A recent popular analysis of the growth of data in our daily lives caught my attention. This report was done by Domo, entitled “Data Never Sleeps”. It’s no surprise that such studies focus on human-generated data. After all, we’re all users and consumers. But purely in terms of metrics of data “exhaust”, humans simply cannot keep up with machines in the modern data center. In this post, I provide a perspective for the scale of metrics data that is much more relevant to IT teams, inspired by my work with the CloudPhysics global data set.
Data Exhaust from Computers
Computers generate a continuous stream of data about their own activities, events, errors, etc. This data source is absolutely massive. And that’s today, before the Internet of Things (IoT) is fully up and running. When the IoT is fully functional—with fairly simple devices streaming data about their environment over the internet to servers (possibly in the cloud) for analysis, action, etc. –we’ll be facing an entirely new set of data management challenges. And we need to start planning for it now.
But let’s focus on today. When I refer to data exhaust from computers, I’m referring not to IoT, but to general purpose computers running inside commercial computing data centers or clouds. These computers are equipped with thousands of software sensors (another key differentiator with IoT where devices are generally hardware sensors) measuring the type of workloads being generated by virtual machines and applications. This data “exhaust” from such general purpose computers running in the data center is surprising large. Let’s construct a notion of the “total addressable data” (TAD).
Sizing the Data Center Exhaust
Conservative assumptions for our analysis:
- 90 million VMs in the world (source)
- 35 million servers in the world (source)
- 50 metrics of interest per VM and per server
- Seconds-level ideal resolution of metric capture
Doing the math on this results in a total of 375,000,000,000 metrics per minute. That’s 375 billion metrics per minute generated in data centers worldwide as a conservative estimate. Why is this conservative? I’m not counting application level metrics, nor network or storage device metrics. I’m also not counting configuration, change, event, error or log data. However, the dataset I’m counting is sufficient for a large amount of data center analytics, so it’s a good starting point. Now let’s compare that to the scale of social media data. In this figure, I plot the types of metric data collected. Note the LOG scale needed to fit the scale into the same chart. Data center exhaust is about 90,000 x larger per minute or about 5 orders of magnitude larger.
Why is the 90,000x Larger Data Set Interesting?
For all of us infrastructure or virtualization professionals, this is a reminder that we live in a data-rich world. As IT professionals providing, the service we provide to our businesses is the infrastructure we build, maintain and optimize. So, when we keep hearing about the developers, marketers, and others in our own businesses enjoying the benefits of large data sets to improve the business, we don’t have to be far behind. That data richness is not only at the level of the apps that our developers run on top of what we provide as the execution layer, it is in our very own infrastructures. The largest and ongoing irony in our industry is that IT has not turned the benefit of this large dataset onto itself. Great benefits await us if we decide to become data-driven IT leaders. Harnessing this dataset can allow us to optimize our business in fundamental ways like other industries have accomplished including:
- Lower cost of service delivery by the right-sizing of the IT infrastructure
- More predictable infrastructure service quality from fine-grained analytics and modeling of workload demands
- Higher availability and up time from proactive identification of hazards and configuration problems
Steps to Take to Prepare Your IT Infrastructure for the Data Revolution
Thankfully, harnessing the massive quantity of metrics to reap the afore-mentioned benefits is no longer a gargantuan task. For example, today CloudPhysics provides a platform which pulls the fire hose of metrics from VMware-based data centers. On top of this platform, there are dozens of analytical modules (called “cards”) that analyze the environment using modeling and simulation empowering you to reap the benefits in an easy to use SaaS form factor.
Here are four steps to take to prepare your infrastructure and team:
- Find an upcoming project where answering questions about sizing, cost reduction or service quality improvement will make a material difference to your data center
- Empower one of the IT architects in your team to research and educate the rest of your team on what value a data-driven approach to the above can bring
- Allocate a small project to work with a vendor who can provide a data-driven approach to the project
- Track the success of the initial project with your team and ask the assigned IT architect to project out the benefits and ROI if rolled out to your entire organization
My experience working on this transformation with many IT organizations has been that once team members get a taste for the data-driven approach to data center management, adoption grows virally within the organization accelerating the inevitable benefits of efficiency, collaboration and quality improvements. Please leave your comments or reach out to me via Twitter (@virtualirfan) to discuss this further.