Many organizations approach Big Data almost exclusively from a data science perspective. They know that the massive data they now have access to contain high-value insights, and they’re trying to determine what kind of analytics they need to extract that insight.
But if you don’t also re-think your infrastructure, you’ll be in trouble. You can’t simply throw data science over the wall and expect operations to deliver the performance you need in the production environment—any more than you can do the same with application code.
That’s why DataOps—the discipline that ensures alignment between data science and infrastructure—is as important to Big Data success as DevOps is to application success.
Stale results are no results
Speed counts. In fact, some results are completely useless if they aren’t delivered in real time. So if your infrastructure can’t deliver near-wirespeed performance, you’ll never be able to reap the full potential business value of your ever-expanding data resources.
Of particular importance is data intake. Big Data projects often get derailed because everyone was too busy thinking about analytic processing workloads on the back end. But intake can be the thornier problem. It’s not easy to prep large volumes of disparate data for analytic processing. Data has to be validated and rationalized. And, again, this has to happen fast.
Incorporating legacy mainframe data into Big Data environments can be especially challenging. EBCDIC-to-ASCII conversion is a beast to normalize—and it can take days to execute that conversion if you don’t have the right infrastructure in place.
The cloud is not enough
Scalable, low-cost commodity cloud infrastructure—both from as-a-service providers and in the datacenter—is awesome. Unfortunately, that capacity can’t solve every performance problem.
In fact, infrastructure models that require you to move lots of data in and out of memory are going to create bottlenecks that will ultimately prevent you from achieving the real-time results your business demands.
So, sure, there’s a lot you can do with Hadoop running on cloud VMs. But you’re not going to make your Basel II reporting deadlines if those are the only cards you have to play.
No future for one-hit wonders
It’s also important to bear in mind that Big Data success isn’t about just getting one type of analytic result from one set of data sources. It’s about adaptively performing multiple types of operations—including operational analytics, predictive analytics, mobile data serving and transaction processing—on whatever combination of data sources may be relevant to any given business objective, today and in the future.
That means your infrastructure has to deliver superlative peformance on everything from social sentiment gleaned from unstructured content to anomaly detection gleaned from IoT telemetry.
In other words, infrastructure isn’t commodified just because VMs and open-source solutions are. Infrastructure that cost-efficiently delivers differentiated performance by adaptively aligning with ever-changing Big Data workloads will definitely provide a competitive advantage. And if you don’t have that, even the best data science can’t help you.