It’s hard to find a set of topics more relevant to the interplay of technology and society than security and privacy. From Glenn Greenwald’s new book on NSA leaker Edward Snowden to the recent finding of a European Union court that Google has to drastically alter the persistence of user data in its services, the societal fallout from the Internet as it enters its Big Data phase is everywhere.
So it was with no small amount of interest that I sat through the first day of Informatica’s user conference last week, listening to how this former somewhat boring and still very nerdy data integration company is transforming itself into a front-line player in what has become an all-out war to protect the privacy and security of our companies and persons.
The position of Informatica is simple: for optimal usability and control, manage data at the point of use, not at the point of origin. Companies still get to run their back-end data centers using all those legacy tools and skills the IT department cherishes, but when it comes to managing the multi-petabyte world of wildly disparate data from every conceivable (and a few inconceivable) sources, trying to manage, massage, transform, protect, reject, and otherwise deal with data at the source is a Sisyphean task best left to the realm of mythology.
Of course, it’s still easy to walk out of a presentation like Agile Data Integration for Big Data Analytics at GE Aviation and miss this not-so-hidden message: GE Aviation tried doing data transformation at the source for the dozens of engine types and thousands of engines GE Aviation monitors, and realized after pushing that boulder up the hill that it was better to do the transformation as the data were being loaded in a “data lake” for analysis. Faster, more agile, better results were the key take-aways from GE Aviation’s efforts.
As the conference wore on, the customer stories and the announcements of new capabilities like Project Springbok, the Intelligent Data Platform, and Secure@Source, it became clear that Informatica’s brand is poised to become synonymous with something far removed from the collection of three-letter acronyms – MDM, TDM, ILM, DQ, and others – that characterizes much of Informatica’s messaging today.
The big picture problem that Informatica solves is a not-so-hidden side of the Big Data gold rush now under way. As data grows exponentially in quantity and sources, the ability of companies to manage those data diminishes proportionally. Indeed, what constitutes “managing” data itself is changing at an unmanageable clip.
In the new world of Big Data, data quality has to be managed along five main parameters: is it the right data for the job, is it the right amount of data, is it in the right format to be useful, is its access and use being controlled appropriately, and is it being analyzed and deployed appropriately?
These big, broad parameters in turn beg a whole set of questions about data and its uses: data has to be safe and secure, it has to be reliable and timely, it has to be blended and transformed in order to be useful, it has to be moved in and out of the right kind of databases, it has to be analyzed, archived, tested for quality, made as accessible as necessary and hidden from unauthorized use. Data has to journey from an almost infinite number of potential sources and formats to an equally infinite number of targets, pass increasingly rigorous regulatory regimes and controls, and emerge safe, useful, reliable, and defensible.
Our data warehouse legacy treats data like water, and models data management on the central utility model that delivers potable water to our communities: Centralize all the sources of water into a single water treatment plant, treat the water according to the most rigorous drinking water standard, and send it out to our homes and businesses. There it would move through a single set of pipes to the sinks, tubs, dishwashers, scrubbers, irrigation systems, and the like, where it would be used once and sent on down the drain.
But data isn’t like water in so many ways. Primarily, big data comes from many sources in many many different formats, and desperately requires an enormous quantity of work before it can be useful. And being useful is very different depending on which data is to be used in which way. Time series data is useful for spotting anomalies, sentiment data has a lot of noise that needs to be filtered, customer data is fraught with error and duplicates, sensor data is voluminous in the extreme, financial and health-related data are highly regulated and controlled. And if you want to develop new apps and services, you’ll need to figure out how to get your hands on a data set for testing purposes that accurately reflects the real data you’ll eventually want to use without actually using real data that might have confidential or regulated information in it.
Trying to deal with these issues as data emerges from its myriad sources isn’t just hard, at times it’s impossible. All too often the data a company uses for mission critical processes like planning and forecasting comes from a third party – a retailer’s POS data or a supply chain partner’s inventory data – over which the user has no control. All the more reason why Informatica’s notion of dealing with data at the point of use makes the most sense.
So where does Informatica go from here? Judging by my conversations with its customers, there’s a huge market demand, though much of it is not necessarily understood in precisely the terms that Informatica is now addressing. Data at the point of use issues abound in the enterprise, the trick for Informatica is to see that its brand is identified as the solution to the problem at all levels of the enterprise.
Right now there are lots of ways in which these problems are solved that don’t involve Informatica – I was just at Anaplan’s user conference listening to yet another example of how a customer is using Anaplan’s planning tool to do basic master data management at the point of use by training business users to spot data anomalies in the analytics they run against their data. Using Anaplan to do this isn’t a bad idea – other users of planning engines like Kinaxis do the same thing – but Informatica can and should make the case that planning is planning and data management is data management.
Doing this level of analysis at the point of use is – back to the water analogy – akin to testing your water for contamination right before you start to cook. Wouldn’t you rather just start the whole cooking process knowing the water was safe in the first place?
Moving Informatica from its secure niche as the “data integration” company to something a little more innovative and forward looking will take some nerve: It’s not clear that Informatica’s investors get it, but then again the investor community tends to like the status quo if it delivers quarterly numbers even if the long term prospects are dimming (c.f. Hewlett Packard).
This may be a time for a little leadership, and not followship, when it comes to the question of where Informatica has to go next. The customers are ready for this new vision, and the market is too. With so many different vendors vying for the opportunity to solve these problems, the time for Informatica to strike is now. This is one Big Data opportunity that won’t wait.
Hi Josh
Do we have the correct mechanisms now to deal with distributed data? In my opinion the ETL model is also wrong because it is too P2P. Will we see the emergence of the equivalent of UPC for data. Could the meta data live in a centralized repository but the actual data be available in a distributed manner? In other words will the internet architecture be extended to include not only data publishing but data retrieval too?
We deal with a lot of heavy structured data such bills-of-material (BoM) for fighter jets and helicopters. It just seems impractical to always load the full BoM. To get around this would a pub/sub construct be sufficient to reduce the amount of data that needs to be transferred while guaranteeing data quality?
As an aside, did you ever read Grammatical Man? It may be worth dusting off this 1982 book. I think there is still some good insight in ‘global’ data. http://en.wikipedia.org/wiki/Grammatical_man
Regards
Trevor
Josh,
I enjoyed reading your analogy of managing data as if it were water. The question you raise about when and where to optimize data management is interesting. In the changing dynamics of how enterprises are starting to leverage their data assets, I don’t think you can conclude that at point of use or at point of creation is preferred over the other. Enterprises need to be flexible and have an approach that can account for both depending on their where they stack up in the data-centric maturity model. No matter what form you consume water (liquid, solid ice, or water vapor), you want it to be safe. And the most efficient way to test may not always be at the time of use.
With our new announcement of Secure@Source, the earlier you can detect exposure of sensitive and company confidential data and protect it upstream, the less risk you are introducing into your security strategy further downstream as your data proliferates.
As Informatica rolls out more data-centric applications – such as Secure@Source – that are build on the Intelligent Data Platform, all of those three letter acronyms (ETL, MDM, ILM, etc.) become the DNA of any viable data-centric application development framework. These services should be available as a service and can be applied when and where it makes the most sense.
Thanks again for your insightful comments!
Julie