A 2011 McKinsey & Co. survey pointed out that many organizations don’t have the skilled personnel needed to mine big data for insights and the structures and incentives required to use big data to make informed decisions and act on them.
Big data is a mixture of distributed data architectures and tools like Hadoop, NoSQL, Hive and R. Data scientists serve as the gatekeepers and mediators between these systems and the people who run the business – the domain experts.
Three main roles served by the data scientist: data architecture, machine learning, and analytics. While these roles are important, but not every company actually needs a highly specialized data team of the sort you’d find at Google or Facebook.
Most of the standard challenges that require big data, like recommendation engines and personalization systems, can be abstracted out. On a per domain basis, however, feature creation could be templatized. What if domain experts could directly encode their ideas and representations of their domains into the system, bypassing the data scientists as middleman and translator?
Currently Big Data is synonymous with technologies like Hadoop, and the “NoSQL” class of databases like Mongo (document stores) and Cassandra (key-values). Today it’s possible to stream real-time analytics with ease. Spinning clusters up and down is a (relative) cinch, accomplished in 20 minutes or less.
Now there are new untapped open source technologies out there.
STORM AND KAFKA
Storm and Kafka are used at a number of high-profile companies including Groupon, Alibaba, and The Weather Channel.
Storm and Kafka is said to handle data velocities of tens of thousands of messages every second.
Drill and Dremel said to put power in the hands of business analysts, and not just data engineers.
R is an open source statistical programming language. It is incredibly powerful. Over two million (and counting) analysts use R. R works very well with Hadoop
GREMLIN AND GIRAPH
Gremlin and Giraph help empower graph analysis, and are often used coupled with graph databases like Neo4j or InfiniteGraph, or in the case of Giraph, working with Hadoop.
SAP Hana is an in-memory analytics platform that includes an in-memory database and a suite of tools and software for creating analytical processes and moving data in and out, in the right formats.
Business intelligence applications, have begun to transition from an OLAP to a new type of service that connects different data sources from social networks, third-party apps and other sources. NoSQL has begun to appear as a popular option for its scaling capability across cheap, commodity-based nodes. It’s much cheaper than scaling with vertically integrated systems that require attaching expensive storage arrays.
A new generation of big data applications are turning up. Which in turn has put pressure on enterprise vendors to modify existing software suites.Venture capitalists will continue to invest in data infrastructure and big data apps that represent the manifest disruption in IT.