Keeping You Up On The Lastest

Posts tagged ‘Apache Hadoop’

Walmart Takes On Big Data

business suit_001

 

Much of the big data tools have been developed at the Walmart Labs, which was created after Walmart took over Kosmix in 2011. The products that were developed at Walmart Labs are ‘Social Genome’, ‘ShoppyCat and Get on the Shelf.

The Social Genome product allows Walmart to reach customers, or friends of customers, who have mentioned something online to inform them about that exact product and include a discount.
 Public data is combined from the web along with social data and proprietary data such as customer purchasing data and contact information. The result is , constantly changing, up-to-date knowledge base with hundreds of millions of entities and relationships. this provides  Walmart with a  better understanding of  the  what their customers are saying online. An example mentioned by Walmart Labs shows a woman tweeting regularly about movies. When she tweets “I love Salt”, Walmart is able to understand that she is talking about the movie Salt and not the condiment.

The Shoppycat product  developed by Walmart is able to recommend suitable products to Facebook users based on the hobbies and interests of their friends. 

Get on the Shelf  a crowd-sourcing solution that gave anyone the chance to promote his or her product in front of a large online audience. The best products would be sold at Walmart with the potential to suddenly reach millions of customers.

Advertisements

Techfest 2013 and Microsoft’s Predictive Whiteboard

 

Read more

Can Big Data Survive Without Data Scientist?

corporate office1_001

A 2011 McKinsey & Co. survey pointed out that many organizations don’t have  the skilled personnel needed to mine big data for insights and the structures and incentives required to use big data to make informed decisions and act on them.

Big data is a mixture of distributed data architectures and tools like Hadoop, NoSQL, Hive and R.  Data scientists serve as the gatekeepers and mediators between these systems and the people who run the business – the domain experts.

Three main roles served by the data scientist: data architecture, machine learning, and analytics. While these roles are important, but not every company actually needs a highly specialized data team of the sort you’d find at Google or Facebook.

Most of the standard challenges that require big data, like recommendation engines and personalization systems, can be abstracted out. On a per domain basis, however, feature creation could be templatized. What if domain experts could directly encode their ideas and representations of their domains into the system, bypassing the data scientists as middleman and translator?

interactive report

 

Data Becoming Bigger and Better 2013

Snapshot_139

 

 

Mortar

Infochimps

Microsoft Windows Azure HDInsight

There are  companies trying to make Hadoop more useful by turning it into a platform for something other than running MapReduce jobs. The companies – ContinuuityPlatforaDrawn to Scale

 

Big Data

corporate office2_001

After having been accustomed to terms like MegaByte, GigaByte, and TerraByte, we must now prepare ourselves for a whole new vocabulary, such as PetaByte, ExaByte, and ZettaByte which will be as common as the aforementioned.

Dr Riza Berkan CEO and Board Member of Hakia provides a list of  Mechanisms generating Big Data

  • Data from scientific measurements and experiments (astronomy, physics, genetics, etc.)
  • Peer to peer communication (text messaging, chat lines, digital phone calls)
  • Broadcasting (News, blogs)
  • Social Networking (Facebook, Twitter)
  • Authorship (digital books, magazines, Web pages, images, videos)
  • Administrative (enterprise or government documents, legal and financial records)
  • Business (e-commerce, stock markets, business intelligence, marketing, advertising)
  • Other

Dr Riza Berkan says Big Data can be a blessing and a curse.

He says that although there should be clear boundaries between data segments that belong to specific objectives, this very concept is misleading and can undermine potential opportunities. For example, scientists working on human genome data may improve their analysis if they could take the entire content (publications) on Medline (or Pubmed) and analyze it in conjunction with the human genome data. However, this requires natural language processing (semantic) technology combined with bioinformatics algorithms, which is an unusual coupling at best.  Two different data segments in different formats, when combined, actually define a new “big data”. Now, add to that a 3rd data segment, such as the FBI’s DNA bank, or geneology.com and you’ll see the complications/opportunities can go on and on. This is where the mystery and the excitement resides with the concept of big data.

Super Big Data Software

Dr Riza Berkan asks are we prepared for generating data at colossal volumes? and we should look at this question in two stages: (1) Platform and (2) Analytics “super” Software

Apache Hadoop’s open source software enables the distributed processing of large data sets across clusters of commodity servers, aka cloud computing. IBM’s Platform Symphony is another example of grid management suitable for a variety of distributed computing and big data analytics applications. Oracle, HP, SAP, and Software AG are very much in the game for this $10 billion industry. While these giants are offering variety of solutions for distributed computing platforms, there is still a huge void at the level of Analytics Super Software . Super Software’s main function would be to discover new knowledge which would otherwise be impossible to acquire via manual means says Dr Berkan.

Discovery requires the following functions:

  • Finding associations across information in any format
  • Visualization of associations
  • Search
  • Categorization, compacting, summarization
  • Characterization of new data (where it fits)
  • Alerting
  • Cleaning (deleting unnecessary clogging information

Moreover, Dr Berkan says that” Super Software would be able to identify genetic patterns of a disease from human genome data, supported by clinical results reported in Medline, and further analyzed to unveil mutation possibilities using FBI’s DNA bank of millions of DNA information. One can extend the scope and meaning of top level objectives which is only limited by our imagination.”

Then too, Dr Berkan says big data can also be a curse  if the cleaning (deleting) technologies are not considered as part of the Super Software operation. In his  previous post, “information pollution”, he emphasized the danger of uncontrollable growth of information which is the invisible devil in information age.

credits: Search Engine Journal/SEG

 

Big Data and the Legal Profession

 

Read More

IBM’s Understanding Big Data e-book

 

PDF

Big Data and Other Technologies

 

Currently Big Data is  synonymous with technologies like Hadoop, and the “NoSQL” class of databases like Mongo (document stores) and Cassandra (key-values).  Today it’s possible to stream real-time analytics with ease. Spinning clusters up and down is a (relative) cinch, accomplished in 20 minutes or less.

Now there are new untapped open source technologies out there.

STORM AND KAFKA

Storm and Kafka are used at a number of high-profile companies including Groupon, Alibaba, and The Weather Channel.

Storm and Kafka is said to  handle data velocities of tens of thousands of messages every second.

Drill and Dremel said to  put power in the hands of business analysts, and not just data engineers.

R

R is an open source statistical programming language. It is incredibly powerful. Over two million (and counting) analysts use R. R works very well with Hadoop

GREMLIN AND GIRAPH

Gremlin and Giraph help empower graph analysis, and are often used coupled with graph databases like Neo4j or InfiniteGraph, or in the case of Giraph, working with Hadoop.

SAP HANA

SAP Hana is an in-memory analytics platform that includes an in-memory database and a suite of tools and software for creating analytical processes and moving data in and out, in the right formats.

 

Big Data

Big data is measured in terabytes, petabytes, or more. Data becomes “big data” when it  outgrows your current ability to process it, store it, and cope with it efficiently. Storage has become very cheap in the past ten years, allowing loads of data to be collected. However, our ability to actually process the loads of data quickly has not scaled as fast. Traditional tools to analyze and store data — SQL databases, spreadsheets, the Chinese abacus — were not designed to deal with vast data problems. The amount of information in the world is now measured in zettabytes. A zettabyte, which is 1021 bytes (that is 1 followed by twenty-one zeroes), is a big number. Imagine writing three paragraphs describing your favorite movie – that’s about 1 kilobyte. Next, imagine writing three paragraphs for every grain of sand on the earth — that amount of information is in the zettabyte range.

The best tool available today for processing and storing herculean amounts of big data is Hadoop.  Hundreds or thousands of computers are thrown at the big data problem, rather than using single computer.

Hadoop makes data mining, analytics, and processing of big data cheap and fast. Hadoop can take most of your big data problems and unlock the answers, because you can keep all your data, including all of your historical data, and get an answer before your children graduate college.

Apache Hadoop is an open-source project inspired by research of Google.  Hadoop is named after the stuffed toy elephant of the lead programmer’s son. In Hadoop parlance, the group of coordinated computers is called a cluster, and the individual computers in the cluster are called nodes.

Tag Cloud