Big data is measured in terabytes, petabytes, or more. Data becomes “big data” when it outgrows your current ability to process it, store it, and cope with it efficiently. Storage has become very cheap in the past ten years, allowing loads of data to be collected. However, our ability to actually process the loads of data quickly has not scaled as fast. Traditional tools to analyze and store data — SQL databases, spreadsheets, the Chinese abacus — were not designed to deal with vast data problems. The amount of information in the world is now measured in zettabytes. A zettabyte, which is 1021 bytes (that is 1 followed by twenty-one zeroes), is a big number. Imagine writing three paragraphs describing your favorite movie – that’s about 1 kilobyte. Next, imagine writing three paragraphs for every grain of sand on the earth — that amount of information is in the zettabyte range.
The best tool available today for processing and storing herculean amounts of big data is Hadoop. Hundreds or thousands of computers are thrown at the big data problem, rather than using single computer.
Hadoop makes data mining, analytics, and processing of big data cheap and fast. Hadoop can take most of your big data problems and unlock the answers, because you can keep all your data, including all of your historical data, and get an answer before your children graduate college.
Apache Hadoop is an open-source project inspired by research of Google. Hadoop is named after the stuffed toy elephant of the lead programmer’s son. In Hadoop parlance, the group of coordinated computers is called a cluster, and the individual computers in the cluster are called nodes.