Between now and 2020, the sheer volume of digital information is predicted to increase to 35 trillion gigabytes – much of it will come from new sources including financial and retail transactions, web logs, RFID, sensor networks, social media, telephony, internet indexing, clickstreams, call detail records, ecommerce, medical records and more. This "Big Data" usually has one or more of the following characteristics:
- Very Large Data Volumes - Measured in terabytes or petabytes
- Variety - Structured and Unstructured Data
- High Velocity - Rapidly Changing Data
Big datasets grow so large that they become awkward or uneconomical when using traditional database management and BI tools. New technologies are enabling companies to perform increasingly sophisticated data analytics on very large and very diverse data sets in a cost-effective manner. Big Data technology is allowing companies to store, manage and analyze massive amounts of very diverse data, quickly and effectively. A growing number of enterprises are sifting through huge volumes of detailed and intricate data for facts and patterns they didn't know about or just weren't able to recognize in the past.
Open source tools like Hadoop and MapReduce are also giving companies new ways to attacks big data. Hadoop provides tremendous data processing power. Apache Hadoop users typically build their own parallelized computing clusters from commodity servers, each with dedicated storage in the form of a small disk array or, more recently, solid-state drive (SSD) for performance. These are commonly referred to as “shared-nothing” architectures. Hadoop storage is direct-attached storage (DAS). Direct-attached storage (DAS) is computer storage that is directly attached to one computer or server. Hadoop is now central to the big data strategies of enterprises, service providers, and other organizations.
Hadoop is the nucleus of the next-generation EDW in the cloud. Hadoop implements the core features that are at the heart of most modern EDWs: cloud-facing architectures, MPP, in-database analytics, mixed workload management, and a hybrid storage layer.
Real-time or near-real-time information delivery is one of the defining characteristics of Big Data analytics. Providers of traditional EDW solutions see great advantages in adopting Hadoop for handling complex content, advanced analytics, and massively parallel in-database processing in the cloud. Essentially, application development and business process professionals should regard today’s Hadoop market as the reinvention of the EDW for the new age of cloud-centric business models that require rapid execution of advanced, embedded analytics against big data.
Some commercial EDW vendors, such as EMC, Greenplum and IBM, have begun to integrate Apache Hadoop to enable cloud-based deployment of increasingly complex analytics against unstructured and real-time data. EDW vendors have begun to offer, develop, and preannounce big data solutions that incorporate Apache Hadoop technologies. Traditional BI tools like MicroStrategy are coming up Big Data Support that enables Users to Leverage Hadoop’s Massive Storage Capabilities and Transform Petabytes of Data into Actionable Insights.
Early adopters of Hadoop are the companies like Yahoo, Facebook, Amazon, Netflix, Adobe, Ebay, Twitter, SalesForce and eHarmony and more.