e-Zest members share technology ideas to foster digital transformation.

Spark versus Hadoop and using them in collaboration

Written by Rushikesh Garadade | Jul 24, 2017 7:44:42 AM


Now-a-days, whenever we hear the word “Big Data”, two words comes into our mind: Spark and Hadoop, and the question that arises is which one should we use. Before debating over Apache Spark and Apache Hadoop, let us understand the architectural difference between them that decides their speed while dealing with different applications.Both of the frameworks are similar in a way that they come into picture when we need to process data which is not possible to process on a single machine. Both the frameworks are used for distributed storage and processing of big dataset. Both of them enjoy a similar advantage of being able to run on commodity hardware that is a very cost effective solution. However, none of them is a replacement for the other, both have their own advantages and inter-dependencies.

Now, let us briefly look at the architectural difference between them. Hadoop has two layers: HDFS, for storage and MapReduce, for processing. If we look at Spark, it has only the processing layer; it does not have any storage layer in specific, which is advantageous as well as disadvantageous for it. Advantageous in such a way that it can work even on local storage and disadvantageous in such a way that it needs be dependent on other storage systems for storing big data. Spark processes any file stored in the HDFS or other storage systems supported by Hadoop, which includes local file system, Amazon S3, Cassandra, Hypertable, HBase, etc.

When to Use Apache Spark?

In Machine Learning, iterative model requires continuous read and write operations. So, if we have an application that runs on iterative model of machine learning, then we should go for apache Spark.

If you want to perform analysis on streaming data, for example, real-time stream of actual messages on Twitter, log files generated from mobile/web application, purchases on ecommerce sites or if we have application that require multiple operations, for example, machine learning algorithms, then using Apache Spark to deal with these situations is more preferable.

Spark is also well suited for the common applications like online product recommendation, machine log monitoring, real-time marketing campaign and analysis on cyber security etc.

When to Use Apache Hadoop?

Hadoop was created initially for logs processing and analysing. However, it is being widely used lately. Apache Hadoop MapReduce is best suited for processing in batch. If your data operation requirements are static, then Hadoop MapReduce is sufficient enough to fast process the data.

Spark and Hadoop are both fault-tolerant. However, Hadoop is comparatively more fault tolerant than Spark. So, if you need extremely high fault tolerant system then you should go for Hadoop.

Working with Spark and Hadoop together

One of the major disadvantages of Spark is that it does not have a storage layer, and here we need to use Hadoop's HDFS.

In one of my earlier projects, we were building an audience measurement platform where we had two sets of data. First, data from set-top boxes which contain different attributes mentioning viewership like A person has started watching B channel on C date from time period D to E and many other such details of viewing. Let's call it viewer's data. Second data is about advertisements. It contains all attributes about ads like 'a' ad is published on 'b' date, on 'c' channel, between time period 'd' and 'e', etc. Our work is to combine these ad data and viewership data and provide some interesting insights which will help our client to grow their business. Analytics is built on top of this combined data. For example, X ad is viewed Y times on Z channel, P number of viewers have seen the ad 100%, Q had seen it less than 50% and switched away etc.

Now to perform this entire ETL flow, we need to perform many transformations. Initially, we were storing all the intermediate data. To execute this flow, it took a lot of time. Storing intermediate data is of no use. So, we stored intermediate data in memory and it reduced our total execution time drastically. Total flow became 6 times faster than earlier. So, here we needed to use Spark because of its advantage in memory processing. Now, why use Hadoop's HDFS and not local storage? It is because, our 1 day of viewer's data contains approximately 5 million rows and 1 day of ad's data contain approximately 20 thousand rows. Combining both for entire flow only for 1 day will generate maximum 1 million * 20 thousand rows. And, same execution flow would be followed for 3 months of data. Considering this huge size of data, if we have used only Spark with local storage, then Spark will come across caching issue, which will ultimately consume more time than only Hadoop. So, here we must use HDFS for storing.

As a result, when we used Spark for processing and Hadoop for storage, the ETL flow becomes way faster than using them individually and this is the benefit of using both of them together.

Spark alone is not yet well suited for production workloads and Hadoop alone has less capabilities in terms of time of execution. Using them together complements their inabilities and comes out with great results.