Apache Hadoop Vs Apache Spark
- September 9, 2017
- Posted by: ProfessionalGuru
- Category: Uncategorized
gComparison Between Apache Hadoop and Apache SparkComparison Between Apache Hadoop and Apache Spark
Data ProcessingHadoop: Apache Hadoop provides batch processing. In fact, Hadoop was the first framework that created ripples in the open-source community. Google’s revelation about how they were working with vasts amounts of data helped Hadoop developers a great deal in creating new algorithms and component stack to improve access to large scale batch processing.MapReduce is Hadoop’s native batch processing engine. Several components or layers (like YARN, HDFS etc) in modern versions of Hadoop allow easy processing of batch data. Since MapReduce is about permanent storage, it stores data on disk, which means it can handle large datasets. MapReduce is scalable and has proved its efficacy to deal with tens of thousands of nodes. However, Hadoop’s data processing is slow as MapReduce operates in various sequential steps.
Spark: Apache Spark is a good fit for both batch processing and stream processing, meaning it’s a hybrid processing framework. Spark speeds up batch processing viain-memory computation and processing optimization. It’s a nice alternative forstreaming workloads, interactive queries, and machine-based learning. Spark can also work with Hadoop and its modules. The real-time data processing capability makes Spark a top choice for big data analytics.Resilient Distributed Dataset (RDD) allows Spark to transparently store data on memory,and send to disk only what’s important or needed. As a result, a lot of time that is spent on the disc read and write is saved.
Architecture HadoopWhile the former is composed of a distributed file system (HDFS) that stores varietiesof data coming from any type and number of dissimilar data sources.The idea and basic architecture involves the node-cluster system, where the massive data gets distributed across multiple nodes in single Hadoop cluster. Thus, there isn’t any requirement of any outside custom hardware and thus, no additional costs involved for the maintenance. SparkConversely, Spark is not a distributed storage framework; it rather supports and encourages reusing the data on distributed collections in an application array.In comparison to Hadoop storing data on the disk, Spark is more of in-memory data storage. The primary concept of Apache Spark is Resilient Distributed Datasets (RDDs),which are referred to provide fault-tolerant and efficient mechanisms for disasterand recovery management across multiple clusters.
CostsBoth Hadoop and Spark are open-source projects, therefore come for free. However, Spark uses large amounts of RAM to run everything in memory, and RAM is more expensivethan harddisks. Hadoop is disk-bound, so saves the costs of buying expensive RAM, but requires more systems to distribute the disk I/O over multiple systems.As far as costs are concerned, organizations need to look at their requirements. If it’s about processing large amounts of big data, Hadoop will be cheaper since hard disk space comes at a much lower rate than memory space.