Remember the last time you tried to write a MapReduce job (obviously something non trivial than a word count)? It sure did the work, but has lot of pain points from getting an idea to implement it in terms of map reduce. Did you wonder how life will be much simple if you had to code like doing collection operations and hence being transparent* to its distributed nature? Did you want/hope for more performant/low latency jobs? Well, seems like you are in luck.
In this talk, we will be covering a different way to do MapReduce kind of operations without being just limited to map and reduce, yes, we will be talking about Apache Spark. We will compare and contrast Spark programming model with Map Reduce. We will see where it shines, and why to use it, how to use it. We’ll be covering aspects like testability, maintainability, conciseness of the code, and some features like iterative processing, optional in-memory caching and others. We will see how Spark, being just a cluster computing engine, abstracts the underlying distributed storage, and cluster management aspects, giving us a uniform interface to consume/process/query the data. We will explore the basic abstraction of RDD which gives us so many awesome features making Apache Spark a very good choice for your big data applications. We will see this through some non trivial code examples.