Data Analysis, Programming

Performing EDA on NY Taxi Fare Dataset to see PySpark in action — because cloud computing is the next big thing!

Introducing the Technologies

What is Spark and PySpark — Spark SQL and Spark MLlib?

From Wikipedia, Spark by Apache is an open-source analytics engine for large-scale data processing. It enables programmers to work upon data stored in multiple clusters with inherent data parallelism and fault tolerance.

At the base of the Spark engine are Resilient Distributed Databases (RDDs) that are a set of data items maintained over a cluster of machines in a fault-tolerant manner. These RDDs were developed to overcome the limitation of the Map-Reduce paradigm that forced a linear flow of data in programs by reading from disk, mapping and reducing and writing back to disk. …

