Apache Spark Overview: Key Features, Use Cases, and Alternatives
Every enterprise organization is in the data business, whether they know it or not. Apache Spark is a powerful open source technology for organizations who acknowledge that reality and want to fully leverage the power of their data.
In this blog, we dive in on Apache Spark and its features, how it works, how it's used, and give a brief overview of common Apache Spark alternatives.
Back to topWhat Is Apache Spark?
Apache Spark is an open-source framework for real time data analytics in a distributed computing environment.
Apache Spark is distributed processing system based on Hadoop YARN, and is used to process big data workloads.
Apache Spark Features
The main feature of Spark, as compared to MapReduce/ YARN, is its in-memory cluster computing capability that increases the processing speed of an application and optimized query execution for fast analytic queries against small to very large data sets.
Other key features of Apache Spark include:
Speed
Applications running on Spark process the data up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory.
Multi-Language Support
Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages.
Advanced Analytics
Spark not only supports ‘Map’ and ‘reduce,' it also supports SQL queries, streaming data, machine learning (ML), and graph algorithms.
How Apache Spark Is Used
The Apache Spark application consists of two main components: a driver, which converts the user's code into multiple tasks that can be distributed across worker nodes, and executors, which run on those nodes and execute the tasks assigned to them.
Some form of cluster manager is necessary to mediate between the two. YARN, Mesos, Kubernetes, and the Spark standalone cluster manager are all well known options.
Back to topHow Apache Spark Works
The execution of the Spark program is performed using an Interactive Client (e.g. Scala Shell or PySpark), or by submitting a job via the spark-submit
command.
Spark orchestrates its operations through the driver program. When the driver program is run, the Spark framework initializes executor processes on the cluster hosts that process your data.
The following occurs when you submit a Spark application to a cluster:
- The driver is launched and invokes the main method in the Spark application.
- The driver requests resources from the cluster manager to launch executors.
- The cluster manager launches executors on behalf of the driver program.
- The driver runs the application. Based on the transformations and actions in the application, the driver sends tasks to executors.
- Tasks are run on executors to compute and save results.
- If dynamic allocation is enabled, after executors are idle for a specified period, they are released.
- When the driver's main method exits or calls SparkContext.stop, it terminates any outstanding executors and releases resources from the cluster manager.
Apache Spark Alternatives
Top Apache Spark alternatives include Apache Flink, Presto, and Google DataFlow.
Apache Flink
Apache Flink is an open source framework used for event-driven applications, stream and batch analytics, and data pipelines. At its core, Flink is a "high-throughput, low-latency streaming engine."
Presto
Presto is an open source distributed SQL query engine used in big data. Compared to Spark, Presto is a more specialized query engine and is less applicable for machine learning or data transformation.
Google Dataflow
Google Dataflow is a fully-managed, commercial service that executes Apache Beam pipelines. Apache Beam is considered Dataflow's SDK, and is able to be used in other Apache data engines, including Spark and Flink. Google Dataflow It is a good option for those who are already working within the Google cloud ecosystem.
Back to topFinal Thoughts
If your organization needs to process real-time or streaming data, apply machine learning approaches, or execute a large number of queries across multiple data sources, then Apache Spark will likely be a good option. With multi-language support, it's a good fit for applications across languages (or even using multiple languages) like Java, Python, Scala, or R.
However, working with Apache Spark can have sharp edges due to the scale at which it's deployed. Before you start development, be sure you and your team have the requisite knowledge and experience to avoid making any potentially costly mistakes.
Get Technical Support for Your Apache Spark Deployments
When you're working with data at scale with Spark, having SLA-backed support at the ready is critical. Talk to an expert today to learn more about how OpenLogic can provide SLA-backed support for your Spark deployments.
Additional Resources
- White Paper - The New Stack: Cassandra, Kafka, and Spark
- On-Demand Webinar - Real-Time Data Lakes: Kafka Streaming With Spark
- Blog - Spark vs. Hadoop: Key Differences and Use Cases
- Blog - Processing Data Streams With Kafka and Spark