Predictive Data Analytics With Apache Spark (Part 1 Introduction)
January 3, 2019
1. Apache Spark
Apache Spark is an open-source distributed platform for fast data processing. Spark programming model outperforms traditional Hadoop MapReduce model in different key aspects. Firstly, Spark enables in-memory computations, which makes it much faster than MapReduce. Secondly, Spark natively supports different computational paradigms, such as stream data processing and interactive SQL queries. In addition, Spark comes with Out-Of-The-Box modules that support general-purpose data processing requirements, such as machine learning and graph processing. On top of it’s core components, Spark has different programming APIs (SQL, Scala, Java, Python, and R). Because of its numerous advantages, Spark become a cornerstone in modern BI and advanced analytics systems.
The basic motivation behind this tutorial is to introduce Apache Spark in simple way, especially for data scientists, whom are familiar with data processing languages such as R and Python. This tutorial is prepared to transform the reader’s experience smoothly from Python and R to Apache Spark. To make this transformation much easier, I used Spark Dataframes as the basic data structure in most parts of the tutorial. Using Spark Dataframes enable the reader to quickly discover the similarities and differences between Dataframes in Spark and Python/R.
I am trying through this tutorial to introduce Spark code-blocks, that were designed to be mostly generic. These code blocks can be reused with no or few modifications to meet different data analytics applications. I hope you find the material useful for a strong kick-off in Spark programming.
3. The Use Case
I selected Predictive Maintenance to be the use case of this tutorial for multiple reasons. Firstly, I think the tutorial is a good chance for readers — while learning Apache Spark — to learn about a common IoT (Internet of Things) use case such as Predictive Maintenance. Secondly, Predictive Maintenance use cases allows us to handle different data analysis challenges in Apache Spark (such as feature engineering, dimensionality reduction, regression analysis, binary and multi classification). This makes the code blocks included in this tutorial useful for the widest range of readers. Finally, I am trying through this tutorial to reach out experts in IoT and IIoT (Industrial IoT) areas and learn more about the current challenges in these fields.
4. Data At A Glance
The Dataset used in this tutorial is the Turbofan Engine Degradation Simulation Data Set. It is an open source data that can be downloaded from this link. This Dataset represents sensor and operational readings generated by 100 turbofan engines of the same model. In the training subset, each turbofan engine starts in good functionality, and ends by a failure. During the life cycle of each engine in the training subset, the performance is degrading overtime, until it eventually reach a failure.
4.1 Predictive Analytics Tasks
In this tutorial we handle three predictive analytics tasks:
- Predict Engine’s Remaining Useful Life (RUL): how many operational cycles left before the engine fails?
- Predict Alarm and Normal Time Zones before Failure: given a new sensor reading, predict whether the engine is in Alarm Zone (i.e. 1 to \(n_1\) operational cycles before failure) or in Normal Zone (more than \(n_1\) operational cycles before failure)?
- Predict Alarm, Warning, and Normal Time Zones before Failure: given a new sensor reading, predict whether the engine is in Alarm Zone (i.e. 1 to \(n_1\) operational cycles before failure), in Warning Zone (i.e. \(n_1\) to \(n_2\) operational cycles before failure) or in Normal Zone (more than \(n_2\) operational cycles before failure)?
The following diagram explains the different time zones of an engine.
The goal of this tutorial is completely instructional. This means that the outcomes and results included in this tutorial is not the best solutions for the given tasks. The reader is encouraged to build upon the basic solutions introduced in this tutorial and improve the results in her/hes own way.