Category: Spark


Real World Spark 2 – Spark Core Overview

By Toyin Akin,

Real World Spark 2 - Spark Core Overview

Why you should take a look at Spark 2. The easiest, open source, modern cluster computation engine to write code against

Course Access


You can access all the Big Data / Spark courses for one low monthly fee. Currently the membership site houses courses that covers deploying Hadoop with Cloudera and Hortonworks as well as installing and working with Spark 2.0.

This course can be purchased from


Why Apache Spark …

Apache Spark run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Apache Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells. Apache Spark can combine SQL, streaming, and complex analytics.

Apache Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.

Spark Overview

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Jupyter Notebook

Jupyter Notebook is a system similar to Mathematica that allows you to create “executable documents”. Notebooks integrate formatted text (Markdown), executable code (Scala),

The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.

The Jupyter Notebook is based on a set of open standards for interactive computing. Think HTML and CSS for interactive computing on the web. These open standards can be leveraged by third party developers to build customized applications with embedded interactive computing.

Spark shell

Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries) or Python.

ScalaIDE

Scala IDE provides advanced editing and debugging support for the development of pure Scala and mixed Scala-Java applications.

Spark Monitoring and Instrumentation

Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:

A list of scheduler stages and tasks A summary of RDD sizes and memory usage Environmental information. Information about the running executors


Recommended Spark course path. If you already have spark installed, you do not need to access the first three courses

Spark.Courses

Real World Vagrant – Build an Apache Spark Development Env!

By Toyin Akin,

Real World Vagrant - Build an Apache Spark Development Env!

With a single command, build an IDE, Scala and Spark (1.6.2 or 2.0.1) Development Environment! Run in under 3 minutes!!

Course Access


You can access all the Big Data / Spark courses for one low monthly fee. Currently the membership site houses courses that covers deploying Hadoop with Cloudera and Hortonworks as well as installing and working with Spark 2.0.

This course can be purchased from


Note : This course is built on top of the “Real World Vagrant For Distributed Computing – Toyin Akin” course

This course enables you to package a complete Spark Development environment into your own custom 2.3GB vagrant box.

Once built you no longer need to manipulate your Windows machine in order to get a fully fledged Spark environment to work. With the final solution, you can boot up a complete Apache Spark environment in under 3 minutes!!

Install any version of Spark you prefer. We have codified for 1.6.2 or 2.0.1. but it’s pretty easy to extend this for a new version.

Why Apache Spark …

Apache Spark run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.
Apache Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells.
Apache Spark can combine SQL, streaming, and complex analytics.
Apache Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.


Recommended Spark course path. If you already have spark installed, you do not need to access the first three courses

Spark.Courses

Vagrant For Distributed Computing Course

By Toyin Akin,

vagrant.logo

NoSQL“, “Big Data“, “DevOps” and “In Memory Database” technology are a hot and highly valuable skill to have – and this course will teach you how to quickly create a distributed environment for you to deploy these technologies on.

Course Access


You can access all the Big Data / Spark courses for one low monthly fee. Currently the membership site houses courses that covers deploying Hadoop with Cloudera and Hortonworks as well as installing and working with Spark 2.0.

This course can be purchased from


A combination of VirtualBox and Vagrant will transform your desktop machine into a virtual cluster. However this needs to be configured correctly. Simply enabling multinode within Vagrant is not good enough. It needs to be tuned. Developers and Operators within large enterprises, including investment banks, all use Vagrant to simulate Production environments.

After all, if you are developing against or operating a distributed environment, it needs to be tested. Tested in terms of code deployed and the deployment code itself.

You’ll learn the same techniques these enterprise guys use on your own Microsoft Windows computer/laptop.

Vagrant provides easy to configure, reproducible, and portable work environments built on top of industry-standard technology and controlled by a single consistent workflow to help maximize the productivity and flexibility of you and your team.

This course will use VirtualBox to carve out your virtual environment. However the same skills learned with Vagrant can be used to provision virtual machines on VMware, AWS, or any other provider.

If you are a developer, this course will help you will isolate dependencies and their configuration within a single disposable, consistent environment, without sacrificing any of the tools you are used to working with (editors, browsers, debuggers, etc.). Once you or someone else creates a single Vagrantfile, you just need to vagrant up and everything is installed and configured for you to work. Other members of your team create their development environments from the same configuration. Say goodbye to “works on my machine” bugs.

If you are an operations engineer, this course will help you build a disposable environment and consistent workflow for developing and testing infrastructure management scripts. You can quickly test your deployment scripts and more using local virtualization such as VirtualBox or VMware. (VirtualBox for this course). Ditch your custom scripts to recycle EC2 instances, stop juggling SSH prompts to various machines, and start using Vagrant to bring sanity to your life.

If you are a designer, this course will help you with distributed installation of software in order for you to focus on doing what you do best: design. Once a developer configures Vagrant, you do not need to worry about how to get that software running ever again. No more bothering other developers to help you fix your environment so you can test designs. Just check out the code, vagrant up, and start designing.