Category: Hadoop


Real World Vagrant – Build an Apache Spark Development Env!

By Toyin Akin,

Real World Vagrant - Build an Apache Spark Development Env!

With a single command, build an IDE, Scala and Spark (1.6.2 or 2.0.1) Development Environment! Run in under 3 minutes!!

Course Access


You can access all the Big Data / Spark courses for one low monthly fee. Currently the membership site houses courses that covers deploying Hadoop with Cloudera and Hortonworks as well as installing and working with Spark 2.0.

This course can be purchased from


Note : This course is built on top of the “Real World Vagrant For Distributed Computing – Toyin Akin” course

This course enables you to package a complete Spark Development environment into your own custom 2.3GB vagrant box.

Once built you no longer need to manipulate your Windows machine in order to get a fully fledged Spark environment to work. With the final solution, you can boot up a complete Apache Spark environment in under 3 minutes!!

Install any version of Spark you prefer. We have codified for 1.6.2 or 2.0.1. but it’s pretty easy to extend this for a new version.

Why Apache Spark …

Apache Spark run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.
Apache Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells.
Apache Spark can combine SQL, streaming, and complex analytics.
Apache Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.


Recommended Spark course path. If you already have spark installed, you do not need to access the first three courses

Spark.Courses

Real World Hadoop – Upgrade Cloudera and Hadoop hands on

By Toyin Akin,

Real World Hadoop - Upgrade Cloudera and Hadoop hands on

New version of Hadoop? Need to upgrade a running PROD Hadoop Environment without losing data? We show you hands on …

Course Access


You can access all the Big Data / Spark courses for one low monthly fee. Currently the membership site houses courses that covers deploying Hadoop with Cloudera and Hortonworks as well as installing and working with Spark 2.0.

This course can be purchased from


Note : This course is built on top of the “Real World Vagrant – Automate a Cloudera Manager Build – Toyin Akin” course

.

Upgrading Cloudera Manager enables new features of the latest product versions while preserving existing data and settings. Some new settings are added, and some additional steps may be required, but no existing configuration is removed.

Upgrading Cloudera Manager
The process for upgrading Cloudera Manager varies depending on the starting point. Categories of tasks to be completed include the following:

Install databases required for the release. In Cloudera Manager 5, the Host Monitor and Service Monitor roles use an internal database that provides greater capacity and flexibility. You do not need to configure an external database for these roles.
Upgrade the Cloudera Manager Server.
Upgrade the Cloudera Manager Agent. You can use an upgrade wizard that is invoked when you connect to the Admin Console or manually install the Cloudera Manager Agent packages.

Upgrading CDH
Cloudera Manager 5 can manage both CDH 4 and CDH 5. To benefit from the most current CDH features, you must upgrade CDH.


Recommended Cloudera Manager curriculum path. If you already have Cloudera Manager installed, you do not need to access the first three courses

Cloudera.Courses

 

Real World Vagrant – Hortonworks Data Platform 2.5

By Toyin Akin,

Real World Vagrant - Hortonworks Data Platform 2.5

Build a Distributed Cluster of Hortonworks 2.5 Manager and Agent nodes with a single command! Includes Spark 2.0!

Course Access


You can access all the Big Data / Spark courses for one low monthly fee. Currently the membership site houses courses that covers deploying Hadoop with Cloudera and Hortonworks as well as installing and working with Spark 2.0.

This course can be purchased from


Note : This course is built on top of the “Real World Vagrant For Distributed Computing – Toyin Akin” course

NoSQL“, “Big Data“, “DevOps” and “In Memory Database
technology are a hot and highly valuable skill to have – and this
course will teach you how to quickly create a distributed environment
for you to deploy these technologies on.

A combination of VirtualBox and Vagrant will transform your desktop machine into a virtual cluster. However this needs to be configured correctly. Simply enabling multinode within Vagrant is not good enough. It needs to be tuned. Developers and Operators within large enterprises, including investment banks, all use Vagrant to simulate Production environments.

After all, if you are developing against or operating a distributed environment, it needs to be tested. Tested in terms of code deployed and the deployment code itself.

You’ll learn the same techniques these enterprise guys use on your own Microsoft Windows computer/laptop.

Vagrant provides easy to configure, reproducible, and portable work environments built on top of industry-standard technology and controlled by a single consistent workflow to help maximize the productivity and flexibility of you and your team.

This course will use VirtualBox to carve out your virtual environment. However the same skills learned with Vagrant can be used to provision virtual machines on VMware, AWS, or any other provider.

If you are a developer, this course will help you will isolate dependencies and their configuration within a single disposable, consistent environment, without sacrificing any of the tools you are used to working with (editors, browsers, debuggers, etc.). Once you or someone else creates a single Vagrantfile, you just need to vagrant up and everything is installed and configured for you to work. Other members of your team create their development environments from the same configuration. Say goodbye to “works on my machine” bugs.

If you are an operations engineer, this course will help you build a disposable environment and consistent workflow for developing and testing infrastructure management scripts. You can quickly test your deployment scripts and more using local virtualization such as VirtualBox or VMware. (VirtualBox for this course). Ditch your custom scripts to recycle EC2 instances, stop juggling SSH prompts to various machines, and start using Vagrant to bring sanity to your life.

If you are a designer, this course will help you with distributed installation of software in order for you to focus on doing what you do best: design. Once a developer configures Vagrant, you do not need to worry about how to get that software running ever again. No more bothering other developers to help you fix your environment so you can test designs. Just check out the code, vagrant up, and start designing.


Recommended Hortonworks curriculum path.

HDP.Courses

 

Real World Vagrant – Automate a Cloudera Manager Build

By Toyin Akin,

Real World Vagrant - Automate a Cloudera Manager Build

Build a Distributed Cluster of Cloudera Manager and any number of Cloudera Manager Agent nodes with a single command!

Course Access


You can access all the Big Data / Spark courses for one low monthly fee. Currently the membership site houses courses that covers deploying Hadoop with Cloudera and Hortonworks as well as installing and working with Spark 2.0.

This course can be purchased from


Note : This course is built on top of the “Real World Vagrant For Distributed Computing – Toyin Akin” course

NoSQL“, “Big Data“, “DevOps” and “In Memory Database” technology are a hot and highly valuable skill to have – and this course will teach you how to quickly create a distributed environment for you to deploy these technologies on.

A combination of VirtualBox and Vagrant will transform your desktop machine into a virtual cluster. However this needs to be configured correctly. Simply enabling multinode within Vagrant is not good enough. It needs to be tuned. Developers and Operators within large enterprises, including investment banks, all use Vagrant to simulate Production environments.

After all, if you are developing against or operating a distributed environment, it needs to be tested. Tested in terms of code deployed and the deployment code itself.

You’ll learn the same techniques these enterprise guys use on your own Microsoft Windows computer/laptop.

Vagrant provides easy to configure, reproducible, and portable work environments built on top of industry-standard technology and controlled by a single consistent workflow to help maximize the productivity and flexibility of you and your team.

This course will use VirtualBox to carve out your virtual environment. However the same skills learned with Vagrant can be used to provision virtual machines on VMware, AWS, or any other provider.

If you are a developer, this course will help you will isolate dependencies and their configuration within a single disposable, consistent environment, without sacrificing any of the tools you are used to working with (editors, browsers, debuggers, etc.). Once you or someone else creates a single Vagrantfile, you just need to vagrant up and everything is installed and configured for you to work. Other members of your team create their development environments from the same configuration. Say goodbye to “works on my machine” bugs.

If you are an operations engineer, this course will help you build a disposable environment and consistent workflow for developing and testing infrastructure management scripts. You can quickly test your deployment scripts and more using local virtualization such as VirtualBox or VMware. (VirtualBox for this course). Ditch your custom scripts to recycle EC2 instances, stop juggling SSH prompts to various machines, and start using Vagrant to bring sanity to your life.

If you are a designer, this course will help you with distributed installation of software in order for you to focus on doing what you do best: design. Once a developer configures Vagrant, you do not need to worry about how to get that software running ever again. No more bothering other developers to help you fix your environment so you can test designs. Just check out the code, vagrant up, and start designing.


Recommended Cloudera Manager curriculum path. If you already have Cloudera Manager installed, you do not need to access the first three courses

Cloudera.Courses

 

Real World Hadoop – Hands on Enterprise Distributed Storage.

By Toyin Akin,

Real World Hadoop - Hands on Enterprise Distributed Storage

Master the art of manipulating files within a distributed storage enterprise platform. It’s easier than you think!

Course Access


You can access all the Big Data / Spark courses for one low monthly fee. Currently the membership site houses courses that covers deploying Hadoop with Cloudera and Hortonworks as well as installing and working with Spark 2.0.

This course can be purchased from


The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems.

We will be manipulating the HDFS File System, however why are Enterprises interested in HDFS to begin with?

However, the differences from other distributed file systems are significant.

HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.

HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

HDFS is part of the Apache Hadoop Core project.

Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of  failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

Applications that run on HDFS have large data sets. A typical file in  HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale  to hundreds of nodes in a single cluster. It should support tens of   millions of files in a single instance.

A computation requested by an application is much more efficient if it  is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and  increases the overall throughput of the system. The assumption is that
it is often better to migrate the computation closer to where the data  is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves  closer to where the data is located.


Recommended Cloudera Manager curriculum path. If you already have Cloudera Manager installed, you do not need to access the first three courses

Cloudera.Courses