Category: Big Data


Real World Hadoop – Hands on Enterprise Distributed Storage.

By Toyin Akin,

Real World Hadoop - Hands on Enterprise Distributed Storage

Master the art of manipulating files within a distributed storage enterprise platform. It’s easier than you think!

Course Access


You can access all the Big Data / Spark courses for one low monthly fee. Currently the membership site houses courses that covers deploying Hadoop with Cloudera and Hortonworks as well as installing and working with Spark 2.0.

This course can be purchased from


The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems.

We will be manipulating the HDFS File System, however why are Enterprises interested in HDFS to begin with?

However, the differences from other distributed file systems are significant.

HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.

HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

HDFS is part of the Apache Hadoop Core project.

Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of  failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

Applications that run on HDFS have large data sets. A typical file in  HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale  to hundreds of nodes in a single cluster. It should support tens of   millions of files in a single instance.

A computation requested by an application is much more efficient if it  is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and  increases the overall throughput of the system. The assumption is that
it is often better to migrate the computation closer to where the data  is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves  closer to where the data is located.


Recommended Cloudera Manager curriculum path. If you already have Cloudera Manager installed, you do not need to access the first three courses

Cloudera.Courses

 

Big Data Intro for IT Administrators, Devs and Consultants

By Toyin Akin,

Big Data Intro for IT Administrators, Devs and Consultants

Grasp why “Big Data” knowledge is in hot demand for Developers / Consultants and Admins

Course Access


You can access all the Big Data / Spark courses for one low monthly fee. Currently the membership site houses courses that covers deploying Hadoop with Cloudera and Hortonworks as well as installing and working with Spark 2.0.

This course can be purchased from


Understand “Big Data” and grasp why, if you are a Developer, Database Administrator, Software Architect or a IT Consultant, why you should be looking at this technology stack

There are more job opportunities in Big Data management and Analytics than there were last year and many IT professionals are prepared to invest time and money for the training.

Why Is Big Data Different?

In the old days… you know… a few years ago, we would utilize systems to extract, transform and load data (ETL) into giant data warehouses that had business intelligence solutions built over them for reporting. Periodically, all the systems would backup and combine the data into a database where reports could be run and everyone could get insight into what was going on.

The problem was that the database technology simply couldn’t handle multiple, continuous streams of data. It couldn’t handle the volume of data. It couldn’t modify the incoming data in real-time. And reporting tools were lacking that couldn’t handle anything but a relational query on the back-end. Big Data solutions offer cloud hosting, highly indexed and optimized data structures, automatic archival and extraction capabilities, and reporting interfaces have been designed to provide more accurate analyses that enable businesses to make better decisions.

Better business decisions means that companies can reduce the risk of their decisions, and make better decisions that reduce costs and increase marketing and sales effectiveness.

What Are the Benefits of Big Data?

This infographic from Informatica walks through the risks and opportunities associated with leveraging big data in corporations.

Big Data is Timely – A large percentage of each workday, knowledge workers spend attempting to find and manage data.

Big Data is Accessible – Senior executives report that accessing the right data is difficult.

Big Data is Holistic – Information is currently kept in silos within the organization. Marketing data, for example, might be found in web analytics, mobile analytics, social analytics, CRMs, A/B Testing tools, email marketing systems, and more… each with focus on its silo.

Big Data is Trustworthy – Organizations measure the monetary cost of poor data quality. Things as simple as monitoring multiple systems for customer contact information updates can save millions of dollars.

Big Data is Relevant – Organizations are dissatisfied with their tools ability to filter out irrelevant data. Something as simple as filtering customers from your web analytics can provide a ton of insight into your acquisition efforts.

Big Data is Authoritive – Organizations struggle with multiple versions of the truth depending on the source of their data. By combining multiple, vetted sources, more companies can produce highly accurate intelligence sources.

Big Data is Actionable – Outdated or bad data results in organizations making bad decisions that can cost billions.


Recommended Cloudera Manager curriculum path. If you already have Cloudera Manager installed, you do not need to access the first three courses

Cloudera.Courses

 

Real World Hadoop – Deploying Hadoop with Cloudera Manager

By Toyin Akin,

Real World Hadoop - Deploying Hadoop with Cloudera Manager

Move to the next step from the Cloudera QuickStart VM. Install a DEV Hadoop Environment with Enterprise Tooling

Course Access


You can access all the Big Data / Spark courses for one low monthly fee. Currently the membership site houses courses that covers deploying Hadoop with Cloudera and Hortonworks as well as installing and working with Spark 2.0.

This course can be purchased from


Big Data” technology is a hot and highly valuable skill to have – and this course will teach you how to quickly deploy a Hadoop Cluster using the Cloudera stack.

Cloudera allows you to download a QuickStart Virtual machine which is great for developers, but this is of no use for the Operations team to start the planning and the building out of DEV / UAT and PROD environments within their organizations. What assumptions were made when the QuickStart VM was put together?

In addition, hosting all of Cloudera’s processes as well as Hadoop’s processes on one VM is not a model that any large organization can or should follow. The Hadoop services need to be split out across multiple VMs/Servers. In fact that’s the whole point out Hadoop!

Distributed Data and Distributed Compute.

After all, if you are developing against or operating a distributed environment, it needs to be tested. Tested in terms of the forcing various failure modes within the cluster and ensuing that the cluster can still respond to user requests. Killing the QuickStart VM destroys the entire cluster!

You’ll learn the same techniques these large enterprise guys use to move to the next step in building out an enterprise grade Hadoop cluster.

If you are a developer, the operations team can build out that centralized cluster in which you are truly testing against a distributed cluster. Testing code against the Quickstart VM may work, but as any experienced distributed developer knows, verifying code against a pseudo cluster on a single machine is different than verifying against code against a truly distributed cluster.

As an example bottlenecks in Networks or CPU cycles will come to light. In addition, this will also assist in capacity planing of the UAT / PROD cluster as initial metrics can be acquired.

If you are in operations then this gives the operations team an environment for the team to start learning how to jointly operate the cluster. Here the team can start to understand cluster metrics, adding/removing cluster nodes, managing the various Hadoop services (Zookeeper, HDFS, YARN and Spark) and a lot more. We also look at managing Cloudera Hadoop Parcels as well as changing Hadoop versions once a cluster is deployed.

The operation team can start to develop procedures and change management documentation ready for Production operation of a Hadoop cluster.


Recommended Cloudera Manager curriculum path. If you already have Cloudera Manager installed, you do not need to access the first three courses

Cloudera.Courses

 

Real World Hadoop – Automating Hadoop install with Python!

By Toyin Akin,

Real World Hadoop - Automating Hadoop install with Python!

Deploy a Hadoop cluster (Zookeeper, HDFS, YARN, Spark) with Cloudera Manager’s Python API. Hands on.

Course Access


You can access all the Big Data / Spark courses for one low monthly fee. Currently the membership site houses courses that covers deploying Hadoop with Cloudera and Hortonworks as well as installing and working with Spark 2.0.

This course can be purchased from


Note : This course is built on top of the “Real World Vagrant – Automate a Cloudera Manager Build – Toyin Akin” course

Deploy a Hadoop cluster (Zookeeper, HDFS, YARN, Spark) with Python! Instruct Cloudera Manager to do the work! Hands on. Here we use Python to instruct an already installed Cloudera Manager to deploy your Hadoop Services.

.The Cloudera Manager API provides configuration and service lifecycle management, service health information and metrics, and allows you to configure Cloudera Manager itself. The API is served on the same host and port as the Cloudera Manager Admin Console, and does not require an extra process or extra configuration. The API supports HTTP Basic Authentication, accepting the same users and credentials as the Cloudera Manager Admin Console.

.

Here are some of the cool things you can do with Cloudera Manager via the API:

Deploy an entire Hadoop cluster programmatically. Cloudera Manager supports HDFS, MapReduce, YARN, ZooKeeper, HBase, Hive, Oozie, Hue, Flume, Impala, Solr, Sqoop, Spark and Accumulo.
Configure various Hadoop services and get config validation.
Take admin actions on services and roles, such as start, stop, restart, failover, etc. Also available are the more advanced workflows, such as setting up high availability and decommissioning.
Monitor your services and hosts, with intelligent service health checks and metrics.
Monitor user jobs and other cluster activities.
Retrieve timeseries metric data.
Search for events in the Hadoop system.
Administer Cloudera Manager itself.
Download the entire deployment description of your Hadoop cluster in a json file.

Additionally, with the appropriate licenses, the API lets you:

Perform rolling restart and rolling upgrade.
Audit user activities and accesses in Hadoop.
Perform backup and cross data-center replication for HDFS and Hive.
Retrieve per-user HDFS usage report and per-user MapReduce resource usage report.


Recommended Cloudera Manager curriculum path. If you already have Cloudera Manager installed, you do not need to access the first three courses

Cloudera.Courses

 

 

Vagrant For Distributed Computing Course

By Toyin Akin,

vagrant.logo

NoSQL“, “Big Data“, “DevOps” and “In Memory Database” technology are a hot and highly valuable skill to have – and this course will teach you how to quickly create a distributed environment for you to deploy these technologies on.

Course Access


You can access all the Big Data / Spark courses for one low monthly fee. Currently the membership site houses courses that covers deploying Hadoop with Cloudera and Hortonworks as well as installing and working with Spark 2.0.

This course can be purchased from


A combination of VirtualBox and Vagrant will transform your desktop machine into a virtual cluster. However this needs to be configured correctly. Simply enabling multinode within Vagrant is not good enough. It needs to be tuned. Developers and Operators within large enterprises, including investment banks, all use Vagrant to simulate Production environments.

After all, if you are developing against or operating a distributed environment, it needs to be tested. Tested in terms of code deployed and the deployment code itself.

You’ll learn the same techniques these enterprise guys use on your own Microsoft Windows computer/laptop.

Vagrant provides easy to configure, reproducible, and portable work environments built on top of industry-standard technology and controlled by a single consistent workflow to help maximize the productivity and flexibility of you and your team.

This course will use VirtualBox to carve out your virtual environment. However the same skills learned with Vagrant can be used to provision virtual machines on VMware, AWS, or any other provider.

If you are a developer, this course will help you will isolate dependencies and their configuration within a single disposable, consistent environment, without sacrificing any of the tools you are used to working with (editors, browsers, debuggers, etc.). Once you or someone else creates a single Vagrantfile, you just need to vagrant up and everything is installed and configured for you to work. Other members of your team create their development environments from the same configuration. Say goodbye to “works on my machine” bugs.

If you are an operations engineer, this course will help you build a disposable environment and consistent workflow for developing and testing infrastructure management scripts. You can quickly test your deployment scripts and more using local virtualization such as VirtualBox or VMware. (VirtualBox for this course). Ditch your custom scripts to recycle EC2 instances, stop juggling SSH prompts to various machines, and start using Vagrant to bring sanity to your life.

If you are a designer, this course will help you with distributed installation of software in order for you to focus on doing what you do best: design. Once a developer configures Vagrant, you do not need to worry about how to get that software running ever again. No more bothering other developers to help you fix your environment so you can test designs. Just check out the code, vagrant up, and start designing.