Big data lab with Hadoop, Spark, SAS and Qlik
February 16, 2016
One of the key advantages of being an independent consultant is that you are 100% accountable for your personal development. This article provides some insights in why and how I spent six weeks learning Hadoop, a major Big Data storage and processing framework.
Even though I am an independent consultant, I prefer to look at it as a "one-man" business. As with any other business, there are many aspects to take into account. It is important to have a steady income stream to pay for basic needs (the "operational" part of business). It is just as important, and even more so in the long run, to invest in yourself (the "R&D" part of business). It is certainly valuable to invest in what you love to do, but you also need to consider how the market is evolving and spot opportunities as they arise.
My personal focus area is Big Data management and Business Analytics software applications. This basically covers all activities for transforming raw data into valuable insights. Throughout the years I have combined many job roles, from pre-sales and business development to IT architecture, software installation and configuration, technical support and end-user coaching. One consistency, since the start of my career in 2008, is that I have primarily worked with SAS Analytics software for transforming that raw data into valuable insights. SAS has been the market leader in enterprise Business Analytics software for many years and I am sure they will continue to be successful in that area. With Hadoop I was not looking for a replacement for SAS, instead I wanted to find out and experience hands-on how the two technologies complement each other.
What is Hadoop? The Apache Hadoop software library is a open-source framework that allows for the distributed storage and processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
I mentioned earlier the importance of spotting market opportunities. Hadoop has been around for over ten years. The underlying building blocks have been around for many years before that. Nevertheless, only recently has the software matured to a point that IT organizations can consider it for enterprise-wide deployment. Processing capabilities were optimized and operational, security and governance capabilities were added. In addition, organizations such as Hortonworks, Cloudera and MapR have started offering commercial technical support and consulting services for the open-source Hadoop framework.
For me, in order to be able to offer consultancy services in the domain of Hadoop, I needed to follow a training and get some hands-on experience. The rest of this article describes how I got from zero knowledge and experience to a solid foundation.
After considering various options, I decided to opt-in for Hortonworks University Self-Paced Learning Library as my primary training resource. I also used the books Hadoop: The Definitive Guide by O'Reilly and Hadoop Application Architectures by O'Reilly.
Hortonworks is one of the key Hadoop distributors, next to Cloudera and MapR. Since the primary source of income for those companies consists of technical support and consulting services in Hadoop, I thought it was safe to assume that they would have the best learning material available. Hortonworks' offering seemed better suited for me, so I decided to opt-in for their online Learning Library.
Hortonwork's Learning Library consists of three major streams:
- Hadoop essentials: An introduction to all Hadoop building blocks (YARN, HDFS, MapReduce, Hive, HBase, Spark and many more..)
- Hadoop administration: Install and configure Hadoop (with a major focus on Hortonwork's Hadoop distribution called Hortonworks Data Platform)
- Hadoop development: Primarily focusses on storing and processing big data with Hive, Pig, HBase, Storm and Spark. Additional topics include Java programming and developing custom YARN applications.
Hardware and OS
In order to create my Big Data learning and development lab, I used the following hardware configuration:
- Case: 5x Intel Next Unit of Computing (NUC)
- Processor: 5x Intel Core i3-5010U processor (2.1 GHz Dual Core, 3 MB Cache, 15W TDP)
- RAM: 5x Kingston ValueRAM 8GB - SODIMM 1.35Voltage (extendible to 16GB)
- Hard drive: 5x Western Digital Blue 500GB SATA III, 5400rpm, 8MB
- Network connection: 5x Intel 1000Mbps Network Connection
- Network switch: NeTP-LINK Gigabit Ethernet switch TL-SG108E 8 Port Switch
For production environments, I refer to Hortonwork's sizing guide (for example).
After setting up the hardware, I installed CentOS 7 Linux on all machines. Since I am not a seasoned Linux system administrator, I first followed the training "Up and Running with CentOS Linux" from Lynda.com. Note that Lynda.com offers a free 10 day trial with unlimited access to all resources (including the Hadoop Fundamentals training).3
Hadoop (Hortonworks distribution)
After taking care of the prerequisites, it was time to install Hadoop. I would not recommend installing off-the-shelf Hadoop binaries from apache.org for a couple of reasons. First of all, Hadoop consists of many building blocks, such as YARN, HDFS, MapReduce, Hive, Pig, HBase, etc.. When setting up a Hadoop cluster, it is important that all components are compatible with each other. Also, besides the fact that a manual installation is very time consuming, the process is undoubtedly prone to configuration errors.
Alternatively, Hortonworks, Cloudera and MapR offer free integrated Hadoop distributions that include user-friendly installation wizards. I was able to set up a Hadoop cluster in under 30 minutes.
Here are some screenshots from Hortonworks' management console Ambari:4
Hadoop (Cloudera distribution)
In order to compare Hortonworks and Cloudera, I also installed the Cloudera Hadoop distribution. Naturally, since both distributions share most Hadoop components, they are almost identical. Despite the fact that both solutions have a different management console (Hortonworks Ambari vs Cloudera Manager) and that Cloudera offers some additional premium components (for which it charges a fee), the difference will be found in their technical support and consulting capabilities. In my opinion, the success of an IT implementation is rarely attributable to the software itself. Success primarily depends on the people involved in implementing and delivering the project.
Here are some screenshots from Cloudera's management console Cloudera Manager:
Before continuing, I reverted back to Hortonworks' distribution (mainly because I am subscribed to their online Learning Library and have gotten used to the Ambari management console).
SAS Data Loader for Hadoop
SAS is the worldwide market leader in advanced analytics. At its core, SAS' processing engine uses statistical procedures to recognize patters in raw data and translate them into valuable insights that support the decision making process.
SAS does not offer any relational or NoSQL database management systems. Instead it integrates with the most common RDBMS's (Oracle, SQL Server, PostgreSQL,..), traditional MPP data warehouses (Teradata, Netezza, Greenplum) and most recently Hadoop. Integration does not simply involve moving the data from the database to the SAS server for processing. It consists of processing the data inside (or alongside) the database by using proprietary in-database processing engines.
The SAS Data Loader for Hadoop includes the following technical components:
- SAS/ACCESS Interface to Hadoop: A connection interface for executing HiveQL commands from a SAS session. The HiveQL processing takes place in Hadoop and the results are transferred to the SAS session.
- Integration with Spark: For executing data quality functions
- In-database processing engine: Certain ETL & data quality directives as well as scoring models (ds2 code) are translated into MapReduce jobs and are executed within the Hadoop cluster.
- A web interface for executing directives
A trial version is available at sas.com.
Here are some screenshots from the SAS Data Loader for Hadoop, starting with the installation wizard:
SAS Data Loader for Hadoop includes many directives. Those directives are translated into HiveQL queries, Spark code or MapReduce jobs. You can even write your own ds2 (datastep2) code, which is translated into MapReduce jobs for distributed processing inside Hadoop:
The following screenshot illustrates the distributed processing of SAS MapReduce jobs:
Data visualization with Microsoft Excel
Finally, once the data processing is done, the last step is to visualize the results. Usually ODBC/JDBC drivers are used for connecting from the data visualization tool to the database. In the context of Hadoop, there are out-of-the-box ODBC/JDBC drivers available for Hive and Spark. SQL queries are launched from the data visualization tool to Hadoop (via Hive or Spark) and aggregated results are transferred back to the data visualization tool.
Note that Hive and Spark share the same SQL front end. One advantage of Spark is that Spark RDD tables can be stored in-memory (in addition to on-disk). This feature can be useful for setting up a high-performance data visualization environment on top of Hadoop.
The following screenshots illustrate the usage of the Hive ODBC driver in combination with Microsoft Excel.7
Data visualization with Qlik Sense Desktop
ODBC drivers are universal. The following screenshots illustrate the usage of the Hive ODBC driver in combination with Qlik Sense Desktop. This data visualization tool can be downloaded freely from qlik.com.
Thank you for taking to time to read this article. In case of any questions or remarks, feel free to reach out to me.