HDInsight Whitepapers –
Informative Whitepapers covering operation of HDInsight including:
1) Compression in Hadoop
When using Hadoop, there are many challenges in dealing with large data sets. The goal of this document is to provide compression techniques that you can use to optimize your Hadoop jobs, and reduce bottlenecks associated with moving and processing large data sets.
In this paper, we will describe the problem of data volumes in different phases of a Hadoop job, and explain how we have used compression to mitigate these problems. We review the compression tools and techniques that are available, and report on tests of each tool. We describe how to enable compression and decompression using both command-line arguments and configuration files.
To review the document, please download the Compression in Hadoop Word document.
2) Hadoop Performance in Hyper-V
Compelling use-cases from industry leaders are quickly changing Hadoop from an emerging technology to an industry standard. However, Hadoop requires considerable resources, and in the search for computing power, users are increasingly asking if it is possible to virtualize Hadoop—that is, create clusters on a virtual machine farm—to build a private cloud infrastructure .
This paper presents the result of internal benchmarks by Microsoft IT, in which the performance of a private cloud using virtual machines was compared to the same jobs running on servers dedicated to Hadoop. The goal was to determine whether Hadoop clusters hosted in Microsoft Hyper-V can be as efficient as physical clusters.
The results indicate that the performance impact of virtualization is small, and that Hadoop on Microsoft Hyper-V offers compelling performance as well as other benefits.
To review the document, please download the Performance of Hadoop on Windows in Hyper-V Environments Word document.
3)Job Optimization in Hadoop
The Map/Reduce paradigm has greatly simplified development of large-scale data processing tasks. However, when processing data at the terabytes or petabyte scale in Hadoop, jobs might run for hours or even days. Therefore, understanding how to analyze, fix, and fine-tune the performance of Map/Reduce jobs is an extremely important skill for Hadoop developers.
This paper describes the principal bottlenecks that occur in Hadoop jobs, and presents a selection of techniques for resolving each issue and mitigating performance problems on different workloads. The paper explains the interaction of disk I/O, CPU, RAM and other resources, and demonstrates with examples why efforts to tune performance should adopt a balanced approach.
It includes the results of extensive experiments with performance tuning, which resulted in significant differences in the speed of the same Map/Reduce job before and after.
To review the document, please download the Hadoop Job Optimzation Word document.
4) Leveraging a Hadoop cluster from SQL Server Integration Services (SSIS)
With the explosion of data, the open source Apache™ Hadoop™ Framework is gaining traction thanks to its huge ecosystem that has arisen around the core functionalities of Hadoop distributed file system (HDFS™) and Hadoop Map Reduce. As of today, being able to have SQL Server working with Hadoop™ becomes increasingly important because the two are indeed complementary. For instance, while petabytes of data can be stored unstructured in Hadoop and take hours to be queried, terabytes of data can be stored in a structured way in the SQL Server platform and queried in seconds. This leads to the need to transfer data between Hadoop and SQL Server.
This white paper explores how SQL Server Integration Services (SSIS), i.e. the SQL Server Extract, Transform and Load (ETL) tool, can be used to automate Hadoop + non Hadoop job executions, and manage data transfers between Hadoop and other sources and destinations.
To review the document, please download the Leveraging a Hadoop cluster from SQL Server Integration Services (SSIS) Word document