Archive
HDInsight Whitepapers -
Informative Whitepapers covering operation of HDInsight including:
1) Compression in Hadoop
When using Hadoop, there are many challenges in dealing with large data sets. The goal of this document is to provide compression techniques that you can use to optimize your Hadoop jobs, and reduce bottlenecks associated with moving and processing large data sets.
In this paper, we will describe the problem of data volumes in different phases of a Hadoop job, and explain how we have used compression to mitigate these problems. We review the compression tools and techniques that are available, and report on tests of each tool. We describe how to enable compression and decompression using both command-line arguments and configuration files.
To review the document, please download the Compression in Hadoop Word document.
2) Hadoop Performance in Hyper-V
Compelling use-cases from industry leaders are quickly changing Hadoop from an emerging technology to an industry standard. However, Hadoop requires considerable resources, and in the search for computing power, users are increasingly asking if it is possible to virtualize Hadoop—that is, create clusters on a virtual machine farm—to build a private cloud infrastructure .
This paper presents the result of internal benchmarks by Microsoft IT, in which the performance of a private cloud using virtual machines was compared to the same jobs running on servers dedicated to Hadoop. The goal was to determine whether Hadoop clusters hosted in Microsoft Hyper-V can be as efficient as physical clusters.
The results indicate that the performance impact of virtualization is small, and that Hadoop on Microsoft Hyper-V offers compelling performance as well as other benefits.
To review the document, please download the Performance of Hadoop on Windows in Hyper-V Environments Word document.
3)Job Optimization in Hadoop
The Map/Reduce paradigm has greatly simplified development of large-scale data processing tasks. However, when processing data at the terabytes or petabyte scale in Hadoop, jobs might run for hours or even days. Therefore, understanding how to analyze, fix, and fine-tune the performance of Map/Reduce jobs is an extremely important skill for Hadoop developers.
This paper describes the principal bottlenecks that occur in Hadoop jobs, and presents a selection of techniques for resolving each issue and mitigating performance problems on different workloads. The paper explains the interaction of disk I/O, CPU, RAM and other resources, and demonstrates with examples why efforts to tune performance should adopt a balanced approach.
It includes the results of extensive experiments with performance tuning, which resulted in significant differences in the speed of the same Map/Reduce job before and after.
To review the document, please download the Hadoop Job Optimzation Word document.
4) Leveraging a Hadoop cluster from SQL Server Integration Services (SSIS)
With the explosion of data, the open source Apache™ Hadoop™ Framework is gaining traction thanks to its huge ecosystem that has arisen around the core functionalities of Hadoop distributed file system (HDFS™) and Hadoop Map Reduce. As of today, being able to have SQL Server working with Hadoop™ becomes increasingly important because the two are indeed complementary. For instance, while petabytes of data can be stored unstructured in Hadoop and take hours to be queried, terabytes of data can be stored in a structured way in the SQL Server platform and queried in seconds. This leads to the need to transfer data between Hadoop and SQL Server.
This white paper explores how SQL Server Integration Services (SSIS), i.e. the SQL Server Extract, Transform and Load (ETL) tool, can be used to automate Hadoop + non Hadoop job executions, and manage data transfers between Hadoop and other sources and destinations.
To review the document, please download the Leveraging a Hadoop cluster from SQL Server Integration Services (SSIS) Word document
SSAS Crashing Intermittently: Caused by Monitoring / AV Scans
Problem Description:
Analysis Services is crashing intermittently and also producing min-dumps.
Assessment:
This could be one of the issues. For getting full analysis of Mini-Dump you can involve Microsoft Customer Support Services and ask them to analyze dumps.
In this case we found:
The issue is SMS client (CcmExce) does software inventory periodically, and it scans the data folder of Analysis Services. When a job needs to commit, it has to delete older version of data files. At the moment, the SMS client has a file handle on some of the database files, causing SSAS unable to delete the older version of the database, so commit fails and crashes SSAS
As you can see in this Process Monitor – CcmExec is browsing through SQL Folders:
Resolution
We have resolved issue by making an exception in SMS client not to browse / scan SSAS Folders
- Data
- Temp
- Config
- Log
Recommendations
Exclude Analysis Services folders from Virus Scans, File Monitoring Tools, Systems Management Server Client – CcmExec.exe or any other 3p File Monitoring or File Backup Tool.
For SQL engine follow recommendations are given in this link – http://support.microsoft.com/kb/309422
Microsoft TecheEd India 2013: Come and join our session
Microsoft BI Authentication and Identity Delegation
From straightforward client/server designs to complex architectures relying on distributed Windows services, SharePoint applications, Web services, and data sources, Microsoft BI solutions can pose many challenges to seamless user authentication and end-to-end identity delegation. SQL Server technologies and data providers expect to use Windows authentication while SharePoint Server uses Web Services Security (WS-Security). Flowing a user identity from a Windows or browser-based BI client application through a claims-based SharePoint service to a Windows backend system is not always possible due to various limitations in data providers, security protocols, and identity services. Network, forest, and federation topologies also influence the authentication flows. Familiarity with the authentication protocols and capabilities, delegation limitations, and possible workarounds is an indispensable prerequisite to delivering a positive BI user experience across the entire Microsoft BI solution stack in enterprise environments.
To review the document, please download the Microsoft BI Authentication and Identity Delegation Word document.
Connectivity Issue: "A connection cannot be made to redirector. Ensure that ‘SQL Browser’ service is running"
Symptoms:
SSAS Named Instance running on cluster with two nodes, on one instance we are able to connect SSAS using Name Instance but when you failover to other node and try to connect we get error message
SSAS Cluster Virtual Server Name – SSASVirtualServer
Instance is MySSAS
Two Nodes:
NodeA
NodeB
When NodeA is owner of SSASVirtualServer and we connect SSASVirtualServer\MySSAS it works but when we failover SSASVirtualServer to NodeB and try to connect SSASVirtualServer\MySSAS it fails with error:
A connection cannot be made to redirector. Ensure that ‘SQL Browser’ service is running.
Cause:
The startup of SQL Browser account does not have permission to access msmdredir.ini. The startup account of SQL Browser should have both Read and Write permission to the ASconfig folder or its child objects.
By default, the SQL Browser will periodically check and update the 90\Shared\ASConfig\msmdredir.ini file to ensure it knows the named instance SSAS info(port..etc) and it will translate client who needs to connect to the named instance SSAS to the correct name and port.
Solution:
a. If the SQL Browser is running under "NT Authority \local service" account, ensure the account has permission (read/write) to the C:\Program Files (x86)\Microsoft SQL Server\90\Shared\ASConfig folder and its child objects. If the SQL Browser is run under other account, please ensure the same.
b. If you are not sure about permission change from Local Service to Local System and restart services.
Today, we have resolved issue of same nature – thanks Saman Alaghehband for his time and patience
Note: On Cluster Environment it is always recommended to connect using SSAS Virtual Server Name
Refer:
DAX: Using filter and summarize in same Query
In this SQL Query we are grouping Sales by Year and Color plus adding a filter of 5000
select CalendarYear,color,sum(SalesAmount) from [DimProduct] join FactInternetSales
on DimProduct.[ProductKey]=FactInternetSales.[ProductKey]
join DimDate
on dimdate.datekey=FactInternetSales.OrderDatekey
Group by CalendarYear,color
Having sum(salesamount)>5000
order by CalendarYear,color
Equivalent Dax of achieving same is:
evaluate( filter (summarize (‘Internet Sales’, ‘Date’[Calendar Year],
‘Product’[Color] ,"Sales Amount",sum(‘Internet Sales’[Sales Amount]))
,Calculate(sum(‘Internet Sales’[Sales Amount]))>5000))
order by ‘Date’[Calendar Year],
‘Product’[Color]
PowerShell:How to List Database Roles and their Members
1: [Reflection.Assembly]::LoadWithPartialName("Microsoft.AnalysisServices")
2: $ServerName = ".\sql2008r2"
3: $DB = "Adventure Works DW 2008"
4: $Server = New-Object Microsoft.AnalysisServices.Server
5: $Server.Connect($ServerName)
6: $SSASDatabase = $Server.Databases.Item($DB)
7: $SSASDatabase.Roles | Select Name, Members
.csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, “Courier New”, courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; }
Hardware Sizing a Tabular Solution (SQL Server Analysis Services)
Applies to: SQL Server 2012 Analysis Services, Tabular Solutions
Summary: Provides guidance for estimating the hardware requirements needed to support processing and query workloads for an Analysis Services tabular solution.
Download from here (Hardware Sizing a Tabular Solution (SQL Server Analysis Services) Word document.
Cleared SSAS Maestro (MCM)
Finally achieved highest certification in SSAS world.
What is the SSAS Maestros?
The SSAS Maestro program was created as a way to share the lessons learned by enterprise customers of the SQL Customer Advisory Team (SQLCAT) using complex SQL Server 2008 R2 Analysis Services and a Unified Dimensional Model (UDM).
Requirements
Because of the complexity of the subject matter and the depth of the lessons, the course has strict requirements, including the following:
-
Applicants will be accepted based on their depth of technical experience with Analysis Services.
-
The three-day course will include several labs that intentionally provide very little guidance.
-
Upon completion of the course, attendees will be given a take-home exam project that they will need to complete within thirty (30) days.
Following the technical conference conventions defining 400-level sessions, this is a 500-level course. It is modeled after the MCM SQL certification program.
Additional Microsoft hosted SSAS Maestro training courses will be available in the near future. Check back on this site for updates. For questions or for a list of courses hosted by certified Maestro trainers, please email
SSAS Synchronization: Across different versions
Problem Statement: Synchronization between different builds or versions of Analysis Services.
Solution: The Synchronize Database Wizard makes two Microsoft SQL Server Analysis Services databases equivalent by copying the data and metadata from a source server to a destination server. This wizard can also be used to deploy a database from a staging server onto a production server, or to synchronize a database on a production server that has changes made to the data and metadata in a database on a staging server.
Apart from Data it copies the metadata, which makes Synchronization between different builds unsupported and you can’t synchronize across major versions or service packs.