Use Hive to read data into Azure ML

Azure Machine Learning supports a range of data sources – primarily cloud data sources. One of the very interesting data sources is Hive that support the capability so write SQL-like queries against HDInsight clusters. In this post, I will show how you can configure the Reader component in an Azure ML experiment.

First, create a new experiment in Azure ML and drag a ‘Reader’ component into the canvas. Then drag a ‘Metadata editor’ component into the canvas too and connect the output of the reader to the input of the metadata editor. Your experiment should look like this.

Settings for loading data into Azure ML using HiveQL



I will now walk you through each of the settings in the reader component.

Data Source
Here you select Hive Query that enables you to specify a Hive query using the HiveQL language

Hive Database Query
Specify your Hive query here. (See HiveQL language reference here)

HCatalog server URI
HCatalog is a table and storage management layer for Hadoop. HCatalog exposes a RestAPI named WebHCat where Hive queries can be queued and job status monitored. WebHCat was previously known as Templeton, so queries against WebHCat are based on the URI http://clustername.azurehdinsight.net/templeton/v1/xxx.
In this field, you should type in the URI of the HDInsight cluster.

Hadoop user account name
A valid user account to the HDInsight cluster, like ‘admin’.

Hadoop user account password
A valid password associated with the Hadoop user account

Location of output data
When a Hive query has executed with success, the result is stored. Based on your configuration, the result can be stored either in Azure Blob storage or in HDFS. In this example, I’m using Azure Blob storage.

Azure storage account name
Type in the name of the storage account used by the HDInsight Cluster. If you are not sure of this setting, then look at the Dashboard page on the HDInsight cluster in the Azure Management portal. The storage account is listed under Linked resources

Azure storage key
The storage key for the Azure Blob Storage. This can be found under ‘Manage Access Keys’ on the storage account in Azure Management portal

Azure container name
Type in the name of the default container for the HDInsight cluster. The container is part of the Azure storage account.



If you want to rename the fields coming from the Hive query use the metadata editor component.

That’s it, you are ready to go and make your HDInsight cluster crunch some data and load the result into Azure ML

Posted in Azure, Azure ML, Hadoop, Hive | Tagged , | Leave a comment

Data-driven Storytelling using Microsoft Power BI

Okay, be honest with me – how many hours have you spent the last week watching movies or series on the television or reading novels? Your answer will properly exceed more than one hour and likely much more than that. So let us agree on one thing – stories are great! We love to relax hearing or watching stories. storyteller Even kids growing up with tablets, streaming television and 24-hours cable still loves the old-fashioned bedtime story told by parents, grandparent etc. Why is that? I think there are multiple reasons for that, like the special bond created between reader and listener. However, more important is the fact that the listener doesn’t hear the story but is experiencing the story. Imagination kick-starts our brain and the neurons inside our brain acts as they are actually reliving the story. The brain love fiction – the brain love stories

I think the community around Business Intelligence (BI) for many years forgot how powerful human story telling are. BI is all about delivering insight based on vast amounts of data in a way that support decision-making. We have to exploit data-driven story telling much more actively in our projects and recognize the fact that story telling is useful and powerful.

In this blog post, I will try to map the theoretical concepts of storytelling with Microsoft Power BI products.

Storytelling brings insight

My thesis is that storytelling brings insight and insight leads to better decisions. To support this thesis I have defined to following three headlines:

  • A story brings life and meaning into your data
  • Narrative visualization enhances your story
  • Structure makes your story understandable and leads toward the decision

A story brings life and meaning into your data

We are humans not robots – yet. So when presented with lots of data and facts it can be very hard to digest it for a human. How should I interpret a dashboard with multiple values, different charts and KPI’s? Are there a correlation between values? Which measure should I look at first? What we need is a context based on the decision we need to make – a story can make that context. A story will not only describe the result but also describe the processes leading to the result.

Where in the BI value-chain should we consider using storytelling? The BI-chain that I am referring is based on the following flow: we take data and transforms it into information, which gives the end-user insight to make a decision that brings value. The leap from information (like charts and key measures) to insight (Okay, I get it!) seems easy when you look at a drawing with the BI-chain. In real life and with the vast amount of available information today, it can be a very difficult task to navigate. What makes that particular step so difficult? It’s because you move from ‘machine’ to ‘human’ and we need to make the information digestible and understandable in a given context for a human brain. A story arrange facts and events in an order that establish a connection between them and in a logical and maybe causable chain – a story that our brain can try to relive.

Narrative visualization enhances your story

Over the last years visualization tools has evolved dramatically. The market offers many technologies to visualize data in multiple ways. When you combine the discipline of storytelling and the possibilities in data visualization, you get a very strong and usable toolset – also known as narrative visualization. For many years, the Swedish professor in international health Hans Rosling has been a master in that discipline. The simple visualization of complex datasets using GapMinder combined with great narrative skills has made him known around the world – and sometimes referred to as an ‘edutainer’. If you haven’t had the possibility to view some of his presentations, then take the time to do so and see how easy he guides you through huge datasets using storytelling. 3

Segel and Heer of Stanford University described in their paper “Narrative Visualization: Telling stories with Data” some different ways to understand narrative visualization. They divide the discipline into “author-driven” and “reader-driven” stories and describe seven different genres of narrative visualization. An author-driven story is when the author controls the flow of the story and interactivity is limited. This could be a typical slideshow, a lecture or a delivered chart where the author has a clear message to communicate. On the other hand, a reader-driven story is controlled by the reader and will typical have the purpose of data exploration using a visualization tool.

Power BI and storytelling

powermapMany vendors offers tools for both author- and reader driven storytelling including Microsoft with their Power BI suite. Power Map is great to make visualizations for author-driven stories – primarily when your data is geographical based. As a free add-on to Excel 2013, it’s very easy to attach the geographical canvas to a Power Pivot data model and visualize comprehensive amounts of data. The built-in MP4 movie export makes it easy to export the visualization and integrate it into a Power Point presentation.

powerviewWith Power View you can make highly interactive dashboards based on Power Pivot models or enterprise-level models using SQL Server Analysis Services. Power View provides a series of different visuals ranging from simple text to charts and maps. All elements are interrelated which makes cross-filtering possible. Cross-filtering means that if you select an element like ‘year’, other elements on the dashboard will filter on that year. Power View is an excellent tool for reader-driven stories exposed through Excel, SharePoint or Power BI in the cloud.

Structure makes your story understandable and leads toward the decision

“Once upon a time…” that’s the first sentence in almost every fairy tale by H.C. Andersen. When talking about data-driven stories we also need structure and here you can find great inspiration in the old fairy tales. They all starts with setting the scene, moving to the main story and finishing up with a close-down. “..and they lived happily ever after”. Author-driven stories can adapt this approach.

martiniBut how do you tell an author-driven story? One way is to just point the reader to the visualization tool like Power View, but an even better approach is to guide the reader a little bit – setting the scene. That approach requires that you combine the author-driven and reader-driven disciplines. One way to illustrate that is by using the Martini Glass model.4 The model illustrates that you start with a narrow author-driven story introducing the subject in the story. When the stem of the glass ends, we move into a reader-driven story that can move in any direction through data interaction.

Microsoft BI supports this scenario in multiple ways, but one elegant solution is to integrate Power View into a Power Point presentation. If your Power View is exposed through a SharePoint Server you can export it into an interactive slide in the Power Point presentation. In that way, you can perform reader-driven story telling without leaving your Power Point – and your context.

Structure is important and most be integrated into every story, both author- and reader-driven. However, the structure must be based on the decision scenario. Which insight is necessary?

Conclusion

Storytelling is very powerful and combined with modern visualization tools as the ones in Microsoft Power BI you can provide even better insight into your data. We are still humans with a complex brain, which loves to understand context and try to relive the stories. So with that in mind, remember that stories needs data, but the data also needs the stories.

 

References:

Power Point presentation: (in Danish):
Storytelling

Posted in General | 1 Comment

Connect R to SQL Server 2012 and “14”

This post will demonstrate how to connect R to Microsoft SQL Server, so that data can be extracted directly from a database by using SQL-statements. The approach described in this post is supported by both SQL Server 2012 and the upcoming SQL Server “14”. You can connect to SQL Server using different techniques – one of them is by using ODBC. This post will use ODBC.

The first time you need to connect to a database, you need to perform some one-time tasks, which are:

  • Create a ODBC DSN data source
  • Install necessary R-packages from CRAN

The screenshot below shows a table containing the well-known data set ‘weather.nomnial’. The table is part of a database named ‘MiningDataSets’. The goal of this tutorial is to load all this data into R.

SQLR00
Create DSN
First we need to setup a user DSN data source pointing at our SQL Server using ODBC. The data source will be called from R using the package “RODBC”

  1. Open “Administrative Tools” and “ODBC Data Sources (32 bit)”

WEKASQL02

  1. On the tab “User DSN” press “Add”
  2. Select “SQL Server” in the provider list
  3. Now give the data source a name and a description. Remember this name – you will need it later on in the process. Last provide the name of the server where the SQL Server is installed. Press “Next”.

WEKASQL03

  1. Select the way in which you will authenticate against the SQL Server. In this case we use Integrated Security. Press “Next”.

WEKASQL04

  1. You now have the possibility to select the default database for the data source. Here we choose to point at the ‘MiningDataSets’. Press “Next” and the “Finish”.

WEKASQL05

  1. On the last page remember to press the “Test Connection” button and ensure that the connection could be established.
  2. The User DSN is now created and active.

WEKASQL06


Install and load RODBC

Support for SQL Server is not possible using native R – therefore we have to install the RODBC package from CRAN.

  1. Open up R and in the console window type: install.packages(“RODBC”)

    SQLR01

  2. The RODBC packages is now downloaded and installed on your system. Next step is to load the package into R so the functions of the package can be used. In the console window type: library(“RODBC”)

    SQLR02

  3. The RODBC package is now loaded and ready

Connect with R
Now it is time to connect to the SQL Server database from R and retrieve the nominal weather dataset.

  1. When calling the database you first have to open a connection to the database. And after you have performed you operations, you have to close the connection again. This is done by using the commands odbcConnect() and odbcClose(). The name specified as the parameter is the name of the ODBC user data source.

    SQLR03

  2. First we want to read an entire table into R – this can be done by using the sqlFetch command.

    SQLR04

  3. If you want to load the data into a data.frame, this can easily be done

    SQLR05

  4. If you want to execute a specific SQL-query, this can be done using the sqlQuery-command.

    SQLR06

You can use many more SQL-statements against the database by using different RODBC-functions. You can get more help about the RODBC-package by typing RShowDoc(“RODBC”, package=”RODBC”)

SQLR07

I hope this little tutorial helped you in connecting R to SQL Server 2012 / “14”.

If you are interested in using the data mining workbench WEKA together with SQL Server, the see my tutorial on that here.

Posted in Analytics and Data Mining, Microsoft SQL Server | Tagged | 10 Comments

Connect WEKA to SQL Server 2012 and “14”

This post will demonstrate how to connect the data mining workbench WEKA to Microsoft SQL Server. Data can then be loaded directly from a database by using SQL-statements. The approach described in this post is supported by both SQL Server 2012 and the upcoming SQL Server “14”.

The first time you need to connect to a database you need to perform some one-time tasks, which are:

  • Create a ODBC DSN data source
  • Configuration of WEKA to support Microsoft SQL Server

The screenshot below shows a table containing the well-known data set ‘weather.nominal’. The table is part of a database named ‘MiningDataSets’. The goal of this tutorial is to load all this data into WEKA using the shown SQL-query.

WEKASQL01

Create DSN
First we need to setup a user DSN data source pointing at our SQL Server using ODBC. The data source will be called from JDBC which is Java’s component to call databases. This setup is called a ODBC-JDBC bridge.

1. Open “Administrative Tools” and “ODBC Data Sources (32 bit)”

WEKASQL02

2. On the tab “User DSN” press “Add”

3. Select “SQL Server” in the provider list

4. Now give the data source a name and a description. Remember this name – you will need it later on in the process. Last provide the name of the server on which the SQL Server is installed. Press “Next”.

WEKASQL03

5. Select the authentication method against the SQL Server. In this case we use Integrated Security. (Windows NT authentication) Press “Next”.

WEKASQL04

6. You now have the possibility to select the default database for the data source. Here we choose the database ‘MiningDataSets’. Press “Next” and the “Finish”.

WEKASQL05

7. On the last page remember to press the “Test Connection” button and ensure that the connection could be established successfully.

8. The User DSN is now created and active.

WEKASQL06


WEKA configuration

Support for SQL Server is not enabled by default when you install WEKA – therefore we have to setup WEKA to support this.

1. Go to C:\Program Files\WEKA-3-6\

2. Locate and un-zip the file weka.jar (You can use tools like 7-Zip for this)

WEKASQL07

3. After you have un-zipped the .jar-file, then go into the folder C:\Program Files\WEKA-3-6\weka\weka\experiment\

4. Locate the file “DatabaseUtils.props.msmsqlserver2005” and make a copy to you home directory. This file contains information about the database setup, like how to convert data types etc.

WEKASQL08

5. Rename the file in the home directory to “DatabaseUtils.props”

WEKASQL09

6. Now open the file with eg. Wordpad. Here you have to specify how to convert between WEKA/Java datatypes and SQL Server datatypes. I have added some of the most used conversions in the screenshot below. An example is the row “bit=1”. This tells ‘convert SQL Server bit data types to WEKA Boolean’.

WEKASQL10

7. Save and close the file


Connect with WEKA

Now it is time to connect to the SQL Server database from WEKA and retrieve the nominal weather dataset into the workbench.

1. Open WEKA Explorer

2. Press the “Open DB” button and you will get the window named “SQL-Viewer”

3. Press the “User…” button and specify the database URL: “jdbc:odbc:User DSN Name”. Then press “OK”

WEKASQL11

4. Press “Connect” and WEKA will connect to the database server. A status row is shown in the info textbox saying “connecting to:jdc:odv:MiningDatabase = true”

5. Now type in your SQL-statement in the query textbox and press “Execute”. Data is now fetched from the SQL Server and the result is showed in the result window. If you are satisfied with the result then press “OK”

WEKASQL12

6. Data is now loaded into the WEKA Explorer and you can start doing your data analysis and mining.

WEKASQL13

The next time you start up WEKA Explorer it’s very easy to connect to the database again. Just press the “Open DB” and then press the “History” button. Here you can find the JDBC/ODBC strings used earlier. Select the one you want to use and then connect to the database again.

I hope this little tutorial will help you – enjoy!

R is another tool for data analysis, so if you would like to know how to connect R to SQL Server – then read this post

Posted in Analytics and Data Mining, Microsoft SQL Server | Tagged , , | 16 Comments

Spin up SQL Server 2014 CTP1 on Azure in 10 minutes

Microsoft SQL Server 2014 CTP 1 has just been released some days ago. CTP stands for Community Technical Preview and can be compared with a public beta-version of the product. So you want to play around with this new piece of software – but how can you do that?

In this posting I will show you, how you can get an instance of SQL Server 2014 CTP 1 up and running in approx. 10 minutes by using Microsoft Azure. The instance will be running as a virtual machine in the cloud. The great thing about this approach is that it’s fast to get working and you can easily delete the image when a new version is being released.

When Microsoft released the CTP version they also announced that they made a preconfigured image available in Azure. So let’s get started and spin up this image to start testing the new product.

Create the Azure Virtual Machine
To start up the process go to http://manage.windowsazure.com and login with you Azure account. If you don’t have an account yet, you can easily create one. You can either try a free trail, you can use your MSDN subscription to get access or you can use your credit card a buy immediately access.

When you get into the Azure dashboard, select ‘Virtual Machines’ in the left menu and press the ‘Create a virtual machine’.

001

Now create a quick image and select “SQL Server 2014 CTP1” as the predefined image. This will give you an installed and configured SQL Server installation. Also assign the virtual machine a DNS-name, which will be the public name of the server, like: servername.cloudapp.net.

You must also name the administrator account for the virtual image – so remember this username and password. Last, press the ‘Create a virtual machine button’.

002

 

Now the virtual machine is created on Azure. First the necessary storage is reserved for the virtual machine, next the virtual machine is provisioned and last the machine is started. The time for this process is about 5 minutes. The following screenshots shows some of the actions shown during the provisioning.

003

004

 

007

The virtual machine is now ready for use, as the status is showing ‘running’. This completes the virtual machine creation process. By pressing the name of the virtual machine, you can access performance dashboards and configuration possibilities.

008

 

Connect to the server
To access the just created virtual machine running Windows Server 2012 R2 and SQL Server 2014 CTP1 we will be using Remote Desktop (RDP). To get the RDP connection configuration press the ‘Connect’ button at the button of the screen.

009

This will allow you to download the configurations for the RDP connection as a .rdp file, which can be opened with the Windows Remote Desktop application. (The screenshots are in Danish, but it is a typical Internet Explorer download file pop-up) Save the .rdp-file to a local folder.

010

 

Now double-click the rdp-file and a typical authentication screen will appear. Type in the administrator credentials that you assigned when you created the virtual image.

012

Now you are connected to the virtual machine in the cloud and you will see the desktop of the server.

013

Create your first database using SQL Server 2014
To create a database we need to open up SQL Server Management Studio (SSMS). To do that, press the Start-button in the bottom. On the start menu that appears press the little down-pointing arrow and you will get to an overview of all the installed applications. Here you can find other SQL Server related applications like the Data Quality client, SQL Server profiler etc. Click the SQL Server Management Studio icon.

034

 

Management Studio will now start up and you are asked to login. Again, type in the credentials of the server administrator that you created earlier on.

026

027

You are now looking at the management studio for SQL Server 2014 CTP1. To create the query that creates your database press the query-button.

030

In the query window that’s opened up, type in this SQL-DDL statement and press F5.

031

Congratulation, you have now created a virtual machine in the cloud and created a database in SQL Server 2014 CTP1.

032

Now go and have fun with the product and follow the SQL Server Blog for articles on using the new features of this product.

Posted in General, Microsoft SQL Server, Microsoft Technology | Leave a comment

Deployment of Master Data Service models

SQL Server Master Data Services (MDS) offers a command-line tool named MDSModelDeploy.exe that can be used to deploy MDS models between environments like development, test and production.

First of all, MDSModelDeploy.exe can be found under Program Files -> Microsoft SQL Server -> 110 -> Master Data Services -> Configurations.

The tool is very easy to use, but anyway, I have experienced some confusion about the command-line tools options used to deploy packages. The options are named ‘deploynew’ and ‘deployclone’.

To understand the difference between the two options you need to understand some internals of MDS models. Beside a name a MDS model is identified by a MUID, which is an internal ID for a specific model. When you create a deployment package using the ‘createpackage’ option, the MUID will be included in the package.

When deploying the deployment package to another server or instance of MDS you can use the ‘deployclone’ option. This will keep the original MUID from the source MDS instance. In that way, you keep a reference back to the original model and it will be possible to deploy additional updates to the model based on the original model.

Using the option ‘deploynew’ will ignore the MUID from the deployment package and generate a new one. In that way you can’t deploy additional updates to the model based on the original model. This can be useful in cases where you want to build a new model based on an existing one.

If you, by mistake deploy a package using the ‘deploynew’ and later on tries to deploy an update based on the original model, you will receive the error message “The xxx model cannot be updated. There is no match for a model with the same name and MUID”.

In the scenario below, I want to deploy a package from one environment to another. Afterwards I want the possibility to deploy additional updates based on the original model.

First, I want to create a deployment package for the model “Customer” based on the version named: ”Version_1”. I here use the option ‘createpackage’.

CreatePackage

A deployment package named “MDS_Customers.pkg” is now placed in C:\. I wanted the package to contain the model definition but also the master data. Therefore I specified the parameter –includedata.

If you have multiple instances of MDS installed on a server, you need to specify the instance name in the –service parameter. If there is only one instance installed you can omit this parameter. Use the ‘listservices’ option to list all instances on a server.

Listservices

Next I want to deploy the package to a new MDS instance in a way that makes it possible to deploy additional updates afterward.

In the screenshot below I first use the ‘deployclone’ option and after that I deploy an update to the model using the ‘deployupdate’ option.

OkayDeployment

To demonstrate that the ‘deploynew’ option breaks the reference back to the original model by defining a new MUID I have made a wrong example shown in the screenshot below.

The ‘deployupdate’ fails because it can’t identify a model with the original MUID and name.
FailDeployment

If you want additional information about MDS model deployment, I have included a couple of useful links.

Read the MSDN documentation about MDSModelDeploy.exe here: http://msdn.microsoft.com/en-us/library/ff486956.aspx

Watch a video by Anand Subbaraj which goes into more details about MDS model deployment using different tools: http://msdn.microsoft.com/en-us/sqlserver/hh780972.aspx

Posted in MDS, Microsoft SQL Server | Leave a comment

Checklist for SSAS Tabular Model Development

The development of SSAS tabular models compared to traditional SSAS multidimensional cube development has been simplified dramatically, making time-to-delivery much shorter on corporate business intelligence projects. But tabular model development is still an advance discipline which require analytical-, data modeling- and technical skills. There are many details to remember in the development and therefore many things to forget! If the end user experience in client tools like Power View and Excel is expected to be high, development must be focused on the details.

I have defined this development checklist to ensure that must details are remembered and taken into consideration in every SSAS tabular project. This is not a tutorial in SSAS Tabular modeling, if you are looking for that please visit MSDN

I will go into more details with all the bullets in future blog posts. I hope you find this list useful.

  • Project- and model configuration
    • Select query mode
    • Name the project and the cube
    • Rename the .bim-file
    • Select workspace server and retention mode
  • Create data model
    • Import data from data sources
    • Create relationships between tables
    • Handle role-playing dimensions
    • Rename or hide tables
    • Mark date tables
    • Delete unnecessary columns
    • Rename columns
    • Verify data category, format and types on columns
    • Reduce the number of tables
  • Define business logic
    • Define measures
    • Set the ‘Summarize By’ property on measures
    • Define calculated columns
    • Define hierarchies
    • Define Key Performance Indicators
  • Client presentation optimization
    • Set format on date columns
    • Set format on numeric columns and measures
    • Hide columns
    • Fill out table and column descriptions
    • Define default field set on tables (Power View)
    • Configure table behaviour
      • Set row identifier column
      • Define rows to keep unique
      • Define default label
      • Define default image
  • Define perspectives
  • Security
    • Specify security roles
    • Define row filters
  • Deployment and processing
    • Define partitions
    • Specify processing option
    • Specify transactional deployment option
    • Post-deployment (performed in SSMS)
      • Add users or groups to security roles
      • Setup data source impersonation information
      • Fill out database description
      • Verify the storage location
  • Verify and validate the model
  • Setup processing job to load and process data

Download here: Checklist for SSAS Tabular Model Development

Posted in Microsoft SQL Server, SSAS | Leave a comment