JP Voogt A Data Enthusiasts Ramblings

How to Mount Azure Storage to Azure Databricks

Introduction

I found multiple articles on how to mount an Azure Storage account in Azure Databricks, but most of them refered to using the Azure Key Vault which I do not want to setup at this time. So I decided to continue my quest as I required the storage account and container to be mounted for me to read the *.sas7dbat files into a Dataframe. I also could not find clear instructions on how to use the Databricks CLI in Azure Databricks.

Prerequisites

  1. Azure Subscription
  2. Azure Storage Account with a Container ready
  3. Azure Databricks

Code

First we will need to generate a SAS Token, which we can achieve by following this guide by June Castillote here.

I prefer to create a new Scala Notebook for the next part, so that I can save it and remember how I achieved this in the future. With the below script we start off by creating all the required variables for the actual mounting function. For a more complete guide see below link in the External Links for the full article by Gauri Mahajan

val containerName = "<Enter Container Name Here>"
val storageAccountName = "<Enter Storage Account Here>"
val sas = "<Enter SAS Key Here>"
val config = "fs.azure.sas." + containerName+ "." + storageAccountName + ".blob.core.windows.net"

Once all the parameters is set, we can go ahead and mount the storage container into Azure Databricks. Remeber to change the mount name in the below script.

dbutils.fs.mount(
  source = "wasbs://" + containerName + "@" + storageAccountName + ".blob.core.windows.net/",
  mountPoint = "/mnt/<Enter Unique Mount Name Here>/",
  extraConfigs = Map(config -> sas))

Data Exploration

Once we have our Storage Account mounted, we can start exploring the data in these mounts. First off lets look at all of our mapped mounts.

display(dbutils.fs.mounts())

If you want to have a look at what files are in our Storage Account, you can use the below script.

dbutils.fs.ls("/mnt/<Enter Unique Mount Name Here>")

@JPVoogt

And if for any reason you want to unmount the Storage Account you can use the below script.

dbutils.fs.unmount("/mnt/<Enter Unique Mount Name Here>/")

Comprehensive Guide

Disclaimer: Content is accurate at the time of publication, however updates and new additions happen frequently which could change the accuracy or relevance. Please keep this in mind when using my content as guidelines. Please always test in a testing or development environment, I do not accept any liability for damages caused by this content.

If you liked this post, you can share it with your followers or follow me on Twitter!

Azure Synapse Query Monitoring

Introduction

I have gotten so use to Adam Machanic’s sp_whoisactive that when we migrated our workloads to Azure Synapse I felt completely lost. So over time I started to compile my own little queries to inspect what is/was going on, it is far from perfect, but for me it was a great start. Please feel free to make changes to the project on GitHub

Installation

You should be able to just open and run the sp_synapse_queries_combined.sql script, this will “deploy” 2 Stored Procedures

*sp_synapse_queries *sp_synapse_queries_deepdive

I wanted to add both into 1, but unfortunatly Azure Synapse does not yet allow us to create parameters in Stored Procedures with default values.

Clone or Download Here

Permissions Required

GRANT VIEW DATABASE STATE ON GRANT EXECUTE ON sp_synapse_queries GRANT EXECUTE ON sp_synapse_queries_deepdive

How to Use

So lets get started, all you need to do is run the below SQL command.

EXEC dbo.sp_whoisactive;

Ok great, what now. Well now I explain (will be shortened due to me not enjoying documentation), you will get the below output. The first results set is all the queries in the queue (YES THAT MEANS IT IS NOT DOING ANYTHING), the second result set is all the queries that is actually running at the moment.

You can also read Microsoft Docs on this, as some of the code came from there.

Column Name JP’s Description
session_id Unique session id for a query
request_id Unique Query id in a session
login_name Who is running this session
status Is the query running or in a queue
running_time How long in seconds has the query been running for
time_in_queue How long was the query in a queue before it started executing
submit_time When did you press F5
start_time When did it actually start
end_compile_time When did it finish creating the execution plan
end_time When did your query finish
label Query Label
command SQL Command that is being executed
blocking_session_id If populated something is blocking your query
app_name Application Session is coming from
resource_class Resource Class, you can read up on this
deepdive Magic statement we will be explaining

Yes I knew you would not be able to stop thinking about the magic command, so what is it? Well essentially it is just another stored procedure called sp_whoisactive_deepdive prepopulated with the basic deep dive parameters

EXEC dbo.sp_whoisactive_deepdive @request_id = 'QID10741526', @distributions = 0, @tempdb = 0

If you run it as is, you will get 2 more result sets as below.

  1. The Query Steps, a nice way of looking at what step your query is busy with
  2. Waits, what your query has locks on and what it could be waiting on

If you want to dive even deeper, you can set the @distributions parameter to 1, this will then produce the below.

  1. The Query Steps on all distribution Nodes
  2. The Data Movement steps your query is busy with

Now when you are feeling adventurous turn on the @tempdb parameter and you will see what you query is doing to TempDB on Synapse.

GitHub Project

Disclaimer: Content is accurate at the time of publication, however updates and new additions happen frequently which could change the accuracy or relevance. Please keep this in mind when using my content as guidelines. Please always test in a testing or development environment, I do not accept any liability for damages caused by this content.

If you liked this post, you can share it with your followers or follow me on Twitter!

Simple Example of using sp_execute_external_script with Python

Introduction

Since the release of Microsoft SQL Server 2017, Microsoft has introduced the ability to execute Python code direct from within the SQL Server environement. Now my first question was why? Well over time I came to the conclusion that SQL Server is not always the best/only tool for the job and thus this feature allows us to perform more complex and complete functions for example string distance functions in SQL Server. We perform many of these on a daily basis like JaroWinkler, which works great except that I can do it even quicker in Python as it is a RBAR operation.

You can read more on this at Docs

Code

Create Table for Demo

Lets start off by creating our input dataset, in most cases you would already have a dataset in mind for this. If that is the case you can skip the next 2 steps.

CREATE TABLE Employees
(
	Id INT IDENTITY(1,1) PRIMARY KEY,
	Firstname varchar(50),
	Surname varchar(50),
	EmailAddress varchar(50)
)

Insert Records into table

Insert some dummy data for processing.

INSERT INTO Employees VALUES ('JP','Voogt','jvoogt1@outlook.com')
INSERT INTO Employees VALUES ('John','Smith','john.smith@gmail.com')
INSERT INTO Employees VALUES ('Faf','De Klerk','faffie@webmail.com')

Python script

Note that I will pass my data(SQL Query) from SQL Server to Python using my_input_data, and OutputDataSet as my final output. You will need to know what columns and datatypes you will be expecting back from Python.

In this example we will use pandas to take the first letter of the Firstname and the Surname and combine them to create our OutputName column. This is just an easy example of how to perform string manipulation operations in Python

DECLARE @NewScript NVARCHAR(MAX) 
SET @NewScript  = N'
import pandas as pd

#Read Data From Input @input_data_1 
df = my_input_data

#Perform String Manipulation
df["OutputName"] = df.apply(lambda x : x.Firstname[0] + " " + x.Surname, axis=1)

#Assign pandas.DataFrame to our spesified OutputDataSet
OutputDataSet = df
'

EXEC sp_execute_external_script

This is now where it all comes together, We indicate that we want to use Python, provide the input script, the input data and specify what the expected output will look like.

EXEC sp_execute_external_script
 @language = N'Python'
, @Script = @NewScript
, @input_data_1 = N'SELECT * FROM Employees'
, @input_data_1_name = N'my_input_data'
WITH RESULT SETS ((Id int, Firstname varchar(50), Surname varchar(50), EmailAddress varchar(50), OutputName varchar(50)))

Results

@JPVoogt

Conclusion

Disclaimer: Content is accurate at the time of publication, however updates and new additions happen frequently which could change the accuracy or relevance. Please keep this in mind when using my content as guidelines. Please always test in a testing or development environment, I do not accept any liability for damages caused by this content.

If you liked this post, you can share it with your followers or follow me on Twitter!

Read and Write Data From Azure Synapse to Azure Databricks

Introduction

Working with big datasets has introduced many more technologies for the average Data Analyst to learn and understand. I have recently started to work more in Azure Databricks to enable our processes to be considered more optimized. For example, I do a lot of string distance calculations on my datasets and, SQL Server or even Azure Synapse is not always the best solution for this. In the past, we created these functions as CLR’s and processed the data directly in SQL, but in time our data grew even more and, this became harder and harder. Moving this workload to Azure Databrick saved us hours of processing time, but introduced a new technology barrier that we had to learn and overcome.

Below I go through the basic outline of what is required to load data from Azure Synapse to Azure Databricks and push down to Synapse again once done.

  1. Configure your BLOB storage access, this can be achieved in many other ways. Read more here
  2. Create the JDBC connection string and BLOB connection string
  3. Read the data from Azure Synapse into a Spark Dataframe using spark.read function
  4. Write transformed data back into Azure Synapse with spark.write

@JPVoogt

Code

CONFIGURE BLOB CREDENTIALS

spark.conf.set(
  "fs.azure.account.key.<BLOBSTORAGENAME>.blob.core.windows.net",
  "<ACCESSKEY>")

CONFIGURE JDBC AND BLOB PATH

jdbc = "jdbc:sqlserver://<YOURSERVERNAME>.database.windows.net:1433;database=<YOURDATABASENAME>;user=<SQLUSERNAME>@<YOURDATABASENAME>;password=<PASSWORD>;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
blob = "wasbs://<BLOBCONTAINER>@<BLOBSTORAGENAME>.blob.core.windows.net/"

READ DATA FROM SYNAPSE INTO DATAFRAME

df = spark.read \
  .format("com.databricks.spark.sqldw") \
  .option("url", jdbc) \
  .option("tempDir", blob) \
  .option("forwardSparkAzureStorageCredentials", "true") \
  .option("Query", "SELECT TOP 1000 * FROM <> ORDER BY NEWID()") \
  .load()

WRITE DATA FROM DATAFRAME BACK TO AZURE SYNAPSE

df.write \
  .format("com.databricks.spark.sqldw") \
  .option("url", jdbc) \
  .option("forwardSparkAzureStorageCredentials", "true") \
  .option("dbTable", "YOURTABLENAME") \
  .option("tempDir", blob) \
  .mode("overwrite") \
  .save()

Disclaimer: Content is accurate at the time of publication, however updates and new additions happen frequently which could change the accuracy or relevance. Please keep this in mind when using my content as guidelines. Please always test in a testing or development environment, I do not accept any liability for damages caused by this content.

If you liked this post, you can share it with your followers or follow me on Twitter!

T-SQL Removing Duplicate non-Identical Records

Introduction

I found that some of us repeat the same processes over and over again where other people might never need to perform a specified task which I would consider basic, thus I want to share with you how I approach removing of duplicate records in my data. Now if you are lucky enough that the entire row is a complete duplicate then you can just go ahead and slap a DISTINCT in your query and your problem should be gone. You will have to be mindful of the limitations here as it doesn’t support long text fields as per Microsoft Docs

I Prefer to rather decide what makes my records distinct and act accordingly.

Remove Duplicate Records for Reporting

Description

When removing duplicates you will have to decide what makes your record unique, could it be a merchant id or even a social security number? I Like to go with the CTE and ROW_NUMBER approach when removing duplicates from me datasets. The only drawback here is, if you do not want the new column you will have to specify the required columns when you select from the CTE

Code

WITH CTE
AS
(
    SELECT  *
            , ROW_NUMBER() OVER(PARTITION BY [<WHAT COLUMN/S MAKE MY RECORD UNIQUE>] ORDER BY [<YOUR PREFERED ORDER>]) RN
    FROM    [myTable]
)
SELECT  *
FROM    CTE
WHERE   RN = 1

Remove Duplicate Records From a Table

Description

Sometimes you are not just removing duplicates for reporting purposes and you want to delete the duplicates from the actual table. SQL Server allows this awesome trick where you can write your logic in a CTE and then DELETE records based on the CTE logic as per the below example. Here I delete all records that occurred more than once.

Code

WITH CTE
AS
(
    SELECT  *
            , ROW_NUMBER() OVER(PARTITION BY [<WHAT COLUMN/S MAKE MY RECORD UNIQUE>] ORDER BY [<YOUR PREFERED ORDER>]) RN
    FROM    [myTable]
)
DELETE FROM CTE
WHERE   RN != 1

Disclaimer: Content is accurate at the time of publication, however updates and new additions happen frequently which could change the accuracy or relevance. Please keep this in mind when using my content as guidelines. Please always test in a testing or development environment, I do not accept any liability for damages caused by this content.

If you liked this post, you can share it with your followers or follow me on Twitter!