Databricks managed identity setup in ADF
This post shows how to quickly set up a managed identity for Databricks activities in Data Factory (ADF), to eliminate the need to manage credentials.
Managed identity
What is a managed identity?
Formerly known as Managed Service Identity (MSI), a managed identity is a service principal of a special type that may only be used with Azure resources.
For more on managed identity, feel free to go through the following Microsoft documentation link that has an overview of managed identities. Managed identities for Azure resources | Microsoft Docs
Why use a managed identity
The following is a small list of benefits of using a managed identity,
-
Simplify service principal lifecycle management
-
Avoid the need to manage credentials
-
Relatively quick to set up
Databricks managed identity set up
Since Databricks supports using Azure Active Directory tokens to authenticate to the REST API 2.0, we can set up Data Factory to use a system assigned managed identity.
To follow along, it is assumed that the reader is familiar with setting up ADF linked services.
1. Create the role assignment
If one attempts to set up a linked service to a Databricks workspace to, without the correct role assignment set up, it will fail. The following screen shot shows such a failure using an existing interactive cluster.
To grant the correct role assignment:
-
Grant the contributor role to the managed identity.
-
The managed identity in this instance will be the name of the Data Factory that the Databricks linked service will be created on.
The following diagram shows how to grant the “Contributor” role assignment via the Azure Portal.
2. Create the linked service
a. In the Data Factory, navigate to the “Manage” pane and under linked services, create a new linked service under the “compute”, then “Azure Databricks” options.
b. Select the Databricks “workspace”, appropriate cluster type (I have an existing interactive cluster) and set “authentication type” as Managed service identity.
c. Depending on your cluster type, you will see the appropriate cluster options. For example if you are using an interactive cluster, you should see that cluster appear in the “choose from existing clusters” option, rather than the previous “load failed”.
d. Test the connection to validate that the Managed Identity based linked service connection has been configured correctly.
The following diagram shows the options described above.
Alternative set up
An alternative way of connecting would be through a personal access token (PAT).
1. The PAT is set up in Databricks and set to expire at a certain point in time.
2. This can either be saved in key vault (preferred) or added as a hard coded value in the linked service (not preferred) for connectivity.
3. When the linked service is created the important difference will be the authentication type set to “access token”.
Conclusion
In summary, managed identity provides a more secure way of connecting to Databricks using Data Factory linked services.
In addition to the security benefits, using the managed identity for activities such as connecting to resources as described in this post, is slightly quicker to setup than other more manual alternatives, less cumbersome and ultimately brings with it less management for developers and data engineers.
Resources
Managed identities for Azure resources | Microsoft Docs
July 2020 - Azure Databricks - Workspace | Microsoft Docs
Authenticate using Azure Active Directory tokens - Azure Databricks - Workspace | Microsoft Docs