May 25, 2021 - azure synapse-analytics

Decode Base64 with Azure Synapse Analytics its Serverless SQL

Xavier Geerinck

@XavierGeerinck

I wanted to analyze some data in Azure Synapse Analytics, Serverless SQL. This data was coming in from my IoT Hub that I created and was being routed towards an Azure Data Lake Storage account as small JSON files. However when trying to analyze this data, it appeared that IoT Hub shows the body as Base64 encoded and T-SQL couldn't analyze it.

Problem Statement

Normally, to convert base64 towards a string, we can use a built-in method where it converts this as XML. The following SQL statement can then be used:

SELECT
[JSON] = cast(cast('' as XML).value('xs:base64Binary(sql:column("StackOverflow.Base64JSON"))', 'varbinary(30)') as varchar(30))
FROM my_table

However, when executing this in Synapse Analytics, Serverless SQL we get the following error:

xml is not supported

Solution

To solve this, we need to execute the following steps:

  1. Create a User Database (we don't have rights on the master database)
  2. Set the correct active Database
  3. Create a Base64 Decoder
  4. Load our JSON data
  5. Decode our Column

Breaking this down

1. Create a User Database

CREATE DATABASE MY_DATABASE;

2. Setting the correct active database

USE MY_DATABASE;

3. Creating the Base64 Decoder

CREATE FUNCTION dbo.tvf_Base64Decoder( @encodedPayload VARCHAR(8000) )
RETURNS TABLE
AS
RETURN WITH cte8Rows AS
(
-- generating 4096 rows, here are the first 8
SELECT ones = 1
FROM
(
values (1),(1),(1),(1)
, (1),(1),(1),(1)
) AS tens(ones)
)
, cte64Rows AS
(
-- generating 4096 rows, here are the first 64
SELECT ones = 1
FROM cte8Rows AS eight1
CROSS JOIN cte8Rows AS eight2
)
, cteSextet as
(
-- a Base64 encoded string is comprised of 4-byte sextets, create an iterator of starting offsets for each sextet
SELECT top ((len(@encodedPayload) / 4))
setextOffset = ((ROW_NUMBER() OVER (ORDER BY (SELECT 1))) * 4) -3
FROM cte64Rows AS sixtyfour1
CROSS JOIN cte64Rows AS sixtyfour2
)
, cteMapping as
(
-- Base64 mapping table
SELECT Base64.[value]
, Base64.[char]
FROM
(
values ( 0, 'A'), ( 1, 'B'), ( 2, 'C'), ( 3, 'D'), ( 4, 'E'), ( 5, 'F'), ( 6, 'G'), ( 7, 'H'), ( 8, 'I'), ( 9, 'J'), (10, 'K'), (11, 'L'), (12, 'M'), (13, 'N'), (14, 'O'), (15, 'P')
, (16, 'Q'), (17, 'R'), (18, 'S'), (19, 'T'), (20, 'U'), (21, 'V'), (22, 'W'), (23, 'X'), (24, 'Y'), (25, 'Z'), (26, 'a'), (27, 'b'), (28, 'c'), (29, 'd'), (30, 'e'), (31, 'f')
, (32, 'g'), (33, 'h'), (34, 'i'), (35, 'j'), (36, 'k'), (37, 'l'), (38, 'm'), (39, 'n'), (40, 'o'), (41, 'p'), (42, 'q'), (43, 'r'), (44, 's'), (45, 't'), (46, 'u'), (47, 'v')
, (48, 'w'), (49, 'x'), (50, 'y'), (51, 'z'), (52, '0'), (53, '1'), (54, '2'), (55, '3'), (56, '4'), (57, '5'), (58, '6'), (59, '7'), (60, '8'), (61, '9'), (62, '+'), (63, '/')
) AS Base64([value], [char])
)
, cteDecode as
(
-- the fun stuff! (see explaination above)
SELECT chars = concat(
char(
cast(cast(((SELECT m.[value] FROM cteMapping AS m WHERE m.[char] = SUBSTRING(@encodedPayload, s.setextOffset , 1) collate Latin1_General_CS_AS) * 4) AS BINARY(1)) AS TINYINT)
|
cast(cast(((SELECT m.[value] FROM cteMapping AS m WHERE m.[char] = SUBSTRING(@encodedPayload, s.setextOffset +1, 1) collate Latin1_General_CS_AS) / 16) AS BINARY(1)) AS TINYINT)
)
, char(
cast(cast(((SELECT m.[value] FROM cteMapping AS m WHERE m.[char] = SUBSTRING(@encodedPayload, s.setextOffset +1, 1) collate Latin1_General_CS_AS) * 16) AS BINARY(1)) AS TINYINT)
|
cast(cast(((SELECT m.[value] FROM cteMapping AS m WHERE m.[char] = SUBSTRING(@encodedPayload, s.setextOffset +2, 1) collate Latin1_General_CS_AS) / 4) AS BINARY(1)) AS TINYINT)
)
, char(
cast(cast(((SELECT m.[value] FROM cteMapping AS m WHERE m.[char] = SUBSTRING(@encodedPayload, s.setextOffset +2, 1) collate Latin1_General_CS_AS) * 64) AS BINARY(1)) AS TINYINT)
|
cast(cast(((SELECT m.[value] FROM cteMapping AS m WHERE m.[char] = SUBSTRING(@encodedPayload, s.setextOffset +3, 1) collate Latin1_General_CS_AS) / 1) AS BINARY(1)) AS TINYINT)
)
)
FROM cteSextet AS s
)
SELECT decodedPayload = string_agg(chars, '')
FROM cteDecode;

4. Loading our JSON Data

SELECT
JSON_VALUE(jsonContent, '$.Body') AS jsonBody
FROM
OPENROWSET(
BULK 'https://myadlsaccount.dfs.core.windows.net/myadlscontainer/myadlsfolder/*/*/*/*/*/*.json',
FORMAT = 'CSV',
FIELDQUOTE = '0x0b',
FIELDTERMINATOR ='0x0b',
ROWTERMINATOR = '0x0b'
)
WITH (
jsonContent varchar(MAX)
) AS [result]

5. Decoding the Column

WITH decoder AS (
SELECT
JSON_VALUE(jsonContent, '$.Body') AS jsonBody
FROM
OPENROWSET(
BULK 'https://myadlsaccount.dfs.core.windows.net/myadlscontainer/myadlsfolder/*/*/*/*/*/*.json',
FORMAT = 'CSV',
FIELDQUOTE = '0x0b',
FIELDTERMINATOR ='0x0b',
ROWTERMINATOR = '0x0b'
)
WITH (
jsonContent varchar(MAX)
) AS [result]
)
SELECT
decoded = (SELECT decodedPayload FROM dbo.tvf_Base64Decoder(jsonBody))
FROM decoder

Conclusion

When you now combine those 5 steps into one SQL file, you will get a decoded JSON string containing your IoT Hub Body!

⚠ I personally would recommend utilizing Spark, seeing that this method works, but is not the fastest!

Did you enjoy reading? Or do you want to stay up-to-date of new Articles?

Consider sponsoring me or providing feedback so I can continue creating high-quality articles!

Xavier Geerinck © 2020

Twitter - LinkedIn