Access your Dataset from Blob Storage directly

Sometimes while working with Azure ML you don't want to utilize the "Dataset" feature directly but use your own Blob Storage account which hosts your dataset.

Luckily for us Microsoft created a Linux Virtual Filesystem implementation of Fuse to allow us to translate calls on the filesystem straight to our Azure Blob Storage account. Combining this together with Azure MLs ability to run custom Docker Containers we should be able to run Blob Fuse! So, let's try this out

At the time of writing, the latest implementation of Blob Fuse is version 2. Although in Preview, I wanted to start utilizing this already as it's for testing purposes.

Directory Structure

First off, start by creating the following file structure (or check it out from my GitHub repository):

Dockerfile # Container Definition
build.sh   # Easily build the container with ./build.sh
docker/
  azure-blobfuse-config.yaml # Configuration for our mount point
  azure-blobfuse-mount.sh    # Mount our blob storage

Creating our Docker Container

Now we have all that ready, let's create a container with our following Dockerfile:

# https://hub.docker.com/r/nvidia/cuda/tags?page=1&name=20.04
FROM nvidia/cuda:11.7.1-base-ubuntu20.04

RUN apt update \
    && apt install -y wget curl software-properties-common apt-utils

# ======================================================================
# Provide BlobFuse v2 (https://docs.microsoft.com/en-us/azure/storage/blobs/blobfuse2-how-to-deploy)
# this will translate calls from Linux Filesystem to Azure Blob Storage 
# beside installing it, we need to perform 3 base actions
# - Configure a temporary path for caching or streaming
# - Create an empty directory for mounting the blob container
# - Authorize access to your storage account
# ======================================================================
# https://github.com/Azure/azure-storage-fuse/releases/download/blobfuse2-2.0.0-preview2/blobfuse2-2.0.0-preview.2-ubuntu-18.04-x86-64.deb
RUN apt install -y libfuse3-dev fuse3 \
    && wget https://github.com/Azure/azure-storage-fuse/releases/download/blobfuse2-2.0.0-preview2/blobfuse2-2.0.0-preview.2-ubuntu-20.04-x86-64.deb \
    && apt install ./blobfuse2-2.0.0-preview.2-ubuntu-20.04-x86-64.deb

# Authorize access to your storage account
# https://github.com/Azure/azure-storage-fuse/blob/main/sampleFileCacheConfig.yaml
ADD docker/azure-blobfuse-config.yaml docker/azure-blobfuse-config.yaml

# ======================================================================
# Other
# ======================================================================
# ...

# ======================================================================
# Configure Scripts
# ======================================================================
COPY docker/ /docker
RUN chmod 755 /docker/*.sh

# ======================================================================
# Configure Other
# ======================================================================
ADD ./ ./

ENTRYPOINT ["/docker/azure-blobfuse-mount.sh"]

Besides creating this file, we also need add the content to our `docker/azure-blobfuse-mount.sh` and `docker/azure-blobfuse-config.yaml` files.

docker/azure-blobfuse-config.yaml

allow-other: true

logging:
  type: syslog
  level: log_debug

components:
  - libfuse
  - file_cache
  - attr_cache
  - azstorage

libfuse:
  attribute-expiration-sec: 120
  entry-expiration-sec: 120
  negative-entry-expiration-sec: 240

file_cache:
  path: /tmp/blobfuse
  timeout-sec: 120
#   max-size-mb: 4096

attr_cache:
  timeout-sec: 7200

azstorage:
  type: block
  endpoint: YOUR_ACCOUNT_NAME.blob.core.windows.net
  account-name: YOUR_ACCOUNT_NAME
  account-key: YOUR_ACCOUNT_KEY
  mode: key
  container: YOUR_CONTAINER_NAME

docker/azure-blobfuse-mount.sh

The mount file will take care or creating our temporary directory and mounting our file storage on /root/azurestorage

#!/bin/bash
set -euo pipefail
set -o errexit
set -o errtrace
IFS=$'\n\t'

# Configuring temporary path for caching or streaming
# finally we create an empty directory for mounting the blob container (/root/azure-storage)
mkdir /tmp/blobfuse \
    && chown root /tmp/blobfuse \
    && mkdir -p /root/azurestorage

# Authorize access to your storage account and mount our blobstore
# Example: https://github.com/Azure/azure-storage-fuse/blob/main/sampleFileCacheConfig.yaml
# Full Config: https://github.com/Azure/azure-storage-fuse/blob/main/setup/baseConfig.yaml
blobfuse2 mount /root/azurestorage --config-file=/docker/azure-blobfuse-config.yaml

# run the command passed to us
exec "$@"

Finalizing

We can now run our container with the commands below that will build and run our container

Important to note is that we are running with --cap-add=SYS_ADMIN --device=/dev/fuse --security-opt apparmor:unconfined which will allow us to use the fuse library in Docker!
docker build -t azure-blobfuse -f Dockerfile .
docker run -it --rm --cap-add=SYS_ADMIN --device=/dev/fuse --security-opt apparmor:unconfined azure-blobfuse /bin/bash

Which will result in us being able to view the /root/azurestorage directory

Note: If you just as me get an error while trying to access a file directory, then you should create a BlobFuse specific metadata file, see this post for more information