Access your Dataset from Blob Storage directly
Sometimes while working with Azure ML you don't want to utilize the "Dataset" feature directly but use your own Blob Storage account which hosts your dataset.
Luckily for us Microsoft created a Linux Virtual Filesystem implementation of Fuse to allow us to translate calls on the filesystem straight to our Azure Blob Storage account. Combining this together with Azure MLs ability to run custom Docker Containers we should be able to run Blob Fuse! So, let's try this out
At the time of writing, the latest implementation of Blob Fuse is version 2. Although in Preview, I wanted to start utilizing this already as it's for testing purposes.
Directory Structure
First off, start by creating the following file structure (or check it out from my GitHub repository):
Dockerfile # Container Definition
build.sh # Easily build the container with ./build.sh
docker/
azure-blobfuse-config.yaml # Configuration for our mount point
azure-blobfuse-mount.sh # Mount our blob storage
Creating our Docker Container
Now we have all that ready, let's create a container with our following Dockerfile:
# https://hub.docker.com/r/nvidia/cuda/tags?page=1&name=20.04
FROM nvidia/cuda:11.7.1-base-ubuntu20.04
RUN apt update \
&& apt install -y wget curl software-properties-common apt-utils
# ======================================================================
# Provide BlobFuse v2 (https://docs.microsoft.com/en-us/azure/storage/blobs/blobfuse2-how-to-deploy)
# this will translate calls from Linux Filesystem to Azure Blob Storage
# beside installing it, we need to perform 3 base actions
# - Configure a temporary path for caching or streaming
# - Create an empty directory for mounting the blob container
# - Authorize access to your storage account
# ======================================================================
# https://github.com/Azure/azure-storage-fuse/releases/download/blobfuse2-2.0.0-preview2/blobfuse2-2.0.0-preview.2-ubuntu-18.04-x86-64.deb
RUN apt install -y libfuse3-dev fuse3 \
&& wget https://github.com/Azure/azure-storage-fuse/releases/download/blobfuse2-2.0.0-preview2/blobfuse2-2.0.0-preview.2-ubuntu-20.04-x86-64.deb \
&& apt install ./blobfuse2-2.0.0-preview.2-ubuntu-20.04-x86-64.deb
# Authorize access to your storage account
# https://github.com/Azure/azure-storage-fuse/blob/main/sampleFileCacheConfig.yaml
ADD docker/azure-blobfuse-config.yaml docker/azure-blobfuse-config.yaml
# ======================================================================
# Other
# ======================================================================
# ...
# ======================================================================
# Configure Scripts
# ======================================================================
COPY docker/ /docker
RUN chmod 755 /docker/*.sh
# ======================================================================
# Configure Other
# ======================================================================
ADD ./ ./
ENTRYPOINT ["/docker/azure-blobfuse-mount.sh"]
Besides creating this file, we also need add the content to our `docker/azure-blobfuse-mount.sh` and `docker/azure-blobfuse-config.yaml` files.
docker/azure-blobfuse-config.yaml
allow-other: true
logging:
type: syslog
level: log_debug
components:
- libfuse
- file_cache
- attr_cache
- azstorage
libfuse:
attribute-expiration-sec: 120
entry-expiration-sec: 120
negative-entry-expiration-sec: 240
file_cache:
path: /tmp/blobfuse
timeout-sec: 120
# max-size-mb: 4096
attr_cache:
timeout-sec: 7200
azstorage:
type: block
endpoint: YOUR_ACCOUNT_NAME.blob.core.windows.net
account-name: YOUR_ACCOUNT_NAME
account-key: YOUR_ACCOUNT_KEY
mode: key
container: YOUR_CONTAINER_NAME
docker/azure-blobfuse-mount.sh
The mount file will take care or creating our temporary directory and mounting our file storage on /root/azurestorage
#!/bin/bash
set -euo pipefail
set -o errexit
set -o errtrace
IFS=$'\n\t'
# Configuring temporary path for caching or streaming
# finally we create an empty directory for mounting the blob container (/root/azure-storage)
mkdir /tmp/blobfuse \
&& chown root /tmp/blobfuse \
&& mkdir -p /root/azurestorage
# Authorize access to your storage account and mount our blobstore
# Example: https://github.com/Azure/azure-storage-fuse/blob/main/sampleFileCacheConfig.yaml
# Full Config: https://github.com/Azure/azure-storage-fuse/blob/main/setup/baseConfig.yaml
blobfuse2 mount /root/azurestorage --config-file=/docker/azure-blobfuse-config.yaml
# run the command passed to us
exec "$@"
Finalizing
We can now run our container with the commands below that will build and run our container
Important to note is that we are running with --cap-add=SYS_ADMIN --device=/dev/fuse --security-opt apparmor:unconfined
which will allow us to use the fuse library in Docker!
docker build -t azure-blobfuse -f Dockerfile .
docker run -it --rm --cap-add=SYS_ADMIN --device=/dev/fuse --security-opt apparmor:unconfined azure-blobfuse /bin/bash
Which will result in us being able to view the /root/azurestorage
directory
Note: If you just as me get an error while trying to access a file directory, then you should create a BlobFuse specific metadata file, see this post for more information