Working on GenomeDK

GenomeDK

GenomeDK is the name of Aarhus University’s super computer cluster. It is a high-performance computing (HPC) system. It is ISO 27001 certified and approved to store sensitive data in compliance with GDPR and the Danish Data Protection Act.

Within a Seedcase context we are going to use it as an example of how Sprout can be used to import, organise and store data on a remote cluster. It will also be used by the two projects that we are currently working with, DP-Next and ON-LiMiT.

Access

People associated with a university, the public sector (government agencies, regions, and municipalities), or small and medium-sized enterprises (SMEs) can ask for an account on GenomeDK, which is done via the request account portal. For now, we will need to select the “Open” zone. You need to state your preferred username and password even though the form seems to say to only do it if you want to re-activate an old account.

Setting up

Once you get approved for access you can run through the documentation, which is extensive and quite detailed. Because GenomeDK is a Linux-based server, you setup, access, and do many actions through a shell terminal. You will have to set up two-factor authentication using an authenticator app (they don’t stipulate any in particular).

Connecting

SSH

You can connect to your GenomeDK environment from the terminal on your local machine. Follow the instructions on GenomeDK’s website.

Desktop

If you prefer a graphical interface, you can also access GenomeDK through their virtual desktop. Follow the instructions on GenomeDK’s website.

Mounting files

If you want to view and edit the files stored on GenomeDK using applications installed on your local machine, you can mount your GenomeDK file system locally. Follow the instructions on GenomeDK’s website.

Installing software tools

To develop Seedcase projects on GenomeDK, you will need to install uv. If you want to render markdown documents, you will also need to install Quarto. Finally, if you want to run recipes in the justfile, you will need to install just.

Installing uv

Log in to GenomeDK (either through SSH or the desktop interface) and open a terminal window. Following uv’s installation guide, run:

Terminal
curl -LsSf https://astral.sh/uv/install.sh | sh

Installing Quarto

Log in to GenomeDK (either through SSH or the desktop interface) and open a terminal window. Follow the instructions on Quarto’s website.

Notes:

  • Step 3: if ~/.local/bin exists, you don’t need to run mkdir ~/.local/bin
  • Step 4: as the instructions say, you can probably skip this step
  • If VS Code is open, you’ll need to restart it to detect the Quarto CLI

Installing just

Log in to GenomeDK (either through SSH or the desktop interface) and open a terminal window. Following just’s installation guide, run:

Terminal
curl --proto '=https' --tlsv1.2 -sSf https://just.systems/install.sh | bash -s -- --to ~/bin

Developing with VS Code

The best way to work on Seedcase projects in VS Code is to use the VS Code application on your local machine and connect to GenomeDK via SSH. This allows you to edit files and run code on GenomeDK as if you were working locally. This setup uses the Remote SSH VS Code extension, which works by adding a .vscode-server folder in your home directory on GenomeDK. This is a lightweight backend that allows VS Code to run terminals, extensions, debuggers, etc., remotely on GenomeDK. If at any point you want to remove the backend, you can simply delete this folder.

  1. Open VS Code on your local machine and install the Remote SSH VS Code extension

  2. Select “Remote-SSH: Connect to Host…” from the Command Palette (F1, Ctrl+Shift+P) and enter <username>@login.genome.au.dk. Follow the prompts given to update or create your config file. In the “Remotes Explorer” panel, you should now see the GenomeDK host listed as login.genome.au.dk.

  3. Click the GenomeDK SSH entry and enter your GenomeDK password when prompted. Now, you should be connected to GenomeDK and see the remote file system if you open up the VS Code terminal.

  4. Clone your repository using the VS Code terminal, if you haven’t already:

    Terminal
    cd <.../my/projects>
    git clone <repo-url>
    cd <repo-name>
  5. Open your project folder in VS Code by clicking the “Open Folder” button in the panel on the left-hand side

  6. Install your project dependencies (you’ll need uv installed):

    Terminal
    uv sync
  7. Install any VS Code extensions needed for development (e.g. the Python language extension). To list recommended extensions, open the Extensions panel and type @recommended into the search bar. You can install extensions from here.

  8. To be able to commit to GitHub, you will need to set up your Git credentials. You can do this by running the following commands in the terminal:

    Terminal
    git config --global user.name "Your Name"
    git config --global user.email "you@example.com"

Now, you should be able to run scripts in your project using either the VS Code terminal or the Run/Debug UI, as well as push changes to GitHub.

Using Git LFS on GenomeDK

In Data Packages on GenomeDK, all Parquet data files and data kept in the raw/ folder are tracked using Git LFS with a local LFS store on GenomeDK itself. This means that the contents of data files are stored on GenomeDK and only pointers to these files are uploaded to GitHub. On GenomeDK you can see the contents of the data files in two places: in your working copy of the Data Package and in a folder outside the Data Package directory called the LFS store.

When working with data or adding new data to the Data Package, it is important to take care that the Git LFS configuration is not changed and that sensitive data is always tracked by Git LFS. Any data that is not tracked by Git LFS runs the risk of being uploaded to GitHub accidentally, which would constitute a serious data breach.

Setting up Git LFS

  1. Install Git LFS in your GenomeDK user space either from the tarball or via conda.

  2. Run git lfs install.

    Important

    If you don’t initialise Git LFS with git lfs install, Git LFS will not manage data files. Instead, files that should be tracked by LFS will be handled by Git as normal and uploaded to GitHub on push.

  3. Create a folder on GenomeDK that everyone working on the Data Package can access and edit: this will be the LFS store where all the data will be saved. This folder should be outside the Data Package folder.

  4. Set the LFS store folder up as a bare repository by opening it in the Terminal and running git init --bare.

  5. Create an .lfsconfig file in the Data Package root folder and add the absolute path to the LFS store:

    .lfsconfig
    [lfs]
        url = file:///path/to/.../lfs-store
    Tip

    Be sure to get the real filesystem path to this folder, not the one pointing to your user space. You can find this by running readlink -f . from the the LFS store folder.

  6. Define a rule for tracking all Parquet files and all data in the raw/ folder by running the following commands from the root of the Data Package:

    Terminal
    git lfs track "*.parquet"
    git lfs track "raw/**"

    This will create a .gitattributes file listing the tracked paths. Tracked files shouldn’t be listed in .gitignore.

  7. To get an overview of the LFS settings in the repo, run git lfs env or git lfs ls-files from the root of the Data Package.

Adding data to LFS

  1. To add data, create a new Parquet file or move data file(s) into the raw/ folder, which is tracked by LFS.

  2. Stage the new files (using git add). They should now be tracked by LFS.

  3. Double-check that Git LFS is configured correctly:

    • The output of git lfs ls-files should include the new files.
    • git lfs env should show that Endpoint is set to a local path pointing to the LFS store on GenomeDK.
  4. Commit and push the new data. If you look at the new data file on GitHub, you should see a pointer file rather than the actual data.

Working on a Data Package outside GenomeDK

If you want to work on a Data Package that uses Git LFS on your own machine, you can check out the repo without downloading the data files. In fact, it is not possible to download the data files via Git LFS, because the LFS store path points to a local folder on GenomeDK that your machine cannot access.

You don’t need Git LFS installed locally, but you do have to tell Git not to attempt to fetch data files tracked by Git LFS, as this would lead to an error. After cloning the repo, run the following from the root of the Data Package:

Terminal
# Show LFS-tracked files as pointers in the working tree
git config --local filter.lfs.smudge  "cat"
# Do not require Git LFS to be installed
git config --local filter.lfs.required false
# Disable all LFS processing for tracked files
git config --local filter.lfs.process ""

With this setup, you won’t have access to the actual data or be able to run scripts that depend on it. However, you can still run other parts of the code and work on data-dependent scripts without executing them.