Working on GenomeDK
GenomeDK
GenomeDK is the name of Aarhus University’s super computer cluster. It is a high-performance computing (HPC) system. It is ISO 27001 certified and approved to store sensitive data in compliance with GDPR and the Danish Data Protection Act.
Within a Seedcase context we are going to use it as an example of how Sprout can be used to import, organise and store data on a remote cluster. It will also be used by the two projects that we are currently working with, DP-Next and ON-LiMiT.
Access
People associated with a university, the public sector (government agencies, regions, and municipalities), or small and medium-sized enterprises (SMEs) can ask for an account on GenomeDK, which is done via the request account portal. For now, we will need to select the “Open” zone. You need to state your preferred username and password even though the form seems to say to only do it if you want to re-activate an old account.
Setting up
Once you get approved for access you can run through the documentation, which is extensive and quite detailed. Because GenomeDK is a Linux-based server, you setup, access, and do many actions through a shell terminal. You will have to set up two-factor authentication using an authenticator app (they don’t stipulate any in particular).
Connecting
SSH
You can connect to your GenomeDK environment from the terminal on your local machine. Follow the instructions on GenomeDK’s website.
Desktop
If you prefer a graphical interface, you can also access GenomeDK through their virtual desktop. Follow the instructions on GenomeDK’s website.
Mounting files
If you want to view and edit the files stored on GenomeDK using applications installed on your local machine, you can mount your GenomeDK file system locally. Follow the instructions on GenomeDK’s website.
Installing software tools
To develop Seedcase projects on GenomeDK, you will need to install uv. If you want to render markdown documents, you will also need to install Quarto. Finally, if you want to run recipes in the justfile, you will need to install just.
Installing uv
Log in to GenomeDK (either through SSH or the desktop interface) and open a terminal window. Following uv’s installation guide, run:
Terminal
curl -LsSf https://astral.sh/uv/install.sh | shInstalling Quarto
Log in to GenomeDK (either through SSH or the desktop interface) and open a terminal window. Follow the instructions on Quarto’s website.
Notes:
- Step 3: if
~/.local/binexists, you don’t need to runmkdir ~/.local/bin - Step 4: as the instructions say, you can probably skip this step
- If VS Code is open, you’ll need to restart it to detect the Quarto CLI
Installing just
Log in to GenomeDK (either through SSH or the desktop interface) and open a terminal window. Following just’s installation guide, run:
Terminal
curl --proto '=https' --tlsv1.2 -sSf https://just.systems/install.sh | bash -s -- --to ~/binDeveloping with VS Code
The best way to work on Seedcase projects in VS Code is to use the VS Code application on your local machine and connect to GenomeDK via SSH. This allows you to edit files and run code on GenomeDK as if you were working locally. This setup uses the Remote SSH VS Code extension, which works by adding a .vscode-server folder in your home directory on GenomeDK. This is a lightweight backend that allows VS Code to run terminals, extensions, debuggers, etc., remotely on GenomeDK. If at any point you want to remove the backend, you can simply delete this folder.
Open VS Code on your local machine and install the Remote SSH VS Code extension
Select “Remote-SSH: Connect to Host…” from the Command Palette (
F1,Ctrl+Shift+P) and enter<username>@login.genome.au.dk. Follow the prompts given to update or create your config file. In the “Remotes Explorer” panel, you should now see the GenomeDK host listed aslogin.genome.au.dk.Click the GenomeDK SSH entry and enter your GenomeDK password when prompted. Now, you should be connected to GenomeDK and see the remote file system if you open up the VS Code terminal.
Clone your repository using the VS Code terminal, if you haven’t already:
Terminal
cd <.../my/projects> git clone <repo-url> cd <repo-name>Open your project folder in VS Code by clicking the “Open Folder” button in the panel on the left-hand side
Install your project dependencies (you’ll need uv installed):
Terminal
uv syncInstall any VS Code extensions needed for development (e.g. the Python language extension). To list recommended extensions, open the Extensions panel and type
@recommendedinto the search bar. You can install extensions from here.To be able to commit to GitHub, you will need to set up your Git credentials. You can do this by running the following commands in the terminal:
Terminal
git config --global user.name "Your Name" git config --global user.email "you@example.com"
Now, you should be able to run scripts in your project using either the VS Code terminal or the Run/Debug UI, as well as push changes to GitHub.
Using Git LFS on GenomeDK
In Data Packages on GenomeDK, all Parquet data files and data kept in the raw/ folder are tracked using Git LFS with a local LFS store on GenomeDK itself. This means that the contents of data files are stored on GenomeDK and only pointers to these files are uploaded to GitHub. On GenomeDK you can see the contents of the data files in two places: in your working copy of the Data Package and in a folder outside the Data Package directory called the LFS store.
When working with data or adding new data to the Data Package, it is important to take care that the Git LFS configuration is not changed and that sensitive data is always tracked by Git LFS. Any data that is not tracked by Git LFS runs the risk of being uploaded to GitHub accidentally, which would constitute a serious data breach.
Setting up Git LFS
Install Git LFS in your GenomeDK user space either from the tarball or via conda.
Run
git lfs install.ImportantIf you don’t initialise Git LFS with
git lfs install, Git LFS will not manage data files. Instead, files that should be tracked by LFS will be handled by Git as normal and uploaded to GitHub on push.Create a folder on GenomeDK that everyone working on the Data Package can access and edit: this will be the LFS store where all the data will be saved. This folder should be outside the Data Package folder.
Set the LFS store folder up as a bare repository by opening it in the Terminal and running
git init --bare.Create an
.lfsconfigfile in the Data Package root folder and add the absolute path to the LFS store:.lfsconfig
[lfs] url = file:///path/to/.../lfs-storeTipBe sure to get the real filesystem path to this folder, not the one pointing to your user space. You can find this by running
readlink -f .from the the LFS store folder.Define a rule for tracking all Parquet files and all data in the
raw/folder by running the following commands from the root of the Data Package:Terminal
git lfs track "*.parquet" git lfs track "raw/**"This will create a
.gitattributesfile listing the tracked paths. Tracked files shouldn’t be listed in.gitignore.To get an overview of the LFS settings in the repo, run
git lfs envorgit lfs ls-filesfrom the root of the Data Package.
Adding data to LFS
To add data, create a new Parquet file or move data file(s) into the
raw/folder, which is tracked by LFS.Stage the new files (using
git add). They should now be tracked by LFS.Double-check that Git LFS is configured correctly:
- The output of
git lfs ls-filesshould include the new files. git lfs envshould show thatEndpointis set to a local path pointing to the LFS store on GenomeDK.
- The output of
Commit and push the new data. If you look at the new data file on GitHub, you should see a pointer file rather than the actual data.
Working on a Data Package outside GenomeDK
If you want to work on a Data Package that uses Git LFS on your own machine, you can check out the repo without downloading the data files. In fact, it is not possible to download the data files via Git LFS, because the LFS store path points to a local folder on GenomeDK that your machine cannot access.
You don’t need Git LFS installed locally, but you do have to tell Git not to attempt to fetch data files tracked by Git LFS, as this would lead to an error. After cloning the repo, run the following from the root of the Data Package:
Terminal
# Show LFS-tracked files as pointers in the working tree
git config --local filter.lfs.smudge "cat"
# Do not require Git LFS to be installed
git config --local filter.lfs.required false
# Disable all LFS processing for tracked files
git config --local filter.lfs.process ""With this setup, you won’t have access to the actual data or be able to run scripts that depend on it. However, you can still run other parts of the code and work on data-dependent scripts without executing them.