There’s a Github repo that includes everything we’re talking about here.
Since I lead my own research group now, it’s a good opportunity to spell out what I consider the best practices for a new research repo. And share it more broadly while at it, instead of having it buried as internal know-how.
I’ve been in research for about ten years now. In science, you start new projects all the time, and coming from a software engineering background, I’ve always felt like it gives me an opportunity to experiment with tools and abstractions.
Note! This is not about the quality of your programming skills. Most people in research do not have any experience with large-scale software projects. They might not know how to design good APIs, or choose the right abstractions. This is okay because research software rarely lives longer than 6-12 months anyway.
However, this is not an excuse to dump 5000 LoC on GitHub split between three files, where half of it is commented in and out depending on which experiment you want to run.
You’re lucky if you have
Goals
Like any software project, modern science repos must have:
- Clear, high quality code that can be easily reproduced.
- Facilitate quick collaborative development.
If you work in software, these are quite obvious but there’s a twist to the first point. People consume research code differently than applications or tools. It’s semi throwaway code that somehow needs to remain reproducible. Almost no one is going to integrate it as a neat dependency.
Additionally, it must address:
- Support and tracking of tens of hyperparameters, i.e., configuration hell on steroids.
- Stable behaviour of existing experiments when the project evolves and refocuses.
In the remainder of this post, I’m providing rationale why certain things are needed but I’m not going into too much detail. Use this post and the Github repo as your North Star, not the be-all and end-all technical reference.
Reproducible environment
Pin everything. Python version, library versions. The temperature and wind direction on that day. Importantly, research repos should be pinned to exact versions, not ranges. Your goal is to maximise reproducibility, not compatibility. If one day you decide to release a package based on your research, you will do things differently.
Use uv as your package and project manager. It’s a fast, modern, feature-rich tool. It will handle the virtual environments (including the Python version and installing the dependencies) and launching code. Here’s an example pyproject.toml:
[project]
name = "sciencerepo"
version = "0.1.0"
description = "Best practices for a science repo in 2026."
authors = [{name = "taclab.aalto.fi"}]
readme = "README.md"
license = {file = "LICENSE"}
requires-python = ">=3.12.0"
dependencies = [
"hydra-core==1.3.2",
"hydra-colorlog==1.2.0",
"tensorboard==2.18.0",
"torch==2.7.0",
"torchvision==0.22.0",
"tqdm==4.67.1",
]
[project.optional-dependencies]
dev = [
"pre-commit==4.1.0",
"basedpyright==1.28.1",
"pytest==8.3.2",
"ruff==0.11.0",
]
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.hatch.build.targets.wheel]
packages = ["src/sciencerepo"]
[tool.basedpyright]
venvPath = "."
venv = ".venv"
typeCheckingMode = "standard"
deprecateTypingAliases = true
reportDeprecated = true
reportPrivateImportUsage = false Before switching to uv, I used poetry. It’s fine. Practically the same functionality, a bit slower, somehow rougher edges even though it’s supposedly more mature.
Conda is popular but it’s a kitchen sink installer. You’d manage CUDA and system libs on the system level, not your Python project level. It’s also more of a pip/venv replacement, rather than a project management tool.
Whatever you do, don’t rawdog virtual environments with pip and requirements.txt. It’s a very manual process that doesn’t capture your intent.
I’m also a fan of Makefiles that simplify common workflows: building, testing, running specific configurations:
.PHONY: help
help:
@echo "Available targets:"
@echo " make install - Check and install uv, install dependencies, and set up pre-commit"
@echo " make sync - Sync dependencies with including dev extras"
@echo " make test - Run tests with pytest"
.PHONY: install
install:
@command -v uv >/dev/null 2>&1 || { echo >&2 "uv is not installed"; exit 1; }
@echo "✓ uv is installed"
@echo ""
@echo "Installing dependencies..."
uv sync --all-extras
@echo ""
@command -v uv pre-commit >/dev/null 2>&1 || { echo >&2 "pre-commit is not installed; check uv extras"; exit 1; }
@echo "✓ pre-commit is installed"
@echo ""
@echo "Installing pre-commit hooks..."
uv run pre-commit install
.PHONY: sync
sync:
uv sync --all-extras
.PHONY: test
test:
uv run pytest tests/ -v LLMart has a good example of scripting tedious workflows.
Typing and formatting
Always use type hints. Even though it’s only Python, the point is to communicate the intent of your APIs — a contract. They also let you catch bugs with type checking tools.
Speaking of which, I recommend basedpyright. It’s a fork of Microsoft’s pyright with some improvements; also easier to install as a Python dependency. Both offer a language server which mypy doesn’t. Keep an eye on ty though — a WIP type checker and language server written in Rust. It’s from the Astral team behind uv.
Use a code formatter. No one 100% likes how they format code but they keep it consistent. This project is configured with ruff which is a fast, drop-in replacement for black.
Code quality
You’d have some automation that runs all kinds of checks for you. Pre-commit is the default choice.
In this example, it’s installed and set up alongside all the other tools. We mostly care about running type-checking and formatting with ruff, as well as some other style checks.
Test coverage is notoriously bad in research code. Some of it is due to the development velocity but some due to complacency or just not knowing any better.
I recommend having some tests for the hot spots: tricky implementations of the important parts, making sure that the project runs in a clear directory. But you can forget about end-to-end testing; at least until the project is stable. Things just change too much.
CI/CD
It’s a good idea to set up some Github workflows too — check that all PRs and commits to the main branch satisfy all code quality requirements:
name: precommit-checks
on:
push:
branches: [main]
pull_request:
branches: [main, "release/*"]
permissions: read-all
jobs:
code-quality:
runs-on: ubuntu-latest
# strategy:
# matrix: # if you need to test against many
# python-version: ['3.9', '3.10', '3.11', '3.12']
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.12" # or w/e version you need
# python-version: ${{ matrix.python-version }} # if you need to test against many
- name: Install uv
run: |
pip install uv
uv sync --all-extras
- name: Pre-commit on all files
uses: pre-commit/action@v3.0.1
with:
extra_args: --all-files
- name: Run tests
if: github.event_name == 'pull_request' || github.ref == 'refs/heads/main'
run: uv run pytest The above example is configured with a Github-hosted runner. However, it’s quite trivial to set up a self-hosted one on your lab server.
Perhaps a controversial take, you’d do PR reviews with your colleagues. Another pair of eyes will often catch bugs and flawed assumptions. Even if you work by yourself and you’ll be the one merging it, the act of creating a PR and looking at your code in the web UI feels different.
Configuration & tracking
One of the most difficult things about science projects is keeping track of the configuration. It becomes unwieldy when you have different experiments, datasets, models, hyperparamters. For tiny projects, argparse will do but you will outgrow it quickly.
For many years, I’ve been shilling for hydra. It provides several niceties that you could but shouldn’t code yourself:
- Structured and typed configuration based on Python data classes.
- Better logging.
- Management of output directories
- CLI argument parsing with nested overrides.
Hydra is quite difficult to use correctly, and there are many different ways to use it. I encourage you to read the docs carefully and check out several configuration examples. This project/repo is somewhere between LLMart and MART in terms of the design choices, polymorphism, yaml vs Python.
On top of that, you’d use something to visualise and compare the running jobs and the results. Tensorboard is alright if you mostly care about plotting. Personally, I prefer Weights & Biases because it also lets me easily run some aggregation queries on all my results; exporting everything as a dataframe is trivial too. The caveat is you need an account.
Non-dev practices
In addition to how you organise your code, you’d consider what you’re actually doing with it. The most common mistake is running experiments on dirty branches. If you do, once (not if) bugs or confusion creeps in, you won’t be able to trace back to the exact version of the code. Similarly, you’d keep a project notebook where you record the progress and evolving assumptions. You’d refer to the specific commits when talking about the code. Ideas aren’t self-documenting and if you want to revisit your project two years later when you’re writing your dissertation, or want to onboard a new person, you will need that log.
In many ways research runs on an honours system. Make sure that you attribute the work that you’re building on: cite and link it, and respect the licensing. Always license your own projects, a LICENSE file (in academic research usually MIT and Apache-2.0) and some license text in the headers goes a long way.
Additional considerations
You might run into more challenges when working with growing ML and security projects: C/C++ FFI, hardware orchestration, slurm jobs, or large scale data storage. This repo is a common denominator that you can build on to facilitate efficient and reproducible research but it isn’t meant to cover everything. Always seek guidance from more experienced colleagues, HPC support staff at your organisation, web resources and even the LLMs.
I’m consciously avoiding LLM recommendations.
I think it’s too early to tell what the best or even good practices are.
Ironically, everything that we talked about in this post, makes projects LLM-friendly because it imposes a lot of explicit structure.
If there’s one thing that I think you’d have, it’d be