GeoAB Model: Installation, Training, And Setup Guide
Hey guys! Today, we're diving into the GeoAB model, an AI system designed for antibody drug development. This model tackles two big challenges: predicting the 3D structure of antibodies (especially the CDR region) using deep learning and generating antibody variants that follow physical and biological rules. Plus, it simulates the natural immune system to optimize how strongly antibodies bind to antigens. It is a pretty useful tool, let's get started on setting it up.
If you want to dive deeper into the theory and background, check out the research paper: GeoAB: Towards Realistic Antibody Design and Reliable Affinity Maturation.
The official GitHub repo (https://github.com/EDAPINENUT/GeoAB) does provide installation steps, but honestly, it can be a bit of a headache. I struggled with it for a while, especially with mixing pip and conda packages. So, I'm going to share my setup to hopefully save you some trouble.
System Information and Conda Configuration
First, here’s my system info and conda setup that worked for me:
[kl2@localhost ~]$ cat /etc/centos-release
CentOS Linux release 8.5.2111
[kl2@localhost ~]$ nvidia-smi
Mon Mar 31 10:26:43 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+------------------------+----------------======|
| 0 NVIDIA RTX A6000 Off | 00000000:17:00.0 Off | Off |
| 58% 82C P0 246W / 300W | 8805MiB / 49140MiB | 76% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================+------------------------+======================|
| 0 N/A N/A 2931 G /usr/libexec/Xorg 4MiB |
| 0 N/A N/A 2378303 C python3 8782MiB |
+-----------------------------------------+------------------------+----------------------+
Because this involves quite a few system-level dependencies, I recommend using both conda and pip for installation. Conda handles the core dependencies, and pip takes care of the rest. Here's the conda environment setup using a yml file:
name: geoab
channels:
- https://conda.rosettacommons.org
- conda-forge
- bioconda
- defaults
dependencies:
- python==3.9
- _libgcc_mutex=0.1
- _openmp_mutex=4.5
- bzip2=1.0.8
- ca-certificates=2024.2.2
- c-ares=1.19.1
- krb5=1.21.2
- libblas=3.9.0
- libcblas=3.9.0
- liblapack=3.9.0
- libopenblas=0.3.24
- libstdcxx-ng=13.2.0
- libgcc-ng=13.1.0
- libgfortran-ng=13.2.0
- libgfortran5=13.2.0
- libzlib=1.2.13
- openssl=3.2.1
Conda Environment Setup: A Detailed Guide
Setting up the conda environment correctly is crucial for the GeoAB model to function optimally. Start by creating a new environment using the provided geoab.yml file. This file specifies all the necessary dependencies, including the correct Python version and various libraries. Open your terminal and navigate to the directory containing the geoab.yml file. Then, run the following command:
conda env create -f geoab.yml
This command instructs conda to create a new environment named geoab and install all the dependencies listed in the geoab.yml file. This process might take a while, depending on your internet speed and system configuration, as conda downloads and installs each package. It ensures that you have the correct versions of Python, compilers, and essential libraries like libblas, libcblas, and liblapack, which are vital for numerical computations. Once the environment is created, activate it using:
conda activate geoab
Activating the environment ensures that all subsequent commands and scripts run within the isolated environment, preventing conflicts with other Python installations or package versions on your system. This isolation is particularly important when dealing with complex projects like GeoAB, which rely on specific versions of various libraries. After activating the environment, you can verify that the correct Python version is being used by running python --version. It should output Python 3.9, as specified in the geoab.yml file. With the conda environment set up and activated, you can proceed to install the remaining dependencies using pip, as detailed in the next section. Remember, a well-configured conda environment is the foundation for a smooth and error-free experience with the GeoAB model.
Installing Pip Dependencies
Once you're inside the conda environment, it's time to install the pip dependencies. These are the Python packages that GeoAB needs to run. Activate the environment and then use the following pip install commands:
aiohttp==3.11.14
biopython==1.81
biotite==0.38.0
e3nn==0.5.1
easydict==1.13
fair-esm==2.0.0
matplotlib==3.8.0
nni==3.0
pip==25.0.1
pytorch-lightning==2.0.9
rdkit==2023.3.2
tensorboard==2.13.0
torch_geometric==2.3.1
torch-scatter==2.1.2+pt20cpu
torchaudio==2.0.2
torchvision==0.15.2
Pip Dependency Installation: A Step-by-Step Guide
After setting up the conda environment, the next crucial step is to install the pip dependencies. These are Python packages that are not included in the conda environment but are necessary for the GeoAB model to function correctly. To ensure a smooth installation process, it's recommended to install these dependencies using a requirements.txt file. Create a file named requirements.txt in your GeoAB project directory and add the following lines:
aiohttp==3.11.14
biopython==1.81
biotite==0.38.0
e3nn==0.5.1
easydict==1.13
fair-esm==2.0.0
matplotlib==3.8.0
nni==3.0
pip==25.0.1
pytorch-lightning==2.0.9
rdkit==2023.3.2
tensorboard==2.13.0
torch_geometric==2.3.1
torch-scatter==2.1.2+pt20cpu
torchaudio==2.0.2
torchvision==0.15.2
Save the file and then, with your conda environment activated, run the following command in your terminal:
pip install -r requirements.txt
This command tells pip to install all the packages listed in the requirements.txt file, along with their specified versions. Installing packages with specified versions helps avoid compatibility issues and ensures that the GeoAB model runs as intended. During the installation, you might encounter errors related to certain packages, such as torch-scatter, due to CUDA version incompatibilities. In such cases, it's best to install the CPU version of the package, as mentioned earlier. Once all the packages are installed, it's a good idea to verify the installation by running pip list. This command lists all the installed packages and their versions, allowing you to confirm that all the required dependencies are installed correctly. If any packages are missing or have incorrect versions, you can reinstall them individually using pip install <package_name>==<version_number>. With the pip dependencies installed and verified, you can now proceed to download the necessary datasets and start training the GeoAB model.
Note: I had to install the CPU version of torch-scatter because of CUDA incompatibility. You might need to do the same.
Cloning the GeoAB Repository
Now that you've set up your environment, it's time to grab the GeoAB code from GitHub. Just use the following commands:
git clone https://github.com/EDAPINENUT/GeoAB.git
cd GeoAB
Cloning the GeoAB Repository: Ensuring a Smooth Download
After setting up the conda environment and installing the pip dependencies, the next step is to clone the GeoAB repository from GitHub. This repository contains all the necessary source code, scripts, and configuration files required to run the GeoAB model. To clone the repository, you'll need to have Git installed on your system. If you don't have Git installed, you can download and install it from the official Git website (https://git-scm.com/). Once Git is installed, open your terminal and navigate to the directory where you want to store the GeoAB project. Then, run the following command:
git clone https://github.com/EDAPINENUT/GeoAB.git
This command tells Git to download the entire GeoAB repository to your local machine. The cloning process might take a few minutes, depending on your internet speed and the size of the repository. After the cloning is complete, you'll have a new directory named GeoAB in your chosen location. This directory contains all the files and folders from the GitHub repository. To navigate into the newly cloned directory, use the following command:
cd GeoAB
This command changes your current directory to the GeoAB directory, allowing you to access the contents of the repository. Before proceeding to the next step, it's a good idea to check out the latest version of the code by running git pull. This command updates your local copy of the repository with any changes that have been made since you cloned it. Ensuring that you have the latest version of the code helps prevent compatibility issues and ensures that you're working with the most up-to-date features and bug fixes. With the GeoAB repository cloned and updated, you can now proceed to download the necessary datasets and start training the model.
Downloading the Dataset
Next up, you'll need to download the dataset required for training and evaluation. Use these commands:
wget "https://drive.usercontent.google.com/download?id=1UwFDuQSE_7gEvrfQkEhqOp2h1NfGAHua&export=download&authuser=0&confirm=t&uuid=7e68c0a8-d842-4cb7-b09c-48e7841cb0b1&at=AEz70l7zL3Yrg_jnQ6f5lACx2v-e:1743385403752" -O all_data.zip
unzip all_data.zip
Downloading the Dataset: Ensuring Data Integrity
After cloning the GeoAB repository, the next critical step is to download the necessary dataset for training and evaluation. The dataset is essential for the GeoAB model to learn and make accurate predictions. To download the dataset, you'll need to use the wget command, which is a command-line utility for retrieving files over HTTP or HTTPS. If you don't have wget installed on your system, you can install it using your system's package manager. For example, on Debian-based systems, you can install wget by running sudo apt-get install wget. Once wget is installed, open your terminal and navigate to the GeoAB directory. Then, run the following command:
wget "https://drive.usercontent.google.com/download?id=1UwFDuQSE_7gEvrfQkEhqOp2h1NfGAHua&export=download&authuser=0&confirm=t&uuid=7e68c0a8-d842-4cb7-b09c-48e7841cb0b1&at=AEz70l7zL3Yrg_jnQ6f5lACx2v-e:1743385403752" -O all_data.zip
This command tells wget to download the dataset from the specified URL and save it as all_data.zip in your current directory. The download process might take a while, depending on your internet speed and the size of the dataset. After the download is complete, you'll have a all_data.zip file in your GeoAB directory. To extract the contents of the ZIP file, you'll need to use the unzip command. If you don't have unzip installed on your system, you can install it using your system's package manager. For example, on Debian-based systems, you can install unzip by running sudo apt-get install unzip. Once unzip is installed, run the following command:
unzip all_data.zip
This command tells unzip to extract all the files and folders from the all_data.zip file into your current directory. After the extraction is complete, you'll have a new directory containing the dataset files. It's essential to ensure that the dataset is downloaded and extracted correctly, as any errors or corrupted files can lead to incorrect training and evaluation results. With the dataset downloaded and extracted, you can now proceed to train and evaluate the GeoAB model.
Training the Model
To train the GeoAB model, use the following commands:
# Train GeoAB-refiner
python train_refine.py
# Train GeoAB-Initializer
python train_init.py
# After GeoAB-Initializer is trained, train GeoAB-Designer
python train_design.py
Training the Model: A Detailed Walkthrough
With the GeoAB repository cloned, dependencies installed, and the dataset downloaded, you're finally ready to train the model. The GeoAB model consists of three main components: GeoAB-Refiner, GeoAB-Initializer, and GeoAB-Designer. Each component serves a specific purpose and requires separate training. To start training the GeoAB-Refiner, navigate to the GeoAB directory in your terminal and run the following command:
python train_refine.py
This command executes the train_refine.py script, which contains the code for training the GeoAB-Refiner model. The training process might take several hours or even days, depending on your hardware and the size of the dataset. During the training, the script will output various metrics, such as loss and accuracy, to the console. These metrics can be used to monitor the progress of the training and identify any potential issues. After the GeoAB-Refiner is trained, you can proceed to train the GeoAB-Initializer. To do this, run the following command:
python train_init.py
This command executes the train_init.py script, which contains the code for training the GeoAB-Initializer model. Similar to the GeoAB-Refiner training, this process might also take a significant amount of time. It is crucial that the GeoAB-Initializer is trained before the GeoAB-Designer, as the latter depends on the former. Once the GeoAB-Initializer is trained, you can train the GeoAB-Designer by running the following command:
python train_design.py
This command executes the train_design.py script, which contains the code for training the GeoAB-Designer model. The training process for each component involves optimizing the model's parameters to minimize the difference between the predicted and actual outputs. This optimization is typically done using algorithms like stochastic gradient descent (SGD) or Adam. The training scripts also include techniques like regularization and dropout to prevent overfitting and improve the generalization ability of the model. With all three components of the GeoAB model trained, you can now proceed to evaluate its performance.
Evaluating the Model
For evaluation, run these commands:
# Evaluate GeoAB-Refiner
python eval.py --eval_dir H3_refine --run 1
# Evaluate GeoAB-Designer
python eval.py --eval_dir H3_design
That’s it! Hope this helps you get GeoAB up and running more smoothly. Let me know if you have any questions!
Evaluating the Model: Assessing Performance Metrics
After training the GeoAB model, the final step is to evaluate its performance. Evaluation involves assessing how well the model performs on a separate dataset that it has not seen during training. This helps determine the model's generalization ability and its ability to make accurate predictions on new, unseen data. To evaluate the GeoAB-Refiner, navigate to the GeoAB directory in your terminal and run the following command:
python eval.py --eval_dir H3_refine --run 1
This command executes the eval.py script with the --eval_dir argument set to H3_refine and the --run argument set to 1. The --eval_dir argument specifies the directory containing the evaluation data for the GeoAB-Refiner, while the --run argument specifies the number of times to run the evaluation. The evaluation script calculates various metrics, such as root mean squared error (RMSE) and Pearson correlation coefficient, to assess the accuracy of the model's predictions. These metrics are printed to the console and can be used to compare the performance of different models or training configurations. To evaluate the GeoAB-Designer, run the following command:
python eval.py --eval_dir H3_design
This command executes the eval.py script with the --eval_dir argument set to H3_design. The evaluation script calculates various metrics, such as the percentage of correctly designed antibodies and the binding affinity of the designed antibodies, to assess the effectiveness of the model. These metrics are printed to the console and can be used to compare the performance of different models or design strategies. By evaluating the GeoAB model on a separate dataset, you can gain insights into its strengths and weaknesses and identify areas for improvement. This information can then be used to refine the model's architecture, training procedure, or design strategy to further enhance its performance.