Abstract
Reverse vaccinology (RV) is a computer-aided approach for vaccine development that identifies a subset of pathogen proteins as protective antigens (PAgs) or potential vaccine candidates. Machine learning (ML)-based RV is promising, but requires a dataset of PAgs (positives) and non-protective protein sequences (negatives). This study aimed to create an ML dataset, VPAgs-Dataset4ML, to predict viral PAgs based on PAgs obtained from Protegen. We performed seven steps to identify PAgs from the Protegen website and non-protective protein sequences from Universal Protein Resource (UniProt). The seven steps included downloading viral PAgs from Protegen, performing quality checks on PAgs using the standard BLASTp identity check ≤30% via MMseqs2, and computational steps running on Google Colaboratory and the Ubuntu terminal to retrieve and perform quality checks (similar to the PAgs) on non-protective protein sequences as negatives from UniProt. VPAgs-Dataset4ML contains 2145 viral protein sequences, with 210 PAgs in positive.fasta and 1935 non-protective protein sequences in negative.fasta. This dataset can be used to train ML models to predict antigens for various viral pathogens with the aim of developing effective vaccines. Dataset: https://doi.org/10.17632/w78tyrjz4z.1 Dataset License: CC BY 4.0
Original language | English |
---|---|
Article number | 41 |
Journal | Data |
Volume | 8 |
Issue number | 2 |
DOIs | |
State | Published - Feb 2023 |
Keywords
- antigens
- bioinformatics
- machine learning
- reverse vaccinology
- vaccines
- vaccinology
- viruses
Funding Agency
- Kuwait Foundation for the Advancement of Sciences