VPAgs-Dataset4ML: A Dataset to Predict Viral Protective Antigens for Machine Learning-Based Reverse Vaccinology

Zakia Salod, Ozayr Mahomed

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Reverse vaccinology (RV) is a computer-aided approach for vaccine development that identifies a subset of pathogen proteins as protective antigens (PAgs) or potential vaccine candidates. Machine learning (ML)-based RV is promising, but requires a dataset of PAgs (positives) and non-protective protein sequences (negatives). This study aimed to create an ML dataset, VPAgs-Dataset4ML, to predict viral PAgs based on PAgs obtained from Protegen. We performed seven steps to identify PAgs from the Protegen website and non-protective protein sequences from Universal Protein Resource (UniProt). The seven steps included downloading viral PAgs from Protegen, performing quality checks on PAgs using the standard BLASTp identity check ≤30% via MMseqs2, and computational steps running on Google Colaboratory and the Ubuntu terminal to retrieve and perform quality checks (similar to the PAgs) on non-protective protein sequences as negatives from UniProt. VPAgs-Dataset4ML contains 2145 viral protein sequences, with 210 PAgs in positive.fasta and 1935 non-protective protein sequences in negative.fasta. This dataset can be used to train ML models to predict antigens for various viral pathogens with the aim of developing effective vaccines. Dataset: https://doi.org/10.17632/w78tyrjz4z.1 Dataset License: CC BY 4.0

Original languageEnglish
Article number41
JournalData
Volume8
Issue number2
DOIs
StatePublished - Feb 2023

Keywords

  • antigens
  • bioinformatics
  • machine learning
  • reverse vaccinology
  • vaccines
  • vaccinology
  • viruses

Funding Agency

  • Kuwait Foundation for the Advancement of Sciences

Fingerprint

Dive into the research topics of 'VPAgs-Dataset4ML: A Dataset to Predict Viral Protective Antigens for Machine Learning-Based Reverse Vaccinology'. Together they form a unique fingerprint.

Cite this