Foldseek Structural Search¶
Overview¶
Foldseek is a structural similarity search tool that identifies proteins with similar 3D structures. The pipeline integrates Foldseek to search for structural homologs of both Boltzgen-designed and Protenix-refolded structures against large databases like AlphaFold or Swiss-Model.
What is Foldseek?
Foldseek uses a novel 3Di structural alphabet combined with traditional amino acid sequences to enable ultra-fast structural similarity searches. It's significantly faster than traditional structural alignment tools like TM-align while maintaining high sensitivity.
When to Use Foldseek¶
Enable Foldseek structural search when you want to:
- Identify homologs: Find proteins with similar structures in large databases
- Validate designs: Check if designed structures resemble known proteins
- Discover function: Infer potential functions based on structural similarity
- Assess novelty: Determine if designs are truly novel or similar to existing structures
Enabling Foldseek¶
nextflow run seqeralabs/nf-proteindesign \
-profile docker \
--input samplesheet.csv \
--run_foldseek \
--foldseek_database /path/to/database_dir \
--foldseek_database_name afdb \
--outdir results
Key Parameters¶
Required Parameters¶
| Parameter | Description |
|---|---|
--run_foldseek |
Enable Foldseek structural search (default: false) |
--foldseek_database |
Path to Foldseek database directory (required when Foldseek is enabled) |
--foldseek_database_name |
Name of the database within the directory (default: afdb) |
Search Parameters¶
| Parameter | Default | Description |
|---|---|---|
--foldseek_evalue |
0.001 |
E-value threshold for reporting matches (lower = more stringent) |
--foldseek_max_seqs |
100 |
Maximum number of target sequences to report |
--foldseek_sensitivity |
9.5 |
Search sensitivity (1.0-9.5, higher = more sensitive but slower) |
--foldseek_coverage |
0.0 |
Minimum fraction of aligned residues (0.0-1.0) |
--foldseek_alignment_type |
2 |
0=3Di only, 1=TMalign (global), 2=3Di+AA (local, default) |
Database Setup¶
AlphaFold Database (Recommended)¶
Download and prepare the AlphaFold database:
# Download AlphaFold database (choose your version)
wget https://foldseek.steineggerlab.workers.dev/afdb-swissprot.tar.gz
tar xzf afdb-swissprot.tar.gz
# Use in pipeline - database files will be in the extracted directory
--foldseek_database /path/to/afdb-swissprot \
--foldseek_database_name afdb
Swiss-Model Database¶
Alternatively, use Swiss-Model structures:
# Download Swiss-Model database
wget https://foldseek.steineggerlab.workers.dev/swissprot.tar.gz
tar xzf swissprot.tar.gz
# Use in pipeline
--foldseek_database /path/to/swissprot \
--foldseek_database_name swissprot
Custom Database¶
Create a custom database from your PDB files:
# Create database from directory of PDB/CIF files
foldseek createdb /path/to/structures/ mydb
# Use in pipeline - specify your custom database name
--foldseek_database /path/to/ \
--foldseek_database_name mydb
What Structures Are Searched?¶
The pipeline runs Foldseek on:
- Boltzgen budget designs - All structures from
intermediate_designs_inverse_folded/ - Protenix refolded structures - All structures predicted by Protenix (if enabled)
Each structure is searched independently, allowing comparison of: - Original Boltzgen designs - ProteinMPNN-optimized sequences refolded by Protenix
Output Files¶
For each design, Foldseek generates:
results/
└── sample_id/
└── foldseek/
├── design_id_boltzgen/
│ ├── aln.m8 # Alignment results in BLAST-like format
│ ├── summary.tsv # Summary of top hits
│ └── alignment.html # Detailed alignment visualization
└── design_id_protenix/ # (if Protenix enabled)
├── aln.m8
├── summary.tsv
└── alignment.html
Output Format¶
The summary.tsv file contains:
| Column | Description |
|---|---|
query |
Query structure name |
target |
Target structure identifier |
evalue |
E-value (lower = more significant) |
prob |
Probability score |
score |
Alignment score |
qlen |
Query length |
tlen |
Target length |
alnlen |
Alignment length |
qstart, qend |
Query alignment boundaries |
tstart, tend |
Target alignment boundaries |
description |
Target protein description |
Interpreting Results¶
E-value Interpretation¶
- E < 1e-10: Very strong structural similarity
- E < 1e-5: Strong structural similarity
- E < 0.001: Moderate similarity (default threshold)
- E < 0.01: Weak similarity
- E > 0.1: Likely not significant
Example Analysis¶
# View top hits for a design
head results/sample1/foldseek/design1_boltzgen/summary.tsv
# Count significant hits (E < 1e-5)
awk '$3 < 1e-5' results/sample1/foldseek/design1_boltzgen/summary.tsv | wc -l
# Extract top hit details
head -n 2 results/sample1/foldseek/design1_boltzgen/summary.tsv
Integration with Other Analyses¶
Foldseek results are automatically integrated into the consolidated metrics report when both are enabled:
nextflow run seqeralabs/nf-proteindesign \
--input samplesheet.csv \
--run_foldseek \
--foldseek_database /path/to/afdb \
--run_consolidation \
--outdir results
The consolidated report includes: - Best E-value for each design - Top matching protein name/description - Number of significant hits - Comparison across Boltzgen and Protenix structures
Performance Notes¶
- GPU accelerated: Foldseek can utilize GPUs for faster searches
- Memory usage: ~4-8 GB per search depending on database size
- Search time: ~1-5 minutes per structure with AlphaFold database
- Database size: AlphaFold database is ~200 GB
Troubleshooting¶
Database Not Found¶
Solution: Ensure the database path is correct and accessible:
Out of Memory¶
Solution: Reduce the number of results or increase memory allocation:
No Significant Hits¶
If no hits are found:
- Check E-value threshold (try relaxing to --foldseek_evalue 0.01)
- Increase sensitivity (--foldseek_sensitivity 9.5)
- Verify database is appropriate for your designs
- Consider if designs are truly novel (no existing homologs)
References¶
- Foldseek Publication: van Kempen M, et al. (2024) Fast and accurate protein structure search with Foldseek. Nature Biotechnology. doi:10.1038/s41587-023-01773-0
- Documentation: https://github.com/steineggerlab/foldseek
See Also¶
- PRODIGY Binding Affinity - Predict binding affinity
- ipSAE Scoring - Evaluate interface quality
- Consolidated Metrics - Unified reporting