ProteinMPNN & Boltz-2¶
Overview¶
ProteinMPNN and Boltz-2 form a sequence optimization and validation workflow that improves designed structures through iterative refinement:
- ProteinMPNN optimizes amino acid sequences for Boltzgen-designed structures
- Boltz-2 predicts structures for the optimized sequences to validate refolding
This workflow helps identify sequences that maintain the desired structure while potentially improving stability, expression, or other properties.
Workflow Diagram¶
flowchart TB
A[Boltzgen Budget Designs<br/>CIF Files] --> B[Convert CIF to PDB<br/>Per Design]
B --> C[ProteinMPNN Optimize<br/>Parallel per Budget Design<br/>🎮 GPU Process]
C --> D[Optimized Sequences<br/>Multi-FASTA + Scores<br/>8 sequences/design]
D --> E[Prepare Boltz-2 Input<br/>Split Multi-FASTA<br/>Process Target FASTA]
E --> F[Boltz-2 Structure Prediction<br/>Parallel per Sequence<br/>🎮 GPU Process]
F --> G[Predicted Structures<br/>CIF + Confidence JSON<br/>+ PAE NPZ]
G --> H{Analysis<br/>Modules}
H -->|Enabled| I[ipSAE Scoring<br/>🎮 GPU]
H -->|Enabled| J[PRODIGY Affinity<br/>💻 CPU]
H -->|Enabled| K[Foldseek Search<br/>🎮 GPU]
I --> L[Combined Metrics]
J --> L
K --> L
L --> M{Consolidation}
M -->|Enabled| N[Unified Report<br/>CSV + HTML]
style C fill:#8E24AA,color:#fff,stroke:#8E24AA,stroke-width:3px
style F fill:#7B1FA2,color:#fff,stroke:#7B1FA2,stroke-width:3px
style N fill:#6A1B9A,color:#fff,stroke:#6A1B9A,stroke-width:3px
classDef dataNode fill:#FFF9C4,stroke:#FBC02D,stroke-width:2px
class D,G,L dataNode
Workflow Details
- Parallelization: Each budget design is processed independently by ProteinMPNN
- Sequence Generation: ProteinMPNN creates 8 sequences per structure by default (
--mpnn_num_seq_per_target) - Boltz-2 Input: Multi-FASTA is split into individual files, target FASTA is cleaned
- Analysis Requirements: All analysis modules require Boltz-2 outputs (not Boltzgen designs)
When to Use This Workflow¶
Enable ProteinMPNN and Boltz-2 when you want to:
- Optimize sequences: Improve sequence properties while maintaining structure
- Validate designs: Confirm optimized sequences can refold correctly
- Compare alternatives: Explore sequence diversity for the same structure
- Assess stability: Check if predicted structures match designed structures
Enabling the Workflow¶
ProteinMPNN Only¶
nextflow run seqeralabs/nf-proteindesign \
-profile docker \
--input samplesheet.csv \
--run_proteinmpnn \
--outdir results
ProteinMPNN + Boltz-2 (Recommended)¶
nextflow run seqeralabs/nf-proteindesign \
-profile docker \
--input samplesheet.csv \
--run_proteinmpnn \
--run_boltz2_refold \
--outdir results
Boltz-2 Requires ProteinMPNN
The --run_boltz2_refold flag requires --run_proteinmpnn to be enabled, as it operates on ProteinMPNN's output sequences.
ProteinMPNN Parameters¶
Core Parameters¶
| Parameter | Default | Description |
|---|---|---|
--run_proteinmpnn |
false |
Enable ProteinMPNN sequence optimization |
--mpnn_sampling_temp |
0.1 |
Sampling temperature (0.1-0.3 recommended, lower = more conservative) |
--mpnn_num_seq_per_target |
8 |
Number of sequence variants per structure |
--mpnn_batch_size |
1 |
Batch size for inference |
--mpnn_seed |
37 |
Random seed for reproducibility |
--mpnn_backbone_noise |
0.02 |
Backbone noise level (0.02-0.20, lower = more faithful to input) |
Parameter Guidelines¶
Sampling Temperature (--mpnn_sampling_temp):
- 0.1: Very conservative, high sequence identity to original
- 0.2: Moderate diversity (recommended)
- 0.3: High diversity, more sequence variation
Number of Sequences (--mpnn_num_seq_per_target):
- 4-8: Standard, good for most applications
- 16-32: High throughput, explore more sequence space
- 1-2: Quick validation runs
Backbone Noise (--mpnn_backbone_noise):
- 0.02: Minimal noise, strict adherence to backbone
- 0.10: Moderate noise, some flexibility
- 0.20: High noise, allows more structural variation
Boltz-2 Parameters¶
| Parameter | Default | Description |
|---|---|---|
--run_boltz2_refold |
false |
Enable Boltz-2 structure prediction |
--boltz2_diffusion_samples |
1 |
Number of diffusion samples per sequence |
--boltz2_seed |
42 |
Random seed for reproducibility |
Parameter Guidelines¶
Diffusion Samples (--boltz2_diffusion_samples):
- 1: Standard, one prediction per sequence
- 5: Multiple samples for consensus
- 10+: Extensive sampling (very slow)
Output Files¶
ProteinMPNN Outputs¶
results/
└── sample_id/
└── proteinmpnn/
├── sequences/ # Optimized FASTA sequences
│ ├── design_0001.fasta
│ ├── design_0002.fasta
│ └── ...
├── scores/ # ProteinMPNN scores
│ ├── design_0001_scores.npz
│ └── ...
└── summary/
└── optimization_summary.txt # Summary statistics
Boltz-2 Outputs¶
results/
└── sample_id/
└── boltz2/
├── structures/ # Predicted CIF structures
│ ├── mpnn_0001_model_0.cif
│ ├── mpnn_0001_model_1.cif
│ └── ...
├── confidence/ # Confidence scores (JSON)
│ ├── mpnn_0001_confidence.json
│ └── ...
└── npz/ # Converted NPZ files (for ipSAE)
├── mpnn_0001_model_0.npz
└── ...
Interpreting Results¶
ProteinMPNN Scores¶
Lower scores indicate better sequence-structure compatibility:
- Score < -2.0: Excellent sequence fit
- Score -2.0 to -1.5: Good fit
- Score -1.5 to -1.0: Acceptable fit
- Score > -1.0: Poor fit, may not fold correctly
Boltz-2 Confidence¶
Boltz-2 provides multiple confidence metrics in JSON format:
{
"plddt": 85.3, // Per-residue confidence (0-100)
"ptm": 0.82, // Predicted TM-score
"iptm": 0.79, // Interface PTM (for complexes)
"ranking_confidence": 0.81
}
Interpretation: - pLDDT > 80: High confidence structure - pLDDT 60-80: Moderate confidence - pLDDT < 60: Low confidence, may be disordered - pTM > 0.8: Good overall structure quality
Comparison with Boltzgen¶
Structural Similarity¶
Compare Boltz-2 structures to original Boltzgen designs:
# Use TM-align or similar tool
tmalign results/sample1/boltzgen/design_0001.cif \
results/sample1/boltz2/structures/mpnn_0001_model_0.cif
TM-score interpretation: - TM-score > 0.9: Nearly identical structures - TM-score 0.7-0.9: Similar structures - TM-score < 0.7: Different structures
Analysis Comparison¶
When multiple analyses are enabled, compare metrics:
# Enable all analyses for both Boltzgen and Boltz-2
nextflow run seqeralabs/nf-proteindesign \
--input samplesheet.csv \
--run_proteinmpnn \
--run_boltz2_refold \
--run_ipsae \
--run_prodigy \
--run_foldseek \
--foldseek_database /path/to/database_dir \
--foldseek_database_name afdb \
--run_consolidation \
--outdir results
The consolidated metrics report will show: - Boltzgen designs: Original structure metrics - Boltz-2 designs: Sequence-optimized structure metrics - Side-by-side comparison of quality scores
Use Cases¶
1. Sequence Optimization¶
Goal: Find sequences with better properties while maintaining structure
Workflow:
Analysis: Compare ProteinMPNN scores, select top sequences
2. Structural Validation¶
Goal: Verify optimized sequences maintain desired structure
Workflow:
Analysis: Compare Boltzgen vs. Boltz-2 structures using TM-align
3. Comprehensive Quality Assessment¶
Goal: Full characterization of both original and optimized designs
Workflow:
Analysis: Use consolidated metrics to identify best overall designs
Performance Notes¶
ProteinMPNN¶
- GPU acceleration: Recommended for large batches
- Memory: ~4-8 GB GPU memory
- Time: ~5-10 seconds per structure
- Parallelization: Processes multiple structures in parallel
Boltz-2¶
- GPU acceleration: Required for reasonable speed
- Memory: ~16-32 GB GPU memory per sample
- Time: ~2-5 minutes per sequence
- Parallelization: Processes multiple sequences in parallel
Combined Workflow¶
For a design with 20 budget structures and 8 sequences per structure: - ProteinMPNN: ~5 minutes total - Boltz-2: ~20 minutes total (160 predictions) - Total overhead: ~25-30 minutes
Troubleshooting¶
ProteinMPNN Issues¶
Low scores (> -1.0):
- Original structure may have issues
- Try adjusting --mpnn_backbone_noise
- Check input PDB quality
Similar sequences:
- Increase --mpnn_sampling_temp (try 0.2-0.3)
- Increase --mpnn_num_seq_per_target
Boltz-2 Issues¶
Low confidence predictions: - Sequences may not be foldable - Check ProteinMPNN scores first - Try different sampling temperature
Out of memory:
- Reduce --boltz2_diffusion_samples
- Process fewer structures at once
- Use GPU with more memory
Structural Divergence¶
Boltz-2 structures differ from Boltzgen:
- Check ProteinMPNN scores (should be < -1.5)
- Verify target sequence extraction worked correctly
- Consider if sequence optimization is too aggressive
- Try lower --mpnn_sampling_temp
Best Practices¶
- Start conservative: Use default parameters first
- Validate small set: Test on 2-3 designs before full run
- Compare metrics: Use consolidation to compare Boltzgen vs. Boltz-2
- Check structural similarity: Always verify refolding maintains structure
- Consider tradeoffs: Lower ProteinMPNN scores may not always mean better designs
Integration with Other Analyses¶
ipSAE¶
Automatically analyzes both Boltzgen and Boltz-2 structures when enabled:
PRODIGY¶
Predicts binding affinity for both structure types:
Foldseek¶
Searches for homologs of both Boltzgen and Boltz-2 designs:
References¶
- ProteinMPNN: Dauparas J, et al. (2022) Robust deep learning–based protein sequence design using ProteinMPNN. Science. doi:10.1126/science.add2187
- Boltz-2: Structure prediction model for protein folding validation
See Also¶
- ipSAE Scoring - Works with both Boltzgen and Boltz-2 NPZ files
- PRODIGY Binding Affinity - Analyzes all predicted structures
- Metrics Consolidation - Compare Boltzgen vs. Boltz-2 metrics