ProteinMPNN & Boltz-2¶

Overview¶

ProteinMPNN and Boltz-2 form a sequence optimization and validation workflow that improves designed structures through iterative refinement:

ProteinMPNN optimizes amino acid sequences for Boltzgen-designed structures
Boltz-2 predicts structures for the optimized sequences to validate refolding

This workflow helps identify sequences that maintain the desired structure while potentially improving stability, expression, or other properties.

Workflow Diagram¶

flowchart TB
    A[Boltzgen Budget Designs<br/>CIF Files] --> B[Convert CIF to PDB<br/>Per Design]

    B --> C[ProteinMPNN Optimize<br/>Parallel per Budget Design<br/>🎮 GPU Process]

    C --> D[Optimized Sequences<br/>Multi-FASTA + Scores<br/>8 sequences/design]

    D --> E[Prepare Boltz-2 Input<br/>Split Multi-FASTA<br/>Process Target FASTA]

    E --> F[Boltz-2 Structure Prediction<br/>Parallel per Sequence<br/>🎮 GPU Process]

    F --> G[Predicted Structures<br/>CIF + Confidence JSON<br/>+ PAE NPZ]

    G --> H{Analysis<br/>Modules}

    H -->|Enabled| I[ipSAE Scoring<br/>🎮 GPU]
    H -->|Enabled| J[PRODIGY Affinity<br/>💻 CPU]
    H -->|Enabled| K[Foldseek Search<br/>🎮 GPU]

    I --> L[Combined Metrics]
    J --> L
    K --> L

    L --> M{Consolidation}
    M -->|Enabled| N[Unified Report<br/>CSV + HTML]

    style C fill:#8E24AA,color:#fff,stroke:#8E24AA,stroke-width:3px
    style F fill:#7B1FA2,color:#fff,stroke:#7B1FA2,stroke-width:3px
    style N fill:#6A1B9A,color:#fff,stroke:#6A1B9A,stroke-width:3px

    classDef dataNode fill:#FFF9C4,stroke:#FBC02D,stroke-width:2px
    class D,G,L dataNode

Workflow Details

Parallelization: Each budget design is processed independently by ProteinMPNN
Sequence Generation: ProteinMPNN creates 8 sequences per structure by default (--mpnn_num_seq_per_target)
Boltz-2 Input: Multi-FASTA is split into individual files, target FASTA is cleaned
Analysis Requirements: All analysis modules require Boltz-2 outputs (not Boltzgen designs)

When to Use This Workflow¶

Enable ProteinMPNN and Boltz-2 when you want to:

Optimize sequences: Improve sequence properties while maintaining structure
Validate designs: Confirm optimized sequences can refold correctly
Compare alternatives: Explore sequence diversity for the same structure
Assess stability: Check if predicted structures match designed structures

Enabling the Workflow¶

ProteinMPNN Only¶

nextflow run seqeralabs/nf-proteindesign \
    -profile docker \
    --input samplesheet.csv \
    --run_proteinmpnn \
    --outdir results

ProteinMPNN + Boltz-2 (Recommended)¶

nextflow run seqeralabs/nf-proteindesign \
    -profile docker \
    --input samplesheet.csv \
    --run_proteinmpnn \
    --run_boltz2_refold \
    --outdir results

Boltz-2 Requires ProteinMPNN

The --run_boltz2_refold flag requires --run_proteinmpnn to be enabled, as it operates on ProteinMPNN's output sequences.

ProteinMPNN Parameters¶

Core Parameters¶

Parameter	Default	Description
`--run_proteinmpnn`	`false`	Enable ProteinMPNN sequence optimization
`--mpnn_sampling_temp`	`0.1`	Sampling temperature (0.1-0.3 recommended, lower = more conservative)
`--mpnn_num_seq_per_target`	`8`	Number of sequence variants per structure
`--mpnn_batch_size`	`1`	Batch size for inference
`--mpnn_seed`	`37`	Random seed for reproducibility
`--mpnn_backbone_noise`	`0.02`	Backbone noise level (0.02-0.20, lower = more faithful to input)

Parameter Guidelines¶

Sampling Temperature (--mpnn_sampling_temp): - 0.1: Very conservative, high sequence identity to original - 0.2: Moderate diversity (recommended) - 0.3: High diversity, more sequence variation

Number of Sequences (--mpnn_num_seq_per_target): - 4-8: Standard, good for most applications - 16-32: High throughput, explore more sequence space - 1-2: Quick validation runs

Backbone Noise (--mpnn_backbone_noise): - 0.02: Minimal noise, strict adherence to backbone - 0.10: Moderate noise, some flexibility - 0.20: High noise, allows more structural variation

Boltz-2 Parameters¶

Parameter	Default	Description
`--run_boltz2_refold`	`false`	Enable Boltz-2 structure prediction
`--boltz2_diffusion_samples`	`1`	Number of diffusion samples per sequence
`--boltz2_seed`	`42`	Random seed for reproducibility

Parameter Guidelines¶

Diffusion Samples (--boltz2_diffusion_samples): - 1: Standard, one prediction per sequence - 5: Multiple samples for consensus - 10+: Extensive sampling (very slow)

Output Files¶

ProteinMPNN Outputs¶

results/
└── sample_id/
    └── proteinmpnn/
        ├── sequences/                     # Optimized FASTA sequences
        │   ├── design_0001.fasta
        │   ├── design_0002.fasta
        │   └── ...
        ├── scores/                        # ProteinMPNN scores
        │   ├── design_0001_scores.npz
        │   └── ...
        └── summary/
            └── optimization_summary.txt   # Summary statistics

Boltz-2 Outputs¶

results/
└── sample_id/
    └── boltz2/
        ├── structures/                    # Predicted CIF structures
        │   ├── mpnn_0001_model_0.cif
        │   ├── mpnn_0001_model_1.cif
        │   └── ...
        ├── confidence/                    # Confidence scores (JSON)
        │   ├── mpnn_0001_confidence.json
        │   └── ...
        └── npz/                          # Converted NPZ files (for ipSAE)
            ├── mpnn_0001_model_0.npz
            └── ...

Interpreting Results¶

ProteinMPNN Scores¶

Lower scores indicate better sequence-structure compatibility:

Score < -2.0: Excellent sequence fit
Score -2.0 to -1.5: Good fit
Score -1.5 to -1.0: Acceptable fit
Score > -1.0: Poor fit, may not fold correctly

Boltz-2 Confidence¶

Boltz-2 provides multiple confidence metrics in JSON format:

{
  "plddt": 85.3,           // Per-residue confidence (0-100)
  "ptm": 0.82,             // Predicted TM-score
  "iptm": 0.79,            // Interface PTM (for complexes)
  "ranking_confidence": 0.81
}

Interpretation: - pLDDT > 80: High confidence structure - pLDDT 60-80: Moderate confidence - pLDDT < 60: Low confidence, may be disordered - pTM > 0.8: Good overall structure quality

Comparison with Boltzgen¶

Structural Similarity¶

Compare Boltz-2 structures to original Boltzgen designs:

# Use TM-align or similar tool
tmalign results/sample1/boltzgen/design_0001.cif \
        results/sample1/boltz2/structures/mpnn_0001_model_0.cif

TM-score interpretation: - TM-score > 0.9: Nearly identical structures - TM-score 0.7-0.9: Similar structures - TM-score < 0.7: Different structures

Analysis Comparison¶

When multiple analyses are enabled, compare metrics:

# Enable all analyses for both Boltzgen and Boltz-2
nextflow run seqeralabs/nf-proteindesign \
    --input samplesheet.csv \
    --run_proteinmpnn \
    --run_boltz2_refold \
    --run_ipsae \
    --run_prodigy \
    --run_foldseek \
    --foldseek_database /path/to/database_dir \
    --foldseek_database_name afdb \
    --run_consolidation \
    --outdir results

The consolidated metrics report will show: - Boltzgen designs: Original structure metrics - Boltz-2 designs: Sequence-optimized structure metrics - Side-by-side comparison of quality scores

Use Cases¶

1. Sequence Optimization¶

Goal: Find sequences with better properties while maintaining structure

Workflow:

--run_proteinmpnn \
--mpnn_sampling_temp 0.2 \
--mpnn_num_seq_per_target 16

Analysis: Compare ProteinMPNN scores, select top sequences

2. Structural Validation¶

Goal: Verify optimized sequences maintain desired structure

Workflow:

--run_proteinmpnn \
--run_boltz2_refold \
--boltz2_diffusion_samples 1

Analysis: Compare Boltzgen vs. Boltz-2 structures using TM-align

3. Comprehensive Quality Assessment¶

Goal: Full characterization of both original and optimized designs

Workflow:

--run_proteinmpnn \
--run_boltz2_refold \
--run_ipsae \
--run_prodigy \
--run_consolidation

Analysis: Use consolidated metrics to identify best overall designs

Performance Notes¶

ProteinMPNN¶

GPU acceleration: Recommended for large batches
Memory: ~4-8 GB GPU memory
Time: ~5-10 seconds per structure
Parallelization: Processes multiple structures in parallel

Boltz-2¶

GPU acceleration: Required for reasonable speed
Memory: ~16-32 GB GPU memory per sample
Time: ~2-5 minutes per sequence
Parallelization: Processes multiple sequences in parallel

Combined Workflow¶

For a design with 20 budget structures and 8 sequences per structure: - ProteinMPNN: ~5 minutes total - Boltz-2: ~20 minutes total (160 predictions) - Total overhead: ~25-30 minutes

Troubleshooting¶

ProteinMPNN Issues¶

Low scores (> -1.0): - Original structure may have issues - Try adjusting --mpnn_backbone_noise - Check input PDB quality

Similar sequences: - Increase --mpnn_sampling_temp (try 0.2-0.3) - Increase --mpnn_num_seq_per_target

Boltz-2 Issues¶

Low confidence predictions: - Sequences may not be foldable - Check ProteinMPNN scores first - Try different sampling temperature

Out of memory: - Reduce --boltz2_diffusion_samples - Process fewer structures at once - Use GPU with more memory

Structural Divergence¶

Boltz-2 structures differ from Boltzgen: - Check ProteinMPNN scores (should be < -1.5) - Verify target sequence extraction worked correctly - Consider if sequence optimization is too aggressive - Try lower --mpnn_sampling_temp

Best Practices¶

Start conservative: Use default parameters first
Validate small set: Test on 2-3 designs before full run
Compare metrics: Use consolidation to compare Boltzgen vs. Boltz-2
Check structural similarity: Always verify refolding maintains structure
Consider tradeoffs: Lower ProteinMPNN scores may not always mean better designs

Integration with Other Analyses¶

ipSAE¶

Automatically analyzes both Boltzgen and Boltz-2 structures when enabled:

--run_ipsae  # Will process both sources

PRODIGY¶

Predicts binding affinity for both structure types:

--run_prodigy  # Analyzes all CIF files

Foldseek¶

Searches for homologs of both Boltzgen and Boltz-2 designs:

--run_foldseek --foldseek_database /path/to/database_dir --foldseek_database_name afdb

References¶

ProteinMPNN: Dauparas J, et al. (2022) Robust deep learning–based protein sequence design using ProteinMPNN. Science. doi:10.1126/science.add2187
Boltz-2: Structure prediction model for protein folding validation

ProteinMPNN & Boltz-2¶

Overview¶

Workflow Diagram¶

When to Use This Workflow¶

Enabling the Workflow¶

ProteinMPNN Only¶

ProteinMPNN + Boltz-2 (Recommended)¶

ProteinMPNN Parameters¶

Core Parameters¶

Parameter Guidelines¶

Boltz-2 Parameters¶

Parameter Guidelines¶

Output Files¶

ProteinMPNN Outputs¶

Boltz-2 Outputs¶

Interpreting Results¶

ProteinMPNN Scores¶

Boltz-2 Confidence¶

Comparison with Boltzgen¶

Structural Similarity¶

Analysis Comparison¶

Use Cases¶

1. Sequence Optimization¶

2. Structural Validation¶

3. Comprehensive Quality Assessment¶

Performance Notes¶

ProteinMPNN¶

Boltz-2¶

Combined Workflow¶

Troubleshooting¶

ProteinMPNN Issues¶

Boltz-2 Issues¶

Structural Divergence¶

Best Practices¶

Integration with Other Analyses¶

ipSAE¶

PRODIGY¶

Foldseek¶

References¶

See Also¶