Metrics Consolidation¶
Overview¶
The metrics consolidation module aggregates results from all analysis tools into a unified CSV report and markdown summary. This provides a comprehensive overview of design quality across all enabled analyses.
Unified Analysis
Consolidation automatically collects metrics from Boltzgen, ProteinMPNN, Protenix, ipSAE, PRODIGY, and Foldseek, making it easy to compare designs and identify top candidates.
When to Use Consolidation¶
Enable metrics consolidation when you:
- Compare designs: Need to evaluate multiple designs across different metrics
- Identify top candidates: Want to quickly find the best designs based on multiple criteria
- Track provenance: Need to know which designs came from Boltzgen vs. Protenix
- Generate reports: Want publication-ready summary tables
Enabling Consolidation¶
nextflow run seqeralabs/nf-proteindesign \
-profile docker \
--input samplesheet.csv \
--run_proteinmpnn \
--run_protenix_refold \
--run_ipsae \
--run_prodigy \
--run_foldseek \
--foldseek_database /path/to/database_dir \
--foldseek_database_name afdb \
--run_consolidation \
--outdir results
Note
Consolidation works with any combination of analysis modules. It will include whatever metrics are available from enabled analyses.
Key Parameters¶
| Parameter | Default | Description |
|---|---|---|
--run_consolidation |
false |
Enable metrics consolidation |
--report_top_n |
10 |
Number of top designs to highlight in report |
Output Files¶
Consolidation generates a unified metrics directory:
results/
└── consolidated_metrics/
├── all_designs_metrics.csv # Complete metrics for all designs
├── top_designs_summary.md # Markdown report of top designs
└── metrics_by_source.csv # Metrics grouped by source (Boltzgen/Protenix)
Consolidated Metrics CSV¶
The all_designs_metrics.csv file contains all available metrics in a single table:
Core Columns¶
| Column | Description | Source |
|---|---|---|
design_id |
Unique design identifier | All |
parent_id |
Parent design ID (links Protenix to Boltzgen) | All |
source |
boltzgen or protenix |
All |
structure_file |
Path to CIF structure | Boltzgen/Protenix |
ProteinMPNN Metrics (if enabled)¶
| Column | Description |
|---|---|
mpnn_score |
ProteinMPNN sequence score |
mpnn_sequence |
Optimized sequence |
mpnn_recovery |
Sequence recovery rate |
Protenix Metrics (if enabled)¶
| Column | Description |
|---|---|
protenix_confidence |
Overall confidence score |
protenix_plddt |
Mean pLDDT score |
protenix_ptm |
Predicted TM-score |
ipSAE Metrics (if enabled)¶
| Column | Description |
|---|---|
ipsae_score |
Interface quality score |
ipsae_ipae |
Interface predicted aligned error |
ipsae_num_contacts |
Number of interface contacts |
PRODIGY Metrics (if enabled)¶
| Column | Description |
|---|---|
prodigy_dg |
Predicted binding free energy (ΔG, kcal/mol) |
prodigy_kd |
Predicted dissociation constant (Kd, M) |
prodigy_kd_temp |
Kd at specified temperature |
Foldseek Metrics (if enabled)¶
| Column | Description |
|---|---|
foldseek_top_hit |
Best matching structure |
foldseek_evalue |
E-value of top hit |
foldseek_score |
Alignment score |
foldseek_num_hits |
Number of significant hits |
Top Designs Summary¶
The top_designs_summary.md provides a markdown-formatted report highlighting the best designs:
# Top Designs Summary
## Overview
- Total designs analyzed: 120
- Boltzgen designs: 60
- Protenix designs: 60
## Top 10 Designs by ipSAE Score
| Rank | Design ID | Source | ipSAE | PRODIGY ΔG | Foldseek E-value |
|------|-----------|--------|-------|------------|------------------|
| 1 | design1_0001 | boltzgen | 0.92 | -12.5 | 1.2e-8 |
| 2 | design1_0002 | protenix | 0.89 | -11.8 | 3.4e-7 |
...
Example Analysis Workflow¶
1. View All Metrics¶
# Open in spreadsheet software
libreoffice results/consolidated_metrics/all_designs_metrics.csv
# Or view in terminal
column -t -s, results/consolidated_metrics/all_designs_metrics.csv | less -S
2. Filter Top Designs¶
# Find designs with strong binding (PRODIGY ΔG < -10)
awk -F',' '$8 < -10' results/consolidated_metrics/all_designs_metrics.csv
# Find designs with high ipSAE score (> 0.8)
awk -F',' '$6 > 0.8' results/consolidated_metrics/all_designs_metrics.csv
# Find designs with significant Foldseek hits (E-value < 1e-5)
awk -F',' '$11 < 1e-5' results/consolidated_metrics/all_designs_metrics.csv
3. Compare Boltzgen vs. Protenix¶
# View metrics by source
cat results/consolidated_metrics/metrics_by_source.csv
# Count designs per source
awk -F',' 'NR>1 {print $3}' results/consolidated_metrics/all_designs_metrics.csv | sort | uniq -c
4. Identify Best Overall Designs¶
import pandas as pd
# Load metrics
df = pd.read_csv('results/consolidated_metrics/all_designs_metrics.csv')
# Define scoring criteria (adjust weights as needed)
df['combined_score'] = (
df['ipsae_score'] * 0.3 + # Interface quality
(df['prodigy_dg'] / -15) * 0.3 + # Binding strength (normalized)
(1 - df['foldseek_evalue']) * 0.2 + # Structural novelty
df['protenix_confidence'] * 0.2 # Confidence (if available)
)
# Get top 10
top_designs = df.nlargest(10, 'combined_score')
print(top_designs[['design_id', 'source', 'combined_score', 'ipsae_score', 'prodigy_dg']])
Integration with Analysis Modules¶
Partial Analysis Support¶
Consolidation works with any subset of analysis modules:
# Only ipSAE and PRODIGY
--run_ipsae --run_prodigy --run_consolidation
# Only Foldseek
--run_foldseek --foldseek_database /path/to/database_dir --foldseek_database_name afdb --run_consolidation
# All modules
--run_proteinmpnn --run_protenix_refold --run_ipsae --run_prodigy --run_foldseek --run_consolidation
Missing metrics will be indicated as NA in the CSV.
Provenance Tracking¶
The report tracks design provenance:
- Boltzgen designs: Original structures from Boltzgen design
- Protenix designs: Structures from ProteinMPNN sequences refolded by Protenix
Parent-child relationships are maintained via parent_id column.
Customizing the Report¶
Change Number of Top Designs¶
Sort by Different Metrics¶
The consolidation script can be customized to prioritize different metrics. Edit assets/consolidate_design_metrics.py to change sorting criteria.
Use Cases¶
1. Therapeutic Development¶
Identify designs with: - Strong binding affinity (PRODIGY ΔG < -10 kcal/mol) - High interface quality (ipSAE > 0.8) - Novel structures (Foldseek E-value > 0.01)
2. Protein Engineering¶
Compare: - Boltzgen designs (original scaffold) - Protenix designs (sequence-optimized) - Identify improvements from ProteinMPNN optimization
3. High-Throughput Screening¶
Process large design sets: - Rank by combined score - Filter by specific thresholds - Identify patterns in successful designs
Performance Notes¶
- Execution time: < 1 minute for typical datasets
- Resource usage: Minimal (CPU-only, < 1 GB memory)
- Scales linearly: Works with 10s to 1000s of designs
Troubleshooting¶
Missing Metrics¶
Cause: Analysis module was not run or failed
Solution:
- Check that the module was enabled (--run_ipsae)
- Verify module completed successfully in pipeline logs
- Consolidation will still run with available metrics
Empty CSV¶
If all_designs_metrics.csv is empty:
- Check that at least one analysis module completed
- Verify output directory structure
- Check pipeline logs for errors in analysis modules
Inconsistent Design Counts¶
If Protenix design count doesn't match ProteinMPNN:
- This is expected behavior (some sequences may fail refolding)
- Check Protenix logs for failed predictions
- Consolidation will include all successfully generated structures
Best Practices¶
- Always enable consolidation: Provides overview even with single analysis
- Use with multiple analyses: Maximum value when combining multiple metrics
- Document criteria: Note which metrics matter for your application
- Archive reports: Save consolidated reports for reproducibility
- Visualize: Import CSV into plotting tools for visual analysis
See Also¶
- ipSAE Scoring - Interface quality evaluation
- PRODIGY Binding Affinity - Binding strength prediction
- Foldseek Structural Search - Structural similarity analysis
- Output Files Reference - Complete output documentation