Basic Usage¶

This guide covers the fundamental concepts for using nf-proteindesign.

Basic Command Structure¶

nextflow run seqeralabs/nf-proteindesign \
    -profile <PROFILE> \
    --input <SAMPLESHEET> \
    --outdir <OUTPUT_DIR> \
    [OPTIONS]

Components¶

-profile: Execution profile (docker, test)
--input: Path to samplesheet CSV file
--outdir: Output directory path
[OPTIONS]: Additional pipeline parameters

Samplesheet Format¶

The pipeline uses a CSV samplesheet to specify design jobs. Each row represents a separate design run.

Required Columns¶

Column	Required	Description
`sample`	✅	Unique sample identifier
`design_yaml`	✅	Path to design YAML file (see below)

Optional Columns¶

Additional columns can override default parameters per sample:

Column	Type	Description
`num_designs`	Integer	Number of designs to generate (overrides `--num_designs`)
`budget`	Integer	Number of final designs to keep (overrides `--budget`)

Example Samplesheet¶

sample,design_yaml,num_designs,budget
protein_binder,designs/egfr_binder.yaml,10000,50
nanobody_design,designs/spike_nanobody.yaml,5000,20
peptide_binder,designs/il6_peptide.yaml,3000,10

Design YAML Format¶

For Design mode, create YAML files following this structure:

# Boltzgen design specification
entities:
  # Designed protein entity
  - protein:
      id: C
      sequence: 50..100  # Length range for designed protein

  # Target structure entity
  - file:
      path: target.cif
      include:
        - chain:
            id: A  # Target chain to bind

Common Parameters¶

Essential Parameters¶

--input            # Path to samplesheet CSV (required)
--outdir           # Output directory (required)
--mode             # Explicit mode: design, target, binder (optional, auto-detected)

Design Parameters¶

--n_samples        # Number of designs per specification (default: 10)
--timesteps        # Diffusion timesteps (default: 100)
--save_traj        # Save trajectory files (default: false)

Analysis Options¶

--run_ipsae        # Enable IPSAE scoring (default: false)
--run_prodigy      # Enable PRODIGY affinity prediction (default: false)

Resource Management¶

--max_cpus         # Maximum CPUs (default: 16)
--max_memory       # Maximum memory (default: 128.GB)
--max_time         # Maximum time per job (default: 48.h)

Output Structure¶

The pipeline creates an organized output directory:

results/
├── {sample_id}/
│   ├── boltzgen/
│   │   ├── final_ranked_designs/    # Your final designs ⭐
│   │   │   ├── design_1.cif
│   │   │   ├── design_2.cif
│   │   │   └── ...
│   │   ├── intermediate_designs/    # Intermediate outputs
│   │   │   └── ...
│   │   └── boltzgen.log            # Execution log
│   │
│   ├── prodigy/                     # If --run_prodigy enabled
│   │   ├── design_1_prodigy_results.txt
│   │   ├── design_1_prodigy_summary.csv
│   │   └── ...
│   │
│   └── ipsae/                       # If --run_ipsae enabled
│       └── design_1_ipsae_scores.csv
│
└── pipeline_info/
    ├── execution_report.html        # Execution summary
    ├── execution_timeline.html      # Timeline visualization
    └── execution_trace.txt          # Detailed trace

Key Output Files¶

Most Important Files

Final designs: boltzgen/{sample}/final_ranked_designs/*.cif
Execution report: pipeline_info/execution_report.html
Affinity predictions: prodigy/{sample}/design_*_summary.csv

Example Workflows¶

Example 1: Basic Protein Design¶

# 1. Create design YAML
cat > protein_design.yaml << EOF
name: egfr_binder
target:
  structure: data/egfr.pdb
  residues: [10, 11, 12, 45, 46]
designed:
  chain_type: protein
  length: [60, 100]
global:
  n_samples: 20
EOF

# 2. Create samplesheet
cat > samples.csv << EOF
sample,design_yaml
egfr_binder,protein_design.yaml
EOF

# 3. Run pipeline
nextflow run seqeralabs/nf-proteindesign \
    -profile docker \
    --input samples.csv \
    --outdir results

Example 2: Multiple Designs with Analysis¶

# 1. Create design YAMLs for different targets
cat > egfr_design.yaml << EOF
name: egfr_binder
target:
  structure: data/egfr.pdb
  residues: [10, 11, 12, 45, 46]
designed:
  chain_type: protein
  length: [60, 120]
EOF

cat > spike_design.yaml << EOF
name: spike_nanobody
target:
  structure: data/spike.cif
  residues: [417, 484, 501]
designed:
  chain_type: nanobody
  length: [110, 130]
EOF

# 2. Create samplesheet
cat > samples.csv << EOF
sample,design_yaml,num_designs,budget
egfr_binder,egfr_design.yaml,10000,50
spike_nanobody,spike_design.yaml,5000,20
EOF

# 3. Run with analysis modules
nextflow run seqeralabs/nf-proteindesign \
    -profile docker \
    --input samples.csv \
    --outdir results \
    --run_proteinmpnn \
    --run_protenix_refold \
    --run_prodigy \
    --run_consolidation

Example 3: Test Run¶

# Use built-in test profile
nextflow run seqeralabs/nf-proteindesign \
    -profile test_design_protein,docker

Resume Failed Runs¶

Nextflow can resume from the last successful step:

nextflow run seqeralabs/nf-proteindesign \
    -profile docker \
    --input samplesheet.csv \
    --outdir results \
    -resume  # ← Add this flag

Always Use Resume

The -resume flag is safe to use even on successful runs and saves significant time if something fails.

Monitoring Execution¶

Check Pipeline Progress¶

# Watch Nextflow output
# Progress is shown in real-time

# Monitor GPU usage
watch -n 1 nvidia-smi

# Check disk usage
du -sh work/ results/

View Execution Report¶

After completion, open the HTML report:

# Linux
xdg-open results/pipeline_info/execution_report.html

# Mac
open results/pipeline_info/execution_report.html

# View timeline
xdg-open results/pipeline_info/execution_timeline.html

Advanced Usage¶

Custom Configuration¶

Create a custom config file my_config.config:

process {
    withLabel: gpu {
        memory = '32 GB'
        time = '24 h'
    }
}

params {
    n_samples = 50
    timesteps = 200
}

Use with:

nextflow run seqeralabs/nf-proteindesign \
    -profile docker \
    -c my_config.config \
    --input samplesheet.csv \
    --outdir results

Profile Combinations¶

Combine multiple profiles:

# Docker with test data
nextflow run ... -profile docker,test

# Docker with custom settings
nextflow run ... -profile docker,custom

Common Issues¶

Issue 1: Samplesheet Format¶

Error

Invalid samplesheet format

Solution: Ensure CSV is properly formatted with required columns:

# Check for proper headers
head -n 1 samplesheet.csv

# Validate no trailing commas
cat samplesheet.csv | grep -E ',$'

Issue 2: File Not Found¶

Error

File not found: design.yaml

Solution: Use absolute paths or paths relative to work directory:

# Absolute path
sample,design_yaml
design1,/full/path/to/design.yaml

# Or use $PWD
sample,design_yaml
design1,$PWD/designs/design.yaml

Issue 3: GPU Memory¶

Error

CUDA out of memory

Solution: Reduce --n_samples or use sequential processing:

nextflow run ... --n_samples 10  # Reduce batch size

Next Steps¶

Check the Quick Reference for common commands
Explore Analysis Tools integration
Review Pipeline Parameters for advanced configuration

Need Help?

See Quick Reference for command templates
Check GitHub Issues