18  Bioinformatics Tool Integration

Bioinformatics involves many command-line tools, each with its own quirks. Claude Code helps navigate this complexity.

18.1 The Bioinformatics Challenge

  • Dozens of tools, each with unique syntax
  • Complex installation procedures
  • Chaining tools into pipelines
  • Managing file formats
  • Dealing with compute clusters

18.2 Installing Tools

18.2.1 Package Managers

> Help me install samtools, bedtools, and STAR aligner

Using Conda (recommended for bioinformatics):

# Create environment
conda create -n biotools python=3.10

# Activate
conda activate biotools

# Install tools
conda install -c bioconda samtools bedtools star

Using Homebrew (Mac):

brew install samtools bedtools

18.2.2 When Installation Fails

> I'm getting this error installing STAR:
> [paste error message]

Claude diagnoses common issues: - Missing dependencies - Architecture mismatches (M1 Mac) - Version conflicts

18.3 Common Tool Workflows

18.3.1 Sequence Alignment

> Create a script to align FASTQ files to a reference genome using STAR
#!/bin/bash
# align.sh - Align RNA-seq reads with STAR

GENOME_DIR="/path/to/star_index"
FASTQ_R1=$1
FASTQ_R2=$2
OUTPUT_PREFIX=$3

STAR \
    --runThreadN 8 \
    --genomeDir $GENOME_DIR \
    --readFilesIn $FASTQ_R1 $FASTQ_R2 \
    --readFilesCommand zcat \
    --outFileNamePrefix $OUTPUT_PREFIX \
    --outSAMtype BAM SortedByCoordinate \
    --outSAMattributes Standard

18.3.2 BAM File Processing

> I have a BAM file. Help me:
> 1. Sort it
> 2. Index it
> 3. Get basic statistics
> 4. Extract reads from chromosome 1
# Sort
samtools sort input.bam -o sorted.bam

# Index
samtools index sorted.bam

# Statistics
samtools flagstat sorted.bam
samtools idxstats sorted.bam

# Extract chr1
samtools view -b sorted.bam chr1 > chr1.bam

18.3.3 Variant Calling

> Walk me through a basic variant calling pipeline with BWA and GATK

Claude provides the multi-step workflow: 1. Index reference 2. Align reads 3. Mark duplicates 4. Base quality recalibration 5. Call variants 6. Filter variants

18.3.4 Peak Calling (ChIP-seq/ATAC-seq)

> Run MACS2 peak calling on my ChIP-seq data
macs2 callpeak \
    -t treatment.bam \
    -c control.bam \
    -f BAM \
    -g hs \
    -n experiment \
    --outdir peaks/

18.4 Building Pipelines

18.4.1 Snakemake

> Create a Snakemake pipeline for RNA-seq analysis
# Snakefile
SAMPLES = ["sample1", "sample2", "sample3"]

rule all:
    input:
        expand("results/counts/{sample}.counts", sample=SAMPLES)

rule align:
    input:
        r1 = "data/{sample}_R1.fastq.gz",
        r2 = "data/{sample}_R2.fastq.gz"
    output:
        "results/aligned/{sample}.bam"
    shell:
        """
        STAR --genomeDir {config[star_index]} \
             --readFilesIn {input.r1} {input.r2} \
             --readFilesCommand zcat \
             --outSAMtype BAM SortedByCoordinate \
             --outFileNamePrefix results/aligned/{wildcards.sample}
        mv results/aligned/{wildcards.sample}Aligned.sortedByCoord.out.bam {output}
        """

rule count:
    input:
        "results/aligned/{sample}.bam"
    output:
        "results/counts/{sample}.counts"
    shell:
        "featureCounts -a {config[gtf]} -o {output} {input}"

18.4.2 Nextflow

> Convert this Snakemake pipeline to Nextflow

18.5 Working with HPC/SLURM

18.5.1 Submitting Jobs

> Create a SLURM script to run this analysis on our cluster
#!/bin/bash
#SBATCH --job-name=rnaseq
#SBATCH --output=logs/%j.out
#SBATCH --error=logs/%j.err
#SBATCH --time=24:00:00
#SBATCH --mem=64G
#SBATCH --cpus-per-task=16
#SBATCH --partition=normal

module load star/2.7.10
module load samtools/1.17

./align.sh $FASTQ_R1 $FASTQ_R2 $OUTPUT

18.5.2 Array Jobs

> I have 50 samples. Set up a SLURM array job to process them in parallel
#!/bin/bash
#SBATCH --array=1-50
#SBATCH --job-name=align_%a
#SBATCH --output=logs/align_%a.out

# Get sample name from list
SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" sample_list.txt)

./align.sh data/${SAMPLE}_R1.fastq.gz data/${SAMPLE}_R2.fastq.gz results/${SAMPLE}

18.5.3 Monitoring Jobs

> Show me all my running jobs and their status
squeue -u $USER
sacct -j <job_id> --format=JobID,State,Elapsed,MaxRSS

18.6 Common Bioinformatics Tasks

18.6.1 Format Conversion

> Convert this GTF file to BED format
> Convert SAM to BAM and sort by coordinate
> Extract FASTA sequences for these BED regions

18.6.2 Quality Control

> Run FastQC on all my FASTQ files and compile a MultiQC report
mkdir -p qc_reports

# Run FastQC
for f in data/*.fastq.gz; do
    fastqc -o qc_reports/ "$f"
done

# Compile with MultiQC
multiqc qc_reports/ -o qc_reports/multiqc/

18.6.3 Subsetting and Filtering

> Extract only the mitochondrial reads from this BAM
> Filter variants with quality < 30
> Get sequences longer than 500bp from this FASTA

18.7 Debugging Bioinformatics Tools

18.7.1 Understanding Error Messages

> I'm getting this error from STAR:
> "FATAL: cannot open file"
> What does it mean?

18.7.2 Memory Issues

> My job keeps getting killed. How do I estimate memory requirements for STAR?

18.7.3 Performance Optimization

> The alignment is taking forever. How can I speed it up?

Options: - More threads - Better indexing - Subset data for testing - Use faster storage

18.8 Containerization

18.8.1 Docker

> Create a Dockerfile for my analysis pipeline
FROM continuumio/miniconda3:latest

RUN conda install -c bioconda \
    star=2.7.10 \
    samtools=1.17 \
    bedtools=2.30

WORKDIR /data
ENTRYPOINT ["/bin/bash"]

18.8.2 Singularity (for HPC)

> Convert this Docker container to Singularity for use on our cluster
singularity pull docker://biocontainers/star:2.7.10--h9ee0642_0
singularity exec star_2.7.10.sif STAR --version

18.9 Reproducibility

18.9.1 Environment Management

> Export my conda environment so collaborators can reproduce it
conda env export > environment.yml
# Collaborator runs:
conda env create -f environment.yml

18.9.2 Recording Tool Versions

> Add version logging to my pipeline script
echo "Tool versions:" > versions.txt
samtools --version >> versions.txt
bedtools --version >> versions.txt
STAR --version >> versions.txt

18.10 What You’ve Learned

You can now: - Install bioinformatics tools - Run common analyses - Build reproducible pipelines - Work with HPC systems - Debug tool issues

18.11 Next Steps

Continue to Part 6: Best Practices.