18 Bioinformatics Tool Integration

Bioinformatics involves many command-line tools, each with its own quirks. Claude Code helps navigate this complexity.

18.1 The Bioinformatics Challenge

Dozens of tools, each with unique syntax
Complex installation procedures
Chaining tools into pipelines
Managing file formats
Dealing with compute clusters

18.2 Installing Tools

18.2.1 Package Managers

> Help me install samtools, bedtools, and STAR aligner

Using Conda (recommended for bioinformatics):

# Create environment
conda create -n biotools python=3.10

# Activate
conda activate biotools

# Install tools
conda install -c bioconda samtools bedtools star

Using Homebrew (Mac):

brew install samtools bedtools

18.2.2 When Installation Fails

> I'm getting this error installing STAR:
> [paste error message]

Claude diagnoses common issues: - Missing dependencies - Architecture mismatches (M1 Mac) - Version conflicts

18.3 Common Tool Workflows

18.3.1 Sequence Alignment

> Create a script to align FASTQ files to a reference genome using STAR

#!/bin/bash
# align.sh - Align RNA-seq reads with STAR

GENOME_DIR="/path/to/star_index"
FASTQ_R1=$1
FASTQ_R2=$2
OUTPUT_PREFIX=$3

STAR \
    --runThreadN 8 \
    --genomeDir $GENOME_DIR \
    --readFilesIn $FASTQ_R1 $FASTQ_R2 \
    --readFilesCommand zcat \
    --outFileNamePrefix $OUTPUT_PREFIX \
    --outSAMtype BAM SortedByCoordinate \
    --outSAMattributes Standard

18.3.2 BAM File Processing

> I have a BAM file. Help me:
> 1. Sort it
> 2. Index it
> 3. Get basic statistics
> 4. Extract reads from chromosome 1

# Sort
samtools sort input.bam -o sorted.bam

# Index
samtools index sorted.bam

# Statistics
samtools flagstat sorted.bam
samtools idxstats sorted.bam

# Extract chr1
samtools view -b sorted.bam chr1 > chr1.bam

18.3.3 Variant Calling

> Walk me through a basic variant calling pipeline with BWA and GATK

Claude provides the multi-step workflow: 1. Index reference 2. Align reads 3. Mark duplicates 4. Base quality recalibration 5. Call variants 6. Filter variants

18.3.4 Peak Calling (ChIP-seq/ATAC-seq)

> Run MACS2 peak calling on my ChIP-seq data

macs2 callpeak \
    -t treatment.bam \
    -c control.bam \
    -f BAM \
    -g hs \
    -n experiment \
    --outdir peaks/

18.4 Building Pipelines

18.4.1 Snakemake

> Create a Snakemake pipeline for RNA-seq analysis

# Snakefile
SAMPLES = ["sample1", "sample2", "sample3"]

rule all:
    input:
        expand("results/counts/{sample}.counts", sample=SAMPLES)

rule align:
    input:
        r1 = "data/{sample}_R1.fastq.gz",
        r2 = "data/{sample}_R2.fastq.gz"
    output:
        "results/aligned/{sample}.bam"
    shell:
        """
        STAR --genomeDir {config[star_index]} \
             --readFilesIn {input.r1} {input.r2} \
             --readFilesCommand zcat \
             --outSAMtype BAM SortedByCoordinate \
             --outFileNamePrefix results/aligned/{wildcards.sample}
        mv results/aligned/{wildcards.sample}Aligned.sortedByCoord.out.bam {output}
        """

rule count:
    input:
        "results/aligned/{sample}.bam"
    output:
        "results/counts/{sample}.counts"
    shell:
        "featureCounts -a {config[gtf]} -o {output} {input}"

18.4.2 Nextflow

> Convert this Snakemake pipeline to Nextflow

18.5 Working with HPC/SLURM

18.5.1 Submitting Jobs

> Create a SLURM script to run this analysis on our cluster

#!/bin/bash
#SBATCH --job-name=rnaseq
#SBATCH --output=logs/%j.out
#SBATCH --error=logs/%j.err
#SBATCH --time=24:00:00
#SBATCH --mem=64G
#SBATCH --cpus-per-task=16
#SBATCH --partition=normal

module load star/2.7.10
module load samtools/1.17

./align.sh $FASTQ_R1 $FASTQ_R2 $OUTPUT

18.5.2 Array Jobs

> I have 50 samples. Set up a SLURM array job to process them in parallel

#!/bin/bash
#SBATCH --array=1-50
#SBATCH --job-name=align_%a
#SBATCH --output=logs/align_%a.out

# Get sample name from list
SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" sample_list.txt)

./align.sh data/${SAMPLE}_R1.fastq.gz data/${SAMPLE}_R2.fastq.gz results/${SAMPLE}

18.5.3 Monitoring Jobs

> Show me all my running jobs and their status

squeue -u $USER
sacct -j <job_id> --format=JobID,State,Elapsed,MaxRSS

18.6 Common Bioinformatics Tasks

18.6.1 Format Conversion

> Convert this GTF file to BED format

> Convert SAM to BAM and sort by coordinate

> Extract FASTA sequences for these BED regions

18.6.2 Quality Control

> Run FastQC on all my FASTQ files and compile a MultiQC report

mkdir -p qc_reports

# Run FastQC
for f in data/*.fastq.gz; do
    fastqc -o qc_reports/ "$f"
done

# Compile with MultiQC
multiqc qc_reports/ -o qc_reports/multiqc/

18.6.3 Subsetting and Filtering

> Extract only the mitochondrial reads from this BAM

> Filter variants with quality < 30

> Get sequences longer than 500bp from this FASTA

18.7 Debugging Bioinformatics Tools

18.7.1 Understanding Error Messages

> I'm getting this error from STAR:
> "FATAL: cannot open file"
> What does it mean?

18.7.2 Memory Issues

> My job keeps getting killed. How do I estimate memory requirements for STAR?

18.7.3 Performance Optimization

> The alignment is taking forever. How can I speed it up?

Options: - More threads - Better indexing - Subset data for testing - Use faster storage

18.8 Containerization

18.8.1 Docker

> Create a Dockerfile for my analysis pipeline

FROM continuumio/miniconda3:latest

RUN conda install -c bioconda \
    star=2.7.10 \
    samtools=1.17 \
    bedtools=2.30

WORKDIR /data
ENTRYPOINT ["/bin/bash"]

18.8.2 Singularity (for HPC)

> Convert this Docker container to Singularity for use on our cluster

singularity pull docker://biocontainers/star:2.7.10--h9ee0642_0
singularity exec star_2.7.10.sif STAR --version

18.9 Reproducibility

18.9.1 Environment Management

> Export my conda environment so collaborators can reproduce it

conda env export > environment.yml
# Collaborator runs:
conda env create -f environment.yml

18.9.2 Recording Tool Versions

> Add version logging to my pipeline script

echo "Tool versions:" > versions.txt
samtools --version >> versions.txt
bedtools --version >> versions.txt
STAR --version >> versions.txt

18.10 What You’ve Learned

You can now: - Install bioinformatics tools - Run common analyses - Build reproducible pipelines - Work with HPC systems - Debug tool issues

18.11 Next Steps

Continue to Part 6: Best Practices.