18 Bioinformatics Tool Integration
Bioinformatics involves many command-line tools, each with its own quirks. Claude Code helps navigate this complexity.
18.1 The Bioinformatics Challenge
- Dozens of tools, each with unique syntax
- Complex installation procedures
- Chaining tools into pipelines
- Managing file formats
- Dealing with compute clusters
18.2 Installing Tools
18.2.1 Package Managers
> Help me install samtools, bedtools, and STAR aligner
Using Conda (recommended for bioinformatics):
# Create environment
conda create -n biotools python=3.10
# Activate
conda activate biotools
# Install tools
conda install -c bioconda samtools bedtools starUsing Homebrew (Mac):
brew install samtools bedtools18.2.2 When Installation Fails
> I'm getting this error installing STAR:
> [paste error message]
Claude diagnoses common issues: - Missing dependencies - Architecture mismatches (M1 Mac) - Version conflicts
18.3 Common Tool Workflows
18.3.1 Sequence Alignment
> Create a script to align FASTQ files to a reference genome using STAR
#!/bin/bash
# align.sh - Align RNA-seq reads with STAR
GENOME_DIR="/path/to/star_index"
FASTQ_R1=$1
FASTQ_R2=$2
OUTPUT_PREFIX=$3
STAR \
--runThreadN 8 \
--genomeDir $GENOME_DIR \
--readFilesIn $FASTQ_R1 $FASTQ_R2 \
--readFilesCommand zcat \
--outFileNamePrefix $OUTPUT_PREFIX \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes Standard18.3.2 BAM File Processing
> I have a BAM file. Help me:
> 1. Sort it
> 2. Index it
> 3. Get basic statistics
> 4. Extract reads from chromosome 1
# Sort
samtools sort input.bam -o sorted.bam
# Index
samtools index sorted.bam
# Statistics
samtools flagstat sorted.bam
samtools idxstats sorted.bam
# Extract chr1
samtools view -b sorted.bam chr1 > chr1.bam18.3.3 Variant Calling
> Walk me through a basic variant calling pipeline with BWA and GATK
Claude provides the multi-step workflow: 1. Index reference 2. Align reads 3. Mark duplicates 4. Base quality recalibration 5. Call variants 6. Filter variants
18.3.4 Peak Calling (ChIP-seq/ATAC-seq)
> Run MACS2 peak calling on my ChIP-seq data
macs2 callpeak \
-t treatment.bam \
-c control.bam \
-f BAM \
-g hs \
-n experiment \
--outdir peaks/18.4 Building Pipelines
18.4.1 Snakemake
> Create a Snakemake pipeline for RNA-seq analysis
# Snakefile
SAMPLES = ["sample1", "sample2", "sample3"]
rule all:
input:
expand("results/counts/{sample}.counts", sample=SAMPLES)
rule align:
input:
r1 = "data/{sample}_R1.fastq.gz",
r2 = "data/{sample}_R2.fastq.gz"
output:
"results/aligned/{sample}.bam"
shell:
"""
STAR --genomeDir {config[star_index]} \
--readFilesIn {input.r1} {input.r2} \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix results/aligned/{wildcards.sample}
mv results/aligned/{wildcards.sample}Aligned.sortedByCoord.out.bam {output}
"""
rule count:
input:
"results/aligned/{sample}.bam"
output:
"results/counts/{sample}.counts"
shell:
"featureCounts -a {config[gtf]} -o {output} {input}"18.4.2 Nextflow
> Convert this Snakemake pipeline to Nextflow
18.5 Working with HPC/SLURM
18.5.1 Submitting Jobs
> Create a SLURM script to run this analysis on our cluster
#!/bin/bash
#SBATCH --job-name=rnaseq
#SBATCH --output=logs/%j.out
#SBATCH --error=logs/%j.err
#SBATCH --time=24:00:00
#SBATCH --mem=64G
#SBATCH --cpus-per-task=16
#SBATCH --partition=normal
module load star/2.7.10
module load samtools/1.17
./align.sh $FASTQ_R1 $FASTQ_R2 $OUTPUT18.5.2 Array Jobs
> I have 50 samples. Set up a SLURM array job to process them in parallel
#!/bin/bash
#SBATCH --array=1-50
#SBATCH --job-name=align_%a
#SBATCH --output=logs/align_%a.out
# Get sample name from list
SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" sample_list.txt)
./align.sh data/${SAMPLE}_R1.fastq.gz data/${SAMPLE}_R2.fastq.gz results/${SAMPLE}18.5.3 Monitoring Jobs
> Show me all my running jobs and their status
squeue -u $USER
sacct -j <job_id> --format=JobID,State,Elapsed,MaxRSS18.6 Common Bioinformatics Tasks
18.6.1 Format Conversion
> Convert this GTF file to BED format
> Convert SAM to BAM and sort by coordinate
> Extract FASTA sequences for these BED regions
18.6.2 Quality Control
> Run FastQC on all my FASTQ files and compile a MultiQC report
mkdir -p qc_reports
# Run FastQC
for f in data/*.fastq.gz; do
fastqc -o qc_reports/ "$f"
done
# Compile with MultiQC
multiqc qc_reports/ -o qc_reports/multiqc/18.6.3 Subsetting and Filtering
> Extract only the mitochondrial reads from this BAM
> Filter variants with quality < 30
> Get sequences longer than 500bp from this FASTA
18.7 Debugging Bioinformatics Tools
18.7.1 Understanding Error Messages
> I'm getting this error from STAR:
> "FATAL: cannot open file"
> What does it mean?
18.7.2 Memory Issues
> My job keeps getting killed. How do I estimate memory requirements for STAR?
18.7.3 Performance Optimization
> The alignment is taking forever. How can I speed it up?
Options: - More threads - Better indexing - Subset data for testing - Use faster storage
18.8 Containerization
18.8.1 Docker
> Create a Dockerfile for my analysis pipeline
FROM continuumio/miniconda3:latest
RUN conda install -c bioconda \
star=2.7.10 \
samtools=1.17 \
bedtools=2.30
WORKDIR /data
ENTRYPOINT ["/bin/bash"]
18.8.2 Singularity (for HPC)
> Convert this Docker container to Singularity for use on our cluster
singularity pull docker://biocontainers/star:2.7.10--h9ee0642_0
singularity exec star_2.7.10.sif STAR --version18.9 Reproducibility
18.9.1 Environment Management
> Export my conda environment so collaborators can reproduce it
conda env export > environment.yml
# Collaborator runs:
conda env create -f environment.yml18.9.2 Recording Tool Versions
> Add version logging to my pipeline script
echo "Tool versions:" > versions.txt
samtools --version >> versions.txt
bedtools --version >> versions.txt
STAR --version >> versions.txt18.10 What You’ve Learned
You can now: - Install bioinformatics tools - Run common analyses - Build reproducible pipelines - Work with HPC systems - Debug tool issues
18.11 Next Steps
Continue to Part 6: Best Practices.