Population stratification not corrected: Without PCA covariates, population structure produces thousands of false positives. Claude Code always includes PCA computation and covariate inclusion in the association model. Wrong regression model: Using linear regression for a case/control trait inf...

Claude Code for Variant Calling Claude Code for Computational Biology CLAUDE.md File Guide

Claude Code for Genomics GWAS Analysis (2026)

Last updated: April 20, 2026

Why Claude Code for GWAS

Genome-wide association studies involve processing millions of genetic variants across thousands of individuals. The statistical pipeline has critical quality control steps: Hardy-Weinberg equilibrium filtering, minor allele frequency thresholds, population stratification correction, linkage disequilibrium pruning, and multiple testing correction. Skip any step and you get false positives that do not replicate.

Claude Code generates PLINK2/REGENIE workflows that follow established GWAS best practices, produces QC reports at each step, and creates publication-ready Manhattan and QQ plots. It knows the difference between a logistic and linear regression GWAS and when to use mixed models for related individuals.

The Workflow

Step 1: Environment Setup

# Install GWAS tools
conda create -n gwas -c bioconda -c conda-forge \
  plink2=2.0 regenie=3.4 bcftools=1.19 \
  r-base=4.3 r-qqman pandas numpy matplotlib scipy
conda activate gwas
# Download reference data for population stratification
wget -P ref/ https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chr1-22.shapeit2_integrated.GRCh38.20230126.phased.vcf.gz
mkdir -p gwas/{input,qc,assoc,plots}

Step 2: QC and Association Testing Pipeline

# gwas/run_gwas.py
"""GWAS pipeline: QC, PCA, association testing, visualization."""
import subprocess
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
MAF_THRESHOLD = 0.01
HWE_THRESHOLD = 1e-6
GENO_MISSING = 0.02
MIND_MISSING = 0.05
MAX_VARIANTS = 50_000_000
GENOME_WIDE_SIG = 5e-8
SUGGESTIVE_SIG = 1e-5
def run_cmd(cmd: str) -> None:
    """Execute shell command with error checking."""
    assert len(cmd) > 0, "Empty command"
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    assert result.returncode == 0, \
        f"Command failed: {cmd}\n{result.stderr[:500]}"
def qc_pipeline(bfile_prefix: str, output_prefix: str) -> dict:
    """Run standard GWAS QC steps with PLINK2."""
    assert Path(f"{bfile_prefix}.bed").exists(), \
        f"BED file not found: {bfile_prefix}.bed"
    stats = {}
    # Step 1: Initial variant and sample counts
    run_cmd(f"plink2 --bfile {bfile_prefix} --freq --out qc/freq")
    freq = pd.read_csv("qc/freq.afreq", sep=r'\s+')
    stats["initial_variants"] = len(freq)
    assert stats["initial_variants"] > 0
    # Step 2: Filter by MAF, HWE, missingness
    run_cmd(
        f"plink2 --bfile {bfile_prefix} "
        f"--maf {MAF_THRESHOLD} "
        f"--hwe {HWE_THRESHOLD} "
        f"--geno {GENO_MISSING} "
        f"--mind {MIND_MISSING} "
        f"--make-bed --out {output_prefix}_qc1"
    )
    # Step 3: LD pruning for PCA
    run_cmd(
        f"plink2 --bfile {output_prefix}_qc1 "
        f"--indep-pairwise 50 5 0.2 "
        f"--out qc/ld_prune"
    )
    # Step 4: PCA for population stratification
    run_cmd(
        f"plink2 --bfile {output_prefix}_qc1 "
        f"--extract qc/ld_prune.prune.in "
        f"--pca 10 "
        f"--out qc/pca"
    )
    # Count post-QC
    run_cmd(f"plink2 --bfile {output_prefix}_qc1 --freq --out qc/freq_post")
    freq_post = pd.read_csv("qc/freq_post.afreq", sep=r'\s+')
    stats["post_qc_variants"] = len(freq_post)
    stats["removed_variants"] = stats["initial_variants"] - stats["post_qc_variants"]
    print(f"QC complete: {stats['initial_variants']} -> {stats['post_qc_variants']} variants")
    return stats
def run_association(bfile_prefix: str, pheno_file: str,
                    covars: str = "qc/pca.eigenvec",
                    output_prefix: str = "assoc/gwas") -> str:
    """Run GWAS association testing with PLINK2."""
    assert Path(pheno_file).exists(), f"Phenotype file not found: {pheno_file}"
    assert Path(covars).exists(), f"Covariate file not found: {covars}"
    # Logistic regression for binary traits, linear for quantitative
    run_cmd(
        f"plink2 --bfile {bfile_prefix} "
        f"--pheno {pheno_file} "
        f"--covar {covars} "
        f"--covar-col-nums 3-12 "
        f"--glm hide-covar "
        f"--out {output_prefix}"
    )
    # Find the results file
    result_files = list(Path("assoc").glob("gwas.*.glm.*"))
    assert len(result_files) > 0, "No association results produced"
    return str(result_files[0])
def manhattan_plot(results_file: str, output_path: str) -> None:
    """Generate Manhattan plot from GWAS results."""
    assert Path(results_file).exists()
    df = pd.read_csv(results_file, sep='\t')
    assert "P" in df.columns, "Missing P-value column"
    assert "#CHROM" in df.columns or "CHR" in df.columns
    chr_col = "#CHROM" if "#CHROM" in df.columns else "CHR"
    df = df.dropna(subset=["P"])
    df = df[df["P"] > 0]  # remove P=0
    df["-log10p"] = -np.log10(df["P"])
    # Assign cumulative position for x-axis
    df["chr_num"] = pd.to_numeric(df[chr_col], errors="coerce")
    df = df.dropna(subset=["chr_num"])
    fig, ax = plt.subplots(figsize=(16, 6))
    colors = ["#1f77b4", "#aec7e8"]
    cumulative_pos = 0
    xtick_positions = []
    xtick_labels = []
    for chrom in sorted(df["chr_num"].unique()):
        chrom_data = df[df["chr_num"] == chrom].copy()
        chrom_data["plot_pos"] = chrom_data["POS"] + cumulative_pos
        color = colors[int(chrom) % 2]
        ax.scatter(chrom_data["plot_pos"], chrom_data["-log10p"],
                   c=color, s=1, alpha=0.7)
        xtick_positions.append(
            cumulative_pos + chrom_data["POS"].median())
        xtick_labels.append(str(int(chrom)))
        cumulative_pos += chrom_data["POS"].max()
    ax.axhline(-np.log10(GENOME_WIDE_SIG), color="red",
               linestyle="--", linewidth=0.8, label="Genome-wide (5e-8)")
    ax.axhline(-np.log10(SUGGESTIVE_SIG), color="blue",
               linestyle="--", linewidth=0.5, label="Suggestive (1e-5)")
    ax.set_xticks(xtick_positions)
    ax.set_xticklabels(xtick_labels, fontsize=8)
    ax.set_xlabel("Chromosome")
    ax.set_ylabel("-log10(P)")
    ax.set_title("GWAS Manhattan Plot")
    ax.legend()
    plt.tight_layout()
    plt.savefig(output_path, dpi=300)
    print(f"Manhattan plot saved: {output_path}")
    # Report significant hits
    sig_hits = df[df["P"] < GENOME_WIDE_SIG]
    print(f"Genome-wide significant variants: {len(sig_hits)}")
if __name__ == "__main__":
    import sys
    assert len(sys.argv) >= 3, \
        "Usage: python run_gwas.py <bfile_prefix> <pheno_file>"
    bfile = sys.argv[1]
    pheno = sys.argv[2]
    stats = qc_pipeline(bfile, "qc/study")
    results = run_association("qc/study_qc1", pheno)
    manhattan_plot(results, "plots/manhattan.png")

Step 3: Validate Results

python3 gwas/run_gwas.py input/study input/phenotypes.tsv
# Expected output:
# QC complete: 8500000 -> 6200000 variants
# Genome-wide significant variants: 3-15 (depends on trait)
# QQ plot for genomic inflation check
# Lambda_GC should be 1.00-1.05 for well-controlled GWAS

CLAUDE.md for GWAS

# GWAS Pipeline Rules
## Standards
- PLINK2 file format (BED/BIM/FAM)
- VCF 4.3 for variant input
- GWAS Catalog submission format
## File Formats
- .bed/.bim/.fam (PLINK binary)
- .vcf.gz (variant calls)
- .eigenvec (PCA covariates)
- .glm.logistic / .glm.linear (results)
## Libraries
- PLINK2 2.0+
- REGENIE 3.4+ (for biobank-scale)
- LDSC (LD Score Regression)
- pandas, numpy, matplotlib
## Testing
- Genomic inflation factor (lambda_GC): 1.00-1.05
- QQ plot must show no systematic inflation
- Ti/Tv for quality variants: expected ~2.1
- Replication in independent cohort
## QC Thresholds
- MAF > 0.01
- HWE p > 1e-6
- Genotype missingness < 2%
- Sample missingness < 5%
- PCA: correct for top 10 principal components

Common Pitfalls

Population stratification not corrected: Without PCA covariates, population structure produces thousands of false positives. Claude Code always includes PCA computation and covariate inclusion in the association model.
Wrong regression model: Using linear regression for a case/control trait inflates type I error. Claude Code detects binary vs quantitative phenotypes and selects logistic or linear regression accordingly.
Multiple testing not accounted for: Reporting uncorrected p-values leads to false discoveries. Claude Code applies Bonferroni correction (5e-8 for genome-wide) and generates both Manhattan and QQ plots for visual inspection.

Frequently Asked Questions

Do I need a paid Anthropic plan to use this?

Claude Code works with any Anthropic API plan, including the free tier. However, the free tier has lower rate limits (requests per minute and tokens per minute) that may slow down multi-step workflows. For professional use, the Build or Scale plan provides higher limits and priority access during peak hours.

How does this affect token usage and cost?

The token cost depends on the size of your prompts and Claude’s responses. Typical development tasks consume 10K-50K tokens per interaction. Using a CLAUDE.md file and skills reduces exploration tokens by 50-80%, which directly lowers costs. Monitor your usage at console.anthropic.com/settings/billing.

Can I customize this for my specific project?

Yes. All Claude Code behavior can be customized through CLAUDE.md (project rules), .claude/settings.json (permissions), and .claude/skills/ (domain knowledge). The most impactful customization is adding your project’s specific patterns, conventions, and common commands to CLAUDE.md so Claude Code follows your standards from the start.

What happens when Claude Code makes a mistake?

Claude Code creates files and edits through standard filesystem operations, so all changes are visible in git diff. If a change is wrong, revert it with git checkout -- <file> for a single file or git stash for all changes. Claude Code does not make irreversible changes unless you explicitly allow destructive commands in settings.json.

Practical Details

When working with Claude Code on this topic, keep these implementation details in mind:

Project Configuration. Your CLAUDE.md should include specific references to how your project handles this area. Include file paths, naming conventions, and any project-specific patterns that differ from defaults. Claude Code reads this file at session start and uses it to guide all operations.

Integration with Existing Tools. Claude Code works alongside your existing development tools rather than replacing them. It respects .gitignore for file visibility, uses your project’s installed dependencies, and follows the build/test scripts defined in package.json (or equivalent). Ensure your toolchain is working correctly before involving Claude Code.

Performance Considerations. For large codebases (10,000+ files), Claude Code’s file scanning can be slow if not properly scoped. Use .claudeignore to exclude generated directories (dist, build, .next, coverage) and dependency directories (node_modules, vendor). This typically reduces scan time by 80-90%.

Version Control Integration. All changes Claude Code makes are regular filesystem operations visible to git. Use git diff after each significant change to review what was modified. For experimental changes, create a branch first with git checkout -b experiment/topic so you can easily discard or keep the results.

Build yours → Create a custom CLAUDE.md with our Generator Tool.

Estimate tokens → Calculate your usage with our Token Estimator.

Try it: Paste your error into our Error Diagnostic for an instant fix.

Implementation Details

When working with this in Claude Code, pay attention to these practical details:

Project configuration. Add specific instructions to your CLAUDE.md file describing how your project handles this area. Include file paths, naming conventions, and any patterns that differ from common defaults. Claude Code reads CLAUDE.md at the start of every session and uses it to guide all operations.

Testing the setup. After configuration, verify everything works by running a simple test task. Ask Claude Code to perform a read-only operation first (like listing files or reading a config) before moving to write operations. This confirms that permissions, paths, and tools are all correctly configured.

Monitoring and iteration. Track your results over several sessions. If Claude Code consistently makes the same mistake, the fix is usually a more specific CLAUDE.md instruction. If it makes different mistakes each time, the issue is likely in the project setup or toolchain configuration.

Troubleshooting Checklist

When something does not work as expected, check these items in order:

CLAUDE.md exists at the project root — run ls -la CLAUDE.md to verify
Node.js version is 18+ — run node --version to check
API key is set — run echo $ANTHROPIC_API_KEY | head -c 10 to verify (shows first 10 characters only)
Disk space is available — run df -h . to check
Network can reach the API — run curl -s -o /dev/null -w "%{http_code}" https://api.anthropic.com (should return 401 without auth, meaning the server is reachable)
No conflicting processes — run ps aux | grep claude | grep -v grep to check for stale sessions