bjpop/crpipe

语言: Python

git: https://github.com/bjpop/crpipe

基于Ruffus的生物信息学管道
Bioinformatics pipeline based on Ruffus
README.md (中文)

基于Ruffus的生物信息学管道

作者:Bernie Pope(bjpope@unimelb.edu.au)

Crpipe基于Ruffus库编写生物信息学管道。其功能包括:

  • 使用DRMAA在集群上提交作业(目前仅使用SLURM进行测试)。
  • 作业依赖性计算和检查点。
  • 管道可以显示为流程图。
  • 重新运行管道将从最新阶段开始。它不会重做以前完成的任务。

执照

3条款BSD许可。请参阅源存储库中的LICENSE.txt。

安装

外部依赖

crpipe取决于以下程序和库:

  • python(版本2.7.5)
  • DRMAA用于向集群提交作业(它使用Python包装器来执行此操作)。    您需要为本地作业提交系统安装自己的libdrama.so。有版本    适用于常见的调度程序,如Torque / PBS,SLURM等。
  • fastqc(版本0.10.1)
  • Stalis Verizon 1.1)
  • bwa用于将读数与参考基因组比对(版本0.7.12)
  • 用于排序bam文件的sambamba(版本0.5.4)。
  • 调用结构变体(版本0.2.11)
  • svtyper用于对结构变体进行基因分型(版本0.0.2)

您需要自己安装这些依赖项。

我建议使用虚拟环境:

cd /place/to/install
virtualenv crpipe
source crpipe/bin/activate
pip install -U https://github.com/bjpop/crpipe

如果您不想使用虚拟环境,那么您只需使用pip进行安装:

pip install -U https://github.com/bjpop/crpipe

工作的例子

源代码分发中的示例目录包含一个小数据集,用于说明管道的使用。

获取源代码分发的副本

cd /path/to/test/directory
git clone https://github.com/bjpop/crpipe

如上所述安装crpipe

获得参考基因组。

cd crpipe/example
mkdir reference
# copy your reference into this directory, or make a symbolic link
# call it reference/genome.fa

告诉Python你的DRMAA库在哪里

例如(这取决于您的本地设置):

export DRMAA_LIBRARY_PATH=/usr/local/slurm_drmaa/1.0.7-gcc/lib/libdrmaa.so

运行crpipe并询问它接下来会做什么

crpipe -n --verbose 3

生成流程图

crpipe --flowchart pipeline_flow.png --flowchart_format png

运行管道

crpipe --use_threads --log_file pipeline.log --jobs 2 --verbose 3

用法

您可以获得命令行参数的摘要,如下所示:

crpipe -h
usage: crpipe [-h] [--verbose [VERBOSE]] [-L FILE] [-T JOBNAME] [-j N]
              [--use_threads] [-n] [--touch_files_only] [--recreate_database]
              [--checksum_file_name FILE] [--flowchart FILE]
              [--key_legend_in_graph] [--draw_graph_horizontally]
              [--flowchart_format FORMAT] [--forced_tasks JOBNAME]
              [--config CONFIG] [--jobscripts JOBSCRIPTS] [--version]

Colorectal cancer pipeline

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Pipeline configuration file in YAML format, defaults
                        to pipeline.config
  --jobscripts JOBSCRIPTS
                        Directory to store cluster job scripts created by the
                        pipeline, defaults to jobscripts
  --version             show program's version number and exit

Common options:
  --verbose [VERBOSE], -v [VERBOSE]
                        Print more verbose messages for each additional
                        verbose level.
  -L FILE, --log_file FILE
                        Name and path of log file

pipeline arguments:
  -T JOBNAME, --target_tasks JOBNAME
                        Target task(s) of pipeline.
  -j N, --jobs N        Allow N jobs (commands) to run simultaneously.
  --use_threads         Use multiple threads rather than processes. Needs
                        --jobs N with N > 1
  -n, --just_print      Don't actually run any commands; just print the
                        pipeline.
  --touch_files_only    Don't actually run the pipeline; just 'touch' the
                        output for each task to make them appear up to date.
  --recreate_database   Don't actually run the pipeline; just recreate the
                        checksum database.
  --checksum_file_name FILE
                        Path of the checksum file.
  --flowchart FILE      Don't run any commands; just print pipeline as a
                        flowchart.
  --key_legend_in_graph
                        Print out legend and key for dependency graph.
  --draw_graph_horizontally
                        Draw horizontal dependency graph.
  --flowchart_format FORMAT
                        format of dependency graph file. Can be 'pdf', 'svg',
                        'svgz' (Structured Vector Graphics), 'pdf', 'png'
                        'jpg' (bitmap graphics) etc
  --forced_tasks JOBNAME
                        Task(s) which will be included even if they are up to
                        date.

配置文件

您必须以YAML格式为管道提供配置文件。

这是一个例子:

```

管道阶段的默认设置。

这些可以在下面的舞台设置中被覆盖。

默认值:     #用于任务的CPU核心数     核心:1     #群集作业的最大内存(千兆字节)     记忆:4     #VLSCI帐户的配额     帐号:VR0002     队列:主要     #以小时为单位的最大允许运行时间:分钟     walltime:'1:00'     #加载模块,在集群上运行命令。     模块:     #在本地机器上运行(运行管道的地方)     #而不是在群集上。 False意味着在群集上运行。     本地:错

特定于舞台的设置。这些覆盖了上面的默认值。

每个阶段必须具有唯一的名称。此名称将用于

管道找到舞台的设置。

阶段:     #使用fastQC对FASTQ文件进行质量检查     fastqc:         walltime:'10:00'         记忆:8         模块:              - 'fastqc / 0.10.1'

# Index the hg19 human genome reference with BWA
index_reference_bwa:
    walltime: '10:00'
    mem: 8
    modules:
        - 'bwa-intel/0.7.12'

# Index the hg19 human genome reference with samtools
index_reference_samtools:
    walltime: '10:00'
    mem: 8
    modules:
        - 'samtools-intel/1.1'

# Align paired end FASTQ files to the reference
align_bwa:
    cores: 8
    walltime: '48:00'
    mem: 32
    modules:
        - 'bwa-intel/0.7.12'
        - 'samtools-intel/1.1'

FASTA格式的人类基因组

参考:/path/to/reference/genome.fa

输入FASTQ文件。

fastqs:     - /path/to/fastqs/sample1_R1.fastq.gz     - /path/to/fastqs/sample1_R2.fastq.gz     - /path/to/fastqs/sample2_R1.fastq.gz     - /path/to/fastqs/sample2_R2.fastq.gz

read_groups:    'sample1':'@ RG \ tID:id1 \ tPU:pu1 \ tSM:sample1 \ tPL:ILLUMINA \ tLB:lib_sample1'    'sample2':'@ RG \ tID:id2 \ tPU:pu2 \ tSM:sample2 \ tPL:ILLUMINA \ tLB:lib_sample2'

本文使用googletrans自动翻译,仅供参考, 原文来自github.com

en_README.md

A bioinformatics pipeline based on Ruffus

Author: Bernie Pope (bjpope@unimelb.edu.au)

Crpipe is based on the Ruffus library for writing bioinformatics pipelines. Its features include:

  • Job submission on a cluster using DRMAA (currently only tested with SLURM).
  • Job dependency calculation and checkpointing.
  • Pipeline can be displayed as a flowchart.
  • Re-running a pipeline will start from the most up-to-date stage. It will not redo previously completed tasks.

License

3 Clause BSD License. See LICENSE.txt in source repository.

Installation

External dependencies

crpipe depends on the following programs and libraries:

  • python (version 2.7.5)
  • DRMAA for submitting jobs to the cluster (it uses the Python wrapper to do this).
    You need to install your own libdrama.so for your local job submission system. There are versions
    available for common schedulers such as Torque/PBS, SLURM and so on.
  • fastqc (version 0.10.1)
  • samtools version 1.1)
  • bwa for aligning reads to the reference genome (version 0.7.12)
  • sambamba for sorting bam files (version 0.5.4).
  • lumpy for calling structural variants (version 0.2.11)
  • svtyper for genotyping the structural variants (version 0.0.2)

You will need to install these dependencies yourself.

I recommend using a virtual environment:

cd /place/to/install
virtualenv crpipe
source crpipe/bin/activate
pip install -U https://github.com/bjpop/crpipe

If you don't want to use a virtual environment then you can just install with pip:

pip install -U https://github.com/bjpop/crpipe

Worked example

The example directory in the source distribution contains a small dataset to illustrate the use of the pipeline.

Get a copy of the source distribution

cd /path/to/test/directory
git clone https://github.com/bjpop/crpipe

Install crpipe as described above

Get a reference genome.

cd crpipe/example
mkdir reference
# copy your reference into this directory, or make a symbolic link
# call it reference/genome.fa

Tell Python where your DRMAA library is

For example (this will depend on your local settings):

export DRMAA_LIBRARY_PATH=/usr/local/slurm_drmaa/1.0.7-gcc/lib/libdrmaa.so

Run crpipe and ask it what it will do next

crpipe -n --verbose 3

Generate a flowchart diagram

crpipe --flowchart pipeline_flow.png --flowchart_format png

Run the pipeline

crpipe --use_threads --log_file pipeline.log --jobs 2 --verbose 3

Usage

You can get a summary of the command line arguments like so:

crpipe -h
usage: crpipe [-h] [--verbose [VERBOSE]] [-L FILE] [-T JOBNAME] [-j N]
              [--use_threads] [-n] [--touch_files_only] [--recreate_database]
              [--checksum_file_name FILE] [--flowchart FILE]
              [--key_legend_in_graph] [--draw_graph_horizontally]
              [--flowchart_format FORMAT] [--forced_tasks JOBNAME]
              [--config CONFIG] [--jobscripts JOBSCRIPTS] [--version]

Colorectal cancer pipeline

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Pipeline configuration file in YAML format, defaults
                        to pipeline.config
  --jobscripts JOBSCRIPTS
                        Directory to store cluster job scripts created by the
                        pipeline, defaults to jobscripts
  --version             show program's version number and exit

Common options:
  --verbose [VERBOSE], -v [VERBOSE]
                        Print more verbose messages for each additional
                        verbose level.
  -L FILE, --log_file FILE
                        Name and path of log file

pipeline arguments:
  -T JOBNAME, --target_tasks JOBNAME
                        Target task(s) of pipeline.
  -j N, --jobs N        Allow N jobs (commands) to run simultaneously.
  --use_threads         Use multiple threads rather than processes. Needs
                        --jobs N with N > 1
  -n, --just_print      Don't actually run any commands; just print the
                        pipeline.
  --touch_files_only    Don't actually run the pipeline; just 'touch' the
                        output for each task to make them appear up to date.
  --recreate_database   Don't actually run the pipeline; just recreate the
                        checksum database.
  --checksum_file_name FILE
                        Path of the checksum file.
  --flowchart FILE      Don't run any commands; just print pipeline as a
                        flowchart.
  --key_legend_in_graph
                        Print out legend and key for dependency graph.
  --draw_graph_horizontally
                        Draw horizontal dependency graph.
  --flowchart_format FORMAT
                        format of dependency graph file. Can be 'pdf', 'svg',
                        'svgz' (Structured Vector Graphics), 'pdf', 'png'
                        'jpg' (bitmap graphics) etc
  --forced_tasks JOBNAME
                        Task(s) which will be included even if they are up to
                        date.

Configuration file

You must supply a configuration file for the pipeline in YAML format.

Here is an example:

```

Default settings for the pipeline stages.

These can be overridden in the stage settings below.

defaults:
# Number of CPU cores to use for the task
cores: 1
# Maximum memory in gigabytes for a cluster job
mem: 4
# VLSCI account for quota
account: VR0002
queue: main
# Maximum allowed running time on the cluster in Hours:Minutes
walltime: '1:00'
# Load modules for running a command on the cluster.
modules:
# Run on the local machine (where the pipeline is run)
# instead of on the cluster. False means run on the cluster.
local: False

Stage-specific settings. These override the defaults above.

Each stage must have a unique name. This name will be used in

the pipeine to find the settings for the stage.

stages:
# Run quality checks on the FASTQ files using fastQC
fastqc:
walltime: '10:00'
mem: 8
modules:
- 'fastqc/0.10.1'

# Index the hg19 human genome reference with BWA
index_reference_bwa:
    walltime: '10:00'
    mem: 8
    modules:
        - 'bwa-intel/0.7.12'

# Index the hg19 human genome reference with samtools
index_reference_samtools:
    walltime: '10:00'
    mem: 8
    modules:
        - 'samtools-intel/1.1'

# Align paired end FASTQ files to the reference
align_bwa:
    cores: 8
    walltime: '48:00'
    mem: 32
    modules:
        - 'bwa-intel/0.7.12'
        - 'samtools-intel/1.1'

The Human Genome in FASTA format

reference: /path/to/reference/genome.fa

The input FASTQ files.

fastqs:
- /path/to/fastqs/sample1_R1.fastq.gz
- /path/to/fastqs/sample1_R2.fastq.gz
- /path/to/fastqs/sample2_R1.fastq.gz
- /path/to/fastqs/sample2_R2.fastq.gz

read_groups:
'sample1': '@RG\tID:id1\tPU:pu1\tSM:sample1\tPL:ILLUMINA\tLB:lib_sample1'
'sample2': '@RG\tID:id2\tPU:pu2\tSM:sample2\tPL:ILLUMINA\tLB:lib_sample2'