SciLifeLab/facs

语言: Jupyter Notebook

git: https://github.com/SciLifeLab/facs

使用Bloom过滤器快速准确地对序列进行分类
Fast and Accurate Classification of Sequences using Bloom filters
README.md (中文)

FACS(序列的快速和准确分类)C实现

Build Status

Coverity Scan Build Status

警告:此程序正在积极开发中,此文档可能无法反映实际情况。 请提交GitHub问题,我们会尽快处理。

介绍

FACS是以前Perl模块的C实现,请选择 perl分支如果 你想看看旧的(不支持的)实现。

这项工作的一些组件基于优秀的Perl Bloom :: Faster 实现。

概观

  • 'build'用于从参考文件构建bloom过滤器。 它支持大型基因组文件(> 4GB),例如人类基因组。
  • 'query'用于查询针对布隆过滤器的fastq / fasta文件。
  • 'remove'用于从fastq / fasta文件中删除污染序列。

快速开始

为了获取源代码运行:

$ git clone https://github.com/SciLifeLab/facs

对于python接口,强烈建议安装并运行FACS 一个python虚拟环境。 Python虚拟环境提供和隔离 环境来运行你的python代码,解决依赖和版本问题,以及 间接权限。在这里阅读更多关于virtualenv的内容。

要轻松安装虚拟环境,您可以使用virtualenv-burrito。 按照提供的链接中的说明创建新虚拟 环境。

安装


对于独立事实命令行工具,请键入:make。

在创建和激活虚拟环境之后编译python绑定:make python。

引文

Henrik Stranneheim,MaxKäller,Tobias Allander,BjörnAndersson,Lars Arvestad,Joakim Lundeberg:使用Bloom过滤器对DNA序列进行分类。生物信息学26(13):1595-1600(2010)

执照

该代码在MIT许可下以及散列算法'lookup8'免费提供,该算法由Bob Jenkins开发并在MIT许可下使用。

用法

Facs使用与流行的bwa中发现的类似的命令行结构。 有三个主要命令:构建,查询和删除。他们每个人可能会有略微不同的旗帜,但应该 表现相似。

$ ./facs -h

Program: facs - Sequence analysis using bloom filters
Version: 2.0 
Contact: Enze Liu <enze.liu@scilifelab.se>

Usage:   facs <command> [options]

Command: build         build a bloom filter from a FASTA/FASTQ reference file
         query         query a bloom filter given a FASTA/FASTQ file
         remove        remove (contamination) sequences from FASTQ/FASTA file

例如,要从FASTA参考基因组构建bloom过滤器,应键入:

$ ./facs build -r ecoli.fasta -o ecoli.bloom

这将生成一个可用于查询FASTQ文件的ecoli bloom过滤器:

$ ./facs query -r ecoli.bloom -q contaminated_sample.fastq.gz -f "json"

请注意,透明支持明文fastq文件和gzip压缩文件 给用户。

这将以json格式返回一些指标,指示可能有多少读取 被特定样品中的大肠杆菌污染:

{
    "timestamp": "2013-03-27T11:16:21.809+0100"
    "organism": "test200.fastq"
    "bloom_filter": "eschColi_K12.bloom"
    "total_read_count": 201,
    "contaminated_reads": 1,
    "total_hits": 36,
    "contamination_rate": 0.004975,
    "p_value": 1.522929e-01
}

如果有人希望能够轻松导入tsv格式 LibreOffice.org或Excel,表示 -f“tsv”在命令行中,tsv文件将写入本地目录:

$ cat test200.fastq.tsv
organism    bloom_filter    total_read_count    contaminated_reads  contamination_rate
test200.fastq   eschColi_K12.bloom  201 1   0.004975

最后,如果想要从样本中删除这些读取,则应运行以下操作 命令:

$ ./facs remove -r ecoli.bloom -q contaminated_sample.fastq

输出: 通过使用stdout和stderr,清理序列将存储在stdout,污染序列中 将存储在stderr中。它们可以存储到特定文件中,例如:

$(./facs remove -r ecoli.bloom -q contaminated_sample.fastq > clean_part.fastq ) >& contaminated_part.fastq

如果指定output_path'-o',将生成两个输出文件:

contaminated_sample_ecoli_contam.fastq contaminated_sample_ecoli_clean.fastq

MPI facs2.0版

MPI facs2.0版本可用于多CPU系统,例如集群,以便利用 多个核心和多个cpus同时出现。

用法:

首先下载facs包并'make',然后'make mpi'。将生成唯一的二进制文件'facs_mpi'。

$mpirun -np number_of_cpu ./facs_mpi -r reference_bloom_filter -q query_sequence

请注意,除了openmp库,MPI facs2.0需要MPI库(OpenMpi或Mpich等)

Python界面

python C-Extension提供了一个非常简单的API来构建,查询和删除序列, 就像上面使用普通的基于C的命令行一样。

$ python
Python 2.6.6 (r266:84292, Jun 18 2012, 09:57:52) 
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import facs
>>> facs.build("ecoli.fasta", "ecoli.bloom")
>>> facs.query("contaminated_sample.fastq.gz", "ecoli.bloom")
>>> facs.remove("contaminated_sample.fastq", "ecoli.bloom")

将结果更新到数据库

FACS提供JSON格式的结果,这简化了 将这些结果存储在CouchDB实例中。为此,您需要创建一个 配置文件,包含CouchDB实例的信息。

该文件应该命名为.facsrc或.facs.cnf,并且应该位于 你的主目录。对于系统范围的安装,它也可以位于 /etc/facs.conf。

格式应该是这样的:

[facs]
SERVER: <your server address>
FACS_DB: <DB name>
FASTQ_SCREEN_DB: <DB name>
DECONSEQ_DB: <DB name>
USER: <username>
PASSWORD: <password>

本文使用googletrans自动翻译,仅供参考, 原文来自github.com

en_README.md

FACS (Fast and Accurate Classification of Sequences) C implementation

Build Status

Coverity Scan Build Status

WARNING: This program is under active development and this documentation might not reflect reality.
Please file a GitHub issue and we will take care of it as soon as we can.

Introduction

FACS is the C implementation of a previous Perl module, please select the
perl branch if
you want to have a look at the old (unsupported) implementation.

Some components of this work are based in the excellent Perl Bloom::Faster
implementation.

Overview

  • 'build' is for building a bloom filter from a reference file.
    It supports large genome files (>4GB), human genome, for instance.
  • 'query' is for querying a fastq/fasta file against the bloom filter.
  • 'remove' is for removing contamination sequences from a fastq/fasta file.

Quickstart

In order to fetch the source code run:

$ git clone https://github.com/SciLifeLab/facs

For the python interface, it is highly recommended to install and run FACS under
a python virtual environment. Python virtual environments provide and isolated
environment to run your python code, solving dependency and version problems, and
indirectly permissions. Read more about virtualenv here.

To easily install a virtual environment you can use virtualenv-burrito.
Follow the instructions in the link provided in order to create a new virtual
environment.

Installing


For a standalone facs commandline tool, type: make.

To compile the python bindings: make python, after creating and activating the virtual environment.

Citation

Henrik Stranneheim, Max Käller, Tobias Allander, Björn Andersson, Lars Arvestad, Joakim Lundeberg: Classification of DNA sequences using Bloom filters. Bioinformatics 26(13): 1595-1600 (2010)

License

The code is freely available under MIT license as well as the hashing algorithm 'lookup8', which is developed by Bob Jenkins and used under MIT license.

Usage

Facs uses a similar commandline structure to the one found in the popular bwa.
There are three main commands: build, query and remove. Each of them might have slightly different flags but should
behave similarly.

$ ./facs -h

Program: facs - Sequence analysis using bloom filters
Version: 2.0 
Contact: Enze Liu <enze.liu@scilifelab.se>

Usage:   facs <command> [options]

Command: build         build a bloom filter from a FASTA/FASTQ reference file
         query         query a bloom filter given a FASTA/FASTQ file
         remove        remove (contamination) sequences from FASTQ/FASTA file

For example, to build a bloom filter out of a FASTA reference genome, one should type:

$ ./facs build -r ecoli.fasta -o ecoli.bloom

That would generate a ecoli bloom filter that could be used to query a FASTQ file:

$ ./facs query -r ecoli.bloom -q contaminated_sample.fastq.gz -f "json"

Note that both plaintext fastq files and gzip-compressed files are supported transparently
to the user.

Which would return some metrics, in json format, indicating how many reads might
be contaminated with ecoli in that particular sample:

{
    "timestamp": "2013-03-27T11:16:21.809+0100"
    "organism": "test200.fastq"
    "bloom_filter": "eschColi_K12.bloom"
    "total_read_count": 201,
    "contaminated_reads": 1,
    "total_hits": 36,
    "contamination_rate": 0.004975,
    "p_value": 1.522929e-01
}

If one wishes to get tsv format to easily import in
LibreOffice.org or Excel, indicate
-f "tsv" in the commandline, and a tsv file will be written in the local directory:

$ cat test200.fastq.tsv
organism    bloom_filter    total_read_count    contaminated_reads  contamination_rate
test200.fastq   eschColi_K12.bloom  201 1   0.004975

Finally, if one wants to remove those reads from the sample, one should run the following
command:

$ ./facs remove -r ecoli.bloom -q contaminated_sample.fastq

Output:
By using stdout and stderr, clean sequences will be stored in stdout, contaminated sequences
will be stored in stderr. They can be stored into specific files, for instance:

$(./facs remove -r ecoli.bloom -q contaminated_sample.fastq > clean_part.fastq ) >& contaminated_part.fastq

If output_path '-o' is specified, two output files will be generated:

contaminated_sample_ecoli_contam.fastq
contaminated_sample_ecoli_clean.fastq

MPI facs2.0 version

MPI facs2.0 version can be used in multi-cpu system, for instance, a cluster, in order to take advantage
of both multiple cores and multiple cpus at the same time.

Usage:

First download facs package and 'make', then 'make mpi'. A unique binary file 'facs_mpi' will be generated.

$mpirun -np number_of_cpu ./facs_mpi -r reference_bloom_filter -q query_sequence

Be advised, besides openmp library, MPI facs2.0 requires MPI library (OpenMpi or Mpich, etc.)

Python interface

A python C-Extension provides a very simple API to build, query and remove sequences,
just as described above with the plain C-based commandline.

$ python
Python 2.6.6 (r266:84292, Jun 18 2012, 09:57:52) 
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import facs
>>> facs.build("ecoli.fasta", "ecoli.bloom")
>>> facs.query("contaminated_sample.fastq.gz", "ecoli.bloom")
>>> facs.remove("contaminated_sample.fastq", "ecoli.bloom")

Update results to a database

FACS provides results in JSON format, which eases the
storage of these results in a CouchDB instance. To do so, you need to create a
configuration file with the information for your CouchDB instance.

The file should be named either .facsrc or .facs.cnf and should be located in
your home directory. For system wide installations it can also be located at
/etc/facs.conf.

The format should be like this:

[facs]
SERVER: <your server address>
FACS_DB: <DB name>
FASTQ_SCREEN_DB: <DB name>
DECONSEQ_DB: <DB name>
USER: <username>
PASSWORD: <password>