seqtk
seqtk是生信大神李恒开发的一个轻量级操作工具。
基于C语言编写的软件,运行速度极快,极大的提高工作效率。seqtk日常序列的处理包括,比如:fq转换为fa,格式化序列,截取序列,随机抽取序列等。
从conda或者github都可以下载。( conda create -p 存放路径 -c bioconda seqtk==最新版本号 )
wangzq@server7 seqtk]$ seqtk
Usage: seqtk <command> <arguments>
Version: 1.3-r117-dirty
Command: seq common transformation of FASTA/Q #FASTA/Q的通用转换工具(可以添加参数进行简单过滤)
comp get the nucleotide composition of FASTA/Q #获取核苷酸组成的统计结果(在reads水平统计)
sample subsample sequences #对fq文件中的reads进行随机抽样
subseq extract subsequences from FASTA/Q #从FASTA/Q中提取reads(可依据readsID或者bed文件提取)
fqchk fastq QC (base/quality summary) #碱基质量统计结果(在样本水平统计,给出整个样本里每个位置上的统计结果)
mergepe interleave two PE FASTA/Q files #交叉合并双端测序的两个FASTQ文件,(合并结果变成8行为一个reads单元)
split split one file into multiple smaller files #将一个FASTQ文件拆成多个小的FASTQ文件
trimfq trim FASTQ using the Phred algorithm #用Phred算法对fq修剪(可以具体指定参数),可以指定修剪掉reads左右多少bp碱基
hety regional heterozygosity #区域性杂合
gc identify high- or low-GC regions #识别高低GC含量的区域
mutfa point mutate FASTA at specified positions #在特定位置指出FASTA的突变
mergefa merge two FASTA/Q files #合并两个FASTQ文件
famask apply a X-coded FASTA to a source FASTA #
dropse drop unpaired from interleaved PE FASTA/Q #交叉合并的FASTQ文件中,删掉未配对的reads
rename rename sequence names ##reads序列重命名
randbase choose a random base from hets #
cutN cut sequence at long N #在N长度处切掉序列
gap get the gap locations #
listhet extract the position of each het #
hpc homopolyer-compressed sequence #
seqkit
SeqKit——FASTA/Q文件操作的跨平台超快工具包。
从conda或者github都可以下载。( conda create -p 存放路径 -c bioconda seqkit==最新版本号 )
wangzq@server7 seqtk]$ seqkit
SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation
Version: 2.8.2
Author: Wei Shen <shenwei356@gmail.com>
Documents : http://bioinf.shenwei.me/seqkit
Source code: https://github.com/shenwei356/seqkit
Please cite:
1. https://doi.org/10.1002/imt2.191
2. https://doi.org/10.1371/journal.pone.0163962
Seqkit utlizies the pgzip (https://github.com/klauspost/pgzip) package to
read and write gzip file, and the outputted gzip file would be slighty
larger than files generated by GNU gzip.
Seqkit writes gzip files very fast, much faster than the multi-threaded pigz,
therefore there's no need to pipe the result to gzip/pigz.
Seqkit also supports reading and writing xz (.xz) and zstd (.zst) formats since v2.2.0.
Bzip2 format is supported since v2.4.0.
Compression level:
format range default comment
gzip 1-9 5 https://github.com/klauspost/pgzip sets 5 as the default value.
xz NA NA https://github.com/ulikunitz/xz does not support.
zstd 1-4 2 roughly equals to zstd 1, 3, 7, 11, respectively.
bzip 1-9 6 https://github.com/dsnet/compress
###这里提到seqkit的输入输出都可以直接使用压缩文件,而且自带的压缩速度非常快(比pigz还要快)
Usage:
seqkit [command]
Commands for Basic Operation:
faidx create the FASTA index file and extract subsequences
scat real time recursive concatenation and streaming of fastx files
seq transform sequences (extract ID, filter by length, remove gaps, reverse complement...)
sliding extract subsequences in sliding windows
stats simple statistics of FASTA/Q files
subseq get subsequences by region/gtf/bed, including flanking sequences
translate translate DNA/RNA to protein sequence (supporting ambiguous bases)
watch monitoring and online histograms of sequence features
Commands for Format Conversion:
convert convert FASTQ quality encoding between Sanger, Solexa and Illumina
fa2fq retrieve corresponding FASTQ records by a FASTA file
fq2fa convert FASTQ to FASTA
fx2tab convert FASTA/Q to tabular format (and length, GC content, average quality...)
tab2fx convert tabular format to FASTA/Q format
Commands for Searching:
amplicon extract amplicon (or specific region around it) via primer(s)
fish look for short sequences in larger sequences using local alignment
grep search sequences by ID/name/sequence/sequence motifs, mismatch allowed
locate locate subsequences/motifs, mismatch allowed
Commands for Set Operation:
common find common/shared sequences of multiple files by id/name/sequence
duplicate duplicate sequences N times
head print first N FASTA/Q records
head-genome print sequences of the first genome with common prefixes in name
pair match up paired-end reads from two fastq files
range print FASTA/Q records in a range (start:end)
rmdup remove duplicated sequences by ID/name/sequence
sample sample sequences by number or proportion
split split sequences into files by id/seq region/size/parts (mainly for FASTA)
split2 split sequences into files by size/parts (FASTA, PE/SE FASTQ)
Commands for Edit:
concat concatenate sequences with the same ID from multiple files
mutate edit sequence (point mutation, insertion, deletion)
rename rename duplicated IDs
replace replace name/sequence by regular expression
restart reset start position for circular genome
sana sanitize broken single line FASTQ files
Commands for Ordering:
shuffle shuffle sequences
sort sort sequences by id/name/sequence/length
Commands for BAM Processing:
bam monitoring and online histograms of BAM record features
Commands for Miscellaneous:
merge-slides merge sliding windows generated from seqkit sliding
sum compute message digest for all sequences in FASTA/Q files
Additional Commands:
genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell)
version print version information and check for update
Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on
which seqkit guesses the sequence type (0 for whole seq)
(default 10000)
--compress-level int compression level for gzip, zstd, xz and bzip2. type "seqkit -h"
for the range and default value for each format (default -1)
-h, --help help for seqkit
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2|
Pseud...
--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
-X, --infile-list string file of input files list (one file per line), if given, they are
appended to files from cli arguments
-w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it
automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. can also set with environment variable
SEQKIT_THREADS) (default 4)
Use "seqkit [command] --help" for more information about a command.
上面是seqkit软件的基础帮助文档。对软件做了基础介绍,然后对功能做了归类介绍
- Commands for Basic Operation:序列碱基操作
- Commands for Format Conversion:格式转换操作
- Commands for Searching:搜索
- Commands for Set Operation:设置参数
- Commands for Edit:编辑
- Commands for Ordering:排序
- Commands for BAM Processing:bam监控
- Commands for Miscellaneous:杂项命令
- Additional Commands:其他命令
使用”seqkit [command] –help”命令可以具体查看每种命令的具体功能和参数。
seqkit的功能非常多,非常全面,针对较小的项目处理非常方便友好,
Documents文档手册也写得很详细。使用时推荐查看官方文档即可。