工具编程

生物信息学软件-seqtk、seqkit

2024 年 10 月 10 日 2024 年 10 月 10 日

seqtk

seqtk是生信大神李恒开发的一个轻量级操作工具。
基于C语言编写的软件，运行速度极快，极大的提高工作效率。seqtk日常序列的处理包括，比如：fq转换为fa，格式化序列，截取序列，随机抽取序列等。
从conda或者github都可以下载。（ conda create -p 存放路径 -c bioconda seqtk==最新版本号）

wangzq@server7 seqtk]$ seqtk

Usage:   seqtk <command> <arguments>
Version: 1.3-r117-dirty

Command: seq       common transformation of FASTA/Q            #FASTA/Q的通用转换工具（可以添加参数进行简单过滤）
         comp      get the nucleotide composition of FASTA/Q   #获取核苷酸组成的统计结果（在reads水平统计）
         sample    subsample sequences                         #对fq文件中的reads进行随机抽样
         subseq    extract subsequences from FASTA/Q           #从FASTA/Q中提取reads（可依据readsID或者bed文件提取）
         fqchk     fastq QC (base/quality summary)             #碱基质量统计结果（在样本水平统计，给出整个样本里每个位置上的统计结果）
         mergepe   interleave two PE FASTA/Q files             #交叉合并双端测序的两个FASTQ文件，（合并结果变成8行为一个reads单元）
         split     split one file into multiple smaller files  #将一个FASTQ文件拆成多个小的FASTQ文件
         trimfq    trim FASTQ using the Phred algorithm        #用Phred算法对fq修剪（可以具体指定参数），可以指定修剪掉reads左右多少bp碱基

         hety      regional heterozygosity                     #区域性杂合
         gc        identify high- or low-GC regions            #识别高低GC含量的区域
         mutfa     point mutate FASTA at specified positions   #在特定位置指出FASTA的突变
         mergefa   merge two FASTA/Q files                     #合并两个FASTQ文件
         famask    apply a X-coded FASTA to a source FASTA     #
         dropse    drop unpaired from interleaved PE FASTA/Q   #交叉合并的FASTQ文件中，删掉未配对的reads
         rename    rename sequence names                       ##reads序列重命名
         randbase  choose a random base from hets              #
         cutN      cut sequence at long N                      #在N长度处切掉序列
         gap       get the gap locations                       #
         listhet   extract the position of each het            #
         hpc       homopolyer-compressed sequence              #

seqkit

SeqKit——FASTA/Q文件操作的跨平台超快工具包。
从conda或者github都可以下载。（ conda create -p 存放路径 -c bioconda seqkit==最新版本号）

wangzq@server7 seqtk]$ seqkit
SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Version: 2.8.2

Author: Wei Shen <shenwei356@gmail.com>

Documents  : http://bioinf.shenwei.me/seqkit
Source code: https://github.com/shenwei356/seqkit
Please cite:
  1. https://doi.org/10.1002/imt2.191
  2. https://doi.org/10.1371/journal.pone.0163962


Seqkit utlizies the pgzip (https://github.com/klauspost/pgzip) package to
read and write gzip file, and the outputted gzip file would be slighty
larger than files generated by GNU gzip.

Seqkit writes gzip files very fast, much faster than the multi-threaded pigz,
therefore there's no need to pipe the result to gzip/pigz.

Seqkit also supports reading and writing xz (.xz) and zstd (.zst) formats since v2.2.0.
Bzip2 format is supported since v2.4.0.

Compression level:
  format   range   default  comment
  gzip     1-9     5        https://github.com/klauspost/pgzip sets 5 as the default value.
  xz       NA      NA       https://github.com/ulikunitz/xz does not support.
  zstd     1-4     2        roughly equals to zstd 1, 3, 7, 11, respectively.
  bzip     1-9     6        https://github.com/dsnet/compress

###这里提到seqkit的输入输出都可以直接使用压缩文件，而且自带的压缩速度非常快（比pigz还要快）

Usage:
  seqkit [command]

Commands for Basic Operation:
  faidx           create the FASTA index file and extract subsequences 
  scat            real time recursive concatenation and streaming of fastx files
  seq             transform sequences (extract ID, filter by length, remove gaps, reverse complement...)
  sliding         extract subsequences in sliding windows
  stats           simple statistics of FASTA/Q files
  subseq          get subsequences by region/gtf/bed, including flanking sequences
  translate       translate DNA/RNA to protein sequence (supporting ambiguous bases)
  watch           monitoring and online histograms of sequence features

Commands for Format Conversion:
  convert         convert FASTQ quality encoding between Sanger, Solexa and Illumina
  fa2fq           retrieve corresponding FASTQ records by a FASTA file
  fq2fa           convert FASTQ to FASTA
  fx2tab          convert FASTA/Q to tabular format (and length, GC content, average quality...)
  tab2fx          convert tabular format to FASTA/Q format

Commands for Searching:
  amplicon        extract amplicon (or specific region around it) via primer(s)
  fish            look for short sequences in larger sequences using local alignment
  grep            search sequences by ID/name/sequence/sequence motifs, mismatch allowed
  locate          locate subsequences/motifs, mismatch allowed

Commands for Set Operation:
  common          find common/shared sequences of multiple files by id/name/sequence
  duplicate       duplicate sequences N times
  head            print first N FASTA/Q records
  head-genome     print sequences of the first genome with common prefixes in name
  pair            match up paired-end reads from two fastq files
  range           print FASTA/Q records in a range (start:end)
  rmdup           remove duplicated sequences by ID/name/sequence
  sample          sample sequences by number or proportion
  split           split sequences into files by id/seq region/size/parts (mainly for FASTA)
  split2          split sequences into files by size/parts (FASTA, PE/SE FASTQ)

Commands for Edit:
  concat          concatenate sequences with the same ID from multiple files
  mutate          edit sequence (point mutation, insertion, deletion)
  rename          rename duplicated IDs
  replace         replace name/sequence by regular expression
  restart         reset start position for circular genome
  sana            sanitize broken single line FASTQ files

Commands for Ordering:
  shuffle         shuffle sequences
  sort            sort sequences by id/name/sequence/length

Commands for BAM Processing:
  bam             monitoring and online histograms of BAM record features

Commands for Miscellaneous:
  merge-slides    merge sliding windows generated from seqkit sliding
  sum             compute message digest for all sequences in FASTA/Q files

Additional Commands:
  genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell)
  version         print version information and check for update

Flags:
      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on
                                        which seqkit guesses the sequence type (0 for whole seq)
                                        (default 10000)
      --compress-level int              compression level for gzip, zstd, xz and bzip2. type "seqkit -h"
                                        for the range and default value for each format (default -1)
  -h, --help                            help for seqkit
      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2|
                                        Pseud...
      --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
  -X, --infile-list string              file of input files list (one file per line), if given, they are
                                        appended to files from cli arguments
  -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)
  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
      --quiet                           be quiet and do not show extra information
  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it
                                        automatically detect by the first sequence) (default "auto")
  -j, --threads int                     number of CPUs. can also set with environment variable
                                        SEQKIT_THREADS) (default 4)

Use "seqkit [command] --help" for more information about a command.

上面是seqkit软件的基础帮助文档。对软件做了基础介绍，然后对功能做了归类介绍

Commands for Basic Operation：序列碱基操作
Commands for Format Conversion:格式转换操作
Commands for Searching:搜索
Commands for Set Operation:设置参数
Commands for Edit:编辑
Commands for Ordering:排序
Commands for BAM Processing:bam监控
Commands for Miscellaneous:杂项命令
Additional Commands:其他命令

使用”seqkit [command] –help”命令可以具体查看每种命令的具体功能和参数。
seqkit的功能非常多，非常全面，针对较小的项目处理非常方便友好，
Documents文档手册也写得很详细。使用时推荐查看官方文档即可。

发表回复取消回复