Usage¶

Usage: centreseq [OPTIONS] COMMAND [ARGS]...

  centreseq builds an annotated core genome using assemblies as input.

Options:
  --version   Print the version and exit.
  --help      Show this message and exit.

Commands:
  core     Given an input directory containing assemblies, establishes a core
           genome
  extract  Helper tool to extract sequences from a particular core cluster
  subset   Subset summary_report.tsv to only samples of interest
  tree     Produces output for phylogenetic tree software

Core module¶

The primary functionality of centreseq can be found within the the core module.

Usage: centreseq core [OPTIONS]

  Given an input directory containing any number of assemblies (.fasta),
  centreseq core will 1) annotate the genomes with Prokka, 2) perform self-
  clustering on each genome with MMSeqs2 linclust, 3) concatenate the self-
  clustered genomes into a single pan-genome, 4) cluster the pan-genome with
  MMSeqs2 linclust, establishing a core genome, 5) generate helpful reports
  to interrogate your dataset Note that if specified output directory
  already exists, centreseq will search for an existing Prokka directory and
  skip this step if possible.

Options:
  -f, --fasta-dir PATH         Path to directory containing *.fasta files for
                               input to the core pipeline  [required]
  -o, --outdir PATH            Root directory to store all output files. If
                               this directory already exists, the pipeline
                               will attempt to skip the Prokka step by reading
                               in the existing Prokka output directory, but
                               will overwrite all other existing result files.
                               [required]
  --n-cpu-medoid INTEGER       Number of CPUs for the representative medoid
                               picking step (if enabled). You will need
                               substantial RAM per CPU.
  --n-cpu-pickbest INTEGER     Number of CPUs for pick_best_nucleotide. You
                               will need substantial RAM per CPU.
  -m, --min-seq-id FLOAT       Sets the mmseqs cluster parameter "--min-seq-
                               id". Defaults to 0.95.
  -c, --coverage-length FLOAT  Sets the mmseqs cluster coverage parameter "-c"
                               directly. Defaults to 0.95, which is the
                               recommended setting.
  --medoid-repseqs             This setting will identify the representative
                               medoid nucleotide sequence for each core
                               cluster. Enabling this will increase
                               computation time considerably. Note that this
                               parameter has no effect on the number of core
                               clusters detected.
  --pairwise                   Generate pairwise comparisons of all genomes.
                               This output file can be used to view an
                               interactive network chart of the core genome in
                               a web browser.
  -v, --verbose                Set this flag to enable more verbose logging.
  --version                    Use this flag to print the version and exit.
  --help                       Show this message and exit.

Tree module¶

Results generated via the core module can be further processed using the tree module.

Usage: centreseq tree [OPTIONS]

  Processes centreseq core output files to produce files that can be fed
  into phylogenetic tree building software.

Options:
  -s, --summary-report PATH  Path to summary_report.csv file produced by the
                             core pipeline  [required]
  -p, --prokka-dir PATH      Path to the Prokka output directory generated by
                             the core pipeline  [required]
  -o, --outdir PATH          Root directory to store all output files
                             [required]
  -pct, --percentile FLOAT   Filter summary report by n_members to the top nth
                             percentile. Defaults to 99.0.
  -n, --n-cpu INTEGER        Number of CPUs to dedicate to parallelizable
                             steps of the pipeline. Will take all available
                             CPUs - 1 if not specified.
  -v, --verbose              Set this flag to enable more verbose logging.
  --version                  Use this flag to print the version and exit.
  --help                     Show this message and exit.

Subset module¶

The subset module allows for filtering of the summary report output file generated by the core module.

Usage: centreseq subset [OPTIONS]

  Given an input text file of Sample IDs and a summary report, will return a
  filtered version of the summary report for genes that belong exclusively
  in the input sample ID list

Options:
  -i, --input-samples PATH   Path to a new line separated text file containing
                             each Sample ID to target  [required]
  -s, --summary-report PATH  Path to summary report generated by the centreseq
                             core command, i.e. summary_report.tsv  [required]
  -o, --outpath PATH         Path to desired output file. If no value is
                             provided, will create a new report in the same
                             directory as the input summary report.
  --help                     Show this message and exit.

Extract module¶

The extract module provides functionality to extract .ffn and .faa sequences from an existing cluster detected by the core module.

Usage: centreseq extract [OPTIONS]

  Given the path to the centreseq core directory and the ID of a
  cluster representative, will create a multi-FASTA containing the sequences
  for all members of that cluster. Generates both an .ffn and .faa file.

Options:
  -i, --indir PATH                Path to your centreseq output directory
                                  [required]
  -o, --outdir PATH               Root directory to store all output files
                                  [required]
  -c, --cluster_representative TEXT
                                  Name of the target cluster representative
                                  e.g. "Typhi.2299.BMH_00195"  [required]
  --version                       Use this flag to print the version and exit.
  --help                          Show this message and exit.