To perform a test of Physcraper functionality, we pruned the Gottlieb et al. tree down ~20%, excluding the outgroups, corresponding to 9 tips. We then performed a Physcraper run to test if we would recover the dropped tips. The input alignment will NOT be automatically trimmed by Physcraper

We performed the following steps:

  1. Pruning the tree
  2. Matching the names
  3. Running Physcraper
  4. Checking the results
  5. Analyzing the results

Pruning the Gottlieb et al. tree

Getting the tree from the Open Tree of Life database

tree6577_pg_2827 <- rotl::get_study_tree(study_id = "pg_2827", tree_id = "tree6577", tip_label = "original_label")
## Warning in build_raw_phylo(ncl, missing_edge_length): missing edge lengths are
## not allowed in phylo class. All removed.
tree6577_pg_2827$tip.label
##  [1] "Helwingia_chinensis"   "Helwingia_japonica"    "Ilex_crenata"         
##  [4] "Ilex_mutchagara"       "Ilex_dumosa_dumosa"    "Ilex_dumosa_guaranina"
##  [7] "Ilex_opaca"            "Ilex_argentina_Gbk"    "Ilex_cassine"         
## [10] "Ilex_collina"          "Ilex_decidua"          "Ilex_vomitoria"       
## [13] "Ilex_anomala"          "Ilex_rugosa"           "Ilex_integra"         
## [16] "Ilex_beecheyi"         "Ilex_mertensii"        "Ilex_percoriacea"     
## [19] "Ilex_kiusiana"         "Ilex_maximowicziana"   "Ilex_cornuta"         
## [22] "Ilex_dimorphophylla"   "Ilex_buergeri"         "Ilex_latifolia"       
## [25] "Ilex_liukiuensis"      "Ilex_warburgii"        "Ilex_amelanchier"     
## [28] "Ilex_mucronata"        "Ilex_serrata"          "Ilex_rotunda"         
## [31] "Ilex_glabra"           "Ilex_canariensis"      "Ilex_mitis"           
## [34] "Ilex_yunnanensis"      "Ilex_macropoda"        "Ilex_micrococca"      
## [37] "Ilex_pedunculosa"      "Ilex_argentina"        "Ilex_brasiliensis"    
## [40] "Ilex_theezans"         "Ilex_integerrima"      "Ilex_theezans_Gbk"    
## [43] "Ilex_brevicuspis"      "Ilex_taubertiana"      "Ilex_microdonta"      
## [46] "Ilex_pseudobuxus"      "Ilex_microdonta_Gbk"   "Ilex_paraguariensis"

Identifying the outgroups

outgroup_index <- match(c("Helwingia_chinensis", "Helwingia_japonica"), tree6577_pg_2827$tip.label)

Getting the ingroup tip labels

ingroup_tip.label <- tree6577_pg_2827$tip.label[-outgroup_index]

Calculating the number of tips to keep

ingroup_ntip <- length(ingroup_tip.label)
ingroup_ntip* 0.8
## [1] 36.8

Choosing at random the ingroup tips to drop

set.seed(19022021)
todrop <- sample(ingroup_tip.label, ingroup_ntip-37)
todrop
## [1] "Ilex_decidua"        "Ilex_amelanchier"    "Ilex_warburgii"     
## [4] "Ilex_dimorphophylla" "Ilex_integerrima"    "Ilex_canariensis"   
## [7] "Ilex_anomala"        "Ilex_mutchagara"     "Ilex_percoriacea"

Dropping the tips from the tree

pruned_tree <- ape::drop.tip(phy = tree6577_pg_2827, tip = todrop)
pruned_tree
## 
## Phylogenetic tree with 39 tips and 30 internal nodes.
## 
## Tip labels:
##   Helwingia_chinensis, Helwingia_japonica, Ilex_crenata, Ilex_dumosa_dumosa, Ilex_dumosa_guaranina, Ilex_opaca, ...
## 
## Rooted; no branch lengths.

Saving the tree as newick and nexus

ape::write.tree(phy = pruned_tree, file = "../data-raw/ilex-test/tree6577_pg_2827_pruned.tre")
ape::write.nexus(pruned_tree, file = "../data-raw/ilex-test/tree6577_pg_2827_pruned.nex")

Matching the tree tip labels

First generate a simple text file containing one tip label per line

write(paste(pruned_tree$tip.label, collapse = "\n"), file = "../data-raw/ilex-test/pruned_tree_tip_labels.txt")

Go to OpenTree’s TNRS matching bulk tool

Available at this link https://tree.opentreeoflife.org/curator/tnrs/

Pruning the alignment

Read the alignment

ali_T1281_M2478 <- ape::read.nexus.data(file = "../data-raw/alignments/T1281-M2478.nex")
names(ali_T1281_M2478)
##  [1] "Helwingia_chinensis"   "Helwingia_japonica"    "Ilex_amelanchier"     
##  [4] "Ilex_anomala"          "Ilex_argentina"        "Ilex_argentina_Gbk"   
##  [7] "Ilex_beecheyi"         "Ilex_brasiliensis"     "Ilex_brevicuspis"     
## [10] "Ilex_buergeri"         "Ilex_canariensis"      "Ilex_cassine"         
## [13] "Ilex_collina"          "Ilex_cornuta"          "Ilex_crenata"         
## [16] "Ilex_decidua"          "Ilex_dimorphophylla"   "Ilex_dumosa_dumosa"   
## [19] "Ilex_dumosa_guaranina" "Ilex_glabra"           "Ilex_integerrima"     
## [22] "Ilex_integra"          "Ilex_kiusiana"         "Ilex_latifolia"       
## [25] "Ilex_liukiuensis"      "Ilex_macropoda"        "Ilex_maximowicziana"  
## [28] "Ilex_mertensii"        "Ilex_micrococca"       "Ilex_microdonta"      
## [31] "Ilex_microdonta_Gbk"   "Ilex_mitis"            "Ilex_mucronata"       
## [34] "Ilex_mutchagara"       "Ilex_opaca"            "Ilex_paraguariensis"  
## [37] "Ilex_pedunculosa"      "Ilex_percoriacea"      "Ilex_pseudobuxus"     
## [40] "Ilex_rotunda"          "Ilex_rugosa"           "Ilex_serrata"         
## [43] "Ilex_taubertiana"      "Ilex_theezans"         "Ilex_theezans_Gbk"    
## [46] "Ilex_vomitoria"        "Ilex_warburgii"        "Ilex_yunnanensis"

Prune it

in_ali <- !names(ali_T1281_M2478) %in% todrop
ali_T1281_M2478_pruned <- ali_T1281_M2478[in_ali]

Save the pruned alignment

ape::write.nexus.data(x = ali_T1281_M2478_pruned, file = "../data-raw/ilex-test/alignment_T1281_M2478_pruned.nex", interleaved = FALSE)

# dendropy.dataio.nexusreader.InvalidCharacterStateSymbolError: Error parsing data source 'physcraperex/data-raw/ilex-test/alignment_T1281_M2478_pruned.nex' on line 7 at column 33: Unrecognized character state symbol for state alphabet 'DNA' (DnaStateAlphabet) : '/'

Modify the slash added when multistate characters are present in the matrix

I did this by hand, by looking for “/” symbol for example in “atcg/t-tct” and removing it and enclosing the adjacent bases within curly braces to indicate multistate character, “atc{gt}-tct” I found some cases with three different states, for example “atc/g/t-tct”, would be “at{cgt}-tct”

Running the test

Locally

physcraper_run.py -tf physcraperex/data-raw/ilex-test/tree6577_pg_2827_pruned.tre -tfs newick -a physcraperex/data-raw/ilex-test/alignment_T1281_M2478_pruned.nex -as nexus -ti physcraperex/data-raw/ilex-test/pruned_tree_tip_labels_mapped/main.json -db /absolute/path/local_blast_db -nt 8 -o physcraperex/data/ilex-test-local-1

Trying it with the otu_info.csv file from a previous run:

physcraper_run.py -tf physcraperex/data-raw/ilex-test/tree6577_pg_2827_pruned.tre -tfs newick -a physcraperex/data-raw/alignments/T1281-M2478.nex -as nexus -ti physcraperex/data/ilex-local/run_T1281-M2478/otu_info_T1281-M2478.json -db /absolute/path/local_blast_db -nt 8 -o physcraperex/data/ilex-test-local

Remotely

physcraper_run.py -tf physcraperex/data-raw/ilex-test/tree6577_pg_2827_pruned.tre -tfs newick -a physcraperex/data-raw/alignments/T1281-M2478.nex -as nexus -ti physcraperex/data-raw/ilex-test/pruned_tree_tip_labels_mapped/main.json -o physcraperex/data/ilex-test-remote

Checking the results of the test

Getting information from the OTU info table

otu_info_T1281_M2478 <- readr::read_delim("../data/ilex-test-local-1/inputs_alignment_T1281_M2478_pruned/otu_info_alignment_T1281_M2478_pruned.csv", "\t", escape_double = FALSE, trim_ws = TRUE)
## Parsed with column specification:
## cols(
##   otu_id = col_character(),
##   `^ot:ottTaxonName` = col_character(),
##   `^ot:ottId` = col_double(),
##   `^ncbi:taxon` = col_double(),
##   `^ncbi:accession` = col_character(),
##   `^physcraper:last_blasted` = col_character(),
##   `^physcraper:ingroup` = col_character(),
##   `^physcraper:status` = col_character(),
##   `^ot:originalLabel` = col_character(),
##   `^ncbi:title` = col_character(),
##   `^physcraper:in_current_aln` = col_logical()
## )
DT::datatable(otu_info_T1281_M2478)
# Verify that all tips in input tree are in alignment
sum(!otu_info_T1281_M2478$`^physcraper:in_current_aln`)
## [1] 0

Looking at the output OTU info file

otu_info_alignment_T1281_M2478_pruned <- readr::read_delim("../data/ilex-test-local-1/outputs_alignment_T1281_M2478_pruned/otu_info_alignment_T1281_M2478_pruned.csv", "\t", escape_double = FALSE, trim_ws = TRUE)
## Parsed with column specification:
## cols(
##   otu_id = col_character(),
##   `^ot:ottTaxonName` = col_character(),
##   `^ot:ottId` = col_double(),
##   `^ncbi:taxon` = col_double(),
##   `^ncbi:accession` = col_character(),
##   `^physcraper:last_blasted` = col_character(),
##   `^physcraper:ingroup` = col_character(),
##   `^physcraper:status` = col_character(),
##   `^ot:originalLabel` = col_character(),
##   `^ncbi:title` = col_character(),
##   `^physcraper:in_current_aln` = col_logical()
## )
DT::datatable(otu_info_alignment_T1281_M2478_pruned)

Getting the OTT ids of the dropped taxa

todrop_tnrs <- rotl::tnrs_match_names(names = todrop)

Looking for the dropped taxa in the OTU info table

# Using the OTT id:
todrop_tnrs$ott_id %in% otu_info_alignment_T1281_M2478_pruned$`^ot:ottId`
## [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
# Using the unique taxon name:
todrop_tnrs$unique_name %in% otu_info_alignment_T1281_M2478_pruned$`^ot:ottTaxonName`
## [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
# Getting the taxa in original tree that are not in the updated tree
notinupdated <- todrop_tnrs$unique_name[!todrop_tnrs$ott_id %in% otu_info_alignment_T1281_M2478_pruned$`^ot:ottId`]

The following taxa were in original tree but were not recovered with this Physcraper run:

  • Ilex warburgii
  • Ilex dimorphophylla
  • Ilex percoriacea

Why does this happen?

We looked for the NCBI GenBank accession numbers reported in the supplementary material of the original publication for the ITS sequences of each of the missed taxa, Ilex warburgii, I. dimorphophylla and I. percoriacea. The first too have updated the originally reported accession numbers from U92600/U92601 and U92592/U92593 to a single accession number AH007153 and AH007149, respectively. For I. percoriacea, the NCBI accession number reported in the priginal alignment is AH007156, and it remains the same (as of Tue Apr 06, 2021). Inspecting the sequences, all three contain a middle gap of 100 undetermined nucleotide bases (Ns). These Ns were not present in these sequences in the original alignment. Physcraper drops these sequence because they are unalignable without any manual curation.

Note: updated_taxonname.tree and labelled_tag.tre are exactly the same tree