verification-test.Rmd
To perform a test of Physcraper functionality, we pruned the Gottlieb et al. tree down ~20%, excluding the outgroups, corresponding to 9 tips. We then performed a Physcraper run to test if we would recover the dropped tips. The input alignment will NOT be automatically trimmed by Physcraper
We performed the following steps:
tree6577_pg_2827 <- rotl::get_study_tree(study_id = "pg_2827", tree_id = "tree6577", tip_label = "original_label")
## Warning in build_raw_phylo(ncl, missing_edge_length): missing edge lengths are
## not allowed in phylo class. All removed.
tree6577_pg_2827$tip.label
## [1] "Helwingia_chinensis" "Helwingia_japonica" "Ilex_crenata"
## [4] "Ilex_mutchagara" "Ilex_dumosa_dumosa" "Ilex_dumosa_guaranina"
## [7] "Ilex_opaca" "Ilex_argentina_Gbk" "Ilex_cassine"
## [10] "Ilex_collina" "Ilex_decidua" "Ilex_vomitoria"
## [13] "Ilex_anomala" "Ilex_rugosa" "Ilex_integra"
## [16] "Ilex_beecheyi" "Ilex_mertensii" "Ilex_percoriacea"
## [19] "Ilex_kiusiana" "Ilex_maximowicziana" "Ilex_cornuta"
## [22] "Ilex_dimorphophylla" "Ilex_buergeri" "Ilex_latifolia"
## [25] "Ilex_liukiuensis" "Ilex_warburgii" "Ilex_amelanchier"
## [28] "Ilex_mucronata" "Ilex_serrata" "Ilex_rotunda"
## [31] "Ilex_glabra" "Ilex_canariensis" "Ilex_mitis"
## [34] "Ilex_yunnanensis" "Ilex_macropoda" "Ilex_micrococca"
## [37] "Ilex_pedunculosa" "Ilex_argentina" "Ilex_brasiliensis"
## [40] "Ilex_theezans" "Ilex_integerrima" "Ilex_theezans_Gbk"
## [43] "Ilex_brevicuspis" "Ilex_taubertiana" "Ilex_microdonta"
## [46] "Ilex_pseudobuxus" "Ilex_microdonta_Gbk" "Ilex_paraguariensis"
ingroup_ntip <- length(ingroup_tip.label) ingroup_ntip* 0.8
## [1] 36.8
## [1] "Ilex_decidua" "Ilex_amelanchier" "Ilex_warburgii"
## [4] "Ilex_dimorphophylla" "Ilex_integerrima" "Ilex_canariensis"
## [7] "Ilex_anomala" "Ilex_mutchagara" "Ilex_percoriacea"
pruned_tree <- ape::drop.tip(phy = tree6577_pg_2827, tip = todrop) pruned_tree
##
## Phylogenetic tree with 39 tips and 30 internal nodes.
##
## Tip labels:
## Helwingia_chinensis, Helwingia_japonica, Ilex_crenata, Ilex_dumosa_dumosa, Ilex_dumosa_guaranina, Ilex_opaca, ...
##
## Rooted; no branch lengths.
ape::write.tree(phy = pruned_tree, file = "../data-raw/ilex-test/tree6577_pg_2827_pruned.tre") ape::write.nexus(pruned_tree, file = "../data-raw/ilex-test/tree6577_pg_2827_pruned.nex")
Available at this link https://tree.opentreeoflife.org/curator/tnrs/
ali_T1281_M2478 <- ape::read.nexus.data(file = "../data-raw/alignments/T1281-M2478.nex") names(ali_T1281_M2478)
## [1] "Helwingia_chinensis" "Helwingia_japonica" "Ilex_amelanchier"
## [4] "Ilex_anomala" "Ilex_argentina" "Ilex_argentina_Gbk"
## [7] "Ilex_beecheyi" "Ilex_brasiliensis" "Ilex_brevicuspis"
## [10] "Ilex_buergeri" "Ilex_canariensis" "Ilex_cassine"
## [13] "Ilex_collina" "Ilex_cornuta" "Ilex_crenata"
## [16] "Ilex_decidua" "Ilex_dimorphophylla" "Ilex_dumosa_dumosa"
## [19] "Ilex_dumosa_guaranina" "Ilex_glabra" "Ilex_integerrima"
## [22] "Ilex_integra" "Ilex_kiusiana" "Ilex_latifolia"
## [25] "Ilex_liukiuensis" "Ilex_macropoda" "Ilex_maximowicziana"
## [28] "Ilex_mertensii" "Ilex_micrococca" "Ilex_microdonta"
## [31] "Ilex_microdonta_Gbk" "Ilex_mitis" "Ilex_mucronata"
## [34] "Ilex_mutchagara" "Ilex_opaca" "Ilex_paraguariensis"
## [37] "Ilex_pedunculosa" "Ilex_percoriacea" "Ilex_pseudobuxus"
## [40] "Ilex_rotunda" "Ilex_rugosa" "Ilex_serrata"
## [43] "Ilex_taubertiana" "Ilex_theezans" "Ilex_theezans_Gbk"
## [46] "Ilex_vomitoria" "Ilex_warburgii" "Ilex_yunnanensis"
in_ali <- !names(ali_T1281_M2478) %in% todrop ali_T1281_M2478_pruned <- ali_T1281_M2478[in_ali]
ape::write.nexus.data(x = ali_T1281_M2478_pruned, file = "../data-raw/ilex-test/alignment_T1281_M2478_pruned.nex", interleaved = FALSE) # dendropy.dataio.nexusreader.InvalidCharacterStateSymbolError: Error parsing data source 'physcraperex/data-raw/ilex-test/alignment_T1281_M2478_pruned.nex' on line 7 at column 33: Unrecognized character state symbol for state alphabet 'DNA' (DnaStateAlphabet) : '/'
I did this by hand, by looking for “/” symbol for example in “atcg/t-tct” and removing it and enclosing the adjacent bases within curly braces to indicate multistate character, “atc{gt}-tct” I found some cases with three different states, for example “atc/g/t-tct”, would be “at{cgt}-tct”
physcraper_run.py -tf physcraperex/data-raw/ilex-test/tree6577_pg_2827_pruned.tre -tfs newick -a physcraperex/data-raw/ilex-test/alignment_T1281_M2478_pruned.nex -as nexus -ti physcraperex/data-raw/ilex-test/pruned_tree_tip_labels_mapped/main.json -db /absolute/path/local_blast_db -nt 8 -o physcraperex/data/ilex-test-local-1
Trying it with the otu_info.csv file from a previous run:
physcraper_run.py -tf physcraperex/data-raw/ilex-test/tree6577_pg_2827_pruned.tre -tfs newick -a physcraperex/data-raw/alignments/T1281-M2478.nex -as nexus -ti physcraperex/data/ilex-local/run_T1281-M2478/otu_info_T1281-M2478.json -db /absolute/path/local_blast_db -nt 8 -o physcraperex/data/ilex-test-local
otu_info_T1281_M2478 <- readr::read_delim("../data/ilex-test-local-1/inputs_alignment_T1281_M2478_pruned/otu_info_alignment_T1281_M2478_pruned.csv", "\t", escape_double = FALSE, trim_ws = TRUE)
## Parsed with column specification:
## cols(
## otu_id = col_character(),
## `^ot:ottTaxonName` = col_character(),
## `^ot:ottId` = col_double(),
## `^ncbi:taxon` = col_double(),
## `^ncbi:accession` = col_character(),
## `^physcraper:last_blasted` = col_character(),
## `^physcraper:ingroup` = col_character(),
## `^physcraper:status` = col_character(),
## `^ot:originalLabel` = col_character(),
## `^ncbi:title` = col_character(),
## `^physcraper:in_current_aln` = col_logical()
## )
DT::datatable(otu_info_T1281_M2478)
# Verify that all tips in input tree are in alignment sum(!otu_info_T1281_M2478$`^physcraper:in_current_aln`)
## [1] 0
otu_info_alignment_T1281_M2478_pruned <- readr::read_delim("../data/ilex-test-local-1/outputs_alignment_T1281_M2478_pruned/otu_info_alignment_T1281_M2478_pruned.csv", "\t", escape_double = FALSE, trim_ws = TRUE)
## Parsed with column specification:
## cols(
## otu_id = col_character(),
## `^ot:ottTaxonName` = col_character(),
## `^ot:ottId` = col_double(),
## `^ncbi:taxon` = col_double(),
## `^ncbi:accession` = col_character(),
## `^physcraper:last_blasted` = col_character(),
## `^physcraper:ingroup` = col_character(),
## `^physcraper:status` = col_character(),
## `^ot:originalLabel` = col_character(),
## `^ncbi:title` = col_character(),
## `^physcraper:in_current_aln` = col_logical()
## )
DT::datatable(otu_info_alignment_T1281_M2478_pruned)
todrop_tnrs <- rotl::tnrs_match_names(names = todrop)
# Using the OTT id: todrop_tnrs$ott_id %in% otu_info_alignment_T1281_M2478_pruned$`^ot:ottId`
## [1] TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
# Using the unique taxon name: todrop_tnrs$unique_name %in% otu_info_alignment_T1281_M2478_pruned$`^ot:ottTaxonName`
## [1] TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
# Getting the taxa in original tree that are not in the updated tree notinupdated <- todrop_tnrs$unique_name[!todrop_tnrs$ott_id %in% otu_info_alignment_T1281_M2478_pruned$`^ot:ottId`]
The following taxa were in original tree but were not recovered with this Physcraper run:
We looked for the NCBI GenBank accession numbers reported in the supplementary material of the original publication for the ITS sequences of each of the missed taxa, Ilex warburgii, I. dimorphophylla and I. percoriacea. The first too have updated the originally reported accession numbers from U92600/U92601 and U92592/U92593 to a single accession number AH007153 and AH007149, respectively. For I. percoriacea, the NCBI accession number reported in the priginal alignment is AH007156, and it remains the same (as of Tue Apr 06, 2021). Inspecting the sequences, all three contain a middle gap of 100 undetermined nucleotide bases (Ns). These Ns were not present in these sequences in the original alignment. Physcraper drops these sequence because they are unalignable without any manual curation.
Note: updated_taxonname.tree and labelled_tag.tre are exactly the same tree