Content

Yi-Yun Cheng, Khanh Linh Hoang, Bertram Ludäscher, Cacao, Cocao, or Cocoa?: Reconciliation of Taxonomic Names in Biodiversity Heritage Library in:

International Society for Knowledge Organziation (ISKO), Marianne Lykke, Tanja Svarre, Mette Skov, Daniel Martínez-Ávila (Ed.)

Knowledge Organization at the Interface, page 88 - 97

Proceedings of the Sixteenth International ISKO Conference, 2020 Aalborg, Denmark

1. Edition 2020, ISBN print: 978-3-95650-775-5, ISBN online: 978-3-95650-776-2, https://doi.org/10.5771/9783956507762-88

Series: Advances in Knowledge Organization, vol. 17

Bibliographic information
Yi-Yun Cheng – iSchool, University of Illinois at Urbana-Champaign, USA Khanh Linh Hoang – iSchool, University of Illinois at Urbana-Champaign, USA Bertram Ludäscher – iSchool, University of Illinois at Urbana-Champaign, USA Cacao, Cocao, or Cocoa? Reconciliation of Taxonomic Names in Biodiversity Heritage Library Abstract: The Biodiversity Heritage Library (BHL) currently hosts more than 150 thousand titles, and 57 million OCRscanned pages on biodiversity literature dating back to the 16th century. While great research efforts have been conducted to extract taxonomic names in BHL’s literature the issue of name reconciliation has yet to be studied. Through the use case of Theobroma cacao, commonly known as chocolate plants, this research aims at presenting a framework to reconcile species names in BHL by merging external taxonomies. We demonstrate this by using a logic-based, taxonomy alignment approach to match variations of species and subspecies names of Theobroma cacao from four major biodiversity sources: the Encyclopedia of Life (EoL), Integrated Taxonomic Information System (ITIS), Global Biodiversity Information Facility (GBIF), and the United States Department of Agriculture PLANTS Database (USDA Plants). 1.0 Introduction Consider the following hypothetical scenario: Charlie is a rising researcher in plant biodiversity with an interest in economic plants and wants to find more descriptions about such species and their family trees from historical texts. Determined to use the canonical online resource for biodiversity texts – the Biodiversity Heritage Library – Charlie starts with a quick search with cocoa, the ingredient in chocolate bars. There are two seemingly unrelated species names. She then turns to GBIF, another popular online resource for biodiversity information, to perform further searches. She found that in GBIF there are two names listed: Theobroma cocao and Theobroma cacao. Unsure which names are the approved scientific name, she tries both and finds 15 other variations of scientific names for Theobroma cacao shown in BHL. These 15 names are listed alphabetically, but without further information of how the terms relate to each other. With the exponential growth of biodiversity information and data, it has become increasingly difficult to reliably retrieve species based on species names, as shown in the example above. The Biodiversity Heritage Library (BHL) houses OCR-scanned pages of legacy biodiversity literature from natural history museum collections and other partnering institutions. This literature dates back to the 16th century, but oftentimes key taxonomic papers are obscured by the evolving, different variations of species names. While efforts have been put in leveraging natural language processing (NLP) and text mining methods to extract scientific names from the texts in BHL, such as developing a series of text mining tools (Page 2011, 2013), discussing taxonomic name recognition from texts (Wei, Heidorn, and Freeland 2010), or automating the process of extraction of names from texts (Batista-Navarro et al., 2015), little has been done in organizing and reconciling these scientific names to provide a better species name representation. 89 Through the use case of the species Theobroma cacao, commonly known as chocolate plants, our research aims to develop a framework for reconciling species names in BHL by merging external taxonomies. We propose to employ a logic-based, taxonomy alignment approach (Franz et al., 2015, 2016; Cheng and Ludäscher, 2019) to match variations of species and subspecies names of Theobroma cacao from four major biodiversity sources: the Encyclopedia of Life (EoL), Integrated Taxonomic Information System (ITIS), Global Biodiversity Information Facility (GBIF), and the United States Department of Agriculture PLANTS Database (USDA Plants). 2.0 Related Work 2.1 Quality issues in aggregated databases such as GBIF, BHL Data quality issues in aggregated biodiversity databases are not uncommon. Numerous studies have discussed the issues as well some practical solutions towards these aggregated databases. Parr et al. (2012) discuss how the rapid growth of biodiversity information from various repositories exacerbates the discovery and integration of the knowledge to the “Tree of Life”, and call for a linked data solution to connect all different aggregated databases. Franz and Sterner (2018) point out how these aggregators underplay a “taxonomic backbone” in their design systems, and shift responsibilities of data quality issues to the data source providers. The authors conclude that rather than correcting datasets from the root providers, providing services that enhance the taxonomic concepts in these systems would increase the trust and collaboration among systematists and the aggregator communities. While the discussions on data quality mostly circulate around aggregators such as GBIF, extracting scientific articles and finding species names in BHL has also been proven to be an unresolved concern. 2.2 Extracting names in BHL BHL uses Global Names Recognition and Discovery (GNRD) services to analyze the OCR-scanned texts and identify any string that can potentially be scientific names. Many studies have proposed using natural language processing (NLP) and text mining methods to enhance the recognition and extraction of scientific names from BHL. For instance, NLP techniques have been developed to support the extraction of taxonomic names and morphological character from taxonomic descriptions (Thessen, Cui, and Mozzherin 2012). Page (2011, 2013) has developed a series of text mining tools (BioNames, BioTor) that aims to enhance the extraction of names and retrieval ability of BHL. Recently, Page (2019) proposed an approach to link taxonomic names from other databases to papers from BHL with the attempt to leverage existing resources. These prior works show great potential in enhancing the extraction and recognition of names from BHL on top of its native embedded service on the GNRD API. However, how to best utilize, organize, and represent the extracted information from BHL warrants further research. Figure 1. BHL’s current search results only return a flat structure of species names 90 2.3 Reconciliation of names Name recognition and extraction are of growing interest within the context of BHL to provide semantics for unstructured data (Wei, Heidorn, and Freeland 2010); nevertheless, the extraction of names is only the first step towards retrieving species information from text, the reconciliation of species names is the next unsettled territory to many biodiversity experts. In Franz et al. (2016), the authors reconciled 11 different taxonomies across 126 years of time span of the Andropogon complex and concluded that the meanings of scientific names can change significantly over time. Analogously, extant literature in Knowledge Organization also emphasized how names are a diachronic concern by both biologists and information scientists alike (Blake 2011). Our prior work discussed the reconciliation of names in evolving, disputed geo-entities (Cheng and Ludäscher 2019). To date, BHL remains a cornucopia of historical taxonomic names and species information. However, the current organization of species taxonomies in BHL makes it hard to relate species names and determine the credibility of species hierarchies. This paper proposes to reconcile species names from external taxonomies and link this value-added information with information from BHL to obtain an even more informative and userfriendly presentation of search results. 3.0 Use Case: Theobroma cacao We started with a single species to investigate the current taxonomic backbone of BHL. The species we chose to examine is commonly known as cacao (or cocoa) tree, scientifically known as Theobroma cacao. Upon searching Theobroma cacao, BHL performed a full-text search and returned about 4,000 search results on all the publications and their metadata (title, author, date, page number) relating to the search term. 15 scientific names were listed alphabetically on the BHL interface. However, the relationship of how each term relates to Theobroma cacao, whether hierarchical, synonyms, or siblings, was unknown to the users. For instance, when clicking a term among the 15 names such as Cacao theobroma, BHL returns a list of bibliography that contains the keyword. The current BHL structure is shown in Figure 1. Because information about where species and higher taxa occur in a taxonomic hierarchy is essential for taxonomists, biodiversity experts, and researchers, we are particularly interested in how BHL presents the information of scientific names, and whether a better approach to organize these names can be provided. Figure 2. Our proposed taxonomic structure framework. 91 3.1 Method We propose a framework to gather species taxonomies through external databases. Specifically, we compare and merge taxonomies of species in four major sources: EoL, GBIF, ITIS, and USDA Plants (Figure 2). In this study, we specifically investigate the inclusion of subspecies into the taxonomic backbone of BHL. 3.2 Data Collection Our data was collected in December, 2019. Detailed descriptions of the data are stated below. The taxonomies of the four different external databases are not mutually exclusive and may overlap with one another. 3.2.1 Encyclopedia of Life (EoL) EoL is one of the aggregated databases that annotates preferred scientific names, common names, and synonyms of species. Notably, EoL maintains curated dynamic hierarchies for each species. EoL staffs manually curate and edit the taxonomic information as needed when suggested by biodiversity experts and likewise communities1. 3.2.2 Global Biodiversity Information Facility (GBIF) GBIF is one of the most popular sources for biodiversity information, including species occurrences, publication, peer-reviewed data, and etc. GBIF published a backbone taxonomy containing taxonomies from EoL, IUCN Red lists, published papers, and more. The taxonomic information is GBIF is updated via an automated process, and Catalogue of Life (CoL) is the primary source that GBIF compares the taxonomies upon. 3.2.3 Integrated Taxonomic Information System (ITIS) The goal of ITIS is mainly to provide species names and taxonomic information. It contains approved species taxonomy by biodiversity experts and its Taxonomy Working Group (TWG). Performing a search on any species will result in a list of accepted or not accepted species names, followed by species hierarchies, expert references, or other sources. 3.2.4 United States Department of Agriculture PLANTS Database (USDA Plants) The scope of USDA Plants is slightly narrower than the previous mentioned counterparts. Performing a search on species names will lead users to the species classification, and subordinate taxa which documented species parents (genus, family), and subspecies. Table 1 shows the subspecies we collected from the four external sources. We considered all the infraspecific epithets (names) as children of a species, so species names that contain keywords such as subsp., ssp., infrasubsp., f., fm., var., were all grouped as subspecies in our analysis. In particular, we viewed subsp. equivalent to ssp. (subspecies), and f. equivalent to fm. (form). That said, if a database consists of both Theobroma cacao ssp. cacao and Theobroma cacao subsp. cacao, we only keep one name for the purpose of analysis. 1 EoL information page: https://eol.org/docs/what-is-eol/whats-new 92 Table 1. A list of subspecies in the four sources examined. *Data are collected December 2019. Prefix Theobroma cacao omitted; changed punctuations (.) to underscores (_). 3.3 Logic-based Taxonomy alignment approach 3.3.1 Taxonomy We define taxonomy T as a hierarchical, tree structure of terms (or names): Each node in the taxonomy has only one parent (with the exception of the root node having no parent). Sibling nodes typically represent disjoint taxa, i.e., two nodes on the same level in the tree are considered mutually exclusive. When considering the children of a node, we could either assume that all children are known (i.e., we know or assume that no other children nodes exist), or that there might yet be unknown children. In case of the latter, we introduce a placeholder node called “other to design for future possible changes (e.g. Theobroma_cacao_other). 3.3.2 Taxonomy Alignment Problem (TAP) To compare two taxonomies T1 and T2, we first identify a set of articulations (relations) used to describe how concept X in T1 relates to concept Y in T2. The region connection calculus RCC-5 (Randell, Cui, and Cohn 1992; Cohn and Renz 2008) can be used to define specific articulations: equals, overlaps, disjoint, includes, is_included_in. Then, we input these constraints to Euler/X (an Answer Set Programming, Python-based tool), which provides us with different merged solutions for these pairwise comparisons. Euler/X will either conclude with (1) an inconsistent outcome with zero Possible World (PW) (n=0); (2) a single, uniquely merged PW T3 (n=1), usually the desired outcome; or (3) multiple merged PWs T3 (n≥2), where each world is a possible reconciliation of how the two taxonomies can be merged. A simple TAP is shown in Figure 3, where a few relations are marked as equivalent ‘=’, and the ‘other’ nodes are left with open relations. The resulting PW shows that all the children nodes in both taxonomies are equivalent (in grey boxes). Details of the Euler/X tool workflow and descriptions for implementation are explained in Cheng et al.(2017). Figure 3. Example of a TAP. Input figure (left). One Possible world (right). 4.0 Results From Table 1, it shows that some of the subspecies names are equivalent, but with variations on choosing either using ssp. or subsp. to represent subspecies, using f. or fm. 93 to represent forms, and using different authorship spellings (e.g. A. Chevalier vs. A. Chev, L. vs. Linnaeus, etc). For such cases, we regard the two names as equivalent. Table 1 shows that ITIS and USDA plants provide fewer subspecies names. Therefore, we begin our alignment with these two taxonomies that are less complex than others, Given that one of the limitations of our taxonomy alignment approach is that only two taxonomies can be compared at once, we executed six pairwise alignments in total to compare each taxonomy with the other three (TITIS-TUSDA, TEoL-TITIS, TITIS -TUSDA, TGIBF -TITIS, TGBIF -TUSDA, TEoL-TGBIF). 4.1 TITIS-TUSDA We begin our alignment with these two taxonomies. Evidently, as shown in the PW of Figure 4, every node is considered congruent (in grey round boxes), meaning the two names from each taxonomy is equivalent. This means instead of six pairwise alignments of the four taxonomies, we can reduce our number of alignments to four given the merged Possible World (PW) of TEoL-TITIS and TEoL-TUSDA, TGIBF -TITIS and TGBIF -TUSDA will be in exactly the same structure. Figure 4. Input alignment (top) and output Possible World (bottom) for TITIS and TUSDA 4.2 TEoL-TITIS, TEoL-TUSDA Considering that TITIS and TUSDA are equivalent, here we only show the result for TEoL-TITIS (Figure 5). TEoL provides more subspecies information than either TITIS or TUSDA. As a non-biodiversity expert, we cannot assert how these subspecies relate to each other, therefore, we left the relations open in the input alignments. As a result, all the extra subspecies in TEoL are inferred to be also merged under “TCOther” in TITIS (and TUSDA) (Figure 5, PW). This indicates that TEoL may be a bigger taxonomy than the other two and TEoL.Theobroma cacao includes both TITIS.Theobroma cacao and TUSDA. Theobroma cacao. 94 Figure 5. Input alignment (top) and output Possible World (bottom) for TEoL and TITIS. 4.3 TGBIF and TITIS, TGBIF -TUSDA Similar to the result in 4.0.2, since there were no direct counterparts for many of TGBIF's subspecies, we left the nodes' relations open without linking them to anything in TITIS or TUSDA . The merged PW is also indicative that TGBIF is more granular in terms of subspecies, and the subspecies should be merged under TITIS's “Other” category. (See Appendix for visualizations of 4.3 result) 4.4 TEoL and TGBIF Figure 6 shows the two merged PWs of TEoL and TGBIF. Notably, for this pair of taxonomies, we ended up with more than one PWs, potentially due to the influence of the “other” category. Specifically, the differences between the two PWs are caused by whether the infraspecies of TGBIF is within or equivalent to TEoL.Other; or that TEoL Theobroma cacao ssp. leiocarpum Bernoulli Cuatrec is within or equivalent to TGBIF.Other. At this point, it is difficult to discern which PW yields a more reliable merged solution. Expert opinions are needed to proceed further in this case. Figure 6. Possible Worlds for TEoL and TGBIF 95 5.0 Discussion and Conclusion In this paper, we have presented a framework to reconcile taxonomies from different sources in BHL. Specifically, we have conducted six pairwise taxonomy alignments on four different taxonomies (ITIS, USDA Plants, EoL, and GBIF). As shown also in prior research (Franz et al., 2015, 2016; Cheng et al., 2017; Cheng and Ludäscher, 2019), a logic-based approach to taxonomy alignment can be used to align and merge different taxonomic perspectives into a solution that makes hidden relationships explicit. In this paper, the merged Possible Worlds of the alignments mainly serve as a subspecies grouping mechanisms that allow users to identify which taxonomies are more granular than the other, as illustrated in 4.0.2 and 4.0.3. The result for 4.0.1 also partially suggests that the PW may serve as a name disambiguation mechanism, where the grey boxes that groups equivalent terms shows that Theobroma cacao ssp. cacao L may be synonymous with Theobroma cacao subsp. cacao L. Further, while reserving a residual category “Other” to “designing for change” (Tennis, 2012) in classifications is usually considered a good practice, our results in 4.0.2, 4.0.3, 4.0.4 exemplified that the PWs will be partially influenced by the “residual category”. The children nodes may be classified into the “Other” category and creates ambiguities for the alignment results. Moreover, our framework tries to minimize the information overload during the alignment process in these aggregated databases. Interoperability endeavors such as taxonomy alignments, cross-walking, or ontology mapping rely substantially on human decisions, especially when the topic of alignments is domain-specific. Experts of a domain asserts what kind of relations a concept in taxonomy A has with taxonomy B. In this paper, we attempted to reduce expert involvement at the beginning stage by using semiautomatic alignment process and incorporating existing external taxonomies. This is not to say that these taxonomies are the ground truth of how Theobroma cacao's taxonomy should look like, nor that the species names aligned are the absolute answers for equivalencies (they are a lot of times not equivalent due to evolving semantic changes (Franz et al., 2016). Rather, these merged solutions are serving as interim knowledge organization systems pending to be further scrutinized by biodiversity and taxonomy experts in the future. Given the large amount of data in aggregated databases such as the Biodiversity Heritage Library, we believe using this approach to establish a minimal viable knowledge product first can be helpful for further efforts. In future work, we plan to extend this study mainly by (1) employing this framework to generate result for more species; and (2) implementing the merged PWs as a new species information representation structure that can be used alongside BHL and assessing its retrieval effectiveness. In the opening scenario, Charlie's searches actually further reveal that similar looking names in BHL such as Theobroma cacao L and Theobroma cacao Linnaeus yield identical search results with 2,733 records. This may suggest that BHL has an internal infrastructure that recognizes and organizes keywords together. We also hope to explore this and to incorporate parents (genus, family, order, or class-level), siblings (other species within the same genus), synonyms, and common names into the taxonomies to form more comprehensive species hierarchies in our framework. Ultimately, we hope to continue conversations with BHL on improving the practices of organizing species names and name reconciliation services. Acknowledgement This project is supported by the Center for Informatics Research in Science and Scholarship (CIRSS) at iSchool, University of Illinois at Urbana-Champaign. The first author 96 would also like to acknowledge the LEADS-4-NDP fellowship program, Dr. Jane Greenberg, and Steven Dilliplane for their ongoing support of this work. References Blake, James. 2011. “Some Issues in the Classification Of Zoology.” Knowledge Organization 38: 463–472. Cheng, Yi-Yun., Nico Franz, Jodi Schneider, Shizhuo Yu, Thomas Rodenhausen, and Bertram Ludäsche. 2017. “Agreeing to Disagree: Reconciling Conflicting Taxonomic Views Using a Logic‐Based Approach.” Proceedings of the Association for Information Science and Technology 54: 46-56. Cheng, Yi-Yun and Bertam Ludäscher. 2019. “Exploring Geopolitical Realities through Taxonomies: The Case of Taiwan.” NASKO 7: 77-93. Cohn, Anthony G. and Jochen Renz. 2008. “Qualitative Spatial Representation and Reasoning.” In Handbook of Knowledge Representation, edited by Frank van Harmelen, Vladimir Lifschitz, and Bruce Porter. Amsterdam: Elsevier, 551-596. Franz, Nico M., Mingmin Chen, Shizhuo Yu, Parisa Kianmajd, Shawn Bowers, and Bertram Ludäscher. 2015. “Reasoning Over Taxonomic Change: Exploring Alignments for the Perelleschus Use Case.” PloS One, 10, no. 2: e0118247. https://doi.org/10.1371/journal.pone.0118247. Franz, Nico M., Mingmin Chen, Parisa Kianmajd, Shizhuo Yu, Shawn Bowers, Alan S. Weakley, and Bertram Ludäscher. 2016. “Names Are Not Good Enough: Reasoning Over Taxonomic Change in the Andropogon Complex.” Semantic Web 7, no.6: 645–667. Franz, Nico M. and Beckett W. Sterner. 2018. “To Increase Trust, Change the Social Design Behind Aggregated Biodiversity Data.” Database 2018. https://doi.org/10.1093/database/bax100. Page, Roderic D.M. 2011. “Extracting Scientific Articles from a Large Digital Archive: Biostor and the Biodiversity Heritage Library.” BMC Bioinformatics 12: 187 https://doi.org/10.1186/1471-2105-12-187 Page, Roderic D.M. 2013. “Bionames: Linking Taxonomy, Texts, and Trees.” PeerJ 1: e190. https://doi.org/10.7717/peerj.190. Page, Roderic. 2019. “Text-Mining BHL: Towards New Interfaces to the Biodiversity Literature.” Biodiversity Information Science and Standards 3: e35013. Parr, Cynthia S., Robert Guralnick, Nico Cellinese, and Roderic D.M. Page. 2012. “Evolutionary Informatics: Unifying Knowledge about the Diversity of Life.” Trends in Ecology & Evolution 27, no. 2: 94-103. https://doi.org/10.1016/j.tree.2011.11.001 Randell, D. A., Zhan Cui, and Anthony G. Cohn. 1992. “A Spatial Logic Based on Regions and Connection.” Knowledge Representation and Reason 92: 165-176. Tennis, Joseph T. 2012. “The Strange Case of Eugenics: A Subject’s Ontogeny in a Long-Lived Classification Scheme and the Question of Collocative Integrity.” Journal of the American Society for Information Science and Technology 63, no. 7: 1350–1359. Thessen, Anne E., Hong Cui and Dmitry Mozzherin, Dmitry, 2012. “Applications of Natural Language Processing in Biodiversity Science.” Advances in Bioinformatics 2012. https://doi.org/10.1155/2012/391574 Wei, Qin, Patrick B. Heidorn, and Chris Freeland. 2010. “Name Matters: Taxonomic Name Recognition (TNR) in Biodiversity Heritage Library (BHL).” In 2010 iConference Proceedings: 284. http://hdl.handle.net/2142/14919 97 Appendix 4.3. Result for TGBIF and TITIS (Top: input taxonomies; Bottom: PW) Result for TGBIF -TUSDA (Top: input taxonomies; Bottom: PW)

Chapter Preview

References

Abstract

The proceedings explore knowledge organization systems and their role in knowledge organization, knowledge sharing, and information searching.

The papers cover a wide range of topics related to knowledge transfer, representation, concepts and conceptualization, social tagging, domain analysis, music classification, fiction genres, museum organization. The papers discuss theoretical issues related to knowledge organization and the design, development and implementation of knowledge organizing systems as well as practical considerations and solutions in the application of knowledge organization theory. Covered is a range of knowledge organization systems from classification systems, thesauri, metadata schemas to ontologies and taxonomies.

Zusammenfassung

Der Tagungsband untersucht Wissensorganisationssysteme und ihre Rolle bei der Wissensorganisation, dem Wissensaustausch und der Informationssuche. Die Beiträge decken ein breites Spektrum von Themen ab, die mit Wissenstransfer, Repräsentation, Konzeptualisierung, Social Tagging, Domänenanalyse, Musikklassifizierung, Fiktionsgenres und Museumsorganisation zu tun haben. In den Beiträgen werden theoretische Fragen der Wissensorganisation und des Designs, der Entwicklung und Implementierung von Systemen zur Wissensorganisation sowie praktische Überlegungen und Lösungen bei der Anwendung der Theorie der Wissensorganisation diskutiert. Es wird eine Reihe von Wissensorganisationssystemen behandelt, von Klassifikationssystemen, Thesauri, Metadatenschemata bis hin zu Ontologien und Taxonomien.