Content

Tobias Renwick, Rick Szostak, A Thesaural Interface for the Basic Concepts Classification in:

International Society for Knowledge Organziation (ISKO), Marianne Lykke, Tanja Svarre, Mette Skov, Daniel Martínez-Ávila (Ed.)

Knowledge Organization at the Interface, page 527 - 531

Proceedings of the Sixteenth International ISKO Conference, 2020 Aalborg, Denmark

1. Edition 2020, ISBN print: 978-3-95650-775-5, ISBN online: 978-3-95650-776-2, https://doi.org/10.5771/9783956507762-527

Series: Advances in Knowledge Organization, vol. 17

Bibliographic information
Tobias Renwick – University of Alberta, Canada Rick Szostak – University of Alberta, Canada A Thesaural Interface for the Basic Concepts Classification Abstract: We describe a thesaural interface that is being developed for the Basic Concepts Classification. This interface is particularly well-suited to the synthetic phenomenon-based approach to classification pursued by the BCC. We describe how the thesaural interface works, our plans to develop it further, and the advantages of this interface for both classifier and user. 1.0 Motivation A classifier using the Basic Concepts Classification (BCC; Szostak 2019) would create a subject string combining terms from separate schedules of phenomena (mostly nouns), relators (mostly verbs or conjunctions), and properties (adjectives and adverbs). The resulting subject strings resemble sentence fragments. It is hoped that a classifier can move fairly directly from a key sentence in an abstract, book description, manuscript description, or object description to a BCC subject string. Compared to classifying with an enumerative classification, the classifier is spared from having to find a complex enumerated subject heading that best fits a particular document or object. But the classifier now has to synthesize multiple terms, generally from two or three separate schedules. The BCC schedules are generally easy to navigate: hierarchies are flat and logically constructed for the most part. Yet a classifier seeking to synthesize several terms might nevertheless find it time-consuming to identify all of the necessary controlled vocabulary. It has long been hoped, then, that a thesaural interface could be constructed that would guide classifiers to BCC controlled vocabulary. Importantly, such an interface might allow users also to enter a query in words of their choice and be guided quickly to controlled vocabulary. This in turn might encourage both public and university libraries to move back toward subject searching: Though keyword searching is easier for most library users, it is far less precise than subject searching (Hjørland 2012). A thesaural interface might potentially render subject searching as easy as keyword searching. 2.0 Design of the Interface We are exploring the possibility that such a thesaural interface can be developed using the Universal Sentence Encoder (USE: Cer et al. 2018). One common criticism of a synthetic (post-coordinated) approach to subject classification is that a user searching for "philosophy of history" will find many documents on "history of philosophy" (Sauperl 2009) But this is only true if the search interface does not care about word order in search queries. USE does discriminate on the basis of word order, for it places each term in the context of the phrase it is embedded within. USE is a transformer-type deep neural network, which has been trained on very large batches of text. USE can help identify synonyms for words and phrases by embedding them into vectors in 512-dimensional space. Embeddings are modeled after the idea, "you shall know a word from the company it keeps" (Firth 1957, 11) and can be seen as 528 a fixed length numeric representation of a text-based input. The guiding principle behind all embeddings is that if two words are often used in a similar context they likely have a similar meaning. In addition to the context of the word, USE also incorporates a token’s position in the phrase to determine its meaning. This ability to discriminate words based on their position and context is created by virtue of the way USE is trained. During the training phase, USE consists of two principal components: an encoder sub-graph which builds a 512 dimension numeric encoding based on text input, and a decoder sub-graph which takes that numeric output as input and attempts to predict the next word in the sentence. USE maintains word order and positioning on the input phrase by adding a second 512D vector to the input which is built by overlapping different wavelengths of sin functions (e.g. sin x, 2sinx), which assign unique values to each position to a maximal length of 1024 tokens. The oscillating nature of the sin function allows the network to generalize shorter inputs to longer ones where it can potentially obverse a similar distance and pattern between words used. Because of this (and other contexts observed during training where these 2 phrases are used), the phrases `philosophy of history’ and `history of philosophy’ have different embeddings. After the network is trained, USE consists of only the encoder portion of the network, which then takes in a sentence in English, and outputs a 512D vector, as before, but now rather than predict the next word, we use the information present in that embedding to convey information about the input phrase (a sentence embedding). An interesting aspect of these embeddings which helps to add some intuition as to how they are constructed is that they can be shown to obey interesting properties when used mathematically. The classic example is that if you take the vector for the word ‘King’ and subtract the vector for ‘man’, you effectively remove all of the words associated with males from king, and you obtain the context that would surround a genderless royal (imagine words like crown, throne, rule, subjects etc.). Interestingly, now if you add the vector for ‘woman’ you will have a result which very closely matches the vector for ‘Queen’. Other common examples can be illustrated by taking a country, subtracting its capital, then adding a different capital to obtain the other country’s approximate vector (France – Paris + Rome = Italy). Happily, USE can deal with phrases up to 1024 tokens in length, rather than just individual words. This will save classifiers and users from having to translate each word individually into controlled vocabulary. More importantly, phrases further clarify the meaning of the words they contain (for example clarifying whether "picture" is being used as noun or verb). In order to make use of these embeddings, the entire terminology of the BCC (phenomena and relators), ISO Country and language codes, and UNSPSC1 codes used to identify goods and services within BCC have been embedded with USE (transformed into 512D vectors of floating point numbers). Further to this, an additional embedding which contains all 2 word classifications consisting of BCC relator + BCC phenomena have also been added. These embeddings are combined into a single vector array, which allows direct comparison to an unseen embedding. When a phrase is presented to the interface, the phrase is first checked for terms which exist directly in the BCC, and is broken into sub-phrases, which will then be translated. 1 United Nations Standard Products and Services Code (UNSPSC). https://www.unspsc.org 529 For example the phrase ‘a man dancing at a club’ is broken into three sub-phrases by the interface, around the word ‘at’ (‘a man dancing’, ‘at’, ‘a club’) which is a BCC term (NT3). This is primarily a simple heuristic which allows for breaking up input in a predictable and reasonable way. If the phrase is too long, it will likely contain too much information to be adequately translated into a 1 or 2 word BCC classification. Therefore the best results are obtained by using the most concise terminology possible with the translator. Each sub-phrase is then embedded with USE and the resultant vectors are compared to the pre-calculated vector field of BCC (and related) embeddings. Classifiers and users can be (immediately) given ten possible BCC translations for each phrase block from which to choose. Technically, from the array the interface selects the ten nearest neighbors, using a measure of cosine similarity (cosine similarity is employed based on the assumption that vectors pointing in a similar direction have similar meaning, and ignores the magnitude of the vector). A demonstration version of the interface can now be seen at https://sites.google.com/a/ualberta.ca/rick-szostak/research/basic-concepts-classification-web-version-2013/thesaural-interface-for-bcc. At present it deals best with shorter phrases. Readers can enter any phrase and be guided to appropriate BCC terminology (and given the BCC notation that goes along with that terminology). At present, they may have to search again if important terms are missing in the generated subject string. 3.0 Future Work In its current state, the thesaural interface is a helpful tool. In the future as more classifications are collected as input data, a true translator could be developed that would be able to better handle ambiguity in input data without breaking the input phrase into blocks. In the prior example ‘a man dancing at a club’ the 3rd sub-phrase ‘a club’ is ambiguous without context, and the first suggestion of the translator is incorrect (a UNPC classification for clubs), and while the correct class (‘E09(901520) - nightclubs and dance halls’) is included, it was returned as a more distant match. We are also working on algorithms that can cope with larger phrases. We can also then analyze many examples of translations. We are also working on tree structures based on the hierarchies within BCC: the translator can then appreciate that the best place to look for controlled vocabulary for a type of painting is within the category "Art" rather than "Mathematical concepts." Note that our interface can be improved over time through repeated use (and selection by users or classifiers of particular options) to better select the best BCC translation of particular queries. 4.0 Discussion Thesauri (within information science) have almost always been developed in the past to guide users toward controlled vocabulary within enumerated classifications. The thesaural interface developed here is much better suited to a synthetic approach to classification, for it can identify the best fit by combining multiple terms within the controlled vocabulary. This thesaural interface thus reverses a potential disadvantage of a classification such as BCC. Without the thesaural interface it might be time-consuming to identify all of the controlled vocabulary necessary for a synthetic subject string (though, 530 again, the logical and flat structure of BCC schedulers would facilitate search for controlled vocabulary). With the thesaural interface it becomes fairly straightforward to move from a key sentence in a document or object description toward a BCC subject string. This is already the case for fairly short sentences and will hopefully become feasible in future for longer phrases. Just as it is easy for a classifier to move from a document or object description toward a subject string, it should be easy for users to move from a query in their own words toward a subject string that guides them to the document or object they seek. Documents or objects are described in sentences. User queries are generally formulated in sentences. We have in the past attempted to translate user queries into a subject heading that is not constructed grammatically, and used that ungrammatical subject heading to attempt to identify relevant documents. It is potentially both easier and more precise to translate sentence to sentence to sentence: translate the user query into a sentence-like subject string that will guide users to documents or objects that are described by a similar sentence. The thesaural interface described in this paper can hopefully guide users and classifiers to describe a document with the same (or very similar) subject string. Users with a precise query can thus be guided to the document(s) or object(s) they seek. Users performing exploratory searches will benefit from the fact that the interface provides them with several suggested subject strings. Users can then adjust each term in their search query to identify different sets of subject strings. They start by wondering about dogs biting mail carriers, and move on to dogs biting neighbors, dogs licking mail carriers, or cats biting mail carriers. We hope to develop a visual interface that would allow users to easily adjust their search query term by term. 5.0 Concluding Remarks There is a dissonance in the field of knowledge organization between a body of theory that urges faceted classification and a body of practice around enumerated classification. One practical advantage of leading enumerated classifications is that they benefit from over a century of development. The thesaural interface discussed here can potentially allow a synthetic approach to classification such as the BCC to outperform enumerated classifications without the painstaking task of developing a thesaurus manually. The thesaural interface can facilitate the work of classifiers in moving from a key sentence in an object or document description toward a BCC subject string. It can so facilitate user queries that these can be as easy as keyword queries – but provide much more precise results. References Cer, Daniel, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco,,Rhomni St. John, Noah Constant, Mario Guajardo-Céspedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil 2018. “Universal Sentence Encoder.” ArXiv. https://arxiv.org/pdf/1803.11175.pdf Firth, John R. 1957. Papers in Linguistics 1934–1951. London: Oxford University Press. Hjørland, Birger. 2012. “Is Classification Necessary After Google?” Journal of Documentation 68: 299-317 Sauperl, Alenka. 2009. “Precoordination or Not?: A New View of the Old Question.” Journal of Documentation 65: 817-833. 531 Szostak. Rick. 2019. “Basic Concepts Classification.” In ISKO Encyclopedia of Knowledge Organization, edited by Birger Hjørland and Claudio Gnoli. https://www.isko.org/cyclo/bcc

Chapter Preview

References

Abstract

The proceedings explore knowledge organization systems and their role in knowledge organization, knowledge sharing, and information searching.

The papers cover a wide range of topics related to knowledge transfer, representation, concepts and conceptualization, social tagging, domain analysis, music classification, fiction genres, museum organization. The papers discuss theoretical issues related to knowledge organization and the design, development and implementation of knowledge organizing systems as well as practical considerations and solutions in the application of knowledge organization theory. Covered is a range of knowledge organization systems from classification systems, thesauri, metadata schemas to ontologies and taxonomies.

Zusammenfassung

Der Tagungsband untersucht Wissensorganisationssysteme und ihre Rolle bei der Wissensorganisation, dem Wissensaustausch und der Informationssuche. Die Beiträge decken ein breites Spektrum von Themen ab, die mit Wissenstransfer, Repräsentation, Konzeptualisierung, Social Tagging, Domänenanalyse, Musikklassifizierung, Fiktionsgenres und Museumsorganisation zu tun haben. In den Beiträgen werden theoretische Fragen der Wissensorganisation und des Designs, der Entwicklung und Implementierung von Systemen zur Wissensorganisation sowie praktische Überlegungen und Lösungen bei der Anwendung der Theorie der Wissensorganisation diskutiert. Es wird eine Reihe von Wissensorganisationssystemen behandelt, von Klassifikationssystemen, Thesauri, Metadatenschemata bis hin zu Ontologien und Taxonomien.