Content

M. Cristina Pattuelli, From uniform identifiers to graphs, from individuals to communities: what we talk about when we talk about linked person data in:

Fernanda Ribeiro, Maria Elisa Cerveira (Ed.)

Challenges and Opportunities for Knowledge Organization in the Digital Age, page 571 - 580

Proceedings of the Fifteenth International ISKO Conference 9-11 July 2018 Porto, Portugal

1. Edition 2018, ISBN print: 978-3-95650-420-4, ISBN online: 978-3-95650-421-1, https://doi.org/10.5771/9783956504211-571

Series: Advances in Knowledge Organization, vol. 16

Bibliographic information
M. Cristina Pattuelli From uniform identifiers to graphs, from individuals to communities: what we talk about when we talk about linked person data Abstract Person identities in the Linked Open Data (LOD) environment are the result of a process of semantic representation that often requires a complex interplay of data association, reconciliation and interlinking. This paper aims to lay the foundation for a broader discussion on the role that linked data principles and practices play in shaping person identities and the communities that bind them. We have only begun to explore the potential, as well as the challenges, of shaping people’s identities through the many layers of semantics we can gather from diverse linked data sources. Using the domain of performing arts, jazz music in particular, as a scenario, I hope to offer insights on various aspects of the process of building person identities in the context of linked data applied research. Introduction As the vision of the semantic web gets closer to a tangible reality and the linked data cloud continues to expand in every direction, it is worth turning our attention to people as entities populating this new knowledge space. Person entities are indeed the crux of countless relationships and the catalysts for multiple contexts and Linked Open Data (LOD) technologies provide a powerful means to give visibility to this complex information environment. Individual and collective identities in the context of linked data development are the result of a process of semantic representation that often involves a combination of processes including name resolution, reconciliation and data interlinking. In discussing some of the aspects involved in building and interrelating person identities I move from the atomic level, represented by individual identifiers, toward integrated graph structures and dataset interlinking. Identity markers Starting at the most fundamental level, when native data is generated, linked data technologies provide an open convention for naming entities – essentially any type of thing to which a unique string of characters called Uniform Resource Identifier (URI) can be assigned. This method enables humans and machines to read and process discrete units of information, such as proper names, unambiguously. We can consider name resolution – i.e., the association of identifiers to names—the very first step toward constructing a person’s identity in the LOD context and the URI the identity marker of an individual. As Gitelman and Jackson (2013) argue, data is never “raw” as it is the result of the cultural process of generation, curation and interpretation. Even apparently semantic agnostic URIs carry meaning embedded in their syntax. At its most basic, the 572 morphology of a URI reveals the naming authority or namespace it is governed by. For example, the jazz artist Ella Fitzgerald is identified by the following URI in the Library of Congress Name Authority File (LC/NAF): http://id.loc.gov/authorities/names/n83021406. The information included in a namespace specification, in the form of elements and attributes, determines what semantics can be associated with a person entity and thus what statements can be made about that person. This also means that each identifier – and thus its namespace – inherently carries with it a set of assumptions and views of the world. The general rules for inclusion in bibliographic name authorities, such as the LC/NAF or the aggregated Virtual International Authority File (VIAF) is for an individual to be either an author (or otherwise contributor), a subject or be part of a resource’s title. The authority records associated to a person entity, while have recently included additional properties beyond the one required for disambiguation (e.g., birth and death dates), are still semantically limited to a biblio-centric view of that entity. New and fast rising representation practices underpinning the linked data paradigm, however, have started to disrupt and expand the author-centered descriptive model, opening up to new opportunities to express and reveal more articulate aspects of person entities and thus enabling richer and more complex identity construction. Developed to support traditional practices of bibliographic control centered on proper names, library name authority services are broadly leveraged as sources of URIs, especially in LOD development for cultural heritage and LAMs. The outstanding contribution of bibliographic data to LOD development is indeed remarkable and primarily due to the wealth and quality of the data. These authorities fall short, however, when it comes to representing cultural contexts broader than bibliographic ones. These limitations became apparent when, a few years ago, we began a linked data project centered on the domain of jazz music. The project, called Linked Jazz1, began in 2012 with the primary goal of representing the relationships within the community of jazz musicians as recorded in oral histories from various jazz history archives. It has become a testbed for the development of various LOD applications and it is used throughout this paper as a source of examples and use cases. One of the first steps in the development of Linked Jazz was to identify jazz artists from textual primary sources and assign them URIs. At that time, there were far fewer sources of authoritative URIs we could rely on for naming artists. Bibliographic name authorities could supply only a portion of the identifiers needed for the thousands of artists that emerged from the archival documents we were mining. More general- 1 https://linkedjazz.org/. 573 purpose linked data hubs (e.g., DBpedia2) needed to be harvested in combination with domain-specific ones (e.g., MusicBrainz3) to identify the many jazz artists who did not have a place in traditional name authorities. In a few instances, new identifiers had to be created from scratch. As discussed earlier, the definition of an individual begins from assigning to a person name – a label made up of a string of characters—an identifier that represents a person entity. This process of identity management often includes the disambiguation of a name through its mapping to name authorities. Key to identity management is the use of the predicate owl:sameAs, the most common mechanism to express equivalence and reconcile name variants, as well as to co-reference identifiers from different namespaces. An example of co-referencing is shown in Figure 1 for the jazz musician Ella Fitzgerald. Figure 1: Co-referencing example for person entity Ella Fitzgerald 2 http://wiki.dbpedia.org/. 3 https://musicbrainz.org/. 574 In this example, various identifiers are associated with the entity Ella Fitzgerald from general and domain specific linked data sources, creating multiple access points and making it possible to combine different layers of description about the same entity and view the artist through multiple perspectives. With artists of the significance and popularity of Ella Fitzgerald, our running example, assigning URIs, even from multiple linked data sources, is rather straightforward. Entity resolution is instead problematic when it comes to lesser-known figures. This issue is particularly problematic within archival collections, where, for example, resources are often populated with names of local or less well-known people for whom authoritative or even public identifiers are not available. For those outliers who do not conform to the criteria of inclusion in common name authorities – e.g., because they don’t have recorded contributions – a different process of identity construction is required. While linked data publishing practices encourage the reuse of semantics whenever possible, new URIs can be minted for entities who don’t have one available. This was the case of jazz musicians who had not recorded with a major label or who had not reached a level of public recognition, but whose existence needed to be accounted for as they were mentioned in source documents included in the Linked Jazz sample collection. We dealt with occurrences of people who didn’t have a match anywhere by minting them into the Linked Jazz namespace (e.g. http://linkedjazz.org/resource/Lynn_Grissett). Determining whether or not to make an individual surface from obscurity and become part of the fabric of cultural linked data raises a number of theoretical, even existential, questions. There are practical considerations invested in the act of URIs creation including ensuring that evidential documentation is provided to justify the minting and that the naming agency takes responsibility to manage its local URIs for persistence and traceability. Issues surrounding minted identifiers and local vocabularies have just begun to be discussed in the LAM community, mostly focusing on methodology and technical development. Broader conversations are also needed to address the implications, for example, of giving digital life to marginal individual for digital archival practices, history research and models of historiography. Minting lesser-known musicians can indeed be a way to relate local special collections to the broader global archives of cultural memory. Moreover, it can be a powerful way to create bridges that link the margins to the center giving diverse narratives the possibility to emerge and be included. Semantic expansion A person means different things in different contexts and linked data principles and techniques make it possible to define identities in an expanded and multifaceted way 575 thanks to the mixing and matching of predicates from different conceptual models – schemas, vocabularies or ontologies – and through the co-referencing of multiple URIs. As discussed earlier, each URI inherently carries the semantics enabled by its naming standard. As one of the key principles of linked data development (Berners Lee 2006), URIs have to be dereferenceable, in other words, they must resolve to an HTML page that can be read by machines and humans alike. Derefereanceable pages display a wealth of information about an entity in the form of linked data statements. When it comes to a person entity, we can look at their dereferenceable pages as a sort of identity card where an array of RDF predicates makes up a person’s profile while linking that entity to external data sources and to the entire linked data cloud. The semantic richness of such identity cards varies from source to source. Even a course comparison between dereferenceable pages for Ella Fitzgerald from common linked data sources convey a general sense of the artist’s representation breadth and depth4. The power of description has been greatly enhanced by new community-driven linked data platforms such as DBpedia and Wikidata5. The latter, a recent development of the Wikimedia Foundation, enables the direct contribution of data from volunteers with relatively low barriers. Recognized as authoritative and working together with other more institutionalized linked data services, DBpedia and Wikipedia are rapidly becoming powerful tools for diversifying and expanding the linked data ecosystem. Recently, several new name standards and services have emerged with related databases representing new contexts, from ORCID6 for researchers and the global International Standard Name Identifier (ISNI)7 to the Social Networks and Archival Context (SNAC)8 for archives. They are greatly expanding the range of people that can now be identified in the linked data environment as well as the scope of the information associated with them. Because identity management is crucial to linked data development, name authority initiatives are actually proliferating to a point where issues of alignment and coordination among the various efforts are raising calls for joint efforts to promote convergence (Deupi and Eckman 2016). The need for semantic harmonization and consistency for LOD properties describing people once again became evident in the context of Linked Jazz when we focused on gender data for a study investigating women in jazz. It is axiomatic to note that the types of descriptive attributes – RDF predicates – available, as well as the quality and consistency of their values, have a direct impact on how we can query person data – 4 Examples of dereferenceable pages for Ella Fitzgerald from DBpedia: http://dbpedia.org/page/Ella_Fitzgerald; Wikidata: https://www.wikidata.org/wiki/Q1768; Union List of Artist Names (ULAN): http://www.getty.edu/vow/ULANFullDisplay?find=&role=&nation=&subjectid=500355437. 5 https://www.wikidata.org/wiki/Wikidata:Main_Page. 6 https://orcid.org/. 7 http://www.isni.org/. 8 http://snaccooperative.org/. 576 any data indeed – and how we can ultimately use the data to support discovery and analysis. In other words, the possibility to computationally leverage a broad range of properties makes it possible to support new and unprecedented lines of inquiry. A research resource for music historians and students, the Linked Jazz dataset includes a curated knowledge graph of over 2,000 musicians where their personal and professional relationships can be explored and queried. The dataset would be incredibly useful to investigate the historical role of women in jazz and help music historians answer research questions about influences, reputation and authority of women in jazz. This scenario, suggested by actual jazz historians, prompted us to prepare the dataset to support such a line of inquiry, including network analysis for gender ratio and distribution. A complex and problematic construct that defines person identities, gender is indeed a necessary attribute that all the person entities in our dataset needed to have as a preliminary condition to analyze our data through this specific lens. What was deemed a rather straightforward task – i.e., harvesting and ingesting a simple “demographic” attribute such as gender – turned out to be more complex and labor intensive than expected. At the time of the study (2014), gathering gender data from reliable linked data sources required significant effort and expertise as this data was sparse and inconsistent. Iterative cycles of data harvesting were needed and multiple rounds of revision and version control had to be performed to identify and correct errors. In bibliographic authorities, for example, gender values, an optional attribute, were often missing. A discussion is underway in the library community on issues surrounding gender as a descriptive attribute in the context of new cataloging practices (Billey, Drabinski and Roberto 2014). DBpedia, another data source used in this study, also lacked a property for gender within its model. Instead, gender had to be discerned from the subjects associated with a person, which required parsing and extraction. Eventually, through a triangulation between multiple sources, we were able to successfully acquire gender information for 75% of our target list of artists (Pattuelli, Hwang and Miller 2017). It is likely that if we had to replicate the study today we could rely on Wikidata for our main data source. Not only does Wikidata provide rich and consistent personcentered data, but when it comes to gender, it conveys a broader representation of gender types than any other linked data service. While the sources we relied on were limited to binary values (male and female) with the option of unknown to address uncertainty or missing information, Wikidata provides multiple options for gender variance. It also includes temporal qualifiers for making gender a time-dependent value taking into account transitions. 577 Semantic stratification The full potential of linked data is achieved when data is interlinked. As discussed earlier, linked open data standards offer a suitable technical platform for semantic stratifications through the combination of predicates from multiple sources. The expressive limitations of an individual RDF vocabulary are largely offset by the endless possibilities for data interconnection and aggregation. This opens up opportunities for layered and multidimensional representations of individual and collective identities. A powerful way to enrich and add complexity to the representation of entities is by combining diverse external datasets, such as linked data graphs, through data mashups. Data integration through interlinking from multiple sources provides the ultimate strategy to enhance resource discovery and move toward the creation of new knowledge. Case studies of dataset mashups are drawn, once again, from Linked Jazz to illustrate the usefulness of this approach. A project using Performance History Data from Carnegie Hall (recently converted to LOD9), was our first attempt at dataset integration. Through a series of Python scripts, artists who existed in both the Carnegie Hall and Linked Jazz datasets were identified. In other words, musicians who performed at Carnegie Hall and were recorded in the Linked jazz dataset were associated through the equivalence property owl:sameAs, bringing together properties from both sources for those shared person entities (Figure 2). Personal and professional descriptions were combined with information about performances resulting in an enhanced and integrated knowledge base which could be seamlessly queried. Technical details on the process are discussed in Sistrunk (2016, December 5). 9 http://data.carnegiehall.org/. 578 Figure 2: Mashup of Carnegie Hall Performance History Data and Linked Jazz datasets Mashups also happened outside our context in the spirit of linked open data where “…the coolest use of your data will be thought of by somebody else”10. For example, the Australian linked data project JazzCats (Jazz Collection of Aggregated Triples) 11 aggregated collections of RDF triples to trace performance history. JazzCats combines discography and granular performance data (e.g., solos including pitch, key, and chord) with interpersonal relationships derived from Linked Jazz to “bridge previously unconnected but complementary information about jazz music” (Bangert, Abdul- Rahman and Nurmikko-Fuller 2017). The project JazzTube12, a joint initiative between the Hochschule für Musik Franz Liszt Weimar (University of Music Franz Liszt Weimar) and the International Audio Laboratories Erlangen, connected annotations of jazz solos with discographies, including artist data from the Linked Jazz dataset, to represent the network of interpersonal relationships between musicians who perform solo13. These are just a few examples of how bringing different datasets together and 10 Paul Walk, 23 July 2007, http://blog.paulwalk.net/2007/07/23/. 11 http://jazzcats.oerc.ox.ac.uk/. 12 http://mir.audiolabs.uni-erlangen.de/jazztube/about. 13 http://mir.audiolabs.uni-erlangen.de/jazztube/soloists/. 579 creating semantic bridges across diverse information spaces can open up an infinite field of connections between individuals, enabling new, unanticipated discovery. In a more diversified discovery context, people and their historical and social contexts can emerge and be more fully understood through a composite of perspectives rather than through a single narrative. Conclusion and future directions Building individual and collective identities in the context of the LOD representation framework is a complex, yet fascinating theme to explore. Pragmatic issues concerning the technical development are intertwined with the theoretical aspects of identity definition and inclusion. As linked data is based on a connectionist model, individuals – as other entities populating a networked environment – are shaped through the web of associations that link one to another. Starting from the naming process where a unique identifier makes a person begin to exist in the linked data space, we have progressed through the process of data interlinking where descriptive layers from multiple and diverse sources can be interweaved to create rich profiles and support complex queries. Aggregation, integration and mashups of sets of linked data result in new knowledge graphs where people and communities can be viewed through a wider lens and in different contexts. The aim of this paper is to provide a point of departure for further discussion on this important area of investigation. As technical barriers to linked data development continue to be lowered and more research is conducted, new opportunities will arise to reflect on the complexities inherent in representing individual and collective identities, ultimately strengthening our shared understanding and helping establish best practices. References Bangert, D., Abdul-Rahman, A., & Nurmikko-Fuller, T. (2017). JazzCats: a collection of aggregated RDF triples tracing performance history through musicological data. Rich Semantics and Direct Representation for Digital Collections Workshop, ACM Joint Conference on Digital Libraries (JCDL), 2017, Toronto. Berners Lee, T. (2006). Linked data: design issues. Available at: http://www.w3.org/DesignIssues/LinkedData.html. Accessed 21 January, 2018. Billey, A., Drabinski, E., & Roberto, K. R. (2014). What's Gender Got to Do with It? A Critique of RDA 9.7. Cataloging & Classification Quarterly, (52)4: 412-421. Deupi, J., & Eckman, C. (2016). Prospects and Strategies for Deep Collaboration in the Galleries, Libraries, Archives, and Museums Sector. Academic Art Museum and Library Summit, 1. Available at: https://scholarlyrepository.miami.edu/con_events_aamls2016/1. Accessed 1 February 2018. Gitelman, L., & Jackson, V. (2013. Introduction. In L. Gitelman (ed.), ‘Raw data’ is an oxymoron. Cambridge, MA: The MIT Press. 580 Pattuelli, M. C., Hwang, K., & Miller M. (2017). Accidental discovery, intentional inquiry: Leveraging linked data to uncover the women of jazz. DSH: Digital Scholarship in the Humanities, 32(4): 918-924. Sistrunk, H. (2016, December 5). Data Interlinking: Linked Jazz and Carnegie Hall | Linked Jazz [Blog post]. Available at: https://linkedjazz.org/data-interlinking-linked-jazz-and-carnegiehall/. Accessed 12 January 2018.

Chapter Preview

Schlagworte

Information Organization, Information access, Societal challenges, Interoperability, Didgital age, Information Representation

References

Abstract

The 15th International ISKO Conference has been held in Porto (Portugal) under the topic Challenges and opportunities for KO in the digital age. ISKO has been organizing biennial international conferences since 1990, in order to promote a space for debate among Knowledge Organization (KO) scholars and practitioners all over the world.

The topics under discussion in the 15th International ISKO Conference are intended to cover a wide range of issues that, in a very incisive way, constitute challenges, obstacles and questions in the field of KO, but also highlight ways and open innovative perspectives for this area in a world undergoing constant change, due to the digital revolution that unavoidably moulds our society. Accordingly, the three aggregating themes, chosen to fit the proposals for papers and posters to be submitted, are as follows: 1 – Foundations and methods for KO; 2 – Interoperability towards information access; 3 – Societal challenges in KO. In addition to these themes, the inaugural session includes a keynote speech by Prof. David Bawden of City University London, entitled Supporting truth and promoting understanding: knowledge organization and the curation of the infosphere.

Schlagworte

Information Organization, Information access, Societal challenges, Interoperability, Didgital age, Information Representation