Keynote: J Stephen Downie
Extracted Features: A Copyright-Sensitive Approach for Digital Library Data Sharing
The HathiTrust Research Center (HTRC) is the research arm of the HathiTrust. As of October 2018, the HathiTrust Digital Library contains 16.8 million volumes (some 5.9 billion scanned pages). HTRC’s mission is to provide “non-consumptive research” access to the HathiTrust collection. The non-consumptive research model is one where researchers can conduct computational analyses against the items found a given collection but cannot copy, read or redistribute the copyright-restricted materials contained within. As part of the suite of offerings designed to meet its non-consumptive research mission, the HTRC has created and published its Extracted Features (EF) Dataset. The EF Dataset contains page-level computationally derived data that includes such things as unigram word counts, header and footer segmentation, and part-of-speech information (for books published in English and several other languages). The EF Dataset currently covers 15.7 million volumes, 5.7 billion pages, and 2.5 trillion tokens. Because the data found within the HTRC EF Dataset is derived from the underlying digital library content–and not merely copied–it satisfies the main principles of the the non-consumptive paradigm and thus can be freely shared with users. This talk will provide an overview of the history, challenges and evolution of the HTRC EF Dataset. It will also show how the EF Dataset is being used by researchers and suggest some possible future development directions. The talk will conclude with a set of proposals and arguments for digital libraries everywhere to think about creating, and then sharing, their own EF Datasets to help maximize the research impact of their collections in a copyright-sensitive manner.
J. Stephen Downie is Associate Dean for Research and a Professor at the School of Information Sciences at the University of Illinois. He is the Illinois Co-Director of the HathiTrust Research Center. He has been an active participant in the digital libraries and digital humanities research domains. He is best known for helping to establish an vibrant music information retrieval research community.
Keynote: Trond Aalberg
Search, interactivity and visualizations in the context of new bibliographic models.
The new generation of bibliographic models developed for libraries represents a change from record-based to entity-centric perspective on bibliographic data. This holds both opportunities and challenges for implementing new tools for search and exploration. The talk will discuss the development of bibliographic information models and the potential impact of this entity-centric perspective and will demonstrate and highlight important research and development challenges within this topic.
Trond Aalberg is Professor in Interactive Information Retrieval at the Oslo Metropolitan University and is also affiliated with Data and Artificial Intelligence research group at the Norwegian University of Science and Technology. He has been involved in the Digital Library community since the early start. Main research interests are methods and techniques to make content available, searchable and accessible in the context of user needs. Current research is focused on search and exploration in the context of new bibliographic models and semantic web data. He has an additional research track on learning technologies and active learning and is one of the project leaders at the Excited Centre for Excellent IT Education.