The Code4Lib Journal, Issue 33 is now available!

http://journal.code4lib.org/issues/issue33

The Editorial Committee is pleased to submit issue 33 for your summer
reading pleasure. We encourage you to explore this issue, engage in the
comments, and reach out to the authors that contributed their work.

Editorial Introduction – Summer Reading List
by Ron Peterson
http://journal.code4lib.org/articles/11859
New additions for your summer reading list!

Emflix – Gone Baby Gone
by Netanel Ganin
http://journal.code4lib.org/articles/11762
Enthusiasm is no replacement for experience. This article describes a tool
developed at the Emerson College Library by an eager but overzealous
cataloger. Attempting to enhance media-discovery in a familiar and
intuitive way, he created a browseable and searchable Netflix-style
interface. Though it may have been an interesting idea, many of the crucial
steps that are involved in this kind of high-concept work were neglected.
This article will explore and explain why the tool ultimately has not been
maintained or updated, and what should have been done differently to ensure
its legacy and continued use.

Introduction to Text Mining with R for Information Professionals
by Monica Maceli
http://journal.code4lib.org/articles/11626
The ‘tm: Text Mining Package’ in the open source statistical software R has
made text analysis techniques easily accessible to both novice and expert
practitioners, providing useful ways of analyzing and understanding large,
unstructured datasets. Such an approach can yield many benefits to
information professionals, particularly those involved in text-heavy
research projects. This article will discuss the functionality and
possibilities of text mining, as well as the basic setup necessary for
novice R users to employ the RStudio integrated development environment
(IDE). Common use cases, such as analyzing a corpus of text documents or
spreadsheet text data, will be covered, as well as the text mining tools
for calculating term frequency, term correlations, clustering, creating
wordclouds, and plotting.

Data for Decision Making: Tracking Your Library’s Needs With TrackRef
by Michael Carlozzi
http://journal.code4lib.org/articles/11740
Library services must adapt to changing patron needs. These adaptations
should be data-driven. This paper reports on the use of TrackRef, an open
source and free web program for managing reference statistics.

Are games a viable solution to crowdsourcing improvements to faulty OCR? –
The Purposeful Gaming and BHL experience
by Max J. Seidman; Dr. Mary Flanagan;Trish Rose-Sandler; Mike Lichtenberg
http://journal.code4lib.org/articles/11781
The Missouri Botanical Garden and partners from Dartmouth, Harvard, the New
York Botanical Garden, and Cornell recently wrapped up a project funded by
IMLS called Purposeful Gaming and BHL: engaging the public in improving and
enhancing access to digital texts (
http://biodivlib.wikispaces.com/Purposeful+Gaming). The goals of the
project were to significantly improve access to digital texts through the
applicability of purposeful gaming for the completion of data enhancement
tasks needed for content found within the Biodiversity Heritage Library
(BHL). This article will share our approach in terms of game design choices
and the use of algorithms for verifying the quality of inputs from players
as well as challenges related to transcriptions and marketing. We will
conclude by giving an answer to the question of whether games are a
successful tool for analyzing and improving digital outputs from OCR and
whether we recommend their uptake by libraries and other cultural heritage
institutions.

>From Digital Commons to OCLC: A Tailored Approach for Harvesting and
Transforming ETD Metadata into High-Quality Records
by Marielle Veve
http://journal.code4lib.org/articles/11676
The library literature contains many examples of automated and
semi-automated approaches to harvest electronic theses and dissertations
(ETD) metadata from institutional repositories (IR) to the Online Computer
Library Center (OCLC). However, most of these approaches could not be
implemented with the institutional repository software Digital Commons
because of various reasons including proprietary schema incompatibilities
and high level programming expertise requirements our institution did not
want to pursue. Only one semi-automated approach was found in the library
literature which met our requirements for implementation, and even though
it catered to the particular needs of the DSpace IR, it could be
implemented to other IR software if further customizations were applied.
The following paper presents an extension of this semi-automated approach
originally created by Deng and Reese, but customized and adapted to address
the particular needs of the Digital Commons community and updated to
integrate the latest Resource Description & Access (RDA) content standards
for ETDs. Advantages and disadvantages of this workflow are discussed and
presented as well.

Checking the identity of entities by machine algorithms: the next step to
the Hungarian National Namespace
by Zsolt Bánki, Tibor Mészáros, Márton Németh, András Simon
http://journal.code4lib.org/articles/11765
The redundancy of entities coming from different sources caused problems
during the building of the personal name authorities for the Petőfi Museum
of Literature. It was a top priority to cleanse and unite classificatory
records which have different data content but pertain to the same person
without losing any data. As a first step in 2013, we found identities in
approximately 80,000 name records so we merged the data content of these
records. In the second phase a much more complicated algorithm had to be
applied to show these identities. We cleansed the database by uniting
approximately 36,000 records. The workflow for automatic detection of
authority data tries to follow human intelligence. The database scripts
normalize and examine about 20 kinds of data elements according to
information about dates, localities, occupation and name variations. The
result of creating pairs from the database authority records, as potential
redundant elements, was a graph, which was condensed to a tree, by human
efforts of the curators of the museum. With this, the limit of
technological identification was reached. For further data cleansing human
intelligence that can be assisted by computerized regular monitoring is
needed, based upon the developed algorithm. As a result, the service
containing about 620,000 authority name records will be an indispensable
foundation to the establishment of the National Name Authorities. This
article shows the work process of unification.

Metadata Analytics, Visualization, and Optimization: Experiments in
statistical analysis of the Digital Public Library of America (DPLA)
by Corey A. Harper
http://journal.code4lib.org/articles/11752
This paper presents the concepts of metadata assessment and
“quantification” and describes preliminary research results applying these
concepts to metadata from the Digital Public Library of America (DPLA). The
introductory sections provide a technical outline of data pre-processing,
and propose visualization techniques that can help us understand metadata
characteristics in a given context. Example visualizations are shown and
discussed, leading up to the use of “metadata fingerprints” — D3 Star Plots
— to summarize metadata characteristics across multiple fields for
arbitrary groupings of resources. Fingerprints are shown comparing metadata
characterisics for different DPLA “Hubs” and also for used versus not used
resources based on Google Analytics “pageview” counts. The closing sections
introduce the concept of metadata optimization and explore the use of
machine learning techniques to optimize metadata in the context of
large-scale metadata aggregators like DPLA. Various statistical models are
used to predict whether a particular DPLA item is used based only on its
metadata. The article concludes with a discussion of the broad potential
for machine learning and data science in libraries, academic institutions,
and cultural heritage.

Reply via email to