Re: [CODE4LIB] hathitrust research center workset browser [github]

2015-06-02 Thread Eric Lease Morgan
I believe I have created a repository of my HTRC Workset Browser code (shell 
and Python scripts) on GitHub. [1] From the Quick Start section of the README:

  1. Download the software putting the bin and etc directories in the same 
directory.
  2. Change to the directory where the bin and etc directories have been saved.
  3. Build a collection by issuing the following command:

   ./bin/build-corpus.sh thoreau etc/rsync-thoreau.sh

  If all goes well, the Browser will create a new directory named thoreau,
  rsync a bunch o' JSON files from the HathiTrust to your computer, index
  the JSON files, do some textual analysis against the corpus, create a
  simple database ("catalog"), and create a few more reports. You can then
  peruse the files in the newly created thoreau directory. If this worked,
  then repeat the process for the other rsync files found in the etc
  directory.

Probably the first issue people will have is the path to their version of 
Python. (Sigh.)

[1] repository - https://github.com/ericleasemorgan/HTRC-Workset-Browser

—
Eric “Git Ignorant” Morgan


Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread Karen Coyle
Right. Which is why *someone* copied all of the Google digitized books 
to the Internet Archive -- someone not associated with the library 
partners. So generally if you cannot download from HT you can find the 
same scan via openlibrary.org. Unfortunately that doesn't help with 
using the tool that ELM has alerted us to.


kc

On 6/1/15 2:19 PM, Jimmy Ghaphery wrote:

I think we are in agreement (especially about the utility of all things
HathiTrust). My one point is that any restrictions on digitized public
domain works, as I understand it, are not related to copyright.

On Mon, Jun 1, 2015 at 5:00 PM, Terry Reese  wrote:


However, the digitizing agency cannot dictate any copyright
restrictions on the digitized copies once released to the public

The digital objects have not, and as far as I understand, cannot be made
available to the public if digitized as part of the google books
digitization project.  Most institutions got very limited use, and
generally these were tied to their specific, immediate, communities.
Though, with that said each institution has slightly different terms.  For
what it's worth, the research center does not make the digital copies
available for download -- it provides tools for working with data in
aggregate (worksets) and provides a proof of concept environment
demonstrating the feasibility of creating a secured data repository with I
believe the long-term goal of providing data mining for the entire
hathitrust resources (both within and outside of the public domain).  But
even as it stands now, the tool has become a fantastic teaching tool when
talking to instructors and graduate students looking for large data sets to
work with, that also includes some pretty interesting research algori!
  thms for working with the data.

--tr

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Jimmy Ghaphery
Sent: Monday, June 1, 2015 4:47 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] hathitrust research center workset browser

Thanks Eric for posting the webinar in the other thread.

I am pretty sure that digitizing something in the public domain does not
change its copyright status, at least in the U.S. The digitizing agency
certainly has the right to sell, restrict access, watermark, or even keep
the scans locked up on a thumb drive in a closet. They are not obligated to
share or to provide the digital files in a re-usable format. However, the
digitizing agency cannot dictate any copyright restrictions on the
digitized copies once released to the public.

#iamnotalawyer and welcome correction

best,

Jimmy



On Mon, Jun 1, 2015 at 12:12 PM, Eric Lease Morgan  wrote:


On Jun 1, 2015, at 10:58 AM, davesgonechina 
wrote:


They just informed me I need a .edu address. Having trouble
understanding the use of the term "public domain" here.

   Gung fhpx, naq fbhaqf ernyyl fbeg bs fghcvq!! --RYZ




--
Jimmy Ghaphery
Head, Digital Technologies
VCU Libraries
804-827-3551






--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: +1-510-435-8234
skype: kcoylenet/+1-510-984-3600


Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread Jimmy Ghaphery
I think we are in agreement (especially about the utility of all things
HathiTrust). My one point is that any restrictions on digitized public
domain works, as I understand it, are not related to copyright.

On Mon, Jun 1, 2015 at 5:00 PM, Terry Reese  wrote:

> >> However, the digitizing agency cannot dictate any copyright
> >>restrictions on the digitized copies once released to the public
>
> The digital objects have not, and as far as I understand, cannot be made
> available to the public if digitized as part of the google books
> digitization project.  Most institutions got very limited use, and
> generally these were tied to their specific, immediate, communities.
> Though, with that said each institution has slightly different terms.  For
> what it's worth, the research center does not make the digital copies
> available for download -- it provides tools for working with data in
> aggregate (worksets) and provides a proof of concept environment
> demonstrating the feasibility of creating a secured data repository with I
> believe the long-term goal of providing data mining for the entire
> hathitrust resources (both within and outside of the public domain).  But
> even as it stands now, the tool has become a fantastic teaching tool when
> talking to instructors and graduate students looking for large data sets to
> work with, that also includes some pretty interesting research algori!
>  thms for working with the data.
>
> --tr
>
> -Original Message-
> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
> Jimmy Ghaphery
> Sent: Monday, June 1, 2015 4:47 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] hathitrust research center workset browser
>
> Thanks Eric for posting the webinar in the other thread.
>
> I am pretty sure that digitizing something in the public domain does not
> change its copyright status, at least in the U.S. The digitizing agency
> certainly has the right to sell, restrict access, watermark, or even keep
> the scans locked up on a thumb drive in a closet. They are not obligated to
> share or to provide the digital files in a re-usable format. However, the
> digitizing agency cannot dictate any copyright restrictions on the
> digitized copies once released to the public.
>
> #iamnotalawyer and welcome correction
>
> best,
>
> Jimmy
>
>
>
> On Mon, Jun 1, 2015 at 12:12 PM, Eric Lease Morgan  wrote:
>
> > On Jun 1, 2015, at 10:58 AM, davesgonechina 
> > wrote:
> >
> > > They just informed me I need a .edu address. Having trouble
> > > understanding the use of the term "public domain" here.
> >
> >   Gung fhpx, naq fbhaqf ernyyl fbeg bs fghcvq!! --RYZ
> >
>
>
>
> --
> Jimmy Ghaphery
> Head, Digital Technologies
> VCU Libraries
> 804-827-3551
>



-- 
Jimmy Ghaphery
Head, Digital Technologies
VCU Libraries
804-827-3551


Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread Terry Reese
>> However, the digitizing agency cannot dictate any copyright 
>>restrictions on the digitized copies once released to the public

The digital objects have not, and as far as I understand, cannot be made 
available to the public if digitized as part of the google books digitization 
project.  Most institutions got very limited use, and generally these were tied 
to their specific, immediate, communities.  Though, with that said each 
institution has slightly different terms.  For what it's worth, the research 
center does not make the digital copies available for download -- it provides 
tools for working with data in aggregate (worksets) and provides a proof of 
concept environment demonstrating the feasibility of creating a secured data 
repository with I believe the long-term goal of providing data mining for the 
entire hathitrust resources (both within and outside of the public domain).  
But even as it stands now, the tool has become a fantastic teaching tool when 
talking to instructors and graduate students looking for large data sets to 
work with, that also includes some pretty interesting research algori!
 thms for working with the data.  

--tr

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jimmy 
Ghaphery
Sent: Monday, June 1, 2015 4:47 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] hathitrust research center workset browser

Thanks Eric for posting the webinar in the other thread.

I am pretty sure that digitizing something in the public domain does not change 
its copyright status, at least in the U.S. The digitizing agency certainly has 
the right to sell, restrict access, watermark, or even keep the scans locked up 
on a thumb drive in a closet. They are not obligated to share or to provide the 
digital files in a re-usable format. However, the digitizing agency cannot 
dictate any copyright restrictions on the digitized copies once released to the 
public.

#iamnotalawyer and welcome correction

best,

Jimmy



On Mon, Jun 1, 2015 at 12:12 PM, Eric Lease Morgan  wrote:

> On Jun 1, 2015, at 10:58 AM, davesgonechina 
> wrote:
>
> > They just informed me I need a .edu address. Having trouble 
> > understanding the use of the term "public domain" here.
>
>   Gung fhpx, naq fbhaqf ernyyl fbeg bs fghcvq!! --RYZ
>



--
Jimmy Ghaphery
Head, Digital Technologies
VCU Libraries
804-827-3551


Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread Jimmy Ghaphery
Thanks Eric for posting the webinar in the other thread.

I am pretty sure that digitizing something in the public domain does not
change its copyright status, at least in the U.S. The digitizing agency
certainly has the right to sell, restrict access, watermark, or even keep
the scans locked up on a thumb drive in a closet. They are not obligated to
share or to provide the digital files in a re-usable format. However, the
digitizing agency cannot dictate any copyright restrictions on the
digitized copies once released to the public.

#iamnotalawyer and welcome correction

best,

Jimmy



On Mon, Jun 1, 2015 at 12:12 PM, Eric Lease Morgan  wrote:

> On Jun 1, 2015, at 10:58 AM, davesgonechina 
> wrote:
>
> > They just informed me I need a .edu address. Having trouble understanding
> > the use of the term "public domain" here.
>
>   Gung fhpx, naq fbhaqf ernyyl fbeg bs fghcvq!! --RYZ
>



-- 
Jimmy Ghaphery
Head, Digital Technologies
VCU Libraries
804-827-3551


Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread Eric Lease Morgan
On Jun 1, 2015, at 10:58 AM, davesgonechina  wrote:

> They just informed me I need a .edu address. Having trouble understanding
> the use of the term "public domain" here.

  Gung fhpx, naq fbhaqf ernyyl fbeg bs fghcvq!! --RYZ


Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread Terry Reese
I know that Robert McDonald lurks around here -- so he could clarify this -- 
but what folks need to realize here is that the research center is providing 
tools that allow research access to materials within the hathitrust that are 
within the public domain.  However, the digitized materials themselves, are not 
public domain any more (as I understand it).  These materials, as I understand, 
are governed by the agreements institutions made as part of the google project. 
 So, while the materials that the research center is currently providing access 
to are ones identified as within the public domain, access to the research 
center is curated due to those agreements.  Robert or someone else can clarify 
if I've misspoken based on my understanding here.

--tr

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
davesgonechina
Sent: Monday, June 1, 2015 10:58 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] hathitrust research center workset browser

They just informed me I need a .edu address. Having trouble understanding the 
use of the term "public domain" here.

On Mon, Jun 1, 2015, 9:58 PM Eric Lease Morgan  wrote:

> On Jun 1, 2015, at 4:33 AM, davesgonechina 
> wrote:
>
> > If your *institutional* email address is not on their whitelist (not 
> > sure if it is limited to subscribing ones, they don't say) you 
> > cannot register using the signup form, instead you can only request 
> > an account by briefly explaining why you want one. Weird, because 
> > they'd have potentially
> learned
> > more about me if they just let me put my gmail address in the signup
> form.
> >
> > I don't get it - can all users download public domain content? If 
> > they
> give
> > me an account, will I be indistinguishable from a subscribing
> institution?
> > If not, why the extra hoops?
>
>
> Dave, you are the second person to bring this “white listing” issue to 
> my attention. Bummer! Yes, apparently, unless your email address is a 
> part of wider something or another, then you need to be authorized to 
> use the Research Center. Weird! In my opinion, while the Research 
> Center’s tools work, I believe the site suffers from usability issues.
>
> In any event, I have enhanced the auto-generated reports created by my 
> “Browser”, and while they are very textual, I also believe they are 
> insightful. For example, the complete works of:
>
>   * William Ellery Channing - http://bit.ly/browser-channing-about
>   * Jane Austen - http://bit.ly/browser-austen-about
>   * Ralph Waldo Emerson - http://bit.ly/browser-emerson-about
>   * Henry David Thoreau - http://bit.ly/browser-thoreau-about
>
> —
> Eric “Beginning To Suffer From ‘Creeping Featuritis’” Morgan
>


Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread davesgonechina
They just informed me I need a .edu address. Having trouble understanding
the use of the term "public domain" here.

On Mon, Jun 1, 2015, 9:58 PM Eric Lease Morgan  wrote:

> On Jun 1, 2015, at 4:33 AM, davesgonechina 
> wrote:
>
> > If your *institutional* email address is not on their whitelist (not sure
> > if it is limited to subscribing ones, they don't say) you cannot register
> > using the signup form, instead you can only request an account by briefly
> > explaining why you want one. Weird, because they'd have potentially
> learned
> > more about me if they just let me put my gmail address in the signup
> form.
> >
> > I don't get it - can all users download public domain content? If they
> give
> > me an account, will I be indistinguishable from a subscribing
> institution?
> > If not, why the extra hoops?
>
>
> Dave, you are the second person to bring this “white listing” issue to my
> attention. Bummer! Yes, apparently, unless your email address is a part of
> wider something or another, then you need to be authorized to use the
> Research Center. Weird! In my opinion, while the Research Center’s tools
> work, I believe the site suffers from usability issues.
>
> In any event, I have enhanced the auto-generated reports created by my
> “Browser”, and while they are very textual, I also believe they are
> insightful. For example, the complete works of:
>
>   * William Ellery Channing - http://bit.ly/browser-channing-about
>   * Jane Austen - http://bit.ly/browser-austen-about
>   * Ralph Waldo Emerson - http://bit.ly/browser-emerson-about
>   * Henry David Thoreau - http://bit.ly/browser-thoreau-about
>
> —
> Eric “Beginning To Suffer From ‘Creeping Featuritis’” Morgan
>


Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread Eric Lease Morgan
On Jun 1, 2015, at 4:33 AM, davesgonechina  wrote:

> If your *institutional* email address is not on their whitelist (not sure
> if it is limited to subscribing ones, they don't say) you cannot register
> using the signup form, instead you can only request an account by briefly
> explaining why you want one. Weird, because they'd have potentially learned
> more about me if they just let me put my gmail address in the signup form.
> 
> I don't get it - can all users download public domain content? If they give
> me an account, will I be indistinguishable from a subscribing institution?
> If not, why the extra hoops?


Dave, you are the second person to bring this “white listing” issue to my 
attention. Bummer! Yes, apparently, unless your email address is a part of 
wider something or another, then you need to be authorized to use the Research 
Center. Weird! In my opinion, while the Research Center’s tools work, I believe 
the site suffers from usability issues.

In any event, I have enhanced the auto-generated reports created by my 
“Browser”, and while they are very textual, I also believe they are insightful. 
For example, the complete works of:

  * William Ellery Channing - http://bit.ly/browser-channing-about
  * Jane Austen - http://bit.ly/browser-austen-about
  * Ralph Waldo Emerson - http://bit.ly/browser-emerson-about
  * Henry David Thoreau - http://bit.ly/browser-thoreau-about

—
Eric “Beginning To Suffer From ‘Creeping Featuritis’” Morgan


Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread davesgonechina
If your *institutional* email address is not on their whitelist (not sure
if it is limited to subscribing ones, they don't say) you cannot register
using the signup form, instead you can only request an account by briefly
explaining why you want one. Weird, because they'd have potentially learned
more about me if they just let me put my gmail address in the signup form.

I don't get it - can all users download public domain content? If they give
me an account, will I be indistinguishable from a subscribing institution?
If not, why the extra hoops?

On Fri, May 29, 2015 at 1:51 AM, Eric Lease Morgan  wrote:

> On May 27, 2015, at 6:33 PM, Karen Coyle  wrote:
>
> >> In my copious spare time I have hacked together a thing I’m calling the
> HathiTrust Research Center Workset Browser, a (fledgling) tool for doing
> “distant reading” against corpora from the HathiTrust. [0, 1] ...
> >>
> >> 'Want to give it a try? For a limited period of time, go to the
> HathiTrust Research Center Portal, create (refine or identify) a collection
> of personal interest, use the Algorithms tool to export the collection's
> rsync file, and send the file to me. I will feed the rsync file to the
> Browser, and then send you the URL pointing to the results.
> >>
> >> [0] introduction in a blog posting - http://ntrda.me/1FUGP2g
> >> [1] HTRC Workset Browser - http://bit.ly/workset-browser
> >
> > Eric, what happens if you access this from a non-HT institution? When I
> go to HT I am often unable to download public domain titles because they
> aren't available to members of the general public.
>
>
> The short answer is, “Nothing”.
>
> The long answer is… longer. The HathiTrust proper is accessible to
> anybody, but the downloading of public domain content is only available to
> subscribing institutions.
>
> On the other hand, the “Workset Browser” is designed to work off the
> HathiTrust Research Center Portal, not the HathiTrust proper. The Portal is
> located at http://sharc.hathitrust.org From there anybody can search the
> collection of public domain content, create collections, and apply various
> algorithms against collections. One of the algorithms is “create RSYNC
> file” which, in turn, allows you to download bunches o’ metadata describing
> the items in your collection. (There is also a “download as MARC”
> algorithm.) This rsync file is the root of the Workset Browser. Feed the
> Browser a rsync file, and the Browser will mirror content locally, index
> it, and generate reports describing the collection.
>
> Thank you for asking. Many people do not know there is a HathiTrust
> Research Center.
>
> —
> Eric Morgan
>


Re: [CODE4LIB] hathitrust research center workset browser

2015-05-28 Thread Eric Lease Morgan
On May 27, 2015, at 6:33 PM, Karen Coyle  wrote:

>> In my copious spare time I have hacked together a thing I’m calling the 
>> HathiTrust Research Center Workset Browser, a (fledgling) tool for doing 
>> “distant reading” against corpora from the HathiTrust. [0, 1] ...
>> 
>> 'Want to give it a try? For a limited period of time, go to the HathiTrust 
>> Research Center Portal, create (refine or identify) a collection of personal 
>> interest, use the Algorithms tool to export the collection's rsync file, and 
>> send the file to me. I will feed the rsync file to the Browser, and then 
>> send you the URL pointing to the results.
>> 
>> [0] introduction in a blog posting - http://ntrda.me/1FUGP2g
>> [1] HTRC Workset Browser - http://bit.ly/workset-browser
> 
> Eric, what happens if you access this from a non-HT institution? When I go to 
> HT I am often unable to download public domain titles because they aren't 
> available to members of the general public.


The short answer is, “Nothing”.

The long answer is… longer. The HathiTrust proper is accessible to anybody, but 
the downloading of public domain content is only available to subscribing 
institutions.

On the other hand, the “Workset Browser” is designed to work off the HathiTrust 
Research Center Portal, not the HathiTrust proper. The Portal is located at 
http://sharc.hathitrust.org From there anybody can search the collection of 
public domain content, create collections, and apply various algorithms against 
collections. One of the algorithms is “create RSYNC file” which, in turn, 
allows you to download bunches o’ metadata describing the items in your 
collection. (There is also a “download as MARC” algorithm.) This rsync file is 
the root of the Workset Browser. Feed the Browser a rsync file, and the Browser 
will mirror content locally, index it, and generate reports describing the 
collection. 

Thank you for asking. Many people do not know there is a HathiTrust Research 
Center.

—
Eric Morgan


Re: [CODE4LIB] hathitrust research center workset browser

2015-05-27 Thread Karen Coyle
Eric, what happens if you access this from a non-HT institution? When I 
go to HT I am often unable to download public domain titles because they 
aren't available to members of the general public.


kc

On 5/26/15 8:30 AM, Eric Lease Morgan wrote:

In my copious spare time I have hacked together a thing I’m calling the 
HathiTrust Research Center Workset Browser, a (fledgling) tool for doing 
“distant reading” against corpora from the HathiTrust. [1]

The idea is to: 1) create, refine, or identify a HathiTrust Research Center 
workset of interest — your corpus, 2) feed the workset’s rsync file to the 
Browser, 3) have the Browser download, index, and analyze the corpus, and 4) 
enable to reader to search, browse, and interact with the result of the 
analysis. With varying success, I have done this with a number of worksets 
ranging on topics from literature, philosophy, Rome, and cookery. The best 
working examples are the ones from Thoreau and Austen. [2, 3] The others are 
still buggy.

As a further example, the Browser can/will create reports describing the corpus 
as a whole. This analysis includes the size of a corpus measured in pages as 
well as words, date ranges, word frequencies, and selected items of interest 
based on pre-set “themes” — usage of color words, name of “great” authors, and 
a set of timeless ideas. [4] This report is based on more fundamental reports 
such as frequency tables, a “catalog”, and lists of unique words. [5, 6, 7, 8]

The whole thing is written in a combination of shell and Python scripts. It 
should run on just about any out-of-the-box Linux or Macintosh computer. Take a 
look at the code. [9] No special libraries needed. (“Famous last words.”) In 
its current state, it is very Unix-y. Everything is done from the command line. 
Lot’s of plain text files and the exploitation of STDIN and STDOUT. Like a 
Renaissance cartoon, the Browser, in its current state, is only a sketch. Only 
later will a more full-bodied, Web-based interface be created.

The next steps are numerous and listed in no priority order: putting the whole 
thing on GitHub, outputting the reports in generic formats so other things can 
easily read them, improving the terminal-based search interface, implementing a 
Web-based search interface, writing advanced programs in R that chart and graph 
analysis, provide a means for comparing & contrasting two or more items from a 
corpus, indexing the corpus with a (real) indexer such as Solr, writing a 
“cookbook” describing how to use the browser to to “kewl” things, making the 
metadata of corpora available as Linked Data, etc.

'Want to give it a try? For a limited period of time, go to the HathiTrust 
Research Center Portal, create (refine or identify) a collection of personal 
interest, use the Algorithms tool to export the collection's rsync file, and 
send the file to me. I will feed the rsync file to the Browser, and then send 
you the URL pointing to the results. [10] Let’s see what happens.

Fun with public domain content, text mining, and the definition of 
librarianship.

Links

[1] HTRC Workset Browser - http://bit.ly/workset-browser
[2] Thoreau - http://bit.ly/browser-thoreau
[3] Austen - http://bit.ly/browser-austen
[4] Thoreau report - http://ntrda.me/1LD3xds
[5] Thoreau dictionary (frequency list) - http://bit.ly/thoreau-dictionary
[6] usage of color words in Thoreau — http://bit.ly/thoreau-colors
[7] unique words in the corpus - http://bit.ly/thoreau-unique
[8] Thoreau “catalog” — http://bit.ly/thoreau-catalog
[9] source code - http://ntrda.me/1Q8pPoI
   [10] HathiTrust Research Center - https://sharc.hathitrust.org

—
Eric Lease Morgan, Librarian
University of Notre Dame


--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: +1-510-435-8234
skype: kcoylenet/+1-510-984-3600


Re: [CODE4LIB] hathitrust research center workset browser [call for worksets]

2015-05-27 Thread Eric Lease Morgan
On May 26, 2015, at 11:30 AM, Eric Lease Morgan  wrote:

> In my copious spare time I have hacked together a thing I’m calling the 
> HathiTrust Research Center Workset Browser, a (fledgling) tool for doing 
> “distant reading” against corpora from the HathiTrust. [0]
> 
>   [0] introductory Workset Browser blog posting - http://ntrda.me/1FUGP2g


Help me put the my fledgling Browser through some paces; this is a call for 
HathiTrust Research Center worksets.

For a limited period of time, go to the HathiTrust Research Center Portal, 
create (refine or identify) a collection of personal interest, use the 
Algorithms tool to export the collection's rsync file, and send the file to me. 
[1] I will feed the rsync file to the Browser, and then send you the URL 
pointing to the results. Let’s see what happens?

[1] HathiTrust Research Center Portal - https://sharc.hathitrust.org

—
Eric Morgan


[CODE4LIB] hathitrust research center workset browser

2015-05-26 Thread Eric Lease Morgan
In my copious spare time I have hacked together a thing I’m calling the 
HathiTrust Research Center Workset Browser, a (fledgling) tool for doing 
“distant reading” against corpora from the HathiTrust. [1]

The idea is to: 1) create, refine, or identify a HathiTrust Research Center 
workset of interest — your corpus, 2) feed the workset’s rsync file to the 
Browser, 3) have the Browser download, index, and analyze the corpus, and 4) 
enable to reader to search, browse, and interact with the result of the 
analysis. With varying success, I have done this with a number of worksets 
ranging on topics from literature, philosophy, Rome, and cookery. The best 
working examples are the ones from Thoreau and Austen. [2, 3] The others are 
still buggy.

As a further example, the Browser can/will create reports describing the corpus 
as a whole. This analysis includes the size of a corpus measured in pages as 
well as words, date ranges, word frequencies, and selected items of interest 
based on pre-set “themes” — usage of color words, name of “great” authors, and 
a set of timeless ideas. [4] This report is based on more fundamental reports 
such as frequency tables, a “catalog”, and lists of unique words. [5, 6, 7, 8] 

The whole thing is written in a combination of shell and Python scripts. It 
should run on just about any out-of-the-box Linux or Macintosh computer. Take a 
look at the code. [9] No special libraries needed. (“Famous last words.”) In 
its current state, it is very Unix-y. Everything is done from the command line. 
Lot’s of plain text files and the exploitation of STDIN and STDOUT. Like a 
Renaissance cartoon, the Browser, in its current state, is only a sketch. Only 
later will a more full-bodied, Web-based interface be created. 

The next steps are numerous and listed in no priority order: putting the whole 
thing on GitHub, outputting the reports in generic formats so other things can 
easily read them, improving the terminal-based search interface, implementing a 
Web-based search interface, writing advanced programs in R that chart and graph 
analysis, provide a means for comparing & contrasting two or more items from a 
corpus, indexing the corpus with a (real) indexer such as Solr, writing a 
“cookbook” describing how to use the browser to to “kewl” things, making the 
metadata of corpora available as Linked Data, etc.

'Want to give it a try? For a limited period of time, go to the HathiTrust 
Research Center Portal, create (refine or identify) a collection of personal 
interest, use the Algorithms tool to export the collection's rsync file, and 
send the file to me. I will feed the rsync file to the Browser, and then send 
you the URL pointing to the results. [10] Let’s see what happens.

Fun with public domain content, text mining, and the definition of 
librarianship.

Links

   [1] HTRC Workset Browser - http://bit.ly/workset-browser
   [2] Thoreau - http://bit.ly/browser-thoreau
   [3] Austen - http://bit.ly/browser-austen
   [4] Thoreau report - http://ntrda.me/1LD3xds
   [5] Thoreau dictionary (frequency list) - http://bit.ly/thoreau-dictionary
   [6] usage of color words in Thoreau — http://bit.ly/thoreau-colors
   [7] unique words in the corpus - http://bit.ly/thoreau-unique
   [8] Thoreau “catalog” — http://bit.ly/thoreau-catalog
   [9] source code - http://ntrda.me/1Q8pPoI
  [10] HathiTrust Research Center - https://sharc.hathitrust.org

— 
Eric Lease Morgan, Librarian
University of Notre Dame