Re: [CODE4LIB] hathitrust research center workset browser [github]

2015-06-02 Thread Eric Lease Morgan
I believe I have created a repository of my HTRC Workset Browser code (shell 
and Python scripts) on GitHub. [1] From the Quick Start section of the README:

  1. Download the software putting the bin and etc directories in the same 
directory.
  2. Change to the directory where the bin and etc directories have been saved.
  3. Build a collection by issuing the following command:

   ./bin/build-corpus.sh thoreau etc/rsync-thoreau.sh

  If all goes well, the Browser will create a new directory named thoreau,
  rsync a bunch o' JSON files from the HathiTrust to your computer, index
  the JSON files, do some textual analysis against the corpus, create a
  simple database (catalog), and create a few more reports. You can then
  peruse the files in the newly created thoreau directory. If this worked,
  then repeat the process for the other rsync files found in the etc
  directory.

Probably the first issue people will have is the path to their version of 
Python. (Sigh.)

[1] repository - https://github.com/ericleasemorgan/HTRC-Workset-Browser

—
Eric “Git Ignorant” Morgan


Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread davesgonechina
If your *institutional* email address is not on their whitelist (not sure
if it is limited to subscribing ones, they don't say) you cannot register
using the signup form, instead you can only request an account by briefly
explaining why you want one. Weird, because they'd have potentially learned
more about me if they just let me put my gmail address in the signup form.

I don't get it - can all users download public domain content? If they give
me an account, will I be indistinguishable from a subscribing institution?
If not, why the extra hoops?

On Fri, May 29, 2015 at 1:51 AM, Eric Lease Morgan emor...@nd.edu wrote:

 On May 27, 2015, at 6:33 PM, Karen Coyle li...@kcoyle.net wrote:

  In my copious spare time I have hacked together a thing I’m calling the
 HathiTrust Research Center Workset Browser, a (fledgling) tool for doing
 “distant reading” against corpora from the HathiTrust. [0, 1] ...
 
  'Want to give it a try? For a limited period of time, go to the
 HathiTrust Research Center Portal, create (refine or identify) a collection
 of personal interest, use the Algorithms tool to export the collection's
 rsync file, and send the file to me. I will feed the rsync file to the
 Browser, and then send you the URL pointing to the results.
 
  [0] introduction in a blog posting - http://ntrda.me/1FUGP2g
  [1] HTRC Workset Browser - http://bit.ly/workset-browser
 
  Eric, what happens if you access this from a non-HT institution? When I
 go to HT I am often unable to download public domain titles because they
 aren't available to members of the general public.


 The short answer is, “Nothing”.

 The long answer is… longer. The HathiTrust proper is accessible to
 anybody, but the downloading of public domain content is only available to
 subscribing institutions.

 On the other hand, the “Workset Browser” is designed to work off the
 HathiTrust Research Center Portal, not the HathiTrust proper. The Portal is
 located at http://sharc.hathitrust.org From there anybody can search the
 collection of public domain content, create collections, and apply various
 algorithms against collections. One of the algorithms is “create RSYNC
 file” which, in turn, allows you to download bunches o’ metadata describing
 the items in your collection. (There is also a “download as MARC”
 algorithm.) This rsync file is the root of the Workset Browser. Feed the
 Browser a rsync file, and the Browser will mirror content locally, index
 it, and generate reports describing the collection.

 Thank you for asking. Many people do not know there is a HathiTrust
 Research Center.

 —
 Eric Morgan



Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread Terry Reese
I know that Robert McDonald lurks around here -- so he could clarify this -- 
but what folks need to realize here is that the research center is providing 
tools that allow research access to materials within the hathitrust that are 
within the public domain.  However, the digitized materials themselves, are not 
public domain any more (as I understand it).  These materials, as I understand, 
are governed by the agreements institutions made as part of the google project. 
 So, while the materials that the research center is currently providing access 
to are ones identified as within the public domain, access to the research 
center is curated due to those agreements.  Robert or someone else can clarify 
if I've misspoken based on my understanding here.

--tr

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
davesgonechina
Sent: Monday, June 1, 2015 10:58 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] hathitrust research center workset browser

They just informed me I need a .edu address. Having trouble understanding the 
use of the term public domain here.

On Mon, Jun 1, 2015, 9:58 PM Eric Lease Morgan emor...@nd.edu wrote:

 On Jun 1, 2015, at 4:33 AM, davesgonechina davesgonech...@gmail.com
 wrote:

  If your *institutional* email address is not on their whitelist (not 
  sure if it is limited to subscribing ones, they don't say) you 
  cannot register using the signup form, instead you can only request 
  an account by briefly explaining why you want one. Weird, because 
  they'd have potentially
 learned
  more about me if they just let me put my gmail address in the signup
 form.
 
  I don't get it - can all users download public domain content? If 
  they
 give
  me an account, will I be indistinguishable from a subscribing
 institution?
  If not, why the extra hoops?


 Dave, you are the second person to bring this “white listing” issue to 
 my attention. Bummer! Yes, apparently, unless your email address is a 
 part of wider something or another, then you need to be authorized to 
 use the Research Center. Weird! In my opinion, while the Research 
 Center’s tools work, I believe the site suffers from usability issues.

 In any event, I have enhanced the auto-generated reports created by my 
 “Browser”, and while they are very textual, I also believe they are 
 insightful. For example, the complete works of:

   * William Ellery Channing - http://bit.ly/browser-channing-about
   * Jane Austen - http://bit.ly/browser-austen-about
   * Ralph Waldo Emerson - http://bit.ly/browser-emerson-about
   * Henry David Thoreau - http://bit.ly/browser-thoreau-about

 —
 Eric “Beginning To Suffer From ‘Creeping Featuritis’” Morgan



Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread Eric Lease Morgan
On Jun 1, 2015, at 10:58 AM, davesgonechina davesgonech...@gmail.com wrote:

 They just informed me I need a .edu address. Having trouble understanding
 the use of the term public domain here.

  Gung fhpx, naq fbhaqf ernyyl fbeg bs fghcvq!! --RYZ


Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread davesgonechina
They just informed me I need a .edu address. Having trouble understanding
the use of the term public domain here.

On Mon, Jun 1, 2015, 9:58 PM Eric Lease Morgan emor...@nd.edu wrote:

 On Jun 1, 2015, at 4:33 AM, davesgonechina davesgonech...@gmail.com
 wrote:

  If your *institutional* email address is not on their whitelist (not sure
  if it is limited to subscribing ones, they don't say) you cannot register
  using the signup form, instead you can only request an account by briefly
  explaining why you want one. Weird, because they'd have potentially
 learned
  more about me if they just let me put my gmail address in the signup
 form.
 
  I don't get it - can all users download public domain content? If they
 give
  me an account, will I be indistinguishable from a subscribing
 institution?
  If not, why the extra hoops?


 Dave, you are the second person to bring this “white listing” issue to my
 attention. Bummer! Yes, apparently, unless your email address is a part of
 wider something or another, then you need to be authorized to use the
 Research Center. Weird! In my opinion, while the Research Center’s tools
 work, I believe the site suffers from usability issues.

 In any event, I have enhanced the auto-generated reports created by my
 “Browser”, and while they are very textual, I also believe they are
 insightful. For example, the complete works of:

   * William Ellery Channing - http://bit.ly/browser-channing-about
   * Jane Austen - http://bit.ly/browser-austen-about
   * Ralph Waldo Emerson - http://bit.ly/browser-emerson-about
   * Henry David Thoreau - http://bit.ly/browser-thoreau-about

 —
 Eric “Beginning To Suffer From ‘Creeping Featuritis’” Morgan



Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread Eric Lease Morgan
On Jun 1, 2015, at 4:33 AM, davesgonechina davesgonech...@gmail.com wrote:

 If your *institutional* email address is not on their whitelist (not sure
 if it is limited to subscribing ones, they don't say) you cannot register
 using the signup form, instead you can only request an account by briefly
 explaining why you want one. Weird, because they'd have potentially learned
 more about me if they just let me put my gmail address in the signup form.
 
 I don't get it - can all users download public domain content? If they give
 me an account, will I be indistinguishable from a subscribing institution?
 If not, why the extra hoops?


Dave, you are the second person to bring this “white listing” issue to my 
attention. Bummer! Yes, apparently, unless your email address is a part of 
wider something or another, then you need to be authorized to use the Research 
Center. Weird! In my opinion, while the Research Center’s tools work, I believe 
the site suffers from usability issues.

In any event, I have enhanced the auto-generated reports created by my 
“Browser”, and while they are very textual, I also believe they are insightful. 
For example, the complete works of:

  * William Ellery Channing - http://bit.ly/browser-channing-about
  * Jane Austen - http://bit.ly/browser-austen-about
  * Ralph Waldo Emerson - http://bit.ly/browser-emerson-about
  * Henry David Thoreau - http://bit.ly/browser-thoreau-about

—
Eric “Beginning To Suffer From ‘Creeping Featuritis’” Morgan


Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread Terry Reese
 However, the digitizing agency cannot dictate any copyright 
restrictions on the digitized copies once released to the public

The digital objects have not, and as far as I understand, cannot be made 
available to the public if digitized as part of the google books digitization 
project.  Most institutions got very limited use, and generally these were tied 
to their specific, immediate, communities.  Though, with that said each 
institution has slightly different terms.  For what it's worth, the research 
center does not make the digital copies available for download -- it provides 
tools for working with data in aggregate (worksets) and provides a proof of 
concept environment demonstrating the feasibility of creating a secured data 
repository with I believe the long-term goal of providing data mining for the 
entire hathitrust resources (both within and outside of the public domain).  
But even as it stands now, the tool has become a fantastic teaching tool when 
talking to instructors and graduate students looking for large data sets to 
work with, that also includes some pretty interesting research algori!
 thms for working with the data.  

--tr

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jimmy 
Ghaphery
Sent: Monday, June 1, 2015 4:47 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] hathitrust research center workset browser

Thanks Eric for posting the webinar in the other thread.

I am pretty sure that digitizing something in the public domain does not change 
its copyright status, at least in the U.S. The digitizing agency certainly has 
the right to sell, restrict access, watermark, or even keep the scans locked up 
on a thumb drive in a closet. They are not obligated to share or to provide the 
digital files in a re-usable format. However, the digitizing agency cannot 
dictate any copyright restrictions on the digitized copies once released to the 
public.

#iamnotalawyer and welcome correction

best,

Jimmy



On Mon, Jun 1, 2015 at 12:12 PM, Eric Lease Morgan emor...@nd.edu wrote:

 On Jun 1, 2015, at 10:58 AM, davesgonechina davesgonech...@gmail.com
 wrote:

  They just informed me I need a .edu address. Having trouble 
  understanding the use of the term public domain here.

   Gung fhpx, naq fbhaqf ernyyl fbeg bs fghcvq!! --RYZ




--
Jimmy Ghaphery
Head, Digital Technologies
VCU Libraries
804-827-3551


Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread Jimmy Ghaphery
I think we are in agreement (especially about the utility of all things
HathiTrust). My one point is that any restrictions on digitized public
domain works, as I understand it, are not related to copyright.

On Mon, Jun 1, 2015 at 5:00 PM, Terry Reese ree...@gmail.com wrote:

  However, the digitizing agency cannot dictate any copyright
 restrictions on the digitized copies once released to the public

 The digital objects have not, and as far as I understand, cannot be made
 available to the public if digitized as part of the google books
 digitization project.  Most institutions got very limited use, and
 generally these were tied to their specific, immediate, communities.
 Though, with that said each institution has slightly different terms.  For
 what it's worth, the research center does not make the digital copies
 available for download -- it provides tools for working with data in
 aggregate (worksets) and provides a proof of concept environment
 demonstrating the feasibility of creating a secured data repository with I
 believe the long-term goal of providing data mining for the entire
 hathitrust resources (both within and outside of the public domain).  But
 even as it stands now, the tool has become a fantastic teaching tool when
 talking to instructors and graduate students looking for large data sets to
 work with, that also includes some pretty interesting research algori!
  thms for working with the data.

 --tr

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Jimmy Ghaphery
 Sent: Monday, June 1, 2015 4:47 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] hathitrust research center workset browser

 Thanks Eric for posting the webinar in the other thread.

 I am pretty sure that digitizing something in the public domain does not
 change its copyright status, at least in the U.S. The digitizing agency
 certainly has the right to sell, restrict access, watermark, or even keep
 the scans locked up on a thumb drive in a closet. They are not obligated to
 share or to provide the digital files in a re-usable format. However, the
 digitizing agency cannot dictate any copyright restrictions on the
 digitized copies once released to the public.

 #iamnotalawyer and welcome correction

 best,

 Jimmy



 On Mon, Jun 1, 2015 at 12:12 PM, Eric Lease Morgan emor...@nd.edu wrote:

  On Jun 1, 2015, at 10:58 AM, davesgonechina davesgonech...@gmail.com
  wrote:
 
   They just informed me I need a .edu address. Having trouble
   understanding the use of the term public domain here.
 
Gung fhpx, naq fbhaqf ernyyl fbeg bs fghcvq!! --RYZ
 



 --
 Jimmy Ghaphery
 Head, Digital Technologies
 VCU Libraries
 804-827-3551




-- 
Jimmy Ghaphery
Head, Digital Technologies
VCU Libraries
804-827-3551


Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread Karen Coyle
Right. Which is why *someone* copied all of the Google digitized books 
to the Internet Archive -- someone not associated with the library 
partners. So generally if you cannot download from HT you can find the 
same scan via openlibrary.org. Unfortunately that doesn't help with 
using the tool that ELM has alerted us to.


kc

On 6/1/15 2:19 PM, Jimmy Ghaphery wrote:

I think we are in agreement (especially about the utility of all things
HathiTrust). My one point is that any restrictions on digitized public
domain works, as I understand it, are not related to copyright.

On Mon, Jun 1, 2015 at 5:00 PM, Terry Reese ree...@gmail.com wrote:


However, the digitizing agency cannot dictate any copyright
restrictions on the digitized copies once released to the public

The digital objects have not, and as far as I understand, cannot be made
available to the public if digitized as part of the google books
digitization project.  Most institutions got very limited use, and
generally these were tied to their specific, immediate, communities.
Though, with that said each institution has slightly different terms.  For
what it's worth, the research center does not make the digital copies
available for download -- it provides tools for working with data in
aggregate (worksets) and provides a proof of concept environment
demonstrating the feasibility of creating a secured data repository with I
believe the long-term goal of providing data mining for the entire
hathitrust resources (both within and outside of the public domain).  But
even as it stands now, the tool has become a fantastic teaching tool when
talking to instructors and graduate students looking for large data sets to
work with, that also includes some pretty interesting research algori!
  thms for working with the data.

--tr

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Jimmy Ghaphery
Sent: Monday, June 1, 2015 4:47 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] hathitrust research center workset browser

Thanks Eric for posting the webinar in the other thread.

I am pretty sure that digitizing something in the public domain does not
change its copyright status, at least in the U.S. The digitizing agency
certainly has the right to sell, restrict access, watermark, or even keep
the scans locked up on a thumb drive in a closet. They are not obligated to
share or to provide the digital files in a re-usable format. However, the
digitizing agency cannot dictate any copyright restrictions on the
digitized copies once released to the public.

#iamnotalawyer and welcome correction

best,

Jimmy



On Mon, Jun 1, 2015 at 12:12 PM, Eric Lease Morgan emor...@nd.edu wrote:


On Jun 1, 2015, at 10:58 AM, davesgonechina davesgonech...@gmail.com
wrote:


They just informed me I need a .edu address. Having trouble
understanding the use of the term public domain here.

   Gung fhpx, naq fbhaqf ernyyl fbeg bs fghcvq!! --RYZ




--
Jimmy Ghaphery
Head, Digital Technologies
VCU Libraries
804-827-3551






--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: +1-510-435-8234
skype: kcoylenet/+1-510-984-3600


Re: [CODE4LIB] hathitrust research center workset browser

2015-05-28 Thread Eric Lease Morgan
On May 27, 2015, at 6:33 PM, Karen Coyle li...@kcoyle.net wrote:

 In my copious spare time I have hacked together a thing I’m calling the 
 HathiTrust Research Center Workset Browser, a (fledgling) tool for doing 
 “distant reading” against corpora from the HathiTrust. [0, 1] ...
 
 'Want to give it a try? For a limited period of time, go to the HathiTrust 
 Research Center Portal, create (refine or identify) a collection of personal 
 interest, use the Algorithms tool to export the collection's rsync file, and 
 send the file to me. I will feed the rsync file to the Browser, and then 
 send you the URL pointing to the results.
 
 [0] introduction in a blog posting - http://ntrda.me/1FUGP2g
 [1] HTRC Workset Browser - http://bit.ly/workset-browser
 
 Eric, what happens if you access this from a non-HT institution? When I go to 
 HT I am often unable to download public domain titles because they aren't 
 available to members of the general public.


The short answer is, “Nothing”.

The long answer is… longer. The HathiTrust proper is accessible to anybody, but 
the downloading of public domain content is only available to subscribing 
institutions.

On the other hand, the “Workset Browser” is designed to work off the HathiTrust 
Research Center Portal, not the HathiTrust proper. The Portal is located at 
http://sharc.hathitrust.org From there anybody can search the collection of 
public domain content, create collections, and apply various algorithms against 
collections. One of the algorithms is “create RSYNC file” which, in turn, 
allows you to download bunches o’ metadata describing the items in your 
collection. (There is also a “download as MARC” algorithm.) This rsync file is 
the root of the Workset Browser. Feed the Browser a rsync file, and the Browser 
will mirror content locally, index it, and generate reports describing the 
collection. 

Thank you for asking. Many people do not know there is a HathiTrust Research 
Center.

—
Eric Morgan


Re: [CODE4LIB] hathitrust research center workset browser [call for worksets]

2015-05-27 Thread Eric Lease Morgan
On May 26, 2015, at 11:30 AM, Eric Lease Morgan emor...@nd.edu wrote:

 In my copious spare time I have hacked together a thing I’m calling the 
 HathiTrust Research Center Workset Browser, a (fledgling) tool for doing 
 “distant reading” against corpora from the HathiTrust. [0]
 
   [0] introductory Workset Browser blog posting - http://ntrda.me/1FUGP2g


Help me put the my fledgling Browser through some paces; this is a call for 
HathiTrust Research Center worksets.

For a limited period of time, go to the HathiTrust Research Center Portal, 
create (refine or identify) a collection of personal interest, use the 
Algorithms tool to export the collection's rsync file, and send the file to me. 
[1] I will feed the rsync file to the Browser, and then send you the URL 
pointing to the results. Let’s see what happens?

[1] HathiTrust Research Center Portal - https://sharc.hathitrust.org

—
Eric Morgan


Re: [CODE4LIB] hathitrust research center workset browser

2015-05-27 Thread Karen Coyle
Eric, what happens if you access this from a non-HT institution? When I 
go to HT I am often unable to download public domain titles because they 
aren't available to members of the general public.


kc

On 5/26/15 8:30 AM, Eric Lease Morgan wrote:

In my copious spare time I have hacked together a thing I’m calling the 
HathiTrust Research Center Workset Browser, a (fledgling) tool for doing 
“distant reading” against corpora from the HathiTrust. [1]

The idea is to: 1) create, refine, or identify a HathiTrust Research Center 
workset of interest — your corpus, 2) feed the workset’s rsync file to the 
Browser, 3) have the Browser download, index, and analyze the corpus, and 4) 
enable to reader to search, browse, and interact with the result of the 
analysis. With varying success, I have done this with a number of worksets 
ranging on topics from literature, philosophy, Rome, and cookery. The best 
working examples are the ones from Thoreau and Austen. [2, 3] The others are 
still buggy.

As a further example, the Browser can/will create reports describing the corpus 
as a whole. This analysis includes the size of a corpus measured in pages as 
well as words, date ranges, word frequencies, and selected items of interest 
based on pre-set “themes” — usage of color words, name of “great” authors, and 
a set of timeless ideas. [4] This report is based on more fundamental reports 
such as frequency tables, a “catalog”, and lists of unique words. [5, 6, 7, 8]

The whole thing is written in a combination of shell and Python scripts. It 
should run on just about any out-of-the-box Linux or Macintosh computer. Take a 
look at the code. [9] No special libraries needed. (“Famous last words.”) In 
its current state, it is very Unix-y. Everything is done from the command line. 
Lot’s of plain text files and the exploitation of STDIN and STDOUT. Like a 
Renaissance cartoon, the Browser, in its current state, is only a sketch. Only 
later will a more full-bodied, Web-based interface be created.

The next steps are numerous and listed in no priority order: putting the whole 
thing on GitHub, outputting the reports in generic formats so other things can 
easily read them, improving the terminal-based search interface, implementing a 
Web-based search interface, writing advanced programs in R that chart and graph 
analysis, provide a means for comparing  contrasting two or more items from a 
corpus, indexing the corpus with a (real) indexer such as Solr, writing a 
“cookbook” describing how to use the browser to to “kewl” things, making the 
metadata of corpora available as Linked Data, etc.

'Want to give it a try? For a limited period of time, go to the HathiTrust 
Research Center Portal, create (refine or identify) a collection of personal 
interest, use the Algorithms tool to export the collection's rsync file, and 
send the file to me. I will feed the rsync file to the Browser, and then send 
you the URL pointing to the results. [10] Let’s see what happens.

Fun with public domain content, text mining, and the definition of 
librarianship.

Links

[1] HTRC Workset Browser - http://bit.ly/workset-browser
[2] Thoreau - http://bit.ly/browser-thoreau
[3] Austen - http://bit.ly/browser-austen
[4] Thoreau report - http://ntrda.me/1LD3xds
[5] Thoreau dictionary (frequency list) - http://bit.ly/thoreau-dictionary
[6] usage of color words in Thoreau — http://bit.ly/thoreau-colors
[7] unique words in the corpus - http://bit.ly/thoreau-unique
[8] Thoreau “catalog” — http://bit.ly/thoreau-catalog
[9] source code - http://ntrda.me/1Q8pPoI
   [10] HathiTrust Research Center - https://sharc.hathitrust.org

—
Eric Lease Morgan, Librarian
University of Notre Dame


--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: +1-510-435-8234
skype: kcoylenet/+1-510-984-3600