[CODE4LIB] position announcement [tulane university]

2017-03-09 Thread Eric Lease Morgan
[The following position announcement is being forwarded upon request. —ELM]


> We are currently hiring for the Applications Developer III position at the 
> Howard-Tilton Memorial Library at Tulane University located in New Orleans, 
> Louisiana.
> 
> Please see the job details here: http://bit.ly/2nb119e 
> 
> To see a listing of all open positions  available at Howard-Tilton please 
> visit our website: http://library.tulane.edu/about/job-opportunities
> 
> --
> Candace Maurice
> Web Developer
> Howard-Tilton Memorial Library
> Tulane University
> 504.314.7784
> cmaur...@tulane.edu


[CODE4LIB] on hold

2016-07-19 Thread Eric Lease Morgan
As of this message, I’m putting the Code4Lib mailing list “on hold” while the 
list’s configurations and archives get moved from one place to another. ‘More 
soon, and this process will take at least a day. Please be patient. —Eric Lease 
Morgan


Re: [CODE4LIB] code4lib mailing list

2016-07-13 Thread Eric Lease Morgan
> Alas, the Code4Lib mailing list software will most likely need to be migrated 
> before the end of summer…

On Monday Wayne Graham (CLIR/DLF) and I are hoping to migrate the Code4Lib 
mailing list to a different domain. We don’t think any archives, subscriptions, 
nor preferences will get lost in the process. (“Famous last words.”) Wish us 
luck. —Eric Morgqn


[CODE4LIB] mashcat

2016-07-12 Thread Eric Lease Morgan
The following Mashcat event seems more than apropos to our group:

  We are excited to announce that the second face-to-face Mashcat
  event in North America will be held on January 24th, 2017, in
  downtown Atlanta, Georgia, USA. We invite you to save the date.
  We will be sending out a call for session proposals and opening
  up registration in the late summer and early fall.

  Not sure what Mashcat is? “Mashcat” was originally an event in
  the UK in 2012 aimed at bringing together people working on the
  IT systems side of libraries with those working in cataloguing
  and metadata. Four years later, Mashcat is a loose group of
  metadata specialists, cataloguers, developers and anyone else
  with an interest in how metadata in and around libraries can be
  created, manipulated, used and re-used by computers and software.
  The aim is to work together and bridge the communications gap
  that has sometimes gotten in the way of building the best tools
  we possibly can to manage library data. Among our accomplishments
  in 2016 was holding the first North American face-to-face event
  in Boston in January and running webinars. If you’re unable to
  attend a face-to-face meeting, we will be holding at least one
  more webinar in 2016.

  http://bit.ly/29FuUuY

Actually, the mass-editing of cataloging (MARC) data is something that is 
particularly interesting to me these days. Hand-crafted metadata records are 
nice, but increasingly unscalable.

—
Eric Lease Morgan


Re: [CODE4LIB] date fields

2016-07-12 Thread Eric Lease Morgan
On Jul 11, 2016, at 4:32 PM, Kyle Banerjee <kyle.baner...@gmail.com> wrote:

>> https://github.com/traject/traject/blob/e98fe35f504a2a519412cd28fdd97dc514b603c6/lib/traject/macros/marc21_semantics.rb#L299-L379
> 
> Is the idea that this new field would be stored as MARC in the system (the
> ILS?).
> 
> If so, the 9xx solution already suggested is probably the way to go if the
> 008 route suggested earlier won't work for you. Otherwise, you run a risk
> that some form of record maintenance will blow out all your changes.
> 
> The actual use case you have in mind makes a big difference in what paths
> make sense, so more detail might be helpful.


Thank you, one & all, for the input & feedback. After thinking about it for a 
while, I believe I will save my normalized dates in a local (9xx) field of some 
sort.

My use case? As a part of the "Catholic Portal", I aggregate many different 
types of metadata and essentially create a union catalog of rare and 
infrequently held materials of a Catholic nature. [1] In an effort to measure 
“rarity” I've counted and tabulated the frequency of a given title in WorldCat. 
I now want to measure the age of the materials in the collection. To do that I 
need to normalize dates and evaluate them. Ideally I would save the normalized 
dates back in MARC and give the MARC back to Portal members libraries, but 
since there is really no standard field for such a value, anything I choose is 
all but arbitrary. I’ll use some 9xx field, just to make things easy. I can 
always (and easily) change it later.

[1] "Catholic Portal” - http://www.catholicresearch.net

—
Eric Lease Morgan


[CODE4LIB] date fields

2016-07-11 Thread Eric Lease Morgan
I’m looking for date fields.

Or more specifically, I have been given a pile o’ MARC records, and I will be 
extracting for analysis the values of dates from MARC 260$c. From the resulting 
set of values — which will include all sorts of string values ([1900], c1900, 
190?, 19—, 1900, etc.) — I plan to normalize things to integers like 1900. I 
then want to save/store these normalized values back to my local set of MARC 
records. I will then re-read the data to create things like timelines, to 
answer questions like “How old is old?”, or to “simply” look for trends in the 
data.

What field would y’all suggest I use to store my normalized date content?

—
Eric Morgan


Re: [CODE4LIB] code4lib mailing list [clir]

2016-06-15 Thread Eric Lease Morgan
On Jun 7, 2016, at 10:11 AM, Eric Lease Morgan <emor...@nd.edu> wrote:

>>> Alas, the Code4Lib mailing list software will most likely need to be 
>>> migrated before the end of summer, and I’m proposing a number possible 
>>> options for the lists continued existence...
>> 
>> Our list — Code4Lib — will be migrating to the Digital Library Federation 
>> (DLF) sometime in the near future. 
> 
> This is a gentle reminder that the Code4Lib mailing list will be migrating to 
> a different address sometime in the very near future. Specifically, it will 
> be migrating to the Digital Library Federation. I suspect this work will be 
> finished in less than thirty days, and when I know the exact address of the 
> new list, I will share it here.
> 
> Thanks go to the DLF in general, and specifically Wayne Graham and Bethany 
> Nowviskie for enabling this to happen. “Thanks!”


Yet again, this is a reminder that the mailing list will be moving, and I think 
the list's address will be associated with CLIR (Council on Library and 
Information Resources), which is the host of the DLF (Digital Library 
Federation). [1, 2]

Wayne Graham & I (actually, mostly Wayne) have been practicing with the 
migration process. We have managed to move the archives and the subscriber list 
(complete with subscription preferences) to a new machine. We — Wayne & I — now 
need to coordinate to do the move for real. To do so we will put the mailing 
list on “hold”, copy things from one computer to another, and then “release” 
the new implementation. The only things that will get lost in the migration 
process are messages sent to the older implementation. Consequently, people 
will need to start sending messages to a new address. I’m not sure, but this 
migration might start happening very early next week — June 20. 

Now back to our regularly scheduled programming (all puns intended).

[1] CLIR - http://clir.org
[2] DLF - https://www.diglib.org

—
Eric Lease Morgan


Re: [CODE4LIB] Formalizing Code4Lib?

2016-06-14 Thread Eric Lease Morgan
On Jun 14, 2016, at 8:01 PM, Coral Sheldon-Hess <co...@sheldon-hess.org> wrote:

> Now, there kind of is. By my count, we have 4 volunteers. Chad, Tom, Galen,
> and me. Anyone else?

  Coral, please sign me up. I’d like to learn more. —Eric Lease Morgan


Re: [CODE4LIB] Formalizing Code4Lib? [diy]

2016-06-10 Thread Eric Lease Morgan
On Jun 9, 2016, at 7:55 PM, Coral Sheldon-Hess  wrote:

> One note about what we're discussing: when we talk about just doing the
> regional events (and I mean beyond 2017, which will be a special case if a
> host city can't step in), we need to realize that we have a lot of members
> who aren't in a Code4Lib region.
> 
> You might think I'm talking about Alaska, because that's where I lived when
> I first came to a Code4Lib conference. And that's certainly one place,
> along with Hawaii, that would be left out.
> 
> But even living in Pittsburgh, I'm not in a Code4Lib region, that I can
> tell. Pittsburgh isn't in the midwest, and we also aren't part of the
> tri-state region that Philly's in. I'm employed (part-time/remote) in the
> DC/MD region, so if I can afford the drive and hotel, that's probably the
> one I'd pick right now. I guess?
> 
> So, even landlocked in the continental US, it's possible not to have a
> region.
> 
> More importantly, though: my understanding is that our international
> members are fairly spread out -- maybe Code4Lib Japan being an exception?
> -- so, even ignoring weird cases like Pittsburgh, we stand to lose some
> really fantastic contributors to our community if we drop to regional-only.
> 
> Just something else to consider.
> - Coral


Interesting. Consider searching one or more of the existing Code4Lib mailing 
list archives for things Pittsburg:

  * https://www.mail-archive.com/code4lib@listserv.nd.edu/
  * http://serials.infomotions.com/code4lib/
  * https://listserv.nd.edu/cgi-bin/wa?A0=CODE4LIB

I’d be willing to be you can identify six or seven Code4Lib’ers in the results. 
You could then suggest a “meet-up”, a get together over lunch, or to have them 
visit you in your space or a near-by public library. Even if there are only 
three of you, then things will get started, and it will grow from there. I 
promise. —Eric Morgan


Re: [CODE4LIB] Formalizing Code4Lib? [diy]

2016-06-08 Thread Eric Lease Morgan
ings to eat. Make reservations in restaurants for larger groups.

  8) Do the event - On the day of the event, make sure you have name tags, 
lists of attendees, and logistical instructions such as connecting to the 
wi-fi. Have volunteers who want to help greet attendees, organize eating 
events, or lead tours. That is easy. Libraries are full of “service-oriented 
people”. Use the agenda as an outline, not a rule book. Smile. Breath. Have 
fun. Play host to a party. Understand the problem you are trying to solve — 
communication & sharing. Let it flow. Don’t constantly ask yourself, “What if…” 
because if you do, then I’m going to ask you, “What are you going to do if a 
cow comes into the library?” and I’m going to expect an answer. 

  9) Record the event - Have people take notes on the sessions, and then hope 
they write up their notes for later publishing. Video streaming is expensive 
and over the top. Gather up people’s presentation materials and republish them.

 10) End the event - Graciously say good-bye, clean up, and rest. Put the 
coordination on your vita and as a part of your annual review.

 11) Evaluate - Follow-up with the people who attended. Ask them what they 
thought worked well and didn’t work well. Record this feedback on the Web page. 
This is all a part of the communication process.

 12) Repeat - Go to Step #1 because this is a never-ending process. 

Now let’s talk about attendee costs. A national meeting almost always requires 
airfare, so we are talking at least a couple hundred dollars. Then there is the 
stay in the “cool” hotel which is at least another hundred dollars per night. 
Taxi fare. Meals. Registration. Etc. Seriously, how much are you going to 
spend? Think about spending that same amount of money more directly for the 
local/regional meeting. If you really wanted to, coordinate with your 
colleagues and sponsor a caterer. Carpool with your colleagues to the event. 
Coordinate with your colleagues and sponsor a tour. Coordinate with your 
colleagues and sponsor video streaming. In the end, I’m positive everybody will 
spend less money.

What do you get? In the end you get a whole lot of professional networking with 
a relatively small group of people. And since they are regional, you will 
continue relationships with them. Want to network with people outside your 
region? No problem. Look on the Code4Lib wiki, see what's playing next, and 
attend the meeting.

Instead of centralization — like older mainframe types of computing — I suggest 
we embrace the ideas of de-centralization a la the Internet and TCP/IP. This 
way, there is no central thing to break, and everything will just find another 
path to get to where it is going. Instead of one large system — let’s call it 
the integrated library system — let’s employ the Unix Way and have lots of 
tools that do one thing and one thing well. When smaller, lesser expensive 
scholarly journal publishers get tired and find the burden to cumbersome, what 
do they do? They associate themselves with a sort of fiduciary who takes on 
financial responsibilities as well as provides a bit of safety. And then what 
happens to those publications? Hmmm… Can anybody say, “Serials pricing crisis?”

Let’s forgo identifying a fiduciary for a while. What will they facilitate? The 
funding of a large meeting space in a “fancy” hotel? Is that really necessary 
when the same communication & sharing can be done on a smaller, lesser 
expensive, and more intimate scale? DIY. 

† Here’s a really tricky idea. Do what the TEI people do. Identify a time and 
place where many similar people are having a meeting, and then sponsor a 
Code4Lib-specific event on either end of the first meeting. NASIG? DLF? ACRL? 
Call it a symbiotic relationship.

—
Eric Lease Morgan


Re: [CODE4LIB] Formalizing Code4Lib?

2016-06-07 Thread Eric Lease Morgan
On Jun 7, 2016, at 10:53 PM, Mike Giarlo <mjgia...@stanford.edu> wrote:

>>> I'm also interested in investigating how to formalize Code4Lib as an
>>> entity, for all of the reasons listed earlier in the thread…
>> 
>> -1 because I don’t think the benefits will outweigh the emotional and 
>> bureaucratic expense. We already have enough rules.
> 
> Can you say more about what you expect "the emotional and bureaucratic 
> expense" to be?

Bureaucratic and emotional expenses include yet more committees and politics. 
Things will happen increasingly slowly. Our community will be less nimble and 
somewhat governed by outside forces. We will end up with presidents, 
vice-presidents, secretaries, etc. Increasingly there will be “inside” and 
“outside”. The inside will make decisions and the outside won’t understand and 
feel left out. That is what happens when formalization take place.

The regional conferences are good things. I call them franchises. The annual 
meeting does not have to be a big deal, and the smaller it is, the less 
financial risk there will be. Somebody will always come forward. It will just 
happen.

—
Eric Lease Morgan


Re: [CODE4LIB] Formalizing Code4Lib?

2016-06-07 Thread Eric Lease Morgan
> I'm also interested in investigating how to formalize Code4Lib as an
> entity, for all of the reasons listed earlier in the thread…


-1 because I don’t think the benefits will outweigh the emotional and 
bureaucratic expense. We already have enough rules. 

—
ELM


[CODE4LIB] viaf and the levenshtein algorithm

2016-06-07 Thread Eric Lease Morgan
In the past few weeks I have had some interesting experiences with WorldCat, 
VIAF, and the Levenshtein algorithm. [1, 2]

In short, I was given a set of authority records with the goal of associating 
each name with a VIAF identifier. To accomplish this goal I first created a 
rudimentary database — an easily parsed list of MARC 1xx fields. I then looped 
through the database, and searched VIAF via the AutoSuggest interface looking 
for one-to-one matches. If found, I updated my database with the VIAF 
identifier. The AutoSuggest interface was fast but only able to associate 20% 
of my names with identifiers. (Moreover, I don’t know how it works; AutoSuggest 
is a “black box” technology.)

I then looped through the database again, but this time I queried VIAF using 
the SRU interface. Searches often returned many hits, not just one-to-one 
matches, but through the use of the Levenshtein algorithm I was able to 
intelligently select items from the search results and update my database 
accordingly. [3] Through the use of the SRU/Levenshtein combination, I was able 
to associate another 50-55 percent of my names with identifiers.

Now that I have close to 75% of my names associated with VIAF identifiers, I 
can update my authority list’s MARC 024 fields, in turn, I can then provide 
enhanced services against my catalog as well as pave the way for linked data 
implementations.

Sometimes our library automation tasks can use a bit more computer science. 
Librarianship isn’t all about service and the humanities. Librarianship is an 
arscient discipline. [4]

[1] VIAF Finder - http://infomotions.com/blog/2016/05/viaf-finder/
[2] Almost perfection - http://infomotions.com/blog/2016/06/levenshtein/
[3] Levenshtein - https://en.wikipedia.org/wiki/Levenshtein_distance
[4] arscience - http://infomotions.com/blog/2008/07/arscience/

—
Eric Lease Morgan


Re: [CODE4LIB] code4lib mailing list [dlf]

2016-06-07 Thread Eric Lease Morgan
On May 12, 2016, at 8:30 AM, Eric Lease Morgan <emor...@nd.edu> wrote:

>> Alas, the Code4Lib mailing list software will most likely need to be 
>> migrated before the end of summer, and I’m proposing a number possible 
>> options for the lists continued existence...
> 
> Our list — Code4Lib — will be migrating to the Digital Library Federation 
> (DLF) sometime in the near future. 

This is a gentle reminder that the Code4Lib mailing list will be migrating to a 
different address sometime in the very near future. Specifically, it will be 
migrating to the Digital Library Federation. I suspect this work will be 
finished in less than thirty days, and when I know the exact address of the new 
list, I will share it here.

Thanks go to the DLF in general, and specifically Wayne Graham and Bethany 
Nowviskie for enabling this to happen. “Thanks!”

—
Eric Lease Morgan


Re: [CODE4LIB] code4lib mailing list [dlf]

2016-05-12 Thread Eric Lease Morgan
On Mar 24, 2016, at 10:29 AM, Eric Lease Morgan <emor...@nd.edu> wrote:

> Alas, the Code4Lib mailing list software will most likely need to be migrated 
> before the end of summer, and I’m proposing a number possible options for the 
> lists continued existence...


Our list — Code4Lib — will be migrating to the Digital Library Federation (DLF) 
sometime in the near future. [1] 

As I believe I alluded to previously, the University of Notre Dame (where 
Code4lib is currently being hosted) is discontinuing support for the venerable 
LISTSERV software. The University is offering two options: 1) doing nothing and 
letting lists die, or 2) migrating them to Google Groups. Neither of the 
options appealed to me. 

Through the process of making these issues public, Bethany Nowviskie and Wayne 
Graham — both of the DLF/CLIR — have graciously offered to host our mailing 
list. “Thank you, Wayne and Bethany!!” Sometime in the near future, I’m not 
exactly sure when, our mailing list's configurations will be copied from one 
host to another, and the address of our list will change to something like 
code4...@lists.clir.org. For better or for worse, the mailing list software 
will continue to be the venerable LISTSERV software. 

‘More later, as news makes itself available. FYI.

[1] DLF - https://www.diglib.org

—
Eric Lease Morgan
Artist- And Librarian-At-Large

“Lost In Rome”


[CODE4LIB] authority work with isni

2016-04-15 Thread Eric Lease Morgan
I am thinking about doing some authority work with content from ISNI, and I 
have a few questions about the resource.

As yo may or may not know, ISNI is a sort of authority database. [1] One can 
search for an identity in ISNI, identify a person of interest, get a key, 
transform the key into a URI, and use the URI to get back both human-readable 
and machine readable data about the person. For example, the following URIs 
return the same content in different forms:

  * human-readable - http://isni.org/isni/35046923
  * XML - http://isni.org/isni/35046923.xml

I discovered the former URI through a tiny bit of reading. [2] And I discovered 
the later URI through a simple guess. What other URIs exist?

When it comes to the authority work, my goal is to enhance authority records; 
to more thoroughly associate global identifiers with named entities in a local 
authority database. Once this goal is accomplished, the library catalog 
experience can be enhanced, and the door is opened for supporting linked data 
initiatives. In order to accomplish the goal, I believe I can:

  1. get a list of authority records
  2. search for name in a global authority database (like VIAF or ISNI)
  3. if found, then update local authority record accordingly
  4. go to Step #2 for all records
  5. done

My questions are:

  * What remote authority databases are available programmatically? I already 
know of one from the Library of Congress, VIAF, and probably WorldCat 
Identities. Does ISNI support some sort of API, and if so, where is some 
documentation?

  * I believe the Library Of Congress, VIAF, and probably WorldCat Identities 
all support linked data. Does ISNI, and if so, then how is it implemented and 
can you point me to documentation?

  * When it comes to updating the local (MARC) authority records, how do you 
suggest the updates happen? More specifically, what types of values do you 
suggest I insert into what specific (MARC) fields/subfields? Some people 
advocate $0 of 1xx, 6xx, and 7xx fields. Other people suggest 024 subfields 2 
and a. Inquiring minds would like to know.

Fun with authorities!? And, “What’s in a name anyway?"

[1] ISNI - http://isni.org
[2] some documentation - http://isni.org/how-isni-works

—
Eric Lease Morgan
Lost In Rome


Re: [CODE4LIB] Software used in Panama Papers Analysis [named entities]

2016-04-08 Thread Eric Lease Morgan
On Apr 8, 2016, at 5:13 PM, Jenn C <jen...@gmail.com> wrote:

> I worked on a text mining project last semester where I had a bunch of
> magazines with text that was totally unstructured (from IA). I would have
> really liked to know how to work entity matching into such a project. Are
> there text mining projects out there that demonstrate doing this?

If I understand your question correctly, then the Stanford Name Entity 
Recognition (NER) library/application may be one solution. [1]

Given text as input, a named entity recognition library/application returns a 
list of nouns (names, places, and things). The things can be all sorts of stuff 
such as organizations, dates, times, fiscal amounts, etc. Stanford’s NER is 
really a Java library, but has a command-line interface. Feed it a text, and 
you get back an XML stream. The stream contains elements, and each element is 
expected to be some sort of entity. Be forewarned. For the the best and most 
optimal performance, it is necessary to “train” the library/application. 
Frankly, I’ve never done that, and consequently, I guess I’ve never been 
optimal.* You also might want to take a read of the text from the Python 
Natural Language Toolkit (NLTK) module. [2] The noted chapter gives a pretty 
good overview of the subject. 

[1] NER - http://nlp.stanford.edu/software/CRF-NER.shtml
[2] NLTK chapter - http://www.nltk.org/book/ch07.html

* ‘Story of my life.

—
Eric Lease Morgan


Re: [CODE4LIB] Software used in Panama Papers Analysis

2016-04-08 Thread Eric Lease Morgan
On Apr 7, 2016, at 4:24 PM, Gregory Markus  wrote:

>> from one of the New York Times stories on the Panama Papers: "The
>> ICIJ made a number of powerful research tools available to the
>> consortium that the group had developed for previous leak
>> investigations. Those included a secure, Facebook-type forum
>> where reporters could post the fruits of their research, as well
>> as database search program called “Blacklight” that allowed the
>> teams to hunt for specific names, countries or sources.”
>> 
>> http://www.nytimes.com/2016/04/06/business/media/how-a-cryptic-message-interested-in-data-led-to-the-panama-papers.html
> 
> https://ijnet.org/en/blog/how-icij-pulled-large-scale-cross-border-investigative-collaboration


Based on my VERY quick read of the articles linked above, a group of people 
created a collaborative system for collecting, indexing, searching, and 
analyzing data/information. In the end, they facilitated the creation of 
knowledge. That sure sounds like a library to me. Kudos! I believe our 
profession has many things to learn from this example, and two of those things 
include: 1) you need full text content, and 2) controlled vocabularies are not 
a necessary component of the system. —ELM


Re: [CODE4LIB] Google can give you answers, but librarians give you the right answers

2016-04-06 Thread Eric Lease Morgan
On Apr 6, 2016, at 12:44 PM, Jason Bengtson <j.bengtson...@gmail.com> wrote:

> This is librarians fighting a PR battle we can't win. I doubt most people
> care about these assertions, and I certainly don't think they stand a
> chance of swaying anyone. This is like the old "librarians need to promote
> themselves better" chestnut. Losing strategies, in my opinion. Rather than
> trying to refight a battle with search technology that search technology
> has already won, libraries and librarians need to reinvent the technology
> and themselves. Semantic technologies, in particular, provide Information
> Science with extraordinary avenues for reinvention. We need to make search
> more effective and approachable, rather than wagging our finger at people
> who we think aren't searching "correctly". In the short term, data provides
> powerful opportunities. And it isn't all about writing code or wrangling
> data . . . informatics, metadata, systematic reviews, all of these are
> fertile ground for additional development. Digitization projects and other
> efforts to make special collections materials broadly accessible are
> exciting stuff, as are the developing technologies that support those
> efforts. We should be seizing the argument and shaping it, rather than
> trying to invent new bromides to support a losing fight.


+1

I wholeheartedly concur. IMHO, the problem to solve now-a-days does not 
surround search because everybody can find plenty of stuff, and the stuff is 
usually more than satisfactory. Instead, I think the problem to solve surrounds 
assisting the reader in using & understanding the stuff they find. [1] “Now 
that I’ve done the ‘perfect’ search and downloaded the subsequent 200 articles 
from JSTOR, how — given my limited resources —- do I read and comprehend what 
they say? Moreover, how do I compare & contrast what the articles purport with 
the things I already know?” Text mining (a type of semantic technology) is an 
applicable tool here, but then again, “Whenever you have a hammer, everything 
begins to look like a nail."

[1] an essay elaborating on the idea of use & understand - 
http://infomotions.com/blog/2011/09/dpla/

—
Eric Lease Morgan
Artist- And Librarian-At-Large


Re: [CODE4LIB] Google can give you answers, but librarians give you the right answers

2016-04-06 Thread Eric Lease Morgan
On Apr 5, 2016, at 11:12 PM, Karen Coyle  wrote:

> Eric, there were studies done a few decades ago using factual questions. 
> Here's a critical round-up of some of the studies: 
> http://www.jstor.org/stable/25828215  Basically, 40-60% correct, but possibly 
> the questions were not representative -- so possibly the results are really 
> worse :(

Karen, interesting article, and thank you for bringing it to our attention. 
—Eric


Re: [CODE4LIB] Google can give you answers, but librarians give you the right answers

2016-04-05 Thread Eric Lease Morgan
 I sincerely wonder to what extent librarians give the reader
(patrons) the right -- correct -- answer to a (reference) question.
Such is a hypothesis that can be tested and measured. Please show me
non-antidotal evidence one way or the other. --ELM


Re: [CODE4LIB] code4lib mailing list [domain]

2016-03-27 Thread Eric Lease Morgan
On Mar 25, 2016, at 1:24 PM, Bethany Nowviskie  wrote:

> Dear all — I’ve been getting this as a digest, so apologies that I’m only 
> seeing the thread on the future of the mailing list now!
> 
> CLIR/DLF is running the same version of ye olde LISTSERV as Notre Dame, to 
> support DLF-ANNOUNCE, some of our working group lists, and (now) all of the 
> discussion lists of the National Digital Stewardship Alliance. 
> 
> We have recent experience migrating NDSA lists over from Library of Congress 
> — with archives and subscribers intact — and would be really happy to do the 
> same for Code4Lib. We could commit to supporting the list for the long haul, 
> as a contribution to this awesome community. 
> 
> It may be that people want to take the opportunity to get off LISTSERV 
> entirely, but if not — just say the word! — Bethany 
> 
> (PS: added gratitude to Eric from all of us at DLF as well.) 
> 
> — 
> Bethany Nowviskie 
> Director of the Digital Library Federation (DLF) at CLIR
> Research Associate Professor of Digital Humanities at UVa
> diglib.org | clir.org | ndsa.org | nowviskie.org


Bethany, yes, thank you. This is a very interesting offer. 

Everybody, please share with the rest of us your opinion about our mailing 
list’s domain. This need to move from the University of Notre Dame is a 
possible opportunity to have our list come from the cod4lib.org domain. For 
example, the address of the list might become code4...@lists.code4lib.org. If 
it moved to Google, then the address might be code4...@googlegroups.com. Is the 
list’s address important for us to brand?

—
ELM


Re: [CODE4LIB] code4lib mailing list

2016-03-24 Thread Eric Lease Morgan
Regarding the mailing list, here is what I propose to do:

  1. Upgrade my virtual server to include more RAM and disk space.
  2. Install and configure Mailman.
  3. Ask people to subscribe to a bogus list so I/we can practice.
  4. Evaluate.
  5. If evaluation is successful, then migrate subscribers to new list.
  6. If evaluation is unsuccessful, then look for alternatives.
  7. Re-evaulate on a regular basis. 

On my mark. Get set. Go?

—
ELM


[CODE4LIB] code4lib mailing list

2016-03-24 Thread Eric Lease Morgan
Alas, the Code4Lib mailing list software will most likely need to be migrated 
before the end of summer, and I’m proposing a number possible options for the 
lists continued existence. 

I have been managing the Code4Lib mailing list since its inception about twelve 
years ago. This work has been both a privilege and an honor. The list itself 
runs on top of the venerable LISTSERV application and is hosted by the 
University of Notre Dame. The list includes about 3,500 subscribers, and 
traffic very very rarely gets over fifty messages a day. But alas, University 
support for LISTSERV is going away, and I believe the University wants to 
migrate the whole kit and caboodle to Google Groups.

Personally, I don’t like the idea of Code4Lib moving to Google Groups. Google 
knows enough about me (us), and I don’t feel the need for them to know more. 
Sure, moving to Google Groups includes a large convenience factor, but it also 
means we have less control over our own computing environment, let alone our 
data.

So, what do we (I) do? I see three options:

  0. Let the mailing list die — Not really an option, in my opinion
  1. Use Google Groups - Feasible, (probably) reliable, but with less control
  2. Host it ourselves - More difficult, more responsibility, all but absolute 
control

Again, personally, I like Option #2, and I would probably be willing to host 
the list on my one of my computers, (and after a bit of DNS trickery) complete 
with a code4lib.org domain.

What do y’all think? If we go with Option #2, then where might we host the 
list, who might do the work, and what software might we use?

—
Eric Lease Morgan
Artist- And Librarian-At-Large


Re: [CODE4LIB] personalization of academic library websites

2016-03-23 Thread Eric Lease Morgan
On Mar 23, 2016, at 6:26 PM, Mark Weiler <mwei...@wlu.ca> wrote:

> I'm doing some exploratory research on personalization of academic library 
> websites. E.g. student logs in, the site presents books due dates, room 
> reservations, course list with associated course readings, subject 
> librarians.  For faculty members, the site might present other information, 
> such as how to put material on course reserves, deposit material into 
> institutional repository, etc.   Has anyone looked into this, or tried it?

I did quite a bit of work on this idea quite a number of years ago, measured in 
Internet time. See:

  MyLibrary@NCState (1999)
  http://infomotions.com/musings/sigir-99/

  The text describes MyLibrary@NCState, an extensible
  implementation of a user-centered, customizable interface to a
  library's collection of information resources. The system
  integrates principles of librarianship with globably networked
  computing resources creating a dynamic, customer-driven front-end
  to any library's set of materials. It supports a framework for
  libraries to provide enhanced access to local and remote sets of
  data, information, and knowledge. At the same, it does not
  overwhelm its users with too much information because the users
  control exactly how much information is displayed to them at any
  given time. The system is active and not passive; direct human
  interaction, computer mediated guidance and communication
  technologies, as well as current awareness services all play
  indispensible roles in its implementation. 


  MyLibrary: A Copernican revolution in libraries (2005)
  http://infomotions.com/musings/copernican-mylibrary/

  "We are suffering from information overload," the speaker said.
  "There is too much stuff to choose from. We want access to the
  world's knowledge, but we only want to see one particular part of
  it at any one particular time."... The speaker was part of a
  focus group at the North Carolina State University (NCSU),
  Raleigh, back in 1997... To address the issues raised in our
  focus groups, the NCSU Libraries chose to create MyLibrary, an
  Internet-based library service. It would mimic the commercial
  portals in functionality but include library content: lists of
  new books, access to the catalog and other bibliographic indexes,
  electronic journals, Internet sites, circulation services,
  interlibrary loan services, the local newspaper, and more. Most
  importantly, we designed the system to provide access to our most
  valuable resource: the expertise of our staff. After all, if you
  are using My Yahoo! and you have a question, then who are you
  going to call? Nobody. But if you are using a library and you
  have a question, then you should be able to reach a librarian.


  MyLibrary: A digital library framework & toolkit (2008)
  http://infomotions.com/musings/mylibrary-framework/

  This article describes a digital library framework and toolkit
  called MyLibrary. At its heart, MyLibrary is designed to create
  relationships between information resources and people. To this
  end, MyLibrary is made up of essentially four parts: 1)
  information resources, 2) patrons, 3) librarians, and 4) a set of
  locally-defined, institution-specific facet/term combinations
  interconnecting the first three. On another level, MyLibrary is a
  set of object-oriented Perl modules intended to read and write to
  a specifically shaped relational database. Used in conjunction
  with other computer applications and tools, MyLibrary provides a
  way to create and support digital library collections and
  services. Librarians and developers can use MyLibrary to create
  any number of digital library applications: full-text indexes to
  journal literature, a traditional library catalog complete with
  circulation, a database-driven website, an institutional
  repository, an image database, etc. The article describes each of
  these points in greater detail.

Technologically, the problem of personalization is not difficult. Instead, the 
problem I encountered in trying to make a thing like MyLibrary a reality were 
library professional ethics. Too many librarians thought the implementation of 
the idea challenged intellectual privacy. Alas.

—
Eric Lease Morgan
Artist- And Librarian—At-Large

(574) 485-6870


[CODE4LIB] worldcat discovery versus metadata apis

2016-03-22 Thread Eric Lease Morgan
I’m curious. What is the difference between the WorldCat Discovery and WorldCat 
Metadata APIs? 

Given an OCLC number, I want to programmatically search WorldCat and get in 
return a full bibliographic record compete with authoritative subject headings 
and names. Which API should I be using?

—
Eric Morgan


Re: [CODE4LIB] reearch project about feeling stupid in professional communication

2016-03-22 Thread Eric Lease Morgan
In my humble opinion, what we have here is a failure to communicate. [1]

Libraries, especially larger libraries, are increasingly made up of many 
different departments, including but not limited to departments such as: 
cataloging, public services, collections, preservation, archives, and 
now-a-days departments of computer staff. From my point of view, these various 
departments fail to see the similarities between themselves, and instead focus 
on their differences. This focus on the differences is amplified by the use of 
dissimilar vocabularies and subdiscipline-specific jargon. This use of 
dissimilar vocabularies causes a communications gap and left unresolved 
ultimately creates animosity between groups. I believe this is especially true 
between the more traditional library departments and the computer staff. This 
communications gap is an impediment to when it comes to achieving the goals of 
librarianship, and any library — whether it be big or small — needs to address 
these issues lest it wastes both its time and money.

For example, the definitions of things like MARC, databases & indexes, 
collections, and services are not shared across (especially larger) library 
departments.

What is the solution to these problems? In my opinion, there are many 
possibilities, but the solution ultimately rests with individuals willing to 
take the time to learn from their co-workers. It rests in the ability to 
respect — not merely tolerate — another point of view. It requires time, 
listening, discussion, reflection, and repetition. It requires getting to know 
other people on a personal level. It requires learning what others like and 
dislike. It requires comparing & contrasting points of view. It demands 
“walking a mile in the other person’s shoes”, and can be accomplished by things 
such as the physical intermingling of departments, cross-training, and simply 
by going to coffee on a regular basis.

Again, all of us working in libraries have more similarities than differences. 
Learn to appreciate the similarities, and the differences will become 
insignificant. The consequence will be a more holistic set of library 
collections and services.

[1] I have elaborated on these ideas in a blog posting - http://bit.ly/1LDpXkc

—
Eric Lease Morgan


[CODE4LIB] Code4Crotaia

2016-03-21 Thread Eric Lease Morgan
Code4Crotaia was alluded to in a blog posting. [1] code4crotaia++  Inquiring 
mind would like to know more. Please tell us about Code4Crotaia, and don’t 
hesitate to update http://wiki.code4lib.org with details?

[1] http://blog.okfn.org/2016/03/21/codeacross-opendataday-zagreb-2016/

—Eric Morgan


Re: [CODE4LIB] onboarding developers coming from industry

2016-03-02 Thread Eric Lease Morgan
On Mar 2, 2016, at 9:48 AM, LeVan,Ralph  wrote:

> …I've written so much bloat that didn't get used because a librarian was sure 
> the system would fail without it….

I’m ROTFL because just a few minutes ago, while composing an informal essay on 
the history of bibliographic description, I wrote the following sentence:

  The result is library jargon solidified in an obscure
  data structure. Moreover, in an attempt to make the
  surrogates of library collections more meaningful, the
  information of bibliographic description bloats to fill
   ^^
  much more than the traditional three to five catalog
  cards of the past.

levan++

—
ELM


Re: [CODE4LIB] onboarding developers coming from industry

2016-03-02 Thread Eric Lease Morgan
On Mar 2, 2016, at 9:30 AM, Tom Hutchinson <thutc...@swarthmore.edu> wrote:

> ...To be honest I feel like I still don’t even really know what libraries / 
> librarians are yet.


  Tom, when you find out, please tell the rest of us.  ;-)  —Eric Lease Morgan


Re: [CODE4LIB] Don't Change Your Site Because of Reference Librarians RE: [CODE4LIB] Responsive website question

2016-02-08 Thread Eric Lease Morgan
On Feb 8, 2016, at 11:25 AM, Katherine Deibel  wrote:

> From a disability accessibility perspective, magnification is not purely 
> about text readability but making sure that all features of a 
> website---images, interactive widgets, text, etc.---are of use to the user. 
> Merely changing the font size is like putting out a fire in the kitchen while 
> the rest of the house is ablaze.

  deibel++  &  ROTFL  —ELM


[CODE4LIB] oclc member code

2016-01-21 Thread Eric Lease Morgan
Given an OCLC member code, such as BXM for Boston College, is it possible to 
use some sort of OCLC API to search WorldCat (or some other database) and 
return information about Boston College? —Eric Lease Morgan


Re: [CODE4LIB] Anyone familiar with XSLT? Im stuck

2016-01-21 Thread Eric Lease Morgan
> I have around 1400 xml files that I am trying to copy into one xml file so 
> that I can then pull out three elements from each and put into a single csv 
> file.

What are three elements you want to pull out of each XML file, and
what do you want the CSV file to look like?

Your XML files are pretty flat, and  if I understand the question
correctly, then it is all but trivial to extract your three elements
as a line of CSV.  Consequently I suggest foregoing the concatenation
of all the XML files into a single file. Such only adds complexity.
Instead I suggest:

 1. Put all XML files in a directory
 2. For each XML file, process with XSL
 3. Output a line of CSV
 4. Done

#!/bin/bash

# xml2cvs.sh - batch process a set of XML files

# configure (season the value of XSLTPROC to taste)
XSLTPROC=/usr/bin/xsltproc
XSLT=xml2csv.xsl

# process each file
for FILE in ./data/*.xml

 # do the work
 $XSLTPROC  $XSLT $FILE

end

# done
exit

 $ mkdir ./data
 $ cp *.xml ./data
 $ ./xml2csv.sh > data.csv
 $ open data.csv

Just about all that is missing is:

 * what elements do you want to extract, and
 * what do you want the CSV to look like

--
ELM


Re: [CODE4LIB] TEI->EPUB serialization testing

2016-01-14 Thread Eric Lease Morgan
On Jan 14, 2016, at 10:32 AM, Ethan Gruber  wrote:

>>> Part of this grant stipulates that open access books be made available
>>> in EPUB 3.0.1, so I got to work on a pipeline for dynamically serializing
>>> TEI into EPUB... 
>>> http://eaditor.blogspot.com/2015/12/the-ans-digital-library-look-under-hood.html
>>>  
>>> ...http://eaditor.blogspot.com/2016/01/first-ebook-published-to-ans-digital.html
>> 
>> I wrote a similar thing a number of years ago, and it was implemented as
>> Alex Lite. [1, 2]...
>> 
>> [1] Alex Lite blog posting - http://bit.ly/eazpJY
>> [2] Alex Lite - http://infomotions.com/sandbox/alex-lite/
> 
> Thanks, Eric. Is the original code online anywhere? I will eventually write
> some XSL:FO to generate PDFs for people who want those, for some reason.

I just put my source code and much of the supporting configuration files (XSL) 
temporarily on the Web at http://infomotions.com/tmp/alex-lite-code/  Enjoy? 
—ELM


Re: [CODE4LIB] TEI->EPUB serialization testing

2016-01-14 Thread Eric Lease Morgan
On Jan 13, 2016, at 4:17 PM, Ethan Gruber <ewg4x...@gmail.com> wrote:

> Part of this grant stipulates that open access books be made available in 
> EPUB 3.0.1, so I got to work on a pipeline for dynamically serializing TEI 
> into EPUB. It works pretty well, but there are some minor issues. The issues 
> might be related more to differences between individual ereader apps in 
> supporting the 3.0.1 spec than anything I might have done wrong in the 
> serialization process (the file validates according to a script I've been 
> running)…
> 
> If you are interested in more information about the framework, there's 
> http://eaditor.blogspot.com/2015/12/the-ans-digital-library-look-under-hood.html
>  and 
> http://eaditor.blogspot.com/2016/01/first-ebook-published-to-ans-digital.html.
>  It's highly LOD aware and is capable of posting to a SPARQL endpoint so that 
> information can be accessed from other archival frameworks and integrated 
> into projects like Pelagios.


I wrote a similar thing a number of years ago, and it was implemented as Alex 
Lite. [1] I started out with TEI files, and then transformed them into a number 
of derivatives: simple HTML, “cooler” HTML, PDF, and ePub. I think my ePub 
version was somewhere around 2.0. The “framework” was written in Perl, of 
course.  ;-)  The whole of a Alex Lite was designed to be given away on CD or 
as an instant website. (“Just add water."). The hard part of the whole thing 
was the creation of the TEI files in the first place. After that, everything 
was relatively easy.

[1] Alex Lite blog posting - http://bit.ly/eazpJY
[2] Alex Lite - http://infomotions.com/sandbox/alex-lite/

—
Eric Lease Morgan
Artist- And Librarian-At-Large

(A man in a trench coat approaches, and says, “Psst. Hey buddy, wanna buy a 
registration to the Code4Lib conference!?”)


Re: [CODE4LIB] The METRO Fellowship

2016-01-05 Thread Eric Lease Morgan
On Jan 5, 2016, at 1:17 PM, Nate Hill  wrote:

>  metro.org/fellowship
> 
> Our goal is to empower a small cohort of fellows to help solve
> cross-institutional problems and to spur innovation within our membership
> of libraries and archives in NYC and Westchester County as well as the
> field at large.

 Cool idea!!! —ELM


Re: [CODE4LIB] selinux [resolved]

2015-12-27 Thread Eric Lease Morgan
On Dec 27, 2015, at 8:29 AM, Michael Berkowski <m...@umn.edu> wrote:

>> How do I modify the permissions of a file under the supervision of SELunix
>> so the file can be executed as a CGI script?
>> 
>> I have two CGI scripts designed to do targeted crawls against remote
>> hosts. One script uses rsync on port 873 and the other uses wget on port
>> 443. I can run these scripts as me without any problems. None. They work
>> exactly as expected. But when the scripts are executed from my HTTP server
>> and under the user apache both rsync and wget fail. I have traced the
>> errors to some sort of permission problems generated from SELinux.
> 
> /usr/sbin/semanage and some other necessary things come from the package
> policycoreutils-python
> 
> By default, Apache is disallowed from making outbound network connections
> and there's an SELinux boolean to enable it (examples here
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Security-Enhanced_Linux/sect-Security-Enhanced_Linux-Booleans-Configuring_Booleans.html)
> 
> This is probably the most common thing anyone needs to change in SELinux.
> 
> $ setsebool -P httpd_can_network_connect on
> 
> (-P is to make it persist beyond reboots) As far as the wget, that setting
> alone may be enough to cure it, provided the  CGI script itself lives in a
> location Apache expects, which would already have the right context.
> Although both produce tcp errors, I'm not so certain it will also correct
> the rsync one.
> 
> To dig further, there are several actions you can take.
> 
> If something has the wrong context and you need to find out what the right
> context should be, you can list the relevant contexts along with the
> filesystem locations they're bound to with:
> 
> # list Apache-related contexts...
> $ semanage fcontext -l | grep httpd
> 
> You probably already know how to change one:
> 
> $ chcon -t new_context_name /path/to/file
> 
> It doesn't look like you got any denials related to CGI execution, so I
> would guess your scripts are where Apache expects them.
> 
> To list all Apache booleans and their states, use
> 
> $ getsebool -a | grep httpd
> 
> If you are unable to get your result using booleans or fixing the context,
> then you have to start digging into audit2allow. It will take denial lines
> from the audit log like those in your email from stdin and attempt to
> diagnose solutions with booleans, or help create a custom SELinux module to
> allow whatever you are attempting.
> 
> Start by grepping the relevant denied lines from /var/log/audit/audit.log,
> or get them from wherever you got the ones in your message. I usually put
> them into a file. Don't take every denial from the log, only the ones
> generated by the action you're trying to solve.
> 
> $ audit2allow < grepped_denials.txt
> 
> There may also be audit2why, but I don't know if CentOS6 has it and I've
> never used it.
> 
> Not sure if CentOS 6 has the updated tools which actually suggest booleans
> you can modify to fix denials, but if it does, you would get output like:
> 
> #= httpd_t ==
> 
> # This avc can be allowed using the boolean 'httpd_run_stickshift'
> allow httpd_t self:capability fowner;
> 
> # This avc can be allowed using the boolean 'httpd_execmem'
> allow httpd_t self:process execmem;
> 
> 
> If there are no booleans to modify, audit2allow will output policy
> configuration which would enable your action. Your last resort is to create
> a custom SELinux module with the -M flag that implements that policy.
> 
> # generate the module
> $ audit2allow -M YOURMODULENAME < grepped_denials.txt
> 
> Then you have to install the module
> 
> $ semodule -i YOURMODULENAME.pp
> 
> There may simpler ways of going about the module creation, but I do it so
> infrequently and this is the method I'm accustomed to. Red Hat has some
> docs here:
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Security-Enhanced_Linux/sect-Security-Enhanced_Linux-Fixing_Problems-Allowing_Access_audit2allow.html
> 
> So, I hope this gets you somewhere useful. In the best case scenario, you
> should only need to enable httpd_can_network_connect.
> 
> — 
> Michael Berkowski
> University of Minnesota Libraries


Michael, resolved, and thank you for the prompt and thorough reply.

Yes, SELinux was doing its job, and it was configured to disallow network 
connections from httpd. After issuing the following command (which allows httpd 
to make network connections) both my rsync- and wget-based CGI scripts worked 
without modification:

  setsebool http_can_network_connect on

Maybe I’ll add the -P option later. Yippie! Thank you. 

— 
Eric Lease Morgan


Re: [CODE4LIB] selinux

2015-12-26 Thread Eric Lease Morgan
On Dec 26, 2015, at 8:14 PM, Childs, Riley  wrote:

>> How do I modify the permissions of a file under the supervision of SELunix
>> so the file can be executed as a CGI script?
>> 
>> I have two CGI scripts designed to do targeted crawls against remote
>> hosts. One script uses rsync on port 873 and the other uses wget on port
>> 443. I can run these scripts as me without any problems. None. They work
>> exactly as expected. But when the scripts are executed from my HTTP server
>> and under the user apache both rsync and wget fail. I have traced the
>> errors to some sort of permission problems generated from SELinux.
>> Specifically, SELinux generates the following errors for the rsync script:
>> 
>>  type=AVC msg=audit(1450984068.685:19667): avc:  denied  {
>>  name_connect } for  pid=11826 comm="rsync" dest=873
>>  scontext=unconfined_u:system_r:httpd_sys_script_t:s0
>>  tcontext=system_u:object_r:rsync_port_t:s0 tclass=tcp_socket
>> 
>>  type=SYSCALL msg=audit(1450984068.685:19667): arch=c03e
>>  syscall=42 success=no exit=-13 a0=3 a1=1b3c030 a2=10
>>  a3=7fffb057acc0 items=0 ppid=11824 pid=11826 auid=500 uid=48
>>  gid=48 euid=48 suid=48 fsuid=48 egid=48 sgid=48 fsgid=48
>>  tty=(none) ses=165 comm="rsync" exe="/usr/bin/rsync"
>>  subj=unconfined_u:system_r:httpd_sys_script_t:s0 key=(null)
>> 
>> SELinux generates these errors for the wget script:
>> 
>>  type=AVC msg=audit(1450984510.396:19715): avc:  denied  {
>>  name_connect } for  pid=13263 comm="wget" dest=443
>>  scontext=unconfined_u:system_r:httpd_sys_script_t:s0
>>  tcontext=system_u:object_r:http_port_t:s0 tclass=tcp_socket
>> 
>>  type=SYSCALL msg=audit(1450984510.396:19715): arch=c03e
>>  syscall=42 success=no exit=-13 a0=4 a1=7ffe1d05b890 a2=10
>>  a3=7ffe1d05b4f0 items=0 ppid=13219 pid=13263 auid=500 uid=48
>>  gid=48 euid=48 suid=48 fsuid=48 egid=48 sgid=48 fsgid=48
>>  tty=(none) ses=165 comm="wget" exe="/usr/bin/wget"
>>  subj=unconfined_u:system_r:httpd_sys_script_t:s0 key=(null)
>> 
>> How do I diagnose these errors? Do I need to use something like chcon to
>> change my CGI scripts’ permissions? Maybe I need to use chcon to change
>> rsync’s or wget’s permissions? Maybe I need to use something like semanage
>> (which doesn’t exist on my system) to change the user apache’s permissions
> 
> SELinux :) Which distro are you running?

  I am running CentOS release 6.7. —ELM


[CODE4LIB] selinux

2015-12-26 Thread Eric Lease Morgan
How do I modify the permissions of a file under the supervision of SELunix so 
the file can be executed as a CGI script?

I have two CGI scripts designed to do targeted crawls against remote hosts. One 
script uses rsync on port 873 and the other uses wget on port 443. I can run 
these scripts as me without any problems. None. They work exactly as expected. 
But when the scripts are executed from my HTTP server and under the user apache 
both rsync and wget fail. I have traced the errors to some sort of permission 
problems generated from SELinux. Specifically, SELinux generates the following 
errors for the rsync script:

  type=AVC msg=audit(1450984068.685:19667): avc:  denied  {
  name_connect } for  pid=11826 comm="rsync" dest=873
  scontext=unconfined_u:system_r:httpd_sys_script_t:s0
  tcontext=system_u:object_r:rsync_port_t:s0 tclass=tcp_socket

  type=SYSCALL msg=audit(1450984068.685:19667): arch=c03e
  syscall=42 success=no exit=-13 a0=3 a1=1b3c030 a2=10
  a3=7fffb057acc0 items=0 ppid=11824 pid=11826 auid=500 uid=48
  gid=48 euid=48 suid=48 fsuid=48 egid=48 sgid=48 fsgid=48
  tty=(none) ses=165 comm="rsync" exe="/usr/bin/rsync"
  subj=unconfined_u:system_r:httpd_sys_script_t:s0 key=(null)

SELinux generates these errors for the wget script:

  type=AVC msg=audit(1450984510.396:19715): avc:  denied  {
  name_connect } for  pid=13263 comm="wget" dest=443
  scontext=unconfined_u:system_r:httpd_sys_script_t:s0
  tcontext=system_u:object_r:http_port_t:s0 tclass=tcp_socket

  type=SYSCALL msg=audit(1450984510.396:19715): arch=c03e
  syscall=42 success=no exit=-13 a0=4 a1=7ffe1d05b890 a2=10
  a3=7ffe1d05b4f0 items=0 ppid=13219 pid=13263 auid=500 uid=48
  gid=48 euid=48 suid=48 fsuid=48 egid=48 sgid=48 fsgid=48
  tty=(none) ses=165 comm="wget" exe="/usr/bin/wget"
  subj=unconfined_u:system_r:httpd_sys_script_t:s0 key=(null)

How do I diagnose these errors? Do I need to use something like chcon to change 
my CGI scripts’ permissions? Maybe I need to use chcon to change rsync’s or 
wget’s permissions? Maybe I need to use something like semanage (which doesn’t 
exist on my system) to change the user apache’s permissions?

This is a level of the operating system of which I am unfamiliar. 

— 
Eric Lease Morgan


Re: [CODE4LIB] yaml/xml/json, POST data, bloodcurdling terror

2015-12-17 Thread Eric Lease Morgan
On Dec 17, 2015, at 8:22 AM, Andromeda Yelton <andromeda.yel...@gmail.com> 
wrote:

> I strongly recommend this hilarious, terrifying PyCon talk about
> vulnerabilities in yaml, xml, and json processing:
> 
>   https://www.youtube.com/watch?v=kjZHjvrAS74
> 
> If you process user-submitted data in these formats and don't yet know why
> you should be flatly terrified, please watch this ASAP; it's illuminating.
> If you *do* know why you should be terrified, watch it anyway and giggle
> along in knowing recognition, because the talk is really very funny.


Obviously, the sorts of things outlined in the presentation above are real, and 
they are really scary. Us developers need to take note: getting input from the 
‘Net can be a really bad thing. —Eric Lease Morgan


Re: [CODE4LIB] dublin core files [and unicorns]

2015-11-27 Thread Eric Lease Morgan
On Nov 24, 2015, at 8:20 PM, Eric Lease Morgan <emor...@nd.edu> wrote:

>>> Do Dublin Core files exist, and if so, then can somebody show me one? Put 
>>> another way, can you point me to a DTD or schema denoting Dublin Core XML? 
>>> The closest I can come is the standard/default oai_dc description of an 
>>> OAI-PMH item.
>> 
>> On Nov 24, 2015, at 8:11 PM, Benjamin Florin <benjamin.flo...@gmail.com> 
>> wrote:
>> 
>> Sometimes the Dublin Core documentation uses "Dublin Core record" to
>> describe XML records that use Dublin core vocabulary, for example:
>> http://dublincore.org/documents/2003/04/02/dc-xml-guidelines/
>> 
>> Those records do use the Simple and Qualified Dublin Core XML Schema <
>> http://dublincore.org/schemas/xmls/>, which basically layout a list of
>> simple elements with DC labels that may contain strings and possibly a
>> language attribute.
> 

> From one of the links above I see a viable schema:
> 
> http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd
> 
> And yes, I haven’t seen any Dublin Core records “in the wild” either, but 
> based on the information above, they apparently can exist. Thank you.


I take back what I said earlier. Dublin Core records don’t exist, and I would 
like to re-enforce what was said by Benjamin, "Sometimes the Dublin Core 
documentation uses 'Dublin Core record' to describe XML records that use Dublin 
core vocabulary.” In this vane, I think think Dublin Core records are similar 
to unicorns, and I wish Library Land would stop alluding to them.

Benjamin points to as many as three different XML schema describing the 
implementation of Dublin Core:

 1. http://dublincore.org/schemas/xmls/simpledc20021212.xsd
 2. http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd
 3. http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd

None of these schema define a root element, and therefore it not possible to 
create an XML file that both: 1) validates against any of the schema, and 2) 
does not declare another schema to contain the Dublin Core data. If a given XML 
file does validate then it will not validate against the Dublin Core schema but 
instead the additional schema. An XML file must have one and only one root 
element, and the schemas listed above do not define root elements. 

One of my students identified a number of ways Dublin Core data could be 
embedded in HTML [1], but again, such files are not Dublin Core files. Instead, 
they are HTML files.

The idea of a “Dublin Core record” probably stems from the idea of a “MARC 
record” which is bad in and of itself. For example, how many times have you 
seen a delimited version of MARC called a ‘MARC record’? The idea of a "Dublin 
Core record” seems detrimental the understanding of what Dublin Core is an how 
it is implemented. 

Dublin Core is a set of element names coupled with very loose definitions of 
what those names are to contain and how they are to be applied. 

To what degree am I incorrect? What am I missing something?

[1] DC-HTML - http://dublincore.org/documents/dc-html/

—
Eric Lease Morgan
Artist- And Librarian-At-Large


[CODE4LIB] bibframe

2015-10-15 Thread Eric Lease Morgan
[Forwarded upon request. —E “Lost In Venice” M ]


> From: "Fultz, Tamara" 
> Subject: Question about posting
> Date: October 15, 2015 at 12:43:08 AM GMT+2
> To: "code4lib-requ...@listserv.nd.edu" 
>  
> Implementing BIBFRAME
> The UC Davis BIBFLOW Project
> Presented by the New York Technical Services Librarians
> 
> With a focus on cataloging, Xiaoli Li will present the UC Davis BIBFLOW 
> project with a preview of its linked data cataloging tools and workflows. 
> This project was designed to examine how BIBFRAME can be adopted and how it 
> will affect daily library operations.
> 
> Date:
> Monday, November 16, 2015
> 5:00 – 7:45 PM
> Refreshments: 5 – 6 PM
> Program 6 – 7:45 PM
> 
> Location:
> The New York Public Library, Stephen A. Schwarzman Building
> Margaret Liebman Berger Forum, Room 227
> 476 Fifth Avenue (at 42nd Street)
> New York, NY 10018
> 
> $15 for current members
> $30 for event + new or renewed membership
> $20 for event + new or renewed student membership
> $40 for non-members
> 
> View more information and register at 
> http://nytsl.org/nytsl/implementing-bibframe-the-uc-davis-bibflow-project/ 
> 
>  
> —
> Tamara Lee Fultz
> Associate Museum Librarian
> Thomas J. Watson Library
> Metropolitan Museum of Art
> 212-650-2443
> tamara.fu...@metmuseum.org 


[CODE4LIB] code4lib chicago meeting

2015-10-05 Thread Eric Lease Morgan
A Code4Lib Chicago meeting has been scheduled for Monday, November 23 from 8:30 
to 5 o’clock at the University of Illinois-Chicago. [1] Sign up early. Sign up 
often.

[1] meeting - http://wiki.code4lib.org/Code4Lib_Chicago

—
Eric Lease Morgan, Librarian-At-Large


Re: [CODE4LIB] code4lib chicago

2015-09-02 Thread Eric Lease Morgan
On Sep 2, 2015, at 12:04 PM, Cary Gordon  wrote:

> http://cod4lib.com

 ROTFL!!! —Eric Morgan


Re: [CODE4LIB] Code4libBC (Vancouver, BC) - save the date November 26/27. 2015

2015-09-01 Thread Eric Lease Morgan
On Aug 31, 2015, at 9:23 PM, Cary Gordon  wrote:

> Perhaps this belongs on the Cod4lib list.
 ^^^

Yesterday, I didn’t quite understand the allusion to the East Coast, but now I 
see that I lost an e in Code4Lib. Cod4Lib. That's pretty funny. Thanks!  :-D  
—Earache


Re: [CODE4LIB] "coders for libraries"

2015-09-01 Thread Eric Lease Morgan
On Sep 1, 2015, at 9:42 AM, Eric Hellman  wrote:

> As someone who feels that Code4Lib should welcome people who don't 
> particularly identify as "coders", I would welcome a return to the previous 
> title attribute.

  1++ because I believe it is more about libraries than it is about code.  —ELM


Re: [CODE4LIB] code4lib chicago

2015-08-31 Thread Eric Lease Morgan
On Aug 28, 2015, at 11:56 AM, Allan Berry  wrote:

> The UIC Library would be happy to host the Code4Lib event, in November or 
> early December.

The folks at University of Illinois-Chicago would like to sponsor a one-day 
Cod4Lib event, and in order to determine the best date, they are asking folks 
to complete the following Doodle Poll:

  http://doodle.com/45aukez6z6pyav62

Code4Lib events are great ways to meet people doing the same work you are doing 
to discuss common problems and solutions. Chicago is large and central. Fill 
out the Poll. Come to Chicago. Invigorate your professional life.

—
Eric Morgan


Re: [CODE4LIB] Code4Lib 2016: Philadelphia - Save the Date [url]

2015-08-10 Thread Eric Lease Morgan
On Aug 10, 2015, at 11:38 AM, David Lacy david.l...@villanova.edu wrote:

 The 2016 conference will be held from March 7 through March 10 in the Old 
 City District of Philadelphia.  This location puts conference attendees 
 within easy walking distance of many of Philadelphia’s historical treasures, 
 including Independence Hall, the Liberty Bell, the Constitution Center, and 
 the house where Thomas Jefferson drafted the Declaration of Independence. 
 Attendees will also be a very short distance from the Delaware River 
 waterfront and will be a short walk from numerous excellent restaurants.


  Cool! Is there an official Code4Lib 2016 Annual Meeting URL, and if so, then 
what is it? —Eric Morgan


[CODE4LIB] code4lib chicago (chicode4lib)

2015-07-29 Thread Eric Lease Morgan
As some of you in  around Chicago may or may not know, there is a Code4Lib 
Chicago group called chicode4lib. See the Google Group:

  https://groups.google.com/forum/#!forum/chicode4lib

I’m simply trying to drum up business for the community.

— 
Eric Lease Morgan


Re: [CODE4LIB] survey of image analysis packages

2015-07-25 Thread Eric Lease Morgan
On Jul 22, 2015, at 6:49 PM, Peter Mangiafico pmangiaf...@stanford.edu wrote:

 I am conducting a survey of software used for image analysis and metadata 
 enhancement.  Examples include facial recognition, object identification, 
 similarity matching, and so on.  The goal is to understand if it is possible 
 to use algorithmic techniques to improve discoverability in a large dataset 
 consisting mostly of images.  The main project I am working on is automobile 
 history (http://revslib.stanford.eduhttp://revslib.stanford.edu/ 
 http://revslib.stanford.edu/ http://revslib.stanford.edu/) but the 
 techniques can of course be applied much more widely.  I'm interested in a 
 broad sweep, could be open source, commercial, service model, API, etc.  If 
 you have projects you are aware of, or tools you have used or heard about, 
 and wouldn't mind sending me an email, I'd appreciate it!


Alas, I do not have anything to contribute to your survey, but I sure would 
like to see the results. I believe image analysis of this sort is something to 
be taken advantage of in libraries. ‘Looking forward. —Eric Morgan


[CODE4LIB] position description

2015-07-09 Thread Eric Lease Morgan
Below is a position description, posted by request:

  East Carolina University’s Joyner Library is seeking to fill the
  position of Head of the Application  Digital Services (ADS) department.
  The ADS team has a track record of implementing open source solutions
  and developing custom applications. This team works closely with
  personnel across ECU Libraries to develop, manage, and support
  large-scale applications, such as the Digital Collections repository,
  the ScholarShip Institutional Repository, and the Blacklight library
  catalog discovery layer. The successful candidate will provide
  leadership and vision for the ADS department and ensure departmental
  goals are met.

  The Head’s primary roles will be as a project manager for new and
  existing development projects and manager of the staff in the department
  (currently 6 people). Knowledge of and ability to institute and
  communicate a development process to analyze, design, develop,
  implement, and evaluate each project will be critical. Key skills in the
  position will be effective communication and decision making, as well as
  the ability to work with stakeholders to maintain and improve tools and
  interfaces.

  Additionally, this person will collaborate with colleagues to establish
  and manage metrics for measuring, analyzing, and optimizing user
  satisfaction. An important role for the Head of ADS will be to monitor
  trends in web design/development within the library environment and plan
  strategically to implement innovative changes for the Libraries. Also
  important is the responsibility of setting priorities, ensuring
  professional growth of the members of the department, and managing
  department activities to assure the best use of time and resources to
  meet project-defined objectives and deliverables.


For more detail, see: http://www.ecu.edu/cs-lib/about/job942036.cfm

—
Eric Lease Morgan


[CODE4LIB] eebo-tcp workset browser

2015-06-20 Thread Eric Lease Morgan
I have put on GitHub a thing I call the EEBO-TCP Workset Browser. [1] From the 
README file:

  The EEBO-TCP Workset Browser is a suite of software designed to support
  distant reading against the corpus called the Early English Books
  Online - Text Creation Partnership corpus. Using the Browser it is
  possible to: 1) search a catalog of the corpus's metadata, 2) create a
  list of identifiers representing a subset of content for study, 3) feed
  the identifiers to a set of files which will mirror the content locally,
  index it, and do some rudimentary analysis outputting as set of HTML
  files, structured data, and graphs. The reader is then expected to
  examine the output more closely (all puns intended) using their
  favorite Web browser, text editor, spreadsheet, database, or statistical
  application. The purpose and functionality of this suite is very similar
  to the purpose and functionality of HathiTrust Research Center Workset
  Browser.

[1] EBO-TCP Workset Browser - 
https://github.com/ericleasemorgan/EEBO-TCP-Workset-Browser

—
Eric Lease Morgan, Librarian
University of Notre Dame


Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Eric Lease Morgan
On Jun 18, 2015, at 12:02 PM, Matt Sherman matt.r.sher...@gmail.com wrote:

 I am working with colleague on a side project which involves some scanned
 bibliographies and making them more web searchable/sortable/browse-able.
 While I am quite familiar with the metadata and organization aspects we
 need, but I am at a bit of a loss on how to automate the process of putting
 the bibliography in a more structured format so that we can avoid going
 through hundreds of pages by hand.  I am pretty sure regular expressions
 are needed, but I have not had an instance where I need to automate
 extracting data from one file type (PDF OCR or text extracted to Word doc)
 and place it into another (either a database or an XML file) with some
 enrichment.  I would appreciate any suggestions for approaches or tools to
 look into.  Thanks for any help/thoughts people can give.


If I understand your question correctly, then you have two problems to address: 
1) converting PDF, Word, etc. files into plain text, and 2) marking up the 
result (which is a bibliography) into structure data. Correct?

If so, then if your PDF documents have already been OCRed, or if you have other 
files, then you can probably feed them to TIKA to quickly and easily extract 
the underlying plain text. [1] I wrote a brain-dead shell script to run TIKA in 
server mode and then convert Word (.docx) files. [2]

When it comes to marking up the result into structured data, well, good luck. I 
think such an application is something Library Land sought for a long time. 
“Can you say Holy Grail?

[1] Tika - https://tika.apache.org
[2] brain-dead script - 
https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff

— 
Eric


[CODE4LIB] hathitrust research center user group meeting [tomorrow (thursday)]

2015-06-10 Thread Eric Lease Morgan
Consider participating in a conference call (tomorrow, Thursday) on the topic 
of the HathiTrust Research Center.

A HathiTrust Research Center User’s Group Meeting is scheduled for tomorrow 
(Thursday), June 11 from 3-4 o’clock-ish:

Who - anybody and everybody
   What - a discussion of all things HathiTrust Research Center
   When - Thursday, June 11 from 3-4:00 Eastern Time
  Where - via the telephone: (812) 856-3600 or (317) 278-7008 with PIN 803140#
Why - because both you and they have something to offer librarianship

More specifically, Thursday's conference call is about at least two things: 1) 
your concerns regarding the Center, and 2) a discussion of my fledgling 
Workset Browser. [1, 2] This is an opportunity for you to learn the why's  
wherefore's of the Center, as well as influence the direction of programming 
initiatives. For example, you can learn more about the Center's authorization 
and copyright restrictions. You can also discuss how you think the Center can 
provide support for the digital humanities and text mining. 

[1] HathiTrust Research Center - http://hathitrust.org/htrc
[2] blog posting describing the Browser -  
http://blogs.nd.edu/emorgan/2015/05/htrc-workset-browser/

—
Eric Lease Morgan
University of Notre Dame


Re: [CODE4LIB] eebo [perfect texts]

2015-06-08 Thread Eric Lease Morgan
On Jun 8, 2015, at 7:32 AM, Owen Stephens o...@ostephens.com wrote:

 I’ve just seen another interesting take based (mainly) on data in the 
 TCP-EEBO release:
 
   
 https://scalablereading.northwestern.edu/2015/06/07/shakespeare-his-contemporaries-shc-released/
 
 It includes mention of MorphAdorner[1] which does some clever stuff around 
 tagging parts of speech, spelling variations, lemmata etc. and another tool 
 which I hadn’t come across before AnnoLex[2] for the correction and 
 annotation of lexical data in Early Modern texts”.
 
 This paper[3] from Alistair Baron and Andrew Hardie at the University of 
 Lancaster in the UK about preparing EEBO-TCP texts for corpus-based analysis 
 may also be of interest, and the team at Lancaster have developed a tool 
 called VARD which supports pre-processing texts[4]
 
 [1] http://morphadorner.northwestern.edu
 [2] http://annolex.at.northwestern.edu
 [3] http://eprints.lancs.ac.uk/60272/1/Baron_Hardie.pdf
 [4] http://ucrel.lancs.ac.uk/vard/about/


All of this is really very interesting. Really. At the same time, there seems 
to be a WHOLE lot of effort spent on cleaning and normalizing data, and very 
little done to actually analyze it beyond “close reading”. The final goal of 
all these interfaces seem to be refined search. Frankly, I don’t need search. 
And the only community who will want this level of search will be the scholarly 
scholar. “What about the undergraduate student? What about the just more than 
casual reader? What about the engineer?” Most people don’t know how or why 
parts-of-speech are important let alone what a lemma is. Nor do they care. I 
can find plenty of things. I need (want) analysis. Let’s assume the data is 
clean — or rather, accept the fact that there is dirty data akin to the dirty 
data created through OCR and there is nothing a person can do about it — lets 
see some automated comparisons between texts. Examples might include:

  * this one is longer
  * this one is shorter
  * this one includes more action
  * this one discusses such  such theme more than this one
  * so  so theme came and went during a particular time period
  * the meaning of this phrase changed over time
  * the author’s message of this text is…
  * this given play asserts the following facts
  * here is a map illustrating where the protagonist went when
  * a summary of this text includes…
  * this work is fiction
  * this work is non-fiction
  * this work was probably influenced by…

We don’t need perfect texts before analysis can be done. Sure, perfect texts 
help, but they are not necessary. Observations and generalization can be made 
even without perfectly transcribed texts. 

—
ELM


Re: [CODE4LIB] eebo [developments]

2015-06-07 Thread Eric Lease Morgan
Here some of developments with my playing with the EEBO data. 

I used the repository on Box to get my content, and I mirrored it locally. [1, 
2] I then looped through the content using XPath to extract rudimentary 
metadata, thus creating a “catalog” (index). Along the way I calculated the 
number of words in each document and saved that as a field of each record. 
Being a tab-delimited file, it is trivial to import the catalog into my 
favorite spreadsheet, database, editor, or statistics program. This allowed me 
to browse the collection. I then used grep to search my catalog, and save the 
results to a file. [5] I searched for Richard Baxter. [6, 7, 8]. I then used an 
R script to graph the numeric data of my search results. Currently, there are 
only two types: 1) dates, and 2) number of words. [9, 10, 11, 12] From these 
graphs I can tell that Baxter wrote a lot of relatively short things, and I can 
easily see when he published many of his works. (He published a lot around 1680 
but little in 1665.) I then transformed the search result!
 s into a browsable HTML table. [13] The table has hidden features. (Can you 
say, “Usability?”) For example, you can click on table headers to sort. This is 
cool because I want sort things by number of words. (Number of pages doesn’t 
really tell me anything about length.) There is also a hidden link to the left 
of each record. Upon clicking on the blank space you can see subjects, 
publisher, language, and a link to the raw XML. 

For a good time, I then repeated the process for things Shakespeare and things 
astronomy. [14, 15] Baxter took me about twelve hours worth of work, not 
counting the caching of the data. Combined, Shakespeare and astronomy took me 
less than five minutes. I then got tired.

My next steps are multi-faceted and presented in the following incomplete 
unordered list:

  * create browsable lists - the TEI metadata is clean and
consistent. The authors and subjects lend themselves very well to
the creation of browsable lists.

  * CGI interface - The ability to search via Web interface is
imperative, and indexing is a prerequisite.

  * transform into HTML - TEI/XML is cool, but…

  * create sets - The collection as a whole is very interesting,
but many scholars will want sub-sets of the collection. I will do
this sort of work, akin to my work with the HathiTrust. [16]

  * do text analysis - This is really the whole point. Given the
full text combined with the inherent functionality of a computer,
additional analysis and interpretation can be done against the
corpus or its subsets. This analysis can be based the counting of
words, the association of themes, parts-of-speech, etc. For
example, I plan to give each item in the collection a colors,
“big” names, and “great” ideas coefficient. These are scores
denoting the use of researcher-defined “themes”. [17, 18, 19] You
can see how these themes play out against the complete writings
of “Dead White Men With Three Names”. [20, 21, 22]

Fun with TEI/XML, text mining, and the definition of librarianship.


 [1] Box - http://bit.ly/1QcvxLP
 [2] mirror - http://dh.crc.nd.edu/sandbox/eebo-tcp/xml/
 [3] xpath script - http://dh.crc.nd.edu/sandbox/eebo-tcp/bin/xml2tab.pl
 [4] catalog (index) - http://dh.crc.nd.edu/sandbox/eebo-tcp/catalog.txt
 [5] search results - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/baxter.txt
 [6] Baxter at VIAF - http://viaf.org/viaf/54178741
 [7] Baxter at WorldCat - http://www.worldcat.org/wcidentities/lccn-n50-5510
 [8] Baxter at Wikipedia - http://en.wikipedia.org/wiki/Richard_Baxter
 [9] box plot of dates - 
http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/boxplot-dates.png
[10] box plot of words - 
http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/boxplot-words.png
[11] histogram of dates - 
http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/histogram-dates.png
[12] histogram of words - 
http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/histogram-words.png
[13] HTML - http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/baxter.html
[14] Shakespeare - http://dh.crc.nd.edu/sandbox/eebo-tcp/shakespeare/
[15] astronomy - http://dh.crc.nd.edu/sandbox/eebo-tcp/astronomy/
[16] HathiTrust work - http://blogs.nd.edu/emorgan/2015/06/browser-on-github/
[17] colors - 
http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-colors.txt
[18] “big” names - 
http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-names.txt
[19] “great” ideas - 
http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-ideas.txt
[20] Thoreau - 
http://dh.crc.nd.edu/sandbox/htrc-workset-browser/thoreau/about.html
[21] Emerson - 
http://dh.crc.nd.edu/sandbox/htrc-workset-browser/emerson/about.html
[22] Channing - 
http://dh.crc.nd.edu/sandbox/htrc-workset-browser/channing/about.html


—
Eric Lease Morgan, Librarian
University of Notre Dame


[CODE4LIB] eebo

2015-06-05 Thread Eric Lease Morgan
Does anybody here have experience reading the SGML/XML files representing the 
content of EEBO? 

I’ve gotten my hands on approximately 24 GB of SGML/XML files representing the 
content of EEBO (Early English Books Online). This data does not include page 
images. Instead it includes metadata of various ilks as well as the transcribed 
full text. I desire to reverse engineer the SGML/XML in order to: 1) provide an 
alternative search/browse interface to the collection, and 2) support various 
types of text mining services. 

While I am making progress against the data, it would be nice to learn of other 
people’s experience so I do not not re-invent the wheel (too many times). ‘Got 
ideas?

—
Eric Lease Morgan
University Of Notre Dame


Re: [CODE4LIB] eebo

2015-06-05 Thread Eric Lease Morgan
On Jun 5, 2015, at 8:20 AM, Ethan Gruber ewg4x...@gmail.com wrote:

 Does anybody here have experience reading the SGML/XML files representing
 the content of EEBO?
 
 Are these in TEI? Back when I worked for the University of Virginia
 Library, I did a lot of clean up work and migration of Chadwyck-Healey
 stuff into TEI-P4 compliant XML (thousands of files), but unfortunately all
 of the Perl scripts to migrate old garbage SGML into XML are probably gone.
 
 How many of these things are really worth keeping, i.e., were not digitized
 by any other organization that has freely published them online?


The data I have comes in two flavors: 1) some flavor of SGML, and 2) some 
flavor of XML which is TEI-like, but not TEI. All of the files are worth 
keeping because I get the basic bibliographic information (id, author, title, 
date, keywords/subjects), as well as transcribed text. (No images.) Given such 
data, I think I can provide interesting, cool, and “kewl” services. Given the 
id number, I may then be able to link to the scanned image. Wish me luck. —ELM


Re: [CODE4LIB] eebo [resolved and coolness!!]

2015-06-05 Thread Eric Lease Morgan
On Jun 5, 2015, at 8:10 AM, Eric Lease Morgan emor...@nd.edu wrote:

 Does anybody here have experience reading the SGML/XML files representing the 
 content of EEBO? 

I ultimately found the EEBO files in the form of TEI, and then I was able to 
transform one of them into VERY functional HTML5. Coolness! Here’s the recipe:

 1. download P5 from Box [1]
 2. download stylesheets from GitHub [2]
 3. transform using Saxon [3]
 4. save output to HTTP server 
 5. open in browser [4]
 6. read results AND get scanned image

Nice clean data + fully functional stylesheets = really cool output

[1] P5 - http://bit.ly/1QcvxLP
[2] stylesheets - https://github.com/TEIC/Stylesheets
[3] transform - java -cp saxon9he.jar net.sf.saxon.Transform -t 
-s:/var/www/html/sandbox/eebo-tcp/xml/A0/A06567.xml 
-xsl:/var/www/html/sandbox/eebo-tcp/style/html5/html5.xsl  
/var/www/html/tmp/eebo.html
[4] output - http://dh.crc.nd.edu/tmp/eebo.html

—
ELM


[CODE4LIB] hathitrust research center user group meeting [rescheduled]

2015-06-04 Thread Eric Lease Morgan
The HathiTrust Research Center User’s Group Meeting (conference call) has been 
rescheduled for next Thursday, June 11:

   Who - anybody and everybody
  What - a discussion of all things HathiTrust Research Center
  When - Thursday, June 11 from 3-4:00 Eastern Time
 Where - via the telephone: (812) 856-3600 or (317) 278-7008 with PIN 803140
   Why - because both you and they have something to offer librarianship

More specifically, next Thursday's conference call is about at least two 
things: 1) your concerns regarding the Center, and 2) a discussion of my 
fledgling Workset Browser. [1, 2] This is an opportunity for you to learn the 
why's  wherefore's of the Center, as well as influence the direction of 
programming initiatives. For example, you can learn more about their 
authorization and copyright restrictions. You can also discuss how you think 
the Center can provide support for the digital humanities and text mining. 

[1] HathiTrust Research Center - http://hathitrust.org/htrc
[2] blog posting describing the Browser -  
http://blogs.nd.edu/emorgan/2015/05/htrc-workset-browser/

—
Eric Lease Morgan
University of Notre Dame


Re: [CODE4LIB] hathitrust research center workset browser [github]

2015-06-02 Thread Eric Lease Morgan
I believe I have created a repository of my HTRC Workset Browser code (shell 
and Python scripts) on GitHub. [1] From the Quick Start section of the README:

  1. Download the software putting the bin and etc directories in the same 
directory.
  2. Change to the directory where the bin and etc directories have been saved.
  3. Build a collection by issuing the following command:

   ./bin/build-corpus.sh thoreau etc/rsync-thoreau.sh

  If all goes well, the Browser will create a new directory named thoreau,
  rsync a bunch o' JSON files from the HathiTrust to your computer, index
  the JSON files, do some textual analysis against the corpus, create a
  simple database (catalog), and create a few more reports. You can then
  peruse the files in the newly created thoreau directory. If this worked,
  then repeat the process for the other rsync files found in the etc
  directory.

Probably the first issue people will have is the path to their version of 
Python. (Sigh.)

[1] repository - https://github.com/ericleasemorgan/HTRC-Workset-Browser

—
Eric “Git Ignorant” Morgan


Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread Eric Lease Morgan
On Jun 1, 2015, at 10:58 AM, davesgonechina davesgonech...@gmail.com wrote:

 They just informed me I need a .edu address. Having trouble understanding
 the use of the term public domain here.

  Gung fhpx, naq fbhaqf ernyyl fbeg bs fghcvq!! --RYZ


Re: [CODE4LIB] hathitrust research center workset browser

2015-06-01 Thread Eric Lease Morgan
On Jun 1, 2015, at 4:33 AM, davesgonechina davesgonech...@gmail.com wrote:

 If your *institutional* email address is not on their whitelist (not sure
 if it is limited to subscribing ones, they don't say) you cannot register
 using the signup form, instead you can only request an account by briefly
 explaining why you want one. Weird, because they'd have potentially learned
 more about me if they just let me put my gmail address in the signup form.
 
 I don't get it - can all users download public domain content? If they give
 me an account, will I be indistinguishable from a subscribing institution?
 If not, why the extra hoops?


Dave, you are the second person to bring this “white listing” issue to my 
attention. Bummer! Yes, apparently, unless your email address is a part of 
wider something or another, then you need to be authorized to use the Research 
Center. Weird! In my opinion, while the Research Center’s tools work, I believe 
the site suffers from usability issues.

In any event, I have enhanced the auto-generated reports created by my 
“Browser”, and while they are very textual, I also believe they are insightful. 
For example, the complete works of:

  * William Ellery Channing - http://bit.ly/browser-channing-about
  * Jane Austen - http://bit.ly/browser-austen-about
  * Ralph Waldo Emerson - http://bit.ly/browser-emerson-about
  * Henry David Thoreau - http://bit.ly/browser-thoreau-about

—
Eric “Beginning To Suffer From ‘Creeping Featuritis’” Morgan


[CODE4LIB] hathitrust research center user group meeting

2015-06-01 Thread Eric Lease Morgan
Consider participating in Thursday's HathiTrust Research Center User Group 
Meeting:

Who - anybody and everybody
   What - a discussion of all things HathiTrust Research Center
   When - this Thursday, June 4 from 3-4:00 Eastern Time
  Where - via the telephone: (812) 856-3600 or (317) 278-7008 with PIN 803140
Why - because both you and they have something to offer librarianship

More specifically, Thursday's conference call is about at least two things: 1) 
your concerns regarding the Center, and 2) a discussion of my fledgling 
Workset Browser. [1, 2] This is an opportunity for you to learn the why's  
wherefore's of the Center, as well as influence the direction of programming 
initiatives. For example, you can learn more about their authorization and 
copyright restrictions. You can also discuss how you think the Center can 
provide support for the digital humanities and text mining. 

[1] HathiTrust Research Center - http://hathitrust.org/htrc
[2] blog posting describing the Browser - http://ntrda.me/1FUGP2g

—
Eric Lease Morgan
University of Notre Dame


Re: [CODE4LIB] hathitrust research center workset browser

2015-05-28 Thread Eric Lease Morgan
On May 27, 2015, at 6:33 PM, Karen Coyle li...@kcoyle.net wrote:

 In my copious spare time I have hacked together a thing I’m calling the 
 HathiTrust Research Center Workset Browser, a (fledgling) tool for doing 
 “distant reading” against corpora from the HathiTrust. [0, 1] ...
 
 'Want to give it a try? For a limited period of time, go to the HathiTrust 
 Research Center Portal, create (refine or identify) a collection of personal 
 interest, use the Algorithms tool to export the collection's rsync file, and 
 send the file to me. I will feed the rsync file to the Browser, and then 
 send you the URL pointing to the results.
 
 [0] introduction in a blog posting - http://ntrda.me/1FUGP2g
 [1] HTRC Workset Browser - http://bit.ly/workset-browser
 
 Eric, what happens if you access this from a non-HT institution? When I go to 
 HT I am often unable to download public domain titles because they aren't 
 available to members of the general public.


The short answer is, “Nothing”.

The long answer is… longer. The HathiTrust proper is accessible to anybody, but 
the downloading of public domain content is only available to subscribing 
institutions.

On the other hand, the “Workset Browser” is designed to work off the HathiTrust 
Research Center Portal, not the HathiTrust proper. The Portal is located at 
http://sharc.hathitrust.org From there anybody can search the collection of 
public domain content, create collections, and apply various algorithms against 
collections. One of the algorithms is “create RSYNC file” which, in turn, 
allows you to download bunches o’ metadata describing the items in your 
collection. (There is also a “download as MARC” algorithm.) This rsync file is 
the root of the Workset Browser. Feed the Browser a rsync file, and the Browser 
will mirror content locally, index it, and generate reports describing the 
collection. 

Thank you for asking. Many people do not know there is a HathiTrust Research 
Center.

—
Eric Morgan


Re: [CODE4LIB] hathitrust research center workset browser [call for worksets]

2015-05-27 Thread Eric Lease Morgan
On May 26, 2015, at 11:30 AM, Eric Lease Morgan emor...@nd.edu wrote:

 In my copious spare time I have hacked together a thing I’m calling the 
 HathiTrust Research Center Workset Browser, a (fledgling) tool for doing 
 “distant reading” against corpora from the HathiTrust. [0]
 
   [0] introductory Workset Browser blog posting - http://ntrda.me/1FUGP2g


Help me put the my fledgling Browser through some paces; this is a call for 
HathiTrust Research Center worksets.

For a limited period of time, go to the HathiTrust Research Center Portal, 
create (refine or identify) a collection of personal interest, use the 
Algorithms tool to export the collection's rsync file, and send the file to me. 
[1] I will feed the rsync file to the Browser, and then send you the URL 
pointing to the results. Let’s see what happens?

[1] HathiTrust Research Center Portal - https://sharc.hathitrust.org

—
Eric Morgan


[CODE4LIB] hathitrust research center workset browser

2015-05-26 Thread Eric Lease Morgan
In my copious spare time I have hacked together a thing I’m calling the 
HathiTrust Research Center Workset Browser, a (fledgling) tool for doing 
“distant reading” against corpora from the HathiTrust. [1]

The idea is to: 1) create, refine, or identify a HathiTrust Research Center 
workset of interest — your corpus, 2) feed the workset’s rsync file to the 
Browser, 3) have the Browser download, index, and analyze the corpus, and 4) 
enable to reader to search, browse, and interact with the result of the 
analysis. With varying success, I have done this with a number of worksets 
ranging on topics from literature, philosophy, Rome, and cookery. The best 
working examples are the ones from Thoreau and Austen. [2, 3] The others are 
still buggy.

As a further example, the Browser can/will create reports describing the corpus 
as a whole. This analysis includes the size of a corpus measured in pages as 
well as words, date ranges, word frequencies, and selected items of interest 
based on pre-set “themes” — usage of color words, name of “great” authors, and 
a set of timeless ideas. [4] This report is based on more fundamental reports 
such as frequency tables, a “catalog”, and lists of unique words. [5, 6, 7, 8] 

The whole thing is written in a combination of shell and Python scripts. It 
should run on just about any out-of-the-box Linux or Macintosh computer. Take a 
look at the code. [9] No special libraries needed. (“Famous last words.”) In 
its current state, it is very Unix-y. Everything is done from the command line. 
Lot’s of plain text files and the exploitation of STDIN and STDOUT. Like a 
Renaissance cartoon, the Browser, in its current state, is only a sketch. Only 
later will a more full-bodied, Web-based interface be created. 

The next steps are numerous and listed in no priority order: putting the whole 
thing on GitHub, outputting the reports in generic formats so other things can 
easily read them, improving the terminal-based search interface, implementing a 
Web-based search interface, writing advanced programs in R that chart and graph 
analysis, provide a means for comparing  contrasting two or more items from a 
corpus, indexing the corpus with a (real) indexer such as Solr, writing a 
“cookbook” describing how to use the browser to to “kewl” things, making the 
metadata of corpora available as Linked Data, etc.

'Want to give it a try? For a limited period of time, go to the HathiTrust 
Research Center Portal, create (refine or identify) a collection of personal 
interest, use the Algorithms tool to export the collection's rsync file, and 
send the file to me. I will feed the rsync file to the Browser, and then send 
you the URL pointing to the results. [10] Let’s see what happens.

Fun with public domain content, text mining, and the definition of 
librarianship.

Links

   [1] HTRC Workset Browser - http://bit.ly/workset-browser
   [2] Thoreau - http://bit.ly/browser-thoreau
   [3] Austen - http://bit.ly/browser-austen
   [4] Thoreau report - http://ntrda.me/1LD3xds
   [5] Thoreau dictionary (frequency list) - http://bit.ly/thoreau-dictionary
   [6] usage of color words in Thoreau — http://bit.ly/thoreau-colors
   [7] unique words in the corpus - http://bit.ly/thoreau-unique
   [8] Thoreau “catalog” — http://bit.ly/thoreau-catalog
   [9] source code - http://ntrda.me/1Q8pPoI
  [10] HathiTrust Research Center - https://sharc.hathitrust.org

— 
Eric Lease Morgan, Librarian
University of Notre Dame


Re: [CODE4LIB] is python s l o o o w ? [resolved]

2015-05-18 Thread Eric Lease Morgan
On May 18, 2015, at 9:23 PM, Galen Charlton g...@esilibrary.com wrote:

 I have two scripts, attached. They do EXACTLY the same thing
 in almost EXACTLY the same manner, but the Python script is
 almost 25 times slower than the Perl script:
 
 I'm no Python expert, but I think that the difference is much more
 likely due to which JSON processor is being used.  I suspect your Perl
 environment has the JSON::XS module, which is written in C, is fast,
 and is automatically invoked (if present) by use JSON;.
 
 In contrast, I believe that the Python json library is written in
 Python itself.  I tried swapping in cjson and UltraJSON [1] in place
 of json in your Python script, and in both cases it ran rather
 faster.
 
 [1] https://github.com/esnme/ultrajson


Thank you. After using the Python module ujson instead of json, the speed of my 
two scripts is now all but equal. Whew! —Eric


[CODE4LIB] is python s l o o o w ?

2015-05-18 Thread Eric Lease Morgan

Is it just me, or is Python  s l o o o w  when compared to Perl?

I have two scripts, attached. They do EXACTLY the same thing in almost EXACTLY 
the same manner, but the Python script is almost 25 times slower than the Perl 
script:

  $ time bin/json2catalog.py sample/  sample.db 2/dev/null
  real 0m10.344s
  user 0m10.281s
  sys 0m0.059s

  $ time bin/json2catalog.pl sample/  sample.db 2/dev/null
  real 0m0.364s
  user 0m0.314s
  sys 0m0.048s

When I started learning Python, and specifically learning Python’s Natural 
Language Toolkit (NLTK), I thought this slowness was do to the large NLTK 
library, but now I’m not so sure. It is just me, or is Python really  s l o o o 
w ? Is there anything I can do to improve/optimize my Python code?

—
Eric Lease Morgan

#!/usr/bin/env python2

# json2catalog.py - create a catalog from a set of HathiTrust json files

# Eric Lease Morgan emor...@nd.edu
# May 18, 2015 - first cut; see https://sharc.hathitrust.org/features


# configure
HEADER   = id\ttitle\tpublication date\tpage count\tHathiTrust URL\tlanguage\tMARC (JSON) URL\tWorldCat URL
WORLDCAT = 'http://worldcat.org/oclc/'

# require
import glob
import json
import sys
import os

# sanity check
if len( sys.argv ) != 2 :
	print Usage:, sys.argv[ 0 ], 'directory'
	quit()

# get input
directory = sys.argv[ 1 ]

# intialize
print( HEADER )

# process each json file in the given directory
for filename in glob.glob( directory + '*.json' ):

	# open and read the file
	with open( filename ) as data: metadata = json.load( data )
		
	# parse
	id   = metadata[ 'id' ]
	title= metadata[ 'metadata' ]['title' ]
	date_created = metadata[ 'metadata' ][ 'dateCreated' ]
	page_count   = metadata[ 'features' ][ 'pageCount' ]
	handle   = metadata[ 'metadata' ][ 'handleUrl' ]
	language = metadata[ 'metadata' ][ 'language' ]
	marc = metadata[ 'metadata' ][ 'htBibUrl' ]
	worldcat = WORLDCAT + metadata[ 'metadata' ][ 'oclc' ]

	# create a list and print it
	metadata = [ id, title, date_created, page_count, handle, language, marc, worldcat ]
	print( '\t'.join( map( str, metadata ) ) )
	
# done
quit()

#!/usr/bin/perl

# json2catalog.pl - create a catalog from a set of HathiTrust json files

# Eric Lease Morgan emor...@nd.edu
# May 15, 2015 - first cut; see https://sharc.hathitrust.org/features


# configure
use constant DEBUG= 0;
use constant WORLDCAT = 'http://worldcat.org/oclc/';
use constant HEADER   = id\ttitle\tpublication date\tpage count\tHathiTrust URL\tlanguage\tMARC (JSON) URL\tWorldCat URL\n;

# require
use Data::Dumper;
use JSON;
use strict;

# get input; sanity check
my $directory = $ARGV[ 0 ];
if ( ! $directory ) {

	print Usage: $0 directory\n;
	exit;
	
}

# initialize
$| = 1;
binmode( STDOUT, ':utf8' );
print HEADER;

# process each file in the given directory
opendir DIRECTORY, $directory or die Error in opening $directory: $!\n;
while ( my $filename = readdir( DIRECTORY ) ) {

	# only .json files
	next if ( $filename !~ /json$/ );

	# convert the json file to a hash
	my $json = decode_json slurp( $directory$filename );
	if ( DEBUG ) { print Dumper( $json ) }

	# parse
	my $id= $$json{ 'id' };
	my $title = $$json{ 'metadata' }{ 'title' };
	my $date  = $$json{ 'metadata' }{ 'pubDate' };
	my $pagecount = $$json{ 'features' }{ 'pageCount' };
	my $handle= $$json{ 'metadata' }{ 'handleUrl' };
	my $language  = $$json{ 'metadata' }{ 'language' };
	my $marc  = $$json{ 'metadata' }{ 'htBibUrl' };
	my $worldcat  = WORLDCAT . $$json{ 'metadata' }{ 'oclc' };

	# dump
	print $id\t$title\t$date\t$pagecount\t$handle\t$language\t$marc\t$worldcat\n;
	 
}

# clean up and done
closedir(DIRECTORY);
exit;


# read and return the contents of a file
sub slurp {
 
	my $f = shift;
	open ( F, $f ) or die Can't open $f: $!\n;
	my $r = do { local $/; F };
	close F;
	return $r;
 
}



Re: [CODE4LIB] Protagonists

2015-04-14 Thread Eric Lease Morgan
If a peson could denote the characteristics of both the main (female) character 
as well as the protagonist, then bits of natural language processing (text 
mining) might be able to address this problem. —Eric “When You Have A Hammer, 
Everything Begins To Look Like a Nail” Morgan


[CODE4LIB] 3,082

2015-03-04 Thread Eric Lease Morgan
  Code4Lib is now 3,082 subscribers strong. Yeah! Almost time to do some 
analysis. —ELM


Re: [CODE4LIB] linked data question

2015-02-26 Thread Eric Lease Morgan
On Feb 26, 2015, at 9:48 AM, Owen Stephens o...@ostephens.com wrote:

 I highly recommend Chapter 6 of the Linked Data book which details different 
 design approaches for Linked Data applications - sections 6.3  
 (http://linkeddatabook.com/editions/1.0/#htoc84) summarises the approaches as:
 
   1. Crawling Pattern
   2. On-the-fly dereferencing pattern
   3. Query federation pattern
 
 Generally my view would be that (1) and (2) are viable approaches for 
 different applications, but that (3) is generally a bad idea (having been 
 through federated search before!)


And at the risk of sounding like a broken record, owen++ because the Linked 
Data book” is a REALLY good read!! [0] While it is computer science-y, it is 
also authoritative, easy-to-read, full of examples, and just plain makes a 
whole lot of sense. 

[0] linked data book - http://linkeddatabook.com/

—
Eric M.


Re: [CODE4LIB] linked data question

2015-02-26 Thread Eric Lease Morgan
On Feb 25, 2015, at 3:12 PM, Sarah Weissman seweiss...@gmail.com wrote:

 I am kind of new to this linked data thing, but it seems like the real
 power of it is not full-text search, but linking through the use of shared
 vocabularies. So if you have data about Jane Austen in your database and
 you are using the same URI as other databases to represent Jane Austen in
 your data (say http://dbpedia.org/resource/Jane_Austen), then you (or
 rather, your software) can do an exact search on that URI in remote
 resources vs. a fuzzy text search. In other words, linked data is really
^
 supposed to be linked by machines and discoverable through URIs. If you
 
 visit the URL: http://dbpedia.org/page/Jane_Austen you can see a
 human-interpretable representation of the data a SPARQL endpoint would
 return for a query for triples {http://dbpedia.org/page/Jane_Austen ?p ?o}.
 This is essentially asking the database for all subject-predicate-object
 facts it contains where Jane Austen is the subject.


Again, seweissman++  The implementation of linked data is VERY much like the 
implementation of a relational database over HTTP, and in such a scenario, the 
URIs are the database keys. —ELM


Re: [CODE4LIB] linked data question

2015-02-26 Thread Eric Lease Morgan
On Feb 25, 2015, at 2:48 PM, Esmé Cowles escow...@ticklefish.org wrote:

 In the non-techie library world, linked data is being talked about (perhaps 
 only in listserv traffic) as if the data (bibliographic data, for instance) 
 will reside on remote sites (as a SPARQL endpoint??? We don't know the 
 technical implications of that), and be displayed by your local catalog/the 
 centralized inter-national catalog by calling data from that remote site. 
 But the original question was how the data on those remote sites would be 
 access points - how can I start my search by searching for that remote 
 content?  I assume there has to be a database implementation that visits 
 that data and pre-indexes it for it to be searchable, and therefore the 
 index has to be local (or global a la Google or OCLC or its 
 bibliographic-linked-data equivalent). 
 
 I think there are several options for how this works, and different 
 applications may take different approaches.  The most basic approach would be 
 to just include the URIs in your local system and retrieve them any time you 
 wanted to work with them.  But the performance of that would be terrible, and 
 your application would stop working if it couldn't retrieve the URIs.
 
 So there are lots of different approaches (which could be combined):
 
 - Retrieve the URIs the first time, and then cache them locally.
 - Download an entire data dump of the remote vocabulary and host it locally.
 - Add text fields in parallel to the URIs, so you at least have a label for 
 it.
 - Index the data in Solr, Elasticsearch, etc. and use that most of the time, 
 esp. for read-only operations.


Yes, exactly. I believe Esmé has articulated the possible solutions well. 
escowles++  —ELM


Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]

2015-02-20 Thread Eric Lease Morgan
On Feb 16, 2015, at 4:54 PM, Levy, Michael ml...@ushmm.org wrote:

 I think you can accomplish what you want by using ICUFoldingFilterFactory
 https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory
 
 which should simply perform ICU (cf http://site.icu-project.org/) based 
 character folding (cf. http://www.unicode.org/reports/tr30/tr30-4.html)
 
 In schema.xml I generally have in both index and query:
 
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.ICUFoldingFilterFactory /


For unknown reasons, I was unable to load the ICUFoldingFilterFactory, but 
nonetheless, my interface works as expected. And I was able to do this after a 
combination of things. First, I needed to tell the indexer my content was 
Spanish, and after doing so, Solr parses things correctly. Second, I needed to 
explicitly tell my Web browser that the search form and returned content were 
using UTF-8. This was done the HTTP content-type header, the HTML meta tag, and 
even in the HTML form. Geesh! Through this whole process I also learned about 
Solr’s edismax (extended dismax) handler. Edismax supports free form queries as 
well as Boolean logic.  solr++  But also solr+- because Solr is getting more 
and more and more complicated. —Eric “Lost In Chicago” Morgan


Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]

2015-02-16 Thread Eric Lease Morgan
I know the documents I’m indexing are written in Spanish, and adding the 
following filters to my field definition, I believe I have resolved my problem:

  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.SnowballPorterFilterFactory language=Spanish /

In other words, my searchable content is defined thus:

  field name=“text type=text_general indexed=true stored=true 
multiValued=false /

And “text_general” is defined to include the filters in both the index and 
query sections:

  fieldType name=text_general class=solr.TextField 
positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.StandardTokenizerFactory /
  filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt /
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.SnowballPorterFilterFactory language=Spanish /
/analyzer
analyzer type=query
  tokenizer class=solr.StandardTokenizerFactory /
  filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt /
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true /
  filter class=solr.LowerCaseFilterFactory /
  filter class=solr.SnowballPorterFilterFactory language=Spanish /
/analyzer
  /fieldType


Re: [CODE4LIB] Job Posting [assistant manager, vancouver public library]

2015-02-13 Thread Eric Lease Morgan
[The following announcement is being passed on by request. —ELM]


Assistant Manager – Websites and Online Engagement Digital Services

Vancouver Public Library (VPL) is seeking a dynamic, strategic, and creative 
Assistant Manager to join the Digital Services department. Reporting to the 
Manager, Digital Services, and leading a team of 4 full-time equivalent (FTE) 
staff, the successful candidate will be responsible for ensuring that VPL’s 
online presence is engaging, relevant, and responsive to patrons’ needs. 

The Vancouver Public Library website is our flagship communication channel, 
with over 5 million visits a year.  In 2015 we are planning a substantial 
refresh of the site to increase digital engagement with our services and 
collections, and ensure that we provide a seamless and positive user experience 
for our patrons. We also maintain a strong social media presence which aims to 
educate and excite patrons and showcase the full range of resources the library 
offers.


The Position

In consultation with the public, library staff, and other stakeholders, you 
will be responsible for guiding the future direction of our websites. You will 
develop metrics and evaluation tools for assessing the success of www.vpl.ca 
and informing future changes to content, design, navigation, and architecture. 
Together with your team and our Library Systems (IT) department you will 
monitor emerging technologies and trends and pilot and implement creative, 
innovative ways of delivering digital services and collections and making our 
overall web presence more dynamic, engaging, and effective. 

You will be a champion for best practices in content development, both on the 
website itself and through our social media channels. Through training, 
coaching, and development of guidelines and procedures, you will help staff 
across the VPL system understand how their contributions to our online 
engagement platforms build connections with patrons and enhance the services we 
offer. You will maintain a strong awareness of emerging engagement tools and 
identify opportunities for us to expand our presence into new channels. You 
will ensure that user experience is the central focus for all of our web 
initiatives.

Your team includes two Web Librarians, a Web Technician, and a Web Graphics 
Technician. As a part of the Digital Services leadership team, you will also 
work collaboratively within the department to support the public in their use 
of all web-based library services including electronic resources, eBooks, and 
digital collections. You will participate in ensuring that our work is focused 
on achieving VPL’s strategic priorities, both within Digital Services and 
across the library system.


Qualifications and Experience

This position requires excellent collaboration and communication skills, 
thorough knowledge of current trends and best practices in the use of 
technology in delivering web-based public library services, significant 
experience in the planning, design, development, promotion and maintenance of 
websites, and demonstrated supervisory or leadership experience. A background 
in project management, web management, and user-centred design is essential. We 
are looking for an innovative, flexible individual with a demonstrated ability 
to develop positive working relationships, lead and develop staff teams, manage 
multiple projects and competing priorities, and assist staff in participating 
in and being open to change.

Qualifications include an MLS/MLIS degree from an ALA accredited post-secondary 
institution and a minimum of 2 years of recent relevant experience, including 
project management, website management, and supervisory experience. 
The Workplace

Vancouver Public Library is the third-largest library system in Canada and 
offers exceptional collections, services and technology at 21 branch libraries 
and a superb virtual library with over 5 million web visitors per year and an 
extensive collection of digital resources. If you would like to make a 
meaningful contribution to the City of Vancouver through this exciting, 
forward-looking position, we would like to hear from you.

This position is within the library’s bargaining unit, CUPE 391. The salary 
range begins at $64,482 with annual increments rising to a maximum of $76,112. 
The library offers a comprehensive benefits package including MSP, extended 
health, dental, pension, and annual vacation of 22 days for professional 
positions. 

Expressions of interest accompanied by a résumé should be submitted by 5:00 pm 
on Friday, March 6, 2015 by ONE of the following methods:

  Mail: Human Resources Department
  Vancouver Public Library
  350 West Georgia Street
  Vancouver, BC V6B 6B1
  OR Email: care...@vpl.ca

Please quote the competition # in the subject line when applying electronically 
and upload your cover letter and resume / CV as one attachment. Ensure your 
application has one of the following file extensions: 

Re: [CODE4LIB] indexing word documents using solr

2015-02-11 Thread Eric Lease Morgan
On Feb 10, 2015, at 11:46 AM, Erik Hatcher erikhatc...@mac.com wrote:

 bin/post -c collection_name /path/to/file.doc

The almost trivial command to index a Word document in Solr, above, is most 
certainly appealing, but I’m wondering about the underlying index’s schema.

Tika makes every effort to extract as much metadata from Word documents as 
possible. This metadata includes dates, titles, authors, names of applications, 
last edit, etc. Some of this data can be very useful. The metadata can be 
packaged up as an XML file/stream and then sent to Solr for indexing. Tastes 
great. Less filling.” But my question is, “To what degree does Solr know what 
to do with the metadata when the (kewl) command, above, is seemingly so 
generic? Does one need to create a Solr schema to specifically accommodate the 
Tika-created metadata, or do such things also come for ‘free’?”

— 
Eric Morgan


[CODE4LIB] indexing word documents using solr

2015-02-10 Thread Eric Lease Morgan
Can somebody point me to a good tutorial on how to index Word documents using 
Solr?

I have a few hundred Microsoft Word documents I want to search. Through the use 
of the Tika library it seems as if I ought to be able to index my Word 
documents directly into Solr, but none of the tutorials I have found on the Web 
are complete. Missing directories. Missing files. Documentation for versions 
unreleased. Etc.

Put another way, Tika can create a (nice) XHTML file complete with some useful 
metadata that can all be fed to Solr for indexing, but I can barely get out of 
the starting gate. Have you indexed Word documents using Solr, and if so, then 
how? 

—
Eric Morgan


[CODE4LIB] joy

2015-01-27 Thread Eric Lease Morgan
  It is a joy to manage this mailing list, and I say that with all sincerity. 
—Eric Morgan


Re: [CODE4LIB] circulation statistics

2015-01-15 Thread Eric Lease Morgan
  The replies received have all been very helpful. Thank you! —Eric M. 


[CODE4LIB] circulation statistics

2015-01-13 Thread Eric Lease Morgan
Does anybody here know how to extract circulation statistics from an library 
catalog? Specifically, given a date range, are you able to create a list of the 
most frequently borrowed books ordered by the number of times they’ve been 
circulated?

I have a colleague who wants to digitize sets of modern literature and then do 
text analysis against the result. In an effort to do the analysis against 
popular literature, he wants to create a list of… popular titles. Getting a 
list of such a thing from library circulation statistics sounds like a logical 
option to me. 

Does somebody here know how to do this? If you know how to do it against Ex 
Libris’s Aleph, then that is a bonus. 

—
Eric Morgan


Re: [CODE4LIB] lita

2015-01-05 Thread Eric Lease Morgan
 I’m curious, how large is LITA (Library and Information Technology
 Association)? [0] How many members does it have?
 
 Apparently it has around 3000 members this year. I found this on the ALA
 membership statistics page:
 
 http://www.ala.org/membership/membershipstats_files/divisionstats#lita


Interesting and thank you. Code4Lib only needs fifty more subscribers to equal 
LITA’s size. I think this just goes to show, with the advent of the Internet, 
centralized authorities are not as necessary/useful as they once used to be. 
—ELM


Re: [CODE4LIB] lita

2015-01-05 Thread Eric Lease Morgan
On Jan 5, 2015, at 11:25 AM, Sylvain Machefert smachef...@u-bordeaux3.fr 
wrote:

 Interesting and thank you. Code4Lib only needs fifty more subscribers to 
 equal LITA’s size. I think this just goes to show, with the advent of the 
 Internet, centralized authorities are not as necessary/useful as they once 
 used to be. —ELM
 
 For a list created more than 10 years ago, can we trust the number of 
 subscribers figure ? How many dead addresses ? (not saying that number of 
 members of an association == active members, sure).


There are zero dead mailing list addresses because the LISTSERV software prunes 
such things on a daily basis. Yes, we can trust the number of subscribers, but 
that does not mean all of the subscribers actively participate in the 
community. —ELM


Re: [CODE4LIB] PBCore RDF Ontology Hackathon Wiki page

2015-01-05 Thread Eric Lease Morgan
On Jan 5, 2015, at 1:35 PM, Karen Coyle li...@kcoyle.net wrote:

 1) Everyone should read at least the first chapters of the Allemang book, 
 Semantic Web for the Working Ontologist:
 http://www.worldcat.org/title/semantic-web-for-the-working-ontologist-effective-modeling-in-rdfs-and-owl/oclc/73393667

+2 because it is a very good book


 2) Everyone should understand the RDF meaning of classes, properties, domain 
 and range before beginning. (cf: 
 http://kcoyle.blogspot.com/2014/11/classes-in-rdf.html)

+1 for knowing the distinctions between these things, yes


 3) Don't lean too heavily on Protege. Protege is very OWL-oriented and can 
 lead one far astray. It's easy to click on check boxes without knowing what 
 they really mean. Do as much development as you can without using Protege, 
 and do your development in RDFS not OWL. Later you can use Protege to check 
 your work, or to complete the code.

+1 but at the same time workshops are good places to see how things get done in 
a limited period of time.


 4) Develop in ntriples or turtle but NOT rdf/xml. RDF differs from XML in 
 some fundamental ways that are not obvious, and developing in rdf/xml masks 
 these differences and often leads to the development of not very good 
 ontologies.

+1  -1 because each of the RDF serializations have its own advantages and 
disadvantages


—
Eric Morgan


[CODE4LIB] lita

2015-01-05 Thread Eric Lease Morgan
I’m curious, how large is LITA (Library and Information Technology 
Association)? [0] How many members does it have? 

[0] LITA - http://www.ala.org/lita/

—
ELM


Re: [CODE4LIB] NEC4L

2014-12-24 Thread Eric Lease Morgan
  It is so cool that we have “franchises”. —Eric Morgan


[CODE4LIB] linked data and open access

2014-12-19 Thread Eric Lease Morgan
I don’t know about y’all, but it seems to me that things like linked data and 
open access are larger trends in Europe than here in the United States. Is 
there are larger commitment to sharing in Europe when compared to the United 
States? If so, is this a factor based on the nonexistence of a national library 
in the United States? Is this your perception too? —Eric Morgan


Re: [CODE4LIB] Starting with Virtuoso - tutorial etc.

2014-12-17 Thread Eric Lease Morgan
On Dec 17, 2014, at 9:52 AM, Nicola Carboni nic.carb...@gmail.com wrote:

 I am collecting some resources (beginner level) in order to start using 
 Virtuoso (OpenSource Edition) for a project I am working with. I would like 
 to use it both for hosting triples and for its sponger (CSV to RDF). I 
 sincerely never used it, but I would like to give it try. Do you have some 
 recommendations, like nice books or tutorial (even video) about it?

I have not used Virtuoso extensively, but I have compiled and installed it. It 
was a big but painless compiling process. It seems to me as if Virtuoso is the 
most feature-rich (open source) triple store available. Please consider sharing 
with the group any of your future experiences with it. —Eric Morgan


Re: [CODE4LIB] Starting with Virtuoso - tutorial etc.

2014-12-17 Thread Eric Lease Morgan
On Dec 17, 2014, at 10:10 AM, Mixter,Jeff mixt...@oclc.org wrote:

 If you want to test out a bare-bones triple store, I would suggest 4Store 
 (http://4store.org/). It has pre-compiled installs for Unix and Unix-like 
 systems (although not Windows). It supports SPARQL 1.1 and is relatively easy 
 to tweak/configure.


Regarding 4Store, I concur. 4Store is my SPARQL endpoint of RDF created from 
archival (EAD and MARC) materials. Regarding the “futurities” of Virtuoso, I 
also agree. —Eric


Re: [CODE4LIB] Scanned PDF to text

2014-12-09 Thread Eric Lease Morgan
On Dec 9, 2014, at 8:25 AM, Kyle Banerjee kyle.baner...@gmail.com wrote:

 I've just started a project that involves harvesting large numbers of
 scanned PDF's and extracting information from the text from the OCR output.
 The process I've started with -- use imagemagick to convert to tiff and
 tesseract to pull out the OCR -- is more system intensive than I hoped it
 would be.

I’m not quite sure if I understand the question, but if all you want to do is 
pull the text out of an OCR’ed PDF file, then I have found both Tika and 
PDFtotext to be useful tools. [1, 2] Here’s a Perl script that takes a PDF as 
input and used to Tika to output the OCR’ed text:

  #!/usr/bin/perl

  # configure
  use constant TIKA = 'java -jar tika.jar -T ';

  # require
  use strict;

  # initialize; needs sanity checking
  my $cmd = TIKA . $ARGV[ 0 ];

  # do the work
  print system $cmd;

  # done
  exit;

Tika can run in a server mode making it more efficient for extracting the text 
from multiple files. 

On the other hand, if you need to do the OCR itself, then employing Tesseract 
is probably the way to go. 

[1] Tika - http://tika.apache.org
[2] PDFtoText - http://www.foolabs.com/xpdf/download.html

—
ELM


Re: [CODE4LIB] Registration for Code4Lib 2015 in Portland Oregon is NOW OPEN! [airbnb]

2014-12-08 Thread Eric Lease Morgan
On Dec 8, 2014, at 12:57 PM, Dana Jemison dana.jemi...@ucop.edu wrote:

 Looks like the recommended hotel is already filled up.  Are there any other 
 options close by?

Mine is a unsolicited comment/endorsement for AirBnB as an additional source of 
accommodations, if it does not hurt the conference planning process. [1] With 
AirBnB believe you can get quite a nice place to stay that is larger, more 
hospitable, and lesser expensive than a hotel.

[1] AirBnB - http:/airbnb.com

—
Eric


Re: [CODE4LIB] CrossRef/DOI content-negotiation for metadata lookup?

2014-10-27 Thread Eric Lease Morgan
 On Oct 23, 2014, at 11:45 AM, Joe Hourcle onei...@grace.nascom.nasa.gov 
 wrote:
 
 I found this blog post talking about CrossRef's support:
 
 http://www.crossref.org/CrossTech/2011/04/content_negotiation_for_crossr.html
 
 But I know DataCite supports it to some extent too.
 
 Does anyone know if there's overall registrar-agnostic documentation from 
 DOI for this service?
 
 None that I'm aware of.  We've actually been discussing this issue in a 
 breakout from 'Data Citation Implementors Group', and I think we're currently 
 leaning towards not relying solely on content negotiation, but also using 
 HTTP Link headers or HTML link elements to make it possible to discover the 
 other formats that the metadata may be available.


I don’t know if the following is helpful or not, but CrossRef has implemented 
an API allowing the developer to use content negotiation to retrieve links to 
full text and/or metadata, and I wrote a hack to implement the idea — 
http://blogs.nd.edu/emorgan/2014/06/tdm/  —Eric Morgan


Re: [CODE4LIB] Why learn Unix?

2014-10-27 Thread Eric Lease Morgan
Learning Unix is not necessarily the problem to solve. Instead it is means to 
an end. 

To my mind, there are number of skills and technologies a person needs to know 
in order to provide (digital) library service. Some of those 
skills/technologies include: indexing, content management (databases), 
programming/scripting, HTTP server management, XML manipulation, etc. While 
these technologies exist in a Windows environment, they are oftentimes more 
robust and specifically designed for a Unix (read “Linux”) environment. 

— 
Eric Morgan


Re: [CODE4LIB] elsevier api program

2014-09-25 Thread Eric Lease Morgan
On Jul 14, 2014, at 5:00 PM, Eric Lease Morgan emor...@nd.edu wrote:

 Does anybody here have any experience with the Elsevier API Program? [1]
 
 [1] Elsevier API - http://www.developers.elsevier.com/cms/


I have had tiny success with the Elsevier API Program.

I first created an API key that is/was intended to be used from a particular IP 
address. This allowed me to sometimes download citation and abstract 
information from a few Elsevier-releated products using a REST-ful interface, 
but it did not allow me to download full text. Apparently because I was not 
“entitled”. 

Then some sort of contract was signed between the University and Elsevier. This 
contract was apparently returned to Elsevier, and consequently I have been able 
to create an access token for a text mining project as opposed to just an IP 
address. Using this second access token, I am able to programmatically download 
more articles. For example, using curl:

  curl -H X-ELS-APIKey: secretKey -H Accept: text/xml 
http://api.elsevier.com/content/article/PII:S0166361514000207  

The resulting XML includes links to images, and I can get those like this:

  curl -H X-ELS-APIKey: secretKey 
http://api.elsevier.com/content/object/eid:1-s2.0-S0166361514000207-si2.gif?httpAccept=%2A%2F%2A

Using this functionality I ought to be able to:

  1. Create a rudimentary one-box, one-buton interface to search select 
Elsevier indexes.
  2. Return the results allowing the reader to select items of interest.
  3. Get the selected items and on-the-fly do text mining against the results. 

Yea, sure, in my copious spare time.

I sincerely believe that indexers will catch on to text mining interfaces to 
their content. Search. Get back hundreds and hundreds of hits. Do some sort of 
analysis and visualization against the result. Allow people to USE the content 
as oppose to just get it. 

Has anybody else had any experience with the Elsevier API?

—
Eric Lease Morgan


Re: [CODE4LIB] Code4Lib 2014 Conference accounting update

2014-09-02 Thread Eric Lease Morgan
  +  --ELM


[CODE4LIB] test of mr. serials

2014-08-30 Thread Eric Lease Morgan
This is at test, and hopefully the only test of the Mr. Serials Process against 
my Code4Lib mailing list archive. Delete me. —Eric


[CODE4LIB] crowdsourcing consortium for libraries and archives

2014-08-27 Thread Eric Lease Morgan
The following message about the Crowdsourcing Consortium for Libraries and 
Archives project is being forwarded upon request. —ELM

  From: Christina manzo.christ...@gmail.com
  Subject: Please Distribute [Crowdsourcing Consortium for Libraries and 
Archives project]
  Date: August 27, 2014 at 10:15:16 AM EDT

  …this message is about the Crowdsourcing Consortium for Libraries
  and Archives project supported by the IMLS.

  On behalf of our partners at Dartmouth College and Boston Public
  Library, we are pleased to announce the launch of an exciting new
  initiative, funded by the Institute for Museum and Library
  Services (IMLS), that will examine how libraries, archives, and
  museums, can most effectively use crowdsourcing techniques to
  augment their collections and enhance their patrons’ experience!

  This initiative, provisionally entitled the Crowdsourcing
  Consortium for Libraries and Archives (CCLA), will employ a
  series of meetings and webinars to collect, examine, and share
  the most recent, cutting-edge technologies, tools, and platforms
  and accompanying best practices in the field. The goal of the
  CCLA is to create a forum that enables all interested
  stakeholders to join a national conversation about the most
  pressing needs and challenges regarding the development and
  deployment of crowdsourcing technologies in the cultural heritage
  domain.

  As a first step in this process, we want to hear from you!

  The CCLA team invites you to take a short 10-minute survey to
  share your thoughts on the current state of crowdsourcing in
  libraries, museums, and other cultural heritage institutions. [1]
  Your opinions and insights will directly inform the agenda of
  upcoming CCLA activities and events, influence the discourse of
  current and future discussions, and have the potential to
  translate into real-world applications.

  Thank you!

  P.S.To stay informed about upcoming CCLA events, please follow us
  on Twitter: @crowdconsortium

  [1] survey - https://www.surveymonkey.com/s/CGHJL7B


Re: [CODE4LIB] Hiring strategy for a library programmer with tight budget - thoughts? [out of context]

2014-08-15 Thread Eric Lease Morgan
 ...But there are few programmer projects that would require zero maintenance 
 once finished…

This is a bit out of context, but a Buddhist monk once said, “Software is never 
done. If it were, then it would be called hardware.” —Eric Morgan


[CODE4LIB] doaj and code4lib journal

2014-08-14 Thread Eric Lease Morgan
Albiet a bit late, I very recently learned that the DOAJ is asking journals 
like ours (Code4Lib Journal) to resubmit our application to be in the 
directory. [1] From a Nature article:

  Now, following criticism of its quality-control checks, the
  website [DOAJ] is asking all of the journals in its directory to
  reapply on the basis of stricter criteria. It hopes the move will
  weed out ‘predatory journals’: those that profess to publish
  research openly, often charging fees, but that are either
  outright scams or do not provide the services a scientist would
  expect, such as a minimal standard of peer review or permanent
  archiving. “We all know there has been a lot of fuss about
  questionable publishers,” says Bjørnshauge. [2]

I’m just bringing this to the attention of our current crop of good Code4Lib 
Journal people, in case they hadn’t seen it previously. Others here in the 
crowd may simply want to know.

[1] resubmit - http://doaj.org/application/new
[2] article - http://www.nature.com/news/open-access-website-gets-tough-1.15674

—
Eric Morgan


  1   2   3   4   5   6   >