date:20090915

The process by which a URI comes to identify something other than the  
stuff you get by resolving it can be mysterious- I've blogged about a  
bit: http://go-to-hellman.blogspot.com/2009/07/illusion-of-internet-identity.html 
  In the case of worldcat or google, it's fame. If you think a URI  
can be usable outside your institution for identification purposes,  
and your institution can maintain some sort of identification  
machinery as long as the OpenURL is expected to be useful, then it's  
fine to use it in rft_id. If you intend the uri to connote identity it  
only in the context that you're building urls for, then use rft_dat  
which is there for exactly that purpose.


On Sep 15, 2009, at 12:17 PM, Jonathan Rochkind wrote:

If it's a URI that is indeed an identifier that "unambiguously  
identifies the referent", as the standard says...   I don't see how  
that's inappropriate in rft_id. Isn't that what it's for?


I mentioned before that I put things like http://catalog.library.jhu.edu/bib/1234 
 in my rft_ids.  Putting http://somewhere.edu/our-purl-server/1234  
in rft_id seems very analogous to me.  Both seem appropriate.


I'm not sure what makes a URI "locally meaningful" or not.  What  
makes http://www.worldcat.org/bibID or http://books.google.com/book?id=foo 
 "globally meaningful" but http://catalog.library.jhu.edu/bib/1234  
or http://somewhere.edu/our-purl-server/1234 "locally meaningful"?   
If it's a URI that is reasonably persistent and unambiguously  
identifies the referent, then it's an identifier and is appropriate  
for rft_id, says me.


Jonathan

Eric Hellman wrote:
I think using locally meaningful ids in rft_id is a misuse and a   
mistake. locally meaningful data should goi in rft_dat, accompanied  
by  rfr_id


just sayin'

On Sep 15, 2009, at 11:52 AM, Jonathan Rochkind wrote:


I do like Ross's solution, if you really wanna use OpenURL. I'm  
much  more comfortable with the idea of including a URI based on  
your own  local service in rft_id, then including any old public  
URL in rft_id.


Then at least your link resolver can say "if what's in rft_id  
begins  with (eg)  http://telstar.open.ac.uk/, THEN I know this is  
one of  these purl type things, and I know that sending the user  
to it will  result in a redirect to an end-user-appropriate access  
URL."
Cause that's my concern with putting random URLs in rft_id, that   
there's no way to know if they are intended as end-user- 
appropriate  access URLs or not, and in putting things in rft_id  
that aren't  really good "identifiers" for the referent at all.
But using your  own local service ID, now you really DO have  
something that's  appropriately considered a "persistent  
identifier" for the referent,  AND you have a straightforward way  
to tell when the rft_id of this  context is intended as an access  
URL.


Jonathan



Eric Hellman
President, Gluejar, Inc.
41 Watchung Plaza, #132
Montclair, NJ 07042
USA

e...@hellman.net
http://go-to-hellman.blogspot.com/

Re: [CODE4LIB] indexing pdf files

2009-09-15 Thread Erik Hatcher


Here's a post on how easy it is to send PDF documents to Solr from Java:

  


Not only can you post PDF (and other rich content) files to Solr for  
indexing, you can also as shown in that blog entry extract the text  
from such files and have it returned to the client.  This Solr  
capability makes the tool chain a bit simpler.


Erik


On Sep 15, 2009, at 10:31 AM, Peter Kiraly wrote:


Hi all,

I would like to suggest an API for extracting text (including  
highlighted or

annotated ones) from PDF: iText (http://www.lowagie.com/iText/).
This is a Java API (has C# port), and it helped me a lot, when we  
worked

with extraordinary PDF files.

Solr uses Tika (http://lucene.apache.org/tika) for extracting text  
from

documents, and Tika uses PDFBox (http://incubator.apache.org/pdfbox/)
to extract from PDF files, and it is a great tool for the normal PDF  
files,
but it has (at least had) some features, which I didn't satisfied  
with:


- it consumed more memory comparing with iText, and couldn't
read files above a given size (this was large, about 1 GB, but we
had even larger files)

- it couldn't handled correctly the conditional hypens at the end of
the line
- it had poorer documentation then iText, and its API was also
poorer (that time the Manning published the iText in Action book).

Our PDF files were double layered (original hi-res image + OCR-ed  
text),
several thousands pages length documents (Hungarian scientific  
journals,
the diary of the Houses of Parliament from the 19th century etc.).  
We indexed

the content with Lucene, and in the UI we showed one page per screen,
so the user didn't need to download the full PDF. We extracted the
Table of contents from the PDF as well, and we implemented it in the  
web UI,

so the user can browse pages according to the full file's TOC.

This project happened two years ago, so it is possible, that lots of  
things

were changed since that time.

Király Péter
http://eXtensibleCatalog.org

- Original Message - From: "Mark A. Matienzo" >

To: 
Sent: Tuesday, September 15, 2009 3:56 PM
Subject: Re: [CODE4LIB] indexing pdf files



Eric,


5. Use pdttotext to extract the OCRed text
  from the PDF and index it along with
  the MyLibrary metadata using Solr. [3, 4]



Have you considered using Solr's ExtractingRequestHandler [1] for the
PDFs? We're using it at NYPL with pretty great success.

[1] http://wiki.apache.org/solr/ExtractingRequestHandler

Mark A. Matienzo
Applications Developer, Digital Experience Group
The New York Public Library

Re: [CODE4LIB] Implementing OpenURL for simple web resources

2009-09-15 Thread Erik Hetzner

Hi Owen, all:

This is a very interesting problem.

At Tue, 15 Sep 2009 10:04:09 +0100,
O.Stephens wrote:
> […]
>
> If we look at a website it is pretty difficult to reference it
> without including the URL - it seems to be the only good way of
> describing what you are actually talking about (how many people
> think of websites by 'title', 'author' and 'publisher'?). For me,
> this leads to an immediate confusion between the description of the
> resource and the route of access to it. So, to differentiate I'm
> starting to think of the http URI in a reference like this as a URI,
> but not necessarily a URL. We then need some mechanism to check,
> given a URI, what is the URL.
>
> […]
>
> The problem with the approach (as Nate and Eric mention) is that any
> approach that relies on the URI as a identifier (whether using
> OpenURL or a script) is going to have problems as the same URI could
> be used to identify different resources over time. I think Eric's
> suggestion of using additional information to help differentiate is
> worth looking at, but I suspect that this is going to cause us
> problems - although I'd say that it is likely to cause us much less
> work than the alternative, which is allocating every single
> reference to a web resource used in our course material it's own
> persistent URL.

> […]

I might be misunderstanding you, but, I think that you are leaving out
the implicit dimension of time here - when was the URL referenced?
What can we use to represent the tuple , and how do we
retrieve an appropriate representation of this tuple? Is the most
appropriate representation the most recent version of the page,
wherever it may have moved? Or is the most appropriate representation
the page as it existed in the past? I would argue that the most
appropriate representation would be the page as it existed in the
past, not what the page looks like now - but I am biased, because I
work in web archiving.

Unfortunately this is a problem that has not been very well addressed
by the web architecture people, or the web archiving people. The web
architecture people start from the assumption that
 is the same resource which only varies in its
representation as a function of time, not in its identity as a
resource. The web archives people create closed systems and do not
think about how to store and resolve the tuple, .

I know this doesn’t help with your immediate problem, but I think
these are important issues.

best,
Erik Hetzner
;; Erik Hetzner, California Digital Library
;; gnupg key id: 1024D/01DB07E3

pgpoU4UofTFjn.pgp
Description: PGP signature

Re: [CODE4LIB] Implementing OpenURL for simple web resources

If it's a URI that is indeed an identifier that "unambiguously 
identifies the referent", as the standard says...   I don't see how 
that's inappropriate in rft_id. Isn't that what it's for?


I mentioned before that I put things like 
http://catalog.library.jhu.edu/bib/1234 in my rft_ids.  Putting 
http://somewhere.edu/our-purl-server/1234 in rft_id seems very analogous 
to me.  Both seem appropriate.


I'm not sure what makes a URI "locally meaningful" or not.  What makes 
http://www.worldcat.org/bibID or http://books.google.com/book?id=foo 
"globally meaningful" but http://catalog.library.jhu.edu/bib/1234 or 
http://somewhere.edu/our-purl-server/1234 "locally meaningful"?  If it's 
a URI that is reasonably persistent and unambiguously identifies the 
referent, then it's an identifier and is appropriate for rft_id, says me.


Jonathan

Eric Hellman wrote:
I think using locally meaningful ids in rft_id is a misuse and a  
mistake. locally meaningful data should goi in rft_dat, accompanied by  
rfr_id


just sayin'

On Sep 15, 2009, at 11:52 AM, Jonathan Rochkind wrote:

  
I do like Ross's solution, if you really wanna use OpenURL. I'm much  
more comfortable with the idea of including a URI based on your own  
local service in rft_id, then including any old public URL in rft_id.


Then at least your link resolver can say "if what's in rft_id begins  
with (eg)  http://telstar.open.ac.uk/, THEN I know this is one of  
these purl type things, and I know that sending the user to it will  
result in a redirect to an end-user-appropriate access URL."
Cause that's my concern with putting random URLs in rft_id, that  
there's no way to know if they are intended as end-user-appropriate  
access URLs or not, and in putting things in rft_id that aren't  
really good "identifiers" for the referent at all.   But using your  
own local service ID, now you really DO have something that's  
appropriately considered a "persistent identifier" for the referent,  
AND you have a straightforward way to tell when the rft_id of this  
context is intended as an access URL.


Jonathan




Eric Hellman
President, Gluejar, Inc.
41 Watchung Plaza, #132
Montclair, NJ 07042
USA

e...@hellman.net
http://go-to-hellman.blogspot.com/

Re: [CODE4LIB] Implementing OpenURL for simple web resources

On Tue, Sep 15, 2009 at 12:06 PM, Eric Hellman  wrote:
> Yes, you can.
>

In this case, I say punt on dc.identifier, throw the URL in rft_id
(since, Eric, you had some concern regarding using the local id for
this?) and let the real URL persistence/resolution work happen with
the by-ref negotiation.

-Ross.

> On Sep 15, 2009, at 11:41 AM, Ross Singer wrote:
>>
>> I can't remember if you can include both metadata-by-reference keys
>> and metadata-by-value, but you could have by-reference
>> (&rft_ref=http://telstar.open.ac.uk/1234&rft_ref_fmt=RIS or something)
>> point at your citation db to return a formatted citation.
>
> Eric Hellman
> President, Gluejar, Inc.
> 41 Watchung Plaza, #132
> Montclair, NJ 07042
> USA
>
> e...@hellman.net
> http://go-to-hellman.blogspot.com/
>

Re: [CODE4LIB] Implementing OpenURL for simple web resources


Yes, you can.

On Sep 15, 2009, at 11:41 AM, Ross Singer wrote:

I can't remember if you can include both metadata-by-reference keys
and metadata-by-value, but you could have by-reference
(&rft_ref=http://telstar.open.ac.uk/1234&rft_ref_fmt=RIS or something)
point at your citation db to return a formatted citation.


Eric Hellman
President, Gluejar, Inc.
41 Watchung Plaza, #132
Montclair, NJ 07042
USA

e...@hellman.net
http://go-to-hellman.blogspot.com/

Re: [CODE4LIB] Implementing OpenURL for simple web resources

I think using locally meaningful ids in rft_id is a misuse and a  
mistake. locally meaningful data should goi in rft_dat, accompanied by  
rfr_id


just sayin'

On Sep 15, 2009, at 11:52 AM, Jonathan Rochkind wrote:

I do like Ross's solution, if you really wanna use OpenURL. I'm much  
more comfortable with the idea of including a URI based on your own  
local service in rft_id, then including any old public URL in rft_id.


Then at least your link resolver can say "if what's in rft_id begins  
with (eg)  http://telstar.open.ac.uk/, THEN I know this is one of  
these purl type things, and I know that sending the user to it will  
result in a redirect to an end-user-appropriate access URL."
Cause that's my concern with putting random URLs in rft_id, that  
there's no way to know if they are intended as end-user-appropriate  
access URLs or not, and in putting things in rft_id that aren't  
really good "identifiers" for the referent at all.   But using your  
own local service ID, now you really DO have something that's  
appropriately considered a "persistent identifier" for the referent,  
AND you have a straightforward way to tell when the rft_id of this  
context is intended as an access URL.


Jonathan



Eric Hellman
President, Gluejar, Inc.
41 Watchung Plaza, #132
Montclair, NJ 07042
USA

e...@hellman.net
http://go-to-hellman.blogspot.com/

Re: [CODE4LIB] Implementing OpenURL for simple web resources

I'm thinking about it :)

Logically I think we can avoid this as we have the context based on the rfr_id 
(for which we are proposing)

rfr_id=info:sid/learn.open.ac.uk:[course code] (at the risk of more comment!)

Which seems to me equivalent. I guess it is just a matter of where you do the 
work, since in SFX we'll end up constructing a 'fetch' to the same location 
anyway. The amount of work involved to change it one way or the other is 
probably trivial though.

I'm not sure I agree that what I'm proposing puts 'random' URLs in the rft_id, 
although I do accept that this is a moot point if other resolvers don't do 
something useful with them (or worse, make incorrect assumptions about them) - 
perhaps this is something I could survey as part of the project... (although 
its all moot if we are only doing this within an internal environment and 
no-one else ever does it!)

Owen

Owen Stephens
TELSTAR Project Manager
Library and Learning Resources Centre
The Open University
Walton Hall
Milton Keynes, MK7 6AA

T: +44 (0) 1908 858701
F: +44 (0) 1908 653571
E: o.steph...@open.ac.uk


> -Original Message-
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
> Behalf Of Jonathan Rochkind
> Sent: 15 September 2009 16:52
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources
>
> I do like Ross's solution, if you really wanna use OpenURL.
> I'm much more comfortable with the idea of including a URI
> based on your own local service in rft_id, then including any
> old public URL in rft_id.
>
> Then at least your link resolver can say "if what's in rft_id
> begins with (eg)  http://telstar.open.ac.uk/, THEN I know
> this is one of these purl type things, and I know that
> sending the user to it will result in a redirect to an
> end-user-appropriate access URL."
>
> Cause that's my concern with putting random URLs in rft_id,
> that there's no way to know if they are intended as
> end-user-appropriate access URLs or not, and in putting
> things in rft_id that aren't really good
> "identifiers" for the referent at all.   But using your own local
> service ID, now you really DO have something that's
> appropriately considered a "persistent identifier" for the
> referent, AND you have a straightforward way to tell when the
> rft_id of this context is intended as an access URL.
>
> Jonathan
>
> Ross Singer wrote:
> > Oh yeah, one thing I left off --
> >
> > In Moodle, it would probably make sense to link to the URL
> in the  tag:
> > http://bbc.co.uk/";>The Beeb! but use a javascript
> > onMouseDown action to rewrite the link to route through your funky
> > link resolver path, a la Google.
> >
> > That way, the page works like any normal webpage, "right mouse
> > click->Copy Link Location" gives the user the "real" URL to copy and
> > paste, but normal behavior funnels through the link resolver.
> >
> > -Ross.
> >
> > On Tue, Sep 15, 2009 at 11:41 AM, Ross Singer
>  wrote:
> >
> >> Given that the burden of creating these links is entirely
> on RefWorks
> >> & Telstar, OpenURL seems as good a choice as anything
> (since anything
> >> would require some other service, anyway).  As long as the profs
> >> aren't expected to mess with it, I'm not sure that *how*
> you do the
> >> indirection matters all that much and, as you say, there are added
> >> bonuses to keeping it within SFX.
> >>
> >> It seems to me, though, that your rft_id should be a URI to the db
> >> you're using to store their references, so your CTX would look
> >> something like:
> >>
> >>
> http://res.open.ac.uk/?rfr_id=info:/telstar.open.ac.uk&rft_id=http://
> >> telstar.open.ac.uk/1234&dc.identifier=http://bbc.uk.co/
> >> # not url encoded because I have, you know, a life.
> >>
> >> I can't remember if you can include both
> metadata-by-reference keys
> >> and metadata-by-value, but you could have by-reference
> >> (&rft_ref=http://telstar.open.ac.uk/1234&rft_ref_fmt=RIS or
> >> something) point at your citation db to return a formatted
> citation.
> >>
> >> This way your citations are unique -- somebody pointing at today's
> >> London Times frontpage isn't the same as somebody else's on a
> >> different day.
> >>
> >> While I'm shocked that I agree with using OpenURL for
> this, it seems
> >> as reasonable as any other solution.  That being said,
> unless you can
> >> definitely offer some other service besides linking to the
> resource,
> >> I'd avoid the resolver menu completely.
> >>
> >> -Ross.
> >>
> >> On Tue, Sep 15, 2009 at 11:17 AM, O.Stephens
>  wrote:
> >>
> >>> Ross - no you didn't miss it,
> >>>
> >>> There are 3 ways that references might be added to the
> learning environment:
> >>>
> >>> An author (or realistically a proxy on behalf of the
> author) can insert a reference into a structured Word
> document from an RIS file. This structured document (XML)
> then goes through a 'publication' process which pushes the
> content to the learning environment (Moodle), including
> r

[CODE4LIB] Fall Internships at WGBH Media Library & Archives

2009-09-15 Thread Courtney Michael

Greetings colleagues! We have two opportunities for 2-3 interns at the WGBH 
Media Library & Archives! Please forgive the cross postings and do not respond 
to me, but send a resume and a statement of interest by email to: 
human_resour...@wgbh.org or by mail to:

WGBH Educational Foundation
Human Resources Department
One Guest Street
Boston, MA 02135

Please forward to any interested parties!
Thank you!
Courtney.

Digital Library Projects Internship
http://careers.wgbh.org/internships/internships/mla_digital_library.html
The WGBH Media Library & Archives has opportunities for undergraduate and 
graduate students to work in a film and media production archives. Come and 
learn what happens to all the materials that went into that FRONTLINE you saw 
after it aired on TV. Digital library interns will work with the Project 
Manager and Production Assistant to make archival media materials accessible 
online for two ongoing pilot projects, the CPB American Archive project and the 
Mellon Digital Library project. The CPB American Archive project will focus on 
Civil Rights Movement content. Funded by CPB, the American Archive will 
eventually be a national archive of PBS media materials. The Mellon Digital 
Library project uses foreign policy and the history of science content, and 
focuses on scholarly use of archival media material online. Interns will get 
hands-on experience preparing archival media for web access by digitizing 
materials, applying metadata, and encoding transcripts. This is an opportunity 
to learn moving image digitization for preservation and access, the PBCore 
metadata schema (pbcore.org) and the TEI XML schema (tei-c.org/).

Electronic Records Internship
http://careers.wgbh.org/internships/internships/mla_records.html
The WGBH Media Library & Archives has opportunities for undergraduate and 
graduate students to work in a film and media production archives. Come and use 
your electronic management knowledge in a real world setting. The Electronic 
Records Management interns will work with both the Program Shutdown Manager and 
the Digital Archives Manager. They will review electronic original interview 
transcripts that have been delivered to the Media Library and Archives by 
productions (such as Frontline or Nova) to standardize names, and correct any 
inconsistencies. This may require some research skills to identify exactly who 
a particular interviewee is and, where applicable, the position held at the 
time of the interview. This will require embedding the interviewee information 
within the document header and linking the transcript back to the physical tape 
holdings. The position will work to standardize naming conventions for 
interview transcripts, and create a suitable electronic workflow, prior to 
upload the WGBH digital assessment management system. Training will be given in 
this Artesia based application. The position requires excellent skills in 
reviewing and correcting information metadata. Familiarity with online search 
engines, Library of Congress Authorities and other online resources is 
recommended, as is an attention to detail.

Re: [CODE4LIB] Implementing OpenURL for simple web resources

I do like Ross's solution, if you really wanna use OpenURL. I'm much 
more comfortable with the idea of including a URI based on your own 
local service in rft_id, then including any old public URL in rft_id.


Then at least your link resolver can say "if what's in rft_id begins 
with (eg)  http://telstar.open.ac.uk/, THEN I know this is one of these 
purl type things, and I know that sending the user to it will result in 
a redirect to an end-user-appropriate access URL." 

Cause that's my concern with putting random URLs in rft_id, that there's 
no way to know if they are intended as end-user-appropriate access URLs 
or not, and in putting things in rft_id that aren't really good 
"identifiers" for the referent at all.   But using your own local 
service ID, now you really DO have something that's appropriately 
considered a "persistent identifier" for the referent, AND you have a 
straightforward way to tell when the rft_id of this context is intended 
as an access URL.


Jonathan

Ross Singer wrote:

Oh yeah, one thing I left off --

In Moodle, it would probably make sense to link to the URL in the  tag:
http://bbc.co.uk/";>The Beeb!
but use a javascript onMouseDown action to rewrite the link to route
through your funky link resolver path, a la Google.

That way, the page works like any normal webpage, "right mouse
click->Copy Link Location" gives the user the "real" URL to copy and
paste, but normal behavior funnels through the link resolver.

-Ross.

On Tue, Sep 15, 2009 at 11:41 AM, Ross Singer  wrote:
  

Given that the burden of creating these links is entirely on RefWorks
& Telstar, OpenURL seems as good a choice as anything (since anything
would require some other service, anyway).  As long as the profs
aren't expected to mess with it, I'm not sure that *how* you do the
indirection matters all that much and, as you say, there are added
bonuses to keeping it within SFX.

It seems to me, though, that your rft_id should be a URI to the db
you're using to store their references, so your CTX would look
something like:

http://res.open.ac.uk/?rfr_id=info:/telstar.open.ac.uk&rft_id=http://telstar.open.ac.uk/1234&dc.identifier=http://bbc.uk.co/
# not url encoded because I have, you know, a life.

I can't remember if you can include both metadata-by-reference keys
and metadata-by-value, but you could have by-reference
(&rft_ref=http://telstar.open.ac.uk/1234&rft_ref_fmt=RIS or something)
point at your citation db to return a formatted citation.

This way your citations are unique -- somebody pointing at today's
London Times frontpage isn't the same as somebody else's on a
different day.

While I'm shocked that I agree with using OpenURL for this, it seems
as reasonable as any other solution.  That being said, unless you can
definitely offer some other service besides linking to the resource,
I'd avoid the resolver menu completely.

-Ross.

On Tue, Sep 15, 2009 at 11:17 AM, O.Stephens  wrote:


Ross - no you didn't miss it,

There are 3 ways that references might be added to the learning environment:

An author (or realistically a proxy on behalf of the author) can insert a 
reference into a structured Word document from an RIS file. This structured 
document (XML) then goes through a 'publication' process which pushes the 
content to the learning environment (Moodle), including rendering the 
references from RIS format into a specified style, with links.
An author/librarian/other can import references to a 'resources' area in our 
learning environment (Moodle) from a RIS file
An author/librarian/other can subscribe to an RSS feed from a RefWorks 
'RefShare' folder within the 'resources' area of the learning environment

In general the project is focussing on the use of RefWorks - so although the 
RIS files could be created by any suitable s/w, we are looking specifically at 
RefWorks.

How you get the reference into RefWorks is something we are looking at 
currently. The best approach varies depending on the type of material you are 
looking at:

For websites it looks like the 'RefGrab-it' bookmarklet/browser plugin 
(depending on your browser) is the easiest way of capturing website details.
For books, probably a Union catalogue search from within RefWorks
For journal articles, probably a Federated search engine (SS 360 is what we've 
got)
Any of these could be entered by hand of course, as could several other kinds 
of reference

Entering the references into RefWorks could be done by an author, but it more 
likely to be done by a member of clerical staff or a librarian/library assistant

Owen

Owen Stephens
TELSTAR Project Manager
Library and Learning Resources Centre
The Open University
Walton Hall
Milton Keynes, MK7 6AA

T: +44 (0) 1908 858701
F: +44 (0) 1908 653571
E: o.steph...@open.ac.uk


  

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
Behalf Of Ross Singer
Sent: 15 September 2009 15:56
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Imple

Re: [CODE4LIB] Implementing OpenURL for simple web resources

Thanks Ross - very helpful

> That being
> said, unless you can definitely offer some other service
> besides linking to the resource, I'd avoid the resolver menu
> completely.

Agreed - this is generally envisaged as a fulltext service. I can see some 
benefit to other options - the Open University is almost 100% distance 
learning, so there may be an argument for pushing students to local library 
services, but in the first instance the idea is that links are to full-text or 
nothing.

We are also planning to include options to not display a link at all, or to use 
the link within the reference directly, bypassing the OpenURL method I'm 
proposing.

> In Moodle, it would probably make sense to link to the URL in
> the  tag:
> http://bbc.co.uk/";>The Beeb! but use a
> javascript onMouseDown action to rewrite the link to route
> through your funky link resolver path, a la Google.

I'll pass that on to the Moodle devs - the only issue I can see with this is 
that copy and paste is such a generic action it's going to be difficult to know 
whether one behaviour or the other is the best way to go. I've got some student 
focus groups coming up so it maybe I can explore this with them (if I've got 
time - there is a lot to talk to them about!)

Owen


Owen Stephens
TELSTAR Project Manager
Library and Learning Resources Centre
The Open University
Walton Hall
Milton Keynes, MK7 6AA

T: +44 (0) 1908 858701
F: +44 (0) 1908 653571
E: o.steph...@open.ac.uk


> -Original Message-
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
> Behalf Of Ross Singer
> Sent: 15 September 2009 16:42
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources
>
> Given that the burden of creating these links is entirely on
> RefWorks & Telstar, OpenURL seems as good a choice as
> anything (since anything would require some other service,
> anyway).  As long as the profs aren't expected to mess with
> it, I'm not sure that *how* you do the indirection matters
> all that much and, as you say, there are added bonuses to
> keeping it within SFX.
>
> It seems to me, though, that your rft_id should be a URI to
> the db you're using to store their references, so your CTX
> would look something like:
>
> http://res.open.ac.uk/?rfr_id=info:/telstar.open.ac.uk&rft_id=
> http://telstar.open.ac.uk/1234&dc.identifier=http://bbc.uk.co/
> # not url encoded because I have, you know, a life.
>
> I can't remember if you can include both
> metadata-by-reference keys and metadata-by-value, but you
> could have by-reference
> (&rft_ref=http://telstar.open.ac.uk/1234&rft_ref_fmt=RIS or
> something) point at your citation db to return a formatted citation.
>
> This way your citations are unique -- somebody pointing at
> today's London Times frontpage isn't the same as somebody
> else's on a different day.
>
> While I'm shocked that I agree with using OpenURL for this,
> it seems as reasonable as any other solution.  That being
> said, unless you can definitely offer some other service
> besides linking to the resource, I'd avoid the resolver menu
> completely.
>
> -Ross.
>
> On Tue, Sep 15, 2009 at 11:17 AM, O.Stephens
>  wrote:
> > Ross - no you didn't miss it,
> >
> > There are 3 ways that references might be added to the
> learning environment:
> >
> > An author (or realistically a proxy on behalf of the
> author) can insert a reference into a structured Word
> document from an RIS file. This structured document (XML)
> then goes through a 'publication' process which pushes the
> content to the learning environment (Moodle), including
> rendering the references from RIS format into a specified
> style, with links.
> > An author/librarian/other can import references to a
> 'resources' area
> > in our learning environment (Moodle) from a RIS file An
> > author/librarian/other can subscribe to an RSS feed from a RefWorks
> > 'RefShare' folder within the 'resources' area of the learning
> > environment
> >
> > In general the project is focussing on the use of RefWorks
> - so although the RIS files could be created by any suitable
> s/w, we are looking specifically at RefWorks.
> >
> > How you get the reference into RefWorks is something we are
> looking at currently. The best approach varies depending on
> the type of material you are looking at:
> >
> > For websites it looks like the 'RefGrab-it'
> bookmarklet/browser plugin (depending on your browser) is the
> easiest way of capturing website details.
> > For books, probably a Union catalogue search from within
> RefWorks For
> > journal articles, probably a Federated search engine (SS
> 360 is what
> > we've got) Any of these could be entered by hand of course,
> as could
> > several other kinds of reference
> >
> > Entering the references into RefWorks could be done by an
> author, but
> > it more likely to be done by a member of clerical staff or a
> > librarian/library assistant
> >
> > Owen
> >
> > Owen Stephens
> > TELSTAR Project Manage

Re: [CODE4LIB] Implementing OpenURL for simple web resources

Oh yeah, one thing I left off --

In Moodle, it would probably make sense to link to the URL in the  tag:
http://bbc.co.uk/";>The Beeb!
but use a javascript onMouseDown action to rewrite the link to route
through your funky link resolver path, a la Google.

That way, the page works like any normal webpage, "right mouse
click->Copy Link Location" gives the user the "real" URL to copy and
paste, but normal behavior funnels through the link resolver.

-Ross.

On Tue, Sep 15, 2009 at 11:41 AM, Ross Singer  wrote:
> Given that the burden of creating these links is entirely on RefWorks
> & Telstar, OpenURL seems as good a choice as anything (since anything
> would require some other service, anyway).  As long as the profs
> aren't expected to mess with it, I'm not sure that *how* you do the
> indirection matters all that much and, as you say, there are added
> bonuses to keeping it within SFX.
>
> It seems to me, though, that your rft_id should be a URI to the db
> you're using to store their references, so your CTX would look
> something like:
>
> http://res.open.ac.uk/?rfr_id=info:/telstar.open.ac.uk&rft_id=http://telstar.open.ac.uk/1234&dc.identifier=http://bbc.uk.co/
> # not url encoded because I have, you know, a life.
>
> I can't remember if you can include both metadata-by-reference keys
> and metadata-by-value, but you could have by-reference
> (&rft_ref=http://telstar.open.ac.uk/1234&rft_ref_fmt=RIS or something)
> point at your citation db to return a formatted citation.
>
> This way your citations are unique -- somebody pointing at today's
> London Times frontpage isn't the same as somebody else's on a
> different day.
>
> While I'm shocked that I agree with using OpenURL for this, it seems
> as reasonable as any other solution.  That being said, unless you can
> definitely offer some other service besides linking to the resource,
> I'd avoid the resolver menu completely.
>
> -Ross.
>
> On Tue, Sep 15, 2009 at 11:17 AM, O.Stephens  wrote:
>> Ross - no you didn't miss it,
>>
>> There are 3 ways that references might be added to the learning environment:
>>
>> An author (or realistically a proxy on behalf of the author) can insert a 
>> reference into a structured Word document from an RIS file. This structured 
>> document (XML) then goes through a 'publication' process which pushes the 
>> content to the learning environment (Moodle), including rendering the 
>> references from RIS format into a specified style, with links.
>> An author/librarian/other can import references to a 'resources' area in our 
>> learning environment (Moodle) from a RIS file
>> An author/librarian/other can subscribe to an RSS feed from a RefWorks 
>> 'RefShare' folder within the 'resources' area of the learning environment
>>
>> In general the project is focussing on the use of RefWorks - so although the 
>> RIS files could be created by any suitable s/w, we are looking specifically 
>> at RefWorks.
>>
>> How you get the reference into RefWorks is something we are looking at 
>> currently. The best approach varies depending on the type of material you 
>> are looking at:
>>
>> For websites it looks like the 'RefGrab-it' bookmarklet/browser plugin 
>> (depending on your browser) is the easiest way of capturing website details.
>> For books, probably a Union catalogue search from within RefWorks
>> For journal articles, probably a Federated search engine (SS 360 is what 
>> we've got)
>> Any of these could be entered by hand of course, as could several other 
>> kinds of reference
>>
>> Entering the references into RefWorks could be done by an author, but it 
>> more likely to be done by a member of clerical staff or a librarian/library 
>> assistant
>>
>> Owen
>>
>> Owen Stephens
>> TELSTAR Project Manager
>> Library and Learning Resources Centre
>> The Open University
>> Walton Hall
>> Milton Keynes, MK7 6AA
>>
>> T: +44 (0) 1908 858701
>> F: +44 (0) 1908 653571
>> E: o.steph...@open.ac.uk
>>
>>
>>> -Original Message-
>>> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
>>> Behalf Of Ross Singer
>>> Sent: 15 September 2009 15:56
>>> To: CODE4LIB@LISTSERV.ND.EDU
>>> Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources
>>>
>>> Owen, I might have missed it in this message -- my eyes are
>>> starting glaze over at this point in the thread, but can you
>>> describe how the input of these resources would work?
>>>
>>> What I'm basically asking is -- what would the professor need
>>> to do to add a new:  citation for a 70 year old book; journal
>>> on PubMed; URL to CiteSeer?
>>>
>>> How does their input make it into your database?
>>>
>>> -Ross.
>>>
>>> On Tue, Sep 15, 2009 at 5:04 AM, O.Stephens
>>>  wrote:
>>> >>True. How, from the OpenURL, are you going to know that the rft is
>>> >>meant to represent a website?
>>> > I guess that was part of my question. But no one has suggested
>>> > defining a new metadata profile for websites (which I
>>> probably would
>>> > avoid tbh). DC doesn't see

Re: [CODE4LIB] Implementing OpenURL for simple web resources

Given that the burden of creating these links is entirely on RefWorks
& Telstar, OpenURL seems as good a choice as anything (since anything
would require some other service, anyway).  As long as the profs
aren't expected to mess with it, I'm not sure that *how* you do the
indirection matters all that much and, as you say, there are added
bonuses to keeping it within SFX.

It seems to me, though, that your rft_id should be a URI to the db
you're using to store their references, so your CTX would look
something like:

http://res.open.ac.uk/?rfr_id=info:/telstar.open.ac.uk&rft_id=http://telstar.open.ac.uk/1234&dc.identifier=http://bbc.uk.co/
# not url encoded because I have, you know, a life.

I can't remember if you can include both metadata-by-reference keys
and metadata-by-value, but you could have by-reference
(&rft_ref=http://telstar.open.ac.uk/1234&rft_ref_fmt=RIS or something)
point at your citation db to return a formatted citation.

This way your citations are unique -- somebody pointing at today's
London Times frontpage isn't the same as somebody else's on a
different day.

While I'm shocked that I agree with using OpenURL for this, it seems
as reasonable as any other solution.  That being said, unless you can
definitely offer some other service besides linking to the resource,
I'd avoid the resolver menu completely.

-Ross.

On Tue, Sep 15, 2009 at 11:17 AM, O.Stephens  wrote:
> Ross - no you didn't miss it,
>
> There are 3 ways that references might be added to the learning environment:
>
> An author (or realistically a proxy on behalf of the author) can insert a 
> reference into a structured Word document from an RIS file. This structured 
> document (XML) then goes through a 'publication' process which pushes the 
> content to the learning environment (Moodle), including rendering the 
> references from RIS format into a specified style, with links.
> An author/librarian/other can import references to a 'resources' area in our 
> learning environment (Moodle) from a RIS file
> An author/librarian/other can subscribe to an RSS feed from a RefWorks 
> 'RefShare' folder within the 'resources' area of the learning environment
>
> In general the project is focussing on the use of RefWorks - so although the 
> RIS files could be created by any suitable s/w, we are looking specifically 
> at RefWorks.
>
> How you get the reference into RefWorks is something we are looking at 
> currently. The best approach varies depending on the type of material you are 
> looking at:
>
> For websites it looks like the 'RefGrab-it' bookmarklet/browser plugin 
> (depending on your browser) is the easiest way of capturing website details.
> For books, probably a Union catalogue search from within RefWorks
> For journal articles, probably a Federated search engine (SS 360 is what 
> we've got)
> Any of these could be entered by hand of course, as could several other kinds 
> of reference
>
> Entering the references into RefWorks could be done by an author, but it more 
> likely to be done by a member of clerical staff or a librarian/library 
> assistant
>
> Owen
>
> Owen Stephens
> TELSTAR Project Manager
> Library and Learning Resources Centre
> The Open University
> Walton Hall
> Milton Keynes, MK7 6AA
>
> T: +44 (0) 1908 858701
> F: +44 (0) 1908 653571
> E: o.steph...@open.ac.uk
>
>
>> -Original Message-
>> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
>> Behalf Of Ross Singer
>> Sent: 15 September 2009 15:56
>> To: CODE4LIB@LISTSERV.ND.EDU
>> Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources
>>
>> Owen, I might have missed it in this message -- my eyes are
>> starting glaze over at this point in the thread, but can you
>> describe how the input of these resources would work?
>>
>> What I'm basically asking is -- what would the professor need
>> to do to add a new:  citation for a 70 year old book; journal
>> on PubMed; URL to CiteSeer?
>>
>> How does their input make it into your database?
>>
>> -Ross.
>>
>> On Tue, Sep 15, 2009 at 5:04 AM, O.Stephens
>>  wrote:
>> >>True. How, from the OpenURL, are you going to know that the rft is
>> >>meant to represent a website?
>> > I guess that was part of my question. But no one has suggested
>> > defining a new metadata profile for websites (which I
>> probably would
>> > avoid tbh). DC doesn't seem to offer a nice way of doing
>> this (that is
>> > saying 'this is a website'), although there are perhaps
>> some bits and
>> > pieces (format, type) that could be used to give some
>> indication (but
>> > I suspect not unambiguously)
>> >
>> >>But I still think what you want is simply a purl server. What makes
>> >>you think you want OpenURL in the first place?  But I still don't
>> >>really understand what you're trying to do: "deliver consistency of
>> >>approach across all our references" -- so are you using OpenURL for
>> >>it's more "conventional" use too, but you want to tack on a
>> purl-like
>> >>functionality to the same sof

Re: [CODE4LIB] Implementing OpenURL for simple web resources

Do you think? I reckon it is just a few lines of code in a custom source 
parser... Only need to:

Check rft.id contains an http uri (regexp)
Define a fetchID based on this URI (possibly + date/other metadata)
Get a URL or null from a lookup service
Insert URL or rft_id value into rft.856

Simple!

Owen

Owen Stephens
TELSTAR Project Manager
Library and Learning Resources Centre
The Open University
Walton Hall
Milton Keynes, MK7 6AA

T: +44 (0) 1908 858701
F: +44 (0) 1908 653571
E: o.steph...@open.ac.uk


> -Original Message-
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
> Behalf Of Jonathan Rochkind
> Sent: 15 September 2009 16:30
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources
>
> Wait, are you really going to try to do this with _SFX_ too?
>  I missed
> that part. Oh boy. Seriously, I think you are in for a world
> of painful hacky kludge.
>
> Rosalyn Metz wrote:
> > Owen,
> >
> > The reason I suggest a source parser rather than a target parser is
> > that handling the openurl based on the source rather than
> shave a bit
> > of time off.  Attached is a slide i created (back in the
> day when it
> > was my job to create such slides...no i don't sit around in my hole
> > creating slides because i'm bored...although.) that shows the
> > process an OpenURL goes through.
> >
> > So the source parser in this example would come into play
> before the
> > OpenURL metadata hits the SFX KB.  It would bypass the
> bottom half of
> > the slide completely and reduce any weird formatting that SFX might
> > try to do to the metadata with a value like website (if you
> tell sfx
> > you're looking for an article but you're really looking for
> a book it
> > sometimes ignores metadata unrelated to an article even though you
> > might actually need it).  if you never let it get to that
> point, then
> > you don't need to worry about that "feature" coming into play.
> >
> > Source parsers aren't used as frequently as they once were,
> but they
> > used to be a way to retrieve more metadata from databases
> that didn't
> > create useful openurls (not that many vendors create useful
> openurls
> > now...).  but if you go a hackish route you could use a
> source parser
> > like a redirect rather than using it to fetch more metadata.
> >
> > If none of this makes sense let me know and i can try to
> describe it
> > better off list so as not to bore people into oblivion.
> >
> > Rosalyn
> >
> >
> >
> >
> > On Tue, Sep 15, 2009 at 9:52 AM, O.Stephens
>  wrote:
> >
> >> Thanks Rosalyn,
> >>
> >> As you say we could push a custom value into rfr_genre. I'm a bit
> >> torn on this, as I guess I'm trying to do something that isn't
> >> 'hacky' - or at least not from the OpenURL end of it. It might be
> >> that this is just wishful thinking, and that I'm just
> trying to fool
> >> myself into thinking I'm 'sticking to the standard' when the
> >> likelihood of what I'm doing being transferrable to other
> scenarios
> >> is zero (although Eric's comments make me hope not)
> >>
> >> Yes, we are using SFX. What I'm proposing on the SFX end
> as the path of least resisitance is writing a source parser
> for our learning environment which can do a 'fetch' for an
> alternative URL, or use the primary URL, and put it in an SFX
> internal field rft_856. We can then use the existing Target
> Parser 856_URL which displays the contents of rft_856 in the
> menu. Combined with some logic which forces this as the only
> option under certain circumstances we can then push the user
> directly to the resulting URL.
> >>
> >> Owen
> >>
> >> Owen Stephens
> >> TELSTAR Project Manager
> >> Library and Learning Resources Centre The Open University
> Walton Hall
> >> Milton Keynes, MK7 6AA
> >>
> >> T: +44 (0) 1908 858701
> >> F: +44 (0) 1908 653571
> >> E: o.steph...@open.ac.uk
> >>
> >>
> >>
> >>> -Original Message-
> >>> From: Code for Libraries
> [mailto:code4...@listserv.nd.edu] On Behalf
> >>> Of Rosalyn Metz
> >>> Sent: 15 September 2009 14:42
> >>> To: CODE4LIB@LISTSERV.ND.EDU
> >>> Subject: Re: [CODE4LIB] Implementing OpenURL for simple web
> >>> resources
> >>>
> >>> you could force a timestamp if people don't include a date.
> >>>
> >>> and I like the idea of going to the Internet Archive of a
> website,
> >>> because then you're not having to get into the business
> of handling
> >>> www.bbc.co.uk differently than cnn.com and someblog.org.
> >>>
> >>> i also like the idea of using a redirect.  you could
> theoretically
> >>> write a source parser (i'm assuming youre using SFX based on what
> >>> you said about bX) that says if my rfr_id = mylocalid and
> the item
> >>> is a website (however you choose to identify the
> website...which if
> >>> you're writing your own source parser you could put
> website in the
> >>> rft_genre even though its not technically a metadata
> format but you
> >>> just want your source parser to forward the url on anyway, so

Re: [CODE4LIB] Implementing OpenURL for simple web resources


O.Stephens wrote:

Thanks Rosalyn,

As you say we could push a custom value into rfr_genre. I'm a bit torn on this, 
as I guess I'm trying to do something that isn't 'hacky' - or at least not from 
the OpenURL end of it. It might be that this is just wishful thinking, and that 
I'm just trying to fool myself into thinking I'm 'sticking to the standard' 
when the likelihood of what I'm doing being transferrable to other scenarios is 
zero (although Eric's comments make me hope not)

  


Heh, that is my opinion. Everything I've ever tried to do with OpenURL 
that isn't part of the original "0.1" use case has ended up very hacky, 
despite my best efforts.

Re: [CODE4LIB] Implementing OpenURL for simple web resources

Wait, are you really going to try to do this with _SFX_ too?   I missed 
that part. Oh boy. Seriously, I think you are in for a world of painful 
hacky kludge.


Rosalyn Metz wrote:

Owen,

The reason I suggest a source parser rather than a target parser is
that handling the openurl based on the source rather than shave a bit
of time off.  Attached is a slide i created (back in the day when it
was my job to create such slides...no i don't sit around in my hole
creating slides because i'm bored...although.) that shows the
process an OpenURL goes through.

So the source parser in this example would come into play before the
OpenURL metadata hits the SFX KB.  It would bypass the bottom half of
the slide completely and reduce any weird formatting that SFX might
try to do to the metadata with a value like website (if you tell sfx
you're looking for an article but you're really looking for a book it
sometimes ignores metadata unrelated to an article even though you
might actually need it).  if you never let it get to that point, then
you don't need to worry about that "feature" coming into play.

Source parsers aren't used as frequently as they once were, but they
used to be a way to retrieve more metadata from databases that didn't
create useful openurls (not that many vendors create useful openurls
now...).  but if you go a hackish route you could use a source parser
like a redirect rather than using it to fetch more metadata.

If none of this makes sense let me know and i can try to describe it
better off list so as not to bore people into oblivion.

Rosalyn




On Tue, Sep 15, 2009 at 9:52 AM, O.Stephens  wrote:
  

Thanks Rosalyn,

As you say we could push a custom value into rfr_genre. I'm a bit torn on this, 
as I guess I'm trying to do something that isn't 'hacky' - or at least not from 
the OpenURL end of it. It might be that this is just wishful thinking, and that 
I'm just trying to fool myself into thinking I'm 'sticking to the standard' 
when the likelihood of what I'm doing being transferrable to other scenarios is 
zero (although Eric's comments make me hope not)

Yes, we are using SFX. What I'm proposing on the SFX end as the path of least 
resisitance is writing a source parser for our learning environment which can 
do a 'fetch' for an alternative URL, or use the primary URL, and put it in an 
SFX internal field rft_856. We can then use the existing Target Parser 856_URL 
which displays the contents of rft_856 in the menu. Combined with some logic 
which forces this as the only option under certain circumstances we can then 
push the user directly to the resulting URL.

Owen

Owen Stephens
TELSTAR Project Manager
Library and Learning Resources Centre
The Open University
Walton Hall
Milton Keynes, MK7 6AA

T: +44 (0) 1908 858701
F: +44 (0) 1908 653571
E: o.steph...@open.ac.uk




-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
Behalf Of Rosalyn Metz
Sent: 15 September 2009 14:42
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources

you could force a timestamp if people don't include a date.

and I like the idea of going to the Internet Archive of a
website, because then you're not having to get into the
business of handling www.bbc.co.uk differently than cnn.com
and someblog.org.

i also like the idea of using a redirect.  you could
theoretically write a source parser (i'm assuming youre using
SFX based on what you said about bX) that says if my rfr_id =
mylocalid and the item is a website (however you choose to
identify the website...which if you're writing your own
source parser you could put website in the rft_genre even
though its not technically a metadata format but you just
want your source parser to forward the url on anyway, so the
link resolver isn't actually going to do anything with it)
bypass everything and just direct to the internet archive of
the website.

all of this is of course kind of hackish...but really isn't
the whole thing hackish?  there were a few source parsers
that would be good models for writing something like this.
but i have no idea if they still exist because i haven't
looked at the back end of sfx in about a year.




On Tue, Sep 15, 2009 at 5:12 AM, O.Stephens
 wrote:
  

I agree with this Rosalyn. The issue that Nate brought up


was that the content at http://www.bbc.co.uk could change
over time, and old content might be moved to another URI -
http://archive.bbc.co.uk or something. So if course A
references http://www.bbc.co.uk on 24/08/09, if the content
that was on http://www.bbc.co.uk on 24/08/09 moves to
http://archive.bbc.co.uk we can use the mechanism I propose
to trap the links to http://www.bbc.co.uk and redirect to
http://archive.bbc.co.uk. However, if at a later date course
B references http://www.bbc.co.uk we have no way of knowing
whether they mean the stuff that is currently on
http://www.bbc.co.uk or the stuff that used to be on
http://www.

Re: [CODE4LIB] Implementing OpenURL for simple web resources


O.Stephens wrote:

True. How, from the OpenURL, are you going to know that the rft is meant
to represent a website?


I guess that was part of my question. But no one has suggested defining a new 
metadata profile for websites (which I probably would avoid tbh). DC doesn't 
seem to offer a nice way of doing this (that is saying 'this is a website'), 
although there are perhaps some bits and pieces (format, type) that could be 
used to give some indication (but I suspect not unambiguously)

  


Yeah, I don't think there IS any good way to do this.  Well, wait, okay, 
you could use a DC metadata package, and try to convey "web site" in 
dc.type.   The OpenURL dc.type is _recommended_ that you use a term from 
the DCTerms Type vocabulary, but that only lets you say something like 
it's an "InteractiveResource" or "Text" or "Software".   Unless 
"InteractiveResource" is sufficient to convey what you need, you could 
disregard the suggestion (not requirement) that the openurl dc metadata 
schema "type" element contain a DCMI Type vocabulary term, and just put 
something else there. "Website".  If you want to go this route, probably 
make a URI (perhaps using purl.org) to put an actual URI instead of a 
string literal there to represent "Website".


Now, you've still wound up with something that is somewhat local/custom, 
that other resolvers are not going to understand. But frankly, I think 
anything you're going to wind up with is something that you aren't going 
to be able to trust arbitrary resolvers in the wild to do anything in 
particular with.  Which may not be a requirement for you anyway.


(Which is why I personally find a new OpenURL metadata format to be a 
complete non-starter.  I don't think OpenURL's "abstract" core actually 
provides much actual practical benefit, a new metadata format might as 
well be an entirely new standard -- for the practical benefit you get 
from it.  Other link resolvers that aren't yours are unlikely to ever do 
anything with your new format, and if they do, whoever implements that 
is going to have almost as much work to do as if it hadn't been OpenURL 
at all. If I wanted a really abstract metadata framework to create a new 
profile/schema on top of, I'd choose DCMI, not OpenURL. DCMI is also so 
abstract that it doesn't make sense to just say "My app can take DCMI" 
(just like it doens't make any sense to say "my app can take 
OpenURL"--it's all about the profiels/schemas).  But at least DCMI is a 
lot more flexible, and still has an active body of people working on 
maintaining and developing and adopting it.)


Jonathan

Re: [CODE4LIB] Implementing OpenURL for simple web resources

A suggestion on how to get a prof to enter a url.

I use this bookmarklet to add a URL to Hacker News:

javascript:window.location=%22http://news.ycombinator.com/submitlink?u=%22+encodeURIComponent(document.location)+%22&t=%22+encodeURIComponent(document.title)

I'm tempted to suggest an api based on OpenURL, but I fear the 10
emails it would provoke.

On Sep 15, 2009, at 10:56 AM, Ross Singer wrote:

Owen, I might have missed it in this message -- my eyes are starting
glaze over at this point in the thread, but can you describe how the
input of these resources would work?

What I'm basically asking is -- what would the professor need to do to
add a new: citation for a 70 year old book; journal on PubMed; URL to
CiteSeer?

How does their input make it into your database?

-Ross.

On Tue, Sep 15, 2009 at 5:04 AM, O.Stephens
wrote:
True. How, from the OpenURL, are you going to know that the rft is
meant

to represent a website?
I guess that was part of my question. But no one has suggested
defining a new metadata profile for websites (which I probably
would avoid tbh). DC doesn't seem to offer a nice way of doing this
(that is saying 'this is a website'), although there are perhaps
some bits and pieces (format, type) that could be used to give some
indication (but I suspect not unambiguously)

But I still think what you want is simply a purl server. What
makes you

think you want OpenURL in the first place? But I still don't really
understand what you're trying to do: "deliver consistency of
approach

across all our references" -- so are you using OpenURL for it's more
"conventional" use too, but you want to tack on a purl-like
functionality to the same software that's doing something more
like a
conventional link resolver? I don't completely understand your
use case.

I wouldn't use OpenURL just to get a persistent URL - I'd almost
certainly look at PURL for this. But, I want something slightly
different. I want our course authors to be able to use whatever URL
they know for a resource, but still try to ensure that the link
works persistently over time. I don't think it is reasonable for a
user to have to know a 'special' URL for a resource - and this
approach means establishing a PURL for all resources used in our
teaching material whether or not it moves in the future - which is
an overhead it would be nice to avoid.

You can hit delete now if you aren't interested, but ...

... perhaps if I just say a little more about the project I'm
working on it may clarify...

The project I'm working on is concerned with referencing and
citation. We are looking at how references appear in teaching
material (esp. online) and how they can be reused by students in
their personal environment (in essays, later study, or something
else). The references that appear can be to anything - books,
chapters, journals, articles, etc. Increasingly of course there are
references to web-based materials.

For print material, references generally describe the resource and
nothing more, but for digital material references are expected not
only to describe the resource, but also state a route of access to
the resource. This tends to be a bad idea when (for example)
referencing e-journals, as we know the problems that surround this
- many different routes of access to the same item. OpenURLs work
well in this situation and seem to me like a sensible (and perhaps
the only viable) solution. So we can say that for journals/articles
it is sensible to ignore any URL supplied as part of the reference,
and to form an OpenURL instead. If there is a DOI in the reference
(which is increasingly common) then that can be used to form a URL
using DOI resolution, but it makes more sense to me to hand this
off to another application rather than bake this into the reference
- and OpenURL resolvers are reasonably set to do this.

If we look at a website it is pretty difficult to reference it
without including the URL - it seems to be the only good way of
describing what you are actually talking about (how many people
think of websites by 'title', 'author' and 'publisher'?). For me,
this leads to an immediate confusion between the description of the
resource and the route of access to it. So, to differentiate I'm
starting to think of the http URI in a reference like this as a
URI, but not necessarily a URL. We then need some mechanism to
check, given a URI, what is the URL.

Now I could do this with a script - just pass the URI to a script
that checks what URL to use against a list and redirects the user
if necessary. On this point Jonathan said "if the usefulness of
your technique does NOT count on being inter-operable with existing
link resolver infrastructure... PERSONALLY I would be using
OpenURL, I don't think it's worth it" - but it struck me that if we
were passing a URI to a script, why not pass it in an OpenURL? I
could see a

Re: [CODE4LIB] Implementing OpenURL for simple web resources

Ross - no you didn't miss it,

There are 3 ways that references might be added to the learning environment:

An author (or realistically a proxy on behalf of the author) can insert a 
reference into a structured Word document from an RIS file. This structured 
document (XML) then goes through a 'publication' process which pushes the 
content to the learning environment (Moodle), including rendering the 
references from RIS format into a specified style, with links.
An author/librarian/other can import references to a 'resources' area in our 
learning environment (Moodle) from a RIS file
An author/librarian/other can subscribe to an RSS feed from a RefWorks 
'RefShare' folder within the 'resources' area of the learning environment

In general the project is focussing on the use of RefWorks - so although the 
RIS files could be created by any suitable s/w, we are looking specifically at 
RefWorks.

How you get the reference into RefWorks is something we are looking at 
currently. The best approach varies depending on the type of material you are 
looking at:

For websites it looks like the 'RefGrab-it' bookmarklet/browser plugin 
(depending on your browser) is the easiest way of capturing website details.
For books, probably a Union catalogue search from within RefWorks
For journal articles, probably a Federated search engine (SS 360 is what we've 
got)
Any of these could be entered by hand of course, as could several other kinds 
of reference

Entering the references into RefWorks could be done by an author, but it more 
likely to be done by a member of clerical staff or a librarian/library assistant

Owen

Owen Stephens
TELSTAR Project Manager
Library and Learning Resources Centre
The Open University
Walton Hall
Milton Keynes, MK7 6AA

T: +44 (0) 1908 858701
F: +44 (0) 1908 653571
E: o.steph...@open.ac.uk


> -Original Message-
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
> Behalf Of Ross Singer
> Sent: 15 September 2009 15:56
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources
>
> Owen, I might have missed it in this message -- my eyes are
> starting glaze over at this point in the thread, but can you
> describe how the input of these resources would work?
>
> What I'm basically asking is -- what would the professor need
> to do to add a new:  citation for a 70 year old book; journal
> on PubMed; URL to CiteSeer?
>
> How does their input make it into your database?
>
> -Ross.
>
> On Tue, Sep 15, 2009 at 5:04 AM, O.Stephens
>  wrote:
> >>True. How, from the OpenURL, are you going to know that the rft is
> >>meant to represent a website?
> > I guess that was part of my question. But no one has suggested
> > defining a new metadata profile for websites (which I
> probably would
> > avoid tbh). DC doesn't seem to offer a nice way of doing
> this (that is
> > saying 'this is a website'), although there are perhaps
> some bits and
> > pieces (format, type) that could be used to give some
> indication (but
> > I suspect not unambiguously)
> >
> >>But I still think what you want is simply a purl server. What makes
> >>you think you want OpenURL in the first place?  But I still don't
> >>really understand what you're trying to do: "deliver consistency of
> >>approach across all our references" -- so are you using OpenURL for
> >>it's more "conventional" use too, but you want to tack on a
> purl-like
> >>functionality to the same software that's doing something
> more like a
> >>conventional link resolver?  I don't completely understand
> your use case.
> >
> > I wouldn't use OpenURL just to get a persistent URL - I'd
> almost certainly look at PURL for this. But, I want something
> slightly different. I want our course authors to be able to
> use whatever URL they know for a resource, but still try to
> ensure that the link works persistently over time. I don't
> think it is reasonable for a user to have to know a 'special'
> URL for a resource - and this approach means establishing a
> PURL for all resources used in our teaching material whether
> or not it moves in the future - which is an overhead it would
> be nice to avoid.
> >
> > You can hit delete now if you aren't interested, but ...
> >
> > ... perhaps if I just say a little more about the project
> I'm working on it may clarify...
> >
> > The project I'm working on is concerned with referencing
> and citation. We are looking at how references appear in
> teaching material (esp. online) and how they can be reused by
> students in their personal environment (in essays, later
> study, or something else). The references that appear can be
> to anything - books, chapters, journals, articles, etc.
> Increasingly of course there are references to web-based materials.
> >
> > For print material, references generally describe the
> resource and nothing more, but for digital material
> references are expected not only to describe the resource,
> but also state a route of access to the r

Re: [CODE4LIB] Implementing OpenURL for simple web resources

Owen, I might have missed it in this message -- my eyes are starting
glaze over at this point in the thread, but can you describe how the
input of these resources would work?

What I'm basically asking is -- what would the professor need to do to
add a new:  citation for a 70 year old book; journal on PubMed; URL to
CiteSeer?

How does their input make it into your database?

-Ross.

On Tue, Sep 15, 2009 at 5:04 AM, O.Stephens  wrote:
>>True. How, from the OpenURL, are you going to know that the rft is meant
>>to represent a website?
> I guess that was part of my question. But no one has suggested defining a new 
> metadata profile for websites (which I probably would avoid tbh). DC doesn't 
> seem to offer a nice way of doing this (that is saying 'this is a website'), 
> although there are perhaps some bits and pieces (format, type) that could be 
> used to give some indication (but I suspect not unambiguously)
>
>>But I still think what you want is simply a purl server. What makes you
>>think you want OpenURL in the first place?  But I still don't really
>>understand what you're trying to do: "deliver consistency of approach
>>across all our references" -- so are you using OpenURL for it's more
>>"conventional" use too, but you want to tack on a purl-like
>>functionality to the same software that's doing something more like a
>>conventional link resolver?  I don't completely understand your use case.
>
> I wouldn't use OpenURL just to get a persistent URL - I'd almost certainly 
> look at PURL for this. But, I want something slightly different. I want our 
> course authors to be able to use whatever URL they know for a resource, but 
> still try to ensure that the link works persistently over time. I don't think 
> it is reasonable for a user to have to know a 'special' URL for a resource - 
> and this approach means establishing a PURL for all resources used in our 
> teaching material whether or not it moves in the future - which is an 
> overhead it would be nice to avoid.
>
> You can hit delete now if you aren't interested, but ...
>
> ... perhaps if I just say a little more about the project I'm working on it 
> may clarify...
>
> The project I'm working on is concerned with referencing and citation. We are 
> looking at how references appear in teaching material (esp. online) and how 
> they can be reused by students in their personal environment (in essays, 
> later study, or something else). The references that appear can be to 
> anything - books, chapters, journals, articles, etc. Increasingly of course 
> there are references to web-based materials.
>
> For print material, references generally describe the resource and nothing 
> more, but for digital material references are expected not only to describe 
> the resource, but also state a route of access to the resource. This tends to 
> be a bad idea when (for example) referencing e-journals, as we know the 
> problems that surround this - many different routes of access to the same 
> item. OpenURLs work well in this situation and seem to me like a sensible 
> (and perhaps the only viable) solution. So we can say that for 
> journals/articles it is sensible to ignore any URL supplied as part of the 
> reference, and to form an OpenURL instead. If there is a DOI in the reference 
> (which is increasingly common) then that can be used to form a URL using DOI 
> resolution, but it makes more sense to me to hand this off to another 
> application rather than bake this into the reference - and OpenURL resolvers 
> are reasonably set to do this.
>
> If we look at a website it is pretty difficult to reference it without 
> including the URL - it seems to be the only good way of describing what you 
> are actually talking about (how many people think of websites by 'title', 
> 'author' and 'publisher'?). For me, this leads to an immediate confusion 
> between the description of the resource and the route of access to it. So, to 
> differentiate I'm starting to think of the http URI in a reference like this 
> as a URI, but not necessarily a URL. We then need some mechanism to check, 
> given a URI, what is the URL.
>
> Now I could do this with a script - just pass the URI to a script that checks 
> what URL to use against a list and redirects the user if necessary. On this 
> point Jonathan said "if the usefulness of your technique does NOT count on 
> being inter-operable with existing link resolver infrastructure... PERSONALLY 
> I would be using OpenURL, I don't think it's worth it" - but it struck me 
> that if we were passing a URI to a script, why not pass it in an OpenURL? I 
> could see a number of advantages to this in the local context:
>
> Consistency - references to websites get treated the same as references to 
> journal articles - this means a single approach on the course side, with 
> flexibility
> Usage stats - we could collect these whatever, but if we do it via OpenURL we 
> get this in the same place as the stats about usage of o

Re: [CODE4LIB] indexing pdf files

2009-09-15 Thread danielle plumer

My (much more primitive) version of the same thing involves reading and
annotating articles using my Tablet PC. Although I do get a variety of print
publications, I find I don't tend to annotate them as much anymore. I used
to use EndNote to do the metadata, then I switched to Zotero. I hadn't
thought to try to create a full-text search of the articles -- hmm.

-- 
Danielle Cunniff Plumer, Coordinator
Texas Heritage Digitization Initiative
Texas State Library and Archives Commission
512.463.5852 (phone) / 512.936.2306 (fax)
dplu...@tsl.state.tx.us
dcplu...@gmail.com


On Tue, Sep 15, 2009 at 8:31 AM, Eric Lease Morgan  wrote:

> I have been having fun recently indexing PDF files.
>
> For the pasts six months or so I have been keeping the articles I've read
> in a pile, and I was rather amazed at the size of the pile. It was about a
> foot tall. When I read these articles I "actively" read them -- meaning, I
> write, scribble, highlight, and annotate the text with my own special
> notation denoting names, keywords, definitions, citations, quotations, list
> items, examples, etc. This active reading process: 1) makes for better
> comprehension on my part, and 2) makes the articles easier to review and
> pick out the ideas I thought were salient. Being the librarian I am, I
> thought it might be cool ("kewl") to make the articles into a collection.
> Thus, the beginnings of Highlights & Annotations: A Value-Added Reading
> List.
>
> The techno-weenie process for creating and maintaining the content is
> something this community might find interesting:
>
>  1. Print article and read it actively.
>
>  2. Convert the printed article into a PDF
>file -- complete with embedded OCR --
>with my handy-dandy ScanSnap scanner. [1]
>
>  3. Use MyLibrary to create metadata (author,
>title, date published, date read, note,
>keywords, facet/term combinations, local
>and remote URLs, etc.) describing the
>article. [2]
>
>  4. Save the PDF to my file system.
>
>  5. Use pdttotext to extract the OCRed text
>from the PDF and index it along with
>the MyLibrary metadata using Solr. [3, 4]
>
>  6. Provide a searchable/browsable user
>interface to the collection through a
>mod_perl module. [5, 6]
>
> Software is never done, and if it were then it would be called hardware.
> Accordingly, I know there are some things I need to do before I can truely
> deem the system version 1.0. At the same time my excitment is overflowing
> and I thought I'd share some geekdom with my fellow hackers. Fun with PDF
> files and open source software.
>
>
> [1] ScanSnap - http://tinyurl.com/oafgwe
> [2] MyLibrary screen dump - http://infomotions.com/tmp/mylibrary.png
> [3] pdftotext - http://www.foolabs.com/xpdf/
> [4] Solr - http://lucene.apache.org/solr/
> [5] module source code - http://infomotions.com/highlights/Highlights.pl
> [6] user interface - http://infomotions.com/highlights/highlights.cgi
>
> --
> Eric Lease Morgan
> University of Notre Dame
>
>
>
>
> --
> Eric Lease Morgan
> Head, Digital Access and Information Architecture Department
> Hesburgh Libraries, University of Notre Dame
>
> (574) 631-8604
>

Re: [CODE4LIB] indexing pdf files

2009-09-15 Thread Peter Kiraly


Hi all,

I would like to suggest an API for extracting text (including highlighted or
annotated ones) from PDF: iText (http://www.lowagie.com/iText/).
This is a Java API (has C# port), and it helped me a lot, when we worked
with extraordinary PDF files.

Solr uses Tika (http://lucene.apache.org/tika) for extracting text from
documents, and Tika uses PDFBox (http://incubator.apache.org/pdfbox/)
to extract from PDF files, and it is a great tool for the normal PDF files,
but it has (at least had) some features, which I didn't satisfied with:

- it consumed more memory comparing with iText, and couldn't
read files above a given size (this was large, about 1 GB, but we
had even larger files)

- it couldn't handled correctly the conditional hypens at the end of
the line
- it had poorer documentation then iText, and its API was also
poorer (that time the Manning published the iText in Action book).

Our PDF files were double layered (original hi-res image + OCR-ed text),
several thousands pages length documents (Hungarian scientific journals,
the diary of the Houses of Parliament from the 19th century etc.). We 
indexed

the content with Lucene, and in the UI we showed one page per screen,
so the user didn't need to download the full PDF. We extracted the
Table of contents from the PDF as well, and we implemented it in the web UI,
so the user can browse pages according to the full file's TOC.

This project happened two years ago, so it is possible, that lots of things
were changed since that time.

Király Péter
http://eXtensibleCatalog.org

- Original Message - 
From: "Mark A. Matienzo" 

To: 
Sent: Tuesday, September 15, 2009 3:56 PM
Subject: Re: [CODE4LIB] indexing pdf files



Eric,


 5. Use pdttotext to extract the OCRed text
   from the PDF and index it along with
   the MyLibrary metadata using Solr. [3, 4]



Have you considered using Solr's ExtractingRequestHandler [1] for the
PDFs? We're using it at NYPL with pretty great success.

[1] http://wiki.apache.org/solr/ExtractingRequestHandler

Mark A. Matienzo
Applications Developer, Digital Experience Group
The New York Public Library

Re: [CODE4LIB] indexing pdf files

2009-09-15 Thread Mark A. Matienzo

Eric,

>  5. Use pdttotext to extract the OCRed text
>from the PDF and index it along with
>the MyLibrary metadata using Solr. [3, 4]
>

Have you considered using Solr's ExtractingRequestHandler [1] for the
PDFs? We're using it at NYPL with pretty great success.

[1] http://wiki.apache.org/solr/ExtractingRequestHandler

Mark A. Matienzo
Applications Developer, Digital Experience Group
The New York Public Library

Re: [CODE4LIB] Implementing OpenURL for simple web resources

Thanks Rosalyn,

As you say we could push a custom value into rfr_genre. I'm a bit torn on this, 
as I guess I'm trying to do something that isn't 'hacky' - or at least not from 
the OpenURL end of it. It might be that this is just wishful thinking, and that 
I'm just trying to fool myself into thinking I'm 'sticking to the standard' 
when the likelihood of what I'm doing being transferrable to other scenarios is 
zero (although Eric's comments make me hope not)

Yes, we are using SFX. What I'm proposing on the SFX end as the path of least 
resisitance is writing a source parser for our learning environment which can 
do a 'fetch' for an alternative URL, or use the primary URL, and put it in an 
SFX internal field rft_856. We can then use the existing Target Parser 856_URL 
which displays the contents of rft_856 in the menu. Combined with some logic 
which forces this as the only option under certain circumstances we can then 
push the user directly to the resulting URL.

Owen

Owen Stephens
TELSTAR Project Manager
Library and Learning Resources Centre
The Open University
Walton Hall
Milton Keynes, MK7 6AA

T: +44 (0) 1908 858701
F: +44 (0) 1908 653571
E: o.steph...@open.ac.uk


> -Original Message-
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
> Behalf Of Rosalyn Metz
> Sent: 15 September 2009 14:42
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources
>
> you could force a timestamp if people don't include a date.
>
> and I like the idea of going to the Internet Archive of a
> website, because then you're not having to get into the
> business of handling www.bbc.co.uk differently than cnn.com
> and someblog.org.
>
> i also like the idea of using a redirect.  you could
> theoretically write a source parser (i'm assuming youre using
> SFX based on what you said about bX) that says if my rfr_id =
> mylocalid and the item is a website (however you choose to
> identify the website...which if you're writing your own
> source parser you could put website in the rft_genre even
> though its not technically a metadata format but you just
> want your source parser to forward the url on anyway, so the
> link resolver isn't actually going to do anything with it)
> bypass everything and just direct to the internet archive of
> the website.
>
> all of this is of course kind of hackish...but really isn't
> the whole thing hackish?  there were a few source parsers
> that would be good models for writing something like this.
> but i have no idea if they still exist because i haven't
> looked at the back end of sfx in about a year.
>
>
>
>
> On Tue, Sep 15, 2009 at 5:12 AM, O.Stephens
>  wrote:
> > I agree with this Rosalyn. The issue that Nate brought up
> was that the content at http://www.bbc.co.uk could change
> over time, and old content might be moved to another URI -
> http://archive.bbc.co.uk or something. So if course A
> references http://www.bbc.co.uk on 24/08/09, if the content
> that was on http://www.bbc.co.uk on 24/08/09 moves to
> http://archive.bbc.co.uk we can use the mechanism I propose
> to trap the links to http://www.bbc.co.uk and redirect to
> http://archive.bbc.co.uk. However, if at a later date course
> B references http://www.bbc.co.uk we have no way of knowing
> whether they mean the stuff that is currently on
> http://www.bbc.co.uk or the stuff that used to be on
> http://www.bbc.co.uk and is now on http://archive.bbc.co.uk -
> and we have a redirect that is being applied across the board.
> >
> > Thinking about it, references are required to include a
> date of access when citing websites, so this is probably the
> best piece of information to use to know where to resolve to
> (and we can put this in the DC metadata). Whether this will
> just get too confusing is a good question - I'll have at
> think about this.
> >
> > Owen
> >
> > PS using the date we could even consider resolving to the
> Internet Archive copy of a website if it was available I
> guess - this might be useful I guess...
> >
> > Owen Stephens
> > TELSTAR Project Manager
> > Library and Learning Resources Centre
> > The Open University
> > Walton Hall
> > Milton Keynes, MK7 6AA
> >
> > T: +44 (0) 1908 858701
> > F: +44 (0) 1908 653571
> > E: o.steph...@open.ac.uk
> >
> >
> >> -Original Message-
> >> From: Code for Libraries [mailto:code4...@listserv.nd.edu]
> On Behalf
> >> Of Rosalyn Metz
> >> Sent: 14 September 2009 21:52
> >> To: CODE4LIB@LISTSERV.ND.EDU
> >> Subject: Re: [CODE4LIB] Implementing OpenURL for simple
> web resources
> >>
> >> oops...just re-read original post s/professor/article
> >>
> >> also your link resolver should be creating a context
> object with each
> >> request.  this context object is what makes the openurl
> unique.  so
> >> if you want uniqueness for stats purposes i would image the link
> >> resolver is already doing that (and just another reason to use an
> >> rfr_id that you create).
> >>
> >>
> >>
> >>
> >> On

Re: [CODE4LIB] indexing pdf files

2009-09-15 Thread Rosalyn Metz

Eric,

I have librarians that would kill for this.  In fact I was talking to
one about it the other day.  She felt there must be a way to handle
active reading and make it portable.  This would be great in
conjunction with RefWorks or Zotero or something along those lines.

Rosalyn



On Tue, Sep 15, 2009 at 9:31 AM, Eric Lease Morgan  wrote:
> I have been having fun recently indexing PDF files.
>
> For the pasts six months or so I have been keeping the articles I've read in
> a pile, and I was rather amazed at the size of the pile. It was about a foot
> tall. When I read these articles I "actively" read them -- meaning, I write,
> scribble, highlight, and annotate the text with my own special notation
> denoting names, keywords, definitions, citations, quotations, list items,
> examples, etc. This active reading process: 1) makes for better
> comprehension on my part, and 2) makes the articles easier to review and
> pick out the ideas I thought were salient. Being the librarian I am, I
> thought it might be cool ("kewl") to make the articles into a collection.
> Thus, the beginnings of Highlights & Annotations: A Value-Added Reading
> List.
>
> The techno-weenie process for creating and maintaining the content is
> something this community might find interesting:
>
>  1. Print article and read it actively.
>
>  2. Convert the printed article into a PDF
>    file -- complete with embedded OCR --
>    with my handy-dandy ScanSnap scanner. [1]
>
>  3. Use MyLibrary to create metadata (author,
>    title, date published, date read, note,
>    keywords, facet/term combinations, local
>    and remote URLs, etc.) describing the
>    article. [2]
>
>  4. Save the PDF to my file system.
>
>  5. Use pdttotext to extract the OCRed text
>    from the PDF and index it along with
>    the MyLibrary metadata using Solr. [3, 4]
>
>  6. Provide a searchable/browsable user
>    interface to the collection through a
>    mod_perl module. [5, 6]
>
> Software is never done, and if it were then it would be called hardware.
> Accordingly, I know there are some things I need to do before I can truely
> deem the system version 1.0. At the same time my excitment is overflowing
> and I thought I'd share some geekdom with my fellow hackers. Fun with PDF
> files and open source software.
>
>
> [1] ScanSnap - http://tinyurl.com/oafgwe
> [2] MyLibrary screen dump - http://infomotions.com/tmp/mylibrary.png
> [3] pdftotext - http://www.foolabs.com/xpdf/
> [4] Solr - http://lucene.apache.org/solr/
> [5] module source code - http://infomotions.com/highlights/Highlights.pl
> [6] user interface - http://infomotions.com/highlights/highlights.cgi
>
> --
> Eric Lease Morgan
> University of Notre Dame
>
>
>
>
> --
> Eric Lease Morgan
> Head, Digital Access and Information Architecture Department
> Hesburgh Libraries, University of Notre Dame
>
> (574) 631-8604
>

Re: [CODE4LIB] Implementing OpenURL for simple web resources

2009-09-15 Thread Rosalyn Metz

you could force a timestamp if people don't include a date.

and I like the idea of going to the Internet Archive of a website,
because then you're not having to get into the business of handling
www.bbc.co.uk differently than cnn.com and someblog.org.

i also like the idea of using a redirect.  you could theoretically
write a source parser (i'm assuming youre using SFX based on what you
said about bX) that says if my rfr_id = mylocalid and the item is a
website (however you choose to identify the website...which if you're
writing your own source parser you could put website in the rft_genre
even though its not technically a metadata format but you just want
your source parser to forward the url on anyway, so the link resolver
isn't actually going to do anything with it) bypass everything and
just direct to the internet archive of the website.

all of this is of course kind of hackish...but really isn't the whole
thing hackish?  there were a few source parsers that would be good
models for writing something like this.  but i have no idea if they
still exist because i haven't looked at the back end of sfx in about a
year.




On Tue, Sep 15, 2009 at 5:12 AM, O.Stephens  wrote:
> I agree with this Rosalyn. The issue that Nate brought up was that the 
> content at http://www.bbc.co.uk could change over time, and old content might 
> be moved to another URI - http://archive.bbc.co.uk or something. So if course 
> A references http://www.bbc.co.uk on 24/08/09, if the content that was on 
> http://www.bbc.co.uk on 24/08/09 moves to http://archive.bbc.co.uk we can use 
> the mechanism I propose to trap the links to http://www.bbc.co.uk and 
> redirect to http://archive.bbc.co.uk. However, if at a later date course B 
> references http://www.bbc.co.uk we have no way of knowing whether they mean 
> the stuff that is currently on http://www.bbc.co.uk or the stuff that used to 
> be on http://www.bbc.co.uk and is now on http://archive.bbc.co.uk - and we 
> have a redirect that is being applied across the board.
>
> Thinking about it, references are required to include a date of access when 
> citing websites, so this is probably the best piece of information to use to 
> know where to resolve to (and we can put this in the DC metadata). Whether 
> this will just get too confusing is a good question - I'll have at think 
> about this.
>
> Owen
>
> PS using the date we could even consider resolving to the Internet Archive 
> copy of a website if it was available I guess - this might be useful I 
> guess...
>
> Owen Stephens
> TELSTAR Project Manager
> Library and Learning Resources Centre
> The Open University
> Walton Hall
> Milton Keynes, MK7 6AA
>
> T: +44 (0) 1908 858701
> F: +44 (0) 1908 653571
> E: o.steph...@open.ac.uk
>
>
>> -Original Message-
>> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
>> Behalf Of Rosalyn Metz
>> Sent: 14 September 2009 21:52
>> To: CODE4LIB@LISTSERV.ND.EDU
>> Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources
>>
>> oops...just re-read original post s/professor/article
>>
>> also your link resolver should be creating a context object
>> with each request.  this context object is what makes the
>> openurl unique.  so if you want uniqueness for stats purposes
>> i would image the link resolver is already doing that (and
>> just another reason to use an rfr_id that you create).
>>
>>
>>
>>
>> On Mon, Sep 14, 2009 at 4:45 PM, Rosalyn Metz
>>  wrote:
>> > Owen,
>> >
>> > rft_id isn't really meant to be a unique identifier
>> (although it can
>> > be in situations like a pmid or doi).  are you looking for it to be?
>> > if so why?
>> >
>> > if professor A is pointing to http://www.bbc.co.uk and
>> professor B is
>> > pointing to http://www.bbc.co.uk why do they have to have unique
>> > OpenURLs.
>> >
>> > Rosalyn
>> >
>> >
>> >
>> >
>> > On Mon, Sep 14, 2009 at 4:41 PM, Eric Hellman
>>  wrote:
>> >> Nate's point is what I was thinking about in this comment in my
>> >> original
>> >> reply:
>> >> If you don't add DC metadata, which seems like a good idea, you'll
>> >> definitely want to include something that will help you to persist
>> >> your replacement record. For example, a label or
>> description for the link.
>> >>
>> >> I should also point out a solution that could work for some people
>> >> but not
>> >> you- put rewrite rules in the gateways serving your network. A bit
>> >> dangerous and kludgy, but we've seen kludgier things.
>> >>
>> >> On Sep 14, 2009, at 4:24 PM, O.Stephens wrote:
>> >>>
>> >>> Nate has a point here - what if we end up with a commonly
>> used URI
>> >>> pointing at a variety of different things over time, and
>> so is used
>> >>> to indicate different content each time. However the
>> problem with a 'short URL'
>> >>> solution (tr.im, purl etc), or indeed any locally assigned
>> >>> identifier that acts as a key, is that as described in
>> the blog post
>> >>> you need prior knowledge of the short URL/identifier to
>> us

[CODE4LIB] indexing pdf files

2009-09-15 Thread Eric Lease Morgan


I have been having fun recently indexing PDF files.

For the pasts six months or so I have been keeping the articles I've  
read in a pile, and I was rather amazed at the size of the pile. It  
was about a foot tall. When I read these articles I "actively" read  
them -- meaning, I write, scribble, highlight, and annotate the text  
with my own special notation denoting names, keywords, definitions,  
citations, quotations, list items, examples, etc. This active reading  
process: 1) makes for better comprehension on my part, and 2) makes  
the articles easier to review and pick out the ideas I thought were  
salient. Being the librarian I am, I thought it might be cool ("kewl")  
to make the articles into a collection. Thus, the beginnings of  
Highlights & Annotations: A Value-Added Reading List.


The techno-weenie process for creating and maintaining the content is  
something this community might find interesting:


 1. Print article and read it actively.

 2. Convert the printed article into a PDF
file -- complete with embedded OCR --
with my handy-dandy ScanSnap scanner. [1]

 3. Use MyLibrary to create metadata (author,
title, date published, date read, note,
keywords, facet/term combinations, local
and remote URLs, etc.) describing the
article. [2]

 4. Save the PDF to my file system.

 5. Use pdttotext to extract the OCRed text
from the PDF and index it along with
the MyLibrary metadata using Solr. [3, 4]

 6. Provide a searchable/browsable user
interface to the collection through a
mod_perl module. [5, 6]

Software is never done, and if it were then it would be called  
hardware. Accordingly, I know there are some things I need to do  
before I can truely deem the system version 1.0. At the same time my  
excitment is overflowing and I thought I'd share some geekdom with my  
fellow hackers. Fun with PDF files and open source software.



[1] ScanSnap - http://tinyurl.com/oafgwe
[2] MyLibrary screen dump - http://infomotions.com/tmp/mylibrary.png
[3] pdftotext - http://www.foolabs.com/xpdf/
[4] Solr - http://lucene.apache.org/solr/
[5] module source code - http://infomotions.com/highlights/Highlights.pl
[6] user interface - http://infomotions.com/highlights/highlights.cgi

--
Eric Lease Morgan
University of Notre Dame




--
Eric Lease Morgan
Head, Digital Access and Information Architecture Department
Hesburgh Libraries, University of Notre Dame

(574) 631-8604

[CODE4LIB] Results from "Institutional Identifiers in Repositories" Survey

2009-09-15 Thread Michael J. Giarlo

Greetings,

The NISO I2 Working Group surveyed repository managers and developers
about current practices and needs of the repository community around
institutional identifiers.  Results from the survey will inform a set
of use cases that are expected to drive the development of a draft
standard for institutional identifiers.

A report on the results of the survey is now available to the public:

http://bit.ly/14hWly

Feedback from the repository community is most welcome.  It may be
sent to our public i2info mailing list --
http://www.niso.org/lists/i2info/ -- or directly to me.

Thanks,

-Mike
 Co-chair, Repositories scenario, NISO I2 Working Group

Re: [CODE4LIB] Implementing OpenURL for simple web resources

I agree with this Rosalyn. The issue that Nate brought up was that the content 
at http://www.bbc.co.uk could change over time, and old content might be moved 
to another URI - http://archive.bbc.co.uk or something. So if course A 
references http://www.bbc.co.uk on 24/08/09, if the content that was on 
http://www.bbc.co.uk on 24/08/09 moves to http://archive.bbc.co.uk we can use 
the mechanism I propose to trap the links to http://www.bbc.co.uk and redirect 
to http://archive.bbc.co.uk. However, if at a later date course B references 
http://www.bbc.co.uk we have no way of knowing whether they mean the stuff that 
is currently on http://www.bbc.co.uk or the stuff that used to be on 
http://www.bbc.co.uk and is now on http://archive.bbc.co.uk - and we have a 
redirect that is being applied across the board.

Thinking about it, references are required to include a date of access when 
citing websites, so this is probably the best piece of information to use to 
know where to resolve to (and we can put this in the DC metadata). Whether this 
will just get too confusing is a good question - I'll have at think about this.

Owen

PS using the date we could even consider resolving to the Internet Archive copy 
of a website if it was available I guess - this might be useful I guess...

Owen Stephens
TELSTAR Project Manager
Library and Learning Resources Centre
The Open University
Walton Hall
Milton Keynes, MK7 6AA

T: +44 (0) 1908 858701
F: +44 (0) 1908 653571
E: o.steph...@open.ac.uk

> -Original Message-
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
> Behalf Of Rosalyn Metz
> Sent: 14 September 2009 21:52
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Implementing OpenURL for simple web resources
>
> oops...just re-read original post s/professor/article
>
> also your link resolver should be creating a context object
> with each request.  this context object is what makes the
> openurl unique.  so if you want uniqueness for stats purposes
> i would image the link resolver is already doing that (and
> just another reason to use an rfr_id that you create).
>
>
>
>
> On Mon, Sep 14, 2009 at 4:45 PM, Rosalyn Metz
>  wrote:
> > Owen,
> >
> > rft_id isn't really meant to be a unique identifier
> (although it can
> > be in situations like a pmid or doi).  are you looking for it to be?
> > if so why?
> >
> > if professor A is pointing to http://www.bbc.co.uk and
> professor B is
> > pointing to http://www.bbc.co.uk why do they have to have unique
> > OpenURLs.
> >
> > Rosalyn
> >
> >
> >
> >
> > On Mon, Sep 14, 2009 at 4:41 PM, Eric Hellman
>  wrote:
> >> Nate's point is what I was thinking about in this comment in my
> >> original
> >> reply:
> >> If you don't add DC metadata, which seems like a good idea, you'll
> >> definitely want to include something that will help you to persist
> >> your replacement record. For example, a label or
> description for the link.
> >>
> >> I should also point out a solution that could work for some people
> >> but not
> >> you- put rewrite rules in the gateways serving your network. A bit
> >> dangerous and kludgy, but we've seen kludgier things.
> >>
> >> On Sep 14, 2009, at 4:24 PM, O.Stephens wrote:
> >>>
> >>> Nate has a point here - what if we end up with a commonly
> used URI
> >>> pointing at a variety of different things over time, and
> so is used
> >>> to indicate different content each time. However the
> problem with a 'short URL'
> >>> solution (tr.im, purl etc), or indeed any locally assigned
> >>> identifier that acts as a key, is that as described in
> the blog post
> >>> you need prior knowledge of the short URL/identifier to
> use it. The
> >>> only 'identifier' our authors know for a website is it's
> URL - and
> >>> it seems contrary for us to ask them to use something else. I'll
> >>> need to think about Nate's point - is this common or an
> edge case? Is there any other approach we could take?
> >>>
> >>
> >> Eric Hellman
> >> President, Gluejar, Inc.
> >> 41 Watchung Plaza, #132
> >> Montclair, NJ 07042
> >> USA
> >>
> >> e...@hellman.net
> >> http://go-to-hellman.blogspot.com/
> >>
> >
>

The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England & Wales and a charity registered in Scotland (SC 038302).

Re: [CODE4LIB] Implementing OpenURL for simple web resources