Re: [CODE4LIB] web archiving - was: Implementing OpenURL for simple web resources

2009-09-29 Thread Erik Hetzner
At Fri, 18 Sep 2009 10:40:08 -0400,
Ed Summers wrote:
 
 Hi Erik, all

 […]

 I haven't been following this thread completely, but you've taken it
 in an interesting direction. I think you've succinctly described the
 issue with using URLs as references in an academic context: that the
 integrity of the URL is a function of time. As John Kunze has said:
 Just because the URI was the last to see a resource alive doesn't
 mean it killed them :-)
 
 I'm sure you've seen this, but Internet Archive have a nice URL
 pattern for referencing a resource representation in time:
 
   http://web.archive.org/web/{year}{month}{day}{hour}{minute}{seconds}/{url}
 
 So for example you can reference Google's homepage on December 2, 1998
 at 23:04:10 with this URL:
 
   http://web.archive.org/web/19981202230410/http://www.google.com/
 
 As Mike's email points out this is only good as long as Internet
 Archive is up and running the way we expect it to. Having any one
 organization shoulder this burden isn't particularly scalable, or
 realistic IMHO. But luckily the open and distributed nature of the
 web allows other organizations to do the same thing--like the great
 work you all are doing at the California Digital Library [1] and
 similar efforts like WebCite [2]. It would be kinda nice if these
 web archiving solutions sported similar URI patterns to enable
 discovery. For example it looks like:
 
   
 http://webarchives.cdlib.org/sw1jd4pq4k/http://books.nap.edu/html/id_questions/appB.html
 
 references a frame that surrounds an actual representation in time:
 
   
 http://webarchives.cdlib.org/wayback.public/NYUL_ag_3/20090320202246/http://books.nap.edu/html/id_questions/appB.html
 
 Which is quite similar to Internet Archive's URI pattern -- not
 surprising given the common use of Wayback [3]. But there are some
 differences. It might be nice to promote some URI patterns for web
 archiving services, so that we could theoretically create
 applications that federated search for a known resource at a given
 time. I guess in part OpenURL was designed to fill this space, but
 it might instead be a bit more natural to define a URI pattern that
 approximated what Wayback does, and come up with some way of sharing
 archive locations. I'm not sure if that last bit made any sense, or
 if some attempt at this has been made already. Maybe something to
 talk about at iPRES?
 
 I had hoped that the Zotero/InternetArchive collaboration would lead
 to some more integration between scholarly use of the web and
 archiving [3]. I guess there's still time?
 
 //Ed
 
 [1] http://webarchives.cdlib.org/
 [2] http://www.webcitation.org/
 [3] http://inkdroid.org/journal/2007/12/17/permalinks-reloaded/

Hi Ed, code4libbers -

Sorry for the late reply, but I have been on vacation.

Thanks for the insightful comments. They are very much in line with
things I have been thinking and you have got me thinking along some
other lines as well.

Our system is based on crawls, so in your example sw1jd4pq4k is a
crawl id. We discussed using the .../20090101.../http://.. scheme
directly as in wayback, but decided to use crawl-based URLs as our
primary mechanism of entry, given the constraints of our system.

(By the way, the ...wayback.public... URL should not be relied on
for permanence!)

We would, however, like to support the use of wayback style URLs as
well. There is some interest in the web archiving community of
increasing interoperability between web archive systems, so that we
can, for instance, direct a user to web.archive.org if we do not have
a URL in our system, and vice versa.

In terms of getting authors to cite archived material rather than live
web material, there are many approaches to this that I can think of,
for example:

a) Encouraging authors to link to archive.org or other web archives
rather than the live web;

b) Creating services to allow authors to take snapshots of websites,
like webcite, if necessary;

c) Rewriting links in our system to point to archives, so that, for
instance, the reference (taken from first google search for “mla
website citation”, and, of course, broken):

Lynch, Tim. DSN Trials and Tribble-ations Review. Psi Phi: Bradley's
Science Fiction Club. 1996. Bradley University. 8 Oct. 1997
http://www.bradley.edu/campusorg/psiphi/DS9/ep/503r.html.

would be rewritten to the working URL, based on the URL provided and
the access time (8 Oct. 1997):

http://web.archive.org/1997100800/http://www.bradley.edu/campusorg/psiphi/DS9/ep/503r.html

d) Publicizing web archiving so that uses know that they can use tools
like the web archive to find those broken links.

e) Providing browser plugins so that users who follow 404ed links can
be given the alternative of proceeding to an archived web site.

best,
Erik Hetzner
;; Erik Hetzner, California Digital Library
;; gnupg key id: 1024D/01DB07E3


pgpKgGuCp4dKB.pgp
Description: PGP signature


[CODE4LIB] web archiving - was: Implementing OpenURL for simple web resources

2009-09-18 Thread Ed Summers
Hi Erik, all

On Tue, Sep 15, 2009 at 1:12 PM, Erik Hetzner erik.hetz...@ucop.edu wrote:
 I might be misunderstanding you, but, I think that you are leaving out
 the implicit dimension of time here - when was the URL referenced?
 What can we use to represent the tuple URL, date, and how do we
 retrieve an appropriate representation of this tuple? Is the most
 appropriate representation the most recent version of the page,
 wherever it may have moved? Or is the most appropriate representation
 the page as it existed in the past? I would argue that the most
 appropriate representation would be the page as it existed in the
 past, not what the page looks like now - but I am biased, because I
 work in web archiving.

 Unfortunately this is a problem that has not been very well addressed
 by the web architecture people, or the web archiving people. The web
 architecture people start from the assumption that
 http://example.org/ is the same resource which only varies in its
 representation as a function of time, not in its identity as a
 resource. The web archives people create closed systems and do not
 think about how to store and resolve the tuple, URL, date.

I haven't been following this thread completely, but you've taken it
in an interesting direction. I think you've succinctly described the
issue with using URLs as references in an academic context: that the
integrity of the URL is a function of time. As John Kunze has said:
Just because the URI was the last to see a resource alive doesn't
mean it killed them :-)

I'm sure you've seen this, but Internet Archive have a nice URL
pattern for referencing a resource representation in time:

  http://web.archive.org/web/{year}{month}{day}{hour}{minute}{seconds}/{url}

So for example you can reference Google's homepage on December 2, 1998
at 23:04:10 with this URL:

  http://web.archive.org/web/19981202230410/http://www.google.com/

As Mike's email points out this is only good as long as Internet
Archive is up and running the way we expect it to. Having any one
organization shoulder this burden isn't particularly scalable, or
realistic IMHO. But luckily the open and distributed nature of the web
allows other organizations to do the same thing--like the great work
you all are doing at the California Digital Library [1] and similar
efforts like WebCite [2]. It would be kinda nice if these web
archiving solutions sported similar URI patterns to enable discovery.
For example it looks like:

  
http://webarchives.cdlib.org/sw1jd4pq4k/http://books.nap.edu/html/id_questions/appB.html

references a frame that surrounds an actual representation in time:

  
http://webarchives.cdlib.org/wayback.public/NYUL_ag_3/20090320202246/http://books.nap.edu/html/id_questions/appB.html

Which is quite similar to Internet Archive's URI pattern -- not
surprising given the common use of Wayback [3]. But there are some
differences. It might be nice to promote some URI patterns for web
archiving services, so that we could theoretically create applications
that federated search for a known resource at a given time. I guess in
part OpenURL was designed to fill this space, but it might instead be
a bit more natural to define a URI pattern that approximated what
Wayback does, and come up with some way of sharing archive locations.
I'm not sure if that last bit made any sense, or if some attempt at
this has been made already. Maybe something to talk about at iPRES?

I had hoped that the Zotero/InternetArchive collaboration would lead
to some more integration between scholarly use of the web and
archiving [3]. I guess there's still time?

//Ed

[1] http://webarchives.cdlib.org/
[2] http://www.webcitation.org/
[3] http://inkdroid.org/journal/2007/12/17/permalinks-reloaded/