Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-03-01 Thread Owen Stephens
Thanks Jason and Ed, I suspect within this project we'll keep using OAI-PMH because we've got tight deadlines and the other project strands (which do stuff with the harvested content) need time from the developer. At the moment it looks like we will probably combine OAI-PMH with web crawling

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-03-01 Thread Ian Ibbotson
Owen... Just wanted to say that, whilst I've been silent since my initial response, I'm not sure I agree with all the viewpoints presented here.. From a point of view of (for example, CultureGrid) I'm not sure what has been done could have been pragmatically achieved soley with web crawling as

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-03-01 Thread Owen Stephens
Thanks Ian, Agree that it is clear from this discussion that there are differing viewpoints and also very different requirements depending on the context and desired outcomes. I think I said earlier in the thread - I'm not against niche solutions, they just make me want to double check that

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-03-01 Thread Jonathan Rochkind
IF your HTML includes embedded semantic data using HTML5 microdata or RDFa or something similar (using a standard vocabulary -- the standard for repositories seems to be DC-based, since that's often all you can get out of OAI-PMH anyway) --- then web crawling combined with site maps probably

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-27 Thread Owen Stephens
On 26 Feb 2012, at 14:42, Godmar Back wrote: May I ask a side question and make a side observation regarding the harvesting of full text of the object to which a OAI-PMH record refers? In general, is the idea to use the dc:source/text() element, treat it as a URL, and then expect to find

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-27 Thread Godmar Back
On Mon, Feb 27, 2012 at 5:25 AM, Owen Stephens o...@ostephens.com wrote: On 26 Feb 2012, at 14:42, Godmar Back wrote: May I ask a side question and make a side observation regarding the harvesting of full text of the object to which a OAI-PMH record refers? In general, is the idea to

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-27 Thread Diane Hillmann
On Mon, Feb 27, 2012 at 5:25 AM, Owen Stephens o...@ostephens.com wrote: This issue is certainly not unique to VT - we've come across this as part of our project. While the OAI-PMH record may point at the PDF, it can also point to a intermediary page. This seems to be standard practice in

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-27 Thread Owen Stephens
On 27 Feb 2012, at 13:31, Diane Hillmann wrote: On Mon, Feb 27, 2012 at 5:25 AM, Owen Stephens o...@ostephens.com wrote: providers provide such intermediate pages (arxiv.org, for instance). The other issue driving providers towards intermediate pages is that it allows them to continue to

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-27 Thread Godmar Back
On Mon, Feb 27, 2012 at 8:31 AM, Diane Hillmann metadata.ma...@gmail.comwrote: On Mon, Feb 27, 2012 at 5:25 AM, Owen Stephens o...@ostephens.com wrote: This issue is certainly not unique to VT - we've come across this as part of our project. While the OAI-PMH record may point at the PDF,

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-27 Thread Joe Hourcle
On Feb 27, 2012, at 10:51 AM, Godmar Back wrote: On Mon, Feb 27, 2012 at 8:31 AM, Diane Hillmann metadata.ma...@gmail.comwrote: On Mon, Feb 27, 2012 at 5:25 AM, Owen Stephens o...@ostephens.com wrote: This issue is certainly not unique to VT - we've come across this as part of our project.

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-27 Thread raffaele messuti
On Sun, Feb 26, 2012 at 3:42 PM, Godmar Back god...@gmail.com wrote: May I ask a side question and make a side observation regarding the harvesting of full text of the object to which a OAI-PMH record refers? In Italy institutional repositories of theses are required to publish metadata as

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-27 Thread Jason Ronallo
On Fri, Feb 24, 2012 at 6:50 AM, Owen Stephens o...@ostephens.com wrote: One obvious alternative that I keep coming back to is 'forget OAI-PMH, just crawl the web' ... Owen, I'd like to bring this back to your suggestion to just forget OAI-PMH and crawl the web. I think that's probably the

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-27 Thread Habing, Thomas Gerald
for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Joe Hourcle Sent: Monday, February 27, 2012 10:43 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Repositories, OAI-PMH and web crawling On Feb 27, 2012, at 10:51 AM, Godmar Back wrote: On Mon, Feb 27, 2012 at 8:31 AM, Diane

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-26 Thread Owen Stephens
On 24 Feb 2012, at 16:52, Ian Ibbotson wrote: Sorry.. late to the discussion... Isn't this a little apples and oranges? Surely robots.txt exists because many static resources are served directly from a tree structured filesystem? (Nearly) all OAI requests are responded to by specific

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-26 Thread Owen Stephens
On 24 Feb 2012, at 18:20, Joe Hourcle wrote: On Feb 24, 2012, at 9:25 AM, Kyle Banerjee wrote: I see it like the people who request that their pages not be cached elsewhere -- they want to make their object 'discoverable', but they want to control the access to those objects -- so it's one

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-26 Thread Owen Stephens
@LISTSERV.ND.EDU] On Behalf Of Joe Hourcle Sent: Friday, February 24, 2012 10:20 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Repositories, OAI-PMH and web crawling On Feb 24, 2012, at 9:25 AM, Kyle Banerjee wrote: One of the questions this raises is what we are/aren't allowed to do

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-26 Thread Owen Stephens
On 24 Feb 2012, at 18:20, Joe Hourcle wrote: I see it like the people who request that their pages not be cached elsewhere -- they want to make their object 'discoverable', but they want to control the access to those objects -- so it's one thing for a search engine to get a copy, but

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-26 Thread Godmar Back
May I ask a side question and make a side observation regarding the harvesting of full text of the object to which a OAI-PMH record refers? In general, is the idea to use the dc:source/text() element, treat it as a URL, and then expect to find the object there (provided that there was a suitable

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-26 Thread Joe Hourcle
On Feb 26, 2012, at 9:42 AM, Godmar Back wrote: May I ask a side question and make a side observation regarding the harvesting of full text of the object to which a OAI-PMH record refers? In general, is the idea to use the dc:source/text() element, treat it as a URL, and then expect to find

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-25 Thread Peter Noerr
@LISTSERV.ND.EDU] On Behalf Of Joe Hourcle Sent: Friday, February 24, 2012 10:20 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Repositories, OAI-PMH and web crawling On Feb 24, 2012, at 9:25 AM, Kyle Banerjee wrote: One of the questions this raises is what we are/aren't allowed to do

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-24 Thread Kyle Banerjee
One of the questions this raises is what we are/aren't allowed to do in terms of harvesting full-text. While I realise we could get into legal stuff here, at the moment we want to put that question to one side. Instead we want to consider what Google, and other search engines, do, the

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-24 Thread Thomas Dowling
On 02/24/2012 09:25 AM, Kyle Banerjee wrote: We use OAI-PMH, and while we often see (usually general and sometimes contradictory) statements about what we can/can't do with the contents of a repository (or a specific record), it feels like there isn't a nice simple mechanism for a repository

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-24 Thread Owen Stephens
Thanks both... Kyle said: If someone goes to the trouble of making things available via a protocol that exists only to make things harvestable and then doesn't want it harvested, you can dismiss them ... True - but that's essentially what Southampton's configuration seems to say. Thomas said:

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-24 Thread Ian Ibbotson
Sorry.. late to the discussion... Isn't this a little apples and oranges? Surely robots.txt exists because many static resources are served directly from a tree structured filesystem? (Nearly) all OAI requests are responded to by specific service applications which are perfectly capable of

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-24 Thread Ian Ibbotson
Grr.. :1,$s/work/word/g (I blame FRBR ;)) e On Fri, Feb 24, 2012 at 4:52 PM, Ian Ibbotson ian.ibbot...@k-int.comwrote: Sorry.. late to the discussion... Isn't this a little apples and oranges? Surely robots.txt exists because many static resources are served directly from a tree

Re: [CODE4LIB] Repositories, OAI-PMH and web crawling

2012-02-24 Thread Joe Hourcle
On Feb 24, 2012, at 9:25 AM, Kyle Banerjee wrote: One of the questions this raises is what we are/aren't allowed to do in terms of harvesting full-text. While I realise we could get into legal stuff here, at the moment we want to put that question to one side. Instead we want to consider