Thanks Jason and Ed,
I suspect within this project we'll keep using OAI-PMH because we've got tight
deadlines and the other project strands (which do stuff with the harvested
content) need time from the developer. At the moment it looks like we will
probably combine OAI-PMH with web crawling
Owen...
Just wanted to say that, whilst I've been silent since my initial response,
I'm not sure I agree with all the viewpoints presented here.. From a point
of view of (for example, CultureGrid) I'm not sure what has been done could
have been pragmatically achieved soley with web crawling as
Thanks Ian,
Agree that it is clear from this discussion that there are differing viewpoints
and also very different requirements depending on the context and desired
outcomes.
I think I said earlier in the thread - I'm not against niche solutions, they
just make me want to double check that
IF your HTML includes embedded semantic data using HTML5 microdata or
RDFa or something similar (using a standard vocabulary -- the standard
for repositories seems to be DC-based, since that's often all you can
get out of OAI-PMH anyway) --- then web crawling combined with site maps
probably
On 26 Feb 2012, at 14:42, Godmar Back wrote:
May I ask a side question and make a side observation regarding the
harvesting of full text of the object to which a OAI-PMH record refers?
In general, is the idea to use the dc:source/text() element, treat it as
a URL, and then expect to find
On Mon, Feb 27, 2012 at 5:25 AM, Owen Stephens o...@ostephens.com wrote:
On 26 Feb 2012, at 14:42, Godmar Back wrote:
May I ask a side question and make a side observation regarding the
harvesting of full text of the object to which a OAI-PMH record refers?
In general, is the idea to
On Mon, Feb 27, 2012 at 5:25 AM, Owen Stephens o...@ostephens.com wrote:
This issue is certainly not unique to VT - we've come across this as part
of our project. While the OAI-PMH record may point at the PDF, it can also
point to a intermediary page. This seems to be standard practice in
On 27 Feb 2012, at 13:31, Diane Hillmann wrote:
On Mon, Feb 27, 2012 at 5:25 AM, Owen Stephens o...@ostephens.com wrote:
providers provide such intermediate pages (arxiv.org, for instance). The
other issue driving providers towards intermediate pages is that it allows
them to continue to
On Mon, Feb 27, 2012 at 8:31 AM, Diane Hillmann metadata.ma...@gmail.comwrote:
On Mon, Feb 27, 2012 at 5:25 AM, Owen Stephens o...@ostephens.com wrote:
This issue is certainly not unique to VT - we've come across this as part
of our project. While the OAI-PMH record may point at the PDF,
On Feb 27, 2012, at 10:51 AM, Godmar Back wrote:
On Mon, Feb 27, 2012 at 8:31 AM, Diane Hillmann
metadata.ma...@gmail.comwrote:
On Mon, Feb 27, 2012 at 5:25 AM, Owen Stephens o...@ostephens.com wrote:
This issue is certainly not unique to VT - we've come across this as part
of our project.
On Sun, Feb 26, 2012 at 3:42 PM, Godmar Back god...@gmail.com wrote:
May I ask a side question and make a side observation regarding the
harvesting of full text of the object to which a OAI-PMH record refers?
In Italy institutional repositories of theses are required to publish
metadata as
On Fri, Feb 24, 2012 at 6:50 AM, Owen Stephens o...@ostephens.com wrote:
One obvious alternative that I keep coming back to is 'forget OAI-PMH, just
crawl the web' ...
Owen,
I'd like to bring this back to your suggestion to just forget OAI-PMH
and crawl the web. I think that's probably the
for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Joe Hourcle
Sent: Monday, February 27, 2012 10:43 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Repositories, OAI-PMH and web crawling
On Feb 27, 2012, at 10:51 AM, Godmar Back wrote:
On Mon, Feb 27, 2012 at 8:31 AM, Diane
On 24 Feb 2012, at 16:52, Ian Ibbotson wrote:
Sorry.. late to the discussion...
Isn't this a little apples and oranges?
Surely robots.txt exists because many static resources are served directly
from a tree structured filesystem?
(Nearly) all OAI requests are responded to by specific
On 24 Feb 2012, at 18:20, Joe Hourcle wrote:
On Feb 24, 2012, at 9:25 AM, Kyle Banerjee wrote:
I see it like the people who request that their pages not be cached elsewhere
-- they want to make their object 'discoverable', but they want to control
the access to those objects -- so it's one
@LISTSERV.ND.EDU] On Behalf Of Joe
Hourcle
Sent: Friday, February 24, 2012 10:20 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Repositories, OAI-PMH and web crawling
On Feb 24, 2012, at 9:25 AM, Kyle Banerjee wrote:
One of the questions this raises is what we are/aren't allowed to do
On 24 Feb 2012, at 18:20, Joe Hourcle wrote:
I see it like the people who request that their pages not be cached elsewhere
-- they want to make their object 'discoverable', but they want to control
the access to those objects -- so it's one thing for a search engine to get a
copy, but
May I ask a side question and make a side observation regarding the
harvesting of full text of the object to which a OAI-PMH record refers?
In general, is the idea to use the dc:source/text() element, treat it as
a URL, and then expect to find the object there (provided that there was a
suitable
On Feb 26, 2012, at 9:42 AM, Godmar Back wrote:
May I ask a side question and make a side observation regarding the
harvesting of full text of the object to which a OAI-PMH record refers?
In general, is the idea to use the dc:source/text() element, treat it as
a URL, and then expect to find
@LISTSERV.ND.EDU] On Behalf Of Joe
Hourcle
Sent: Friday, February 24, 2012 10:20 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Repositories, OAI-PMH and web crawling
On Feb 24, 2012, at 9:25 AM, Kyle Banerjee wrote:
One of the questions this raises is what we are/aren't allowed to do
One of the questions this raises is what we are/aren't allowed to do in
terms of harvesting full-text. While I realise we could get into legal
stuff here, at the moment we want to put that question to one side. Instead
we want to consider what Google, and other search engines, do, the
On 02/24/2012 09:25 AM, Kyle Banerjee wrote:
We use OAI-PMH, and while we often see (usually general and sometimes
contradictory) statements about what we can/can't do with the contents of a
repository (or a specific record), it feels like there isn't a nice simple
mechanism for a repository
Thanks both...
Kyle said: If someone goes to the trouble of making things available via a
protocol that exists only to make things harvestable and
then doesn't want it harvested, you can dismiss them ...
True - but that's essentially what Southampton's configuration seems to say.
Thomas said:
Sorry.. late to the discussion...
Isn't this a little apples and oranges?
Surely robots.txt exists because many static resources are served directly
from a tree structured filesystem?
(Nearly) all OAI requests are responded to by specific service applications
which are perfectly capable of
Grr..
:1,$s/work/word/g
(I blame FRBR ;))
e
On Fri, Feb 24, 2012 at 4:52 PM, Ian Ibbotson ian.ibbot...@k-int.comwrote:
Sorry.. late to the discussion...
Isn't this a little apples and oranges?
Surely robots.txt exists because many static resources are served directly
from a tree
On Feb 24, 2012, at 9:25 AM, Kyle Banerjee wrote:
One of the questions this raises is what we are/aren't allowed to do in
terms of harvesting full-text. While I realise we could get into legal
stuff here, at the moment we want to put that question to one side. Instead
we want to consider
26 matches
Mail list logo