Re: [CODE4LIB] Fwd: [rules] Publication of the RDA Element Vocabularies

2014-01-25 Thread Robert Sanderson
On Sat, Jan 25, 2014 at 6:20 AM, Jon Phipps jphi...@madcreek.com wrote:

 On Fri, Jan 24, 2014 at 11:16 AM, Robert Sanderson azarot...@gmail.com
 wrote:

   All in my opinion, and all debatable. I hope that your choice goes well
   for
you,
  
   I'd like to repeat: just because I agree with that choice, and I'm
   defending it here, it wasn't my choice. Not at all. And the concerns
 you
   express were well-aired and very carefully considered before the choice
  was
   made.
 
  And yours :)

 Ok, that makes me feel a bit personally defensive...



Apologies!  It was too much shorthand.  I meant that your concerns were
well-aired, and well explained, in this thread :)

Rob


Re: [CODE4LIB] Fwd: [rules] Publication of the RDA Element Vocabularies

2014-01-24 Thread Robert Sanderson
On Fri, Jan 24, 2014 at 7:56 AM, Jon Phipps jphi...@madcreek.com wrote:

 Hi Rob, the conversation continues below...

 On Thu, Jan 23, 2014 at 7:01 PM, Robert Sanderson azarot...@gmail.com
 wrote:

  Hi Jon,
 
  To present the other side of the argument so that others on the list can
  make an informed decision...
 

 Thanks for reminding me that this is an academic panel discussion in front
 of an audience, rather than a conversation.

 
  On Thu, Jan 23, 2014 at 4:22 PM, Jon Phipps jphi...@madcreek.com
 wrote:
 
   I've developed a quite strong opinion that vocabulary developers should
  not
   _ever_ think that they can understand the semantics of a vocabulary
   resource by 'reading' the URI.
 
 
  100% Agreed. Good documentation is essential for any ontology, and it has
  to be read to understand the semantics. You cannot just look at
  oa:hasTarget, out of context, and have any idea what it refers to.
 
  However if that URI is readable it makes developers lives much easier in
 a
  lot of situations, and it has no additional cost. Opaque URIs for
  predicates is the digital equivalent of thumbing your nose at the people
  you should be courting -- the people who will actually use your ontology
 in
  any practical sense.  It says: We don't care about you enough to make
 your
  life one step easier by having something that's memorable. You will
 always
  have to go back to the ontology every time and reread this documentation,
  over and over and over again.
 

 What you suggest is that an identifier (e.g. @azaroth42 or ORCID:
 -0003-4441-6852 https://orcid.org/-0003-4441-6852) should always
 be readable as a convenience to the developer. RDA does provide a 'readable
 in the language of the reader' uri specifically as a convenience to the
 developer. A feature that I lobbied for. It's just not the /canonical/ URI,
 because it's an identifier of a property, not the property itself, and that
 property is independent of the language used to label it.

 It's the difference between Metadata Management Associates, PO Box 282,
 Jacksonville, NY 14854, USA (for people) and 14854-0282 (a perfectly
 functional complete address in the USA namespace), which is precisely the
 same identifier of that box for machines, and ultimately for the
 postmaster, who doesn't care whose name is on the box numbered 282, who
 only needs to know that highly memorable name when someone uses the
 convenience of not bothering to look up the box number and just sends mail
 addressed to us at 14854, or even just Jacksonville. And no I don't want to
 start a URL vs. URI/URN/IRI discussion.

 
  Do you have some expectation that in order
   for the data to be useful your relational or object database
 identifiers
   must be readable?
 
 
  Identifiers for objects, no. The table names and field names? Yes. How
 many
  DBAs do you know that create tables with opaque identifiers for the
 column
  names?  How many XML schemas do you know that use opaque identifiers for
  the element names?
 
  My count is 0 from many many many instances.  And the reason is the same
 as
  having readable predicate URIs -- so that when you look at the table,
  schema, ontology, triple or what have you, there is some mnemonic value
  from the name to its intent.
 
  Our experience obviously differs in this regard. I've seen many, many
 databases that have relatively opaque column identifiers that were
 relabeled in the query to suit the audience for the query. I've seen many
 French databases, with French content, intended for a French audience,
 designed by French developers, that had French 'column headers'.

 The point here is that the identifiers /identify/ a property that exists
 independent of the language of the data being used to describe a resource.
 If RDA _had_ to pick a single language to satisfy your requirement for a
 single readable identifier, which one? To assume that the one language
 should be English says to the non-english speaking world We don't care
 about you enough to make your
 life one step easier by having something that's memorable


 
   By whom, and in English? This to me is a frankly colonial
   assumption of the dominance of English in the world of metadata.
 
 
  In the world of computing in general. for if while ... all English.
  While there are turing complete languages out there, the ones that don't
  have real world language constructions are toys, like Whitespace for
  example.  Even the lolcats programming language is more usable than
  whitespace.
 
  Again, it's a cost/value consideration.  There are many people who will
  understand English, and when developers program, they're surrounded by
 it.
  If your intended audience is primarily people who speak French, then you
  would be entirely justified in using URIs with labels from French. Or
  Chinese, though the IRI expansion would be more of a pain :)
 
 
 
 Despite the fact that developers are surrounded by English I've worked with
 many highly skilled

Re: [CODE4LIB] Fwd: [rules] Publication of the RDA Element Vocabularies

2014-01-24 Thread Robert Sanderson
(Sorry for a previous empty message)

Hi Jon,

On Fri, Jan 24, 2014 at 7:56 AM, Jon Phipps jphi...@madcreek.com wrote:

 Hi Rob, the conversation continues below...

 On Thu, Jan 23, 2014 at 7:01 PM, Robert Sanderson azarot...@gmail.com
 wrote:
  To present the other side of the argument so that others on the list can
  make an informed decision...
 Thanks for reminding me that this is an academic panel discussion in front
 of an audience, rather than a conversation.


Heh :) I just meant that I wasn't trying to convince you to change, just
that I wanted to voice my concerns.
(But, yes, touché!)


 On Thu, Jan 23, 2014 at 4:22 PM, Jon Phipps jphi...@madcreek.com wrote:
 
  However if that URI is readable it makes developers lives much easier in
 a
  lot of situations, and it has no additional cost. Opaque URIs for
  predicates is the digital equivalent of thumbing your nose at the people
  you should be courting

What you suggest is that an identifier (e.g. @azaroth42 or ORCID:
 -0003-4441-6852 https://orcid.org/-0003-4441-6852) should always
 be readable as a convenience to the developer.


Those are identifiers for objects or entities, not predicates.   As I said,
I'm happy for entities to have opaque URIs.  Where we disagree is that you
can carry over that same rationale to predicates/properties/relationships.


RDA does provide a 'readable
 in the language of the reader' uri specifically as a convenience to the
 developer. A feature that I lobbied for. It's just not the /canonical/ URI,
 because it's an identifier of a property, not the property itself, and that
 property is independent of the language used to label it.


So this, IMO, is where the trouble starts.  People /will/ use those
convenience URIs. And that will make for a nightmare in terms of
interoperability (see below).



 It's the difference between Metadata Management Associates, PO Box 282,
 Jacksonville, NY 14854, USA (for people) and 14854-0282 (a perfectly
 functional complete address in the USA namespace), which is precisely the
 same identifier of that box for machines


Which is also an entity, not a predicate. I almost said property there,
which would be amusingly incorrect.



  Do you have some expectation that in order
   for the data to be useful your relational or object database
 identifiers
   must be readable?
 
  Identifiers for objects, no. The table names and field names? Yes. How
 many
  DBAs do you know that create tables with opaque identifiers for the
 column
  names?  How many XML schemas do you know that use opaque identifiers for
  the element names?
 
  My count is 0 from many many many instances.  And the reason is the same
 as
  having readable predicate URIs -- so that when you look at the table,
  schema, ontology, triple or what have you, there is some mnemonic value
  from the name to its intent.
 
  Our experience obviously differs in this regard. I've seen many, many
 databases that have relatively opaque column identifiers that were
 relabeled in the query to suit the audience for the query. I've seen many
 French databases, with French content, intended for a French audience,
 designed by French developers, that had French 'column headers'.


Yes, but French column headers are not opaque. How many schemas have
completely opaque, non-linguistic column headers, element names, etc?
I'm not talking relatively opaque, I mean P12345 or similar. I didn't
count MARC in my 0, which is strictly true as it's not XML or a relational
table, but you could say 1 to be fair.

Yes, sometimes they're PrpCtr or similar, but that's at least somewhat
readable (Property Counter, perhaps?) compared to a UUID or random integer.


The point here is that the identifiers /identify/ a property that exists
 independent of the language of the data being used to describe a resource.
 If RDA _had_ to pick a single language to satisfy your requirement for a
 single readable identifier, which one? To assume that the one language
 should be English says to the non-english speaking world We don't care
 about you enough to make your
 life one step easier by having something that's memorable


My problem is not with the idea that properties exist independently of
language, it's the side effect of not picking a language to use.  If you
had to pick one, then you should pick one.  If you want to make a political
stand, don't pick English. But at least pick one, and only one.

Not caring about the non-English speaking world is at least caring about
some people, rather than no one.  Or the non-French speaking world.


Despite the fact that developers are surrounded by English I've worked with
 many highly skilled developers who didn't speak or read English. Who relied
 on documentation and meetings in their own language.


Likewise, though admittedly primarily European languages rather than Asian.
 However even if someone doesn't speak English (or Italian, or French, or
German), a language-based construct is more memorable than

Re: [CODE4LIB] Fwd: [rules] Publication of the RDA Element Vocabularies

2014-01-23 Thread Robert Sanderson
Hi Jon,

To present the other side of the argument so that others on the list can
make an informed decision...

On Thu, Jan 23, 2014 at 4:22 PM, Jon Phipps jphi...@madcreek.com wrote:

 I've developed a quite strong opinion that vocabulary developers should not
 _ever_ think that they can understand the semantics of a vocabulary
 resource by 'reading' the URI.


100% Agreed. Good documentation is essential for any ontology, and it has
to be read to understand the semantics. You cannot just look at
oa:hasTarget, out of context, and have any idea what it refers to.

However if that URI is readable it makes developers lives much easier in a
lot of situations, and it has no additional cost. Opaque URIs for
predicates is the digital equivalent of thumbing your nose at the people
you should be courting -- the people who will actually use your ontology in
any practical sense.  It says: We don't care about you enough to make your
life one step easier by having something that's memorable. You will always
have to go back to the ontology every time and reread this documentation,
over and over and over again.

Do you have some expectation that in order
 for the data to be useful your relational or object database identifiers
 must be readable?


Identifiers for objects, no. The table names and field names? Yes. How many
DBAs do you know that create tables with opaque identifiers for the column
names?  How many XML schemas do you know that use opaque identifiers for
the element names?

My count is 0 from many many many instances.  And the reason is the same as
having readable predicate URIs -- so that when you look at the table,
schema, ontology, triple or what have you, there is some mnemonic value
from the name to its intent.


 By whom, and in English? This to me is a frankly colonial
 assumption of the dominance of English in the world of metadata.


In the world of computing in general. for if while ... all English.
While there are turing complete languages out there, the ones that don't
have real world language constructions are toys, like Whitespace for
example.  Even the lolcats programming language is more usable than
whitespace.

Again, it's a cost/value consideration.  There are many people who will
understand English, and when developers program, they're surrounded by it.
If your intended audience is primarily people who speak French, then you
would be entirely justified in using URIs with labels from French. Or
Chinese, though the IRI expansion would be more of a pain :)



 The proper
 understanding of the semantics, although still relatively minimal, is from
 the definition, not the URI.


Yes. Any short cuts to *understanding* rather than *remembering* are to be
avoided.



 Our coining and inclusion of multilingual
 (eventually) lexical URIs based on the label is a concession to developers
 who feel that they can't effectively 'use' the vocabularies unless they can
 read the URIs.


So in my opinion, as is everything in the mail of course, this is even
worse. Now instead of 1600 properties, you have 1600 * (number of languages
+1) properties. And you're going to see them appearing in uses of the
ontology. Either stick with your opaque identifiers or pick a language for
the readable ones, and best practice would be English, but doing both is a
disaster in the making.



  I grant that writing ad
 hoc sparql queries with opaque URIs can be intensely frustrating, but the
 vocabularies aren't designed specifically to support that incredibly narrow
 use case.


Writing queries is something developers have to do to work with data.  More
importantly, writing code that builds the triples in the first place is
something that developers have to do. And they have to get it right ...
which they likely won't do first time. There will be typos. That P1523235
might be written into the code as P1533235 ... an impossible to spot typo.
 dc:title vs dc:titel ... a bit easier to spot, no?

So the consequence is that the quality of the uses of your ontology will go
down.  If there were 16 fields, maybe there'd be a chance of getting it
right. But 1600, with 5 digit identifiers, is asking for trouble.

Compare MARC fields. We all love our 245$a, I know, but dc:title is a lot
easier to recall. Now imagine those fields are (seemingly) random 5 digit
codes without significant structure. And that there's 1600 of them. And
you're asking the developer to use a graph structure that's likely
unfamiliar to them.

All in my opinion, and all debatable. I hope that your choice goes well for
you, but would like other people to think about it carefully before
following suit.

Rob


Re: [CODE4LIB] Fwd: [rules] Publication of the RDA Element Vocabularies

2014-01-22 Thread Robert Sanderson
P166123464771

And now no one understands at all.  CIDOC-CRM has taken the same approach
-- it's better that everyone is equal in their non-comprehension than
people who speak a particular language are somehow advantaged.

BTW, as an English speaker, I also don't understand other designation
associated with the corporate body, regardless of spaces or camelCase.
Labels and semantic descriptions are *always* important.

The we might change what this means argument is also problematic -- if
you change what it means, then you should change the URI! Otherwise people
will continue to use them incorrectly, plus the legacy data generated with
the previous definition will suddenly change what it's saying.

Finally, 1600 properties... good luck with that.

Rob



On Wed, Jan 22, 2014 at 3:03 PM, Hamilton, Gill g.hamil...@nls.uk wrote:

 Je ne comprends pas l'anglais.
 Je ne comprends pas l'URI otherDesignationAssociatedWithTheCorporateBody

 私は日本人です。私は理解していない、そのURI

 Opaque URIs with human readable labels helps in an international context.

 Just my two yens worth :)
 G

 -
 Gill Hamilton
 Digital Access Manager
 National Library of Scotland
 George IV Bridge
 Edinburgh EH1 1EW, Scotland
 e: g.hamil...@nls.uk
 t: +44 (0)131 623 3770
 Skype: gill.hamilton.nls

 
 From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Dan
 Scott [deni...@gmail.com]
 Sent: 22 January 2014 21:10
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Fwd: [rules] Publication of the RDA Element
 Vocabularies

 Hi Karen:

 On Wed, Jan 22, 2014 at 3:16 PM, Karen Coyle li...@kcoyle.net wrote:
  I can't address the first points, but I can speak a bit to the question
 of
  meaningful URIs. In the original creation of the RDA elements,
 meaningful
  URIs were used based on the actual RDA terminology. This resulted in URIs
  like:
 
 
 http://rdvocab.info/Elements/alternativeChronologicalDesignationOfLastIssueOrPartOfSequence
 
  and...
 
 
 http://rdvocab.info/Elements/alternativeChronologicalDesignationOfLastIssueOrPartOfSequenceManifestation
 
  Not only that, the terminology for some elements changed over time,
 which in
  some cases meant deprecating a property that was then overly confusing
 based
  on its name.
 
  Now, I agree that one possibility would have been for the JSC to develop
  meaningful but reasonably short property names. Another possibility is
 that
  we cease looking at URIs and begin to work with labels, since URIs are
 for
  machines and labels are for humans. Unfortunately, much RDF software
 still
  expects you to work with the underlying URI rather than the human-facing
  label. We need to get through that stage as quickly as possible, because
  it's causing us to put effort into URI naming that would be best used
 for
  other analysis activities.

 Thanks for responding on this front. I understand that, while the
 vocabulary was in heavy active development it might have been painful
 to adjust as elements changed, but given that this marks the actual
 publication of the vocabulary, that churn should have settled down,
 and then this part of the JSC's contribution to semantic web could
 have semantics applied at both the micro and macro level.

 I guess I see URIs as roughly parallel to API names; as long as humans
 are assembling programs, we're likely to benefit from having
 meaningful (no air quotes required) names... even if sometimes the
 meaning drifts over time and the code  APIs need to be refactored.
 Dealing with sequentially numbered alphanumeric identifiers reminds me
 rather painfully of MARC.

 For what it's worth (and it might not be worth much) curl
 http://rdaregistry.info/Elements/a/P50101 | grep reg:name | sort |
 uniq -c shows that the reg:name property is unique across all of the
 agent properties, at least. Remnants of the earlier naming effort? If
 that pattern holds, those could have been simply used for the
 identifiers in place of P#. The most unwieldy of those appears
 to be otherDesignationAssociatedWithTheCorporateBody (which _is_
 unwieldy, certainly, but still more meaningful than
 http://rdaregistry.info/Elements/a/P50033).

 Perhaps it's not too late?
 Follow us on Twitter and Facebook

 National Library of Scotland, Scottish Charity, No: SCO11086

 This communication is intended for the addressee(s) only. If you are not
 the addressee please inform the sender and delete the email from your
 system. The statements and opinions expressed in this message are those of
 the author and do not necessarily reflect those of National Library of
 Scotland. This message is subject to the Data Protection Act 1998 and
 Freedom of Information (Scotland) Act 2002. No liability is accepted for
 any harm that may be caused to your systems or data by this message.

 www.nls.uk



Re: [CODE4LIB] rdf ontologies for archival descriptions

2014-01-20 Thread Robert Sanderson
Have you considered the LOCAH work in mapping EAD into Linked Data?

http://archiveshub.ac.uk/locah/
and
http://data.archiveshub.ac.uk/

Rob




On Sun, Jan 19, 2014 at 5:10 PM, Ben Companjen
ben.compan...@dans.knaw.nlwrote:

 Hi Eric,

 While I'm no archivist by training (information systems engineer I am),
 I've learned a thing or two from having to work with EAD and its basis for
 use, ISAD(G) (all citations below are from ISAD(G), 2nd edition). As with
 all information modelling, either inside or outside the Linked Data
 domain, you should take a step back to look at the goal of the
 description. When you have a list of what you want to describe, you can
 start looking for ontologies.

 You probably know this, but I was triggered by Because many archival
 descriptions are rooted in MARC
 records, and MODS is easily mapped from MARC. to respond. IMO
 archival descriptions are rooted in rules for description, not a specific
 file format.

 So, when I of (some of) the essences of archival description, I think of:

 - The purpose of archival description is to identify and explain the
 context and content of archival material in order to promote its
 accessibility. This is achieved by creating accurate and appropriate
 representations and by organizing them in accordance with predetermined
 models. (§I.2)
 - … seven areas of descriptive information:
   1. Identity Statement Area
  (where essential information is conveyed to identify the unit of
 description)
   2. Context Area
  (where information is conveyed about the origin and custody of the
 unit of description)
   3. Content and Structure Area
  (where information is conveyed about the subject matter and
 arrangement of the unit of description)
   4. Condition of Access and Use Area
  (where information is conveyed about the availability of the unit of
 description)
   5. Allied Materials Area
  (where information is conveyed about materials having an important
 relationship to the unit of description)
   6. Note Area
  (where specialized information and information that cannot be
 accommodated in any of the other areas may be conveyed).
   7. Description Control Area
  (where information is conveyed on how, when and by whom the archival
 description was prepared). (§I.11)



 There is a distinction between the thing being described, and the
 description itself, and both have an important role within the archival
 description. (If anything so far causes confusion with anyone here, I
 misunderstood and accept to be corrected :))
 NB: this is one way of thinking of descriptions. Incorporating the
 PROV-ontology would make sense for expressing more/other aspects of the
 provenance of archival entities, but I haven't got round to becoming an
 expert of PROV yet ;)


 ISAD(G) lists 26 elements that may be combined to constitute the
 description of an archival entity.

 Trying to translate these 'elements', I'd end up with possible a lot more
 than 26 RDFS/OWL properties.
 *Depending on the type of archival entity you can/should of course use
 more specific ontologies.*



 Let me list some properties and related ontologies.





 # Identity statement area

 ## Identifiers
 The URI, naturally, and other IDs. Could be linked using
 dc(terms):identifier, or mods:identifier, or other ontologies. Ideally
 there is some way of linking the domain of the ID to the ID itself,
 because box 101 is likely not unique in the universe. Perhaps you want
 to publish a URI strategy separately to explain how the URI was
 assembled/derived.

 ## Title
 Again DC(terms), MODS, RDA

 ## Date(s)
 You want properties that have a clear meaning. For example,
 dcterms:created and mods:dateCreated assume it is clear what when the
 resource was created means. DC terms are vague, I mean general, on
 purpose. You could create some properties `owl:subPropertyOf` dcterms date
 properties for this.
 I'd look into EDTF for encoding uncertain dates and ranges and BCE dates
 (MODS doesn't support BCE dates).

 ## Level of description
 What kind of 'documentary unit' does the description describe? A whole
 building's content or one piece of paper? I don't know of any ontology
 with terms fonds, …, file, item, but you could say `http URI
 rdf:type fonds class URI`.

 ## Extent and medium
 Saying anything about extent and medium should possible only happen on the
 lowest level of description. Any higher level extent and medium should be
 calculated by aggregating lower level descriptions.
 On the lowest level, refer to class URIs. A combination of dimensions and
 material {c|sh}ould be a class, e.g. A4 paper 80 grams/square meter.

 # Context area

 ## Creator(s) and administrative/biographical history
 As ISAD(G) refers to ISAAR(CPF) for description of corporate bodies,
 people, and families, this is a perfect example of using existing people-
 and organisation-describing ontologies like FOAF, BIO, ORG, and others are
 useful for separate descriptions of the 

Re: [CODE4LIB] archiving web pages

2014-01-14 Thread Robert Sanderson
For what it's worth, the latest wayback code is:

https://github.com/iipc/openwayback

And being developed by the IIPC consortium, rather than just the Internet
Archive alone.
It has many additional features, contributed by other members.

It should be used in preference to the sourceforge version, IMO.

Rob




On Tue, Jan 14, 2014 at 10:00 AM, L Snider lsni...@gmail.com wrote:

 Hi Kathryn,

 Right now the WARC format is considered the best preservation format for
 websites/social media, in terms of digital archives. It is our best guess
 right now. It will likely will be with us for a long time, because it has
 been adopted by most of the major players.

 The way I have seen WARCs served up is through Wayback, the manual version
 of the Internet Archive's Wayback machine.
 http://archive-access.sourceforge.net/projects/wayback/index.html

 I have only used Heritrix and Wayback together, so I haven't played with
 Wayback and WARCs made another way.

 I would stick with WARC in terms of preservation, access is another
 story...that would depend on budget, time, etc.

 Hope that helps.

 Cheers

 Lisa
 --
 Lisa Snider
 Electronic Records Archivist
 Harry Ransom Center
 The University of Texas at Austin
 P.O. Box 7219
 Austin, Texas 78713-7219
 P: 512-232-4616
 www.hrc.utexas.edu



 On Tue, Jan 14, 2014 at 10:48 AM, Kathryn Frederick (Library) 
 kfred...@skidmore.edu wrote:

  Hi,
  I'm trying to develop a strategy for preserving issues our school's
 online
  newspaper. Creating a WARC file of the content seems straightforward, but
  how will that content fair long-term? Also, how is the WARC served to an
  end-user? Is there some other method I should look at?
  Thanks in advance for any advice!
  Kathryn
 



Re: [CODE4LIB] archiving web pages

2014-01-14 Thread Robert Sanderson
Here are several to consider:

*
http://www.webarchive.org.uk/wayback/archive/*/http://www.aboutmayfair.co.uk/
*
http://webarchive.loc.gov/lcwa0015/*/http://lawprofessors.typepad.com/adminlaw/
* http://www.padi.cat:8080/wayback/*/http://www.ajberga.cat/
* http://vefsafn.is/index.php?page=english


Hope that helps :)

Rob






On Tue, Jan 14, 2014 at 10:31 AM, Nathan Tallman ntall...@gmail.com wrote:

 Lisa,

 Is your local web archive available online? I'd like to see a production
 example of non-Internet Archive instance of Wayback/Open Wayback.

 Thanks,
 Nathan


 On Tue, Jan 14, 2014 at 12:17 PM, L Snider lsni...@gmail.com wrote:

  Rob is right on! I included the wrong link, thanks for catching that...
 
  Cheers
 
  Lisa
 
 
  On Tue, Jan 14, 2014 at 11:04 AM, Robert Sanderson azarot...@gmail.com
  wrote:
 
   For what it's worth, the latest wayback code is:
  
   https://github.com/iipc/openwayback
  
   And being developed by the IIPC consortium, rather than just the
 Internet
   Archive alone.
   It has many additional features, contributed by other members.
  
   It should be used in preference to the sourceforge version, IMO.
  
   Rob
  
  
  
  
   On Tue, Jan 14, 2014 at 10:00 AM, L Snider lsni...@gmail.com wrote:
  
Hi Kathryn,
   
Right now the WARC format is considered the best preservation format
  for
websites/social media, in terms of digital archives. It is our best
  guess
right now. It will likely will be with us for a long time, because it
  has
been adopted by most of the major players.
   
The way I have seen WARCs served up is through Wayback, the manual
   version
of the Internet Archive's Wayback machine.
http://archive-access.sourceforge.net/projects/wayback/index.html
   
I have only used Heritrix and Wayback together, so I haven't played
  with
Wayback and WARCs made another way.
   
I would stick with WARC in terms of preservation, access is another
story...that would depend on budget, time, etc.
   
Hope that helps.
   
Cheers
   
Lisa
--
Lisa Snider
Electronic Records Archivist
Harry Ransom Center
The University of Texas at Austin
P.O. Box 7219
Austin, Texas 78713-7219
P: 512-232-4616
www.hrc.utexas.edu
   
   
   
On Tue, Jan 14, 2014 at 10:48 AM, Kathryn Frederick (Library) 
kfred...@skidmore.edu wrote:
   
 Hi,
 I'm trying to develop a strategy for preserving issues our school's
online
 newspaper. Creating a WARC file of the content seems
 straightforward,
   but
 how will that content fair long-term? Also, how is the WARC served
 to
   an
 end-user? Is there some other method I should look at?
 Thanks in advance for any advice!
 Kathryn

   
  
 



Re: [CODE4LIB] The lie of the API

2013-12-02 Thread Robert Sanderson
Hi Richard,

On Sun, Dec 1, 2013 at 4:25 PM, Richard Wallis 
richard.wal...@dataliberate.com wrote:

 It's harder to implement Content Negotiation than your own API, because
 you
 get to define your own API whereas you have to follow someone else's rules
 Don't wish your implementation problems on the consumers of your data.
 There are [you would hope] far more of them than of you ;-)

Content-negotiation is an already established mechanism - why invent a
 new, and different, one just for *your* data?


I should have been clearer here that I was responding to the original blog
post.  I'm not advocating arbitrary APIs, but instead just to use link
headers between the different representations.

The advantages are that the caching issues (both browser and intermediate
caches) go away as the content is static, you don't need to invent a way to
find out which formats are available (eg no arbitrary content in a 300
response), and you can simply publish the representations as any other
resource without server side logic to deal with conneg.

The disadvantages are ... none.  There's no invention of APIs, it's just
following a simpler route within the HTTP spec.


Put your self in the place of your consumer having to get their head
 around yet another site specific API pattern.


As a consumer of my own data, I would rather do a simple GET on a URI than
mess around constructing the correct Accept header.



 As to discovering then using the (currently implemented) URI returned from
 a content-negotiated call  - The standard http libraries take care of that,
 like any other http redirects (301,303, etc) plus you are protected from
 any future backend server implementation changes.


No they don't, as there's no way to know which representations are
available via conneg, and hence no automated way to construct the Accept
header.

Rob


Re: [CODE4LIB] The lie of the API

2013-12-02 Thread Robert Sanderson
On Sun, Dec 1, 2013 at 5:57 PM, Barnes, Hugh hugh.bar...@lincoln.ac.nzwrote:

 +1 to all of Richard's points here. Making something easier for you to
 develop is no justification for making it harder to consume or deviating
 from well supported standards.


I'm not suggesting deviating from well supported standards, I'm suggesting
choosing a different approach within the well supported standard that makes
it easier for both consumer and producer.



 [Robert]
   You can't
  just put a file in the file system, unlike with separate URIs for
  distinct representations where it just works, instead you need server
  side processing.

 If we introduce languages into the negotiation, this won't scale.


Sure, there's situations where the number of variants is so large that
including them all would be a nuisance.  The number of times this actually
happens is (in my experience at least) vanishingly small.  Again, I'm not
suggesting an arbitrary API, I'm saying that there's easier ways to
accomplish the 99% of cases than conneg.



 [Robert]
  This also makes it much harder to cache the
  responses, as the cache needs to determine whether or not the
  representation has changed -- the cache also needs to parse the
  headers rather than just comparing URI and content.

 Don't know caches intimately, but I don't see why that's algorithmically
 difficult. Just look at the Content-type of the response. Is it harder for
 caches to examine headers than content or URI? (That's an earnest, perhaps
 naïve, question.)

 If we are talking about caching on the client here (not caching proxies),
 I would think in most cases requests are issued with the same Accept-*
 headers, so caching will work as expected anyway.


I think Joe already discussed this one, but there's an outstanding conneg
caching bug in firefox and it took even Squid a long time to implement the
content negotiation aware caching.  Also note, much harder not
impossible :)

No Conneg:
* Check if we have the URI. Done. O(1) as it's a hash.

Conneg:
* Check if we have the URI. Parse the Accept headers from the request.
 Check if they match the cached content and don't contain wildcards.
 O(quite a lot more than 1)



 [Robert]
  Link headers
  can be added with a simple apache configuration rule, and as they're
  static are easy to cache. So the server side is easy, and the client
 side is trivial.

 Hadn't heard of these. (They are on Wikipedia so they must be real.) What
 do they offer over HTML link elements populated from the Dublin Core
 Element Set?


Nothing :) They're link elements in a header so you can use them in non
HTML representations.


My whatever it's worth . great topic, though, thanks Robert :)


Welcome :)

Rob


Re: [CODE4LIB] The lie of the API

2013-12-02 Thread Robert Sanderson
To be (more) controversial...

If it's okay to require headers, why can't API keys go in a header rather
than the URL.
Then it's just the same as content negotiation, it seems to me. You send a
header and get a different response from the same URI.

Rob



On Mon, Dec 2, 2013 at 10:57 AM, Edward Summers e...@pobox.com wrote:

 On Dec 3, 2013, at 4:18 AM, Ross Singer rossfsin...@gmail.com wrote:
  I'm not going to defend API keys, but not all APIs are open or free.  You
  need to have *some* way to track usage.

 A key (haha) thing that keys also provide is an opportunity to have a
 conversation with the user of your api: who are they, how could you get in
 touch with them, what are they doing with the API, what would they like to
 do with the API, what doesn’t work? These questions are difficult to ask if
 they are just a IP address in your access log.

 //Ed



Re: [CODE4LIB] The lie of the API

2013-11-29 Thread Robert Sanderson
(posted in the comments on the blog and reposted here for further
discussion, if interest)


While I couldn't agree more with the post's starting point -- URIs identify
(concepts) and use HTTP as your API -- I couldn't disagree more with the
use content negotiation conclusion.

I'm with Dan Cohen in his comment regarding using different URIs for
different representations for several reasons below.

It's harder to implement Content Negotiation than your own API, because you
get to define your own API whereas you have to follow someone else's rules
when you implement conneg.  You can't get your own API wrong.  I agree with
Ruben that HTTP is better than rolling your own proprietary API, we
disagree that conneg is the correct solution.  The choice is between conneg
or regular HTTP, not conneg or a proprietary API.

Secondly, you need to look at the HTTP headers and parse quite a complex
structure to determine what is being requested.  You can't just put a file
in the file system, unlike with separate URIs for distinct representations
where it just works, instead you need server side processing.  This also
makes it much harder to cache the responses, as the cache needs to
determine whether or not the representation has changed -- the cache also
needs to parse the headers rather than just comparing URI and content.  For
large scale systems like DPLA and Europeana, caching is essential for
quality of service.

How do you find our which formats are supported by conneg? By reading the
documentation. Which could just say add .json on the end. The Vary header
tells you that negotiation in the format dimension is possible, just not
what to do to actually get anything back. There isn't a way to find this
out from HTTP automatically,so now you need to read both the site's docs
AND the HTTP docs.  APIs can, on the other hand, do this.  Consider
OAI-PMH's ListMetadataFormats and SRU's Explain response.

Instead you can have a separate URI for each representation and link them
with Link headers, or just a simple rule like add '.json' on the end. No
need for complicated content negotiation at all.  Link headers can be added
with a simple apache configuration rule, and as they're static are easy to
cache. So the server side is easy, and the client side is trivial.
 Compared to being difficult at both ends with content negotiation.

It can be useful to make statements about the different representations,
and especially if you need to annotate the structure or content.  Or share
it -- you can't email someone a link that includes the right Accept headers
to send -- as in the post, you need to send them a command line like curl
with -H.

An experiment for fans of content negotiation: Have both .json and 302
style conneg from your original URI to that .json file. Advertise both. See
how many people do the conneg. If it's non-zero, I'll be extremely
surprised.

And a challenge: Even with libraries there's still complexity to figuring
out how and what to serve. Find me sites that correctly implement * based
fallbacks. Or even process q values. I'll bet I can find 10 that do content
negotiation wrong, for every 1 that does it correctly.  I'll start:
dx.doi.org touts its content negotiation for metadata, yet doesn't
implement q values or *s. You have to go to the documentation to figure out
what Accept headers it will do string equality tests against.

Rob



On Fri, Nov 29, 2013 at 6:24 AM, Seth van Hooland svhoo...@ulb.ac.be
wrote:

 Dear all,

 I guess some of you will be interested in the blogpost of my colleague
and co-author Ruben regarding the misunderstandings on the use and abuse of
APIs in a digital libraries context, including a description of both good
and bad practices from Europeana, DPLA and the Cooper Hewitt museum:

 http://ruben.verborgh.org/blog/2013/11/29/the-lie-of-the-api/

 Kind regards,

 Seth van Hooland
 Président du Master en Sciences et Technologies de l'Information et de la
Communication (MaSTIC)
 Université Libre de Bruxelles
 Av. F.D. Roosevelt, 50 CP 123  | 1050 Bruxelles
 http://homepages.ulb.ac.be/~svhoolan/
 http://twitter.com/#!/sethvanhooland
 http://mastic.ulb.ac.be
 0032 2 650 4765
 Office: DC11.102


Re: [CODE4LIB] Loris

2013-11-09 Thread Robert Sanderson
Hi Andrew,

Not exactly sure what sort of differences you're after...

Do you mean the difference between this:
http://iipimage.sourceforge.net/documentation/protocol/
(and it's 74 page reference: http://iipimage.sourceforge.net/IIPv105.pdf )

And this:
  http://www-sul.stanford.edu/iiif/image-api/1.1/

?

Rob




On Fri, Nov 8, 2013 at 10:58 PM, Andrew Hankinson 
andrew.hankin...@gmail.com wrote:

 So what’s the difference between IIIF and IIP? (the protocol, not the
 server implementation)

 -Andrew

 On Nov 8, 2013, at 9:05 PM, Jon Stroop jstr...@princeton.edu wrote:

  It aims to do the same thing...serve big JP2s (and other images) over
 the web, so from that perspective, yes. But, beyond that, time will tell.
 One nice thing about coding against a well-thought-out spec is that are
 lots of implementations from which you can choose[1]--though as far as I
 know Loris is the only one that supports the IIIF syntax natively (maybe
 IIP?). We still have Djatoka floating around in a few places here, but, as
 many people have noted over the years, it takes a lot of shimming to scale
 it up, and, as far as I know, the project has more or less been abandoned.
 
  I haven't done too much in the way of benchmarking, but to date don't
 have any reason to think Loris can't perform just as well. The demo I sent
 earlier is working against a very large jp2 with small tiles[1] which means
 a lot of rapid hits on the server, and between that, (a little bit of)
 JMeter and ab testing, and a fair bit of concurrent use from the c4l
 community this afternoon, I feel fairly confident about it being able to
 perform as well as Djatoka in a production environment.
 
  By the way, you can page through some other images here:
 http://libimages.princeton.edu/osd-demo/
 
  Not much of an answer, I realize, but, as I said, time and usage will
 tell.
 
  -Js
 
  1. http://iiif.io/apps-demos.html
  2.
 http://libimages.princeton.edu/loris/pudl0052%2F6131707%2F0001.jp2/info.json
 
 
  On 11/8/13 8:07 PM, Peter Murray wrote:
  A clarifying question: is Loris effectively a Python-based replacement
 for the Java-based djatoka [1] server?
 
 
  Peter
 
  [1]
 http://sourceforge.net/apps/mediawiki/djatoka/index.php?title=Main_Page
 
 
  On Nov 8, 2013, at 3:05 PM, Jon Stroop jstr...@princeton.edu wrote:
 
  c4l,
  I was reminded earlier this week at DLF (and a few minutes ago by Tom
  and Simeon) that I hadn't ever announced a project I've been working
 for
  the least year or so to this list. I showed an early version in a
  lightning talk at code4libcon last year.
 
  Meet Loris: https://github.com/pulibrary/loris
 
  Loris is a Python based image server that implements the IIIF Image API
  version 1.1 level 2[1].
 
  http://www-sul.stanford.edu/iiif/image-api/1.1/
 
  It can take JP2 (if you make Kakadu available to it), TIFF, or JPEG
  source images, and hand back JPEG, PNG, TIF, and GIF (why not...).
 
  Here's a demo of the server directly: http://goo.gl/8XEmjp
 
  And here's a sample of the server backing OpenSeadragon[2]:
  http://goo.gl/Gks6lR
 
  -Js
 
  1. http://www-sul.stanford.edu/iiif/image-api/1.1/
  2. http://openseadragon.github.io/
 
  --
  Jon Stroop
  Digital Initiatives Programmer/Analyst
  Princeton University Library
  jstr...@princeton.edu
  --
  Peter Murray
  Assistant Director, Technology Services Development
  LYRASIS
  peter.mur...@lyrasis.org
  +1 678-235-2955
  800.999.8558 x2955



Re: [CODE4LIB] rdf serialization

2013-11-05 Thread Robert Sanderson
Yes, I'm going to get sucked into this vi vs emacs argument for nostalgia's
sake.


From the linked, very outdated article:

 In fact, as far as I know I've never used an RDF application, nor do I
know of any that make me want to use them.  So what's wrong with this
picture?

a) Nothing.  You would never know if you've used a CORBA application
either. Or (insert infrastructure technology here) application.
b) You've never been to the BBC website? You've never used anything that
pulls in content from remote sites? Oh wait, see (a).
c) I've never used a Topic Maps application. (and see (a))

 I find most existing RDF/XML entirely unreadable
Patient: Doctor, Doctor it hurts when I use RDF/XML!
Doctor: Don't Do That Then.   (aka #DDTT)

Already covered in this thread. I'm a strong proponent of JSON-LD.

 I think that when we start to bring on board metadata-rich knowledge
monuments such as WorldCat ...

See VIAF in this thread. See, if you must, BIBFRAME in this thread.

There /are/ challenges with RDF, not going to argue against that. And in
fact I /have/ recently argued for it:
http://www.cni.org/news/video-rdf-failures-linked-data-letdowns/

But for the vast majority of cases, the problems are solved (JSON-LD) or no
one cares any more (httpRange14).  Named Graphs (those quads used by
crazies you refer to) solve the remaining issues, but aren't standard yet.
 They are, however, cleverly baked into JSON-LD for the time that they are.


On Tue, Nov 5, 2013 at 2:48 PM, Alexander Johannesen 
alexander.johanne...@gmail.com wrote:

 Ross Singer rossfsin...@gmail.com wrote:
  This is definitely where RDF outclasses almost every alternative*,

 Having said that, there's tuples of many kinds, it's only that the
 triplet is the most used under the W3C banner. Many are using to a
 more expressive quad, a few crazies , for example, even though that

ad hominem? really? Your argument ceased to be valid right about here.

 may or may not be a better way of dealing with it. In the end, it all
 comes down to some variation over frames theory (or bundles); a
 serialisation of key/value pairs with some ontological denotation for
 what the semantics of that might be.

Except that RDF follows the web architecture through the use of URIs for
everything. That is not to be under-estimated in terms of scalability and
long term usage.


 But wait, there's more! We haven't touched upon the next layer of the
 cake; OWL, which is, more or less, an ontology for dealing with all
 things knowledge and web. And it kinda puzzles me that it is not more
 often mentioned (or used) in the systems we make. A lot of OWL was
 tailored towards being a better language for expressing knowledge
 (which in itself comes from DAML and OIL ontologies), and then there's
 RDFs, and OWL in various formats, and then ...

Your point? You don't like an ontology? #DDTT


 Complexity. The problem, as far as I see it, is that there's not
 enough expression and rigor for the things we want to talk about in
 RDF, but we don't want to complicate things with OWL or RDFs either.

That's no more a problem of RDF than any other system.

 And then there's that tedious distinction between a web resource and
 something that represents the thing in reality that RDF skipped (and
 hacked a 304 solution to). It's all a bit messy.

That RDF skipped? No, *RDF* didn't skip it nor did RDF propose the *303*
solution.
You can use URIs to identify anything.

The 303/httprange14 issue is what happens when you *dereference* a URI that
identifies something that does not have a digital representation because
it's a real world object.  It has a direct impact on RDF, but came from the
TAG not the RDF WG.

http://www.w3.org/2001/tag/doc/httpRange-14/2007-05-31/HttpRange-14

And it's not messy, it's very clean. What it is not, is pragmatic. URIs are
like kittens ... practically free to get, but then you have a kitten to
look after and that costs money.  Thus doubling up your URIs is increasing
the number of kittens you have. [though likely not, in practice, doubling
the cost]

  * Unless you're writing a parser, then having a kajillion serializations
  seriously sucks.
 Some of us do. And yes, it sucks. I wonder about non-political
 solutions ever being possible again ...

This I agree with.

Rob


Re: [CODE4LIB] rdf serialization

2013-11-03 Thread Robert Sanderson
You're still missing a vital step.

Currently your assertion is that the creator /of a web page/ is Jefferson,
which is clearly false.

The page (...) is a transcription of the Declaration of Independence.
The Declaration of Independence is written by Jefferson.
Jefferson is Male.

And it's not very hard given the right mindset -- its just a fully expanded
relational database, where the identifiers are URIs.  Yes, it's not 1st
year computer science, but it is 2nd or 3rd year rather than post graduate.

Which is not to say that people do not have great trouble succinctly
articulating knowledge, but like any skill, it can be learned. Just look at
the variation in the ways of writing papers ... some people can do it very
clearly, some have much more difficulty.

And with JSON-LD, you don't have to understand the RDF, just a clean
representation of it.

Rob



On Sun, Nov 3, 2013 at 1:45 PM, Eric Lease Morgan emor...@nd.edu wrote:


 Cool input. Thank you. I believe I have tweaked my assertions:

 1. The Declaration of Independence was written by Thomas Jefferson

 rdf:RDF
   xmlns:rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#;
   xmlns:dc=http://purl.org/dc/elements/1.1/; 

   rdf:Description
   rdf:about=
 http://www.archives.gov/exhibits/charters/declaration_transcript.html;
 dc:creatorhttp://id.loc.gov/authorities/names/n79089957/dc:creator
   /rdf:Description

 /rdf:RDF


 2. Thomas Jefferson is a male person

 rdf:RDF
   xmlns:rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#;
   xmlns:foaf=http://xmlns.com/foaf/0.1/;

   rdf:Description rdf:about=http://id.loc.gov/authorities/names/n7908995
 
 foaf:Person foaf:gender=male /
   /rdf:Description

 /rdf:RDF


 Using no additional vocabularies (ontologies), I think my hypothetical
 Linked Data spider / robot ought to be able to assert the following:

 3. The Declaration of Independence was written by Thomas Jefferson, a male
 person

 rdf:RDF
  xmlns:rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#;
  xmlns:dc=http://purl.org/dc/elements/1.1/;
  xmlns:foaf=http://xmlns.com/foaf/0.1/;

   rdf:Description
   rdf:about=
 http://www.archives.gov/exhibits/charters/declaration_transcript.html;
   dc:creator
 foaf:Person rdf:about=
 http://id.loc.gov/authorities/names/n79089957;
   foaf:gendermale/foaf:gender
 /foaf:Person
   /dc:creator
   /rdf:Description

 /rdf:RDF

 The W3C Validator…validates Assertion #3, and returns the attached graph,
 which illustrates the logical combination of Assertion #1 and #2.

 This is hard. The Semantic Web (and RDF) attempt at codifying knowledge
 using a strict syntax, specifically a strict syntax of triples. It is very
 difficult for humans to articulate knowledge, let alone codifying it. How
 realistic is the idea of the Semantic Web? I wonder this not because I
 don’t think the technology can handle the problem. I say this because I
 think people can’t (or have great difficulty) succinctly articulating
 knowledge. Or maybe knowledge does not fit into triples?

 —
 Eric Morgan
 University of Notre Dame

 [cid:6A4E613F-CE41-4D35-BDFA-2E66EE7AF20A]




[CODE4LIB] ANN: Memento Client for Chrome

2013-10-09 Thread Robert Sanderson
Dear all,

We are delighted to be able to announce the availability of the beta
Memento extension for Chrome. The extension is available in the Chrome
store:

https://chrome.google.com/webstore/detail/memento/jgbfpjledahoajcppakbgilmojkaghgm?hl=engl=US

Below, we include the description that accompanies the extension in the
Chrome store, which highlights its web time travel and 404-circumventing
features.

Your feedback would be much appreciated to help us get it ready for prime
time.

We would like to take this opportunity to thank:
- Harihar Shankar for the effort he invested in developing the extension.
- Luydmila Balakireva, Martin Klein, Michael Nelson, James Powell, for
their input during the development process.


Many thanks,

Rob Sanderson and Herbert Van de Sompel,
Los Alamos National Laboratory


==

Description

Travel to the past of the web by right-clicking pages and links.

Memento for Chrome allows you to seamlessly navigate between the present
web and the web of the past. It turns your browser into a web time travel
machine that is activated by means of a Memento sub-menu that is available
on right-click.

First, select a date for time travel by clicking the black Memento
extension icon. Now right-click on a web page, and click the Get near …
option from the Memento sub-menu to see what the page looked like around
the selected date. Do the same for any link in a page to see what the
linked page looked like. If you hit one of those nasty Page not Found
errors, right-click and select the Get near current time
option to see what the page looked like before it vanished from the web.
When on a past version of a page - the Memento extension icon is now red -
right-click the page and select the Get current time
option to see what it looks like now.

Memento for Chrome obtains prior versions of pages from web archives around
the world, including the massive web-wide Internet Archive, national
archives such as the British Library and UK National Archives web archives,
and on-demand web archives such as archive.is. It also allows time travel
in all language versions of Wikipedia. There's two things Memento for
Chrome can not do for you: obtain a prior version of a page when none have
been archived and time travel into the future. Our sincere apologies for
that.

Technically, the Memento for Chrome extension is a client-side
implementation of the Memento protocol that extends HTTP with content
negotiation in the date time dimension. Many web archives have implemented
server-side support for the Memento protocol, and, in essence, every
content management system that supports time-based versioning can implement
it. Technical details are in the Memento Internet Draft at
http://www.mementoweb.org/guide/rfc/ID/. General information about the
protocol, including a quick introduction, is available at
http://mementoweb.org.


Re: [CODE4LIB] anti-harassment policy for code4lib?

2012-11-26 Thread Robert Sanderson
+1, of course :)

You might wish to consider some further derivatives/related pages:
http://www.diglib.org/about/code-of-conduct/
http://wikimediafoundation.org/wiki/Friendly_space_policy
https://thestrangeloop.com/about/policies
http://www.apache.org/foundation/policies/anti-harassment.html

Rob



On Mon, Nov 26, 2012 at 3:57 PM, Mariner, Matthew 
matthew.mari...@ucdenver.edu wrote:

 +1 for all of the below

 Matthew C. Mariner
 Head of Special Collections and Digital Initiatives
 Assistant Professor
 Auraria Library
 1100 Lawrence StreetDenver, CO 80204-2041
 matthew.mari...@ucdenver.edu
 http://library.auraria.edu :: http://archives.auraria.edu





 On 11/26/12 3:51 PM, Tom Cramer tcra...@stanford.edu wrote:

 +1 for Bess's motion
 +1 for Roy's expansion to C4L online interactions as well as face to face
 +1 for Karen's focus on general inclusivity and fair play
 
  For me the hardest thing is how one monitors and resolves issues that
 arise. As a group with no formal management, I suppose the conference
 organizers become the deciders if such a necessity arises. If it's
 elsewhere (email, IRC) -- that's a bit trickier. The Ada project's
 detailed guides should help, but if there is a policy it seems that
 there necessarily has to be some responsible body -- even if ad hoc.
 
 
 It seems to me that there would be tremendous benefit in having
 
 1.) an explicit statement of the community norms around harassment and
 fair play in general. In the best case, this would help avoid
 uncomfortable or inappropriate situations before they occur.
 
 2.) a defined process for handling any incidents that do arise, which in
 the case of this community I would imagine would revolve around
 reporting, communication, negotiation and arbitration rather than
 adjudication by a standing body (which I agree is hard to see in this
 crowd). I know several high schools have adopted peer arbitration
 networks for conflict resolution rather than referring incidents to the
 Principal's Office--perhaps therein lies a model for us for any incidents
 that may not be resolved simply through dialogue.
 
 - Tom
 
 
 
 On Nov 26, 2012, at 2:32 PM, Karen Coyle wrote:
 
  Bess and Code4libbers,
 
  I've only been to one c4l conference and it was a very positive
 experience for me, but I also feel that this is too valuable of a
 community for us to risk it getting itself into crisis mode over some
 unintended consequences or a bad apple incident. For that reason I
 would support the adoption of an anti-harassment policy in part for its
 consciousness-raising value. Ideally this would be not only about sexual
 harassment but would include general goals for inclusiveness and fair
 play within the community. And it would also serve as an acknowledgment
 that none of us is perfect, but we can deal with it.
 
  For me the hardest thing is how one monitors and resolves issues that
 arise. As a group with no formal management, I suppose the conference
 organizers become the deciders if such a necessity arises. If it's
 elsewhere (email, IRC) -- that's a bit trickier. The Ada project's
 detailed guides should help, but if there is a policy it seems that
 there necessarily has to be some responsible body -- even if ad hoc.
 
  kc
 
 
  On 11/26/12 2:16 PM, Bess Sadler wrote:
  Dear Fellow Code4libbers,
 
  I hope I am not about to get flamed. Please take as context that I
 have been a member of this community for almost a decade. I have
 contributed software, support, and volunteer labor to this community's
 events. I have also attended the majority of code4lib conferences,
 which have been amazing and life-changing, and have helped me do my job
 a lot better. But, and I've never really known how to talk about this,
 those conferences have also been problematic for me a couple of times.
 Nothing like what happened to Noirin Shirley at ApacheCon (see
 http://geekfeminism.wikia.com/wiki/Noirin_Shirley_ApacheCon_incident if
 you're unfamiliar with the incident I mean) but enough to concern me
 that even in a wonderful community where we mostly share the same
 values, not everyone has the same definitions of acceptable behavior.
 
  I am watching the toxic fallout from the BritRuby conference
 cancellation with a heavy heart (go search for britruby conference
 cancelled if you want to catch up and/or get depressed). It has me
 wondering what more we could be doing to promote diversity and
 inclusiveness within code4lib. We have already had a couple of
 harassment incidents over the years, which I won't rehash here, which
 have driven away members of our community. We have also had other
 incidents that don't get talked about because sometimes one can feel
 that membership in a community is more important than one's personal
 boundaries or even safety. We should not be a community where people
 have to make that choice.
 
  I would like for us to consider adopting an anti-harassment policy for
 code4lib conferences. This is emerging 

Re: [CODE4LIB] Code4lib 2013 Presentation Election now open!

2012-11-13 Thread Robert Sanderson
I guess that you need to be logged in to vote?
Perhaps a direct link in the text to where to login, and where to request a
new account?

Thanks,

Rob


On Tue, Nov 13, 2012 at 11:15 AM, Becky Yoose b.yo...@gmail.com wrote:

 Not a voting problem per se, but the results page in IE9 [1] in Win7 threw
 up up everywhere: http://screencast.com/t/lUnwFl8h

 Otherwise, yay new design :cD

 Thanks,
 Becky

 [1] Related: don't ask why I was in IE.

 On Mon, Nov 12, 2012 at 11:03 PM, Ross Singer rossfsin...@gmail.com
 wrote:

  http://vote.code4lib.org/election/24
 
  Vote early, vote often, but most importantly, vote soon:  the polls close
  sometime on the night of Monday the 19th of November (looking at the host
  that the diebold-o-tron, I think it will be around 11 PM EST, but when
 they
  close, they close!).
 
  -Ross.
  p.s. given the new design, let me know if there are any voting problems.
 



Re: [CODE4LIB] Embedding XHTML into RDF

2012-01-12 Thread Robert Sanderson
+1

Rob

On Thu, Jan 12, 2012 at 9:26 AM, aj...@virginia.edu aj...@virginia.edu wrote:
 My inclination would be to keep the descriptive snippets in some kind of 
 content store with a good RESTful Web exposure and just use those URLs as the 
 values of description triples in your RDF. Then your RDF is genteel Linked 
 Data and your XHTML can be easily available to integrating services.

 ---
 A. Soroka
 Online Library Environment
 the University of Virginia Library




 On Jan 11, 2012, at 11:00 PM, CODE4LIB automatic digest system wrote:

 From: Ethan Gruber ewg4x...@gmail.com
 Date: January 11, 2012 3:07:16 PM EST
 Subject: Re: Embedding XHTML into RDF


 People are going to use the YUI rich text editor and the output is run
 through tidy, so that should ensure the well-formedness of the HTML.

 Right now we have a system where thousands of small XHTML fragments exist
 as text files in a filesystem (edited manually, practically), which are
 rendered through wiki software.  The fragments have RDFa attributes so that
 an RDFa python script can interpret wiki pages as RDF on the fly.  We need
 to redesign the system from the ground up, and I'd like to use RDF as the
 source object.

 Ethan


Re: [CODE4LIB] Embedding XHTML into RDF

2012-01-11 Thread Robert Sanderson
You might consider the Content in RDF specification:
http://www.w3.org/TR/Content-in-RDF10/

which describes how to do this in a generic fashion, as opposed to
stuffing it directly into a string literal.

HTH

Rob

On Wed, Jan 11, 2012 at 12:36 PM, Ethan Gruber ewg4x...@gmail.com wrote:
 Hi all,

 Suppose I have RDF describing an object, and I would like some fairly
 free-form human generating description about the object (let's say within
 dcterms:description).  Is it semantically acceptable to have XHTML nested
 directly in this element or would this be considered uncouth for LOD?

 Thanks,
 Ethan


Re: [CODE4LIB] Sending html via ajax -vs- building html in js (was: jQuery Ajax request to update a PHP variable)

2011-12-08 Thread Robert Sanderson
On Thu, Dec 8, 2011 at 9:14 AM, BRIAN TINGLE
brian.tingle.cdlib@gmail.com wrote:

 On Dec 7, 2011, at 2:19 PM, Robert Sanderson wrote:
 * Lax Security -- It's easier to get into trouble when you're simply
 inlining HTML received, compared to building the elements.  Getting
 into the same bad habits as SQL injection. It might not be a big deal
 now, but it will be later on.

 I've been scratching my head about this one.  Can someone elaborate on this?

If you blindly include whatever you get back directly into the page,
it might include either badly performing, out of date, or potentially
malicious script tags that subsequently destroy the page.  It's the
equivalent of blindly accepting web form input into an SQL query and
then wondering where your tables all disappeared off to.

Rob


Re: [CODE4LIB] Sending html via ajax -vs- building html in js (was: jQuery Ajax request to update a PHP variable)

2011-12-07 Thread Robert Sanderson
Here's some off the top of my head:

* Separation of concerns -- You can keep your server side data
transfer and change the front end easily by working with the
javascript, rather than reworking both.

* Lax Security -- It's easier to get into trouble when you're simply
inlining HTML received, compared to building the elements.  Getting
into the same bad habits as SQL injection. It might not be a big deal
now, but it will be later on.

* Obfuscation -- It's easier to debug one layer of code rather than
two at once. It's thus also easier to maintain the two layers of code,
and easier to see at which end the system is failing.

Rob

On Wed, Dec 7, 2011 at 3:12 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 A fair number? Anyone but Godmar?

 On 12/7/2011 5:02 PM, Nate Vack wrote:

 OK. So we have a fair number of very smart people saying, in essence,
 it's better to build your HTML in javascript than send it via ajax
 and insert it.

 So, I'm wondering: Why? Is it an issue of data transfer size? Is there
 a security issue lurking? Is it tedious to bind events to the new /
 updated code? Something else? I've thought about it a lot and can't
 think of anything hugely compelling...

 Thanks!
 -Nate




Re: [CODE4LIB] Plea for help from Horowhenua Library Trust to Koha Community

2011-11-23 Thread Robert Sanderson
LibLime
A Division of PTFS, Inc.
Main Office

11501 Huff Court
North Bethesda, Maryland 20895

tel: (301) 654-8088 Ext. 127
fax: (301) 654-5789
email: kohai...@liblime.com

Twitter: @liblime

How about we all contact them? ;)

Rob


2011/11/23 Wilfred Drew dr...@tc3.edu:
 Has anybody contacted the company? A sales rep? PR department?

 Bill Drew

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
 Parker, Anson (adp6j)
 Sent: Wednesday, November 23, 2011 12:09 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Plea for help from Horowhenua Library Trust to Koha 
 Community

 This is pretty offensive on the liblime part, perhaps not surprising, but 
 certainly low browŠ I think best practices are to 1) blog it up 2) get a list 
 of their clients and email them all to let them know what a bunch of schmarmy 
 brats they are working withŠ make it hurt financially.  It's not libel or 
 slander as long as it is true.  Going to New Zealand to play a legal game 
 like this is way below the belt.
 -ap

 On 11/23/11 10:32 AM, Eric Lease Morgan emor...@nd.edu wrote:

Just to make it easier, use the following link to read the article and
then donate via PayPal -- http://bit.ly/rBeWN0  Open source software is
about liberty, not gratis. --ELM



Re: [CODE4LIB] OIA Feeds

2011-06-21 Thread Robert Sanderson
Without /any/ infrastructure it would be a challenge, but a simple
database that has timestamps and basic metadata would be sufficient.
The timestamps are the most important, obviously, to populate the feed
correctly and handle the time slicing.

Rob

On Tue, Jun 21, 2011 at 8:55 AM, Eric Lease Morgan emor...@nd.edu wrote:
 On Jun 21, 2011, at 9:50 AM, Nathan Tallman wrote:

 Can anyone direct me towards documentation on creating an OAI feed from
 scratch, without a repository infrastructure?


 Setting up an OAI feed -- becoming an OAI data provider -- without a 
 repository infrastructure would be challenging, to say the least. To learn 
 more one would need to first read the OAI-PMH specification. [1] You would 
 then need to write a program to support the OAI verbs (identify, listSets, 
 etc.). All of the metadata in the resulting feed would need to come from 
 some place, and this place is usually a database of some sort. You have a 
 database listing your content, right?

 [1] specification - http://bit.ly/dJyAE3

 --
 Eric Lease Morgan
 University of Notre Dame

 Great Books Survey -- http://bit.ly/auPD9Q



[CODE4LIB] New Memento Internet Draft available; comments requested!

2011-04-28 Thread Robert Sanderson
Dear all,

We have published an updated internet draft for the Memento
specification concerning Time Travel on the Web.

It is available at:

* TXT version: http://www.ietf.org/id/draft-vandesompel-memento-01.txt
* HTML version: http://mementoweb.org/guide/rfc/ID/

This version contains updates and clarifications that result from
community feedback, as well as new material pertaining to the
discovery of TimeGates, TimeMaps, and Mementos.

This is the technical specification behind our recent article in the
Code4Lib journal on client development:
   http://journal.code4lib.org/articles/4979

We would be very appreciative of any feedback that the community might
have on this specification, either on-list or privately.

We have a Memento google group if you would like to participate in an
ongoing conversation on this topic:
http://groups.google.com/group/memento-dev


Many thanks, and we look forwards to hearing your comments,

Rob Sanderson, Herbert Van de Sompel, Michael Nelson.


Re: [CODE4LIB] Fwd: [Air-L] Using archives of the web for research

2011-04-21 Thread Robert Sanderson
Our work on Memento comes to mind, of course.

http://www.mementoweb.org/

And in particular, regarding the second point, our papers about the
use of Memento for non-traditional interactions with web archives:

* http://arxiv.org/abs/1003.3661
Using Memento to recover the state of a web resource at the point in
time it was annotated, to ensure that the annotation is displayed with
the correct representation.

* http://arxiv.org/abs/1003.2643
Using Memento with Linked Data to perform time series analysis.

And hopefully a paper at Open Repositories, describing initial and
ongoing research, briefly summarized in:

* 
http://public.lanl.gov/herbertv/papers/Papers/2011/MementoPoster_IKS_201104.pdf


Hope that helps!

Rob


On Thu, Apr 21, 2011 at 8:58 AM, Jodi Schneider jschnei...@pobox.com wrote:
 Code4Lib, any thoughts for Eric? -Jodi

 -- Forwarded message --
 From: Eric Meyer eric.me...@oii.ox.ac.uk
 Date: Wed, Apr 20, 2011 at 4:46 PM
 Subject: [Air-L] Using archives of the web for research
 To: ai...@listserv.aoir.org ai...@listserv.aoir.org
 Cc: Ralph Schroeder ralph.schroe...@oii.ox.ac.uk, 
 a...@proteus-associates.com a...@proteus-associates.com


 Dear AoIR,

 OII is currently doing some work for the IIPC (International Internet
 Preservation Consortium: http://www.netpreserve.org), and part of the work
 involves identifying current and cutting edge research techniques and tools
 that are available for research on the live web, but that are currently
 either difficult or impossible to use with web archives such as the Internet
 Archive (http://www.archive.org/) or other IIPC member organisations.

 The short version of what we are hoping to get from this group:
 - do you know of any innovative uses of web archives for research?
 - what techniques for researching the live web should be adapted for use
 with web archives?
 - can you envisage any innovative uses of web archives (or other archived
 Internet data) for research that you would ideally like to be able to do?

 The longer version:
 What we are hoping is that members of AoIR will respond to us (off list)
 with your ideas about ways you research the live web that could potentially
 be enhanced using either snapshots from the web at different time points or
 longitudinal data about the web over time, but which would need additional
 support, training, tools, or infrastructure to be able to accomplish.  Your
 responses will be used to influence the IIPC community to add web archive
 support for the kinds of cutting edge research that AoIR members are doing.

 Also, if you have any types of research or research questions you have been
 hoping to be able to do with archived internet data but have not been able
 to do for whatever reason, and you are willing to share the ideas and the
 barriers to researching them with us for possible inclusion in our
 discussion paper, that would be appreciated as well.

 Responses before 1 May will be most helpful. We will post the draft report
 back to the list in May, and the final report in the summer.  Those
 interested in web archives may also find two reports we wrote last autumn to
 be of interest:
 Dougherty, M., Meyer, E.T., Madsen, C., van den Heuvel, C., Thomas, A.,
 Wyatt, S. (2010). Researcher Engagement with Web Archives: State of the Art.
 London: JISC. Online: http://ssrn.com/abstract=1714997 or
 http://ie-repository.jisc.ac.uk/544/
 Thomas, A., Meyer, E.T., Dougherty, M., van den Heuvel, C., Madsen, C.,
 Wyatt, S. (2010). Researcher Engagement with Web Archives: Challenges and
 Opportunities for Investment. London: JISC. Online:
 http://ssrn.com/abstract=1715000 or http://ie-repository.jisc.ac.uk/543/


 Eric T. Meyer
 Research Fellow, Oxford Internet Institute
 University of Oxford
 eric.me...@oii.ox.ac.uk
 http://people.oii.ox.ac.uk/meyer



 ___
 The ai...@listserv.aoir.org mailing list
 is provided by the Association of Internet Researchers http://aoir.org
 Subscribe, change options or unsubscribe at:
 http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org

 Join the Association of Internet Researchers:
 http://www.aoir.org/



[CODE4LIB] Fwd: [dm-l] Postdoctoral Fellowship at MARGOT, University of Waterloo

2011-04-18 Thread Robert Sanderson
-- Forwarded message --
From: Christine McWebb cmcw...@uwaterloo.ca
Date: Mon, Apr 18, 2011 at 7:36 AM
Subject: [dm-l] Postdoctoral Fellowship at MARGOT, University of Waterloo
To: d...@uleth.ca


University of Waterloo – Mellon Postdoctoral Fellowship in Digital Humanities



With apologies for cross-posting; please redistribute:



Postdoctoral Fellowship at MARGOT



The MARGOT Annotation Tool project (imageMAT), funded by the Andrew W.
Mellon Foundation – Scholarly Communications and Technology Program
(2011-2012), invites applications to its 2011 competition for a
postdoctoral fellowship. imageMAT offers a one-year postdoctoral
fellowship valued at $31,500 + 14% vacation pay and benefits to PhD
students in the final year of their program and recent graduates.
Applicants must have knowledge in medieval iconography and/or
literature and manuscript culture/production. Applicants must also
have solid computer skills. The postdoctoral fellow will provide
scholarly leadership and, more generally, add scholarly content to the
project site such as manuscript descriptions and blog posts. He/she
will consult on content creation, and assist the developer and McWebb
with the training of graduate students in content creation and be
responsible for site moderation.



Knowledge of French would be an asset, but is not required. The award
is tenable at the University of Waterloo, Waterloo, Ontario, and is
supervised by Christine McWebb. The start date is September 1, 2011.

Applicants must not hold a tenure or tenure-track position or other
full-time employment. Fellows are expected to engage in full-time
postdoctoral research during the term of the award.



Preference will be given to recent graduates, that is, to graduates
applying within five years of receiving their doctoral degree. The
awards are not renewable beyond the first year.

Please send a cover letter, current c.v., and the names of three
referees by email to:



Christine McWebb

cmcw...@uwaterloo.ca



Application deadline: 1 June, 2011





Christine McWebb

Associate Professor

Associate Chair, Graduate Studies

Département d'études françaises

ML 337

University of Waterloo

200 University Avenue

Waterloo, Ontario N2L 3G1

Canada



T.: 519-888-4567x32426

http://margot.uwaterloo.ca



Digital Medievalist --  http://www.digitalmedievalist.org/
Journal: http://www.digitalmedievalist.org/journal/
Journal Editors: editors _AT_ digitalmedievalist.org
News: http://www.digitalmedievalist.org/news/
Wiki: http://www.digitalmedievalist.org/wiki/
Twitter: http://twitter.com/digitalmedieval
Facebook: http://www.facebook.com/group.php?gid=49320313760
Discussion list: d...@uleth.ca
Change list options: http://listserv.uleth.ca/mailman/listinfo/dm-l


[CODE4LIB] Fwd: OAC RFP Annoncement

2011-04-06 Thread Robert Sanderson
Forwarded:

The Open Annotation Collaboration (OAC) project is pleased to announce
a Request For Proposal to collaborate with OAC researchers for
building implementations of the OAC data model and ontology. The OAC
is seeking to collaborate with scholars and/or librarians currently
using and/or curating established repositories of scholarly digital
resources with well-defined audiences of scholars. The OAC intends to
fund a set of four projects that are complementary in content media
type and use cases that leverage the OAC Data Model to the fullest
extent, and that leverage existing annotation tools or at least have
articulated an interesting scholarly annotation use case.

Two of the successful Respondents will collaborate with OAC
researchers at the University of Maryland and the other two will
collaborate with OAC research at the University of Illinois at
Urbana-Champaign. (For these collaborations, Illinois and Maryland
will provide guidance on the implementation of the OAC data model and
ontology, help in defining extensions of the data model that might be
necessary, advice on existing tools that might be adaptable for the
demonstration experiment, feedback on correctness of mappings from/to
native annotation formats and/or annotations created.)

The full text of the RFP can be found at
http://www.openannotation.org/documents/openAnnotationRFP.pdf

The IP agreement attachment to this RFP is available at:
http://www.openannotation.org/documents/openAnnotationIP_Agreement_forRFP.pdf

A FAQ about this RFP is available at:
http://www.openannotation.org/RFP_FAQs.html

Please make all submissions regarding this RFP, including your letter
of intent and proposal, to oac2...@support.lis.illinois.edu

Questions: regarding any details of this RFP should also be emailed to
oac2...@support.lis.illinois.edu; answers to substantive questions
from individuals will be posted immediately on the RFP FAQ page
mentioned above (so as to available to all proposers).

The Open Annotation Collaboration is supported by a grant from the
Andrew W. Mellon Foundation. OAC members include the University of
Illinois at Urbana-Champaign, the University of Maryland, the
University of Queensland (Australia), and the Los Alamos National
Laboratory.

Regards,

Jacob Jett
Assistant Coordinator, Open Annotation Collaboration Project
Center for Informatics Research in Science and Scholarship
The Graduate School of Library and Information Science
University of Illinois at Urbana-Champaign
501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA


Re: [CODE4LIB] XML Schema vs Library APIs (OAI-PMH/SRU/unAPI)

2011-02-24 Thread Robert Sanderson
That is (still) incorrect.

A single schema may contain multiple namespaces, and there isn't a
unique identifier for a schema.  For example, any simple Dublin Core
based syntax must have at least two Namespaces, Dublin Core and the
wrapper element. SchemaLocation is not unique as there can be many
copies of the same schema.  A single schema may define multiple root
elements, such as MODS does with both item and collection level
elements.

Referring to your blog post, you can say how the four inter-relate:

Schema Identifier uniquely identifies the format.
Schema Location is a non-unique description of the format.
Schema Name is a short, human readable, non-unique name for the format
and Namespace is a non-unique namespace used by the format.

This is just a rehash of a previous discussion on this list, between us:

http://www.mail-archive.com/code4lib@listserv.nd.edu/msg05309.html

So I guess I'm wasting my time ;)

Rob Sanderson

On Thu, Feb 24, 2011 at 9:44 AM, Jakob Voss jakob.v...@gbv.de wrote:
 Hi,

 We are developing a general API management tool to provide different APIs
 (unAPI, SRU, OAI-PMH...) with different record formats (MARC, MODS, DC...)
 to our databases. We now stumbled upon some confusion regarding XML formats.
 The basic question is what is a format and how do you refer to it?

 I came to the conclusion that at least SRU schema identifiers are useless.
 In addition you can extract XML namespace URIs from XML Schemas, so all you
 need to identify a format is a link to its XML Schema.

 I wrote a more detailed blog posting about this at
 http://jakoblog.de/2011/02/24/xml-schema-vs-library-apis-oai-pmhsruunapi/

 Does anyone of you relies on SRU schema identifiers when consuming SRU?
 I think at least for XML-based formats we should only use the XML Schema as
 authoritative reference. Sure there are different applications of variants
 of one schema, but then it makes no sense to use global identifiers in
 addition to local names.

 Jakob

 --
 Jakob Voß jakob.v...@gbv.de, skype: nichtich
 Verbundzentrale des GBV (VZG) / Common Library Network
 Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
 +49 (0)551 39-10242, http://www.gbv.de



[CODE4LIB] Using OAC Workshop Planned for March 2011 (CFP)

2010-12-20 Thread Robert Sanderson
Dear all,

The Open Annotation Collaboration (OAC) project is pleased to announce
an open call for statements of interest in participating in the Using
the OAC Model for Annotation Interoperability Workshop. The workshop
will be held 24-25 March 20011 in Chicago, IL and will provide an in-
depth introduction to the OAC data model and ontology for describing
scholarly annotations of Web-accessible information resources. Use
cases involving a range of scholarly annotation classes and target
media types will be presented. Participants will be asked to examine,
comment on, and provide feedback on how well the OAC data model and
framework intersects (or fails to intersect) with domain-specific
needs for scholarly annotation services and with existing discipline
or repository-specific annotation tools and services. By the end of
the day and a half workshop, attendees will be better prepared to
propose and undertake implementations of annotation tools and services
exploiting the OAC data model and ontology.

The workshop is planned for 9 AM March 24 through 1 PM March 25, 2011,
in Chicago, Illinois. Limited support is available to reimburse
invited participants for reasonable travel costs. Preliminary
statements of interest  use case briefs are due by January 24, 2011.
In the event of oversubscription, these briefs will be used to select
invitees; invitations will be issued by February 7. Please see
http://www.openannotation.org/documents/CallForWorkshopParticipation.pdf
for additional details and context; contact Tim Cole (t-
co...@illinois.edu) or Jacob Jett (jje...@illinois.edu) for further
information.

The Open Annotation Collaboration is supported by a grant from the
Andrew W. Mellon Foundation. OAC members include the University of
Illinois at Urbana-Champaign, the University of Maryland, the
University of Queensland (Australia), and the Los Alamos National
Laboratory.


Re: [CODE4LIB] graph processing stack

2010-12-20 Thread Robert Sanderson
On Mon, Dec 20, 2010 at 1:28 PM, BRIAN TINGLE brian.tingle.cdlib.org@
gmail.com wrote:


 graph processing stack on top of a graph database resonates with me more
 than RDF store with SPARQL access but I guess they are
 basically/functionally saying the same thing?  Maybe the graph database
 way of thinking about it is  potentially less interoperable open data
 linking way? -- but I've always believed you have to operate before you can
 interoperate.


An RDF Triple Store is a specific type of graph database, and SPARQL is a
specific way to access it.  Neo4J is another type of graph database, and the
Gremlin/Pipes/Blueprints/Rexster stack is a way to access it.

At the heart of the matter is how you model your graph, and RDF is the
standard way to do that. You can store RDF in Neo4J and it has a SPARQL
interface.

In terms of operating and interoperating, it would seem to me that the
easiest way forwards is to ignore any RDF ontologies you don't understand
and simply create new relationships, such as snac:correspondedWith and
snac:associatedWith ... you or other people can assert equivalencies later
:)

HTH,

Rob Sanderson


Re: [CODE4LIB] What do you want out of a frbrized data web service?

2010-04-20 Thread Robert Sanderson
Exposing the records as Linked Data, rather than just plain old XML
would be an interesting demonstration of how the library world can
generate and, more importantly, curate massive amounts of data.  They
could then be linked to and from by other resources/services -- for
example linking a copy of a book on Amazon as an Item to the
Manifestation it's drawn from could allow for powerful graph oriented
search.

Rob


On Tue, Apr 20, 2010 at 3:50 PM, Riley, Jenn jenlr...@indiana.edu wrote:
 Hi all,

 At Indiana University we're working on a project that will help us see
 concretely what FRBRized [1] library data and discovery systems might look
 like. [2] One of our project goals is to share the raw FRBRized data widely
 so that others can look at it to see how it's structured, reuse it, improve
 on it, comment on the FRBRization effectiveness, etc. We're planning on
 allowing remote/Web Services/API/SRU/some machine-to-machine method like
 that access to the data. As we're starting to think about how we should set
 that up, we thought it would be useful to gather some use cases from the
 code4lib community, as it's the folks here that are experimenting with
 services like this. So if there were FRBRized data available to you (at
 least for FRBR group 1 and group 2 entities; *maybe* group 3 as well), what
 would you do with it? What kinds of questions would your service (discovery
 system, whatever) ask a service that made this data available? What kinds of
 information would you want in a response? Would you have uses that called
 for downloading of all data at once or would you instead be better off
 with real-time queries to a web service? It's questions like that we're
 interested in brainstorming with this group about.

 Basically, what type of access to the data we're generating is most
 important, since we have finite resources to expend on this right now.

 Thanks, all!

 Jenn

 [1] http://www.loc.gov/cds/downloads/FRBR.PDF
 [2] http://vfrbr.info

 
 Jenn Riley
 Metadata Librarian
 Digital Library Program
 Indiana University - Bloomington
 Wells Library W501
 (812) 856-5759
 www.dlib.indiana.edu

 Inquiring Librarian blog: www.inquiringlibrarian.blogspot.com



Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?

2010-04-12 Thread Robert Sanderson
Depends on the sort of features required, in particular the access
patterns, and the hardware it's going to run on.

In my experience, NoSQL systems (for example apache's Cassandra) have
extremely good distribution properties over multiple machines, much
better than SQL databases.  Essentially, it's easier to store a bunch
of key/values in a distributed fashion, as you don't need to do joins
across tables (there aren't any) and eventually consistent systems
(such as Cassandra) don't even need to always be internally consistent
between nodes.

If many concurrent write accesses are required, then NoSQL can also be
a good choice, for the same reasons as it's easily distributed.
And for the same reasons, it can be much faster than SQL systems with
the same data given a data model that fits the access patterns.

The flip side is that if later you want to do something that just
requires the equivalent of table joins, it has to be done at the
application level.  This is going to be MUCH MUCH slower and harder
than if there was SQL underneath.


Rob


On Mon, Apr 12, 2010 at 7:55 AM, Thomas Dowling tdowl...@ohiolink.edu wrote:
 So let's say (hypothetically, of course) that a colleague tells you he's
 considering a NoSQL database like MongoDB or CouchDB, to store a couple
 tens of millions of documents, where a document is pretty much an
 article citation, abstract, and the location of full text (not the full
 text itself).  Would your reaction be:

 That's a sensible, forward-looking approach.  Lots of sites are putting
 lots of data into these databases and they'll only get better.

 This guy's on the bleeding edge.  Personally, I'd hold off, but it could
 work.

 Schedule that 2012 re-migration to Oracle or Postgres now.

 Bwahahahah!!!

 Or something else?



 (http://en.wikipedia.org/wiki/NoSQL is a good jumping-in point.)


 --
 Thomas Dowling
 tdowl...@ohiolink.edu



[CODE4LIB] Memento Updates

2010-04-05 Thread Robert Sanderson
*** Apologies for cross-posting ***

We are excited to share some news about the Memento (Time Travel for the
Web) effort. Memento proposes to extend HTTP with datetime content
negotiation as a means to better integrate the present and past Web. The
Memento effort is partly funded by the Library of Congress.

= The MementoFox add-on for FireFox browsers has been released. It allows
time travel on the Web in a manner compliant with the Memento framework.
  *  The MementoFox add-on can be downloaded at 
https://addons.mozilla.org/en-US/firefox/addon/100298.
  *  Suggested Web time travels that can be undertaken using the add-on are
described at http://www.mementoweb.org/demo/. They involve navigations for
both the document Web and the Linked Data cloud.

= There is also a Memento plug-in available for the MediaWiki
platform.  The plug-in provides support for Memento-style navigation of a
Wiki's history pages.
  *  The MediaWiki plug-in can be downloaded at  
http://www.mediawiki.org/wiki/Extension:Memento.
  *  If you run a MediaWiki platform, please install this plug-in and let us
know the URI of your Wiki.

= Further pointers for recent Memento developments:
  *  Memento site http://www.mementoweb.org.
  *  Since Memento was first announced in November 2009, improvements have
been made to the technical framework. Most notably, all of the concerns
related to Web caching have been addressed such that the framwork now takes
maximal advantage of the existing caching infrastructure. Overviews of the
framework are available via http://www.mementoweb.org/guide/.
  *  Some major Web Archives have started working towards Memento support.
See http://www.mementoweb.org/events/IA201002/.

We are very interested in your feedback. Discussions are welcomed on the
Memento list at http://groups.google.com/group/memento-dev/.

On behalf of the Memento team:

Herbert Van de Sompel - Los Alamos National Laboratory
Michael  L. Nelson - Old Dominion University
Robert Sanderson - Los Alamos National Laboratory