Re: [CODE4LIB] OCLC Classify API - sfa vs. nsfa

2012-06-21 Thread Houghton,Andrew
Without checking webdewey or the schedules the Class numbers in the records 
were probably built from the schedules and the "truncated" numbers are probably 
what is enumerated in the schedules. That's my best guess.

Andy.

On Jun 21, 2012, at 17:02, "Arash.Joorabchi"  wrote:

> Hi Steve,
> 
> Thanks very much for following this up with your colleagues.
> 
> This cleanup/normalization makes perfect sense, However I am finding a 
> considerable number of cases where beside the simple cleanup, the actual DDC 
> numbers have been shortened in the normalized form (nsfa), e.g. in case of 
> the following work the DDC class "Cloud computing" has been changed to " 
> Internet" (004.678 Internet, 004.6782 Cloud computing )
> 
> controlNumber: 757133257 DDC -> afa:004.6782 nsfa:004.678
> 
> More examples:
> 
> controlNumber: 32155131 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 31864085 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 41037721 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 48955829 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 52254624 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 47139254 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 53357482 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 26847297 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 31864085 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 213480039 DDC -> afa:572.802855133 nsfa:572.80285
> controlNumber: 216938373 DDC -> afa:776.0941 nsfa:776.09
> controlNumber: 45175701 DDC -> afa:B nsfa:null
> controlNumber: 21024620 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 33055109 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 31864085 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 28566296 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 223021753 DDC -> afa:370.70994 nsfa:370.7
> controlNumber: 42816159 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 44041153 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 30599637 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 488524108 DDC -> afa:025.0427 nsfa:025.04
> controlNumber: 496159895 DDC -> afa:E nsfa:null
> controlNumber: 50004966 DDC -> afa:303.699669 nsfa:303.69
> controlNumber: 664661422 DDC -> afa:347.7109 nsfa:347.71
> controlNumber: 221506914 DDC -> afa:623.8886 nsfa:623.888
> controlNumber: 22710093 DDC -> afa:347.3079 nsfa:347.307
> controlNumber: 50730923 DDC -> afa:B nsfa:null
> controlNumber: 61122961 DDC -> afa:FIC nsfa:null
> controlNumber: 27266348 DDC -> afa:347.30082 nsfa:347.3
> controlNumber: 660539496 DDC -> afa:792.0941090511 nsfa:792.0941
> controlNumber: 646112101 DDC -> afa:809.93358209082 nsfa:809.93358
> controlNumber: 221472619 DDC -> afa:823.08720992870994 nsfa:823.0872099287
> controlNumber: 30111307 DDC -> afa:813.08109975 nsfa:813.08109
> controlNumber: 227654172 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 16057801 DDC -> afa:000 nsfa:null
> controlNumber: 11593698 DDC -> afa:510.784403 nsfa:510.78
> controlNumber: 11213815 DDC -> afa:510.784472 nsfa:510.78
> controlNumber: 37507275 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 10410101 DDC -> afa:510.7844 nsfa:510.78
> controlNumber: 22666045 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 20919707 DDC -> afa:621.3994 nsfa:621.399
> controlNumber: 49944222 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 31855390 DDC -> afa:005.711015118 nsfa:005.711
> controlNumber: 26922572 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 31279447 DDC -> afa:FIC nsfa:null
> controlNumber: 56208849 DDC -> afa:FIC nsfa:null
> controlNumber: 671709856 DDC -> afa:518.5 nsfa:518
> controlNumber: 195735685 DDC -> afa:670.15196 nsfa:670
> controlNumber: 639925075 DDC -> afa:164.04 nsfa:164
> controlNumber: 20649668 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 228195232 DDC -> afa:364.374094209033 nsfa:364.3740942
> controlNumber: 427676546 DDC -> afa:497.124 nsfa:497.1
> controlNumber: 144991217 DDC -> afa:B nsfa:null
> controlNumber: 227922603 DDC -> afa:511.36071 nsfa:511.36
> controlNumber: 26241817 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 42901706 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 45166540 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 28371893 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 246976751 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 29328912 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 25144023 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 32231470 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 22471430 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 21054877 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 23445479 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 22471351 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 43323375 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 35820511 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 40696610 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 24129963 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 28566299 DDC -> afa:510.7808 nsfa:510.78
> controlNumber: 36774829 DDC -> afa:510.7808 nsf

Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Houghton,Andrew
> -Original Message-
> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
> Jonathan Rochkind
> Sent: Tuesday, April 17, 2012 19:55
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] more on MARC char encoding: Now we're about
> ISO_2709 and MARC21
> 
> Okay, forget XML for a moment, let's just look at marc 'binary'.
> 
> First, for Anglophone-centric MARC21.
> 
> The LC docs don't actually say quite what I thought about leader byte
> 09, used to advertise encoding:
> 
> 
> a - UCS/Unicode
> Character coding in the record makes use of characters from the
> Universal Coded Character Set (UCS) (ISO 10646), or Unicode™, an
> industry subset.
> 
> 
> 
> That doesn't say UTF-8. It says UCS or "Unicode". What does that
> actually mean?  Does it mean UTF-8, or does it mean UTF-16 (closer to
> what used to be called "UCS" I think?).  Whatever it actually means, do
> people violate it in the wild?
> 
First UCS/Unicode basically means the same thing. Second UTF-8, UTF-16, UTF-32 
are encoding forms for UCS/Unicode. The MARC documentation does actually say 
MARC binary records *must* be encoded UTF-8 when LDR/09 content has the value 
'a'.

You need to refer to the appropriate standards for this information and 
definitions:


Unicode specifies three encoding forms, of which only one, UTF-8 (UCS 
Transformation Format 8), is authorized for use in MARC 21 records.


UCS. Acronym for Universal Character Set, which is specified by International 
Standard ISO/IEC 10646, which is equivalent in repertoire to the Unicode 
Standard.


Unicode Encoding Form. A character encoding form that assigns each Unicode 
scalar value to a unique code unit sequence. The Unicode Standard defines three 
Unicode encoding forms: UTF-8, UTF-16, and UTF-32. (See definition D79 in 
Section 3.9, Unicode Encoding Forms.)


UTF-8. A multibyte encoding for text that represents each Unicode character 
with 1 to 4 bytes, and which is backward-compatible with ASCII. UTF-8 is the 
predominant form of Unicode in web pages. More technically: (1) The UTF-8 
encoding form. (2) The UTF-8 encoding scheme. (3) “UCS Transformation Format 
8,” defined in Annex D of ISO/IEC 10646:2003, technically equivalent to the 
definitions in the Unicode Standard.


UTF-16. A multibyte encoding for text that represents each Unicode character 
with 2 or 4 bytes; it is not backward-compatible with ASCII. It is the internal 
form of Unicode in many programming languages, such as Java, C#, and 
JavaScript, and in many operating systems. More technically: (1) The UTF-16 
encoding form. (2) The UTF-16 encoding scheme. (3) “Transformation format for 
16 planes of Group 00,” defined in Annex C of ISO/IEC 10646:2003; technically 
equivalent to the definitions in the Unicode Standard.

Andy


Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Houghton,Andrew
> Jonathan Rochkind
> Sent: Tuesday, April 17, 2012 19:55
> Subject: [CODE4LIB] more on MARC char encoding: Now we're about
> ISO_2709 and MARC21
>
> The LC docs don't actually say quite what I thought about leader byte
> 09, used to advertise encoding:
> 
> 
> a - UCS/Unicode
> Character coding in the record makes use of characters from the
> Universal Coded Character Set (UCS) (ISO 10646), or Unicode™, an
> industry subset.
> 
> 
> 
> That doesn't say UTF-8. It says UCS or "Unicode". What does that
> actually mean?  Does it mean UTF-8, or does it mean UTF-16 (closer to
> what used to be called "UCS" I think?).  Whatever it actually means, do
> people violate it in the wild?

First UCS/Unicode basically means the same thing. Second UTF-8, UTF-16, UTF-32 
are encoding forms for UCS/Unicode. The MARC documentation does actually say 
MARC binary records *must* be encoded UTF-8 when LDR/09 content has the value 
'a'.

You need to refer to the appropriate standards for this information and 
definitions:


Unicode specifies three encoding forms, of which only one, UTF-8 (UCS 
Transformation Format 8), is authorized for use in MARC 21 records.

 
UCS. Acronym for Universal Character Set, which is specified by International 
Standard ISO/IEC 10646, which is equivalent in repertoire to the Unicode 
Standard.


Unicode Encoding Form. A character encoding form that assigns each Unicode 
scalar value to a unique code unit sequence. The Unicode Standard defines three 
Unicode encoding forms: UTF-8, UTF-16, and UTF-32. (See definition D79 in 
Section 3.9, Unicode Encoding Forms.)

 
UTF-8. A multibyte encoding for text that represents each Unicode character 
with 1 to 4 bytes, and which is backward-compatible with ASCII. UTF-8 is the 
predominant form of Unicode in web pages. More technically: (1) The UTF-8 
encoding form. (2) The UTF-8 encoding scheme. (3) “UCS Transformation Format 
8,” defined in Annex D of ISO/IEC 10646:2003, technically equivalent to the 
definitions in the Unicode Standard.

 
UTF-16. A multibyte encoding for text that represents each Unicode character 
with 2 or 4 bytes; it is not backward-compatible with ASCII. It is the internal 
form of Unicode in many programming languages, such as Java, C#, and 
JavaScript, and in many operating systems. More technically: (1) The UTF-16 
encoding form. (2) The UTF-16 encoding scheme. (3) “Transformation format for 
16 planes of Group 00,” defined in Annex C of ISO/IEC 10646:2003; technically 
equivalent to the definitions in the Unicode Standard.


Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Houghton,Andrew
> Karen Coyle
> Sent: Tuesday, April 17, 2012 15:41
> Subject: Re: [CODE4LIB] MarcXML and char encodings
> 
> The discussions at the MARC standards group relating to Unicode all had
> to do with using Unicode *within* ISO2709. I can't find any evidence
> that MARCXML ever went through the standards process. (This may not be
> a
> bad thing.) So none of what we know about the MARBI discussions and
> resulting standards can really help us here, except perhaps by analogy.

Well I can confirm that the MARCXML didn't go through MARBI since I was
one of OCLC's representatives who solidified MARCXML. MARCXML came out
of a meeting at LC between the MARC Standards office, OCLC, RLG, and 
one or two other interested parties whom I cannot remember or find in
my emails or notes about the meeting.


Andy.


Re: [CODE4LIB] MarcXML and char encodings

2012-04-17 Thread Houghton,Andrew
> Jonathan Rochkind
> Sent: Tuesday, April 17, 2012 14:18
> Subject: Re: [CODE4LIB] MarcXML and char encodings
> 
> Okay, maybe here's another way to approach the question.
> 
> If I want to have a MarcXML document encoded in Marc8 -- what should it
> look like?  What should be in the XML decleration? What should be in
> the
> MARC header embedded in the XML?  Or is it not in fact legal at all?
> 
> If I want to have a MarcXML document encoded in UTF8, what should it
> look like? What should be in the XML decleration? What should be in the
> MARC header embedded in the XML?
> 
> If I want to have a MarcXML document with a char encoding that is
> _neither_ Marc8 nor UTF8, but something else generally legal for XML --
> is this legal at all? And if so, what should it look like? What should
> be in the XML decleration? What should be in the MARC header embedded
> in
> the XML?

You cannot have a MARC-XML document encoded in MARC-8, well sort of, but it's 
not standard. To answer your questions you have to refer to a variety of 
standards:


In an encoding declaration, the values " UTF-8 ", " UTF-16 ", " ISO-10646-UCS-2 
", and " ISO-10646-UCS-4 " should be used for the various encodings and 
transformations of Unicode / ISO/IEC 10646, the values " ISO-8859-1 ", " 
ISO-8859-2 ", ... " ISO-8859- n " (where n is the part number) should be used 
for the parts of ISO 8859, and the values " ISO-2022-JP ", " Shift_JIS ", and " 
EUC-JP " should be used for the various encoded forms of JIS X-0208-1997. It is 
recommended that character encodings registered (as charsets) with the Internet 
Assigned Numbers Authority [IANA-CHARSETS], other than those just listed, be 
referred to using their registered names; other encodings should use names 
starting with an "x-" prefix. XML processors should match character encoding 
names in a case-insensitive way and should either interpret an IANA-registered 
name as the encoding registered at IANA for that name or treat it as unknown 
(processors are, of course, not required to support all IANA-!
 registered encodings).

In the absence of information provided by an external transport protocol (e.g. 
HTTP or MIME), it is a fatal error for an entity including an encoding 
declaration to be presented to the XML processor in an encoding other than that 
named in the declaration, or for an entity which begins with neither a Byte 
Order Mark nor an encoding declaration to use an encoding other than UTF-8. 
Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not 
strictly need an encoding declaration.


1) The above says that  means the same as  and if you prefer you can omit the XML 
declaration and that is assumed to be UTF-8 unless there is a BOM (Byte Order 
Mark) which determines UTF-8 vs UTF-16BE vs UTF-16LE.

2) If you really wanted to encode the XML in MARC-8 you need to specify "x-" 
since if you refer to:  MARC-8 
isn't a registered character set, hence cannot be specified in the encoding 
attribute unless the name was prefixed with "x-". Which implies that no 
standard XML library will know how to convert the MARC-8 characters into 
Unicode so the XML DOM can be used. So unless you want to write your own MARC-8 
<=> Unicode conversion routines and integrate them your preferred XML library 
it isn't going to work out of the box for anyone else but yourself.

When dealing with MARC-XML you should ignore the values in LDR/00-04, LDR/10, 
LDR/11, LDR/12-16, LDR/20-23. If you look at the MARC-XML schema you will note 
that the definition for leaderDataType specifies LDR/00-04 "[\d ]{5}", LDR/10 
and LDR/11 "(2| )", LDR/12-16 "[\d ]{5}", LDR/20-23 "(4500| )". Note the 
MARC-XML schema allows spaces in those positions because they are not relevant 
in the XML format, though very relevant in the binary format.

You probably should ignore LDR/09 since most MARC to MARC-XML converters do not 
change this value to 'a' although many converters do change the value when 
converting MARC binary between MARC-8 and UTF-8. The only valid character set 
for MARC-XML is Unicode and it *should* be encoded in UTF-8 in Unicode 
normalization form D (NFD) although most XML libraries will not know the 
difference if it was encoded as UTF-16BE or UTF-16LE in Unicode normalization 
form D since the XML libraries internally work with Unicode.

I could have sworn that this information was specified on LC's site at one 
point in time, but I'm having trouble finding the documentation.


Hope this helps, Andy.


Re: [CODE4LIB] LCSH and Linked Data

2011-04-07 Thread Houghton,Andrew
My bad in (2) that should have been 781 and it’s LC’s way to indicate the 
geographic form used for a 181 when a heading may be geographically subdivided. 
The point is, when you are trying to do authority matching/mapping you have to 
match against the 181’s in LCSH *and* the 781’s in NAF.  This is an oddity of 
the LC authority file that people may not be aware of, hence why I pointed it 
out.  As I indicated, in my mapping projects I have taken LCSH and added new 
181 records based on the 781’s found in NAF.  This allows the matching process 
to work reasonably well without dragging in the entire NAF for searching and 
matching.  However, this still doesn’t give the complete the picture since in 
LCSH the *construction rules* allow you to use things in the name authority 
file as subjects, ugh.  Effectively, LCSH isn’t useful by itself when trying to 
match/decompose 6XX in bibliographic records.  You really need access to NAF as 
well.  Things get worst when talking about the Children’s headings… since you 
can pull from both LCSH and NAF, ugh-ugh.  While LC would like us to think of 
the authority file as three separate authorities, LCSH, LCSHac, NAF, in reality 
the dependencies require you to ignore the thesaurus boundaries and just treat 
the entire authority file as one thesauri.  We struggled with this in the 
terminology services project, especially when the references in one thesaurus 
cross over into the other thesauri.

 

Andy.

 

From: Ya'aqov Ziso [mailto:yaaq...@gmail.com] 
Sent: Thursday, April 07, 2011 13:47
To: Code for Libraries; Houghton,Andrew
Cc: Hickey,Thom; LeVan,Ralph
Subject: Re: [CODE4LIB] LCSH and Linked Data

 

Andrew, as always, most helpful news, kindest thanks! more [YZ] below:

 

1.   No disagreement, except that some 151 appears in the name file and 
some appear in the subject file:
n82068148   008/11=a 008/14=a 151 _ _ $a England
sh2010015057008/11=a 008/14=b 151 _ _ $a Tabasco 
Mountains (Mexico)
[YZ] would it be possible then to use both files as sources and create one file 
for geographical names for our purpose(s)?

2.   Yes, see n5359
151 _ _ $a Sonora (Mexico : State)
751 _ _ $z Mexico $z Sonora (State)

[YZ]  Both stand for a distinct cataloging usage. Jonathan's suggestion to 
consult LC may answer the question of which field/when to use for geographical 
names

3.   Oops, my apologies to my VIAF colleagues, I believe that geographic 
names are in the works… 

[YZ] inshAllah!

 

4. That is probably correct. England may appear as both a 110 *and* a 151 
because the 110 signifies the concept for the country entity while the 151 
signifies the concept for the geographic place. A subtle distinction...

[YZ] Exactly. This distinction called for creating both a 110 AND a 151. But we 
are talking about 151. The case where there is both a 110 and a 151 does NOT 
apply to geographic names, only to some.

 

[YZ] VIAF would be helpful to provide a way to limit geographical names ONLY to 
151 names and their cross references.



Re: [CODE4LIB] LCSH and Linked Data

2011-04-07 Thread Houghton,Andrew
That is probably correct. England may appear as both a 110 *and* a 151 because 
the 110 signifies the concept for the country entity while the 151 signifies 
the concept for the geographic place. A subtle distinction...

Andy.

> -Original Message-
> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
> Ya'aqov Ziso
> Sent: Thursday, April 07, 2011 11:56
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] LCSH and Linked Data
> 
> Ralph, Owen's pointing to a list where corporate (110) and geographic
> names
> (151) are mixed.
> 
> Thanks Owen, I haven't seen that the first time. I guess you got that
> mixed
> 110/151 when limiting to 'exact name'. Perhaps Andrew has a workaround.
> 
> *Ya'aqov*
> 
> 
> 
> 
> 
> On Thu, Apr 7, 2011 at 10:34 AM, LeVan,Ralph  wrote:
> 
> > If you look at the fields those names come from, I think they mean
> > England as a corporation, not England as a place.
> >
> > Ralph
> >
> > > -Original Message-
> > > From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On
> Behalf
> > Of
> > > Owen Stephens
> > > Sent: Thursday, April 07, 2011 11:28 AM
> > > To: CODE4LIB@LISTSERV.ND.EDU
> > > Subject: Re: [CODE4LIB] LCSH and Linked Data
> > >
> > > Still digesting Andrew's response (thanks Andrew), but
> > >
> > > On Thu, Apr 7, 2011 at 4:17 PM, Ya'aqov Ziso 
> > wrote:
> > >
> > > > *Currently under id.loc.gov you will not find name authority
> > records, but
> > > > you can find them at viaf.org*.
> > > > *[YZ]*  viaf.org does not include geographic names. I just
> checked
> > there
> > > > England.
> > > >
> > >
> > > Is this not the relevant VIAF entry
> > > http://viaf.org/viaf/14299580
> > >
> > >
> > > --
> > > Owen Stephens
> > > Owen Stephens Consulting
> > > Web: http://www.ostephens.com
> > > Email: o...@ostephens.com
> >
> 
> 
> 
> --
> *ya'aqov**ZISO | **yaaq...@gmail.com **| 856 217 3456
> 
> *


Re: [CODE4LIB] LCSH and Linked Data

2011-04-07 Thread Houghton,Andrew
1.   No disagreement, except that some 151 appears in the name file and 
some appear in the subject file:
n82068148   008/11=a 008/14=a 151 _ _ $a 
England 
sh2010015057008/11=a 008/14=b 151 _ _ $a Tabasco 
Mountains (Mexico)



2.   Yes, see n5359
151 _ _ $a Sonora (Mexico : State)
751 _ _ $z Mexico $z Sonora (State)



3.   Oops, my apologies to my VIAF colleagues, I believe that geographic 
names are in the works… or at least I was under the impression they were from a 
discussion I had last night.



 

From: Ya'aqov Ziso [mailto:yaaq...@gmail.com] 
Sent: Thursday, April 07, 2011 11:18
To: Code for Libraries; Houghton,Andrew
Cc: LeVan,Ralph
Subject: Re: [CODE4LIB] LCSH and Linked Data

 

Andrew, please see [YZ] below

 

181 __ $z England  and you would NOT find this heading in LCSH. This is issue 
one. Unfortunately, LC does not create 181 in LCSH (actually I think there are 
some, but not if it’s a name), instead they create a 781 in the name authority 
record. 

[YZ]  MARC/LCSH distinguishes between names 100 and geographic names 151 in 
their authority record. You'll find all geographic names if you look for 151 
records.

 

So to find the corresponding $z England we need to go to the name authority 
record 150 England with LCCN n82068148. 

[YZ]  LCCN n82068148 authority record is  for 151 England.

Also Andrew, are you indicating there is a difference between the form of 
geographic name in 151$a and 781$z   -- ?

 

Currently under id.loc.gov you will not find name authority records, but you 
can find them at viaf.org. 

[YZ]  viaf.org does not include geographic names. I just checked there England. 
makes little sense to mix personal/corporate names with geographic ones. Let's 
see what Ralph comments.

 

Ya'aqov



Re: [CODE4LIB] LCSH and Linked Data

2011-04-07 Thread Houghton,Andrew
After having done numerous matching and mapping projects, there are some issues 
that you will face with your strategy, assuming I understand it correctly. 
Trying to match a heading starting at the left most subfield and working 
forward will not necessarily produce correct results when matching against the 
LCSH authority file. Using your example:

 

650 _0 $a Education $z England $x Finance

 

is a good example of why processing the heading starting at the left will not 
necessarily produce the correct results.  Assuming I understand your proposal 
you would first search for:

 

150 __ $a Education

 

and find the heading with LCCN sh85040989. Next you would look for:

 

181 __ $z England

 

and you would NOT find this heading in LCSH. This is issue one. Unfortunately, 
LC does not create 181 in LCSH (actually I think there are some, but not if 
it’s a name), instead they create a 781 in the name authority record. So to 
find the corresponding $z England we need to go to the name authority record 
150 England with LCCN n82068148. Currently under id.loc.gov you will not find 
name authority records, but you can find them at viaf.org. The second issue 
using your example is that you want to find the “longest” matching heading. 
While the pieces parts are there, so is the enumerated authority heading:

 

150 __ $a Education $z England

 

as LCCN sh2008102746. So your heading is actually composed of the enumerated 
headings:

 

sh2008102746150 __ $a Education $z England

sh2002007885180 __ $x Finance

 

and not the separate headings:

 

sh85040989 150 __ $a Education

n82068148   150 __ $a England

sh2002007885180 __ $x Finance

 

Although one could argue that either analysis is correct depending upon what 
you are trying to accomplish.

 

The matching algorithm I have used in the past contains two routines. The first 
f(a) will accept a heading as a parameter, scrub the heading, e.g., remove 
unnecessary subfield like $0, $3, $6, $8, etc. and do any other pre-processing 
necessary on the heading, then call the second function f(b). The f(b) function 
accepts a heading as a parameter and recursively calls itself until it builds 
up the list LCCNs that comprise the heading. It first looks for the given 
heading when it doesn’t find it, it removes the *last* subfield and recursively 
calls itself, otherwise it appends the found LCCN to the returned list and 
exits. This strategy will find the longest match. The headings are search 
against an augmented LCSH database where the 781 name authority records have 
been transformed into 181 records keeping the LCCN of the name authority 
record. Not ideal, but it generally works well. Adjust algorithm per need.

 

Hope this helps, Andy.

 

 

From: public-lld-requ...@w3.org [mailto:public-lld-requ...@w3.org] On Behalf Of 
Owen Stephens
Sent: Thursday, April 07, 2011 08:11
To: Thomas Meehan
Cc: Code for Libraries; public-lld; f.zabl...@open.ac.uk
Subject: Re: LCSH and Linked Data
Importance: Low

 

Thanks Tom - very helpful

Perhaps this suggests that rather using an order we should check combinations 
while preserving the order of the original 650 field (I assume this should in 
theory be correct always - or at least done to the best of the cataloguers 
knowledge)?

 

So for:

 

650 _0 $$a Education $$z England $$x Finance.

 

check:

 

Education

England (subdiv)

Finance (subdiv)

Education--England

Education--Finance

Education--England--Finance

 

While for 650 _0 $$a Education $$x Economic aspects $$z England we check

 

Education

Economic aspects (subdiv)

England (subdiv)

Education--Economic aspects

Education--England

Education--Economic aspects--England


- It is possible for other orders in special circumstances, e.g. with 
language dictionaries which can go something like:

650 _0 $$a English language $$v Dictionaries $$x Albanian.

 

This possiblity would also covered by preserving the order - check:

 

English Language

Dictionaries (subdiv)

Albanian (subdiv)

English Language--Dictionaries

English Language--Albanian

English Language--Dictionaries-Albanian

 

Creating possibly invalid headings isn't necessarily a problem - as we won't 
get a match on id.loc.gov anyway. (Instinctively English Language--Albanian 
doesn't feel right)

 


- Some of these are repeatable, so you can have too $$vs following each 
other (e.g. Biography--Dictionaries); two $$zs (very common), as in 
Education--England--London; two $xs (e.g. Biography--History and criticism).

OK - that's fine, we can use each individually and in combination for any 
repeated headings I think

 

- I'm not I've ever come across a lot of $$bs in 650s. Do you have a 
lot of them in the database?

Hadn't checked until you asked! We have 1 in the dataset in question (c.30k 
records) :)

 

I'm not sure how possible it would be to come up with a definitive list 
of (reasonable) 

Re: [CODE4LIB] Looking for a Word to EAD converter

2010-10-07 Thread Houghton,Andrew
Don't know whether one exists or not, but the fact that the documents are in MS 
Word means that you could attach some VBA (Visual Basic for Applications) 
macros to the documents and run a macro that extracts and creates XML.

Andy.

> -Original Message-
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Cornwall, Daniel D (EED)
> Sent: Thursday, October 07, 2010 01:36 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] Looking for a Word to EAD converter
> 
> Hi All,
> 
> 
> 
> While I think what I'm looking for doesn't exist, I wanted to ask some
> experts before making confident assertions.
> 
> 
> 
> Our institution has a lot of finding aids for photo and manuscript
> collections in MS Word Format. They have pretty standard subheadings.
> An
> example can be found at
> www.library.state.ak.us/hist/hist_docs/finding_aids/MS220.doc
> 
> .
> 
> 
> 
> 
> I've had inquiries about getting these Word finding aids converted to
> EAD (Encoded Archival Description) through some sort of converter. I
> haven't been able to locate any such program, but maybe that's a
> reflection on my searching skills.
> 
> 
> 
> There are a number of programs to create EAD finding aids from scratch
> and I've recommended acquiring one of these programs and getting staff
> to rekey/copy & paste from Word into the EAD finding aid program. Staff
> are not willing to do this at least until I can demonstrate that there
> is no automated way to convert our finding aids. Of course, if there is
> a converter, so much the better.
> 
> 
> 
> Thanks in advance for any enlightenment you can give me. - Daniel
> 
> 
> 
> ===
> 
> Daniel Cornwall
> 
> Head of Technical and Imaging Services
> 
> Division of Libraries, Archives and Museums
> 
> PO Box 110571
> Juneau, AK 99811-0571
> Phone (907) 465-6332
> 
> Fax (907) 465-2665
> E-Mail: dan.cornw...@alaska.gov
> 
> See Division resources at http://lam.alaska.gov
> 
> .
> 
> 
> 
> 
> 
> Any opinions expressed in this e-mail are mine alone and not those of
> my
> employer unless explicitly stated.
> 
> 


Re: [CODE4LIB] generating unique integers

2010-05-28 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Eric Lease Morgan
> Sent: Friday, May 28, 2010 09:35 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] generating unique integers
> 
> Given a list of unique strings, how can I generate a list of short,
> unique integers?
> 
> I have a list of about 250 unique author/title combinations, such as:
> 
>   Aeschylus / Prometheus Bound
>   Aeschylus / Suppliant Maidens
>   American State / Articles of confederation
>   American State / Declaration of Independence
>   Aquinas / Summa Theologica
>   Aristophanes / Achamians
>   Aristophanes / Clouds
>   Aristophanes / Ecclesiazusae
>   Aristotle / On Generation And Corruption
>   Aristotle / On The Gait Of Animals
>   Aristotle / On The Generation Of Animals
>   ...
> 
> From each author/title combination I want to create a file name (key).
> Specifically, I want a file name with the following form: author-
> firstwordofthetitle-integer.txt  Such a scheme will make it
> (relatively) easy for me to look at the file name and know the what
> title is and by whom.
> 
> Using Perl, how can I convert the author/title combination into some
> sort of integer, checksum, or unique value that is the same every time
> I run my script? I don't want to have to remember what was used before
> because I don't want to maintain a list of previously used keys. Should
> I use some form of the pack function? Should I sum the ASCII values of
> each character in the author/title combination?

You could MD5 hash the author/title combination which would give you the
same hash so long they were the author/title combination was the same,
e.g., letter case and spelling, etc.  However, that doesn't meet your
requirement of an small integer, but if you are using the value for a
Perl hash it might not matter all that much.

Andy.


Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?

2010-05-13 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> MJ Suhonos
> Sent: Thursday, May 13, 2010 12:34 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
> 
> PS.  For those RDF-ites among us, I also happen to think that JSON
> makes a great data structure for a triple store, eg. [3] — but I think
> storing absolute URLs as predicates like the N2 spec does is stupid.
> 
> 1. http://robotlibrarian.billdueber.com/data-structures-and-
> serializations/
> 2. http://en.wikipedia.org/wiki/Worse_is_better
> 3. http://n2.talis.com/wiki/RDF_JSON_Specification

Hmm... having spent time recently creating triples using [3], it is
painful.  From an RDF view [3] makes some reasonable choices in its
representation, but before you can generate [3] you have to collect
everything associated with the subject URI and that can be painful,
depending upon your input data, in comparison to N3 or Turtle.

BTW, lets not forget an easier more N3 standardized approach to 
generating RDF in JSON. Which has broader implications beyond
SPARQL and can be used as an exchange format like [3]:

[4] Serializing SPARQL Query Results in JSON



Andy.


Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?

2010-05-13 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Kyle Banerjee
> Sent: Thursday, May 13, 2010 11:51 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
> 
> JSON maybe a great data exchange format,  but it's not a markup
> language like XML so doing things like preserving  field order 
> or just getting a bird's eye view of content across multiple 
> fields or subfields becomes more complex.

Huh? JSON arrays preserve element order just like XML preserves element
order.  Combining JSON labeled arrays and objects provide you with the 
same mechanisms available in markup languages such as XML.


Andy.


Re: [CODE4LIB] Microsoft Zentity

2010-04-28 Thread Houghton,Andrew
Well... it kind of "requires" SQL Server.  The project page says that it uses 
the Entity Framework and Link, and that means you can use MySQL or Oracle since 
there is an ADO.NET adapter for them.  Microsoft has a plugable data layer that 
anyone can write an adapter for.

Andy.

> -Original Message-
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Ross Singer
> Sent: Wednesday, April 28, 2010 11:23 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Microsoft Zentity
> 
> On Wed, Apr 28, 2010 at 10:21 AM, Houghton,Andrew 
> wrote:
> > If its open source, I assume that it could be adapted to run under
> Mono and then you could run it on Linux, Macs, etc.  It may even run
> under Mono, don't know, haven't played with it.
> >
> 
> Well, it requires SQLServer, so I think this is probably going to be
> much more difficult than it's worth.
> 
> -Ross.


Re: [CODE4LIB] Microsoft Zentity

2010-04-28 Thread Houghton,Andrew
If its open source, I assume that it could be adapted to run under Mono and 
then you could run it on Linux, Macs, etc.  It may even run under Mono, don't 
know, haven't played with it.

Andy.

> -Original Message-
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Ethan Gruber
> Sent: Wednesday, April 28, 2010 10:17 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Microsoft Zentity
> 
> It seems to me that the major flaw of the software is that it isn't
> cross-platform, which comes as no surprise.  But I feel Microsoft
> didn't do
> their market research.  While the financial and business sectors are
> heavily
> reliant on Microsoft servers, American universities, and by extension,
> research libraries, are not.  If they really wanted to make a
> "commitment to
> support the academic community" as they say on the Zentity website,
> they
> would have developed it for a platform that the academic community
> actually
> uses.
> 
> Ethan
> 
> On Wed, Apr 28, 2010 at 10:11 AM, David Kane  wrote:
> 
> > Andy,
> >
> > It is a highly extensible platform, based on .NET and windows.  It is
> also
> > open source!  We did install it and have a play around with it.  But
> not as
> > much as we would have liked, primarily because of skillset and
> resource
> > issues here.
> >
> > Microsoft have come late into the repository space, and have had a
> really
> > good look at the kinds of mistakes others have made.
> >
> > Let us know how you get on.
> >
> > David.
> >
> >
> >
> >
> >
> > On 28 April 2010 14:54, Andrew Ashton 
> wrote:
> >
> > > I¹m looking for some background information on Microsoft¹s Zentity
> (their
> > > digital repository software).  If anyone has first-hand experience
> > working
> > > with it, or if you know of institutions that have implemented it,
> please
> > > contact me.
> > >
> > > Thanks,
> > >
> > > Andy Ashton
> > > Senior Research Programmer
> > > Center for Digital Scholarship, Brown University Library
> > > andrew_ash...@brown.edu
> > >
> >
> >
> >
> > --
> > David Kane
> > Systems Librarian
> > Waterford Institute of Technology
> > Ireland
> > http://library.wit.ie/
> > davidfk...@googlewave.com
> > T: ++353.51302838
> > M: ++353.876693212
> >


[CODE4LIB] List of MARC flavors

2010-03-23 Thread Houghton,Andrew
Does anyone know where there might be a list of the various flavors of MARC?

I currently have:

marc21
usmarc  US MARC Replaced by marc21
rusmarc Russian MARC
canmarc Canadian MARC   Replaced by marc21
ukmarc  UK MARC Replaced by marc21
cmarc   Chinese MARC
unimarc Uni-MARC


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-15 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Monday, March 15, 2010 12:40 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]
> 
> On the one hand, I'm all for following specs. But on the other...should
> we really be too concerned about dealing with the full flexibility of
> the 2709 spec, vs. what's actually used? I mean, I hope to god no one 
> is actually creating new formats based on 2709!
> 
> If there are real-life examples in the wild of, say, multi-character
> indicators, or subfield codes of more than one character, that's one
> thing.

Yes there are real-life examples, e.g., MarcXchange, now ISO 25577, being the 
one that comes to mind.  Where IFLA was *compelled* to create a new MARC-XML 
specification, in a different namespace, and the main difference between the 
two specification was being able to specify up to nine indicator values.  Given 
that they are optional in the MarcXchange XML schema, personally I feel that, 
IFLA and LC could have just extended the MARC-XML schema and added the optional 
attributes making all existing MARC-XML documents MarcXchange documents and the 
library community wouldn't have to deal with two XML specifications.  This is 
another example of the library community creating more barriers for people 
entering their market.

> BTW, in the stuff I proposed, you know a controlfield vs. a datafield
> because of the length of the array (2 vs 5); it's well-specified, but
> by the size of the tuple, not by label.

Ahh... I overlooked that aspect of your proposal.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-15 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Monday, March 15, 2010 12:19 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]
> 
> I would like to see ind1 and ind2 get their own fields, though, for
> easier
> use of stuff like jsonpath in json-centric nosql databases.

I'll add this issue to the discussion page.  Your point on being able to index 
a specific indicator property has value.  So a resolution to the issue would be 
have 9 indicator properties, per ISO 2709 and MarcXchange, but require the ones 
based on the leader indicator number value.


Thanks, Andy.


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-15 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Jonathan Rochkind
> Sent: Monday, March 15, 2010 11:53 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]
> 
> I would just ask why you didn't use Bill Dueber's already existing
> proto-spec, instead of making up your own incomptable one.

Because the internal use of our specification predated Bill's blog entry, dated 
2010-02-25, by almost a year.  Bill's post reminded me that I had not published 
or publicly discussed our specification.

Secondly, Bill's specification looses semantics from ISO 2709, as I previously 
pointed out.  His specification clumps control and data fields into one 
property named fields. According to ISO 2709, control and data fields have 
different semantics.  You could have a control field tagged as 001 and a data 
field tagged as 001 which have different semantics.  MARC-21 has imposed 
certain rules for assignment of tags such that this isn't a concern, but other 
systems based on ISO 2709 may not.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]

2010-03-14 Thread Houghton,Andrew
> From: Houghton,Andrew
> Sent: Saturday, March 06, 2010 06:59 PM
> To: Code for Libraries
> Subject: RE: [CODE4LIB] Q: XML2JSON converter
> 
> Depending on how much time I get next week I'll talk with the developer
> network folks to see what I need to do to put a specification under
> their infrastructure

I finished documenting our existing use of MARC-JSON.  The specification can be 
found on the OCLC developer network wiki [1].  Since it is a wiki, registered 
developer network members can edit the specification and I would ask that you 
refrain from doing so.

However, please do use the discussion tab to record issues with the 
specification or add additional information to existing issues.  There are 
already two open issues on the discussion tab and you can use them as a 
template for new issues.  The first issue is Bill Dueber's request for some 
sort of versioning and the second issue is whether the specification should 
specify the flavor of MARC, e.g., marc21, unicode, etc.

It is recommended that you place issues on the discussion tab since that will 
be the official place for documenting and disposing of them.  I do monitor this 
listserve and the OCLC developer network listserve, but I only selectively look 
at messages on those listserves.  If you would like to use this listserve or 
the OCLC developer network listserve to discuss the MARC-JSON specification, 
make sure you place MARC-JSON in the subject line, to give me a clue that I 
*should* look at that message, or directly CC my e-mail address on your post.

This message marks the beginning of a two week comment period on the 
specification which will end on midnight 2010-03-28.

[1] <http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11>


Thanks, Andy. 


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-08 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Benjamin Young
> Sent: Monday, March 08, 2010 09:32 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> Rather than using a newline-delimited format (the whole of which would
> not together be considered a valid JSON object) why not use the JSON
> array format with or without new lines? Something like:
> 
> [{"key":"value"}, {"key","value"}]
> 
> You could include new line delimiters after the "," if you needed to
> make pre-parsing easier (in a streaming context), but may be able to
> get
> away with just looking for the next "," or "]" after each valid JSON
> object.
> 
> That would allow the entire stream, if desired, to be saved to disk and
> read in as a single JSON object, or the same API to serve smaller JSON
> collections in a JSON standard way.

I think we just went around full circle again.  There appear to be two distinct 
use cases when dealing with MARC collections.  The first conforms to the ECMA 
262 JSON subset.  Which is what you described, above:

[ { "key" : "value" }, { "key" : "value" } ]

its media type should be specified as application/json.

The second use case, which there was some discussion between Bill Dueber and 
myself, is a newline delimited format where the JSON array specifiers are 
omitted and the objects are specified one per line without commas separating 
objects.  The misunderstanding between Bill and I was that this "malformed" 
JSON was being sent as media type application/json which is not what he was 
proposing and I misunderstood.  This newline delimited JSON appears to be an 
import/export format in both CouchDB and MongoDB.

In the FAST work I'm doing I'm probably going to take an alternate approach to 
generating our 10,000 MARC record collection files for download.  The approach 
I'm going to take is to create valid JSON but make it easier for the CouchDB 
and MongoDB folks to import the collection of records.  The format will be:

[
{ "key" : "value" }
,
{ "key" : "value" }
]

the objects will be one per line, but the array specifier and comma delimiters 
between objects will appear on a separate line.  This would allow the CouchDB 
and MongoDB folks to run a simple sed script on the file before import:

sed -e '/^.$/D' file.json > file.txt

or if they are reading the data as a raw text file, they can just ignore all 
lines that start with opening brace, comma, or closing brace, or alternately 
only process lines starting with an opening brace.

However, this doesn't mean that I'm balking on pursuing a separate media type 
specific to the library community that specifies a specific MARC JSON 
serialization encoded as a single line.

I see multiple steps here with the first being a consensus on serializing MARC 
(ISO 2709) in JSON.  Which begins with me documenting it so people can throw 
some darts at.  I don't think what we are proposing is controversial, but it's 
beneficial to have a variety of perspectives as input.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-06 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Saturday, March 06, 2010 05:11 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> Anyway, hopefully, it won't be a huge surprise that I don't disagree
> with any of the quote above in general; I would assert, though, that
> application/json and application/marc+json should both return JSON
> (in the same way that text/xml, application/xml, and 
> application/marc+xml can all be expected to return XML). 
> Newline-delimited json is starting to crop up in a few places 
> (e.g. couchdb) and should probably have its own mime type
> and associated extension. So I would say something like:
> 
> application/json -- return json (obviously)
> application/marc+json  -- return json
> application/marc+ndj  -- return newline-delimited json

This sounds like consensus on how to deal with newline-delimited JSON in a 
standards based manner.

I'm not familiar with CouchDB, but I am using MongoDB which is similar.  I'll 
have to dig into how they deal with this newline-delimited JSON.  Can you 
provide any references to get me started?

> In all cases, we should agree on a standard record serialization,
> though, and the pure-json returns should include something that 
> indicates what the heck it is (hopefully a URI that can act as a 
> distinct "namespace"-type identifier, including a version in it).

I agree that our MARC-JSON serialization needs some "namespace" identifier in 
it and it occurred to me that the way it is handling indicators, e.g., ind1 and 
ind2 properties, might be better handled as an array to accommodate IFLA's 
MARC-XML-ish where they can have from 1-9 indicator values.

BTW, our MARC-JSON content is specified in Unicode not MARC-8, per the JSON 
standard, which means you need to use \u notation to specify characters in 
strings, not sure I made that clear in earlier posts.  A downside to the 
current ECMA 262 specification is that it doesn't support \U00XX, as Python 
does, for the extended characters.  Hopefully that will get rectified in a 
future ECMA 262 specification.

> The question for me, I think, is whether within this community,  anyone
> who provides one of these types (application/marc+json and
> application/marc+ndj) should automatically be expected to provide both.
> I don't have an answer for that.

I think this issue gets into familiar territory when dealing with RDF formats.  
Let's see, there is N3, NT, XML, Turtle, etc.  Do you need to provide all of 
them?  No, but it's nice of the server to at least provide NT or Turtle and 
XML.  Ultimately it's up to the server.  But the only difference between use 
cases #2 and #3 is whether the output is wrapped in an array, so it's probably 
easy for the server to produce both.

Depending on how much time I get next week I'll talk with the developer network 
folks to see what I need to do to put a specification under their 
infrastructure.  Looks like from my schedule it's going to be another week of 
hell :(


Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-06 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Friday, March 05, 2010 08:48 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> On Fri, Mar 5, 2010 at 6:25 PM, Houghton,Andrew 
> wrote:
> 
> > OK, I will bite, you stated:
> >
> > 1. That large datasets are a problem.
> > 2. That streaming APIs are a pain to deal with.
> > 3. That tool sets have memory constraints.
> >
> > So how do you propose to process large JSON datasets that:
> >
> > 1. Comply with the JSON specification.
> > 2. Can be read by any JavaScript/JSON processor.
> > 3. Do not require the use of streaming API.
> > 4. Do not exceed the memory limitations of current JSON processors.
> >
> >
> What I'm proposing is that we don't process large JSON datasets; I'm
> proposing that we process smallish JSON documents one at a time by
> pulling
> them out of a stream based on an end-of-record character.
> 
> This is basically what we use for MARC21 binary format -- have a
> defined
> structure for a valid record, and separate multiple well-formed record
> structures with an end-of-record character. This preserves JSON
> specification adherence at the record level and uses a different scheme
> to represent collections. Obviously, MARC-XML uses a different 
> mechanism to define a collection of records -- putting well-formed 
> record structures inside a  tag.
> 
> So... I'm proposing define what we mean by a single MARC record
> serialized to JSON (in whatever format; I'm not very opinionated 
> on this point) that preserves the order, indicators, tags, data, 
> etc. we need to round-trip between marc21binary, marc-xml, and 
> marc-json.
> 
> And then separate those valid records with an end-of-record character
> -- "\n".

Ok, what I see here are divergent use cases and the willingness of the library 
community to break existing Web standards.  This is how the library community 
makes it more difficult to use their data and places additional barriers for 
people and organizations to enter their market because of these library centric 
protocols and standards.

If I were to try to sell this idea to the Web community, at large, and tell 
them that when they send an HTTP request with an Accept: application/json 
header to our services, our services will respond with a 200 HTTP status and 
deliver them malformed JSON, I would be immediately impaled with multiple 
arrows and daggers :(  Not to mention that OCLC would disparaged by a certain 
crowd in their blogs as being idiots who cannot follow standards.

OCLC's goals are use and conform to Web standards to make library data easier 
to use by people or organizations outside the library community, otherwise 
libraries and their data will become irrelevant.  The JSON serialization is a 
standard and the Web community expects that when they make HTTP requests with 
an Accept: application/json header that they will be get back JSON conforming 
to the standard.  JSON's main use case is in AJAX scenarios where you are not 
suppose to be sending megabytes of data across the wire.

Your proposal is asking me to break a widely deployed Web standard that is used 
by AJAX frameworks and to access millions (ok, many) Web sites.

> Unless I've read all this wrong, you've come to the conclusion that the
> benefit of having a JSON serialization that is valid JSON at both the
> record and collection level outweighs the pain of having to deal with
> a streaming parser and writer.  This allows a single collection to be
> treated as any other JSON document, which has obvious benefits (which 
> I certainly don't mean to minimize) and all the drawbacks we've been 
> talking about *ad nauseam*.

The goal is to adhere to existing Web standards and your underlying assumption 
is that you can or will be retrieving large datasets through an AJAX scenario.  
As I pointed out this is more an API design issue and due to the way AJAX works 
you should never design an API in that manner.  Your assumption that you can or 
will be retrieving large datasets through an AJAX scenario is false given the 
caveat of a well designed API.  Therefore you will never be put into the 
scenario requiring the use of JSON streaming so your argument from this point 
of view is mute.

But for arguments sake let's say you could retrieve a line delimited list of 
JSON objects.  You can no longer use any existing AJAX framework for getting 
back that JSON since it's malformed.  You could use the AJAX framework's 
XMLHTTP to retrieve this line delimited list of JSON objects, but this still 
doesn't help because the XMLHTTP object will keep the entire response in memory.

So when our service se

Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Ross Singer
> Sent: Friday, March 05, 2010 09:18 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> I actually just wrote the same exact email as Bill (although probably
> not as polite -- I called the marcxml "collection" element a
> "contrivance that appears nowhere in marc21").  I even wrote the
> "marc21 is EOR character delimited files" bit.  I was hoping to figure
> out how to use unix split to make my point, couldn't, and then
> discarded my draft.
> 
> But I was *right there*.
> 
> -Ross.

I'll answer Bill's message tomorrow after I have had some sleep :) 

Actually, I contend that the MARC-XML collection element does appear in MARC 
(ISO 2709), but it is at the physical layer and not at the structural layer.  
Remember MARC records were placed on a tape reel, thus the tape reel was the 
collection (container).  Placed on disk in a file, the file is the collection 
(container).  I agree that it's not spelled out in the standard, but the 
concept of a collection (container) is implicit when you have more than one 
record of anything.

Basic set theory: a set is a container for its members :)

The obvious reason why it exists in XML is that the XML infoset requires a 
single document element (container).  This is why the MARC-XML schema allows 
either a collection or record element to be specified as the document element.  
It is unfortunate that the XML infoset requires a single document element, 
otherwise you would be back to the file on disk being the implicit collection 
(container) as it is in ISO 2709.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Friday, March 05, 2010 05:22 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> This is my central point. I'm actually saying that JSON streaming is
> painful
> and rare enough that it should be avoided as a requirement for working
> with
> any new format.

OK, in principle we are in agreement here.

> I guess, in sum, I'm making the following assertions:
> 
> 1. Streaming APIs for JSON, where they exist, are a pain in the ass.
> And
> they don't exist everywhere. Without a JSON streaming parser, you have
> to
> pull the whole array of documents up into memory, which may be
> impossible.
> This is the crux of my argument -- if you disagree with it, then I
> would
> assume you disagree with the other points as well.

Agree with streaming APIs for JSON are a pain and not universal across all 
clients.

Agree that without a streaming API you are limited by memory constraints on the 
client. 

> 2. Many people -- and I don't think I'm exaggerating here, honestly --
> really don't like using MARC-XML but have to because of the length
> restrictions on MARC-binary. A useful alternative, based on dead-easy
> parsing and production, is very appealing.

Cannot address this concern.  MARC (ISO 2709) and MARC-XML are library 
community standards.  Doesn't matter whether I like them or not, or you like 
them or not.  This is what the library community has agreed to as a 
communications format between systems for interoperability.

> 2.5 Having to deal with a streaming API takes away the "dead-easy"
> part.

My assumption is that 2.5 is dealing with using a streaming API with MARC-XML.  
I agree that using SAX in XML on MARC-XML is a pain, but that's an issue with 
dealing with large XML datasets, in general, and has nothing to do with 
MARC-21.  In general, when processing large MARC-XML I use SAX to get me a 
complete record and process at the record level, that isn't too bad, but I'll 
concede it's still a pain.  Usually, I break up large datasets into 10,000 
record chunks and process them that way since most XML and XSLT tools cannot 
effectively deal with documents that are 100MB or larger, so I rarely ever use 
SAX anymore.

> 3. If you accept my assertions about streaming parsers, then dealing
> with
> the format you've proposed for large sets is either painful (with a
> streaming API) or impossible (where such an API doesn't exist) due to
> memory
> constraints.

Large datasets, period, are a pain to deal with.  I deal with them all day long 
and have to deal with tool issues, disk space, processing times, etc.
I don't disagree with you here in principle, but as I previously point out this 
is an API issue.

If your API never allows you to return a collection of more than 10 records 
which is less than 1MB, you are not dealing with large datasets.  If your API 
is returning a large collection of records that is 100MB or larger, then you 
got problems and need to rethink your API.

This is no different than a large MARC-XML collection.  The entire LC authority 
dataset, names and subjects, is 8GB of MARC-XML.  Do I process that as 8GB of 
MARC-XML, heck no!!  I break it up into smaller chunks and process the chunks.  
This allows me to take those chunks and run parallel algorithms on them or 
throw the chunks at our cluster and get the results back quicker.

It's the size of the data that is the crux of your argument not the format of 
the data, e.g., XML, JSON, CSV, etc.

> 4. Streaming JSON writer APIs are also painful; everything that applies
> to
> reading applies to writing. Sans a streaming writer, trying to *write*
> a
> large JSON document also results in you having to have the whole thing
> in
> memory.

No disagreement here.
 
> 5. People are going to want to deal with this format, because of its
> benefits over marc21 (record length) and marc-xml (ease of processing),
> which means we're going to want to deal with big sets of data and/or
> dump batches of it to a file. Which brings us back to #1, the pain or
> absence of streaming apis.

So we are back to the general argument that large datasets, regardless of 
format, are a pain to deal with, and that tool sets have issues dealing with 
large datasets.  I don't disagree with these statements and run into these 
issues on a daily basis whether dealing with MARC datasets or other large 
datasets.  A solution to this issue is to create batches of stuff that can be 
processed in parallel, ever heard of Google and map-reduce :)

> "Write a better JSON parser/writer" or "use a different language" seem
> like
> bad solutions to me, especially when a (potentially) useful alternative
> exists.

OK, I will bite, you stated:

1. That large datasets are a problem.
2. That streaming APIs are a pain to deal with.
3. That tool sets have memory constraints.

So how do you propose to process large JSON datasets that:

1. Comply with the

Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Benjamin Young
> Sent: Friday, March 05, 2010 04:24 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> For my part, I'd like to explore the options of putting MARC data into
> CouchDB (which stores documents as JSON) which could then open the door
> for replicating that data between any number of installations of
> CouchDB
> as well as providing for various output formats (marc-xml, etc).
> 
> It's just an idea, but it's one that uses JSON outside of the browser
> and is a good proof case for any MARC in JSON format.

This was partly the reason why I developed our MARC-JSON format since I'm using 
MongoDB [1] which is a NoSQL database based on JSON.


Andy.

[1] 


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Friday, March 05, 2010 03:45 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> I guess my concern here is that the specification, as you're describing
> it, is closing off potential uses.  It seems fine if, for example, your
> primary concern is javascript-in-the-browser, and browser-request,
> pagination-enabled systems might be all you're worried about right now.
> 
> That's not the whole universe of uses, though. People are going to want
> to dump these things into a file to read later -- no possibility for
> pagination in that situation.

I disagree that you couldn't dump a paginated result set into a file for 
reading later.  I do this all the time not only in Javascript, but may other 
programming languages.

> Others may, in fact, want to stream a few thousand
> records down the pipe at once, but without a streaming parser that
> can't happen if it's all one big array.

Well, if your service isn't allowing them to be streamed a few thousand records 
at a time, then that isn't a issue :)

Maybe I have been mislead or misunderstood JSON streaming.  My understanding 
was that you can generate an arbitrary large outgoing stream on the server side 
and can read an arbitrary large incoming stream on the client side.  So it 
shouldn't matter if the result set was delivered as one big JSON array.  The 
SAX like interface that JSON streaming uses provides the necessary events to 
allow you to pull the individual records from that arbitrary large array.

> I worry that as specified, the *only* use will be, "Pull these down a
> thin pipe, and if you want to keep them for later, or want a bunch of
> them, you have to deal with marc-xml."

Don't quite follow this.  MARC-XML is an XML format, MARC-JSON is our JSON 
format for expressing the various MARC-21 format, e.g., authority, 
bibliographic, classification, community information and holdings in JSON.  The 
JSON is based on the structure of MARC-XML which was based on the structure of 
ISO 2709.  Don't see how MARC-XML comes into play when you are dealing with 
JSON.  If you want to save our MARC-JSON you don't have to convert it to 
MARC-XML on the client side.  Just save it as a text file.

> Part of my incentive is to *not* have to use marc-xml, but in this 
> case I'd just be trading one technology I don't like (marc-xml) 
> for two technologies, one of which I don't like (that'd be marc-xml 
> again).

Again not sure how to address this concern.  If you are dealing with library 
data, then its current communication formats are either MARC binary (ISO 2709) 
or MARC-XML, ignoring IFLA's MARC-XML-ish format for the moment.  You might not 
like it, but that is life in library land.  You can go develop your own formats 
based on the various MARC-21 format specifications, but are unlikely to achieve 
any sort of interoperability with the existing library systems and services.

We choose our MARC-JSON to maintain the structural components of MARC-XML and 
hence MARC binary (ISO 2709).  In MARC, control fields have different semantics 
from data fields and you cannot merge them into one thing called field.  If you 
look closely at the MARC-XML schema, you might notice that the controlfield and 
datafield elements can have non-numeric tags.  If you merge everything into 
something called field, then you cannot distinguish between a non-numeric tag 
for a controlfield vs. a datafield element.  There are valid reasons why we 
decided to maintain the existing structure of MARC.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Ross Singer
> Sent: Friday, March 05, 2010 02:32 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> On Fri, Mar 5, 2010 at 1:10 PM, Houghton,Andrew 
> wrote:
> 
> > I certainly would be will to work with LC on creating a MARC-JSON
> specification as I did in creating the MARC-XML specification.
> 
> Quite frankly, I think I (and I imagine others) would much rather see
> a more open, RFC-style process to creating a marc-json spec than "I
> talked to LC and here you go".
> 
> Maybe I'm misreading this last paragraph a bit, however.

Yes, you misread the last paragraph.

Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Benjamin Young
> Sent: Friday, March 05, 2010 02:06 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> A CouchDB friend of mine just pointed me to the BibJSON format by the
> Bibliographic Knowledge Network:
> http://www.bibkn.org/bibjson/index.html
> 
> Might be worth looking through for future collaboration/transformation
> options.

Unfortunately, it doesn't really work for authority and classification data 
that I'm frequently involved with.

Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Friday, March 05, 2010 01:59 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> On Fri, Mar 5, 2010 at 1:10 PM, Houghton,Andrew 
> wrote:
> 
> >
> > I decided to stick closer to a MARC-XML type definition since its
> would be
> > easier to explain how the two specifications are related, rather than
> take a
> > more radical approach in producing a specification less familiar.
> Not to
> > say that other approaches are bad, they just have different
> advantages and
> > disadvantages.  I was going for simple and familiar.
> >
> >
> That makes sense, but please consider adding a format/version (which we
> get
> in MARC-XML from the namespace and isn't present here). In fact, please
> consider adding a format / version / URI, so people know what they've
> got.

This sounds reasonable and I'll consider adding into our specification.

> I'm also going to again push the newline-delimited-json stuff. The
> collection-as-array is simple and very clean, but leads to trouble
> for production (where for most of us we'd have to get the whole
> freakin' collection in memory first ...

As far as our MARC-JSON specificaton is concerned a server application can 
return either a collection or record which mimics the MARC-XML specification 
where the collection or record element can be used for a document element.

> Unless, of course, writing json to a stream and reading json from a
> stream
> is a lot easier than I make it out to be across a variety of languages
> and I
> just don't know it, which is entirely possible. The streaming writer
> interfaces for Perl (
> http://search.cpan.org/dist/JSON-Streaming-
> Writer/lib/JSON/Streaming/Writer.pm)
> and Java's Jackson (
> http://wiki.fasterxml.com/JacksonInFiveMinutes#Streaming_API_Example)
> are a
> little more daunting than I'd like them to be.

As you point out JSON streaming doesn't work with all clients and I am hesitent 
to build on anything that all clients cannot accept.  I think part of the issue 
here is proper API design.  Sending tens of megabytes back to a client and 
expecting them to process it seems like a poor API design regardless of whether 
they can stream it or not.  It might make more sense to have a server API send 
back 10 of our MARC-JSON records in a JSON collection and have the client 
request an additional batch of records for the result set.  In addition, if I 
remember correctly, JSON streaming or other streaming methods keep the 
connection to the server open which is not a good thing to do to maintain 
server throughput.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Friday, March 05, 2010 12:30 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> On Fri, Mar 5, 2010 at 12:01 PM, Houghton,Andrew 
> wrote:
> 
> > Too bad I didn't attend code4lib.  OCLC Research has created a
> version of
> > MARC in JSON and will probably release FAST concepts in MARC binary,
> > MARC-XML and our MARC-JSON format among other formats.  I'm wondering
> > whether there is some consensus that can be reached and standardized
> at LC's
> > level, just like OCLC, RLG and LC came to consensus on MARC-XML.
> >  Unfortunately, I have not had the time to document the format,
> although it
> > fairly straight forward, and yes we have an XSLT to convert from
> MARC-XML to
> > MARC-JSON.  Basically the format I'm using is:
> >
> >
> The stuff I've been doing:
> 
>   http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
> 
> ... is pretty much the same, except:

I decided to stick closer to a MARC-XML type definition since its would be 
easier to explain how the two specifications are related, rather than take a 
more radical approach in producing a specification less familiar.  Not to say 
that other approaches are bad, they just have different advantages and 
disadvantages.  I was going for simple and familiar.

I certainly would be will to work with LC on creating a MARC-JSON specification 
as I did in creating the MARC-XML specification.


Andy.


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Benjamin Young
> Sent: Friday, March 05, 2010 09:26 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> If you're looking at putting MARC into JSON, there was some discussion
> of that during code4lib 2010. Johnathan Rochkind, who was at code4lib
> 2010 blogged about marc-json recently:
> http://bibwild.wordpress.com/2010/03/03/marc-json/
> He references a project that Bill Dueber's been playing with for a
> year:
> http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
> 
> All told, there's growing momentum for a MARC in JSON format to be
> created, so you might jump in there.

Too bad I didn't attend code4lib.  OCLC Research has created a version of MARC 
in JSON and will probably release FAST concepts in MARC binary, MARC-XML and 
our MARC-JSON format among other formats.  I'm wondering whether there is some 
consensus that can be reached and standardized at LC's level, just like OCLC, 
RLG and LC came to consensus on MARC-XML.  Unfortunately, I have not had the 
time to document the format, although it fairly straight forward, and yes we 
have an XSLT to convert from MARC-XML to MARC-JSON.  Basically the format I'm 
using is:

[
  ...
]

which represents a collection of MARC records or 

{
  ...
}

which represents a single MARC records that takes the form:

{
  leader : "01192cz  a2200301n  4500",
  controlfield :
  [
{ tag : "001", data : "fst01303409" },
{ tag : "003", data : "OCoLC" },
{ tag : "005", data : "20100202194747.3" },
{ tag : "008", data : "060620nn anznnbabn  || ana d" }
  ],
  datafield :
  [
{
  tag : "040",
  ind1 : " ",
  ind2 : " ",
  subfield :
  [
{ code : "a", data : "OCoLC" },
{ code : "b", data : "eng" },
{ code : "c", data : "OCoLC" },
{ code : "d", data : "OCoLC-O" },
{ code : "f", data : "fast" },
  ]
},
{
  tag : "151",
  ind1 : " ",
  ind2 : " ",
  subfield :
  [
{ code : "a", data : "Hawaii" },
{ code : "z", data : "Diamond Head" }
  ]
}
  ]
}


[CODE4LIB] Q: MARC formats in XML

2010-02-15 Thread Houghton,Andrew
Does anybody know whether the MARC formats at:







are encoded in an XML format that one might use for processing/validation of 
the leader, field tags, indicator codes, subfield codes, field and subfield 
repeatability, and field and subfield requiredness.  Something akin to the 
codepoint table in XML at:




Thanks, Andy.


Re: [CODE4LIB] Auto-suggest and the id.loc.gov LCSH web service

2009-12-07 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Winona Salesky
> Sent: Monday, December 07, 2009 11:00 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Auto-suggest and the id.loc.gov LCSH web
> service
>
> Quoting Ethan Gruber :
> 
> > I have a need to integrate the LCSH terms into a web form that uses
> > auto-suggest to control the vocabulary.  Is this technically possible
> with
> > the id.loc.gov service?

Why can't you just add a "*" to the end of the data in your search form
and send the request to the id.loc.gov search, per:



then parse the response?


Andy.


Re: [CODE4LIB] FW: PURL Server Update 2

2009-09-02 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Thomas Dowling
> Sent: Wednesday, September 02, 2009 10:25 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] FW: PURL Server Update 2
> 
> The III crawler has been a pain for years and Innovative has shown no
> interest
> in cleaning it up.  It not only ignores robots.txt, but it hits target
> servers
> just as fast and hard as it can.  If you have a lot of links that a lot
> of III
> catalogs check, its behavior is indistinguishable from a DOS attack.  (I
> know
> because our journals server often used to crash about 2:00am on the
> first of
> the month...)

I see that I didn't fully make the connection to the point I was 
making... which is that there are hardware solutions to these 
issues rather than using robots.txt or sitemap.xml.  If a user 
agent is a problem, then network folks should change the router 
to ignore the user agent or reduce the number of requests it is
allowed to make to the server.

In the case you point to with III hitting the server as fast as 
it can and it looking like a DOS attack to the network which
caused the server to crash, then 1) the router hasn't been setup 
to impose throttling limits on user agents, and 2) the server 
probably isn't part of a server farm that is being load balanced.

In the case of GPO, they mentioned or implied, that they were
having contention issues with user agents hitting the server
while trying to restore the data.  This contention could be
mitigated by imposing lower throttling limits in the router on 
user agents until the data is restored and then raising the 
limits back to the whatever their prescribed SLA (service level
agreement) was.

You really don't need to have a document on the server to tell 
user agents what to do.  You can and should impose a network 
policy on user agents which is far better solution in my opinion.


Andy.


Re: [CODE4LIB] FW: PURL Server Update 2

2009-09-02 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> David Fiander
> Sent: Wednesday, September 02, 2009 9:32 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] FW: PURL Server Update 2
> 
> If Millenium is acting like a robot in its
> monthly maintenance processes, then it should be checking robots.txt.

User agents are *not required* to check robots.txt nor are servers
*required* to provide a robots.txt.  There are no expectations for
robots.txt other than a gentlemen's agreement that if a server
provides one it should be consulted when any content is access from 
the server.

However, if you have publicly accessible URIs it is highly unlikely
that you would restrict access to those of URIs in your robots.txt.
It kink-of defeats the purpose of the URIs being *public*.  You
might put those URIs in robots.txt when the URIs have been deprecated 
and are being redirected to another URI, e.g., you redesigned your
Web site, but 1) I would argue that it would be better for your user
agents to see the redirect so they can update themselves, and 2) GPO
is running a PURL server, where the URIs are suppose to be *permanent*
and *publicly* accessible.

Robots.txt is a nice idea, but if you are having an issue with a user
agent, the network folks will most likely update the router rules to 
block the traffic rather than let it get thru to the server.





Andy.


Re: [CODE4LIB] FW: PURL Server Update 2

2009-09-01 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Edward M. Corrado
> Sent: Tuesday, September 01, 2009 3:57 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] FW: PURL Server Update 2
> 
> This should be a lesson to each and everyone of us who are in charge of
> maintain systems to actually try to restore a system from your backups
> on a some-what regular basis to ensure that you can get important
> systems up and running in a timely manner. From personal experience, I
> can tell you that you can learn a lot from these tests, and on at least
> one occasion they saved my  when I had a critical system fail.

I would be surprised if GPO wasn't backing up their data or even imaging
their server drives on a periodic basis.  However, in my experience with 
operation folks over many years, it isn't as simple as grabbing a backup
tape, putting it in a drive, and copying the data.  Just copying the
data, depending on how large it is could take some time.

1) Data drive fails. Grab a new drive from stock, reformat the drive,
   grab the backup tape, reload the data, then figure out what was lost
   between the time it was backed up and it went down.  If it were a SQL
   database can you recover from any intact journals?  Yes, then apply
   the journals.  What if this isn't the only service running on the
   server, you got to restore the data for those applications too.

2) OS drive fails. Grab a new drive from stock, reformat the drive, grab
   the backup tape or image and reload the OS, then figure out what
   security or application patches were applied between the time the 
   backup/image was taken and today.  Reinstall the missing patches.

3) Total hardware failure.  Do (1) and (2) above.  Your organization
   might keep pre-built spare servers around, but you still have to make
   sure that they are up-to-date with security and application patches,
   that they have the correct applications installed, you need to change
   the IP address, the server name and other little details like that.

4) Server compromised.  Worst case scenario.  They need to preserve all
   the drives so they can analyze them and turn over information to 
   police.  They are not going to trust the backup/image since they don't 
   know how long the server was compromised.  So they are most likely 
   going to rebuild the server from scratch and insure that it has *all* 
   the latest security and application patches, in addition to doing (1).

We had a research server get compromised a few years ago and it took
several weeks to get it back online due to rebuilding it from scratch.
Nothing is as simple as it seems...


Andy.


Re: [CODE4LIB] XML schemas question

2009-07-27 Thread Houghton,Andrew
> -Original Message-
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Jonathan Rochkind
> Sent: Monday, July 27, 2009 11:35 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] XML schemas question
> 
> Anyone familiar with XML schemas (.xsd)?
> 
> Can you help me figure something out. Is there something in the schema
> that specifies what elements can serve as the 'root node'... or is any
> element described in the schema available for use as a 'root node',
> and
> it'll still validate?

Any global element is available as a document element in an instance
document.  Typically, you can use global types, specify one global
element and use the Russian Doll approach to limit which elements
can be used for the document element in an instance document.  The
upside is that you can control what elements are used for instance
documents.  The downside is that schema reuse drops because inner
elements cannot be reused in other contexts.


Andy.


Re: [CODE4LIB] Open, public standards v. pay per view standards and usage

2009-07-16 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Karen Coyle
> Sent: Thursday, July 16, 2009 2:09 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Open, public standards v. pay per view
> standards and usage
> 
> Houghton,Andrew wrote:
>
> Second, standards can undergo change as they move from NISO to ISO. I
> believe with ISO 2709 there was agreement that it be backward
> compatible
> with Z39.2, but they are not identical. Should NISO revise its standard
> to match the ISO standard?

This would by my preferred solution.


Andy.


Re: [CODE4LIB] Open, public standards v. pay per view standards and usage

2009-07-16 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Bill Dueber
> Sent: Thursday, July 16, 2009 11:45 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Open, public standards v. pay per view
> standards and usage
> 
> On Thu, Jul 16, 2009 at 11:26 AM, Houghton,Andrew 
> wrote:
> 
> I'm not disagreeing with your overall point, but this is a specious
> example,

Yes it was and I tried to point that out.  I'm sure others might be
able to come up with examples in or out of the library domain where
only having the schema wasn't an issue to adoption.  This is why I
said: "it depends", but in philosophy I agree that it's generally a
barrier to adoption.

> The ISO 208775 schema, for example, include elements like  name="physicalLocation"> -- and there's no way you're going to know
> what the hell goes in there without a lot more help. And if you were 
> to have to pay for that help, many would rely on cheat-sheets or 
> pattern-matching and it all goes to hell.

Exactly the point of the rest of my prior post, but more bluntly put :)

So why do people keep running new standards thru organizations like ISO
that lock them up behind a pay system?  It's probably better to run them
through NISO first where they will be freely available, then run them
through ISO where ISO can lock them up for the people who require the ISO
stamp of approval before they can use a standard.

Now a days, all standards should be free.  They are created by people from
participating institutions who put their time and effort into them, not by
the institution where they are housed.  Charging for standards is counter
productive to adoption.  In the past, before the Web, you could get away
with that business model because you didn't have a low cost distribution
model.  Today it doesn't cost much to house a digital copy, heck Google 
Docs and other cloud services will do it for free.  I can see paying for a 
standard if I want a bound printed copy of the standard, otherwise it just
doesn’t make sense [to me]. 


Andy.


Re: [CODE4LIB] Open, public standards v. pay per view standards and usage

2009-07-16 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Ross Singer
> Sent: Thursday, July 16, 2009 11:07 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Open, public standards v. pay per view
> standards and usage
> 
> On Wed, Jul 15, 2009 at 8:57 AM, Ray Denenberg, Library of
> Congress wrote:
> 
> > Ross, if you're talking about the ISO 20775 xml schema:
> > http://www.loc.gov/standards/iso20775/ISOholdings_V1.0.xsd
> >
> > It's free.
> 
> It's also not a spec, it's a schema.  If the expectation is that
> people are actually going to adopt a standard from merely looking at
> an .xsd, my prediction is that this will go nowhere.
> 
> I mean, I'm wrong a lot, but I feel pretty good about this reading
> from my crystal ball.

Not saying you're wrong Ross, but it depends.  People adopted MARC-XML
by looking at the .xsd without an actual specification.  Granted it's
not a complicated schema however, and there already existed the "MARC 21 
Specifications for Record Structure, Character Sets, and Exchange Media"
so it wasn't a big leap to adopt MARC-XML, IMHO.

Generally I agree with your conclusion Ross. It's difficult for people
to just pick up an .xsd and understand what the semantics are for each
element and attribute in the schema and which element(s) should be used 
for the document element.  This is mitigated by annotations in the .xsd 
for the elements and attributes and also mitigated by using the Russian
doll schema approach, that MARC-XML uses, so it's clear what elements
can be used for the document element.  Also tools like XMLSpy that
provide a graphical representation of the .xsd can provide insights
into how the schema should be used.

But these are a lot of if this and that was done, and you have appropriate 
tools.  A freely available specification detailing each element and 
attribute along with their semantics is much better for understanding a 
schema than the schema itself, but obviously the schema is the definitive 
authority when it comes to generating conforming instance documents.


Andy.


Re: [CODE4LIB] A simple windows macro program?

2009-05-18 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Jill Ellern
> Sent: Monday, May 18, 2009 3:52 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] A simple windows macro program?
> 
> Hey Code4Lib folks,
> 
> Not sure if there are any Clio ILL software users in the Code4Lib
> group...but we are interested in getting Clio to pull emails from the
> ClioAdvanced email more often.  We are thinking there must be a windows
> macro/scripting program on top of Clio (which is an MS Access
> application for those not familiar with it) would be the best idea.
> Right now, it needs human intervention to run the menu option to pop
> these requests into Clio...and they only do it once or twice a day.  We
> want this menu option in Clio to run every 15-20 minutes all day.
> However, we are not familiar with what programs are out there that
> would be good to look at.  Can you folks make a suggestion?  Has anyone
> done this already with Clio?

I'm not familiar with Clio, but have used MS Access extensively over
the years.  It is quite possible, since Clio is using MS Access that
they have VBA macros to accomplish various tasks.  If the macros are
not encrypted, you might be able to modify them to suit your needs.

Alternately, you could write a VB.NET or C# application that uses 
ADO.NET to open the Clio MS Access database every 15-20 minutes pull 
the information from ADO.NET and push it to your e-mail solution.  
Microsoft Express tools, e.g., VB.NET Express, C# Express, etc. can 
be freely downloaded and used.


Hope this helps, Andy.


Re: [CODE4LIB] Anyone else watching rev=canonical?

2009-04-14 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Brett Bonfield
> Sent: Tuesday, April 14, 2009 6:48 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Anyone else watching rev=canonical?
> 
> On Tue, Apr 14, 2009 at 5:53 PM, Houghton,Andrew 
> wrote:
> >> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf
> Of
> >> Brett Bonfield
> >>
> >> Different. Which is one of the problems with rev=canonical.
> >
> > Another issue is that Google, Microsoft, et al. couldn't see that
> their
> > proposal was already taken care of by HTTP with its Content-Location
> > header and that if they wanted people to embed the canonical URI into
> > their HTML that they could have easily done:
> >
> > 
> >
> > rather than creating a new link rel="canonical" and BTW their
> strategy
> > only works in HTML, it doesn't work in RDF, JSON, XML, etc., but
> using
> > HTTP as it was intended, e.g., Content-Location header, it works for
> > all media types.
> 
> Similar issues are arising with the proposed rev=canonical. That is,
> there are different ways to provide the info that rev=canonical is
> providing.
> 
> However, just to be clear, rev=canonical != rel=canonical.
> 
> They are discrete responses to distinct issues.

Agreed.  Another issue with rev=canonical is that I don't believe that
rev= is going to be supported in HTML 5.


Andy.


Re: [CODE4LIB] Anyone else watching rev=canonical?

2009-04-14 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Brett Bonfield
> Sent: Tuesday, April 14, 2009 5:35 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Anyone else watching rev=canonical?
> 
> On Tue, Apr 14, 2009 at 5:30 PM, Jonathan Rochkind 
> wrote:
> > Wait, is this the same or different than , as
> in:
> >
> > http://googlewebmastercentral.blogspot.com/2009/02/specify-your-
> canonical.html
> >
> >  seemed like a good idea to me.  But when I
> start
> > reading some of those URLs, it's not clear to me if they're talking
> about
> > the same thing or not.
> 
> Different. Which is one of the problems with rev=canonical.

Another issue is that Google, Microsoft, et al. couldn't see that their
proposal was already taken care of by HTTP with its Content-Location
header and that if they wanted people to embed the canonical URI into
their HTML that they could have easily done:



rather than creating a new link rel="canonical" and BTW their strategy 
only works in HTML, it doesn't work in RDF, JSON, XML, etc., but using
HTTP as it was intended, e.g., Content-Location header, it works for 
all media types.


Andy.


Re: [CODE4LIB] resolution and identification (was Re: [CODE4LIB] registering info: uris?)

2009-04-14 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Jonathan Rochkind
> Sent: Tuesday, April 14, 2009 10:21 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] resolution and identification (was Re:
> [CODE4LIB] registering info: uris?)
> 
> Over in: http://www.w3.org/2001/tag/doc/URNsAndRegistries-50-2006-08-
> 17.html
> 
> They suggest: "URI opacity'Agents making use of URIs SHOULD NOT
> attempt to infer properties of the referenced resource.'"
> 
> I understand why that makes sense in theory, but it's entirely
> impractical for me, as I discovered with the SuDoc experiment (which
> turned out to be a useful experiment at least in understanding my own
> requirements).  If I get a URI representing (eg) a Sudoc (or an ISSN,
> or an LCCN), I need to be able to tell from the URI alone that it IS a
> Sudoc, AND I need to be able to extract the actual SuDoc identifier
> from it.  That completely violates their Opacity requirement, but it's
> entirely infeasible to require me to make an individual HTTP request
> for every URI I find, to figure out what it IS.

Jonathan, you need to take URI opacity in context.  The document is correct
in suggesting that user agents should not attempt to infer properties of
the referenced resource.  The Architecture of the Web is also clear on this
point and includes an example.  Just because a resource URI ends in .html
does not mean that HTML will be the representation being returned.  The
user agent is inferring a property by looking at the end of the URI to see
if it ends in .html, e.g., that the Web Document will be returning HTML.  If 
you really want to know for sure you need to dereference it with a HEAD 
request.

Now having said that, URI opacity applies to user agents dealing with *any*
URIs that they come across in the wild.  They should not try to infer any
semantics from the URI itself.  However, this doesn't mean that the minter
of a URI cannot create a policy decision for a group of URIs under their
control that contain semantics.  In your example, you made a policy 
decision about the URIs you were minting for SUDOCs such that the actual
SUDOC identifier would appear someplace in the URI.  This is perfectly
fine and is the basis for REST URIs, but understand you created a specific
policy statement for those URIs, and if a user agent is aware of your policy
statements about the URIs you mint, then they can infer semantics from
the URIs you minted.

Does that break URI opacity from a user agents perspective?  No.  It just
means that those user agents who know about your policy can infer semantics
from your URIs and those that don't should not infer any semantics because
they don't know what the policies are, e.g., you could be returning PDF
representations when the URI ends in .html, if that was your policy, and
the only way for a user agent to know that is to dereference the URI with 
either HEAD or GET when they don't know what the policies are.


Andy.


Re: [CODE4LIB] points of failure (was Re: [CODE4LIB] resolution and identification )

2009-04-02 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Jonathan Rochkind
> Sent: Thursday, April 02, 2009 10:53 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] points of failure (was Re: [CODE4LIB] resolution
> and identification )
> 
> Isn't there always a single point of failure if you are expecting to be
> able to resolve an http URI via the HTTP protocol?
> 
> Whether it's purl.org or not, there's always a single point of failure
> on a given http URI that you expect to resolve via HTTP, the entity
> operating the web server at the specified address. Right?

I think the answer lies in DNS.  Even though you have a single DNS name
requests could be redirected to one of multiple servers, called a server
farm.  I believe this is how many large sites, like Google, operate.  So
even if a single server fails the load balancer sends requests to other
servers.  Even OCLC does this.

> Now, if you have a collection of disparate http URIs, you have _many_
> points of failure in that collection. Any entity goes down or ceases to
> exist, and the http URIs that resolved to that entity's web server will
> stop working.

I think this also gets back to DNS.  Even though you have a single DNS
name requests could be redirected to servers outside the original request
domain.  So you could have distributed servers under many different domain
names.


Andy.


Re: [CODE4LIB] resolution and identification (was Re: [CODE4LIB] registering info: uris?)

2009-04-02 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Karen Coyle
> Sent: Thursday, April 02, 2009 10:15 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] resolution and identification (was Re:
> [CODE4LIB] registering info: uris?)
> 
> Houghton,Andrew wrote:
> > RFC 3986 (URI generic syntax) says that "http:" is a URI scheme not a
> > protocol.  Just because it says "http" people make all kinds of
> > assumptions about type of use, persistence, resolvability, etc.
> >
> 
> And RFC 2616 (Hypertext transfer protocol) says:
> 
> "The HTTP protocol is a request/response protocol. A client sends a
> request to the server in the form of a request method, URI, and
> protocol
> version, followed by a MIME-like message containing request modifiers,
> client information, and possible body content over a connection with a
> server."
> 
> So what you are saying is that it's ok to use the URI for the hypertext
> transfer protocol in a way that ignores RFC 2616. I'm just not sure how
> functional that is, in the grand scheme of things.

You missed the whole point that URIs, specified by RFC 3986, are just tokens
that are divorced from protocols, like RFC 2616, but often work in conjunction
with them to retrieve a representation of the resource defined by the URI
scheme.  It is up to the protocol to decide which URI schemes that it will 
accept.  In the case of RFC 2616, there is a one-to-one relationship, today,
with the HTTP URI scheme.  RFC 2616 could also have said it would accept other 
URI schemes too or another protocol could be defined, tomorrow, that also 
accepts the HTTP URI scheme, causing the HTTP URI scheme to have a one-to-many 
relationship between its scheme and protocols that accept its scheme.

> And when you say:
> 
> > The "Cool URIs for the Semantic Web" document describes how an HTTP
> protocol
> > implementation (of RFC 2616) should respond to a dereference of an
> HTTP URI.
> 
> I think you are deliberating distorting the intent of the Cool URIs
> document. You seem to read it that *given* an http uri, here is how the
> protocol should respond. But in fact the Cool URIs document asks the
> question "So the question is, what URIs should we use in RDF?" and
> responds that one should use http URIs for the reason that:
> 
> "Given only a URI, machines and people should be able to retrieve a
> description about the resource identified by the URI from the Web. Such
> a look-up mechanism is important to establish shared understanding of
> what a URI identifies. Machines should get RDF data and humans should
> get a readable representation, such as HTML. The standard Web transfer
> protocol, HTTP, should be used."

The answer to the question posed in the document is based on Tim 
Burners-Lee four linked data principles where one of them states to 
use HTTP URIs.  Nobody, as far as I know, has created a hypertext 
based system based on the URN or info URI schemes.  The only 
hypertext based system available today is the Web which is based on 
the HTTP protocol that accepts HTTP URIs.  So you cannot effectively 
accomplish linked data on the Web without using HTTP URIs.

The document has an RDF / Semantic Web slant, but Tim Burners-Lee's 
four linked data principles say nothing about RDF or the Semantic Web.  
Those four principles might be more aptly named the four linked 
"information" principles for the Web.  Further, the document does go on 
to describe how an HTTP server (an implementation of RFC 2616) should 
respond to requests for Real World Object, Generic Documents and Web 
Documents which is based on the W3C TAG decisions for httpRange-14 and 
genericResources-53.

The scope of the document clearly says:

  "This document is a practical guide for implementers of the RDF 
   specification... It explains two approaches for RDF data hosted 
   on HTTP servers..."

Section 2.1 discusses HTTP and content negotiation for Generic Documents.

Section 4 discusses how the HTTP server should respond with diagrams and
actual HTTP status codes to let user agents know which URIs are Real
World Objects vs. Generic Document and Web Documents, per the W3 TAG
decisions on httpRange-14 and genericResources-53.

Section 6 directly address the question that this thread has been talking
about, namely using new URI schemes, like URN and info and why they are
not acceptable in the context of linked data.

And here is a quote which is what I have said over and over again about
URI being tokens and divorced from protocols:

  "To be truly useful, a new scheme must be accompanied by a protocol 
   defining how to access more information about the identified resource.
   For example, the ftp:// URI scheme identifies resources (files on

Re: [CODE4LIB] resolution and identification (was Re: [CODE4LIB] registering info: uris?)

2009-04-02 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Mike Taylor
> Sent: Thursday, April 02, 2009 10:07 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] resolution and identification (was Re:
> [CODE4LIB] registering info: uris?)
> 
> Houghton,Andrew writes:
>  > > I have to say I am suspicious of schemes like PURL, which for all
>  > > their good points introduce a single point of failure into, well,
>  > > everything that uses them.  That can't be good.  Especially as
>  > > it's run by the same compary that also runs the often-unavailable
>  > > OpenURL registry.
>  >
>  > What you are saying is that you are suspicious of the HTTP protocol.
> 
> That is NOT what I am saying.
> 
> I am saying I am suspicious of a single point of failure.  Especially
> since the entire architecture of the Internet was (rightly IMHO)
> designed with the goal of avoid SPOFs.

OK, good, then if you are concerned about the PURL services SPOF, take 
the freely available PURL software and created a distributed PURL based 
system and put it up for the community.  I think several people have
looked at this, but I have not heard of any progress or implementations.


Andy.


Re: [CODE4LIB] resolution and identification (was Re: [CODE4LIB] registering info: uris?)

2009-04-02 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Mike Taylor
> Sent: Thursday, April 02, 2009 8:41 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] resolution and identification (was Re:
> [CODE4LIB] registering info: uris?)
> 
> I have to say I am suspicious of schemes like PURL, which for all
> their good points introduce a single point of failure into, well,
> everything that uses them.  That can't be good.  Especially as it's
> run by the same compary that also runs the often-unavailable OpenURL
> registry.

What you are saying is that you are suspicious of the HTTP protocol.  All
the PURL server does is use mechanisms specified by the HTTP protocol.
Any HTTP server is capable of implementing those same mechanisms.  The
actual PURL server is a community based service that allows people to
create HTTP URIs that redirect to other URIs without having to run an 
actual HTTP server.  If you don't like its single point of failure, then 
create your own in-house service using your existing HTTP server.  I 
believe the source code for the entire PURL service is freely available 
and other people have taken the opportunity to run their own in-house or 
community based service.


Andy.


Re: [CODE4LIB] resolution and identification (was Re: [CODE4LIB] registering info: uris?)

2009-04-02 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Karen Coyle
> Sent: Wednesday, April 01, 2009 2:26 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] resolution and identification (was Re:
> [CODE4LIB] registering info: uris?)
> 
> This really puzzles me, because I thought http referred to a protocol:
> hypertext transfer protocol. And when you put "http://"; in front of
> something you are indicating that you are sending the following string
> along to be processed by that protocol. It implies a certain
> application
> over the web, just as "mailto:"; implies a particular application. Yes,
> "http" is the URI for the hypertext transfer protocol. That doesn't
> negate the fact that it indicates a protocol. 

RFC 3986 (URI generic syntax) says that "http:" is a URI scheme not a
protocol.  Just because it says "http" people make all kinds of 
assumptions about type of use, persistence, resolvability, etc.  As I
indicated in a prior message, whoever registered the http URI scheme
could have easily used the token "web:" instead of "http:".  All the
URI scheme in RFC 3986 does is indicate what the syntax of the rest
of the URI will look like.  That's all.  You give an excellent
example: mailto.  The mailto URI scheme does not imply a particular
application.  It is a URI scheme with a specific syntax.  That URI
is often resolved with the SMTP (mail) protocol.  Whoever registered
the mailto URI scheme could have specified the token as "smtp:"
instead of "mailto:";.

> My reading of Cool URIs is
> that they use the protocol, not just the URI. If they weren't intended
> to take advantage of http then W3C would have used something else as a
> URI. Read through the Cool URIs document and it's not about
> identifiers,
> it's all about using the *protocol* in service of identifying. Why use
> http?

I'm assuming here when you say "My reading of Cool URIs..." means reading
the "Cool URIs for the Semantic Web" document and not the "Cool URIs Don't
Change" document.  The "Cool URIs for the Semantic Web" document is about
linked data.  Tim Burners-Lee's four linked data priciples state:

   1. Use URIs as names for things.
   2. Use HTTP URIs so that people can look up those names.
   3. When someone looks up a URI, provide useful information.
   4. Include links to other URIs. so that they can discover more things.

(2) is an important aspect to linking.  The Web is a hypertext based system
that uses HTTP URIs to identify resources.  If you want to link, then you 
need to use HTTP URIs.  There is only one protocol, today, that accepts 
HTTP URIs as "currency" and its appropriately called HTTP and defined by 
RFC 2616.

The "Cool URIs for the Semantic Web" document describes how an HTTP protocol
implementation (of RFC 2616) should respond to a dereference of an HTTP URI.
Its important to understand the URIs are just tokens that *can* be presented 
to a protocol for resolution.  Its up to the protocol to define the "currency"
that it will accept, e.g., HTTP URIs, and its up to an implementation of the
protocol to define the "tokens" of that "currency" that it will accept.

It just so happens that HTTP URIs are accepted by the HTTP protocol, but in
the case of mailto URIs they are accepted by the SMTP protocol.  However,
it is important to note that a HTTP user agent, e.g., a browser, accepts
both HTTP and mailto URIs.  It decides that it should send the mailto URI
to an SMTP user agent, e.g., Outlook, Thunderbird, etc. or it should
dereference the HTTP URI with the HTTP protocol.  In fact the HTTP protocol
doesn't directly accept HTTP URIs.  As part of the dereference process the
HTTP user agent needs to break apart the HTTP URI and present it to the HTTP
protocol.  For example the HTTP URI: http://example.org/ becomes the HTTP 
protocol request:

GET / HTTP/1.1
Host: example.org

Think of a URI as a minted token.  The New York subway mints tokens to ride 
the subway to get to a destination.  Placing a U.S. quarter or a Boston
subway token in a turn style will not allow you to pass.  You must use the 
New York subway minted token, e.g., "currency".  URIs are the same.  OCLC 
can mint HTTP URI tokens and LC can mint HTTP URI tokens, both are using
the HTTP URI "currency", but sending LC HTTP URI tokens, e.g., Boston subway
tokens, to OCLC's Web server will most likely result in a 404, you cannot
pass since OCLC's Web server only accepts OCLC tokens, e.g., New York subway
tokens, that identify a resource under its control.


Andy.


Re: [CODE4LIB] resolution and identification (was Re: [CODE4LIB] registering info: uris?)

2009-04-02 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Ray Denenberg, Library of Congress
> Sent: Wednesday, April 01, 2009 2:38 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] resolution and identification (was Re:
> [CODE4LIB] registering info: uris?)
> 
> No,  not identical URIs.
> 
> Let's say I've put a copy of the schema permanently at each of the
> following
> locations.
>  http://www.loc.gov/standards/mods/v3/mods-3-3.xsd
>  http://www.acme.com//mods-3-3.xsd
>  http://www.takoma.org/standards/mods-3-3.xsd
> 
> Three locations, three URIs.
> 
> But the issue of redirect or even resolution is irrelevant in the use
> case
> I'm citing.   I'm talking about the use of an identifier within a
> protocol,
> for the sole purpose of identifying an object that the recipient of the
> URI
> already has - or if it doesn't have it it isn't going to retrieve it,
> it
> will just fail the request.   The purpose of the identifier is to
> enable the
> server to determine whether it has the schema that the client is
> looking
> for.  (And by the way that should answer Ed's question about a use
> case.)
> 
> So the server has some table of schemas, in that table is the row:
> 
> ["mods schema]   [ ]
> 
> It recieves the SRU request:
> http://z3950.loc.gov:7090/voyager?
> version=1.1&operation=searchRetrieve&query=dinosaur&maximumRecords=1&re
> cordSchema= identifying the mods schema>
> 
> If the "URI identifying the MODS schema" in the request matches the URI
> in
> the table, then the server know what schema the client wants, and it
> proceeds.  If there are multiple identifiers then it has to have a row
> in
> its table for each.
> 
> Does that make sense?

Absolute sense to me.  Since LC is the "author/creator" of MODS it should
create a Real World Object URI for MODS version 3.3 schema.  So LC now
creates:

http://www.loc.gov/standards/mods/v3.3

Everyone uses that URI for the SRU recordSchema parameter.  What LC has
done is define a URI with the following policy statement:

1) Type of usage: Real World Object (RWO)
2) Persistence: Yes
3) Resolvable: No

Issue resolved.

As a side issue, one could argue that placing a schema at:

>  http://www.loc.gov/standards/mods/v3/mods-3-3.xsd
>  http://www.acme.com//mods-3-3.xsd
>  http://www.takoma.org/standards/mods-3-3.xsd

and not having an authorized location is a recipe for disaster. One of 
those URIs is an authorized URI the other two are URI aliases.  So
according to RFC 2616 you would probably what to have the latter two
URIs either redirect 301/302/307 back to the first URI or have the
latter two return a 200 with a Content-Location header containing
the first URI.  Now user agents can figure out which is the authorized
version of the schema and which are URI aliases.


Andy.


Re: [CODE4LIB] resolution and identification (was Re: [CODE4LIB] registering info: uris?)

2009-04-02 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Ray Denenberg, Library of Congress
> Sent: Wednesday, April 01, 2009 1:59 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] resolution and identification (was Re:
> [CODE4LIB] registering info: uris?)
> 
> We do just fine minting our URIs at LC, Andy. But we do appreciate your
> concern.

Sorry Ray, that statement wasn't directed at LC in particular, but was a 
general statement.  OCLC doesn’t do any better in this area, especially 
with WorldCat where there are the same issues I pointed out with your 
examples and additional issues to boot.  The point I was trying to make
was *all* organizations need to have clear policies on creating, 
maintaining, persistence, etc.  Failure to do so creates a big mess 
that takes time to fix, often creating headaches for those using an 
organizations URIs.  Take for example when NISO redesigned their site 
and broke all the URIs to their standards.  Tim Berners-Lee addresses 
this in his Cool URIs Don't Break article.

> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Ross Singer
> Sent: Wednesday, April 01, 2009 2:07 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] resolution and identification (was Re:
> [CODE4LIB] registering info: uris?)
> 
> Ray, you are absolutely right.  These would be bad identifiers.  But
> let's say they're all identical (which I think is what you're saying,
> right?), then this just strengthens the case for indirection through a
> service like purl.org.  Then it doesn't *matter* that all of these are
> different locations, there is one URI that represent the concept of
> what is being kept at these locations.  At the end of the redirect can
> be some sort of 300 response that lets the client pick which endpoint
> is right for them -or arbitrarily chooses one for them.

Exactly, but purl.org is just using standard HTTP protocol mechanisms 
which could be easily done by LC's site given Ray's examples.

What is at issue is the identification of a Real World Object URI for
MODS v3.3.  Whether I get back an XML schema, a RelaxNG schema, etc.
are just Web Documents or representations of that abstract Real World 
Object.  What Ross did was make the PURL the Real World Object URI for
MODS v3.3 and used it to redirect to the geographically distributed
Web Documents, e.g., representations.  LC could have just as well
minted one under its own domain.


Andy.


Re: [CODE4LIB] resolution and identification

2009-04-01 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Jonathan Rochkind
> Sent: Wednesday, April 01, 2009 1:23 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] resolution and identification
> 
> Houghton,Andrew wrote:
> >
> > Organizations need to have a clear understanding of what they are
> minting
> > URIs for.
> >
> 
> Precisely. And in the real world... they don't always have that.

Then one could contend that they should not be minting URIs, this is 
why they get into trouble.  No clear policy on what a URI identifies,
when is it appropriate to create aliases and when URI aliases are
created making sure you use appropriate HTTP mechanisms to redirect
them or indicate that they are URI alias.  They also need to understand
the difference between URIs for Real World Objects, Generic Documents
and Web Documents.  Failure to understand these issues and you will
create a mess.

> ONE of the benefits of info is that the registry process forces minters
> to develop that clear understanding (to some extent), and documents it
> for later users.  There are also other pros and cons.

Right, but your argument seems to be that you have all these HTTP URIs
so rather than sort it out, using appropriate standards based mechanisms,
lets create yet another URI to add to all the other URI aliases?


Andy.


Re: [CODE4LIB] resolution and identification (was Re: [CODE4LIB] registering info: uris?)

2009-04-01 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Karen Coyle
> Sent: Wednesday, April 01, 2009 1:06 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] resolution and identification (was Re:
> [CODE4LIB] registering info: uris?)
> 
> The general convention is that "http://"; is a web address, a location.
> I
> realize that it's also a form of URI, but that's a minority use of
> http.
> This leads to a great deal of confusion. I understand the desire to use
> domain names as a way to create unique, managed identifiers, but the
> http part is what is causing us problems.

http:// is an HTTP URI, defined by RFC 3986, loosely I will agree that
it is a web addresss.  However, it is not a location.  URIs according
to RFC 3986 are just tokens to identify resources.  These tokens, e.g.,
URIs are presented to protocol mechanisms as part of the dereferencing
process to locate and retrieve a representation of the resource.

People see http: and assume that it means the HTTP protocol so it must
be a locator.  Whoever initially registered the HTTP URI scheme could 
have used "web" as the token instead and we would all be doing:
.  This is the confusion.  People don't understand 
what RFC 3986 is saying.  It makes no claim that any URI registered 
scheme has persistence or can be dereferenced.  An HTTP URI is just a 
token to identify some resource, nothing more.


Andy.


Re: [CODE4LIB] resolution and identification

2009-04-01 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Ray Denenberg, Library of Congress
> Sent: Wednesday, April 01, 2009 12:55 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] resolution and identification
> 
> A concrete example.
> 
> The MODS schema, version 3.3, has an info identifier, for SRU purposes:
> 
> info:srw/schema/1/mods-v3.3
> 
> So in an SRU request you can say"
> 
> recordSchema=info:srw/schema/1/mods-v3.3

Yes, but you can use any token for a recordSchema so long as everyone
agrees that the token represents the schema for MODS 3.3.  You could
have just as well used an HTTP URI for that parameter in SRU which
is what I typically do for things like SKOS.

> Now in contrast, the schema is at
> http://www.loc.gov/standards/mods/v3/mods-3-3.xsd

This could have been used in place of the info URI for the SRU.  This
URI is a Web Document and probably is the definitive URI for version
3.3 of the MODS XML schema.

> And it's also at:
> http://www.loc.gov/mods/v3/mods-3-3.xsd

This is a URI alias.  The Architecure of the Web warns us not to create
URI aliases, but HTTP allows use to do two things with this URI: 1)
redirect it to the correct URI or use the Content-Location header to
indicate that it is a URI alias.  Neither of which is done by the LC
Web site.

> And also:
> http://www.loc.gov/mods/mods.xsd
>
> And:
> http://www.loc.gov/standards/mods/mods.xsd

These are not the same URI as previous URIs.  These URIs is to the latest 
MODS XML schema document.  It's semantics are that of the "current" 
version of the schema.  One of these is another URI alias that has the
same issues as I previously stated.

> And:
> http://www.loc.gov/standards/mods/v3/mods.xsd

This URI is not the same as URI as previous URIs.  This URI is to the
latest MODS version 3 XML schema document.  It's semantics are that of
the "current" version 3 of the schema.

> So there you have five http "identifiers" for the schema.

At this point in time all five of these HTTP URIs my dereference to the
Web Document, but overtime the latter three will not represent version 3.3
of the MODS XML schema.  The first URI is most likely the definitive URI
for version 3.3 of the MODS XML schema.


Organization need to have a clear understanding of what they are minting 
URIs for.


Andy.


Re: [CODE4LIB] resolution and identification (was Re: [CODE4LIB] registering info: uris?)

2009-04-01 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Ray Denenberg, Library of Congress
> Sent: Wednesday, April 01, 2009 11:58 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] resolution and identification (was Re:
> [CODE4LIB] registering info: uris?)
> 
> I realize this is pretty much a dead-end debate as everyone has dug
> themselves into a position and nobody is going to change their mind. It
> is a
> philosophical debate and there isn't a right answer.  But in my opinion
> 

Often it is portrayed as a philosophical debate, but it's about standards.
Nothing in RFC 3986 says that any URI scheme should be made or is
resolvable.  A URI with an HTTP scheme is just as good as a URI with any
other scheme.  URIs are just identification tokens.  Resolvability or
dereference is about the use of URI.

> Why? Because it drives me nuts to see http URIs everywhere that give
> all appearances of resolvability - browsers, editors, etc.  turn them 
> into clickable links.

This happens with info URIs too.  Show a person an info URI an tell them
that it’s a URI, and they might swipe the text and try to resolve it in
their favorite browser.  It doesn't help when there browser spits back
"unknown URI scheme".  They will probably just go off an Goggle it.

The argument that info URIs are not resolvable, just doesn't mean that
someone will not try to resolve it in their browser.  Resolvability,
like persistence, is a policy statement about a URI.


Andy.


Re: [CODE4LIB] registering info: uris?

2009-04-01 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Mike Taylor
> Sent: Wednesday, April 01, 2009 10:17 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] registering info: uris?
> 
> Houghton,Andrew writes:
>  > Again we have moved the discussion to a specific resolution
> mechanism,
>  > e.g., OpenURL.  OpenURL could have been defined differently, such
>  > that rft_id and rft_idScheme were available and you used the actual
>  > DOI value and specified the scheme of the identifier.  Then the
> issue
>  > of extraction of the identifier value from the URI goes away,
> because
>  > there is no URI needed.
> 
> Yes, that would have been OK, too.  But no doubt there are other
> contexts where it's possible to pass in an identifier without also
> being able to say "and by the way, it's of type XYZ".  Surely you
> don't disagree that it's good for identifiers to be self-describing?

Ok, now we moved the discussion back to identifiers rather than
resolution mechanisms.  Absolutely agree that it's good for
identifiers to be self-describing, I wasn't saying otherwise.
However, lets take the following URIs:

http://any.identifier.org/?scheme=doi&id=10./j.1475-4983.2007.00728.x
info:doi/10./j.1475-4983.2007.00728.x
urn:doi:10./j.1475-4983.2007.00728.x

All three are self describing URI.  The HTTP URI does exactly the same thing
as the info URI without having to create a new URI scheme, e.g., info, and
the argument made by IETF and W3C against the creation of info URIs.  Also,
since the info URI folks actually created a domain name for registering info 
URIs you could have easily changed "any.identifier.org" to "info-uri.info"
to achieve the same effect as the info URI.

> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Mike Taylor
> Sent: Wednesday, April 01, 2009 10:44 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] registering info: uris?
>
> Imagine your web-browser extended by a plugin that knows how to
> resolve particularly kinds of info: URLs.  If you just paste the raw
> DOI into the URI bar, it won't have a clue what to do with it, but the
> wrapped-in-a-URI version stands alone and self-describes, so the
> plugin can pull it apart and say, "ah yes, this URI is a DOI, and I
> know how my user has configured me to resolve those."

Sure you can imagine a web-browser plugin, but these things never happen
due to a) the cost of developing or, b) in order for it to work you need
a plugin to work for every type of browser.  This is why the Architecture
of the Web document states:

  "While Web architecture allows the definition of new schemes, introducing 
   a new scheme is costly. Many aspects of URI processing are scheme-dependent, 
   and a large amount of deployed software already processes URIs of well-known 
   schemes. Introducing a new URI scheme requires the development and 
deployment 
   not only of client software to handle the scheme, but also of ancillary 
agents 
   such as gateways, proxies, and caches. See [RFC2718] for other 
considerations 
   and costs related to URI scheme design"

> What you seem to be suggesting (are you?) is that in the former case, the 
> resolver should recognise that the HTTP URL matches the regular expression
>   ^http://dx\.doi\.org\.(.*)$/
> and so extract the match and go off and do something else with it.

Back to resolution mechanisms... I'm not suggesting anything.  You are 
suggesting
a resolution mechanism implementation which uses regular expressions.  That is 
one of many ways a resolution mechanism can retrieve the embedded DOI or 
identifier
of choice.  URI Templates is another and given this URI:

http://any.identifier.org/?scheme=doi&id=10./j.1475-4983.2007.00728.x

any Web library on the planet can pull the query parameters out of the URI.

> as the "actionable identifier" might be something uglier...

A URI is just a token with a predefined syntax, per RFC 3986, used to identify a
resource which can be an abstract "thing", e.g., Real World Object or a 
representation of a resource, e.g., a Web Document.  One could postulate that 
all 
URIs are ugly.  Whether a URI is ugly or not is irrelevant.


Andy.


Re: [CODE4LIB] resolution and identification (was Re: [CODE4LIB] registering info: uris?)

2009-04-01 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Jonathan Rochkind
> Sent: Wednesday, April 01, 2009 11:08 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] resolution and identification (was Re: [CODE4LIB]
> registering info: uris?)
> 
> Houghton,Andrew wrote:
> > Lets separate your argument into two pieces. Identification and
> > resolution.  The DOI is the identifier and it inherently doesn't
> > tie itself to any resolution mechanism.  So creating an info URI
> > for it is meaningless, it's just another alias for the DOI.  I
> > can create an HTTP resolution mechanism for DOI's by doing:
> >
> > http://resolve.example.org/?doi=10./j.1475-4983.2007.00728.x
> >
> > or
> >
> > http://resolve.example.org/?uri=info:doi/10./j.1475-
> 4983.2007.00728.x
> >
> > since the info URI contains the "natural" DOI identifier, wrapping it
> > in a URI scheme has no value when I could have used the DOI
> identifier
> > directly, as in the first HTTP resolution example.
> >
> 
> I disagree that wrapping it in a URI scheme has no value.  We have very
> much software and schemas that are built to store URIs, even if they
> don't know what the URI is or what can be done with it, we have
> infrastructure in place for dealing with URIs.

Oops... that should have read "... wrapping it in an unresolvable URI
scheme..."

The point being that:

urn:doi:*
info:doi:*

provide no advantages over:

http://doi.org/*

when, per W3C TAG httpRange-14 decision you identify the URI as being a 
Real World Object.  When identifying the HTTP URI as a Real World Object,
it is the same as what Mike said about the info URI that: "the identifier
describes its own type".


Andy.


Re: [CODE4LIB] registering info: uris?

2009-04-01 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Eric Hellman
> Sent: Wednesday, April 01, 2009 9:51 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] registering info: uris?
> 
> There are actually a number of http URLs that work like
> http://dx.doi.org/10./j.1475-4983.2007.00728.x
> One of them is http://doi.wiley.com/10./j.1475-4983.2007.00728.x
> Another is run by crossref;  Some OpenURL ink servers also have doi
> proxy capability.
> So for code to extract the doi reliably from http urls, the code needs
> to know all the possibilities for the doi proxy stem. The proxies also
> tend to have optional parameters that can control the resolution. In
> principle, the info:doi/ stem addresses this.

Again we have moved the discussion to a specific resolution mechanism,
e.g., OpenURL.  OpenURL could have been defined differently, such
that rft_id and rft_idScheme were available and you used the actual
DOI value and specified the scheme of the identifier.  Then the issue
of extraction of the identifier value from the URI goes away, because
there is no URI needed.


Andy.


Re: [CODE4LIB] registering info: uris?

2009-04-01 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Mike Taylor
> Sent: Wednesday, April 01, 2009 9:35 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] registering info: uris?
> 
> Houghton,Andrew writes:
>  > So creating an info URI for it is meaningless, it's just another
>  > alias for the DOI.
> 
> Not quite.  Embedding a DOI in an info URI (or a URN) means that the
> identifier describes its own type.  If you just get the naked string
>   10./j.1475-4983.2007.00728.x
> passed to you, say as an rft_id in an OpenURL, then you can't tell
> (except by guessing) whether it's a DOI, a SICI, and ISBN or a
> biological species identifier.  But if you get
>   info:doi/10./j.1475-4983.2007.00728.x
> then you know what you've got, and can act on it accordingly.

Now you are changing the argument to a specific resolution mechanism,
e.g., OpenURL.  OpenURL could have easily defined rft_idType where
you specified DOI, SICI, ISBN, etc. along with its actual identifier
value in rft_id.  However, given that OpenURL didn't do this, there
is no difference plugging either of the following URIs into rft_id:

http://dx.doi.org/10./j.1475-4983.2007.00728.x
info:doi/10./j.1475-4983.2007.00728.x 

when I identify the HTTP URI as a Real World Object.  This was the
whole point of the W3C TAG httpRange-14 decision which the "Cool
URIs for the Semantic Web" document is based on.

So again, wrapping the "natural" DOI in an unresolvable URI scheme
is meaningless.  When talking about resolution mechanisms any number
of implementations are possible, including separating an identifier
type from it value or conflating the two.  In the two URIs above
the only real differences are:

1) http: vs. info: URI scheme
2) an authority named: dx.doi.org vs. doi

These are just simple substitutions.  Whoever registered the info URI
for doi could have easily applied for an authority named: dx.doi.org
instead of just doi, then the only difference would be the URI scheme.


Andy.


Re: [CODE4LIB] registering info: uris?

2009-04-01 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Mike Taylor
> Sent: Wednesday, April 01, 2009 9:17 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] registering info: uris?
> 
> The point is that (I argue) the identifier shouldn't tie itself to a
> particular dereferencing mechanism (such as dx.doi.org, or amazon.com)
> but should be dereferenced by software that knows what's the most
> appropriate dereferencing mechanism _for you_ in your situation, with
> your subscriptions, at particular distances from specific libraries,
> etc.

Lets separate your argument into two pieces.  Identification and
resolution.  The DOI is the identifier and it inherently doesn't
tie itself to any resolution mechanism.  So creating an info URI
for it is meaningless, it's just another alias for the DOI.  I 
can create an HTTP resolution mechanism for DOI's by doing:

http://resolve.example.org/?doi=10./j.1475-4983.2007.00728.x

or

http://resolve.example.org/?uri=info:doi/10./j.1475-4983.2007.00728.x

since the info URI contains the "natural" DOI identifier, wrapping it
in a URI scheme has no value when I could have used the DOI identifier
directly, as in the first HTTP resolution example.


Andy.


Re: [CODE4LIB] registering info: uris?

2009-04-01 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Mike Taylor
> Sent: Wednesday, April 01, 2009 8:38 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] registering info: uris?
> 
> Ross Singer writes:
>  > I suppose my point is, there's a valid case for identifiers like
>  > your doi, I think we can agree on that (well, we don't have to
>  > agree, these identifiers will exist and continue to exist long
>  > after we've grown tired of flashing out gang signs).  What I don't
>  > understand is the reason to express that identifier as:
>  >
>  > info:doi/10./j.1475-4983.2007.00728.x
>  >
>  > when
>  >
>  > http://dx.doi.org/10./j.1475-4983.2007.00728.x
>  >
>  > can serve exactly the same function *and* be actionable.

This was exactly the point I was making, but you said it much more
coherently than what I said, Ross.  If you are going to use a 
"natural" identifier, like doi, isbn, lccn, etc., then use it, but 
if you are going to Web-ify that "natural" identifier, use an HTTP 
URI.  It doesn't need to be actionable today, but can be tomorrow, 
without anybody having to write a new resolution mechanism and 
clients having to integrate that new resolution mechanism in their
systems.  Typically, most resolution mechanisms for unresolvable 
URI schemes use HTTP URIs anyway and amount to:

http://resolve.example.org/?uri=info:isbn/141574338X 
http://resolve.example.org/?uri=urn:isbn:141574338X 

which could have just been:

http://isbn.info/141574338X

> The problem with the latter identifier (and to be clear, yes, I agree
> that it COULD function as an identifier) is that it gives the
> impression that what you get when you dereference the DOI is that
> specific resource, i.e. it enshrines dx.doi.org as THE way of
> dereferencing DOIs.

I agree that Ross's DOI example could function as an identifier.  I think 
we can agree that RFC 3986 says that URIs are just tokens with a specified 
syntax.  Nothing in RFC 3986 says that a URI has to be actionable.

You are talking about an "impression" that isn't enshrined in RFC 3986.
It might be better to think about this in terms of the W3C's "Cool URIs
for the Semantic Web" document.  That document classifies URIs into
three types: Real World Objects, Generic Documents and Web Documents.  So
which type is: ?

It depends.  If I say that it is a Real World Object, it’s an identifier
for the actual DOI identifier.  If I say that it is a Web Document, then 
dereferencing it will give me a specific resource.  In this case I can 
have and probably should have both a Real World Object URI and a Web 
Document URI.

> What if I don't want to get the article from dx.doi.org?  Maybe if I
> go via that site, it'll point me to Elsevier's pay-for copy of an
> article, whereas if I'd fed the DOI to my local library's resolver, it
> would have sent me to Blackwell's version which the library has a
> subscription for.  An actionable URI mandates (or at leasts strongly
> suggests) a particular course of action: but I don't want you to tell
> me what to _do_, I just what you to tell me what the Thing is.

People wanting to identify the DOI use the Real World Object URI and
people wanting to find out information about the DOI use the Web 
Document URI.

Both these URI, Real World Object and Web Document, are HTTP URIs, so 
there is little if any value in using info or URN URIs.  People *tend* 
to use URN URIs because RFC 2141 states that the URI has persistents 
and people *tend* to use info URIs because RFC 4452 because it states 
there is no persistents.  However, persistents is a policy statement 
made by the minter of a URI.  You can make a persistents policy 
statement about any URI including HTTP URIs.


Andy.


Re: [CODE4LIB] registering info: uris?

2009-03-30 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Jonathan Rochkind
> Sent: Monday, March 30, 2009 3:52 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] registering info: uris?
> 
> But when did someone suggest that all identifiers must be resolvable?
> When Andrew argued that:
> 
> > Having unresolvable URIs is anti-Web since the Web is a hypertext
> > system where links are required to make it useful.  Exposing
> > unresolvable links in content on the Web doesn't make the Web
> > more useful.
> 
> Okay, I guess he didn't actually SAY that you should never have non-
> resolvable identifiers, but he rather strongly implied it, by using the
> "anti-Web" epithet.

You are correct that I didn't say that you should never have unresolvable
identifiers and I wasn't implying that either.  Though I was pointing out
that sticking Text into the hypertext
system where info URIs are unresolvable negates the effect of linking to it
in the first place.


Andy.


Re: [CODE4LIB] registering info: uris?

2009-03-30 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Jonathan Rochkind
> Sent: Monday, March 30, 2009 12:16 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] registering info: uris?
> 
> Some hints of the existing argument in other forums can be found in
> this
> post by Stu Weibel, and the other posts it links to.
> 
> http://weibel-lines.typepad.com/weibelines/2006/08/uncoupling_iden.html

Unfortunately, Stu is an author of the info URI specification and the
he makes the same arguments that they made for the justification of
the info URI RFC which has been debunked by the W3C:



Having unresolvable URIs is anti-Web since the Web is a hypertext
system where links are required to make it useful.  Exposing
unresolvable links in content on the Web doesn't make the Web 
more useful.


Andy.


Re: [CODE4LIB] registering info: uris?

2009-03-30 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Mike Taylor
> Sent: Monday, March 30, 2009 12:15 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] registering info: uris?
> 
> The problem is that, after setting up a non-dereferencable http: URI
> to name something like an XML namespace or a CQL context set, it's
> just so darned _tempting_ to put something explanatory at the location
> which happens to be indicated by that URI  :-)

and that is what you are suppose to do...  Having a representation of
the "thing" is useful and is what makes the Web and any other hypertext
system useful.


Andy.


Re: [CODE4LIB] registering info: uris?

2009-03-30 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Mike Taylor
> Sent: Monday, March 30, 2009 11:30 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] registering info: uris?
> 
> Ross Singer writes:
>  > There should be no issue with having both, mainly because like I
>  > mentioned earlier, nobody cares about info:uris.
>  >
>  > Take, for instance, DOIs.  What do you see in the wild?  Do you ever
>  > see info:uris (except in OpenURLs)?  If you don't see
>  > http://dx.doi.org/ URIs you generally see doi:10... URIs.  It seems
>  > like having http and info URIs would *have* to be fine, since
>  > info:uris *not being dereferenceable* are far less useful (I won't
> go
>  > so far as 'useless') on the web, which is where all this is
> happening.
> 
> What on earth does dereferencing have to do with this?
> 
> We're talking about an identifier.

Exactly, that is what people don't understand about RFC 3986.  URIs are
just identifiers and have nothing to do with dereferencing.  Dereferencing
only comes into play when the URI is used with an actual protocol like 
HTTP.  The only thing the http:, e.g., URI scheme, starting the URI tells 
you is what the syntax of the rest of the URI looks like.  This is where 
the authors of info URIs missed the boat.  They conflated the URI scheme,
e.g., http:, with dereferencing and used it as a justification for a new
URI scheme.  The authors were told of that misconception before info
became an RFC by both the IETF and W3C, but they decided to proceed 
anyway creating another library specific standard that no one else will
use.

If people would just follow the prescribed practice by the W3C:

 
Architecture of the Web says:

2.3.1. URI aliases

Best practice: "A URI owner SHOULD NOT associate arbitrarily different URIs 
with the same resource."

2.4. URI Schemes

Best practice: "A specification SHOULD reuse an existing URI scheme (rather 
than create a new one) when it provides the desired properties of identifiers 
and their relation to resources."

Quote: "While Web architecture allows the definition of new schemes, 
introducing a new scheme is costly. Many aspects of URI processing are 
scheme-dependent, and a large amount of deployed software already processes 
URIs of well-known schemes. Introducing a new URI scheme requires the 
development and deployment not only of client software to handle the scheme, 
but also of ancillary agents such as gateways, proxies, and caches. See 
[RFC2718] for other considerations and costs related to URI scheme design."

 
This tag finding pretty much debunks all the reasons given by the info URI 
authors for creating a new URI scheme.  I think Erik Hetzner also referenced it 
in his posts.


Andy.


Re: [CODE4LIB] registering info: uris?

2009-03-27 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Jonathan Rochkind
> Sent: Friday, March 27, 2009 6:09 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] registering info: uris?
> 
> If GPO had a system where I could resolve Sudoc identifiers, then this
> whole problem would be solved right there, I wouldn't need to go any
> further, I'd just use the http URI's associated with that system as
> identifiers! This whole problem statement is because GPO does not
> provide any persistent URIs for sudoc's in the first place, right?

With a little Googling how about this:

sudoc: E 2.11/3:EL 2


looks like the param scan_start= holds the sudoc number.  Sure it gives you 
other
results, but its might work for your purposes.

Seems like they are creating bad HTTP responses since Fiddler throws an protocol
violation because they do not end the HTTP headers with CR,LF,CR,LF and instead 
use LF,LF...


Andy.


Re: [CODE4LIB] registering info: uris?

2009-03-27 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Ray Denenberg, Library of Congress
> Sent: Friday, March 27, 2009 5:38 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] registering info: uris?
> 
> Correct me if I'm wrong but isn't the point of all this to be able to
> put
> the URI in an OpenURL?   And info was invented (in part) to avoid
> putting
> http URIs in OpenURLs  (because they are complicated enough already,
> why
> clutter them further).  So I don't see that pursuing an http solution
> to
> this is very useful.   --Ray

Ray, I don't quite understand the "to avoid putting http URIs in
OpenURLs" part.  An info URI as well as an HTTP URI use the same 
encoding rules from RFC 3986, URI Generic Syntax.  So neither
has an advantage over the other.  If you have a %80%CC in your
info URI or HTTP URI then sticking it in an OpenURL will 
require it to become %2580%25CC.  So what am I missing about
your statement?


Andy.


Re: [CODE4LIB] registering info: uris?

2009-03-27 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Jonathan Rochkind
> Sent: Friday, March 27, 2009 5:28 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] registering info: uris?
> 
> Another good idea, true. There are indeed lots of ways to do this.
> 
> But wait, you don't need a unique hostname for a tag uri, a unique uri
> (hostname+path) will do? purl.org will only give me the latter, not the
> former, right?

Tag URIs require that the authorizing "agency" own the domain name and they
cannot specify a date that is before their domain registration or in the
future.  So nobody could mint Tag URIs with purl.org as the domain name.

PURLs might be an interesting solution for you if GAO has a system where
you can resolve SUDOC identifiers.  Then you could create a PURL and point
it to their system.  Now you get to use your PURL for your project and as
a side benefit get lookup capabilities from GAO!  Otherwise you could just
send them to a relevant page on GAO site.


Andy.


Re: [CODE4LIB] registering info: uris?

2009-03-27 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Jonathan Rochkind
> Sent: Friday, March 27, 2009 5:18 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] registering info: uris?
> 
> I am not interested in maintaining a sudoc.info registration, and
> neither is my institution, who I wouldn't trust to maintain it (even to
> the extent of not letting the DNS registration expire) after I left.

BTW, you could always use http://purl.org/ and later if you wanted
to have it resolve to something just change the PURL.


Re: [CODE4LIB] registering info: uris?

2009-03-27 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Jonathan Rochkind
> Sent: Friday, March 27, 2009 5:00 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] registering info: uris?
> 
> Aha, cool!  Yeah, I could use tag for this, but it wouldn't seem
> appropriate for something I want to encourage others to use compatibly
> as well, info seems better.

Not to push tag URIs on you, just providing some information,
but if you are working with other organizations, you could 
just go to GoDaddy and get a domain name for your project, 
then use an email address instead of ND.EDU:

tag:project-n...@my-tags.org,2009:id/sudoc-value


Andy.


Re: [CODE4LIB] registering info: uris?

2009-03-27 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Jonathan Rochkind
> Sent: Friday, March 27, 2009 4:52 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] registering info: uris?
> 
> Also, the date aspect of a tag-uri seems to make it hard to use to mint
> an identifier that will always represent the same SuDoc, regardless of
> when it was minted.

No the date part is a versioning scheme, not the date you created the
tag URI.  It's used, for example, where I created a specific tag
scheme one day and then decided to create another tag scheme some
other day:

tag:example.org,1999:date/yy-mm-dd

where yy-mm-dd is the year, month and day values.  Then I realize that
it's Y2K so I create a new tag scheme:

tag:example.org,2000:date/-mm-dd


Andy.


Re: [CODE4LIB] registering info: uris?

2009-03-27 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Jonathan Rochkind
> Sent: Friday, March 27, 2009 4:42 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] registering info: uris?
> 
> I am looking for the easiest possible way to get a legal URI
> representing a sudoc.
> 
> My understanding, after looking at this stuff previously, is that info:
> is a LOT lower barrier than urn:, and that's part of it's purpose.

Jonathan you could use TAG URI's, RFC 4151, if you are looking for something
quick and dirty.  No need to register with any authority since you are using
your own DNS name.




Andy.


Re: [CODE4LIB] MIME Type for MARC, Mods, etc.?

2009-02-13 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Jonathan Rochkind
> Sent: Friday, February 13, 2009 12:02 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] MIME Type for MARC, Mods, etc.?
> 
> I'm confused about your suggestion of registering a content type for
> SRU.  My understanding is that SRU is a _protocol_, not a media type?

SRU is a messaging protocol just like SOAP and HTTP.  When SRU and SOAP
are tunneled through HTTP you are sending back the SRU "message" as the
HTTP entity representation which contains a result set from the search.
You can register protocol messages with IANA, e.g., application/soap+xml
and message/http.


Andy.


Re: [CODE4LIB] MIME Type for MARC, Mods, etc.?

2009-02-13 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Ray Denenberg, Library of Congress
> Sent: Thursday, February 12, 2009 5:48 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] MIME Type for MARC, Mods, etc.?
> 
> A few points:
> 
> 1. "x-" is commonly used in cases when an application for a mime type
> is
> pending, and when there is a reasonable expectation that it will be
> approved.   The mime type is prefixed with "x-" until the requested
> mime
> type becomes official, after which the "x-" is dropped.
> 
> 2. We will be registering MODS and MARCXML:
>  - application/mods+xml
>  - application/marcxml+xml

Please don't forget to also register application/mads+xml too, for those of us 
who are using MADS.


Andy.


Re: [CODE4LIB] MIME Type for MARC, Mods, etc.?

2009-02-12 Thread Houghton,Andrew
> From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
> Alexander Johannesen
> Sent: Thursday, February 12, 2009 4:00 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] MIME Type for MARC, Mods, etc.?
> 
> On Thu, Feb 12, 2009 at 21:43, Rebecca S Guenther  wrote:
> > Patrick is right that an XML schema such as MODS or MARCXML would be
> text/xml.
> 
> I would strongly advise against text/xml, as it is an oxymoron (text
> is not XML XML is not text even if it is delivered through a text
> protocol), and more and more are switching away from the generic text
> protocol (which makes little sense in structured data).

According to RFC 3023, section 3 XML Media Types:

   If an XML document -- that is, the unprocessed, source XML document
   -- is readable by casual users, text/xml is preferable to
   application/xml.  MIME user agents (and web user agents) that do not
   have explicit support for text/xml will treat it as text/plain, for
   example, by displaying the XML MIME entity as plain text.
   Application/xml is preferable when the XML MIME entity is unreadable
   by casual users.

So it is justified to return a Content-Type header with text/xml.  It
depends upon whether you think MARC-XML, MODS, MADS, etc. are readable
by casual users and the user agents you expect to be accessing the
documents.

> Hence, a more correct MIME type for XMLMARC would be
> application/marc+xml, although until registered should be
> application/x-marc+xml.

I'm not sure the +xml is correct on two fronts.  First RFC 2220 defines
the media type for MARC binary, not MARC-XML, and it was my understanding 
that the +xml meant that the schema allowed extension by using XML 
namespaces which MARC binary does not.  Further, in the case of MARC-XML,
its schema also does not allow arbitrary XML elements.  MODS and MADS I 
believe do, but that is a different story.


Andy.


Re: [CODE4LIB] Javascript trees

2008-11-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
> Yitzchak Schaffer
> Sent: Wednesday, November 05, 2008 1:40 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] Javascript trees
> 
> Coders:
> 
> Have you had any successful experiences with trees in JS frameworks?
> I'm trying to find one for the site I'm building, in order to "entree"
> the results of an API search; here's what I've found:

If you are looking for something to view trees using alternate 
visualization techniques rather than the standard outline-ish 
folder/document visualization, you might want to take a look at:



>From the home page:

JavaScript Information Visualization Toolkit (JIT)

What’s the JIT?

The JIT is an advanced JavaScript infovis toolkit based on 5 papers about 
different information visualization techniques.  The JIT implements advanced 
features of information visualization like Treemaps (with the slice and dice 
and squarified methods), an adapted visualization of trees based on the 
Spacetree, a focus+context technique to plot Hyperbolic Trees, and a radial 
layout of trees with advanced animations (RGraph).

I have been meaning to download this toolkit to visualize hierarchies in
controlled vocabularies, but the task in down on my priority list at the
moment :(


Andy.


Re: [CODE4LIB] Cross-Walk - LC Subject Heads to MeSH Subject Headings

2008-10-07 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
> Karen Tschanz
> Sent: Tuesday, October 07, 2008 12:53 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] Cross-Walk - LC Subject Heads to MeSH Subject
> Headings
> 
> Dear Listers:
> 
> We are interested in identifuing one or more cross-walks for conversion
> of LC Subject Headings to MeSH subject headings and vice versa. Also,
> if you have used such a cross-walk, what has been your experience with
> it?

Why not just get the Northwestern LCSH/MeSH mappings?


Andy.


Re: [CODE4LIB] LOC Authority Data

2008-10-03 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
> Ya¹aqov Ziso
> Sent: Thursday, October 02, 2008 5:39 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] LOC Authority Data
> 
> Andrew Houghton, kindly explain:
> 
> 1. LC names/subjects authority files, current with 2008-09-17, are
> available
> on your SRW server http://tspilot.oclc.org/lcsh/  for us (a consortium)
> to
> harvest and load on our server for
> our consortial authority maintenance?

The SRW server URI endpoint at /lcsh only contains LC subjects.  If we are able 
to provide access to the name authority file it will most likely have the 
endpoint /naf in keeping with LC's code list.  However, harvesting and loading 
is not permitted under the ResearchWorks license, since we are not licensed to 
redistribute the works of the vocabulary owners we are providing service to.  
If you need to load the vocabularies that we are making available in your local 
system, then you should speak directly with the vocabulary owner for their 
licensing terms and conditions.

> 2. Weekly updates to these files to these name/subject files are also
> available on that SRW server?

Since the Terminology service at tspilot.oclc.org is a research project we 
update vocabularies between other research activities.  Most of the 
vocabularies are fairly static in nature or are updated every 6 or 12 months by 
the vocabulary owner and can be worked in with other research activities.  LC 
does provides OCLC with weekly updates for the production WorldCat, but we do 
not have the time to update our research server that frequently.   However, we 
do try to update LCSH every couple of months.  You can visit:



to see when the last time each vocabulary was updated.


Andy. 


Re: [CODE4LIB] LOC Authority Data

2008-09-30 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
> Jonathan Rochkind
> Sent: Tuesday, September 30, 2008 4:35 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] LOC Authority Data
> 
> The NAF (Name/National Authority File) is still one important database
> that we are missing any kind of good machine access to, I believe.

Agreed.  As part of our research project we have enhanced some of the 
vocabulary data in the service to provide mappings and links between 
vocabularies.  One issue we noticed with FAST was that many of the mapped terms 
were not being linked.  We tracked this back to the term being in NAF rather 
than in LCSH.  So to make the FAST data more usable we would have to include 
the entire LC authority file, both names and subjects.  It is something we are 
looking into at the moment...

Andy.


Re: [CODE4LIB] LOC Authority Data

2008-09-30 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
> Ross Singer
> Sent: Tuesday, September 30, 2008 3:23 PM
> 
> I'm going to go out on a limb here and assume you need to be an OCLC
> customer to benefit from this?
>
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
> Ross Singer
> Sent: Tuesday, September 30, 2008 3:23 PM
> 
> s/customer/"partner"/
> 
> Also, in the case of what the thread was initially calling for, what
> would be the legalities of redistributing this data?

You do not need to be a "customer/member/partner" to access the authority 
files.  It's an ongoing research project [1] which is publicly accessible to 
anyone over the web.  The research project is covered by the OCLC ResearchWorks 
Terms and Conditions:



Looks like from a quick reading of this license the redistribution of the data 
is prohibited, but most of the data in the system, except LCSH, is from public 
sources.  So if you wanted to redistribute the data for a vocabulary, you can 
get permission from the vocabulary maintainer, just as we did.  We merely 
consolidated freely available public controlled vocabularies into a service 
that other people could be used to build upon, including OCLC Research.

BTW, our research project should not be confused with OCLC's production 
Terminology Service [2] which is only available to members with a cataloging 
authorization.  Actually, I created the prototype for the production service, 
so people do get confused sometimes.  If you have a cataloging authorization 
you can access the production service as a web service.  I posted a how-to on 
the OCLC developer network listserv a while ago.  The production service allows 
access to AAT, DCT, TGN, GSAFD, Maori Subject Headings (Nga Upoko Tukutuku), 
MeSH, NGL, TGM I, TGM II and ULAN.  Obviously, the Getty vocabularies will 
never make it into our research project due to licensing restrictions :(


Andy.

[1] 
[2] 


Re: [CODE4LIB] LOC Authority Data

2008-09-30 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
> Ross Singer
> Sent: Monday, September 29, 2008 7:45 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] LOC Authority Data
> 
> Also, I noticed another dump on the IA of Library of Congress updates
> since the initial "Bisson load".
> http://www.archive.org/details/marc_loc_updates
> 
> In typical IA fashion, it's incredibly difficult to know what the hell
> this stuff is, though.
> -Ross.

If you just looking for access to the LCSH authority data, you can access it 
through our Terminology Services project.  The data in our SRW server was 
updated to the 2008-09-17 weekly update from LC.  The SRW server is located at 
the URI:



Looking for access to other authority files:

FAST
GSAFD   
MeSH
TGM I   
TGM II  


Andy.


Re: [CODE4LIB] creating call number browse

2008-09-17 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
> Emily Lynema
> Sent: Wednesday, September 17, 2008 11:46 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] creating call number browse
> 
> Hey all,
> 
> I would love to tackle the issue of creating a really cool call number
> browse tool that utilizes book covers, etc. However, I'd like to do
> this
> outside of my ILS/OPAC. What I don't know is whether there are any
> indexing / SQL / query techniques that could be used to browse forward
> and backword in an index like this.
> 
> Has anyone else worked on developing a tool like this outside of the
> OPAC? I guess I would be perfectly happy even if it was something I
> could build directly on top of the ILS database and its indexes (we use
> SirsiDynix Unicorn).

Our group has done something simliar with the DDC rather than using call
numbers.  You can find information about the Dewey browser here:



and the prototype browser can be access here:



The prototype runs on the Linux platform under Apache HTTP or Tomcat and uses
Apache Solr to index and search the records.  According to the developer it
was fairly easy to put together using Apache Solr.  The difficult part was the
UI which has taken many iterations...


Andy.


Re: [CODE4LIB] Assistance with Dewey to LC conversion

2008-08-15 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
> Di Worth
> Sent: Friday, August 15, 2008 12:59 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] Assistance with Dewey to LC conversion
> 
> Dear Code4Libs
> The University of Tasmania has about one third of its collection
> classified by Dewey, and the other part by Library of Congress
> classification.  We will be able to find an LC call number, either from
> a
> call number on an added copy, or from the 050 tag in the bib record.
> We
> have found though that we have about 50,000 bib records with no added
> copies with LC call#, and no 050 information.  About 10,000 of these
> bibs
> do have 010 tags (LC #s), so one assumes that an LC call number would
> be
> available for them should we be able to interrogate a database like
> Libraries Australia or WorldCat and be able to find the information.
> 
> Trouble is, although we may be competent at SQL on our databases,
> automatically "piping" information to an application and parsing
> results
> is not something we've done.  This list seems to have some very savvy
> people on it - can anyone suggest how we could proceed?  Could we use
> something like Yaz with an input file of requests, and get an output
> file
> with 050 information?  Has anyone done this already?  Is there an
> electronic file available with a Dewey - LC conversion, which we could
> use?  50,000 bibs is a lot for us to manually look up!
> 
> Any help you could give would be most appreciated

You might want to look at the OCLC Research Classify project:



Classify is a prototype service designed to support the assignment of 
classification numbers for books, DVDs, CDs, and many other types of materials. 
 The latest prototype can be found here:



If you think classify fits your scenario, then you might want to contact Diane 
Vizine-Goetz (project lead) and discuss your project with her.  We did a 
special project with the Phoenix Public Library where they sent us their 
catalog records and we used the original Classify prototype to assign Dewey 
classes to records that didn't have any.  The details can be found in the 
following presentation Diane gave, slides 13 to 18:




Andy.


Re: [CODE4LIB] library-related apps for the iPhone?

2008-07-15 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
> Stephens, Owen
> Sent: Tuesday, July 15, 2008 10:31 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] library-related apps for the iPhone?
> 
> >Our biggest challenge is actually getting our hands on an iPhone
> because
> >there are some issues with the university paying for monthly cell
> phone
> >plan.
> 
> I understood the developers SDK included iphone/itouch simulation so
> that you don't need the actual device in front of you to develop?

Yes, the SDK does include a simulator, but at some point you really need to 
test your application on a device in the real world and not just before 
deployment.


Andy.


Re: [CODE4LIB] implementing cool uris in java

2008-07-03 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
> Emily Lynema
> Sent: Thursday, July 03, 2008 12:22 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] implementing cool uris in java
> 
> I'm looking around for tools to implement cool uris in java. I've been
> studying the restlet framework tonight, and while it sounds cool, I
> think it would also require a complete re-write of an application that
> is currently based on the Servlet API. And, of course, I'm working
> under
> a time crunch.
> 
> Is there anything out there to assist me in working with cool uris
> besides just using regular expressions when parsing URLs?
> 
> For example, I'd like to create URLs like:
> 
> http://catalog.lib.ncsu.edu/record/123456
> 
> instead of:
> 
> http://catalog.lib.ncsu.edu/record?id=1234565

There are a number of ways you can create them, but they depend upon your Web 
server and application infrastructure.  For example, if you are using Apache 
for a Web server you can could use mod_rewrite or if you are using Tomcat for a 
Web server you could use URL rewrite.  Since you mentioned that your 
application is using the Servlet API, you could write a Servlet interceptor or 
Servlet filter to process the request before it gets to your application.  The 
easiest is to use mod_rewrite or URL rewrite.  This is what we did for our 
Terminology Services project to hide those ugly SRU URL's.


Andy.


Re: [CODE4LIB] alpha characters used for field names

2008-06-25 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
> Eric Lease Morgan
> Sent: Wednesday, June 25, 2008 3:21 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] alpha characters used for field names
> 
> Are alpha characters used for field names valid in MARC records?
> 
> When we do dumps of MARC records our ILS often dumps them with FMT and
> CAT field names. So not only do I have glorious 246 fields and 100
> fields but I also have CAT fields and FMT fields. Are these features
> of my ILS -- extensions of the standard -- or really a part of MARC?
> Moreover, does something like Marc4J or MARC::Batch and friends deal
> with these alpha field names correctly?

ISO 2709 allows tags to consists of three ASCII numeric characters or ASCII 
alphabetic characters (uppercase or lowercase, but not both) [1].  MARC-21 only 
uses numeric tags.  However, when I was OCLC representative for the MARC-XML 
standard, the pattern that it uses for tags is:

  

this change was a request from the RLG representatives who had MARC records in 
their system that contained alpha-numeric tags.


Andy.


[1] 


Re: [CODE4LIB] MARC::Record::JSON perl and javascript modules on Google Code

2008-04-23 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On
> Behalf Of Bill Dueber
> Sent: 23 April, 2008 10:48
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] MARC::Record::JSON perl and
> javascript modules on Google Code
>
> Right. I keep a "sequence" number for the fields so they can
> be re-ordered if need be, but it just seemed too much of a
> pain (and edge case) to worry about subfield order.

Actually, you don't need to keep a sequence number, any XML
can be transformed into JSON so its round tripable, and since
MARC can to transformed to MARC-XML you can use the same type
of JSON model for representation.


Andy.


Re: [CODE4LIB] MARC::Record::JSON perl and javascript modules on Google Code

2008-04-23 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On
> Behalf Of Bill Dueber
> Sent: 23 April, 2008 10:34
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] MARC::Record::JSON perl and javascript
> modules on Google Code
>
> The MARC::Record::JSON data structure itself is *not*
> round-tripable (order of subfields within a field is not
> preserved), but makes for an easy-to-work-with format in
> read-only situations that one would probably be encountering
> within a web page.

Depending upon how you structure your JSON, you could make it
round-tripable where the order is preserved.  There are some
tradeoffs, though.


Andy.


Re: [CODE4LIB] code4lib.org hosting

2007-08-02 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On
> Behalf Of Ross Singer
> Sent: 01 August, 2007 23:31
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] code4lib.org hosting
>
> I don't think anything has been 'decided'.  We had a meeting,
> OSU stepped forward, nobody present objected.
> www.code4lib.org is still 'down'.

Has anybody considered talking with the OCLC WebJunction folks about
hosting www.code4lib.org?  They are already hosting the xml4lib
listserve.  I do realize we are talking about a domain vs. a listserve,
but part of WebJunction's mandate from their Gates grant is community
service for libraries.  The www.code4lib.org seems to fit within the
scope of the public service that they are doing for the xml4lib
listserve.  So maybe this might be another option to look into.


Andy.


Re: [CODE4LIB] worldcat

2007-05-21 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On
> Behalf Of Eric Lease Morgan
> Sent: 21 May, 2007 09:34
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] worldcat
>
> We here at Notre Dame subscribe to (license?) WorldCat, and
> I'm wondering, does it have a Web Services interface/API?

I guess it depends on what you consider a Web Service interface
and API.  Today you create URL's to retrieve XHTML documents.
The details are here:



It's not pretty since you have to scrape the XHTML document for
the information you want and they don't use id attributes on
content to make it easy for you to pull information out of the
XHTML.  However, looking at the roadmap for WorldCat.org, they
have always planned a proper Web Service API for it.  It's still
beta, so some functionality has yet to be delivered.

Unfortunately, I cannot say when a Web Service API will be
delivered since I'm not on the WorldCat.org team and do not
know what their current development priorities are.

Also, if you have some specific use cases in mind for how you
would like to interact with WorldCat.org using Web services,
I'm sure the WorldCat.org folks would like to see them.  You
can send feedback here:





Andy.


Re: [CODE4LIB] OCLC is "us" (was Re: [CODE4LIB] more metadata from xISBN)

2007-05-14 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On
> Behalf Of Jonathan Rochkind
> Sent: 14 May, 2007 10:00
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] OCLC is "us" (was Re: [CODE4LIB] more
> metadata from xISBN)
>
> My understanding, from a number of sources, including
> comments Thom Hickey (I think that was Thom? I actually
> missed his name in my notes) made at the FRBR Implementer's
> Group meeting at ALA Midwinter, is that the algorithm current
> OCLC products use for work-set grouping is not limited to
> that published algorithm, but ends up being quite a bit more
> sophisticated than that. Thom (I think!) also said that
> different OCLC products used slightly different algorithms at
> the moment (for instance, FictionFinder vs. xISBN).  I asked
> at the Implementor's Group meeting in December if OCLC
> planned to share the more sophisticated algorithms and tweaks
> currently being used, and he said he wasn't sure.
>
> It is of course OCLC's right as an organization to decide
> what information to share and when.

It is not a matter of sharing or not.  For a number of projects and
some of OCLC's services we do in fact deviate from the published
algorithm.  The deviation is a result to improve upon the published
algorithm.  Sometimes when the published algorithm is tweaked it
produces slightly better or worst results.

I don't believe that we have significantly improved upon the published
algorithm and we will continue to experiment in our research efforts
and services we deliver to members.  We published the algorithm so
others could use it as a base for their FRBR activities and encourage
others to experiment with our algorithm and publish their results,
especially where it fails to produce adequate results.

We are aware of a number of areas where the algorithm does not produce
adequate results, but have not found any resolution to those problematic
areas.  I'm sure that we will publish any significant changes to the
algorithm in an appropriate journal when they happen.


Andy.


Re: [CODE4LIB] OCLC is "us" (was Re: [CODE4LIB] more metadata from xISBN)

2007-05-10 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On
> Behalf Of Jonathan Rochkind
> Sent: 10 May, 2007 10:59
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] OCLC is "us" (was Re: [CODE4LIB] more
> metadata from xISBN)
>
> PS: The more I think about this, the more burned up I actually get.
> Which maybe means I shouldn't post about it, but hey, I've
> never been one for circumspection.
>
> If OCLC is "us", then OCLC will gladly share with us (who are
> in fact "them", right?) their research on workset grouping
> algorithms, and precisely what workset grouping algorithm
> they are using in current implementations of xISBN and other
> services, right? After all, if OCLC is not a vendor, but just
> "us" collectively, why would one part of "us"
> need to keep trade secrets from another part of "us"?  Right?
>
> While OCLC is at it, OCLC could throw in some more
> information on this project, which has apparently been
> consigned to trade secret land since it's sole (apparently
> mistaken) public outing:
> http://www.code4lib.org/2006/smith

Actually, with just a little bit of research, you would have
found the following information:



OCLC's FRBR algorithm has been publicly available since sometime
in 2005.  The algorithm was used for xISBN and is being used for
a number of ongoing projects in the Office of Research.


Andy.


Re: [CODE4LIB] OpenURL XML generation libraries?

2006-10-17 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On
> Behalf Of Jonathan Rochkind
> Sent: 17 October, 2006 12:23
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] OpenURL XML generation libraries?
>
> I sort of asked this before here, and got a few answers, but
> I'm going to try again, more specifically.
>
> Are there any libraries available in any language to generate
> XML for an OpenURL 1.0 SAP2 (XML) request?  [Or to interpret
> such requests, either
> SAP1 or SAP2. But at the moment generating SAP2 XML requests
> is my immediate concern].
>
> Does anyone have any of their own code they would be willing to share?

Is this possible what you are looking for?




Andy.


Re: [CODE4LIB] java application on a cd

2006-10-13 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On
> Behalf Of Eric Muzzy
> Sent: 13 October, 2006 15:34
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] java application on a cd
>
> Eric:
>
> I'm less familiar with java, but if you're not tied to the
> architecture there may be a few other options; The last

I think the reason Eric probably decided on using Java was
that he was trying to build a cross platform solution.  HTC
files would only work under Windows.  Also, one issue with
using *either* .NET or Java is it assumes that the person
loading the CDROM has the runtime installed.  So while Java
is cross platform, it is not a guarantee that it will run
on the target system, although the odds are quite high that
a JRE will exist on the system.


Andy.


Re: [CODE4LIB] open worldcat and cover images

2006-10-13 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On
> Behalf Of Jonathan Rochkind
> Sent: 13 October, 2006 14:58
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] open worldcat and cover images
>
> So open worldcat has cover images for lots of books.
>
> The idea occured to me of trying to use these cover images in
> my local catalog. I could use Amazon instead, but Amazon's
> licensing would require me to link to Amazon, and I'd rather
> be linking to open worldcat, if anywhere. [I could also pay a
> vendor lots of money for this same thing, of course]. So does
> anyone know:

> 1) If OCLC would allow such a thing. (I'm guessing they might
> not, sadly).

You would have to read the terms of service for Open WorldCat, but
I suspect that the answer is no due to the fact that OCLC licenses
the cover art from another organization.  For a definitive answer,
send an e-mail to OCLC.  The most they can say is no can do.

> 2) If open worldcat provides any XML/API to make this easier to do.

None that I'm aware of.


Andy.


Re: [CODE4LIB] LC MARC records?

2006-09-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On
> Behalf Of Tim Spalding
> Sent: 05 September, 2006 12:53
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] LC MARC records?
>
> *OCLC has a license around their MARC records, and around
> Dewey. The OCLC business model requires this.

It is my understanding that OCLC *pays* LC several thousands
of dollars, per year, for access to their MARC records...


Andy.


Re: [CODE4LIB] LC MARC records?

2006-09-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On
> Behalf Of Keith Jenkins
> Sent: 05 September, 2006 12:42
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] LC MARC records?
>
> Worldcat.org, as nice as it is, only offers a limited subset
> of bibliographic data.  There's no way to get to the
> underlying MARC record, as far as I can see.  Or am I missing
> something?

People seem to be confused over the difference between Open
WorldCat and WorldCat.org.  It is understandable since both
have WorldCat in them.

Open WorldCat is a 4 million record set that search engines,
such as Google and Yahoo, get from OCLC.  The reason for this
reduced set is that the search engine folks couldn't ingest
all of WorldCat.  We really wanted them to ingest the whole
thing, but they balked.  In Google's case, due to how they
view resources, many bibliographic records looked like
duplicates and so they told OCLC that they didn't want the
whole thing.  So we basically gave them a FRBR-ized view of
the most commonly held records.

Due to a number of factors, one being this limitation of the
search engines to expose all of WorldCat, OCLC created the
WorldCat.org search portal which does exposes all 80 million
plus, bibliographic records.  So when you go to:
 you get access to all the records.  If
you come in from an Internet search engine, you will only see
the 4 million records they ingested.

Actually, most people didn't realize that even though the
search engines only incorporated 4 million records into their
result lists, people had access to the full 80 million records
through URL manipulation.

It is true, today, that you do not have access to the underlying
MARC record, however the document returned from WorldCat.org is
a well formed XHTML document that can be run through an XSLT to
be used in a mash-up.  The Open WorldCat team, who is now
responsible for both Open WorldCat and WorldCat.org is working
on an SRU/W interface to WorldCat.org.  Not sure what record
formats will be delivered, e.g., MARC, Dublin Core, etc. since
this is on their TODO list and that's about all I know.


Andy.


Re: [CODE4LIB] LC MARC records?

2006-09-05 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On
> Behalf Of Edward Summers
> Sent: 05 September, 2006 10:50
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] LC MARC records?
>
> On Sep 4, 2006, at 7:30 PM, Tim Spalding wrote:
> > I gather that, although the LC sells the CDs, there are no
> copyright
> > or redistribution restrictions—indeed, that companies often resell
> > them, with minimal value adds. I also gather that libraries
> generally
> > don't get these CDs, but rely on OCLC instead.

Why not use WorldCat.org?  I believe that it already contains the
entire LC bibliographic data plus additional member input data not
found in the LC bibliographic data.


Andy.


Re: [CODE4LIB] worldcat

2006-08-22 Thread Houghton,Andrew
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On
> Behalf Of Eric Lease Morgan
> Sent: 22 August, 2006 16:24
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] worldcat
>
> Is there a public Z39.50/SRU/SRW/Web Services interface to
> WorldCat or OpenWorldCat?
>
> I would like to create a simple search engine to query
> "Other's books", and *Cat seems like a great candidate.
>
> Inquiring minds would like to know.

I'm not sure about a Z39.50/SRU/SRW interface to WorldCat, but
you can access WorldCat via URL queries and it appears that the
data comes back as an XHTML document.  So... you could hack
something together with a little XSLT to simulate an SRU
interface.

Since this isn't documented anywhere, at least that I could find,
with a little digging and hacking I came up with the following:

URL query:
,
e.g.


Results XPath:
/html/body//[EMAIL PROTECTED]'tableResults']
/html/body//[EMAIL PROTECTED]'tableResults']/tr/[EMAIL PROTECTED]'result']

Next set of results:
,
e.g.




Andy.


  1   2   >