AW: [librecat-dev] A common MARC record path language

2014-02-23 Thread Klee, Carsten
Hi Thomas and Patrick!

Thank you both for bringing the discussion forward. I must admit that I'm 
having some problems following here. I read your mails multiple times, really 
trying to understand your demands. After reading this [1], I hope I'm getting 
closer.

I just want to sum up what I think I've understood so far. Please correct me if 
I'm wrong..

--  When it comes to cataloging based delimiters (punctuation), there is some 
inner semantic to the content of the subfields. E.g. "=$b" in field 245 means 
something different than ":$b".

-- There may be data you want to get at whole, which spread over multiple 
subfields. This information is cannot be described by the range of subfields, 
but with the closure through punctuation. E.g. in the field

245 00$aHeritage Books archives.$pUnderwood biographical 
dictionary.$nVolumes 1 & 2 revised$h[electronic resource] /$cLaverne 
Galeener-Moore.

the data you want to get is

Heritage Books archives. Underwood biographical dictionary. Volumes 1 & 2 
revised [electronic resource]

Is this what you mean when want to say something like "Get me all from field 
XXX until you hit Y"? I guess so.

-- Therefore the order of subfields is crucial. While MARCspec allows subfields 
stated in any order, a result should preserve the subfield order emerging in 
the field.

-- Some fields are linked through specific subfields. There may be some data 
you want to get dependent on linkage from other fields. I'm not sure if I have 
an example for this. Maybe you could provide one.

Finally I've found a nice example on the MARC21 website [2] (section $i - 
Relationship information). That my question is, if you want to achieve 
something like this:

Source:
100  1# $aVerdi, Giuseppe, $d1813-1901.
245  10 $aOtello :$bin full score /$cGiuseppe Verdi.
700  1# $iLibretto based on (work) $aShakespeare, William, $d1564-1616. 
$tOthello.
787  08 $ireproduction of (manifestation) $aVerdi, Giuseppe, 1813-1901. 
$tOtello.$d Milano: Ricordi, c1913

Result (user display):
Verdi, Giuseppe, 1813-1901. Otello : in full score / Giuseppe Verdi
Reproduction of Verdi, Giuseppe, 1813-1901. Otello. Milano : Ricordi, c1913
Libretto based on Shakespeare, William, 1564-1616. Othello.

Is this something you want to express within a MARCspec?

Anyhow a collection of use cases is a great idea. That would help to discover 
the tasks a MARCspec should cope. But I really need your help here. Maybe a 
wider audience would also be helpful?
Cheers!

Carsten

[1] 
[2] 
___
Carsten Klee
Abt. Überregionale Bibliographische Dienste IIE
Staatsbibliothek zu Berlin - Preußischer Kulturbesitz

Fon:  +49 30 266-43 44 02

> -Ursprüngliche Nachricht-
> Von: Thomas Berger [mailto:t...@gymel.com]
> Gesendet: Mittwoch, 19. Februar 2014 23:06
> An: Klee, Carsten; 'Patrick Hochstenbach'
> Cc: v...@gbv.de; librecat-...@mail.librecat.org; perl4lib@perl.org
> Betreff: Re: [librecat-dev] A common MARC record path language
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Hi Carsten,
> 
> > I think the whole problem lies in the limited expressivity of strings.
> > MARCspec is pretty much close to XPath at its approach, but without
> regular
> > expressions and functions like first(), last() etc. But even with XPath
> it would
> > be pretty hard to get the character before a subfield in a MARCXML file.
> >
> > The only solution I can think of, is using regular expressions. And I'm
> not
> > convinced that bringing this into MARCspec is a good idea. As I already
> > mentioned in the spec, MARCspec is not independent from the application
> using
> > MARCspec. Taking regular expressions into MARCspec wouldn't make the
> application
> > more usable, but would blow up the specification.
> 
> Agreed, therefore regular expressions or other /general/ mechanisms
> should not the way to go (for specifying MARCspecs - specific
> implementations
> may realize it using a regexp implementation at hand)
> 
> Thus, yes, limited expressivness of strings demands to make the most
> typical and most important "operations" on MARC records to be
> expressible. But if it's too limited (say it could only extract fields
> or has blind spots - parts of record data which cannot be accessed at all)
> it wouldn't be of any use.
> 
> Thus MARCspec's need a convincing approach to the peculiarities of MARC
> records:
> 
> Subfields are not always data elements in a proper sense, sometimes
> they are just marks interspersed into the field content.
> 
> And as Patrick pointed out there is the presence of non-MARC delimiters
> (markup) which is crucial for processing of some (sub)fields.
> 
> Many fields contain "ensembles" of subfields with one nature, accompanied
> by other, more data-like subfields of a different nature:
> 
> - - Most subfields in 700 are a simple copy of some (hypothetical)
>

AW: [librecat-dev] A common MARC record path language

2014-02-19 Thread Klee, Carsten
Hi Thomas and Patrick!

I think the whole problem lies in the limited expressivity of strings. MARCspec 
is pretty much close to XPath at its approach, but without regular expressions 
and functions like first(), last() etc. But even with XPath it would be pretty 
hard to get the character before a subfield in a MARCXML file.

The only solution I can think of, is using regular expressions. And I'm not 
convinced that bringing this into MARCspec is a good idea. As I already 
mentioned in the spec, MARCspec is not independent from the application using 
MARCspec. Taking regular expressions into MARCspec wouldn't make the 
application more usable, but would blow up the specification. 

One example:

The data in field 245 is:

"$aConcerto per piano n. 21, K 467$h[sound recording] /$cW.A. Mozart"

The desired result is (rule: take everything from 245 until the string ' /$' 
appears):

"Concerto per piano n. 21, K 467 [sound recording]"

Imagine a MARCspec with regular expression. // pseudo code coming up!

marcspec = "245.match(/(.*)\s\/\$/)"
titleData = getMARCspec(record, marcspec)
print titleData[1]
// should result in "$aConcerto per piano n. 21, K 467$h[sound recording]"

Now pretty the same but without the regular expression in the MARCspec.

marcspec = "245"
titleData = getMARCspec(record, marcspec).match(/(.*)\s\/\$/)
print titleData[1]
// should result in "$aConcerto per piano n. 21, K 467$h[sound recording]"

You see, nothing won here.

But an application could provide a special function like

function 
takeEverythingFromSpecUntilYouHitBeforeSubfield(marcspec,hitWhat,record)
{
// get the data before the / or = or else
regex = new RegExp("(.*)\\s\\" + hitWhat + "\\$")
data = getMARCspec(record, marcspec).match(regex)[1]

// now split on subfield
dataSplit = data.split(/\$[a-z0-9]/)

// loop everything into result
for (i = 1; i < dataSplit.length-1; i++)
{
result += dataSplit[i] + " "
}
result += dataSplit[dataSplit.length]

return result 
}

In Catmandu or elsewhere the user calls the function

takeEverythingFromSpecUntilYouHitBeforeSubfield("245","/",record)

--> this should result in the desired "Concerto per piano n. 21, K 467 [sound 
recording]".

If there is any other approach you can think of, plase make a proposal or 
give me a substantial discussion here. Otherwise I can't see any options 
solving this problem in MARCspec.

Cheers!

Carsten
___
Carsten Klee
Abt. Überregionale Bibliographische Dienste IIE
Staatsbibliothek zu Berlin - Preußischer Kulturbesitz

Fon:  +49 30 266-43 44 02

> -Ursprüngliche Nachricht-
> Von: Thomas Berger [mailto:t...@gymel.com]
> Gesendet: Mittwoch, 19. Februar 2014 01:04
> An: Klee, Carsten; 'Patrick Hochstenbach'
> Cc: v...@gbv.de; librecat-...@mail.librecat.org; perl4lib@perl.org
> Betreff: Re: [librecat-dev] A common MARC record path language
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> 
> 
> Am 18.02.2014 17:47, schrieb Klee, Carsten:
> 
> > I understand that there is MARC data combined with cataloging rules. We
> > don't use this approach within our MARC. So I'm not really aware of the
> problematics.
> 
> "Your" MARC however will be very much interested in "/" (or "=") as the
> first
> character of some subfield in 245 if I recall correctly. Not such a big
> difference I would think. But maybe a slight complication of the matter,
> since MARCspec should have to cope with both approaches...
> 
> Thomas Berger
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iJwEAQECAAYFAlMD9NYACgkQYhMlmJ6W47PzEQP/RIfm5bsHLTwhJMLJjNjF3vO/
> XIpKt98CPUgy+hcFXc4hpTi+UH8j7NIWtaCyXYOfdL4xryzI0kEk98brZ/4TJG+9
> IxzPZ8WDQL8bjX1hRTF8P4qjn/u+nyvDFFvdbM4kH7QhYhPeeWfoVqtCnMFHLzFJ
> 7v+o6x2CKH2MnfOcgGI=
> =yBFy
> -END PGP SIGNATURE-


AW: AW: [librecat-dev] A common MARC record path language

2014-02-18 Thread Klee, Carsten
Hi Patrick,

I'm sorry for my much more late reply. Since your mail MARCspec [1]  developed 
a lot. But unfortunately I didn't got a solution for the issue you described. 
Honestly, I'm not sure, if I understand the problem which you want to solve 
here.

I understand that there is MARC data combined with cataloging rules. We don't 
use this approach within our MARC. So I'm not really aware of the problematics.

You mentioned the typical case “Take everything from the 245 until you hit the 
first / before a subfield”. I thought about this, but didn't came even close 
how this could be expressed in MARCspec. How is this solved in Catmandu now?

If you have any suggestions, how this could be expressed in a string, please 
give me a hint.

The only thing I can imagine, is a reference to a character within a subfield. 
Something like

245$a[0]/-1

could be read as "A reference to the last character of the first subfield 'a' 
of field 245". Then you could check if the reference character is "/". But I 
think, that didn't solve your problem, right?

However I would be very glad, if MARCspec gets adopted by Catmandu. If you're 
interested, I've written some algorithm rules for MARCspec parsers [2]. They 
are very comprehensive. Maybe there is a smarter algorithm, but this might give 
you some clue.

Cheers! 

Carsten

[1] <http://cklee.github.io/marc-spec/marc-spec.html>
[2] <https://github.com/cKlee/marc-spec/blob/master/marc-spec-parser-rules.md>
___
Carsten Klee
Abt. Überregionale Bibliographische Dienste IIE
Staatsbibliothek zu Berlin – Preußischer Kulturbesitz

Fon:  +49 30 266-43 44 02


> -Ursprüngliche Nachricht-
> Von: Patrick Hochstenbach [mailto:patrick.hochstenb...@ugent.be]
> Gesendet: Dienstag, 21. Januar 2014 09:57
> An: Klee, Carsten; v...@gbv.de; librecat-...@mail.librecat.org;
> perl4lib@perl.org
> Betreff: Re: AW: [librecat-dev] A common MARC record path language
> 
> Hi Carsten
> 
> Excuses for the late reply, it took some while to get the system booted
> after winter vacations.
> 
> You are right in the discussion about which parts should be specified by a
> MARCspec language and which part should be implemented as operations on
> nodes found. I gave the examples not as a hit for the implementation
> language (e.g. if it requires regular expressions or not) but as a
> examples of MARC in the wild (non standard tags) and MARC combined with
> cataloging rules (where subfields and characters in front of a subfield
> have a special meaning).
> 
> In daily work I often encounter mapping rules which involve these special
> subfield cases (“Take everything from the 245 until you hit the first /
> before a subfield”). These things can’t be easily (can it) expressed in
> Xpath when using XSTL or MARCspec when using tools like Catmandu..but are
> very common and can be shared across tools. I think this would be
> candidates to formalise .
> 
> 
> Cheers
> Patrick
> 
> On 06/01/14 16:33, "Klee, Carsten"  wrote:
> 
> >
> >On the other hand I could imagine something like "100[0]" for the first
> >100 field (author) and "100[1]" for the second and so on. But what is
> >about repeatable subfields? Maybe someone requires the first subfield "a"
> >of the second 100 field. Besides the characters "[" and "]" are also
> >valid subfield codes (see [2]).
> >
> >With substrings it is more complicated. I only could imagine using
> >regular expressions. Maybe something like 245a[Œ\s(.*)]_10. But for
> >usability reasons this might be better left to the applications. Isn't
> >there something in Catmandu like
> >marc_map('245','my.title', -substring-after => 'Π'); ??
> >
> >Maybe you have another solution for that?
> >
> >Another issue I suspect with your last example under
> >https://metacpan.org/pod/Catmandu::Fix::marc_map
> >
> ># Copy all 100 subfields except the digits to the 'author' field
> >marc_map('100^0123456789','author');
> >
> >In the current MARCspec this would be interpreted as "a reference to
> >subfields ^, 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9 of field 100". This is
> >because "^" is a valid subfield code (see [2]).
> >
> >So far... I would be happy to read more comments on this.
> >
> >Cheers!
> >
> >Carsten
> >
> >
> >[1] <https://github.com/cKlee/marc-spec/issues>
> >[2] <http://www.loc.gov/marc/specifications/specrecstruc.html#varifields>
> >___
> >Carst

Re: AW: [librecat-dev] A common MARC record path language

2014-01-21 Thread Patrick Hochstenbach
Hi Carsten

Excuses for the late reply, it took some while to get the system booted
after winter vacations.

You are right in the discussion about which parts should be specified by a
MARCspec language and which part should be implemented as operations on
nodes found. I gave the examples not as a hit for the implementation
language (e.g. if it requires regular expressions or not) but as a
examples of MARC in the wild (non standard tags) and MARC combined with
cataloging rules (where subfields and characters in front of a subfield
have a special meaning).

In daily work I often encounter mapping rules which involve these special
subfield cases (“Take everything from the 245 until you hit the first /
before a subfield”). These things can’t be easily (can it) expressed in
Xpath when using XSTL or MARCspec when using tools like Catmandu..but are
very common and can be shared across tools. I think this would be
candidates to formalise .


Cheers
Patrick

On 06/01/14 16:33, "Klee, Carsten"  wrote:

>
>On the other hand I could imagine something like "100[0]" for the first
>100 field (author) and "100[1]" for the second and so on. But what is
>about repeatable subfields? Maybe someone requires the first subfield "a"
>of the second 100 field. Besides the characters "[" and "]" are also
>valid subfield codes (see [2]).
>
>With substrings it is more complicated. I only could imagine using
>regular expressions. Maybe something like 245a[Œ\s(.*)]_10. But for
>usability reasons this might be better left to the applications. Isn't
>there something in Catmandu like
>marc_map('245','my.title', -substring-after => 'Π'); ??
>
>Maybe you have another solution for that?
>
>Another issue I suspect with your last example under
>https://metacpan.org/pod/Catmandu::Fix::marc_map
>
># Copy all 100 subfields except the digits to the 'author' field
>marc_map('100^0123456789','author');
>
>In the current MARCspec this would be interpreted as "a reference to
>subfields ^, 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9 of field 100". This is
>because "^" is a valid subfield code (see [2]).
>
>So far... I would be happy to read more comments on this.
>
>Cheers!
>
>Carsten
> 
>
>[1] 
>[2] 
>___
>Carsten Klee
>Abt. Überregionale Bibliographische Dienste IIE
>Staatsbibliothek zu Berlin – Preußischer Kulturbesitz
>
>Fon:  +49 30 266-43 44 02
>
>> -Ursprüngliche Nachricht-
>> Von: Patrick Hochstenbach [mailto:patrick.hochstenb...@ugent.be]
>> Gesendet: Freitag, 20. Dezember 2013 14:06
>> An: v...@gbv.de; librecat-...@mail.librecat.org; perl4lib@perl.org
>> Cc: Klee, Carsten
>> Betreff: Re: [librecat-dev] A common MARC record path language
>> 
>> Hi
>> 
>> Thanks for this initiative to formalise the path language for MARC
>> records. In Catmandu our path language is better described at:
>> https://metacpan.org/pod/Catmandu::Fix::marc_map. It would be an easy
>>fix
>> for us to follow Carsten¹s MARC spec rules and I will gladly implement
>>it
>> for our community.
>> 
>> We see these type of MARC paths in programming libraries such as the
>> projects mentioned below but also in products like XSTL, SolrMarc,
>> ILS-vendors who need them to define how to index marc, standardisation
>> bodies like e.g. that provide mapping rules (e.g.
>> http://www.loc.gov/standards/mods/mods-mapping.html). I tried to make a
>> small roundup in the past of these projects but it would be great to
>>have
>> more extensive look at all current pratices.
>> 
>> In our Catmandu project we found that Xpaths are too verbose for our
>> librarians to interpret and in practise tied to XSLT-programming which
>> requires quite some programming skills to read and interpret.
>> 
>> Our paths are very much simplified but still seem to lack some things
>>that
>> are available in the MARC data model which would be great to have
>> available in the MARCspec syntax:
>> 
>>  - Notion of pointing to the first item (first author)
>>  - Supporting local defined MARC (sub)fields (e.g. Ex Libris exports
>> contain all kind of Z30, CAT , etc fields)
>>  - Support for pointing to a subfields that follow a specific character
>> (e.g. In titles I would like to point to everything after the Œ/Œ in a
>>245
>> field).
>> 
>> Cheers and have a nice holiday
>> 
>> Patrick
>> 
>> 
>> On 19/12/13 13:16, "Jakob Voß"  wrote:
>> 
>> >Hi,
>> >
>> >Carsten Klee specified a simple path language for MARC records, called
>> >"MARC spec". In short it is a formal syntax to refer to selected parts
>> >of a MARC record (similar to XPath for XML):
>> >
>> >http://collidoscope.de/lld/marcspec-as-string.html
>> >http://cklee.github.io/marc-spec/marc-spec.html#examples
>> >
>> >Similar languages have been invented before but not with a strict
>> >specification, as far as I know. For instance the perl Catmandu::MARC
>> >supports references to MARC fields:
>> >
>> >https

AW: [librecat-dev] A common MARC record path language

2014-01-06 Thread Klee, Carsten
Hi Patrick! Hi everyone!

Thanks for looking into MARCspec. I opened some GitHub issues [1] concerning 
your enhancement requests. Let me just give you some thoughts on these:

>  - Supporting local defined MARC (sub)fields (e.g. Ex Libris exports
> contain all kind of Z30, CAT , etc fields)

I think this should be supported. The only problem in the current spec is that 
the character "X" is used as a wildcard in the field tag. If someone defined a 
local field tag like "XYZ" this gets interpreted as "all fields ending with 
YZ". Possible solution is to define another wildcard character like "*".

>  - Notion of pointing to the first item (first author)
and
>  - Support for pointing to a subfields that follow a specific character
> (e.g. In titles I would like to point to everything after the Œ/Œ in a 245
> field).

I understand that both are desired very often. But I think that this is maybe a 
little bit out of scope. Since XSLT provides this feature and many more like 
"the last element", "the X element", "substring-after", "substring-before" 
etc., MARCspec should concentrate on the core MARC data model (fields, 
subfields, character positions) and let the rest to the post data processing of 
the applications.

If MARCspec would provide these functionalities it quickly will become very 
complicated. And maybe it will come into conflicts with already defined data 
processing functions in the various applications.

On the other hand I could imagine something like "100[0]" for the first 100 
field (author) and "100[1]" for the second and so on. But what is about 
repeatable subfields? Maybe someone requires the first subfield "a" of the 
second 100 field. Besides the characters "[" and "]" are also valid subfield 
codes (see [2]).

With substrings it is more complicated. I only could imagine using regular 
expressions. Maybe something like 245a[Œ\s(.*)]_10. But for usability reasons 
this might be better left to the applications. Isn't there something in 
Catmandu like 
marc_map('245','my.title', -substring-after => 'Π'); ??

Maybe you have another solution for that?

Another issue I suspect with your last example under 
https://metacpan.org/pod/Catmandu::Fix::marc_map 

# Copy all 100 subfields except the digits to the 'author' field
marc_map('100^0123456789','author');

In the current MARCspec this would be interpreted as "a reference to subfields 
^, 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9 of field 100". This is because "^" is a 
valid subfield code (see [2]).

So far... I would be happy to read more comments on this.

Cheers!

Carsten
 

[1] 
[2] 
___
Carsten Klee
Abt. Überregionale Bibliographische Dienste IIE
Staatsbibliothek zu Berlin – Preußischer Kulturbesitz

Fon:  +49 30 266-43 44 02

> -Ursprüngliche Nachricht-
> Von: Patrick Hochstenbach [mailto:patrick.hochstenb...@ugent.be]
> Gesendet: Freitag, 20. Dezember 2013 14:06
> An: v...@gbv.de; librecat-...@mail.librecat.org; perl4lib@perl.org
> Cc: Klee, Carsten
> Betreff: Re: [librecat-dev] A common MARC record path language
> 
> Hi
> 
> Thanks for this initiative to formalise the path language for MARC
> records. In Catmandu our path language is better described at:
> https://metacpan.org/pod/Catmandu::Fix::marc_map. It would be an easy fix
> for us to follow Carsten¹s MARC spec rules and I will gladly implement it
> for our community.
> 
> We see these type of MARC paths in programming libraries such as the
> projects mentioned below but also in products like XSTL, SolrMarc,
> ILS-vendors who need them to define how to index marc, standardisation
> bodies like e.g. that provide mapping rules (e.g.
> http://www.loc.gov/standards/mods/mods-mapping.html). I tried to make a
> small roundup in the past of these projects but it would be great to have
> more extensive look at all current pratices.
> 
> In our Catmandu project we found that Xpaths are too verbose for our
> librarians to interpret and in practise tied to XSLT-programming which
> requires quite some programming skills to read and interpret.
> 
> Our paths are very much simplified but still seem to lack some things that
> are available in the MARC data model which would be great to have
> available in the MARCspec syntax:
> 
>  - Notion of pointing to the first item (first author)
>  - Supporting local defined MARC (sub)fields (e.g. Ex Libris exports
> contain all kind of Z30, CAT , etc fields)
>  - Support for pointing to a subfields that follow a specific character
> (e.g. In titles I would like to point to everything after the Œ/Œ in a 245
> field).
> 
> Cheers and have a nice holiday
> 
> Patrick
> 
> 
> On 19/12/13 13:16, "Jakob Voß"  wrote:
> 
> >Hi,
> >
> >Carsten Klee specified a simple path language for MARC records, called
> >"MARC spec". In short it is a formal syntax to refer to selected parts
> >of a MARC record (similar