Re: Net::Z3950 and diacritics [book catalogs]

2004-01-05 Thread Eric Lease Morgan
On 12/16/03 8:57 AM, Eric Lease Morgan [EMAIL PROTECTED] wrote:

 Upon further investigation, it seems that MARC::Batch is not necessarily
 causing my problem with diacritics, instead, the problem may lie in the way I
 am downloading my records using Net::Z3950

Thank you to everybody who replied to my messages about MARC data and
Net::Z3950.

I must admit, I still don't understand all the issues. It seems there are at
least a couple of character sets that can be used to encode MARC data. The
characters in these sets are not always 1 byte long (specifically the
characters with diacritics), and consequently the leader of my downloaded
MARC records was not always accurate, I think. Again, I still don't
understand all the issues, and the discrepancy is most likely entirely my
fault.

I consider my personal catalog about 80% complete. I have about another 200
books to copy catalog, and I can see a few more enhancements to my
application, but they will not significantly increase the system's
functionality. I consider those enhancements to be featuritis. Using my
Web browser I can catalog about two books per minute.

In any event, the number of book descriptions from my personal catalog
containing diacritics is very small. Tiny. Consequently, my solution was to
either hack my MARC records to remove the diacritic or skip the inclusion of
the record all together.

The process of creating my personal catalog was very enlightening. The MARC
records in my catalog are very very similar to the records found in catalogs
across the world. My catalog provides author, title, and subject searching.
It provides Boolean logic, nested queries, and right-hand truncation. The
entire record is free-text searchable. Everything is accessible. The results
can be sorted by author, title, subject, and rank (statistical relevance). A
cool search is a search for cookery:

  http://infomotions.com/books/?cmd=searchquery=cookery

Yet, I still find the catalog lacking, and what it is lacking is/are three
things: 1) more descriptive summaries like abstracts, 2) qualitative
judgments like reviews and/or the number of uses (popularity), and 3) access
to the full text. These are problems I hope to address in my developing
third iteration of my Alex Catalogue:

  http://infomotions.com/alex2/

My book catalog excels at inventorying my collection. It does a very poor
job at recommending/suggesting what book(s) to use. The solution is not with
more powerful search features, nor is it with bibliographic instruction. The
solution is lies in better, more robust data, as well as access to the full
text. This is not just a problem with my catalog. It is a problem with
online public access catalogs everywhere, but I deviate. I'm off topic. All
of this is fodder for my book catalog's About text.

Again, thank you for the input.

-- 
Eric Lease Morgan
University Libraries of Notre Dame



Re: Net::Z3950 and diacritics

2003-12-16 Thread Tajoli Zeno
Hi,

in fact the question is quite complex to explain, and I'm not sure that I 
can explain well.

At 14.57 16/12/03, you wrote:

This process works just fine for records that contain no diacritics, but
when diacritics are in the records extra characters end up in my saved
files, like this:
  00901nam  22002651
^^^
  45100080005001780080041000250350021000669060045000870
  1000170013204000180014905000180016708200100018512900195245009
  200224260003400316347003504900029003975040026004266340045
  27100021004869910044005079910055005510990029006063118006
  1974041700.0731207s1967nyuabf   b000 0beng
  9(DLC)   67029856  a7bcbccorignewdueocipf19gy-gencatlg
  a   67029856   aDLCcDLCdDLC00aND588.D9bR8500a759.31
  aRussell, Francis,d1910-14aThe world of Dˆ®urer,
  ^^^
  1471-1528,cby Francis Russell and the editors of Time-Life
  Books.  aNew York,bTime, inc.c[1967]  a183 p.billus.,
  maps, col. plates.c32 cm.0 aTime-Life library of art
  aBibliography: p. 177.10aDˆ®urer, Albrecht,d1471-1528.2
  ^^^
  aTime-Life Books.  bc-GenCollhND588.D9iR85tCopy 1wBOOKS
  bc-GenCollhND588.D9iR85p00034015107tCopy 2wCCF
  arussell-world-1071495663
Notice how Dürer got munged into Dˆ®urer, twice, and consequently the record
length is not 901 but 903 instead.
Some people say I must be sure to request a specific character set from the
LOC when downloading my MARC records, specifically MARC-8 or MARC-UCS. Which
one of these character sets do I want and how do I tell the remote database
which one I want?
1)When you call LOC without a specific character you recive data in MARC-8 
character set.

2) In MARC-8 character set a letter like è  [e grave] is done with TWO 
bytes one for the sign [the grave accent] and one for the letter [the 
letter e].

3)In the leader, position 0-4 you have the number of character, NOT the 
number of bytes. In your record there are 901 characters and 903 bytes.

In fact the lenght function of perl read the number of bytes. The best 
option, now, is to use charset where 1 character is always 1 byte, for 
example ISO 8859_1
A good place to undestand charset sets is http://www.gymel.com/charsets/ 
[in deutch]

Bye

Zeno Tajoli
[EMAIL PROTECTED]
CILEA - Segrate (MI)
02 / 26995321


Re: Net::Z3950 and diacritics

2003-12-16 Thread Ed Summers
On Tue, Dec 16, 2003 at 03:52:56PM +0100, Tajoli Zeno wrote:
 1)When you call LOC without a specific character you recive data in MARC-8 
 character set.
 
 2) In MARC-8 character set a letter like è  [e grave] is done with TWO 
 bytes one for the sign [the grave accent] and one for the letter [the 
 letter e].
 
 3)In the leader, position 0-4 you have the number of character, NOT the 
 number of bytes. In your record there are 901 characters and 903 bytes.
 
 In fact the length function of perl read the number of bytes. The best 
 option, now, is to use charset where 1 character is always 1 byte, for 
 example ISO 8859_1

While this is certainly part of the answer we still don't know why the 
record length is off. The way I see it, there are two possible options: 

1. Net::Z3950 is doing on-the-fly conversion of MARC-8 to Latin1
2. LC's Z39.50 server is emitting the records that way, and not updating the 
   record length.

I guess one way to test which one is true would be to query another Z39.50 
server for the same record, and see if the same problem existsin which
case 1 is probably the case. 

//Ed