Re: Net::Z3950 and diacritics [book catalogs]
On 12/16/03 8:57 AM, Eric Lease Morgan [EMAIL PROTECTED] wrote: Upon further investigation, it seems that MARC::Batch is not necessarily causing my problem with diacritics, instead, the problem may lie in the way I am downloading my records using Net::Z3950 Thank you to everybody who replied to my messages about MARC data and Net::Z3950. I must admit, I still don't understand all the issues. It seems there are at least a couple of character sets that can be used to encode MARC data. The characters in these sets are not always 1 byte long (specifically the characters with diacritics), and consequently the leader of my downloaded MARC records was not always accurate, I think. Again, I still don't understand all the issues, and the discrepancy is most likely entirely my fault. I consider my personal catalog about 80% complete. I have about another 200 books to copy catalog, and I can see a few more enhancements to my application, but they will not significantly increase the system's functionality. I consider those enhancements to be featuritis. Using my Web browser I can catalog about two books per minute. In any event, the number of book descriptions from my personal catalog containing diacritics is very small. Tiny. Consequently, my solution was to either hack my MARC records to remove the diacritic or skip the inclusion of the record all together. The process of creating my personal catalog was very enlightening. The MARC records in my catalog are very very similar to the records found in catalogs across the world. My catalog provides author, title, and subject searching. It provides Boolean logic, nested queries, and right-hand truncation. The entire record is free-text searchable. Everything is accessible. The results can be sorted by author, title, subject, and rank (statistical relevance). A cool search is a search for cookery: http://infomotions.com/books/?cmd=searchquery=cookery Yet, I still find the catalog lacking, and what it is lacking is/are three things: 1) more descriptive summaries like abstracts, 2) qualitative judgments like reviews and/or the number of uses (popularity), and 3) access to the full text. These are problems I hope to address in my developing third iteration of my Alex Catalogue: http://infomotions.com/alex2/ My book catalog excels at inventorying my collection. It does a very poor job at recommending/suggesting what book(s) to use. The solution is not with more powerful search features, nor is it with bibliographic instruction. The solution is lies in better, more robust data, as well as access to the full text. This is not just a problem with my catalog. It is a problem with online public access catalogs everywhere, but I deviate. I'm off topic. All of this is fodder for my book catalog's About text. Again, thank you for the input. -- Eric Lease Morgan University Libraries of Notre Dame
Re: Net::Z3950 and diacritics
Hi, in fact the question is quite complex to explain, and I'm not sure that I can explain well. At 14.57 16/12/03, you wrote: This process works just fine for records that contain no diacritics, but when diacritics are in the records extra characters end up in my saved files, like this: 00901nam 22002651 ^^^ 45100080005001780080041000250350021000669060045000870 1000170013204000180014905000180016708200100018512900195245009 200224260003400316347003504900029003975040026004266340045 27100021004869910044005079910055005510990029006063118006 1974041700.0731207s1967nyuabf b000 0beng 9(DLC) 67029856 a7bcbccorignewdueocipf19gy-gencatlg a 67029856 aDLCcDLCdDLC00aND588.D9bR8500a759.31 aRussell, Francis,d1910-14aThe world of D®urer, ^^^ 1471-1528,cby Francis Russell and the editors of Time-Life Books. aNew York,bTime, inc.c[1967] a183 p.billus., maps, col. plates.c32 cm.0 aTime-Life library of art aBibliography: p. 177.10aD®urer, Albrecht,d1471-1528.2 ^^^ aTime-Life Books. bc-GenCollhND588.D9iR85tCopy 1wBOOKS bc-GenCollhND588.D9iR85p00034015107tCopy 2wCCF arussell-world-1071495663 Notice how Dürer got munged into D®urer, twice, and consequently the record length is not 901 but 903 instead. Some people say I must be sure to request a specific character set from the LOC when downloading my MARC records, specifically MARC-8 or MARC-UCS. Which one of these character sets do I want and how do I tell the remote database which one I want? 1)When you call LOC without a specific character you recive data in MARC-8 character set. 2) In MARC-8 character set a letter like è [e grave] is done with TWO bytes one for the sign [the grave accent] and one for the letter [the letter e]. 3)In the leader, position 0-4 you have the number of character, NOT the number of bytes. In your record there are 901 characters and 903 bytes. In fact the lenght function of perl read the number of bytes. The best option, now, is to use charset where 1 character is always 1 byte, for example ISO 8859_1 A good place to undestand charset sets is http://www.gymel.com/charsets/ [in deutch] Bye Zeno Tajoli [EMAIL PROTECTED] CILEA - Segrate (MI) 02 / 26995321
Re: Net::Z3950 and diacritics
On Tue, Dec 16, 2003 at 03:52:56PM +0100, Tajoli Zeno wrote: 1)When you call LOC without a specific character you recive data in MARC-8 character set. 2) In MARC-8 character set a letter like è [e grave] is done with TWO bytes one for the sign [the grave accent] and one for the letter [the letter e]. 3)In the leader, position 0-4 you have the number of character, NOT the number of bytes. In your record there are 901 characters and 903 bytes. In fact the length function of perl read the number of bytes. The best option, now, is to use charset where 1 character is always 1 byte, for example ISO 8859_1 While this is certainly part of the answer we still don't know why the record length is off. The way I see it, there are two possible options: 1. Net::Z3950 is doing on-the-fly conversion of MARC-8 to Latin1 2. LC's Z39.50 server is emitting the records that way, and not updating the record length. I guess one way to test which one is true would be to query another Z39.50 server for the same record, and see if the same problem existsin which case 1 is probably the case. //Ed