Re: Net::Z3950 and diacritics [book catalogs]
On 12/16/03 8:57 AM, Eric Lease Morgan [EMAIL PROTECTED] wrote: Upon further investigation, it seems that MARC::Batch is not necessarily causing my problem with diacritics, instead, the problem may lie in the way I am downloading my records using Net::Z3950 Thank you to everybody who replied to my messages about MARC data and Net::Z3950. I must admit, I still don't understand all the issues. It seems there are at least a couple of character sets that can be used to encode MARC data. The characters in these sets are not always 1 byte long (specifically the characters with diacritics), and consequently the leader of my downloaded MARC records was not always accurate, I think. Again, I still don't understand all the issues, and the discrepancy is most likely entirely my fault. I consider my personal catalog about 80% complete. I have about another 200 books to copy catalog, and I can see a few more enhancements to my application, but they will not significantly increase the system's functionality. I consider those enhancements to be featuritis. Using my Web browser I can catalog about two books per minute. In any event, the number of book descriptions from my personal catalog containing diacritics is very small. Tiny. Consequently, my solution was to either hack my MARC records to remove the diacritic or skip the inclusion of the record all together. The process of creating my personal catalog was very enlightening. The MARC records in my catalog are very very similar to the records found in catalogs across the world. My catalog provides author, title, and subject searching. It provides Boolean logic, nested queries, and right-hand truncation. The entire record is free-text searchable. Everything is accessible. The results can be sorted by author, title, subject, and rank (statistical relevance). A cool search is a search for cookery: http://infomotions.com/books/?cmd=searchquery=cookery Yet, I still find the catalog lacking, and what it is lacking is/are three things: 1) more descriptive summaries like abstracts, 2) qualitative judgments like reviews and/or the number of uses (popularity), and 3) access to the full text. These are problems I hope to address in my developing third iteration of my Alex Catalogue: http://infomotions.com/alex2/ My book catalog excels at inventorying my collection. It does a very poor job at recommending/suggesting what book(s) to use. The solution is not with more powerful search features, nor is it with bibliographic instruction. The solution is lies in better, more robust data, as well as access to the full text. This is not just a problem with my catalog. It is a problem with online public access catalogs everywhere, but I deviate. I'm off topic. All of this is fodder for my book catalog's About text. Again, thank you for the input. -- Eric Lease Morgan University Libraries of Notre Dame
Koha
I'm forwarding a question from a colleague in NYC: Do you know of any libraries closer than New Zealand or Europe that use Koha? We are almost ready to consider it. I'd appreciate anyone who can be a resource, preferably in the US, to respond on or off the list. Thanks. JJ **Views expressed by the author do not necessarily represent those of the Queens Library.** Jane Jacobs Asst. Coord., Catalog Division Queens Borough Public Library 89-11 Merrick Blvd. Jamaica, NY 11432 tel.: (718) 990-0804 e-mail: [EMAIL PROTECTED] FAX. (718) 990-8566
Re: Extracting data from an XML file
On Mon, Jan 05, 2004 at 03:54:09PM -0500, Eric Lease Morgan wrote: The code works, but is really slow. Can you suggest a way to improve my code or use some other technique for extracting things like author, title, and id from my XML? It's slow because you're building a DOM for the entire document, and only using a piece of it. If you use a stream based parser like XML::SAX [1] you should see some good speed improvement, and it won't use so much memory :) XML::SAX uses XML::LibXML, but as a stream. Kip Hampton has a good article High Performance XML Parsing with SAX [2] which should provide some guidance in getting started with XML::SAX. SAX is a generally useful technique (in Java land too), and SAX filters are really neat tools to have in your toolbox. I used them heavily as part of Net::OAI::Harvester [3] since OAI responses can be arbitrarily large, and building a DOM for some of the responses could be harmful. //Ed [1] http://search.cpan.org/perldoc?XML::SAX [2] http://xml.com/pub/a/2001/02/14/perlsax.html [3] http://search.cpan.org/perldoc?Net::OAI::Harvester
Re: Extracting data from an XML file
On Monday, January 5, 2004, at 03:54 PM, Eric Lease Morgan wrote: To create my HTML files with rich meta data, I need to extract bits and pieces of information from the teiHeader of my originals. The snippet of code below illustrates how I am currently doing this with XML::LibXML: [...] The code works, but is really slow. Can you suggest a way to improve my code or use some other technique for extracting things like author, title, and id from my XML? Check out XML::Twig, which uses XML::Parser. It gives you -- in tree form -- only those elements you're interested in. From the README: One of the strengths of XML::Twig is that it let you work with files that do not fit in memory (BTW storing an XML document in memory as a tree is quite memory-expensive, the expansion factor being often around 10). To do this you can define handlers, that will be called once a specific element has been completely parsed. I *think* your code would then look like this: use XML::Twig; my ($author, $title, $id); my $twig = XML::Twig-new('twig_roots' = { 'teiHeader/fileDesc/titleStmt/author' = sub { $author = $_[1] }, 'teiHeader/fileDesc/titleStmt/title' = sub { $title = $_[1] }, 'teiHeader/fileDesc/publicationStmt/idno' = sub { $id = $_[1] }, })-parsefile('/foo/bar.xml'); $twig-purge; This is totally untested -- I don't even have XML::Twig installed, I'm just going by the documentation on CPAN. For more info (including a tutorial) see URL:http://www.xmltwig.com/xmltwig/. Paul. -- Paul Hoffman :: Taubman Medical Library :: Univ. of Michigan [EMAIL PROTECTED] :: [EMAIL PROTECTED] :: http://www.nkuitse.com/
Re: Extracting data from an XML file
I wrote: Can you suggest a fast, efficient way to use Perl to extract selected data from an XML file?... First of all, thank you everyone who promptly replied to my query. Second, I was not quite clear in my question. Many people said I should write an XSLT style sheet to transform my XML document into HTML. This is in fact what I do, but I was not clear in my question. I need a process to not only transform each of my documents, but I also need to create an author as well as title indexes to my collection, and therefore I need to extract bits of data from each of my original XML files. Third, most of the replies fell into two categories: 1) use an XSLT style sheet as as sort of subroutine, and 2) use XML::Twig. Fourth, I tried both of these approaches plus my own, and timed them. I had to process 1.5 MB of data in nineteen files. Tiny. Ironically, my original code was the fastest at 96 seconds. The XSLT implementation came in second at 101 seconds, and the XML::Twig implementation, while straight-forward came in last as 141 seconds. (See the attached code snippets.) Since my original implementation is still the fastest, and the newer implementations do not improve the speed of the application, then I must assume that the process is slow because of the XSLT transformations themselves. These transformations are straight-forward: # transform the document and save it my $doc = $parser-parse_file($file); my $results = $stylesheet-transform($doc); my $html_file = $HTML_DIR/$id.html; open OUT, $html_file; print OUT $stylesheet-output_string($results); close OUT; # convert the HTML to plain text and save it my $html = parse_htmlfile($html_file); my $text_file = $TEXT_DIR/$id.txt; open OUT, $text_file; print OUT $formatter-format($html); close OUT; When my collection grows big I will have to figure out a better way to batch transform my documents. I might even have to break down and write a shell script to call xsltproc directly. (Blasphemy!) -- Eric Lease Morgan University Libraries of Notre Dame subroutines.txt Description: application/applefile # my original code print Processing $file...\n; my $doc= $parser-parse_file($file); my $root = $doc-getDocumentElement; my @header = $root-findnodes('teiHeader'); my $author = $header[0]-findvalue('fileDesc/titleStmt/author'); my $title = $header[0]-findvalue('fileDesc/titleStmt/title'); my $id = $header[0]-findvalue('fileDesc/publicationStmt/idno'); print author: $author\n title: $title\n id: $id\n\n; # using an XSLT stylesheet print Processing $file...\n; my $style = $parser-parse_file($AUTIID); my $stylesheet = $xslt-parse_stylesheet($style); my $doc= $parser-parse_file($file); my $results= $stylesheet-transform($doc); my $fullResult = ($stylesheet-output_string($results)); my @fullResult = split /#/, $fullResult; my $title = $fullResult[0]; my $author = $fullResult[1]; my $id = $fullResult[2]; print author: $author\n title: $title\n id: $id\n\n; # using XML::Twig print Processing $file...\n; my ($author, $title, $id); my $twig = new XML::Twig(TwigHandlers = { 'teiHeader/fileDesc/titleStmt/author' = sub {$author = $_[1]-text}, 'teiHeader/fileDesc/titleStmt/title' = sub {$title = $_[1]-text}, 'teiHeader/fileDesc/publicationStmt/idno' = sub {$id = $_[1]-text}}); $twig-parsefile($file); print author: $author\n title: $title\n id: $id\n\n;