Re: Net::Z3950 and diacritics [book catalogs]

2004-01-05 Thread Eric Lease Morgan
On 12/16/03 8:57 AM, Eric Lease Morgan [EMAIL PROTECTED] wrote:

 Upon further investigation, it seems that MARC::Batch is not necessarily
 causing my problem with diacritics, instead, the problem may lie in the way I
 am downloading my records using Net::Z3950

Thank you to everybody who replied to my messages about MARC data and
Net::Z3950.

I must admit, I still don't understand all the issues. It seems there are at
least a couple of character sets that can be used to encode MARC data. The
characters in these sets are not always 1 byte long (specifically the
characters with diacritics), and consequently the leader of my downloaded
MARC records was not always accurate, I think. Again, I still don't
understand all the issues, and the discrepancy is most likely entirely my
fault.

I consider my personal catalog about 80% complete. I have about another 200
books to copy catalog, and I can see a few more enhancements to my
application, but they will not significantly increase the system's
functionality. I consider those enhancements to be featuritis. Using my
Web browser I can catalog about two books per minute.

In any event, the number of book descriptions from my personal catalog
containing diacritics is very small. Tiny. Consequently, my solution was to
either hack my MARC records to remove the diacritic or skip the inclusion of
the record all together.

The process of creating my personal catalog was very enlightening. The MARC
records in my catalog are very very similar to the records found in catalogs
across the world. My catalog provides author, title, and subject searching.
It provides Boolean logic, nested queries, and right-hand truncation. The
entire record is free-text searchable. Everything is accessible. The results
can be sorted by author, title, subject, and rank (statistical relevance). A
cool search is a search for cookery:

  http://infomotions.com/books/?cmd=searchquery=cookery

Yet, I still find the catalog lacking, and what it is lacking is/are three
things: 1) more descriptive summaries like abstracts, 2) qualitative
judgments like reviews and/or the number of uses (popularity), and 3) access
to the full text. These are problems I hope to address in my developing
third iteration of my Alex Catalogue:

  http://infomotions.com/alex2/

My book catalog excels at inventorying my collection. It does a very poor
job at recommending/suggesting what book(s) to use. The solution is not with
more powerful search features, nor is it with bibliographic instruction. The
solution is lies in better, more robust data, as well as access to the full
text. This is not just a problem with my catalog. It is a problem with
online public access catalogs everywhere, but I deviate. I'm off topic. All
of this is fodder for my book catalog's About text.

Again, thank you for the input.

-- 
Eric Lease Morgan
University Libraries of Notre Dame



Koha

2004-01-05 Thread Jacobs, Jane W
I'm forwarding a question from a colleague in NYC:

Do you know of any libraries closer than New Zealand or Europe that use
Koha?  We are almost ready to consider it.

I'd appreciate anyone who can be a resource, preferably in the US, to
respond on or off the list.  Thanks.
JJ

**Views expressed by the author do not necessarily represent those of the
Queens Library.**

Jane Jacobs
Asst. Coord., Catalog Division
Queens Borough Public Library
89-11 Merrick Blvd.
Jamaica, NY 11432

tel.: (718) 990-0804
e-mail: [EMAIL PROTECTED]
FAX. (718) 990-8566



Re: Extracting data from an XML file

2004-01-05 Thread Ed Summers
On Mon, Jan 05, 2004 at 03:54:09PM -0500, Eric Lease Morgan wrote:
 The code works, but is really slow. Can you suggest a way to improve my code
 or use some other technique for extracting things like author, title, and id
 from my XML?

It's slow because you're building a DOM for the entire document, and only 
using a piece of it. If you use a stream based parser like XML::SAX [1] you
should see some good speed improvement, and it won't use so much memory :)

XML::SAX uses XML::LibXML, but as a stream. Kip Hampton has a good article 
High Performance XML Parsing with SAX [2] which should provide some guidance 
in getting started with XML::SAX.  

SAX is a generally useful technique (in Java land too), and SAX filters are 
really neat tools to have in your toolbox. I used them heavily as part of 
Net::OAI::Harvester [3] since OAI responses can be arbitrarily large, and 
building a DOM for some of the responses could be harmful.

//Ed 

[1] http://search.cpan.org/perldoc?XML::SAX
[2] http://xml.com/pub/a/2001/02/14/perlsax.html
[3] http://search.cpan.org/perldoc?Net::OAI::Harvester



Re: Extracting data from an XML file

2004-01-05 Thread Paul Hoffman
On Monday, January 5, 2004, at 03:54  PM, Eric Lease Morgan wrote:

To create my HTML files with rich meta data, I need to extract bits and
pieces of information from the teiHeader of my originals. The snippet 
of
code below illustrates how I am currently doing this with XML::LibXML:

[...]

The code works, but is really slow. Can you suggest a way to improve 
my code
or use some other technique for extracting things like author, title, 
and id
from my XML?
Check out XML::Twig, which uses XML::Parser.  It gives you -- in tree 
form -- only those elements you're interested in.  From the README:

   One of the strengths of XML::Twig is that it let you work with files 
that
   do not fit in memory (BTW storing an XML document in memory as a 
tree is
   quite memory-expensive, the expansion factor being often around 10).

   To do this you can define handlers, that will be called once a 
specific
   element has been completely parsed.

I *think* your code would then look like this:

   use XML::Twig;

   my ($author, $title, $id);

   my $twig = XML::Twig-new('twig_roots' = {
  'teiHeader/fileDesc/titleStmt/author' = sub { $author = 
$_[1] },
  'teiHeader/fileDesc/titleStmt/title'  = sub { $title  = 
$_[1] },
  'teiHeader/fileDesc/publicationStmt/idno' = sub { $id = 
$_[1] },
   })-parsefile('/foo/bar.xml');

   $twig-purge;

This is totally untested -- I don't even have XML::Twig installed, I'm 
just going by the documentation on CPAN.

For more info (including a tutorial) see 
URL:http://www.xmltwig.com/xmltwig/.

Paul.

--
Paul Hoffman :: Taubman Medical Library :: Univ. of Michigan
[EMAIL PROTECTED] :: [EMAIL PROTECTED] :: http://www.nkuitse.com/


Re: Extracting data from an XML file

2004-01-05 Thread Eric Lease Morgan
I wrote:

 Can you suggest a fast, efficient way to use Perl to extract selected
 data from an XML file?...

First of all, thank you everyone who promptly replied to my query.

Second, I was not quite clear in my question. Many people said I should
write an XSLT style sheet to transform my XML document into HTML. This is in
fact what I do, but I was not clear in my question. I need a process to not
only transform each of my documents, but I also need to create an author as
well as title indexes to my collection, and therefore I need to extract bits
of data from each of my original XML files.

Third, most of the replies fell into two categories: 1) use an XSLT style
sheet as as sort of subroutine, and 2) use XML::Twig.

Fourth, I tried both of these approaches plus my own, and timed them. I had
to process 1.5 MB of data in nineteen files. Tiny. Ironically, my original
code was the fastest at 96 seconds. The XSLT implementation came in second
at 101 seconds, and the XML::Twig implementation, while straight-forward
came in last as 141 seconds. (See the attached code snippets.)

Since my original implementation is still the fastest, and the newer
implementations do not improve the speed of the application, then I must
assume that the process is slow because of the XSLT transformations
themselves. These transformations are straight-forward:

  # transform the document and save it
  my $doc   = $parser-parse_file($file);
  my $results   = $stylesheet-transform($doc);
  my $html_file = $HTML_DIR/$id.html;
  open OUT,  $html_file;
  print OUT $stylesheet-output_string($results);
  close OUT;
  
  # convert the HTML to plain text and save it
  my $html  = parse_htmlfile($html_file);
  my $text_file = $TEXT_DIR/$id.txt;
  open OUT,  $text_file;
  print OUT $formatter-format($html);
  close OUT;

When my collection grows big I will have to figure out a better way to batch
transform my documents. I might even have to break down and write a shell
script to call xsltproc directly. (Blasphemy!)

-- 
Eric Lease Morgan
University Libraries of Notre Dame




subroutines.txt
Description: application/applefile
  # my original code
  print Processing $file...\n;
  my $doc= $parser-parse_file($file);
  my $root   = $doc-getDocumentElement;
  my @header = $root-findnodes('teiHeader');
  my $author = $header[0]-findvalue('fileDesc/titleStmt/author');
  my $title  = $header[0]-findvalue('fileDesc/titleStmt/title');
  my $id = $header[0]-findvalue('fileDesc/publicationStmt/idno');
  print   author: $author\n   title: $title\n  id: $id\n\n;


  # using an XSLT stylesheet
  print Processing $file...\n;
  my $style  = $parser-parse_file($AUTIID);
  my $stylesheet = $xslt-parse_stylesheet($style);
  my $doc= $parser-parse_file($file);
  my $results= $stylesheet-transform($doc);
  my $fullResult = ($stylesheet-output_string($results));
  my @fullResult = split /#/, $fullResult;
  my $title  = $fullResult[0];
  my $author = $fullResult[1];
  my $id = $fullResult[2];
  print   author: $author\n   title: $title\n  id: $id\n\n;


  # using XML::Twig
  print Processing $file...\n;
  my ($author, $title, $id);
  my $twig = new XML::Twig(TwigHandlers = {
'teiHeader/fileDesc/titleStmt/author' = sub {$author = $_[1]-text},
'teiHeader/fileDesc/titleStmt/title'  = sub {$title  = $_[1]-text},
'teiHeader/fileDesc/publicationStmt/idno' = sub {$id = $_[1]-text}});
  $twig-parsefile($file);
  print   author: $author\n   title: $title\n  id: $id\n\n;