XML Lucene Indexing Package Updated

2002-05-15 Thread W. Eliot Kimber

I have updated the demo Lucene XML indexing package at
http://www.isogen.com/papers/lucene_xml_indexing.zip. This new release
includes code improvements from Brandon Jockman and some slightly better
build and run scripts.

You should be able to unzip the package and run the
LuceneClient/runLuceneClient.bat script (on Windows) and it should just
work. If it doesn't, let me know.

Cheers,

Eliot
--
W. Eliot Kimber, [EMAIL PROTECTED]
Consultant, ISOGEN International

1016 La Posada Dr., Suite 240
Austin, TX  78752 Phone: 512.656.4139

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: PDF4J Project: Gathering Feature Requests

2002-05-06 Thread W. Eliot Kimber

Peter Carlson wrote:

 This is very exciting.

 Are you planning on basing the code on other pdf readers / writers?

At this point I haven't found any Java PDF reader that meets my
requirements. One of the motivations for doing this is the problems we
had using Etymon's PJ library: both the license (GPL, not LGPL) and the
quality of the code itself, which does not meet our engineering
standards. I want to use an LGPL library so that people can use the code
in projects that are not themselves open sourced but I want the library
itself to be protected.

For writing, may or may not be able to leverage existing code, don't
know yet.

Note too that there are two aspects of writing: creating a valid PDF
data stream and creating meaningful page layouts--we are not addressing
the second of these (there are lots of libraries that will create useful
PDF output from various non-PDF inputs). Our main writing usecase is the
rewriting of existing PDFs following some amount of manipulation through
our API.

A caution: I am still waiting to get approval from my employers to do
this work as open source--it may be a while before I can even start on
the coding.

Cheers,

Eliot
--
W. Eliot Kimber, [EMAIL PROTECTED]
Consultant, ISOGEN International

1016 La Posada Dr., Suite 240
Austin, TX  78752 Phone: 512.656.4139

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: indexing PDF files

2002-05-03 Thread W. Eliot Kimber

Moturu,Praveen wrote:
 
 Good Morning to you all. Can I assume none of the poeple on the lucene user
 group had implemented indexing a pdf document using lucene. If some one
 has.. Please help me by providing the solution.

You can try using Eytemon's PJ library (www.eytemon.com). But be aware
that the code as provided does not support some features of PDF and has
some bugs that prevent it from reading some PDFs. 

Note also that there are some inherent problems with full-text indexing
of PDFs, namely that the word order in the PDF does not necessarily
reflect its reading order (for example, in two-column layouts), so if
your tokenizer is doing phrase analysis it may produce incorrect
results. You can see this by doing a multi-word search in Acrobat Reader
on a two-column document. It can also be difficult to accurately
determine word boundaries because of the way that PDF can represent text
strings as sequences of characters and placement instructions. The
Adobe-provided C libraries have largely solved this problem but the PJ
library does not--you will have to write your own algorithms to reduce
text sequences with explicit kerning instructions into meaningful
tokens. Not impossible but takes a little doing. 

If you have money to spend you could license the Adobe PDF libraries and
create a Java binding for them. It does not appear that Adobe has any
plans to provide a Java library for accessing PDFs, free or otherwise.

However, implementing a Java PDF reader would not be too hard--I started
trying to implement one just to see how hard it would be and got as a
far as being able to get page objects by page number after an intense
weekend's work [unfortunately my employment contract prevents me from
creating open-source software without explicit approval and I didn't
want to create a PDF library that wasn't open source, so I haven't done
any more work on it yet]. The PDF spec (www.pdfzone.com) is pretty
clear, although the PDF format is pretty convoluted (lots of byte
offsets and such). But once you get the basic infrastructure in place
for parsing out specific objects, the rest of it is just tedious parser
implementation--there are scads of different field types once you get
down to text streams.

Adding the business logic to figure out where things are on the page
would be more involved--you'd have to implement Adobe's layout logic.
However, you need this functionality in order to correlate PDF
annotations (links, bookmarks, notes) to the page objects they relate
to--it's all done with bounding boxes.

Cheers,

Eliot
-- 
W. Eliot Kimber, [EMAIL PROTECTED]
Consultant, ISOGEN International

1016 La Posada Dr., Suite 240
Austin, TX  78752 Phone: 512.656.4139

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




XML Indexing With Lucene: New Location For Package

2002-02-01 Thread W. Eliot Kimber

You can now find our package for doing XML indexing with Lucene on the
ISOGEN web site:

http://www.isogen.com/papers/lucene_xml_indexing.html

The package (lucene_xml_indexing.zip) includes all the 3rd-party
libraries it depends on (Lucene, Xerces 1.4.4, junit).

This package is provided as-is and is not actively supported, but I do
want to know if you run into any problems using it.

Cheers,

Eliot Kimber
ISOGEN International, LLC
[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Zones

2002-01-25 Thread W. Eliot Kimber

Ogren, Philip V. wrote:
 We are indexing a large corpus of XML documents (~10M).  One thing that
 Verity does with XML notes is that it indexes each XML tag as a zone.*
 What's cool about it is that the zones are nested so that it mirrors the
 schema of your XML document.  You can limit your search to any part of the
 document by searching on specific zones.  A Verity zone is analogous to a
 Lucene field.  Verity also has 'field' indexes - but these are a different
 kind of index that Lucene does not have.  Verity fields allow you to index
 various numeric types, date types etc. side-by-side with your textual index.
 
 The edge that Verity zones have over Lucene fields is that they are nested.
 However, nested fields can be simulated quite easily in Lucene by doing
 redundant indexing.  I have a hunch this is what Verity does anyways because
 their indexes are HUGE.

The XML indexing scheme we developed for Lucene here at ISOGEN (and
posted about late last year) provides more complete XML indexing than
Verity can provide because it is not limited by some of the constraints
inherent in Verity's zone mechanism. Our indexing approach is also
infinitely more flexible than Verity's (or any of other commercial
systems) because relatively simple Java code can be used to extend the
default indexing to optimize for specific DTDs or types of queries.

Also, Verity is, as far as I know, unable to index elements are
attributes that have . (period) in their names because their indexers
always treat . as a word separator. Doh.

Of the commercial full-text indexers that do XML indexing, my analysis
is that Verity does the best job, but it is still, in my opinion, not
sufficiently complete or flexible to be useful in production. Otherwise,
Verity is a full-text fine indexing system.

Cheers,

Eliot Kimber
ISOGEN International, LLC

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Trying To Understand Query Syntax Details

2001-10-16 Thread W. Eliot Kimber

I'm trying to understand the details of the query syntax. I found the
syntax ` in QueryParser.jj, but it doesn't make everything clear.

My initial questions:

- It doesn't appear that ? can be the last character in a search. For
example, to match fool and food, I tried to do foo?, but got a
parse error. fo?l of course matches fool and foal. Is this a bug or an
implementation constraint?

- How does one specify a date range in a query? We need to be able to
search on docs later than date x, and I know that Lucene supports date
matching, but I don't see how to specify this in a query.

Also, is there a description of the algorithm ~ uses?

Thanks,

E.

-- 
. . . . . . . . . . . . . . . . . . . . . . . .

W. Eliot Kimber | Lead Brain

1016 La Posada Dr. | Suite 240 | Austin TX  78752
T 512.656.4139 |  F 512.419.1860 | [EMAIL PROTECTED]

w w w . d a t a c h a n n e l . c o m



Re: Trying To Understand Query Syntax Details

2001-10-16 Thread W. Eliot Kimber

 Scott Ganyo wrote:
 
 Not sure about the rest, but if you've stored your dates in mmdd
 format, you can use a RangeQuery like so:
 
 dateField:[20011001-null]
 
 This would return all dates on or after October 1, 2001.

Cool--thanks!

E.
-- 
. . . . . . . . . . . . . . . . . . . . . . . .

W. Eliot Kimber | Lead Brain

1016 La Posada Dr. | Suite 240 | Austin TX  78752
T 512.656.4139 |  F 512.419.1860 | [EMAIL PROTECTED]

w w w . d a t a c h a n n e l . c o m



XML Indexing Samples

2001-10-16 Thread W. Eliot Kimber

I have put together a hopefully useful package that demonstrates our
current experiments with using Lucene for XML indexing. You can get the
files by anonymous ftp from che.isogen.com, /outgoing/lucene. There are
two zip files:

- lucene_xml_indexing.zip
  
  This is the core indexing code and a little Java app that lets you do
searches and see the results (including going back to the original docs
to get data not stored in the index). There is documentation that should
get you going. Also includes Jython support for interacting with the
indexer and Lucene if you don't like GUIs (I wrote the Jython first and
the the GUI, if you're wondering).

- lucene_xml_sample_index.zip

  This is a sample index containing three books from the New Testament
out of the Jon Bosak World Religions document set. I've included this
sample index because the index feature of the GUI may not work (it works
when I run the code from JBuilder, but didn't appear to work when I ran
it standalone, but I've run out of time to spend on this). The docids in
the index are absolute file paths, you need to put the data dir in the
Zip file at the root of the same drive you're running the GUI from. This
directory contains the original docs, which the GUI goes back to. Weak I
know but it's just a demo.

I haven't tested this stuff outside of Windows, but it should just work
elsewhere.

Let me know if there's some hideous problem with the package.

Cheers,

Eliot



Indexing XML With Lucene: Some Initial Results

2001-10-14 Thread W. Eliot Kimber

We have continued to test our experiment of indexing XML docs by making
one Lucene doc for each element. It seems to be working pretty well,
although we haven't tried any really large-scale tests yet (will try to
do that this coming week).

I did do some informal testing with the World Religions document set
provided by Jon Bosak of Sun Microsystems. Using the Xerces DOM
implementation, it took about 75 seconds on a 900mhz PIII laptop to
index the Book of Mormon (which is the biggest of the four works at 1.5
Meg of XML data). Searches across it were essentially instantaneous (but
the index size was small in terms of the scales Lucene can support). I
have not yet profiled the cost of things like collating hit lists by XML
document (that is, all hits with the same docid field), but that should
be purely a function of Java's speed at list iteration, not anything
Lucene does. 

I also wrote client code that takes the treeloc in a given hit and looks
up the corresponding node in the source document's DOM. This code was
very fast too (again, using the Xerces DOM implementation). I had to do
this because we aren't storing any of the XML data in the index itself
(which you could do, but seemed redundant given that the original
documents are still accessible). Given the ability to store pretty much
anything in fields, you could actually capture all of the original XML
data in the Lucene index such that the original document could be
reconsitituted with sufficient fidelity. We are not currently taking
that approach because we don't to add that complexity to the Lucene
index. But it does imply that Lucene could be used as an XML store where
the original input documents are not subsequently kept. (Of course, I
don't know if this approach would preform well enough, but it almost
certainly wouldn't perform worse than existing XML-specific storage
systems that decompose docs at the element level.) [I personally don't
like storage systems that store XML documents only as decomposed bits,
which is why we're not taking that approach in this project--we're
treating the Lucene indexes as purely transient indexes over a
separately-managed authoritative datastore. This protects us, for
example, from changes to the index rules, such as changing fields from
indexed to non-indexed or changing the rules for particular fields. It's
much easier and faster to simply re-index existing docs than to do some
sort of export/re-import process.]

I'm also starting to think about additional contextual information that
could be captured in the index to make it possible to do even more
contextual qualification at the Lucene query level. Will require more
experimentation and thought.

Again, the basic approach is very simple: for a given XML document, walk
the DOM tree, creating one Lucene doc for each element node, where each
Lucene doc has a docid field whose value is the same for all docs
created from the same XML document, a tagname field, an ancestors
field (the ordered list of ancestor element types for the element), a
treeloc field, which is the DOM tree location of the element (e.g., 0
1 0 3 for the 4th child of the first child of the second child of the
document element), and a nodetype field that indicates the DOM node
type that has been indexed (we also index processing instructions and
comments and could do more). We also capture any attributes as fields as
well, enabling searching on attribute values.

For the text content of the document, we are capturing only the
directly-contained content of each element and indexing that as the
content field. We also capture all the PCDATA content for the whole
document and index it on a separate Lucene doc with a distinct node type
(e.g., ALL_CONTENT). This enables phrase searching that ignores
element boundaries and can also allow for faster queries if all you care
about is whether or not a given doc has some text and not which elements
have it.

We then have a front-end that both handles preparing the queries that go
to Lucene and collating the results that come back (for example,
organizing the hits by XML document or doing additional context-based
filtering that can't be done at the Lucene level). 

Cheers,

Eliot
-- 
. . . . . . . . . . . . . . . . . . . . . . . .

W. Eliot Kimber | Lead Brain

1016 La Posada Dr. | Suite 240 | Austin TX  78752
T 512.656.4139 |  F 512.419.1860 | [EMAIL PROTECTED]

w w w . d a t a c h a n n e l . c o m



Another Indexing Question: Case Sensitivity

2001-10-13 Thread W. Eliot Kimber

From reading the docs, my understanding is that you want to enable both
case sensitive and insensitive searches that you must have two indexes,
one that uses a case-insensitive analyzer, and one that uses a case
sensitive one--is this correct?

Thanks,

E.
-- 
. . . . . . . . . . . . . . . . . . . . . . . .

W. Eliot Kimber | Lead Brain

1016 La Posada Dr. | Suite 240 | Austin TX  78752
T 512.656.4139 |  F 512.419.1860 | [EMAIL PROTECTED]

w w w . d a t a c h a n n e l . c o m



Re: Index Optimization: Which is Better?

2001-10-12 Thread W. Eliot Kimber

Steven J. Owens wrote:

  I think that's exactly what Elliot is intending.

Steven is correct. For each element in the XML document we create a
separate Lucene document with the following fields:

- docid (unique identifier of the input XML document, e.g., file system
path, object ID from a repository, URL, etc.)
- list of ancestor element types
- DOM tree location
- text of direct PCDATA content
- DOM node type (Element_node, processing_instruction_node,
comment_node) [This list is probably imcomplete but it was enough for us
to test the idea.]
- For each attribute of the element, a field whose name is the attribute
name and whose value is the attribute value.

We also capture all the text content of the input XML document as a
single Lucene document with the same docid and the node type
all_content. 

Given these Lucene documents, I can do queries like this:

   big brown dog AND ancestor:tag2 AND NOT ancestor:tag3 and
language:english

This will result in one doc for each element instance that contains the
text big brown dog, is within a tag2 element, not within a tag3
element and has the value english for its language attribute.

To make sure you match the phrase if it crosses element boundaries, just
include the all-content doc as well:

   big brown dog ((AND ancestor:tag2 AND NOT ancestor:tag3 and
language:english) OR
   (nodetype:ALL_CONTENT))

Given this set of Lucene docs, we can then collect them by docid to
determine which XML documents are represented. The ancestor list and
tree location enable correlating each hit back to its original location
in the input document. It also enables post-processing to do more
involved contextual filtering, such as find 'foo' in all paras that are
first children of chapters.

We have implemented a first pass at code that does this indexing but we
have no idea how it will perform (we only got this fully working
yesterday and haven't had time to stress it yet).

I agree that this is somewhat twisted. In fact my collegue John
Heintz, who suggested the approach of one Lucene doc per element,
characterized the idea as an abuse of Lucene's design. But we haven't
been able to think of a better or easier way to do it.

It was really easy to write the DOM processing code to generate this
index and the interaction with Lucene's API couldn't have been
easier--this is my first experience programming against Lucene and I'm
really impressed with the simplicity of the API and the power of the
architecture. 

The functionality described above for XML retrieval already surpasses
anything I know how to do with Verity, Fulcrum, Excallibur, etc. and it
was freaky easy to do once we got the idea for the approach. I just hope
it performs adequately.

Cheers,

E.

-- 
. . . . . . . . . . . . . . . . . . . . . . . .

W. Eliot Kimber | Lead Brain

1016 La Posada Dr. | Suite 240 | Austin TX  78752
T 512.656.4139 |  F 512.419.1860 | [EMAIL PROTECTED]

w w w . d a t a c h a n n e l . c o m



Index Optimization: Which is Better?

2001-10-11 Thread W. Eliot Kimber

We are experimenting with XML-aware indexing. The approach we're trying
is to index every element in a given XML document as a separate Lucene
document along with a another Lucene document that captures just the
concatenated text content of the document (to handle searching for
phrases across element boundaries), what we're calling the all-content
Lucene document.

We are using a node type field to distinguish the different types of
XML document constructs we are indexing (elements, comments, PIs, etc.)
and also thought we would use node type to distinguish the all-content
document. When we get a hit list, we can then use the node type to
figure out which XML constructs contained the target text and reduce the
per-element Lucene documents to single XML documents for the final query
result. We can also use node type to limit the query (you might want to
search just in PIs or just in comments, for example).

Our question is this: given that for the all-content document we could
either use the default content field for the text and the node type
field to label the document as the all-content node or simply use a
different field name for the content (e.g., alltext or something),
which of the following queries would tend to perform better? This:

some text AND nodtype:ALL_CONTENT

or:
  
alltext:some text

Or is there any practical difference?

Which way we construct the Lucene document will affect how our front-end
and/or users have to construct queries. It would be slightly more
convenient for front-ends to get the all-content doc by default (using
the content field for the text), but we thought the AND query needed
to limit searches to just the text (thus ignoring element-specific
searching) might incur a performance penalty.

In a related question, is there anything we can or need to do to
optimize Lucene to handle lots of little Lucene documents? 

Thanks,

Eliot

-- 
. . . . . . . . . . . . . . . . . . . . . . . .

W. Eliot Kimber | Lead Brain

1016 La Posada Dr. | Suite 240 | Austin TX  78752
T 512.656.4139 |  F 512.419.1860 | [EMAIL PROTECTED]

w w w . d a t a c h a n n e l . c o m