XML Lucene Indexing Package Updated
I have updated the demo Lucene XML indexing package at http://www.isogen.com/papers/lucene_xml_indexing.zip. This new release includes code improvements from Brandon Jockman and some slightly better build and run scripts. You should be able to unzip the package and run the LuceneClient/runLuceneClient.bat script (on Windows) and it should just work. If it doesn't, let me know. Cheers, Eliot -- W. Eliot Kimber, [EMAIL PROTECTED] Consultant, ISOGEN International 1016 La Posada Dr., Suite 240 Austin, TX 78752 Phone: 512.656.4139 -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: PDF4J Project: Gathering Feature Requests
Peter Carlson wrote: This is very exciting. Are you planning on basing the code on other pdf readers / writers? At this point I haven't found any Java PDF reader that meets my requirements. One of the motivations for doing this is the problems we had using Etymon's PJ library: both the license (GPL, not LGPL) and the quality of the code itself, which does not meet our engineering standards. I want to use an LGPL library so that people can use the code in projects that are not themselves open sourced but I want the library itself to be protected. For writing, may or may not be able to leverage existing code, don't know yet. Note too that there are two aspects of writing: creating a valid PDF data stream and creating meaningful page layouts--we are not addressing the second of these (there are lots of libraries that will create useful PDF output from various non-PDF inputs). Our main writing usecase is the rewriting of existing PDFs following some amount of manipulation through our API. A caution: I am still waiting to get approval from my employers to do this work as open source--it may be a while before I can even start on the coding. Cheers, Eliot -- W. Eliot Kimber, [EMAIL PROTECTED] Consultant, ISOGEN International 1016 La Posada Dr., Suite 240 Austin, TX 78752 Phone: 512.656.4139 -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: indexing PDF files
Moturu,Praveen wrote: Good Morning to you all. Can I assume none of the poeple on the lucene user group had implemented indexing a pdf document using lucene. If some one has.. Please help me by providing the solution. You can try using Eytemon's PJ library (www.eytemon.com). But be aware that the code as provided does not support some features of PDF and has some bugs that prevent it from reading some PDFs. Note also that there are some inherent problems with full-text indexing of PDFs, namely that the word order in the PDF does not necessarily reflect its reading order (for example, in two-column layouts), so if your tokenizer is doing phrase analysis it may produce incorrect results. You can see this by doing a multi-word search in Acrobat Reader on a two-column document. It can also be difficult to accurately determine word boundaries because of the way that PDF can represent text strings as sequences of characters and placement instructions. The Adobe-provided C libraries have largely solved this problem but the PJ library does not--you will have to write your own algorithms to reduce text sequences with explicit kerning instructions into meaningful tokens. Not impossible but takes a little doing. If you have money to spend you could license the Adobe PDF libraries and create a Java binding for them. It does not appear that Adobe has any plans to provide a Java library for accessing PDFs, free or otherwise. However, implementing a Java PDF reader would not be too hard--I started trying to implement one just to see how hard it would be and got as a far as being able to get page objects by page number after an intense weekend's work [unfortunately my employment contract prevents me from creating open-source software without explicit approval and I didn't want to create a PDF library that wasn't open source, so I haven't done any more work on it yet]. The PDF spec (www.pdfzone.com) is pretty clear, although the PDF format is pretty convoluted (lots of byte offsets and such). But once you get the basic infrastructure in place for parsing out specific objects, the rest of it is just tedious parser implementation--there are scads of different field types once you get down to text streams. Adding the business logic to figure out where things are on the page would be more involved--you'd have to implement Adobe's layout logic. However, you need this functionality in order to correlate PDF annotations (links, bookmarks, notes) to the page objects they relate to--it's all done with bounding boxes. Cheers, Eliot -- W. Eliot Kimber, [EMAIL PROTECTED] Consultant, ISOGEN International 1016 La Posada Dr., Suite 240 Austin, TX 78752 Phone: 512.656.4139 -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
XML Indexing With Lucene: New Location For Package
You can now find our package for doing XML indexing with Lucene on the ISOGEN web site: http://www.isogen.com/papers/lucene_xml_indexing.html The package (lucene_xml_indexing.zip) includes all the 3rd-party libraries it depends on (Lucene, Xerces 1.4.4, junit). This package is provided as-is and is not actively supported, but I do want to know if you run into any problems using it. Cheers, Eliot Kimber ISOGEN International, LLC [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Zones
Ogren, Philip V. wrote: We are indexing a large corpus of XML documents (~10M). One thing that Verity does with XML notes is that it indexes each XML tag as a zone.* What's cool about it is that the zones are nested so that it mirrors the schema of your XML document. You can limit your search to any part of the document by searching on specific zones. A Verity zone is analogous to a Lucene field. Verity also has 'field' indexes - but these are a different kind of index that Lucene does not have. Verity fields allow you to index various numeric types, date types etc. side-by-side with your textual index. The edge that Verity zones have over Lucene fields is that they are nested. However, nested fields can be simulated quite easily in Lucene by doing redundant indexing. I have a hunch this is what Verity does anyways because their indexes are HUGE. The XML indexing scheme we developed for Lucene here at ISOGEN (and posted about late last year) provides more complete XML indexing than Verity can provide because it is not limited by some of the constraints inherent in Verity's zone mechanism. Our indexing approach is also infinitely more flexible than Verity's (or any of other commercial systems) because relatively simple Java code can be used to extend the default indexing to optimize for specific DTDs or types of queries. Also, Verity is, as far as I know, unable to index elements are attributes that have . (period) in their names because their indexers always treat . as a word separator. Doh. Of the commercial full-text indexers that do XML indexing, my analysis is that Verity does the best job, but it is still, in my opinion, not sufficiently complete or flexible to be useful in production. Otherwise, Verity is a full-text fine indexing system. Cheers, Eliot Kimber ISOGEN International, LLC -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Trying To Understand Query Syntax Details
I'm trying to understand the details of the query syntax. I found the syntax ` in QueryParser.jj, but it doesn't make everything clear. My initial questions: - It doesn't appear that ? can be the last character in a search. For example, to match fool and food, I tried to do foo?, but got a parse error. fo?l of course matches fool and foal. Is this a bug or an implementation constraint? - How does one specify a date range in a query? We need to be able to search on docs later than date x, and I know that Lucene supports date matching, but I don't see how to specify this in a query. Also, is there a description of the algorithm ~ uses? Thanks, E. -- . . . . . . . . . . . . . . . . . . . . . . . . W. Eliot Kimber | Lead Brain 1016 La Posada Dr. | Suite 240 | Austin TX 78752 T 512.656.4139 | F 512.419.1860 | [EMAIL PROTECTED] w w w . d a t a c h a n n e l . c o m
Re: Trying To Understand Query Syntax Details
Scott Ganyo wrote: Not sure about the rest, but if you've stored your dates in mmdd format, you can use a RangeQuery like so: dateField:[20011001-null] This would return all dates on or after October 1, 2001. Cool--thanks! E. -- . . . . . . . . . . . . . . . . . . . . . . . . W. Eliot Kimber | Lead Brain 1016 La Posada Dr. | Suite 240 | Austin TX 78752 T 512.656.4139 | F 512.419.1860 | [EMAIL PROTECTED] w w w . d a t a c h a n n e l . c o m
XML Indexing Samples
I have put together a hopefully useful package that demonstrates our current experiments with using Lucene for XML indexing. You can get the files by anonymous ftp from che.isogen.com, /outgoing/lucene. There are two zip files: - lucene_xml_indexing.zip This is the core indexing code and a little Java app that lets you do searches and see the results (including going back to the original docs to get data not stored in the index). There is documentation that should get you going. Also includes Jython support for interacting with the indexer and Lucene if you don't like GUIs (I wrote the Jython first and the the GUI, if you're wondering). - lucene_xml_sample_index.zip This is a sample index containing three books from the New Testament out of the Jon Bosak World Religions document set. I've included this sample index because the index feature of the GUI may not work (it works when I run the code from JBuilder, but didn't appear to work when I ran it standalone, but I've run out of time to spend on this). The docids in the index are absolute file paths, you need to put the data dir in the Zip file at the root of the same drive you're running the GUI from. This directory contains the original docs, which the GUI goes back to. Weak I know but it's just a demo. I haven't tested this stuff outside of Windows, but it should just work elsewhere. Let me know if there's some hideous problem with the package. Cheers, Eliot
Indexing XML With Lucene: Some Initial Results
We have continued to test our experiment of indexing XML docs by making one Lucene doc for each element. It seems to be working pretty well, although we haven't tried any really large-scale tests yet (will try to do that this coming week). I did do some informal testing with the World Religions document set provided by Jon Bosak of Sun Microsystems. Using the Xerces DOM implementation, it took about 75 seconds on a 900mhz PIII laptop to index the Book of Mormon (which is the biggest of the four works at 1.5 Meg of XML data). Searches across it were essentially instantaneous (but the index size was small in terms of the scales Lucene can support). I have not yet profiled the cost of things like collating hit lists by XML document (that is, all hits with the same docid field), but that should be purely a function of Java's speed at list iteration, not anything Lucene does. I also wrote client code that takes the treeloc in a given hit and looks up the corresponding node in the source document's DOM. This code was very fast too (again, using the Xerces DOM implementation). I had to do this because we aren't storing any of the XML data in the index itself (which you could do, but seemed redundant given that the original documents are still accessible). Given the ability to store pretty much anything in fields, you could actually capture all of the original XML data in the Lucene index such that the original document could be reconsitituted with sufficient fidelity. We are not currently taking that approach because we don't to add that complexity to the Lucene index. But it does imply that Lucene could be used as an XML store where the original input documents are not subsequently kept. (Of course, I don't know if this approach would preform well enough, but it almost certainly wouldn't perform worse than existing XML-specific storage systems that decompose docs at the element level.) [I personally don't like storage systems that store XML documents only as decomposed bits, which is why we're not taking that approach in this project--we're treating the Lucene indexes as purely transient indexes over a separately-managed authoritative datastore. This protects us, for example, from changes to the index rules, such as changing fields from indexed to non-indexed or changing the rules for particular fields. It's much easier and faster to simply re-index existing docs than to do some sort of export/re-import process.] I'm also starting to think about additional contextual information that could be captured in the index to make it possible to do even more contextual qualification at the Lucene query level. Will require more experimentation and thought. Again, the basic approach is very simple: for a given XML document, walk the DOM tree, creating one Lucene doc for each element node, where each Lucene doc has a docid field whose value is the same for all docs created from the same XML document, a tagname field, an ancestors field (the ordered list of ancestor element types for the element), a treeloc field, which is the DOM tree location of the element (e.g., 0 1 0 3 for the 4th child of the first child of the second child of the document element), and a nodetype field that indicates the DOM node type that has been indexed (we also index processing instructions and comments and could do more). We also capture any attributes as fields as well, enabling searching on attribute values. For the text content of the document, we are capturing only the directly-contained content of each element and indexing that as the content field. We also capture all the PCDATA content for the whole document and index it on a separate Lucene doc with a distinct node type (e.g., ALL_CONTENT). This enables phrase searching that ignores element boundaries and can also allow for faster queries if all you care about is whether or not a given doc has some text and not which elements have it. We then have a front-end that both handles preparing the queries that go to Lucene and collating the results that come back (for example, organizing the hits by XML document or doing additional context-based filtering that can't be done at the Lucene level). Cheers, Eliot -- . . . . . . . . . . . . . . . . . . . . . . . . W. Eliot Kimber | Lead Brain 1016 La Posada Dr. | Suite 240 | Austin TX 78752 T 512.656.4139 | F 512.419.1860 | [EMAIL PROTECTED] w w w . d a t a c h a n n e l . c o m
Another Indexing Question: Case Sensitivity
From reading the docs, my understanding is that you want to enable both case sensitive and insensitive searches that you must have two indexes, one that uses a case-insensitive analyzer, and one that uses a case sensitive one--is this correct? Thanks, E. -- . . . . . . . . . . . . . . . . . . . . . . . . W. Eliot Kimber | Lead Brain 1016 La Posada Dr. | Suite 240 | Austin TX 78752 T 512.656.4139 | F 512.419.1860 | [EMAIL PROTECTED] w w w . d a t a c h a n n e l . c o m
Re: Index Optimization: Which is Better?
Steven J. Owens wrote: I think that's exactly what Elliot is intending. Steven is correct. For each element in the XML document we create a separate Lucene document with the following fields: - docid (unique identifier of the input XML document, e.g., file system path, object ID from a repository, URL, etc.) - list of ancestor element types - DOM tree location - text of direct PCDATA content - DOM node type (Element_node, processing_instruction_node, comment_node) [This list is probably imcomplete but it was enough for us to test the idea.] - For each attribute of the element, a field whose name is the attribute name and whose value is the attribute value. We also capture all the text content of the input XML document as a single Lucene document with the same docid and the node type all_content. Given these Lucene documents, I can do queries like this: big brown dog AND ancestor:tag2 AND NOT ancestor:tag3 and language:english This will result in one doc for each element instance that contains the text big brown dog, is within a tag2 element, not within a tag3 element and has the value english for its language attribute. To make sure you match the phrase if it crosses element boundaries, just include the all-content doc as well: big brown dog ((AND ancestor:tag2 AND NOT ancestor:tag3 and language:english) OR (nodetype:ALL_CONTENT)) Given this set of Lucene docs, we can then collect them by docid to determine which XML documents are represented. The ancestor list and tree location enable correlating each hit back to its original location in the input document. It also enables post-processing to do more involved contextual filtering, such as find 'foo' in all paras that are first children of chapters. We have implemented a first pass at code that does this indexing but we have no idea how it will perform (we only got this fully working yesterday and haven't had time to stress it yet). I agree that this is somewhat twisted. In fact my collegue John Heintz, who suggested the approach of one Lucene doc per element, characterized the idea as an abuse of Lucene's design. But we haven't been able to think of a better or easier way to do it. It was really easy to write the DOM processing code to generate this index and the interaction with Lucene's API couldn't have been easier--this is my first experience programming against Lucene and I'm really impressed with the simplicity of the API and the power of the architecture. The functionality described above for XML retrieval already surpasses anything I know how to do with Verity, Fulcrum, Excallibur, etc. and it was freaky easy to do once we got the idea for the approach. I just hope it performs adequately. Cheers, E. -- . . . . . . . . . . . . . . . . . . . . . . . . W. Eliot Kimber | Lead Brain 1016 La Posada Dr. | Suite 240 | Austin TX 78752 T 512.656.4139 | F 512.419.1860 | [EMAIL PROTECTED] w w w . d a t a c h a n n e l . c o m
Index Optimization: Which is Better?
We are experimenting with XML-aware indexing. The approach we're trying is to index every element in a given XML document as a separate Lucene document along with a another Lucene document that captures just the concatenated text content of the document (to handle searching for phrases across element boundaries), what we're calling the all-content Lucene document. We are using a node type field to distinguish the different types of XML document constructs we are indexing (elements, comments, PIs, etc.) and also thought we would use node type to distinguish the all-content document. When we get a hit list, we can then use the node type to figure out which XML constructs contained the target text and reduce the per-element Lucene documents to single XML documents for the final query result. We can also use node type to limit the query (you might want to search just in PIs or just in comments, for example). Our question is this: given that for the all-content document we could either use the default content field for the text and the node type field to label the document as the all-content node or simply use a different field name for the content (e.g., alltext or something), which of the following queries would tend to perform better? This: some text AND nodtype:ALL_CONTENT or: alltext:some text Or is there any practical difference? Which way we construct the Lucene document will affect how our front-end and/or users have to construct queries. It would be slightly more convenient for front-ends to get the all-content doc by default (using the content field for the text), but we thought the AND query needed to limit searches to just the text (thus ignoring element-specific searching) might incur a performance penalty. In a related question, is there anything we can or need to do to optimize Lucene to handle lots of little Lucene documents? Thanks, Eliot -- . . . . . . . . . . . . . . . . . . . . . . . . W. Eliot Kimber | Lead Brain 1016 La Posada Dr. | Suite 240 | Austin TX 78752 T 512.656.4139 | F 512.419.1860 | [EMAIL PROTECTED] w w w . d a t a c h a n n e l . c o m