One way to implement hierarchical documents is through the use of
predefined phrases. Consider the 2 hierarchies:
1. Kids_and_Teens/Computers/Software/Games
2. Computers/Software/Freeware
When indexing a document belonging to (1), add these terms in consecutive
order (autoincrement=1): dir:Top dir:Kids_and_Teens dir:Computers
dir:Software dir:Games dir:Bottom
For documents belonging to (2), add: dir:Top dir:Computers dir:Software
dir:Bottom
The terms dir:Top and dir:Bottom can be used to anchor a query
to a specific portion of the hierachy.
Thus, a query containing the phrase: dir:Computers dir:Software would
match documents in both (1) and (2) (and perhaps others), but a query for:
dir:Top dir:Kids_and_Teens dir:Computers dir:Software would target only
'Computer/Software' documents from the 'Kids_and_Teens' top level directory.
(The QueryPhrase 'slop factor' would be set to 0).
Peter
- Original Message -
From: Tatu Saloranta [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, October 20, 2003 8:24 PM
Subject: Re: Hierarchical document
On Monday 20 October 2003 10:31, Erik Hatcher wrote:
On Monday, October 20, 2003, at 11:06 AM, Tom Howe wrote:
There is not a more lucene way to do this - its really up to you to
be creative with this. I'm sure there are folks that have implemented
something along these lines on top of Lucene. In fact, I have a
particular interest in doing so at some point myself. This is very
similar to the object-relational issues surrounding relational
databases - turning a pretty flat structure into an object graph.
There are several ideas that could be explored by playing tricks with
fields, such as giving them a hierarchical naming structure and
querying at the level you like (think Field.Keyword and PrefixQuery,
for example), and using a field to indicate type and narrowing queries
to documents of the desired type.
I'm interested to see what others have done in this area, or what ideas
emerge about how to accomplish this.
I'm planning to do something similar. In my case problem is bit simpler;
documents have associated products, and products form a hierarchy.
Searches should be able to match not only direct matches (searching
product, article associated with product), but also indirect ones via
membership (product member of a product group, matching group).
Product hierarchy also has variable depth.
To do searches using non-leaf hierarchy items (groups), all actual product
items/groups associated with docs are expanded to full ids when
indexing (ie. they contain path from root, up to and including node,
each node component having its own unique id).
Thus, when searching for an intermediate node (product grouping),
match occurs since that node id is part of path to products that are in
the group (either directly or as members of sub-groups).
Since no such path is stored (directly) in database, this also allows me
to do
queries that would be impossible to do in database (I could add similar
path/full id fields for search purposes of course). Thus, Lucene index is
optimized for searching purposes, and database structure for editing
and retrieval of data.
Another thing to keep in mind is that at least for metadata it may make
sense
to use specialized analyzer, one that allows tokenizing using specific ids
to store ids as separate tokens; instead of using some standard plain text
analyzer. This way it is possible to separate ids from textual words (by
using prefixes, for example, @1253 or #13945); this allows for
accurate
matching based on identity of associated metadata selections.
-+ Tatu +-
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]