On Thu, 6 Mar 2003, sandy pittendrigh wrote:

> I have an off-the-cuff idea I wonder if anybody else
> has considered: "does it make any sense to think
> about using apache::lucene as an alternate, fuzzy-search
> mechanism over collections of XML files, rather than, or
> in addition to xpath?"

Well, I suggested it only a few days ago, so yes. ;) The main question, as
far as I'm aware, is the data store itself. It would be fantastic to use
the same data store, and provide alternate query mechanisms. Otherwise,
there would have to be a translator between pure xml documents as stored
in Xindice and Documents as stored in Lucene.

>
> Lucene appears to provide a way of indexing words
> and word proximities in otherwise free-form text
> documents. You could, for instance, use a term modifier
> like ["jakarta apache" ~10]to find all the documents that
> contained the fields jakarta and apache, that appear no
> more than ten fields apart from each other.
>
> To the extent this query language is useful over
> completely unstructured, free-form text, it seems likely
> that it (the lucene query language) would be even more
> powerful operating over more regularly structured text, like XML files.

Yes, but you need to define "fields" for Lucene in order for it to
distinguish. You would have a query like:

subject:"Lucene as an alternate Query Mechanism" AND
email:"[EMAIL PROTECTED]"

I suppose you could use Lucene and create an exact correspondence between
xml attributes, xml element values, and Lucene fields. This would,
however, make it difficult to manage, since most people use xml so
differently from one another. Furthermore, Lucene cannot return a document
as xml structured data. You could add the entire xml document as text in
its own field, which would make it much less useful for querying within
certain attributes and/or element values. So, either way, you need to add
fields to the documents. In my opinion, XPath is much nicer for this.

Worst case scenario, I will probably keep track of a certain set of xml
attributes and element values that I want to be known as Lucene fields. I
will probably do this as an xml attribute such as 'index="true"(or false)'
associated with every element. If somehow the presence of that attribute
in Xindice would "automatically" cause it to be indexed by Lucene, that
would be cool. Since I will probably do this any way, I will offer my
code, but it will be hard to keep it generic enough to be used by any
Xindice user unless I know how to do this intentionally.

 >
> Lucene is more of a search-engine technology than a database
> technololgy....where answer sets are expected to have an attractive ratio
> between relevant and irrelevant data, rather than
> the rigid, 100% metadata criteria matches possible with
> xpath queries over XML data.
>
> Does it make sense for projects like Xindice to have alterate,
> plug-in-like ways to search and query the same datasets? Or should alterate
> query technologies exist as disparate, separate software entities?
>
>

I think this primarily relates to the format of the data store. Lucene
obviously uses some type of database format for its document index, as
does Xindice. You would probably have to have a different database format
to support a different query structure.

-David

Reply via email to