On 29 Nov 2003, at 16:28, Erik Hatcher wrote:

On Saturday, November 29, 2003, at 07:00 AM, Stefano Mazzocchi wrote:
I think you'll find that Lucene will serve Slide's needs nicely - you'll just have to be a little creative in how you build Lucene Document objects and break things into fields. Lucene is a "flat" structure - so implying hierarchy requires some thought - perhaps just the URI will work to give you the hierarchy you need. But if properties are also hierarchical (can't non-live, "dead"?, properties contain an entire DOM tree?) then things will get more interesting and tricky.

Hmmm, seems to me like trying to fit a square into a rounded hole.

Perhaps. But, folks are doing object-relational mapping with databases. A database could be viewed as simply a flat structure of bytes on a disk.

well, that's a relational database without relations and yeah, that's normally called "persistent hashtable", but I digress :-)

So, mapping Lucene's flat structure into something more structured and hierarchical is do-able.

cool

ZOE (the e-mail client-server-indexer) does a lot of this type of stuff using Lucene, in fact.

I've used ZOE in the past and I knew it was using lucene, but I didn't know it did.

But, certainly it is just one possible solution and may not be the most pragmatic one. If a database is being used for property storage already, then Lucene might be overkill for a query like you provided.

Keep in mind that queries like those are not borderline but the ones that the system will need to be doing in most cases.


Can you elaborate more on how you would do a query like

 SELECT {DAV}allprop
 FROM /files/whatever
 WHERE {DAV}contentlength > 40000
 ORDER BY {DAV}lastmodified

on top of lucene?

I would AND together a PrefixQuery for URI "/files/whatever" (allowing it to search a sub-tree rooted there) with a RangeQuery on field "contentlength" for values 40000 and greater.

Hmmm. This seems to have O(n) complexity on scoping. I was aiming to obtain a O(1) complexity on scoping and O(n) on the rest (WHERE and ORDER). O(n) on scoping is not going to be performant enough on very large collections of documents.

Ordering is not something Lucene does though, other than by it's score calculation, so this is where the mismatch occurs most strongly.

that's not a big issue.

If you're doing full-text searching combined with these types of conditions and want the order to be by how well the documents match your query then Lucene will shine.

yes, but in that case, I think lucene should handle its own query language thru a specific DASL implementation.... using a text-oriented search engine for relational stuff is, IMO, a little abusive.

Traditional relational database type of queries with ORDER BY clauses don't map as well. Ordering, though, can be applied after the query results are returned in this case as you will want to collect all documents that match the query anyway. I'd almost be willing to bet that Lucene will beat most, if not all, relational databases here especially in this case where the hierarchy is being recursively traversed.

Not sure about that.

Lucene is not relational, so it will have to scan the entire list of documents if they belong to a particular scope or not.

In a relational model, if scopes are keep relational, I can acces the scopes with O(m) complexity where 'm' is the number of paths in the scope (not more than 20, I believe), then I can obtain a list of all the files in that scope (we can precompute all those scope->URL relations.. it would make the writing operations more expensive, but the lookup will be O(1))

note that since DASL queries will be more or less the same all the time, it is possible to think at a relational model that will optimize them greatly.

but I'm no database guru, so I might be completely off-topic here... anybody has other opinions on this matter?

--
Stefano.

Attachment: smime.p7s
Description: S/MIME cryptographic signature



Reply via email to