Re: Lucene vs. in-DB-full-text-searching

2005-02-24 Thread Kevin A. Burton
Otis Gospodnetic wrote:
The most obvious answer is that the full-text indexing features of
RDBMS's are not as good (as fast) as Lucene.  MySQL, PostgreSQL,
Oracle, MS SQL Server etc. all have full-text indexing/searching
features, but I always hear people complaining about the speed.  A
person from a well-known online bookseller told me recently that Lucene
was about 10x faster that MySQL for full-text searching, and I am
currently helping someone get away from MySQL and into Lucene for
performance reasons.
 

Also... MySQL full text search isn't perfect. If you're not a java 
programmer it would be difficult to hack on. Another downside is that FT 
in MySQL only works with MyISAM tables which aren't transaction aware 
and use global tables locks (not fun).

I'm sure though that MySQL would do a better job at online index 
maintenance than Lucene. It falls down a bit in this area...

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene vs. in-DB-full-text-searching

2005-02-24 Thread Kevin A. Burton
David Sitsky wrote:
On Sat, 19 Feb 2005 09:31, Otis Gospodnetic wrote:
 

You are right.
Since there are C++ and now C ports of Lucene, it would be interesting
to integrate them directly with DBs, so that the RDBMS full-text search
under the hood is actually powered by one of the Lucene ports.
   

Or to see Lucene + Derby (100% JAVA embedded database donated from IBM 
currently in Apache incubation) integrated together... that would be 
really nice and powerful.

Does anyone know if there are any integration plans?
 

Don't forget BerkeleyDB Java  Edition... that would be interesting too...
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412



Re: Lucene vs. in-DB-full-text-searching

2005-02-22 Thread David Sitsky
On Sat, 19 Feb 2005 09:31, Otis Gospodnetic wrote:
 You are right.
 Since there are C++ and now C ports of Lucene, it would be interesting
 to integrate them directly with DBs, so that the RDBMS full-text search
 under the hood is actually powered by one of the Lucene ports.

Or to see Lucene + Derby (100% JAVA embedded database donated from IBM 
currently in Apache incubation) integrated together... that would be 
really nice and powerful.

Does anyone know if there are any integration plans?

-- 
Cheers,
David

This message is intended only for the named recipient.  If you are not the 
intended recipient you are notified that disclosing, copying, distributing 
or taking any action  in reliance on the contents of this information is 
strictly prohibited.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene vs. in-DB-full-text-searching

2005-02-19 Thread Steven J. Owens
On Fri, Feb 18, 2005 at 04:45:50PM -0500, Mike Rose wrote:
 I can comment on this since I'm in the middle of excising Oracle text
 searching and replacing it with Lucene in one of my projects.

 Intereseting, particularly as it's from somebody who's already
tried an existing in-db fulltext search feature.

 All in all, I don't think that a JDBC wrapper is going to do what
 you want.

 I wasn't thinking about trying to do the whole thing under the
JDBC driver.  Mainly I was thinking that one key point is that you
need to treat the lucene index somewhat like a cache.  This also means
that you have to watch database writes and make sure you update your
cache, which means you have to have some sort of single point of data
access to monitor.  Well, we already have that - it's called the JDBC
driver.

 The general design I was eyeing speculatively is basically that
the driver would be set up with a reference to an object that
implements a CacheManager interface.  This interface basically gives
the driver a way to notify the cache manager of when certain tables
and columns are being edited.  Exactly how is another question.  I
don't know enough of the innards of, say, a PreparedStatement, to say
more.  It could be as simple as sending the CacheManager a copy of
every SQL query string and letting the CacheManager figure out the
rest.  Ideally I'd like it to be a little bit more structured.

 From there, it's the CacheManager's job to decide what to do
about it, and how to do it.  This leaves the tricky issue of mapping
from a specific database to a specific lucene index up to the
developer.

-- 
Steven J. Owens
[EMAIL PROTECTED]

I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt. - http://darksleep.com/notablog


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene vs. in-DB-full-text-searching

2005-02-18 Thread Steven J. Owens
Hi,

 I was rambling to some friends about an idea to build a
cache-aware JDBC driver wrapper, to make it easier to keep a lucene
index of a database up to date.

 They asked me a question that I have to take seriously, which is
that most RDBMSes provide some built-in fulltext searching - postgres,
mysql, even oracle - why not use that instead of adding another layer
of caching?

 I have to take this question seriously, especially since it
reminds me a lot of what Doug has often said to folks contemplating
doing similar things (caching query results, etc) with Lucene.

 Has anybody done some serious investigation into this, and could
summarize the pros and cons?

-- 
Steven J. Owens
[EMAIL PROTECTED]

I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt. - http://darksleep.com/notablog


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene vs. in-DB-full-text-searching

2005-02-18 Thread Otis Gospodnetic
The most obvious answer is that the full-text indexing features of
RDBMS's are not as good (as fast) as Lucene.  MySQL, PostgreSQL,
Oracle, MS SQL Server etc. all have full-text indexing/searching
features, but I always hear people complaining about the speed.  A
person from a well-known online bookseller told me recently that Lucene
was about 10x faster that MySQL for full-text searching, and I am
currently helping someone get away from MySQL and into Lucene for
performance reasons.

Otis




--- Steven J. Owens [EMAIL PROTECTED] wrote:

 Hi,
 
  I was rambling to some friends about an idea to build a
 cache-aware JDBC driver wrapper, to make it easier to keep a lucene
 index of a database up to date.
 
  They asked me a question that I have to take seriously, which is
 that most RDBMSes provide some built-in fulltext searching -
 postgres,
 mysql, even oracle - why not use that instead of adding another layer
 of caching?
 
  I have to take this question seriously, especially since it
 reminds me a lot of what Doug has often said to folks contemplating
 doing similar things (caching query results, etc) with Lucene.
 
  Has anybody done some serious investigation into this, and could
 summarize the pros and cons?
 
 -- 
 Steven J. Owens
 [EMAIL PROTECTED]
 
 I'm going to make broad, sweeping generalizations and strong,
  declarative statements, because otherwise I'll be here all night and
  this document will be four times longer and much less fun to read.
  Take it all with a grain of salt. - http://darksleep.com/notablog
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene vs. in-DB-full-text-searching

2005-02-18 Thread Mike Rose
I can comment on this since I'm in the middle of excising Oracle text
searching and replacing it with Lucene in one of my projects.

Oracle does provide mechanisms for creating fuzzy indexes of text and
doing word stemming as well, has a scoring mechanism, etc...  However,
this requires additional licensing (or an enterprise license, big $$$)
and index creation is slow.  Unlike other indexes in Oracle, this needs
to be explicitly dropped and recreated in order to pick up changes to
the content, and you can't update a single entry in the index, you have
to do the whole thing in one shot.  That being said, it has been
successful for me so far, you just have to use some non-standard funky
SQL operators to make use of it.

So why am I switching to Lucene on this project?

Speed: Lucene is faster at indexing and searching.

Price: I don't think I need to explain this one.

Size: The size of the Lucene index is tiny and easier to deploy
to the servers that search it.

Flexibility: If I want to change my methodology of index or
search, I don't need to worry about db schema evolution across multiple
environments on the way to production.

All in all, I don't think that a JDBC wrapper is going to do what you
want.  The material you want to index is application-specific, as are
the mechanics of searching the index.  A JDBC driver isn't going to know
which of the fields you are updating you might care to index and search
later.  In the end, the approach that worked for me was to create a
config driven wrapper that knows how to index specific properties of
POJOs.  The same config also drives the formation of the query
expressions as well.  This way I don't care if the content was
instantiated from a db or xml (I need to do both), or some other source.
I think one of the great benefits of Lucene is that it allows me to
embed sophisticated search functionality into my apps without being
dependent upon any particular persistence mechanism.

Mike





smime.p7s
Description: S/MIME cryptographic signature


Re: Lucene vs. in-DB-full-text-searching

2005-02-18 Thread David Spencer
Otis Gospodnetic wrote:
The most obvious answer is that the full-text indexing features of
RDBMS's are not as good (as fast) as Lucene.  MySQL, PostgreSQL,
Oracle, MS SQL Server etc. all have full-text indexing/searching
features, 
but I always hear people complaining about the speed. 
Yeah, but in theory, in the ideal world :), it should't be any slower - 
there's no magic Lucene has that DB's don't.  And the big advantage of 
it being embedded in the DB is the index can always be up to date, just 
as if you had Lucene updating the index based on a trigger. You don't 
need any separate cron job to periodically update the index.

But this brings up - has anyone run Lucene off a database trigger or are 
 triggers known to be slow and bad for this use?

A
person from a well-known online bookseller told me recently that Lucene
was about 10x faster that MySQL for full-text searching, and I am
currently helping someone get away from MySQL and into Lucene for
performance reasons.
Otis

--- Steven J. Owens [EMAIL PROTECTED] wrote:

Hi,
I was rambling to some friends about an idea to build a
cache-aware JDBC driver wrapper, to make it easier to keep a lucene
index of a database up to date.
They asked me a question that I have to take seriously, which is
that most RDBMSes provide some built-in fulltext searching -
postgres,
mysql, even oracle - why not use that instead of adding another layer
of caching?
I have to take this question seriously, especially since it
reminds me a lot of what Doug has often said to folks contemplating
doing similar things (caching query results, etc) with Lucene.
Has anybody done some serious investigation into this, and could
summarize the pros and cons?
--
Steven J. Owens
[EMAIL PROTECTED]
I'm going to make broad, sweeping generalizations and strong,
declarative statements, because otherwise I'll be here all night and
this document will be four times longer and much less fun to read.
Take it all with a grain of salt. - http://darksleep.com/notablog
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene vs. in-DB-full-text-searching

2005-02-18 Thread markharw00d
But this brings up - has anyone run Lucene off a database trigger or 
are  triggers known to be slow and bad for this use?

I suspect the tricky bit would be knowing when to balancing the calls to 
Reader/Writer closes, opens and optimizes.
Record updates are the usual fun and games involving a reader.delete and 
a document.write.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene vs. in-DB-full-text-searching

2005-02-18 Thread David Spencer
markharw00d wrote:
 But this brings up - has anyone run Lucene off a database trigger or 
are  triggers known to be slow and bad for this use?

I suspect the tricky bit would be knowing when to balancing the calls to 
Reader/Writer closes, opens and optimizes.
Record updates are the usual fun and games involving a reader.delete and 
a document.write.
I agree this is the usual tricky/fun thing.
In similar situations I have:
- batched the updates in, well, sort of a queue
- flushed the queue after t seconds or n documents (e.g. t=60sec, 
n=1000 docs)

Part of the trick is a document that changes multiple times during one 
of these periods - if you have a add queue and a delete queue then 
you'll probably have the wrong index with the doc either zero times or 
more than one time - not impossible to cover, just something to keep in mind

- Dave

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]