RE: commercial websites powered by Lucene?

2003-06-25 Thread John Takacs
Tatu,

I agree 100% with everything you've said.

Let's look at MySQL for example.  Great database.  No doubt about it.

BUT, looking at the Full text indexing/searching part...it not up to snuff.

Currently, I'm using mysql's full text search support. I have a database of
3-5 million rows. Each row is unique, let's say a product. Each row has
several columns, but the two I search on are title and description. I
created a full text index on title and description. Title has approximately
100 characters, and description has 255 characters.

At the moment, mysql is taking 50 seconds plus to return results on simple
one word searches. My dedicated server is a P4, 2.0 Gighz, 1.5 Gig RAM
RedHat Linux 7.3 platform, with nothing else running on it, i.e. another
server is handling HTTP requests. It is a dedicated mysql box.  In addition,
I'm the only person making queries.

Obviously, the above performance is unacceptable for real world web
applications.

I'd love to try Lucene with the above, but the Lucene install fails because
of JavaCC issues.  Surprised more people haven't encountered this problem,
as the install instructions are out of date.

Regards,

John



-Original Message-
From: Tatu Saloranta [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 25, 2003 12:26 PM
To: Lucene Users List
Subject: Re: commercial websites powered by Lucene?


On Tuesday 24 June 2003 07:36, Ulrich Mayring wrote:
 Chris Miller wrote:
...
 Well, nothing against Lucene, but it doesn't solve your problem, which
 is an overloaded DB-Server. It may temporarily alleviate the effects,
 but you'll soon be at the same load again. So I'd recommend to install

I don't think that would necessarily be the case. Like you mention later on,
indexing data stored in DB does flatten it to allow faster indexing (and
retrieval), and faster in this context means more efficient, not only
sharing
the load between DB and search engine, but potentially lowering total load?

The alternative, data warehouse - like preprocessing of data, for faster
search, would likely be doable too, but it's usually more useful for running
reports. For actual searches Lucene does it job nicely and efficiently,
biggest problems I've seen are more related to relevancy questions. But
that's where tuning of Lucene ranking should be easier than trying to build
your own ranking from raw database hits (except if one uses OracleText or
such that's pretty much a search engine on top of DB itself).

So, to me it all comes down to right tool for the job aspect;  DBs are
good
at mass retrieval of data, or using aggregate functions (in read-only side),
whereas dedicated search engines are better for, well, searching.

...
 Of course, in real life there may be political obstacles which will
 prevent you from doing the right thing as detailed above for example,
 and your only chance is to circumvent in some way - and then Lucene is a
 great way to do that. But keep in mind that you are basically
 reinventing the functionality that is already built-in in a database :)

It depends on type of queries, but Lucene certainly has much more advanced
text searching functionality, even if indexed content comes from a rigid
structure like RDBMS. I'm not sure using a ready product like Lucene is
reinventing much functionality, even considering synchronization issues?

So I would go as far saying that for searching purposes, plain vanilla
RDBMSs
are not all that great in the first place. Even if queries need not use
advanced search features (advanced as in not just using % and _ in addition
to exact matches) Lucene may well offer better search performance and
functionality.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: commercial websites powered by Lucene?

2003-06-25 Thread Ulrich Mayring
John Takacs wrote:

I'd love to try Lucene with the above, but the Lucene install fails because
of JavaCC issues.  Surprised more people haven't encountered this problem,
as the install instructions are out of date.
Well, what do you need JavaCC for? Isn't it just the technology for 
building the supplied HTML-Parser? There are much better HTML parsers 
out there, which you can use.

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: commercial websites powered by Lucene?

2003-06-25 Thread John Takacs
Good idea.  I was just following the install directions, but if I don't have
to pay attention to the install directions, I'll find a much better one.

Any hints?  Previous email discussion maybe?  I found some references via
searching the archives, but I'm not 100% convinced they are applicable to my
situation.

John



-Original Message-
From: news [mailto:[EMAIL PROTECTED] Behalf Of Ulrich Mayring
Sent: Thursday, June 26, 2003 12:48 AM
To: [EMAIL PROTECTED]
Subject: Re: commercial websites powered by Lucene?


John Takacs wrote:
 
 I'd love to try Lucene with the above, but the Lucene install fails
because
 of JavaCC issues.  Surprised more people haven't encountered this problem,
 as the install instructions are out of date.

Well, what do you need JavaCC for? Isn't it just the technology for
building the supplied HTML-Parser? There are much better HTML parsers
out there, which you can use.

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: commercial websites powered by Lucene?

2003-06-25 Thread Ulrich Mayring
John Takacs wrote:
Good idea.  I was just following the install directions, but if I don't have
to pay attention to the install directions, I'll find a much better one.
Any hints?  Previous email discussion maybe?  I found some references via
searching the archives, but I'm not 100% convinced they are applicable to my
situation.
I'm not sure what you mean with install directions, Lucene is just a JAR 
file and you use it like any other Java class library. There's also the 
WAR file with a few demos, which you can just drop into Tomcat.

Perhaps you were trying to build it? I just downloaded the binary 
distribution and used it.

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: commercial websites powered by Lucene?

2003-06-25 Thread Otis Gospodnetic
 I'd love to try Lucene with the above, but the Lucene install fails
 because
 of JavaCC issues.  Surprised more people haven't encountered this
 problem,
 as the install instructions are out of date.

The JavaCC fix is in the queue.  Check Bugzilla for details (link on
Lucene home page).

Otis


 -Original Message-
 From: Tatu Saloranta [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, June 25, 2003 12:26 PM
 To: Lucene Users List
 Subject: Re: commercial websites powered by Lucene?
 
 
 On Tuesday 24 June 2003 07:36, Ulrich Mayring wrote:
  Chris Miller wrote:
 ...
  Well, nothing against Lucene, but it doesn't solve your problem,
 which
  is an overloaded DB-Server. It may temporarily alleviate the
 effects,
  but you'll soon be at the same load again. So I'd recommend to
 install
 
 I don't think that would necessarily be the case. Like you mention
 later on,
 indexing data stored in DB does flatten it to allow faster indexing
 (and
 retrieval), and faster in this context means more efficient, not only
 sharing
 the load between DB and search engine, but potentially lowering total
 load?
 
 The alternative, data warehouse - like preprocessing of data, for
 faster
 search, would likely be doable too, but it's usually more useful for
 running
 reports. For actual searches Lucene does it job nicely and
 efficiently,
 biggest problems I've seen are more related to relevancy questions.
 But
 that's where tuning of Lucene ranking should be easier than trying to
 build
 your own ranking from raw database hits (except if one uses
 OracleText or
 such that's pretty much a search engine on top of DB itself).
 
 So, to me it all comes down to right tool for the job aspect;  DBs
 are
 good
 at mass retrieval of data, or using aggregate functions (in read-only
 side),
 whereas dedicated search engines are better for, well, searching.
 
 ...
  Of course, in real life there may be political obstacles which will
  prevent you from doing the right thing as detailed above for
 example,
  and your only chance is to circumvent in some way - and then Lucene
 is a
  great way to do that. But keep in mind that you are basically
  reinventing the functionality that is already built-in in a
 database :)
 
 It depends on type of queries, but Lucene certainly has much more
 advanced
 text searching functionality, even if indexed content comes from a
 rigid
 structure like RDBMS. I'm not sure using a ready product like Lucene
 is
 reinventing much functionality, even considering synchronization
 issues?
 
 So I would go as far saying that for searching purposes, plain
 vanilla
 RDBMSs
 are not all that great in the first place. Even if queries need not
 use
 advanced search features (advanced as in not just using % and _ in
 addition
 to exact matches) Lucene may well offer better search performance and
 functionality.
 
 -+ Tatu +-
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


__
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: commercial websites powered by Lucene?

2003-06-25 Thread Otis Gospodnetic
 Well, what do you need JavaCC for? Isn't it just the technology for 
 building the supplied HTML-Parser? There are much better HTML parsers
 out there, which you can use.

Its primary use in Lucene package is for parsing users' queries.

Otis


__
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: commercial websites powered by Lucene?

2003-06-25 Thread Tatu Saloranta
On Wednesday 25 June 2003 09:47, Ulrich Mayring wrote:
 John Takacs wrote:
  I'd love to try Lucene with the above, but the Lucene install fails
  because of JavaCC issues.  Surprised more people haven't encountered this
  problem, as the install instructions are out of date.

 Well, what do you need JavaCC for? Isn't it just the technology for
 building the supplied HTML-Parser? There are much better HTML parsers
 out there, which you can use.

On a related note; has anyone done performance measurements for various
HTML parsers used for indexing?

I have written couple of XML/HTML parsers that were optimized for speed 
(and/or leniency to be able to handle/fix non-valid documents), and was 
wondering if they might be useful for indexing purposes for other people (one 
is in general pretty optimal if document contents are fully in memory 
already, like when fetching from DB; another uses very little memory, while 
being only slightly slower). However, using those as opposed to more standard 
ones would only make sense if there are significant speed improvements.
And to do that, it would be good to have baseline measurements, and/or to know 
what are current best candidates, from performance perspective.

The thing is that creating a parser that only cares about textual content (and 
perhaps in some cases about surrounding element, but not about attributes, or 
structure, or DTD/Schema, validity etc) is fairly easy, and since indexing is 
often the most CPU-intensive part of search engine, it may make sense to try 
to optimize this part heavily, up to and including using specialized parsers.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: commercial websites powered by Lucene?

2003-06-24 Thread Chris Miller
Hi Nader,

I was wondering if you'd mind me asking you a couple of questions about your
implementation?

The main thing I'm interested in is how you handle updates to Lucene's
index. I'd imagine you have a fairly high turnover of CVs and jobs, so index
updates must place a reasonable load on the CPU/disk. Do you keep CVs and
jobs in the same index or two different ones? And what is the process you
use to update the index(es) - do you batch-process updates or do you handle
them in real-time as changes are made?

Any insight you can offer would be much appreciated as I'm about to
implement something similar and am a little unsure of the best approach to
take. We need to be able to handle indexing about 60,000 documents/day,
while allowing (many) searches to continue operating alongside.

Thanks!
Chris

Nader S. Henein [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 We use Lucene http://www.bayt.com , we're basically an on-line
 Recruitment site and up until now we've got around 500 000 CVs and
 documents indexed with results that stump Oracle Intermedia.

 Nader Henein
 Senior Web Dev

 Bayt.com

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, June 04, 2003 6:09 PM
 To: [EMAIL PROTECTED]
 Subject: commercial websites powered by Lucene?



 Hello All,

 I've been trying to find examples of large commercial websites that
 use Lucene to power their search.  Having such examples would
 make Lucene an easy sell to management

 Does anyone know of any good examples?  The bigger the better, and
 the more the better.

 TIA,
 -John



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-24 Thread Nader S. Henein
 I handle updates or inserts the same way first I delete the document
from the index and then I insert it (better safe than sorry), I batch my
updates/inserts every twenty minutes, I would do it in smaller intervals
but since I have to sync the XML files created from the DB to three
machines (I maintain three separate Lucene indices on my three separate
web-servers) it takes a little longer. You have to batch your changes
because Updating the index takes time as opposed to deleted which I
batch every two minutes. You won't have a problem updating the index and
searching at the same time because lucene updates the index on a
separate set of files and then when It's done it overwrites the old
version. I've had to provide for Backups, and things like server crashes
mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS
IT AWAY.

-Original Message-
From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller
Sent: Tuesday, June 24, 2003 12:06 PM
To: [EMAIL PROTECTED]
Subject: Re: commercial websites powered by Lucene?


Hi Nader,

I was wondering if you'd mind me asking you a couple of questions about
your implementation?

The main thing I'm interested in is how you handle updates to Lucene's
index. I'd imagine you have a fairly high turnover of CVs and jobs, so
index updates must place a reasonable load on the CPU/disk. Do you keep
CVs and jobs in the same index or two different ones? And what is the
process you use to update the index(es) - do you batch-process updates
or do you handle them in real-time as changes are made?

Any insight you can offer would be much appreciated as I'm about to
implement something similar and am a little unsure of the best approach
to take. We need to be able to handle indexing about 60,000
documents/day, while allowing (many) searches to continue operating
alongside.

Thanks!
Chris

Nader S. Henein [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 We use Lucene http://www.bayt.com , we're basically an on-line 
 Recruitment site and up until now we've got around 500 000 CVs and 
 documents indexed with results that stump Oracle Intermedia.

 Nader Henein
 Senior Web Dev

 Bayt.com

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, June 04, 2003 6:09 PM
 To: [EMAIL PROTECTED]
 Subject: commercial websites powered by Lucene?



 Hello All,

 I've been trying to find examples of large commercial websites that 
 use Lucene to power their search.  Having such examples would make 
 Lucene an easy sell to management

 Does anyone know of any good examples?  The bigger the better, and the

 more the better.

 TIA,
 -John



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: commercial websites powered by Lucene?

2003-06-24 Thread Chris Miller
So you have a holding table in a database (or directory on disk?) where you
store the incoming documents correct? Does each webserver run it's own
indexing thread which grabs any new documents every 20 minutes, or is there
a central process that manages that? I'm trying to understand how you know
when you can safely clean out the holding table.

Did you look at having just a single process that was responsible for
updating the index, and then pushing copies out to all the webservers? I'm
wondering if that might be worth investigating (since it would take a lot of
load off the webservers that are running the searches), or if it will be too
troublesome in practice.

Also, I'm interested to see how you handle the situation when a server gets
shutdown/restarted - does it just take a copy of the index from one of the
other servers (since it's own index is likely out of date)? I take it it's
not safe to copy an index while it is being updated, so you have to block on
that somehow?

PS: It's great to hear Lucene blows Oracle out of the water! I've got some
skeptical management that need some convincing, hearing stories like this
helps a lot :-)

Nader S. Henein [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 I handle updates or inserts the same way first I delete the document
 from the index and then I insert it (better safe than sorry), I batch my
 updates/inserts every twenty minutes, I would do it in smaller intervals
 but since I have to sync the XML files created from the DB to three
 machines (I maintain three separate Lucene indices on my three separate
 web-servers) it takes a little longer. You have to batch your changes
 because Updating the index takes time as opposed to deleted which I
 batch every two minutes. You won't have a problem updating the index and
 searching at the same time because lucene updates the index on a
 separate set of files and then when It's done it overwrites the old
 version. I've had to provide for Backups, and things like server crashes
 mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS
 IT AWAY.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: commercial websites powered by Lucene?

2003-06-24 Thread Gareth Griffiths
Nader,
You say you have to cope with server crash mid-indexing. I think I'm seeing
lots of garbage files created by server crash mid merge/optimise while
lucene is creating a new index. Did you write code specifically to handle
this or is there something more automated. (I was thinking of writing a
sanity check for before start-up that looked in 'segments' and 'deletable
and got rid of any files in the catalog directory that are not referenced.)

Did you do something similar or have I missed something...

TIA

Gareth


- Original Message -
From: Nader S. Henein [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Tuesday, June 24, 2003 9:30 AM
Subject: RE: commercial websites powered by Lucene?


 I handle updates or inserts the same way first I delete the document
 from the index and then I insert it (better safe than sorry), I batch my
 updates/inserts every twenty minutes, I would do it in smaller intervals
 but since I have to sync the XML files created from the DB to three
 machines (I maintain three separate Lucene indices on my three separate
 web-servers) it takes a little longer. You have to batch your changes
 because Updating the index takes time as opposed to deleted which I
 batch every two minutes. You won't have a problem updating the index and
 searching at the same time because lucene updates the index on a
 separate set of files and then when It's done it overwrites the old
 version. I've had to provide for Backups, and things like server crashes
 mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS
 IT AWAY.

 -Original Message-
 From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller
 Sent: Tuesday, June 24, 2003 12:06 PM
 To: [EMAIL PROTECTED]
 Subject: Re: commercial websites powered by Lucene?


 Hi Nader,

 I was wondering if you'd mind me asking you a couple of questions about
 your implementation?

 The main thing I'm interested in is how you handle updates to Lucene's
 index. I'd imagine you have a fairly high turnover of CVs and jobs, so
 index updates must place a reasonable load on the CPU/disk. Do you keep
 CVs and jobs in the same index or two different ones? And what is the
 process you use to update the index(es) - do you batch-process updates
 or do you handle them in real-time as changes are made?

 Any insight you can offer would be much appreciated as I'm about to
 implement something similar and am a little unsure of the best approach
 to take. We need to be able to handle indexing about 60,000
 documents/day, while allowing (many) searches to continue operating
 alongside.

 Thanks!
 Chris

 Nader S. Henein [EMAIL PROTECTED] wrote in message
 news:[EMAIL PROTECTED]
  We use Lucene http://www.bayt.com , we're basically an on-line
  Recruitment site and up until now we've got around 500 000 CVs and
  documents indexed with results that stump Oracle Intermedia.
 
  Nader Henein
  Senior Web Dev
 
  Bayt.com
 
  -Original Message-
  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, June 04, 2003 6:09 PM
  To: [EMAIL PROTECTED]
  Subject: commercial websites powered by Lucene?
 
 
 
  Hello All,
 
  I've been trying to find examples of large commercial websites that
  use Lucene to power their search.  Having such examples would make
  Lucene an easy sell to management
 
  Does anyone know of any good examples?  The bigger the better, and the

  more the better.
 
  TIA,
  -John
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-24 Thread Nader S. Henein
I have to store the information I am indexing in the database because
the nature of our application requires it, on update of certain columns
in a table I create an XML file which is then copied to directories on
each of my web servers, then separate lucene apps, running on separates
machines digest the information into separate indices, you also have to
provide procedures that will run periodically to ensure that all you
indices are in sync with each other and in sync with the DB ( I run this
once every three days when the CPU usage on the machines is low) 

To update the index I have a servlet running off a scheduler in Resin
(you could use any webserver, Orion's cool too), the up-side to
distributing your search engines like this is that you have three active
back ups in case one got corrupted (hasn't happened in two years), and
the load on each machine is pretty low even during updates/optimizations
every 20 minutes.

If the server crashes, it's not  a problem unless it happens
mid-indexing, then you have to somehow remove the write locks created in
the index directory ( I just delete them, optimize, and re-start the
update that crashed) 

Lucene destroyed Oracle on speed tests and we use to have to use our
single DB monster machine for all the searching and indexing which made
the load on it pretty high, but now I have 0.5 loads on all my CPUs and
no need to buy new hardware

-Original Message-
From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller
Sent: Tuesday, June 24, 2003 1:12 PM
To: [EMAIL PROTECTED]
Subject: Re: commercial websites powered by Lucene?


So you have a holding table in a database (or directory on disk?) where
you store the incoming documents correct? Does each webserver run it's
own indexing thread which grabs any new documents every 20 minutes, or
is there a central process that manages that? I'm trying to understand
how you know when you can safely clean out the holding table.

Did you look at having just a single process that was responsible for
updating the index, and then pushing copies out to all the webservers?
I'm wondering if that might be worth investigating (since it would take
a lot of load off the webservers that are running the searches), or if
it will be too troublesome in practice.

Also, I'm interested to see how you handle the situation when a server
gets shutdown/restarted - does it just take a copy of the index from one
of the other servers (since it's own index is likely out of date)? I
take it it's not safe to copy an index while it is being updated, so you
have to block on that somehow?

PS: It's great to hear Lucene blows Oracle out of the water! I've got
some skeptical management that need some convincing, hearing stories
like this helps a lot :-)

Nader S. Henein [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 I handle updates or inserts the same way first I delete the document 
 from the index and then I insert it (better safe than sorry), I batch 
 my updates/inserts every twenty minutes, I would do it in smaller 
 intervals but since I have to sync the XML files created from the DB 
 to three machines (I maintain three separate Lucene indices on my 
 three separate
 web-servers) it takes a little longer. You have to batch your changes
 because Updating the index takes time as opposed to deleted which I
 batch every two minutes. You won't have a problem updating the index
and
 searching at the same time because lucene updates the index on a
 separate set of files and then when It's done it overwrites the old
 version. I've had to provide for Backups, and things like server
crashes
 mid-indexing, but I was using Oracle Intermedia before and Lucene
BLOWS
 IT AWAY.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-24 Thread John Takacs
Hi Nader,

This thread is by far one of the best, and most practical.  It will only be
topped when someone provides benchmarks for a DMOZ.org type directory of 3
million plus urls.  I would love to, but the whole JavaCC thing is a show
stopper.

Questions:

I noticed that search is a little slow.  What has been your experience?
Perhaps it was a bandwidth issue, but I'm living in a country with the
greatest internet connectivity and penetration in the world (South Korea),
so I don't think that is an issue on my end.

You have 500,000 resumes.  Based on the steps you took to get to 500,000, do
you think your current setup will scale to millions, like say, 3 million or
so?

What is your hardware like?  CPU/RAM?

Warm regards, and thanks for sharing.  If I can ever get passed the
Lucene/JavaCC installation failure, I'll share my benchmarks on the above
directory scenario.

John



-Original Message-
From: Nader S. Henein [mailto:[EMAIL PROTECTED]
Sent: Tuesday, June 24, 2003 5:30 PM
To: 'Lucene Users List'
Subject: RE: commercial websites powered by Lucene?


 I handle updates or inserts the same way first I delete the document
from the index and then I insert it (better safe than sorry), I batch my
updates/inserts every twenty minutes, I would do it in smaller intervals
but since I have to sync the XML files created from the DB to three
machines (I maintain three separate Lucene indices on my three separate
web-servers) it takes a little longer. You have to batch your changes
because Updating the index takes time as opposed to deleted which I
batch every two minutes. You won't have a problem updating the index and
searching at the same time because lucene updates the index on a
separate set of files and then when It's done it overwrites the old
version. I've had to provide for Backups, and things like server crashes
mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS
IT AWAY.

-Original Message-
From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller
Sent: Tuesday, June 24, 2003 12:06 PM
To: [EMAIL PROTECTED]
Subject: Re: commercial websites powered by Lucene?


Hi Nader,

I was wondering if you'd mind me asking you a couple of questions about
your implementation?

The main thing I'm interested in is how you handle updates to Lucene's
index. I'd imagine you have a fairly high turnover of CVs and jobs, so
index updates must place a reasonable load on the CPU/disk. Do you keep
CVs and jobs in the same index or two different ones? And what is the
process you use to update the index(es) - do you batch-process updates
or do you handle them in real-time as changes are made?

Any insight you can offer would be much appreciated as I'm about to
implement something similar and am a little unsure of the best approach
to take. We need to be able to handle indexing about 60,000
documents/day, while allowing (many) searches to continue operating
alongside.

Thanks!
Chris

Nader S. Henein [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 We use Lucene http://www.bayt.com , we're basically an on-line
 Recruitment site and up until now we've got around 500 000 CVs and
 documents indexed with results that stump Oracle Intermedia.

 Nader Henein
 Senior Web Dev

 Bayt.com

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, June 04, 2003 6:09 PM
 To: [EMAIL PROTECTED]
 Subject: commercial websites powered by Lucene?



 Hello All,

 I've been trying to find examples of large commercial websites that
 use Lucene to power their search.  Having such examples would make
 Lucene an easy sell to management

 Does anyone know of any good examples?  The bigger the better, and the

 more the better.

 TIA,
 -John



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-24 Thread Nader S. Henein
Because I've setup Lucene as a webapp with a centralized Init file and
setup properties file, I do my sanity check in the Init, because if the
serer crashes mid-indexing, I have to delete the lock files optimize and
re-index the files that were indexing when the crash occurred, there was
long discussion about this back in August, search for Crash / Recovery
Scenario in the lucene-dev archived discussions. Should answer all your
questions

Nader Henein

-Original Message-
From: Gareth Griffiths [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 24, 2003 1:11 PM
To: Lucene Users List; [EMAIL PROTECTED]
Subject: Re: commercial websites powered by Lucene?


Nader,
You say you have to cope with server crash mid-indexing. I think I'm
seeing lots of garbage files created by server crash mid merge/optimise
while lucene is creating a new index. Did you write code specifically to
handle this or is there something more automated. (I was thinking of
writing a sanity check for before start-up that looked in 'segments' and
'deletable and got rid of any files in the catalog directory that are
not referenced.)

Did you do something similar or have I missed something...

TIA

Gareth


- Original Message -
From: Nader S. Henein [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Tuesday, June 24, 2003 9:30 AM
Subject: RE: commercial websites powered by Lucene?


 I handle updates or inserts the same way first I delete the document 
 from the index and then I insert it (better safe than sorry), I batch 
 my updates/inserts every twenty minutes, I would do it in smaller 
 intervals but since I have to sync the XML files created from the DB 
 to three machines (I maintain three separate Lucene indices on my 
 three separate
 web-servers) it takes a little longer. You have to batch your changes
 because Updating the index takes time as opposed to deleted which I
 batch every two minutes. You won't have a problem updating the index
and
 searching at the same time because lucene updates the index on a
 separate set of files and then when It's done it overwrites the old
 version. I've had to provide for Backups, and things like server
crashes
 mid-indexing, but I was using Oracle Intermedia before and Lucene
BLOWS
 IT AWAY.

 -Original Message-
 From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller
 Sent: Tuesday, June 24, 2003 12:06 PM
 To: [EMAIL PROTECTED]
 Subject: Re: commercial websites powered by Lucene?


 Hi Nader,

 I was wondering if you'd mind me asking you a couple of questions 
 about your implementation?

 The main thing I'm interested in is how you handle updates to Lucene's

 index. I'd imagine you have a fairly high turnover of CVs and jobs, so

 index updates must place a reasonable load on the CPU/disk. Do you 
 keep CVs and jobs in the same index or two different ones? And what is

 the process you use to update the index(es) - do you batch-process 
 updates or do you handle them in real-time as changes are made?

 Any insight you can offer would be much appreciated as I'm about to 
 implement something similar and am a little unsure of the best 
 approach to take. We need to be able to handle indexing about 60,000 
 documents/day, while allowing (many) searches to continue operating 
 alongside.

 Thanks!
 Chris

 Nader S. Henein [EMAIL PROTECTED] wrote in message 
 news:[EMAIL PROTECTED]
  We use Lucene http://www.bayt.com , we're basically an on-line 
  Recruitment site and up until now we've got around 500 000 CVs and 
  documents indexed with results that stump Oracle Intermedia.
 
  Nader Henein
  Senior Web Dev
 
  Bayt.com
 
  -Original Message-
  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, June 04, 2003 6:09 PM
  To: [EMAIL PROTECTED]
  Subject: commercial websites powered by Lucene?
 
 
 
  Hello All,
 
  I've been trying to find examples of large commercial websites that 
  use Lucene to power their search.  Having such examples would make 
  Lucene an easy sell to management
 
  Does anyone know of any good examples?  The bigger the better, and 
  the

  more the better.
 
  TIA,
  -John
 
 
 
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-24 Thread Nader S. Henein
The search is a little sluggish because our initial architecture was
based on TCL, not java, so until we complete the full java overhaul,
every time I perform a search the AOL Webserver (tcl) has to call the
servlet in Resin (where lucene is)  and then perform the search, then
this is the killer , I have to parse all the results from a Java
Collection into a TCL List, the most intense search with thousands of
results takes less than a second, it's all the things I have to do
around it that take time.

Nader

-Original Message-
From: John Takacs [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 24, 2003 1:52 PM
To: Lucene Users List
Subject: RE: commercial websites powered by Lucene?


Hi Nader,

This thread is by far one of the best, and most practical.  It will only
be topped when someone provides benchmarks for a DMOZ.org type directory
of 3 million plus urls.  I would love to, but the whole JavaCC thing is
a show stopper.

Questions:

I noticed that search is a little slow.  What has been your experience?
Perhaps it was a bandwidth issue, but I'm living in a country with the
greatest internet connectivity and penetration in the world (South
Korea), so I don't think that is an issue on my end.

You have 500,000 resumes.  Based on the steps you took to get to
500,000, do you think your current setup will scale to millions, like
say, 3 million or so?

What is your hardware like?  CPU/RAM?

Warm regards, and thanks for sharing.  If I can ever get passed the
Lucene/JavaCC installation failure, I'll share my benchmarks on the
above directory scenario.

John



-Original Message-
From: Nader S. Henein [mailto:[EMAIL PROTECTED]
Sent: Tuesday, June 24, 2003 5:30 PM
To: 'Lucene Users List'
Subject: RE: commercial websites powered by Lucene?


 I handle updates or inserts the same way first I delete the document
from the index and then I insert it (better safe than sorry), I batch my
updates/inserts every twenty minutes, I would do it in smaller intervals
but since I have to sync the XML files created from the DB to three
machines (I maintain three separate Lucene indices on my three separate
web-servers) it takes a little longer. You have to batch your changes
because Updating the index takes time as opposed to deleted which I
batch every two minutes. You won't have a problem updating the index and
searching at the same time because lucene updates the index on a
separate set of files and then when It's done it overwrites the old
version. I've had to provide for Backups, and things like server crashes
mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS
IT AWAY.

-Original Message-
From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller
Sent: Tuesday, June 24, 2003 12:06 PM
To: [EMAIL PROTECTED]
Subject: Re: commercial websites powered by Lucene?


Hi Nader,

I was wondering if you'd mind me asking you a couple of questions about
your implementation?

The main thing I'm interested in is how you handle updates to Lucene's
index. I'd imagine you have a fairly high turnover of CVs and jobs, so
index updates must place a reasonable load on the CPU/disk. Do you keep
CVs and jobs in the same index or two different ones? And what is the
process you use to update the index(es) - do you batch-process updates
or do you handle them in real-time as changes are made?

Any insight you can offer would be much appreciated as I'm about to
implement something similar and am a little unsure of the best approach
to take. We need to be able to handle indexing about 60,000
documents/day, while allowing (many) searches to continue operating
alongside.

Thanks!
Chris

Nader S. Henein [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 We use Lucene http://www.bayt.com , we're basically an on-line 
 Recruitment site and up until now we've got around 500 000 CVs and 
 documents indexed with results that stump Oracle Intermedia.

 Nader Henein
 Senior Web Dev

 Bayt.com

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, June 04, 2003 6:09 PM
 To: [EMAIL PROTECTED]
 Subject: commercial websites powered by Lucene?



 Hello All,

 I've been trying to find examples of large commercial websites that 
 use Lucene to power their search.  Having such examples would make 
 Lucene an easy sell to management

 Does anyone know of any good examples?  The bigger the better, and the

 more the better.

 TIA,
 -John



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED

Re: commercial websites powered by Lucene?

2003-06-24 Thread David Medinets
- Original Message -
From: Chris Miller [EMAIL PROTECTED]
 Did you look at having just a single process that was responsible for
 updating the index, and then pushing copies out to all the webservers? I'm
 wondering if that might be worth investigating (since it would take a lot
of
 load off the webservers that are running the searches), or if it will be
too
 troublesome in practice.

I've found that pulling information from a central source is simpler than
pushing information. When information is pushing, there is much
administration on the central server to track the recipient machines. It
seems like servers are added and dropped from the push list. Additionally,
you need to account for servers that stop responding. When information is
pulled from the central source, these issues of coordination are eliminated.

David Medinets
http://www.codebits.com



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: commercial websites powered by Lucene?

2003-06-24 Thread Ulrich Mayring
Chris Miller wrote:
The main thing I'm interested in is how you handle updates to Lucene's
index. I'd imagine you have a fairly high turnover of CVs and jobs, so index
updates must place a reasonable load on the CPU/disk. Do you keep CVs and
jobs in the same index or two different ones? And what is the process you
use to update the index(es) - do you batch-process updates or do you handle
them in real-time as changes are made?
The way we do it: we re-index everything periodically in a temporary 
directory and then rename the temporary directory. That way the index 
remains accessible at all times and its currency is simply determined by 
the interval I run the re-indexing in.

 We need to be able to handle indexing about 60,000 documents/day,
while allowing (many) searches to continue operating alongside.
On an entry-level Sun I can index about 23 documents per second and 
these are real-life HTML pages. Thus in less than one hour you would be 
finished with a complete index run and save yourself all kinds of 
trouble with crashes during indexing etc.

On my 2 GHz Linux workstation it's even faster: more than 2000 documents 
per minute, so you'd be done in half an hour.

BTW, we're not using the supplied JavaCC-based HTML parser, instead we 
got htmlparser.sourceforge.net, which is a joy to use and pretty fast.

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: commercial websites powered by Lucene?

2003-06-24 Thread Chris Miller
Thanks David, that's about what I figured. Of course if the servers are
pulling the information then a central holding table that contains only new
data doesn't make much sense anymore. Instead I guess the easiest approach
would be to have a central table that contains the entire dataset, and has
last-modified timestamps on each record so the individual webservers can
grab just the data that was changed since they last ran an index update. My
concern still is that the effort of indexing (which is potentially quite
high) is being duplicated across all the webservers.

Is there any reason why it would be a bad idea to have one machine
responsible for grabbing updates and adding documents to a master index, so
the other servers could periodically grab a copy of that index and hot-swap
it with their previous copy? Is Lucene capable of handling that scenario?
Seems to me that this approach would reduce the stress on a webservers even
more, and even if the indexing server went down the webservers would still
have a stale index to search against. Has anyone attempted something like
this?


David Medinets [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 - Original Message -
 From: Chris Miller [EMAIL PROTECTED]
  Did you look at having just a single process that was responsible for
  updating the index, and then pushing copies out to all the webservers?
I'm
  wondering if that might be worth investigating (since it would take a
lot
 of
  load off the webservers that are running the searches), or if it will be
 too
  troublesome in practice.

 I've found that pulling information from a central source is simpler than
 pushing information. When information is pushing, there is much
 administration on the central server to track the recipient machines. It
 seems like servers are added and dropped from the push list. Additionally,
 you need to account for servers that stop responding. When information is
 pulled from the central source, these issues of coordination are
eliminated.

 David Medinets
 http://www.codebits.com




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: commercial websites powered by Lucene?

2003-06-24 Thread Chris Miller
Thanks for your commments Ulrich. I just posted a message asking if anyone
had attempted this approach! Sounds like you have, and it works :-)  Thanks
for information, this sounds pretty close to what my preferred approach
would be.

You say you get 2000 docs/minute. I've done some benchmarking and managed to
get our data indexing at ~1000/minute on an Athlon 1800+ (and most of that
speed was acheived by bumping the IndexWriter.mergeFactor up to 100 or so).
Our data is coming from a database table, each record contains about 40
fields, and I'm indexing 8 of those fields (an ID, 4 number fields, 3 text
fields including one that has ~2k text). Does this sound reasonable to you,
or do you have any tips that might improve that performance?


Ulrich Mayring [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 Chris Miller wrote:
 
  The main thing I'm interested in is how you handle updates to Lucene's
  index. I'd imagine you have a fairly high turnover of CVs and jobs, so
index
  updates must place a reasonable load on the CPU/disk. Do you keep CVs
and
  jobs in the same index or two different ones? And what is the process
you
  use to update the index(es) - do you batch-process updates or do you
handle
  them in real-time as changes are made?

 The way we do it: we re-index everything periodically in a temporary
 directory and then rename the temporary directory. That way the index
 remains accessible at all times and its currency is simply determined by
 the interval I run the re-indexing in.

   We need to be able to handle indexing about 60,000 documents/day,
  while allowing (many) searches to continue operating alongside.

 On an entry-level Sun I can index about 23 documents per second and
 these are real-life HTML pages. Thus in less than one hour you would be
 finished with a complete index run and save yourself all kinds of
 trouble with crashes during indexing etc.

 On my 2 GHz Linux workstation it's even faster: more than 2000 documents
 per minute, so you'd be done in half an hour.

 BTW, we're not using the supplied JavaCC-based HTML parser, instead we
 got htmlparser.sourceforge.net, which is a joy to use and pretty fast.

 Ulrich




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-24 Thread Nader S. Henein
About 100 documents every twenty minutes, but it fluctuates depending on
how much traffic is on the site

-Original Message-
From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller
Sent: Tuesday, June 24, 2003 3:28 PM
To: [EMAIL PROTECTED]
Subject: Re: commercial websites powered by Lucene?


Hmm, good point with the cost of copying indicies in a distributed
environment, although that is unlikely to affect us in the foreseeable
future. But, noted!

Do you have any rough statistics on how many documents you index/day, or
how many every 20 minutes?

This discussion is fantastic by the way, lots of great experience and
comments coming out here. Thanks, it's really appreciated.

Nader S. Henein [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
 We thought of that in the beginning and then we became more 
 comfortable with multiple indices for simple backup purposes, and now 
 our indices are in excess of 100megs, and transferring that kind of 
 data between three machines sitting in the same data center is 
 passable, but once you start thinking of distributed webservers in 
 different hosting facilities, copying  100Megs every 20 minutes, or 
 even every hour becomes financially expensive.

 Our webservers are on Single Processor Sun Ultra Sparc III 400 Mhz 
 with two gegs of memory, and I've never seen the CPU usage go over 0.8

 at peek time with the indexer running. Try it out first, take your 
 time to gather your own numbers so you can really get  a feel of what 
 set up fits you best.

 Nader




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-24 Thread Nader S. Henein
We were using Oracle Internedia before we switched to Lucene, and Lucene
has been much faster and it has allowed us to distribute our search
functionality over multiple servers, Intermedia which is supposedly one
of the best in the business couldn't hold a candle to Lucene, and our
Oracle installation and setup is impeccable, we spent years perfecting
it before we decided to separate from Intermedia and use Oracle as DBMS
not a search engine, also when you use lucene and not a proprietary
product like Intermedia we can switch databases at will if Licensing
fees become to high to ignore.

Nader

-Original Message-
From: news [mailto:[EMAIL PROTECTED] On Behalf Of Ulrich Mayring
Sent: Tuesday, June 24, 2003 3:40 PM
To: [EMAIL PROTECTED]
Subject: Re: commercial websites powered by Lucene?


Chris Miller wrote:
 Thanks for your commments Ulrich. I just posted a message asking if 
 anyone had attempted this approach! Sounds like you have, and it works

 :-)  Thanks for information, this sounds pretty close to what my 
 preferred approach would be.

This is a good approach if the number of total documents doesn't grow 
too much. There's obviously a limit to full index runs at some point.

 You say you get 2000 docs/minute. I've done some benchmarking and 
 managed to get our data indexing at ~1000/minute on an Athlon 1800+ 
 (and most of that speed was acheived by bumping the 
 IndexWriter.mergeFactor up to 100 or so). Our data is coming from a 
 database table, each record contains about 40 fields, and I'm indexing

 8 of those fields (an ID, 4 number fields, 3 text fields including one

 that has ~2k text). Does this sound reasonable to you, or do you have 
 any tips that might improve that performance?

You need to find out where you lose most of the time:

a) in data access (like your database could be too slow, in my case I am

scanning the local filesystem)
b) in parsing (probably not an issue when reading from a DB, but in my 
case it is, I have HTML files)
c) in indexing

I haven't gone to the trouble to find that out for my app, because it is

fast enough the way it is.

However, what I wonder: if you have your data in a database anyway, why 
not use the database's indexing features? It seems like Lucene is an 
additional layer on top of your data, which you don't really need.

cheers,

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: commercial websites powered by Lucene?

2003-06-24 Thread Chris Miller
 This is a good approach if the number of total documents doesn't grow
 too much. There's obviously a limit to full index runs at some point.

Well I was actually going to go with incremental indexing, since a full
reindex will probably take ~1 hour. We have a relatively fixed size of data,
but the data is updated very frequently - almost 100% turnover/day.

 You need to find out where you lose most of the time:

Fair enough, I haven't tried much in the way of profiling yet. I just
thought you might have found some Lucene settings that made a big difference
for you, or you'd found indexing into a RAMDirectory then dumping it to disk
was faster, etc. But it sounds like you're pretty happy with near default
settings.

 However, what I wonder: if you have your data in a database anyway, why
 not use the database's indexing features? It seems like Lucene is an
 additional layer on top of your data, which you don't really need.

Our current DB server (running SQL Server) is under enormous strain, partly
due to the complex searches that are being performed against it. We've got
it pretty heavily tweaked already, so I don't think there's too much room to
improve on that front. The idea is to use Lucene to take the searching load
off it so it can get on with all the other tasks it has to perform. The
Lucene implementation I'm working on here is just a proof of concept - it
may be that we stay with SQL Server in the long run anyway, but Lucene
definitely seems to be worth investigating - it has certainly worked well
for us on smaller projects.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: commercial websites powered by Lucene?

2003-06-24 Thread David Medinets
- Original Message -
From: Chris Miller [EMAIL PROTECTED]
 Thanks David, that's about what I figured. Of course if the servers are
 pulling the information then a central holding table that contains only
new
 data doesn't make much sense anymore. Instead I guess the easiest approach
 would be to have a central table that contains the entire dataset

The following commentary might have no bearing on Lucene nor relevance with
today's technology, but I feel garrulous this morning.

Each pulling server did a three-step dance when updating. First, the central
server (Oracle) was polled to get the latest data (actually we sucked it all
because there were only 30,000 records). A text file was created (format is
unimportant, use the easiest for your application). Then that text file was
read to update the local datastore.

The advantage of this rigamarole was to allow the servers to fail and be
restored without needing to poll the central server. We had 400 servers in
the cluster. And at times, many of them would be fail (this was in 1999,
don't be critical!). If many systems pulled data from the central server,
the process would slow down. Which started another round of failures. To
avoid that vicious circle of failure all of the systems could reboot
independently.

David Medinets
http://www.codebits.com




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-05 Thread Nader S. Henein
We use Lucene http://www.bayt.com , we're basically an on-line
Recruitment site and up until now we've got around 500 000 CVs and
documents indexed with results that stump Oracle Intermedia.

Nader Henein
Senior Web Dev

Bayt.com

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 04, 2003 6:09 PM
To: [EMAIL PROTECTED]
Subject: commercial websites powered by Lucene?



Hello All,

I've been trying to find examples of large commercial websites that
use Lucene to power their search.  Having such examples would
make Lucene an easy sell to management

Does anyone know of any good examples?  The bigger the better, and
the more the better.

TIA,
-John



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: commercial websites powered by Lucene?

2003-06-05 Thread Otis Gospodnetic
A few big names are listed in the 1st Lucene article on Onjava.com, if
I recall correctly.

Otis


--- [EMAIL PROTECTED] wrote:
 
 
 Hello All,
 
 I've been trying to find examples of large commercial websites that
 use Lucene to power their search.  Having such examples would
 make Lucene an easy sell to management
 
 Does anyone know of any good examples?  The bigger the better, and
 the more the better.
 
 TIA,
 -John
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: commercial websites powered by Lucene?

2003-06-05 Thread Che Dong
http://search.163.com  China portal: NetEase use lucene as directory search and news 
search.


Che, Dong
http://www.chedong.com

- Original Message - 
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, June 04, 2003 10:08 PM
Subject: commercial websites powered by Lucene?


 
 
 Hello All,
 
 I've been trying to find examples of large commercial websites that
 use Lucene to power their search.  Having such examples would
 make Lucene an easy sell to management
 
 Does anyone know of any good examples?  The bigger the better, and
 the more the better.
 
 TIA,
 -John
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]