RE: commercial websites powered by Lucene?
Tatu, I agree 100% with everything you've said. Let's look at MySQL for example. Great database. No doubt about it. BUT, looking at the Full text indexing/searching part...it not up to snuff. Currently, I'm using mysql's full text search support. I have a database of 3-5 million rows. Each row is unique, let's say a product. Each row has several columns, but the two I search on are title and description. I created a full text index on title and description. Title has approximately 100 characters, and description has 255 characters. At the moment, mysql is taking 50 seconds plus to return results on simple one word searches. My dedicated server is a P4, 2.0 Gighz, 1.5 Gig RAM RedHat Linux 7.3 platform, with nothing else running on it, i.e. another server is handling HTTP requests. It is a dedicated mysql box. In addition, I'm the only person making queries. Obviously, the above performance is unacceptable for real world web applications. I'd love to try Lucene with the above, but the Lucene install fails because of JavaCC issues. Surprised more people haven't encountered this problem, as the install instructions are out of date. Regards, John -Original Message- From: Tatu Saloranta [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 25, 2003 12:26 PM To: Lucene Users List Subject: Re: commercial websites powered by Lucene? On Tuesday 24 June 2003 07:36, Ulrich Mayring wrote: Chris Miller wrote: ... Well, nothing against Lucene, but it doesn't solve your problem, which is an overloaded DB-Server. It may temporarily alleviate the effects, but you'll soon be at the same load again. So I'd recommend to install I don't think that would necessarily be the case. Like you mention later on, indexing data stored in DB does flatten it to allow faster indexing (and retrieval), and faster in this context means more efficient, not only sharing the load between DB and search engine, but potentially lowering total load? The alternative, data warehouse - like preprocessing of data, for faster search, would likely be doable too, but it's usually more useful for running reports. For actual searches Lucene does it job nicely and efficiently, biggest problems I've seen are more related to relevancy questions. But that's where tuning of Lucene ranking should be easier than trying to build your own ranking from raw database hits (except if one uses OracleText or such that's pretty much a search engine on top of DB itself). So, to me it all comes down to right tool for the job aspect; DBs are good at mass retrieval of data, or using aggregate functions (in read-only side), whereas dedicated search engines are better for, well, searching. ... Of course, in real life there may be political obstacles which will prevent you from doing the right thing as detailed above for example, and your only chance is to circumvent in some way - and then Lucene is a great way to do that. But keep in mind that you are basically reinventing the functionality that is already built-in in a database :) It depends on type of queries, but Lucene certainly has much more advanced text searching functionality, even if indexed content comes from a rigid structure like RDBMS. I'm not sure using a ready product like Lucene is reinventing much functionality, even considering synchronization issues? So I would go as far saying that for searching purposes, plain vanilla RDBMSs are not all that great in the first place. Even if queries need not use advanced search features (advanced as in not just using % and _ in addition to exact matches) Lucene may well offer better search performance and functionality. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: commercial websites powered by Lucene?
John Takacs wrote: I'd love to try Lucene with the above, but the Lucene install fails because of JavaCC issues. Surprised more people haven't encountered this problem, as the install instructions are out of date. Well, what do you need JavaCC for? Isn't it just the technology for building the supplied HTML-Parser? There are much better HTML parsers out there, which you can use. Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
Good idea. I was just following the install directions, but if I don't have to pay attention to the install directions, I'll find a much better one. Any hints? Previous email discussion maybe? I found some references via searching the archives, but I'm not 100% convinced they are applicable to my situation. John -Original Message- From: news [mailto:[EMAIL PROTECTED] Behalf Of Ulrich Mayring Sent: Thursday, June 26, 2003 12:48 AM To: [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? John Takacs wrote: I'd love to try Lucene with the above, but the Lucene install fails because of JavaCC issues. Surprised more people haven't encountered this problem, as the install instructions are out of date. Well, what do you need JavaCC for? Isn't it just the technology for building the supplied HTML-Parser? There are much better HTML parsers out there, which you can use. Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: commercial websites powered by Lucene?
John Takacs wrote: Good idea. I was just following the install directions, but if I don't have to pay attention to the install directions, I'll find a much better one. Any hints? Previous email discussion maybe? I found some references via searching the archives, but I'm not 100% convinced they are applicable to my situation. I'm not sure what you mean with install directions, Lucene is just a JAR file and you use it like any other Java class library. There's also the WAR file with a few demos, which you can just drop into Tomcat. Perhaps you were trying to build it? I just downloaded the binary distribution and used it. Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
I'd love to try Lucene with the above, but the Lucene install fails because of JavaCC issues. Surprised more people haven't encountered this problem, as the install instructions are out of date. The JavaCC fix is in the queue. Check Bugzilla for details (link on Lucene home page). Otis -Original Message- From: Tatu Saloranta [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 25, 2003 12:26 PM To: Lucene Users List Subject: Re: commercial websites powered by Lucene? On Tuesday 24 June 2003 07:36, Ulrich Mayring wrote: Chris Miller wrote: ... Well, nothing against Lucene, but it doesn't solve your problem, which is an overloaded DB-Server. It may temporarily alleviate the effects, but you'll soon be at the same load again. So I'd recommend to install I don't think that would necessarily be the case. Like you mention later on, indexing data stored in DB does flatten it to allow faster indexing (and retrieval), and faster in this context means more efficient, not only sharing the load between DB and search engine, but potentially lowering total load? The alternative, data warehouse - like preprocessing of data, for faster search, would likely be doable too, but it's usually more useful for running reports. For actual searches Lucene does it job nicely and efficiently, biggest problems I've seen are more related to relevancy questions. But that's where tuning of Lucene ranking should be easier than trying to build your own ranking from raw database hits (except if one uses OracleText or such that's pretty much a search engine on top of DB itself). So, to me it all comes down to right tool for the job aspect; DBs are good at mass retrieval of data, or using aggregate functions (in read-only side), whereas dedicated search engines are better for, well, searching. ... Of course, in real life there may be political obstacles which will prevent you from doing the right thing as detailed above for example, and your only chance is to circumvent in some way - and then Lucene is a great way to do that. But keep in mind that you are basically reinventing the functionality that is already built-in in a database :) It depends on type of queries, but Lucene certainly has much more advanced text searching functionality, even if indexed content comes from a rigid structure like RDBMS. I'm not sure using a ready product like Lucene is reinventing much functionality, even considering synchronization issues? So I would go as far saying that for searching purposes, plain vanilla RDBMSs are not all that great in the first place. Even if queries need not use advanced search features (advanced as in not just using % and _ in addition to exact matches) Lucene may well offer better search performance and functionality. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! http://sbc.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: commercial websites powered by Lucene?
Well, what do you need JavaCC for? Isn't it just the technology for building the supplied HTML-Parser? There are much better HTML parsers out there, which you can use. Its primary use in Lucene package is for parsing users' queries. Otis __ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! http://sbc.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: commercial websites powered by Lucene?
On Wednesday 25 June 2003 09:47, Ulrich Mayring wrote: John Takacs wrote: I'd love to try Lucene with the above, but the Lucene install fails because of JavaCC issues. Surprised more people haven't encountered this problem, as the install instructions are out of date. Well, what do you need JavaCC for? Isn't it just the technology for building the supplied HTML-Parser? There are much better HTML parsers out there, which you can use. On a related note; has anyone done performance measurements for various HTML parsers used for indexing? I have written couple of XML/HTML parsers that were optimized for speed (and/or leniency to be able to handle/fix non-valid documents), and was wondering if they might be useful for indexing purposes for other people (one is in general pretty optimal if document contents are fully in memory already, like when fetching from DB; another uses very little memory, while being only slightly slower). However, using those as opposed to more standard ones would only make sense if there are significant speed improvements. And to do that, it would be good to have baseline measurements, and/or to know what are current best candidates, from performance perspective. The thing is that creating a parser that only cares about textual content (and perhaps in some cases about surrounding element, but not about attributes, or structure, or DTD/Schema, validity etc) is fairly easy, and since indexing is often the most CPU-intensive part of search engine, it may make sense to try to optimize this part heavily, up to and including using specialized parsers. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: commercial websites powered by Lucene?
Hi Nader, I was wondering if you'd mind me asking you a couple of questions about your implementation? The main thing I'm interested in is how you handle updates to Lucene's index. I'd imagine you have a fairly high turnover of CVs and jobs, so index updates must place a reasonable load on the CPU/disk. Do you keep CVs and jobs in the same index or two different ones? And what is the process you use to update the index(es) - do you batch-process updates or do you handle them in real-time as changes are made? Any insight you can offer would be much appreciated as I'm about to implement something similar and am a little unsure of the best approach to take. We need to be able to handle indexing about 60,000 documents/day, while allowing (many) searches to continue operating alongside. Thanks! Chris Nader S. Henein [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] We use Lucene http://www.bayt.com , we're basically an on-line Recruitment site and up until now we've got around 500 000 CVs and documents indexed with results that stump Oracle Intermedia. Nader Henein Senior Web Dev Bayt.com -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 04, 2003 6:09 PM To: [EMAIL PROTECTED] Subject: commercial websites powered by Lucene? Hello All, I've been trying to find examples of large commercial websites that use Lucene to power their search. Having such examples would make Lucene an easy sell to management Does anyone know of any good examples? The bigger the better, and the more the better. TIA, -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
I handle updates or inserts the same way first I delete the document from the index and then I insert it (better safe than sorry), I batch my updates/inserts every twenty minutes, I would do it in smaller intervals but since I have to sync the XML files created from the DB to three machines (I maintain three separate Lucene indices on my three separate web-servers) it takes a little longer. You have to batch your changes because Updating the index takes time as opposed to deleted which I batch every two minutes. You won't have a problem updating the index and searching at the same time because lucene updates the index on a separate set of files and then when It's done it overwrites the old version. I've had to provide for Backups, and things like server crashes mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS IT AWAY. -Original Message- From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller Sent: Tuesday, June 24, 2003 12:06 PM To: [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? Hi Nader, I was wondering if you'd mind me asking you a couple of questions about your implementation? The main thing I'm interested in is how you handle updates to Lucene's index. I'd imagine you have a fairly high turnover of CVs and jobs, so index updates must place a reasonable load on the CPU/disk. Do you keep CVs and jobs in the same index or two different ones? And what is the process you use to update the index(es) - do you batch-process updates or do you handle them in real-time as changes are made? Any insight you can offer would be much appreciated as I'm about to implement something similar and am a little unsure of the best approach to take. We need to be able to handle indexing about 60,000 documents/day, while allowing (many) searches to continue operating alongside. Thanks! Chris Nader S. Henein [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] We use Lucene http://www.bayt.com , we're basically an on-line Recruitment site and up until now we've got around 500 000 CVs and documents indexed with results that stump Oracle Intermedia. Nader Henein Senior Web Dev Bayt.com -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 04, 2003 6:09 PM To: [EMAIL PROTECTED] Subject: commercial websites powered by Lucene? Hello All, I've been trying to find examples of large commercial websites that use Lucene to power their search. Having such examples would make Lucene an easy sell to management Does anyone know of any good examples? The bigger the better, and the more the better. TIA, -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: commercial websites powered by Lucene?
So you have a holding table in a database (or directory on disk?) where you store the incoming documents correct? Does each webserver run it's own indexing thread which grabs any new documents every 20 minutes, or is there a central process that manages that? I'm trying to understand how you know when you can safely clean out the holding table. Did you look at having just a single process that was responsible for updating the index, and then pushing copies out to all the webservers? I'm wondering if that might be worth investigating (since it would take a lot of load off the webservers that are running the searches), or if it will be too troublesome in practice. Also, I'm interested to see how you handle the situation when a server gets shutdown/restarted - does it just take a copy of the index from one of the other servers (since it's own index is likely out of date)? I take it it's not safe to copy an index while it is being updated, so you have to block on that somehow? PS: It's great to hear Lucene blows Oracle out of the water! I've got some skeptical management that need some convincing, hearing stories like this helps a lot :-) Nader S. Henein [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] I handle updates or inserts the same way first I delete the document from the index and then I insert it (better safe than sorry), I batch my updates/inserts every twenty minutes, I would do it in smaller intervals but since I have to sync the XML files created from the DB to three machines (I maintain three separate Lucene indices on my three separate web-servers) it takes a little longer. You have to batch your changes because Updating the index takes time as opposed to deleted which I batch every two minutes. You won't have a problem updating the index and searching at the same time because lucene updates the index on a separate set of files and then when It's done it overwrites the old version. I've had to provide for Backups, and things like server crashes mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS IT AWAY. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: commercial websites powered by Lucene?
Nader, You say you have to cope with server crash mid-indexing. I think I'm seeing lots of garbage files created by server crash mid merge/optimise while lucene is creating a new index. Did you write code specifically to handle this or is there something more automated. (I was thinking of writing a sanity check for before start-up that looked in 'segments' and 'deletable and got rid of any files in the catalog directory that are not referenced.) Did you do something similar or have I missed something... TIA Gareth - Original Message - From: Nader S. Henein [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Tuesday, June 24, 2003 9:30 AM Subject: RE: commercial websites powered by Lucene? I handle updates or inserts the same way first I delete the document from the index and then I insert it (better safe than sorry), I batch my updates/inserts every twenty minutes, I would do it in smaller intervals but since I have to sync the XML files created from the DB to three machines (I maintain three separate Lucene indices on my three separate web-servers) it takes a little longer. You have to batch your changes because Updating the index takes time as opposed to deleted which I batch every two minutes. You won't have a problem updating the index and searching at the same time because lucene updates the index on a separate set of files and then when It's done it overwrites the old version. I've had to provide for Backups, and things like server crashes mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS IT AWAY. -Original Message- From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller Sent: Tuesday, June 24, 2003 12:06 PM To: [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? Hi Nader, I was wondering if you'd mind me asking you a couple of questions about your implementation? The main thing I'm interested in is how you handle updates to Lucene's index. I'd imagine you have a fairly high turnover of CVs and jobs, so index updates must place a reasonable load on the CPU/disk. Do you keep CVs and jobs in the same index or two different ones? And what is the process you use to update the index(es) - do you batch-process updates or do you handle them in real-time as changes are made? Any insight you can offer would be much appreciated as I'm about to implement something similar and am a little unsure of the best approach to take. We need to be able to handle indexing about 60,000 documents/day, while allowing (many) searches to continue operating alongside. Thanks! Chris Nader S. Henein [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] We use Lucene http://www.bayt.com , we're basically an on-line Recruitment site and up until now we've got around 500 000 CVs and documents indexed with results that stump Oracle Intermedia. Nader Henein Senior Web Dev Bayt.com -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 04, 2003 6:09 PM To: [EMAIL PROTECTED] Subject: commercial websites powered by Lucene? Hello All, I've been trying to find examples of large commercial websites that use Lucene to power their search. Having such examples would make Lucene an easy sell to management Does anyone know of any good examples? The bigger the better, and the more the better. TIA, -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
I have to store the information I am indexing in the database because the nature of our application requires it, on update of certain columns in a table I create an XML file which is then copied to directories on each of my web servers, then separate lucene apps, running on separates machines digest the information into separate indices, you also have to provide procedures that will run periodically to ensure that all you indices are in sync with each other and in sync with the DB ( I run this once every three days when the CPU usage on the machines is low) To update the index I have a servlet running off a scheduler in Resin (you could use any webserver, Orion's cool too), the up-side to distributing your search engines like this is that you have three active back ups in case one got corrupted (hasn't happened in two years), and the load on each machine is pretty low even during updates/optimizations every 20 minutes. If the server crashes, it's not a problem unless it happens mid-indexing, then you have to somehow remove the write locks created in the index directory ( I just delete them, optimize, and re-start the update that crashed) Lucene destroyed Oracle on speed tests and we use to have to use our single DB monster machine for all the searching and indexing which made the load on it pretty high, but now I have 0.5 loads on all my CPUs and no need to buy new hardware -Original Message- From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller Sent: Tuesday, June 24, 2003 1:12 PM To: [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? So you have a holding table in a database (or directory on disk?) where you store the incoming documents correct? Does each webserver run it's own indexing thread which grabs any new documents every 20 minutes, or is there a central process that manages that? I'm trying to understand how you know when you can safely clean out the holding table. Did you look at having just a single process that was responsible for updating the index, and then pushing copies out to all the webservers? I'm wondering if that might be worth investigating (since it would take a lot of load off the webservers that are running the searches), or if it will be too troublesome in practice. Also, I'm interested to see how you handle the situation when a server gets shutdown/restarted - does it just take a copy of the index from one of the other servers (since it's own index is likely out of date)? I take it it's not safe to copy an index while it is being updated, so you have to block on that somehow? PS: It's great to hear Lucene blows Oracle out of the water! I've got some skeptical management that need some convincing, hearing stories like this helps a lot :-) Nader S. Henein [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] I handle updates or inserts the same way first I delete the document from the index and then I insert it (better safe than sorry), I batch my updates/inserts every twenty minutes, I would do it in smaller intervals but since I have to sync the XML files created from the DB to three machines (I maintain three separate Lucene indices on my three separate web-servers) it takes a little longer. You have to batch your changes because Updating the index takes time as opposed to deleted which I batch every two minutes. You won't have a problem updating the index and searching at the same time because lucene updates the index on a separate set of files and then when It's done it overwrites the old version. I've had to provide for Backups, and things like server crashes mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS IT AWAY. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
Hi Nader, This thread is by far one of the best, and most practical. It will only be topped when someone provides benchmarks for a DMOZ.org type directory of 3 million plus urls. I would love to, but the whole JavaCC thing is a show stopper. Questions: I noticed that search is a little slow. What has been your experience? Perhaps it was a bandwidth issue, but I'm living in a country with the greatest internet connectivity and penetration in the world (South Korea), so I don't think that is an issue on my end. You have 500,000 resumes. Based on the steps you took to get to 500,000, do you think your current setup will scale to millions, like say, 3 million or so? What is your hardware like? CPU/RAM? Warm regards, and thanks for sharing. If I can ever get passed the Lucene/JavaCC installation failure, I'll share my benchmarks on the above directory scenario. John -Original Message- From: Nader S. Henein [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 24, 2003 5:30 PM To: 'Lucene Users List' Subject: RE: commercial websites powered by Lucene? I handle updates or inserts the same way first I delete the document from the index and then I insert it (better safe than sorry), I batch my updates/inserts every twenty minutes, I would do it in smaller intervals but since I have to sync the XML files created from the DB to three machines (I maintain three separate Lucene indices on my three separate web-servers) it takes a little longer. You have to batch your changes because Updating the index takes time as opposed to deleted which I batch every two minutes. You won't have a problem updating the index and searching at the same time because lucene updates the index on a separate set of files and then when It's done it overwrites the old version. I've had to provide for Backups, and things like server crashes mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS IT AWAY. -Original Message- From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller Sent: Tuesday, June 24, 2003 12:06 PM To: [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? Hi Nader, I was wondering if you'd mind me asking you a couple of questions about your implementation? The main thing I'm interested in is how you handle updates to Lucene's index. I'd imagine you have a fairly high turnover of CVs and jobs, so index updates must place a reasonable load on the CPU/disk. Do you keep CVs and jobs in the same index or two different ones? And what is the process you use to update the index(es) - do you batch-process updates or do you handle them in real-time as changes are made? Any insight you can offer would be much appreciated as I'm about to implement something similar and am a little unsure of the best approach to take. We need to be able to handle indexing about 60,000 documents/day, while allowing (many) searches to continue operating alongside. Thanks! Chris Nader S. Henein [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] We use Lucene http://www.bayt.com , we're basically an on-line Recruitment site and up until now we've got around 500 000 CVs and documents indexed with results that stump Oracle Intermedia. Nader Henein Senior Web Dev Bayt.com -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 04, 2003 6:09 PM To: [EMAIL PROTECTED] Subject: commercial websites powered by Lucene? Hello All, I've been trying to find examples of large commercial websites that use Lucene to power their search. Having such examples would make Lucene an easy sell to management Does anyone know of any good examples? The bigger the better, and the more the better. TIA, -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
Because I've setup Lucene as a webapp with a centralized Init file and setup properties file, I do my sanity check in the Init, because if the serer crashes mid-indexing, I have to delete the lock files optimize and re-index the files that were indexing when the crash occurred, there was long discussion about this back in August, search for Crash / Recovery Scenario in the lucene-dev archived discussions. Should answer all your questions Nader Henein -Original Message- From: Gareth Griffiths [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 24, 2003 1:11 PM To: Lucene Users List; [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? Nader, You say you have to cope with server crash mid-indexing. I think I'm seeing lots of garbage files created by server crash mid merge/optimise while lucene is creating a new index. Did you write code specifically to handle this or is there something more automated. (I was thinking of writing a sanity check for before start-up that looked in 'segments' and 'deletable and got rid of any files in the catalog directory that are not referenced.) Did you do something similar or have I missed something... TIA Gareth - Original Message - From: Nader S. Henein [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Tuesday, June 24, 2003 9:30 AM Subject: RE: commercial websites powered by Lucene? I handle updates or inserts the same way first I delete the document from the index and then I insert it (better safe than sorry), I batch my updates/inserts every twenty minutes, I would do it in smaller intervals but since I have to sync the XML files created from the DB to three machines (I maintain three separate Lucene indices on my three separate web-servers) it takes a little longer. You have to batch your changes because Updating the index takes time as opposed to deleted which I batch every two minutes. You won't have a problem updating the index and searching at the same time because lucene updates the index on a separate set of files and then when It's done it overwrites the old version. I've had to provide for Backups, and things like server crashes mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS IT AWAY. -Original Message- From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller Sent: Tuesday, June 24, 2003 12:06 PM To: [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? Hi Nader, I was wondering if you'd mind me asking you a couple of questions about your implementation? The main thing I'm interested in is how you handle updates to Lucene's index. I'd imagine you have a fairly high turnover of CVs and jobs, so index updates must place a reasonable load on the CPU/disk. Do you keep CVs and jobs in the same index or two different ones? And what is the process you use to update the index(es) - do you batch-process updates or do you handle them in real-time as changes are made? Any insight you can offer would be much appreciated as I'm about to implement something similar and am a little unsure of the best approach to take. We need to be able to handle indexing about 60,000 documents/day, while allowing (many) searches to continue operating alongside. Thanks! Chris Nader S. Henein [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] We use Lucene http://www.bayt.com , we're basically an on-line Recruitment site and up until now we've got around 500 000 CVs and documents indexed with results that stump Oracle Intermedia. Nader Henein Senior Web Dev Bayt.com -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 04, 2003 6:09 PM To: [EMAIL PROTECTED] Subject: commercial websites powered by Lucene? Hello All, I've been trying to find examples of large commercial websites that use Lucene to power their search. Having such examples would make Lucene an easy sell to management Does anyone know of any good examples? The bigger the better, and the more the better. TIA, -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
The search is a little sluggish because our initial architecture was based on TCL, not java, so until we complete the full java overhaul, every time I perform a search the AOL Webserver (tcl) has to call the servlet in Resin (where lucene is) and then perform the search, then this is the killer , I have to parse all the results from a Java Collection into a TCL List, the most intense search with thousands of results takes less than a second, it's all the things I have to do around it that take time. Nader -Original Message- From: John Takacs [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 24, 2003 1:52 PM To: Lucene Users List Subject: RE: commercial websites powered by Lucene? Hi Nader, This thread is by far one of the best, and most practical. It will only be topped when someone provides benchmarks for a DMOZ.org type directory of 3 million plus urls. I would love to, but the whole JavaCC thing is a show stopper. Questions: I noticed that search is a little slow. What has been your experience? Perhaps it was a bandwidth issue, but I'm living in a country with the greatest internet connectivity and penetration in the world (South Korea), so I don't think that is an issue on my end. You have 500,000 resumes. Based on the steps you took to get to 500,000, do you think your current setup will scale to millions, like say, 3 million or so? What is your hardware like? CPU/RAM? Warm regards, and thanks for sharing. If I can ever get passed the Lucene/JavaCC installation failure, I'll share my benchmarks on the above directory scenario. John -Original Message- From: Nader S. Henein [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 24, 2003 5:30 PM To: 'Lucene Users List' Subject: RE: commercial websites powered by Lucene? I handle updates or inserts the same way first I delete the document from the index and then I insert it (better safe than sorry), I batch my updates/inserts every twenty minutes, I would do it in smaller intervals but since I have to sync the XML files created from the DB to three machines (I maintain three separate Lucene indices on my three separate web-servers) it takes a little longer. You have to batch your changes because Updating the index takes time as opposed to deleted which I batch every two minutes. You won't have a problem updating the index and searching at the same time because lucene updates the index on a separate set of files and then when It's done it overwrites the old version. I've had to provide for Backups, and things like server crashes mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS IT AWAY. -Original Message- From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller Sent: Tuesday, June 24, 2003 12:06 PM To: [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? Hi Nader, I was wondering if you'd mind me asking you a couple of questions about your implementation? The main thing I'm interested in is how you handle updates to Lucene's index. I'd imagine you have a fairly high turnover of CVs and jobs, so index updates must place a reasonable load on the CPU/disk. Do you keep CVs and jobs in the same index or two different ones? And what is the process you use to update the index(es) - do you batch-process updates or do you handle them in real-time as changes are made? Any insight you can offer would be much appreciated as I'm about to implement something similar and am a little unsure of the best approach to take. We need to be able to handle indexing about 60,000 documents/day, while allowing (many) searches to continue operating alongside. Thanks! Chris Nader S. Henein [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] We use Lucene http://www.bayt.com , we're basically an on-line Recruitment site and up until now we've got around 500 000 CVs and documents indexed with results that stump Oracle Intermedia. Nader Henein Senior Web Dev Bayt.com -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 04, 2003 6:09 PM To: [EMAIL PROTECTED] Subject: commercial websites powered by Lucene? Hello All, I've been trying to find examples of large commercial websites that use Lucene to power their search. Having such examples would make Lucene an easy sell to management Does anyone know of any good examples? The bigger the better, and the more the better. TIA, -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED
Re: commercial websites powered by Lucene?
- Original Message - From: Chris Miller [EMAIL PROTECTED] Did you look at having just a single process that was responsible for updating the index, and then pushing copies out to all the webservers? I'm wondering if that might be worth investigating (since it would take a lot of load off the webservers that are running the searches), or if it will be too troublesome in practice. I've found that pulling information from a central source is simpler than pushing information. When information is pushing, there is much administration on the central server to track the recipient machines. It seems like servers are added and dropped from the push list. Additionally, you need to account for servers that stop responding. When information is pulled from the central source, these issues of coordination are eliminated. David Medinets http://www.codebits.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: commercial websites powered by Lucene?
Chris Miller wrote: The main thing I'm interested in is how you handle updates to Lucene's index. I'd imagine you have a fairly high turnover of CVs and jobs, so index updates must place a reasonable load on the CPU/disk. Do you keep CVs and jobs in the same index or two different ones? And what is the process you use to update the index(es) - do you batch-process updates or do you handle them in real-time as changes are made? The way we do it: we re-index everything periodically in a temporary directory and then rename the temporary directory. That way the index remains accessible at all times and its currency is simply determined by the interval I run the re-indexing in. We need to be able to handle indexing about 60,000 documents/day, while allowing (many) searches to continue operating alongside. On an entry-level Sun I can index about 23 documents per second and these are real-life HTML pages. Thus in less than one hour you would be finished with a complete index run and save yourself all kinds of trouble with crashes during indexing etc. On my 2 GHz Linux workstation it's even faster: more than 2000 documents per minute, so you'd be done in half an hour. BTW, we're not using the supplied JavaCC-based HTML parser, instead we got htmlparser.sourceforge.net, which is a joy to use and pretty fast. Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: commercial websites powered by Lucene?
Thanks David, that's about what I figured. Of course if the servers are pulling the information then a central holding table that contains only new data doesn't make much sense anymore. Instead I guess the easiest approach would be to have a central table that contains the entire dataset, and has last-modified timestamps on each record so the individual webservers can grab just the data that was changed since they last ran an index update. My concern still is that the effort of indexing (which is potentially quite high) is being duplicated across all the webservers. Is there any reason why it would be a bad idea to have one machine responsible for grabbing updates and adding documents to a master index, so the other servers could periodically grab a copy of that index and hot-swap it with their previous copy? Is Lucene capable of handling that scenario? Seems to me that this approach would reduce the stress on a webservers even more, and even if the indexing server went down the webservers would still have a stale index to search against. Has anyone attempted something like this? David Medinets [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] - Original Message - From: Chris Miller [EMAIL PROTECTED] Did you look at having just a single process that was responsible for updating the index, and then pushing copies out to all the webservers? I'm wondering if that might be worth investigating (since it would take a lot of load off the webservers that are running the searches), or if it will be too troublesome in practice. I've found that pulling information from a central source is simpler than pushing information. When information is pushing, there is much administration on the central server to track the recipient machines. It seems like servers are added and dropped from the push list. Additionally, you need to account for servers that stop responding. When information is pulled from the central source, these issues of coordination are eliminated. David Medinets http://www.codebits.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: commercial websites powered by Lucene?
Thanks for your commments Ulrich. I just posted a message asking if anyone had attempted this approach! Sounds like you have, and it works :-) Thanks for information, this sounds pretty close to what my preferred approach would be. You say you get 2000 docs/minute. I've done some benchmarking and managed to get our data indexing at ~1000/minute on an Athlon 1800+ (and most of that speed was acheived by bumping the IndexWriter.mergeFactor up to 100 or so). Our data is coming from a database table, each record contains about 40 fields, and I'm indexing 8 of those fields (an ID, 4 number fields, 3 text fields including one that has ~2k text). Does this sound reasonable to you, or do you have any tips that might improve that performance? Ulrich Mayring [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Chris Miller wrote: The main thing I'm interested in is how you handle updates to Lucene's index. I'd imagine you have a fairly high turnover of CVs and jobs, so index updates must place a reasonable load on the CPU/disk. Do you keep CVs and jobs in the same index or two different ones? And what is the process you use to update the index(es) - do you batch-process updates or do you handle them in real-time as changes are made? The way we do it: we re-index everything periodically in a temporary directory and then rename the temporary directory. That way the index remains accessible at all times and its currency is simply determined by the interval I run the re-indexing in. We need to be able to handle indexing about 60,000 documents/day, while allowing (many) searches to continue operating alongside. On an entry-level Sun I can index about 23 documents per second and these are real-life HTML pages. Thus in less than one hour you would be finished with a complete index run and save yourself all kinds of trouble with crashes during indexing etc. On my 2 GHz Linux workstation it's even faster: more than 2000 documents per minute, so you'd be done in half an hour. BTW, we're not using the supplied JavaCC-based HTML parser, instead we got htmlparser.sourceforge.net, which is a joy to use and pretty fast. Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
About 100 documents every twenty minutes, but it fluctuates depending on how much traffic is on the site -Original Message- From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller Sent: Tuesday, June 24, 2003 3:28 PM To: [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? Hmm, good point with the cost of copying indicies in a distributed environment, although that is unlikely to affect us in the foreseeable future. But, noted! Do you have any rough statistics on how many documents you index/day, or how many every 20 minutes? This discussion is fantastic by the way, lots of great experience and comments coming out here. Thanks, it's really appreciated. Nader S. Henein [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] We thought of that in the beginning and then we became more comfortable with multiple indices for simple backup purposes, and now our indices are in excess of 100megs, and transferring that kind of data between three machines sitting in the same data center is passable, but once you start thinking of distributed webservers in different hosting facilities, copying 100Megs every 20 minutes, or even every hour becomes financially expensive. Our webservers are on Single Processor Sun Ultra Sparc III 400 Mhz with two gegs of memory, and I've never seen the CPU usage go over 0.8 at peek time with the indexer running. Try it out first, take your time to gather your own numbers so you can really get a feel of what set up fits you best. Nader - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
We were using Oracle Internedia before we switched to Lucene, and Lucene has been much faster and it has allowed us to distribute our search functionality over multiple servers, Intermedia which is supposedly one of the best in the business couldn't hold a candle to Lucene, and our Oracle installation and setup is impeccable, we spent years perfecting it before we decided to separate from Intermedia and use Oracle as DBMS not a search engine, also when you use lucene and not a proprietary product like Intermedia we can switch databases at will if Licensing fees become to high to ignore. Nader -Original Message- From: news [mailto:[EMAIL PROTECTED] On Behalf Of Ulrich Mayring Sent: Tuesday, June 24, 2003 3:40 PM To: [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? Chris Miller wrote: Thanks for your commments Ulrich. I just posted a message asking if anyone had attempted this approach! Sounds like you have, and it works :-) Thanks for information, this sounds pretty close to what my preferred approach would be. This is a good approach if the number of total documents doesn't grow too much. There's obviously a limit to full index runs at some point. You say you get 2000 docs/minute. I've done some benchmarking and managed to get our data indexing at ~1000/minute on an Athlon 1800+ (and most of that speed was acheived by bumping the IndexWriter.mergeFactor up to 100 or so). Our data is coming from a database table, each record contains about 40 fields, and I'm indexing 8 of those fields (an ID, 4 number fields, 3 text fields including one that has ~2k text). Does this sound reasonable to you, or do you have any tips that might improve that performance? You need to find out where you lose most of the time: a) in data access (like your database could be too slow, in my case I am scanning the local filesystem) b) in parsing (probably not an issue when reading from a DB, but in my case it is, I have HTML files) c) in indexing I haven't gone to the trouble to find that out for my app, because it is fast enough the way it is. However, what I wonder: if you have your data in a database anyway, why not use the database's indexing features? It seems like Lucene is an additional layer on top of your data, which you don't really need. cheers, Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: commercial websites powered by Lucene?
This is a good approach if the number of total documents doesn't grow too much. There's obviously a limit to full index runs at some point. Well I was actually going to go with incremental indexing, since a full reindex will probably take ~1 hour. We have a relatively fixed size of data, but the data is updated very frequently - almost 100% turnover/day. You need to find out where you lose most of the time: Fair enough, I haven't tried much in the way of profiling yet. I just thought you might have found some Lucene settings that made a big difference for you, or you'd found indexing into a RAMDirectory then dumping it to disk was faster, etc. But it sounds like you're pretty happy with near default settings. However, what I wonder: if you have your data in a database anyway, why not use the database's indexing features? It seems like Lucene is an additional layer on top of your data, which you don't really need. Our current DB server (running SQL Server) is under enormous strain, partly due to the complex searches that are being performed against it. We've got it pretty heavily tweaked already, so I don't think there's too much room to improve on that front. The idea is to use Lucene to take the searching load off it so it can get on with all the other tasks it has to perform. The Lucene implementation I'm working on here is just a proof of concept - it may be that we stay with SQL Server in the long run anyway, but Lucene definitely seems to be worth investigating - it has certainly worked well for us on smaller projects. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: commercial websites powered by Lucene?
- Original Message - From: Chris Miller [EMAIL PROTECTED] Thanks David, that's about what I figured. Of course if the servers are pulling the information then a central holding table that contains only new data doesn't make much sense anymore. Instead I guess the easiest approach would be to have a central table that contains the entire dataset The following commentary might have no bearing on Lucene nor relevance with today's technology, but I feel garrulous this morning. Each pulling server did a three-step dance when updating. First, the central server (Oracle) was polled to get the latest data (actually we sucked it all because there were only 30,000 records). A text file was created (format is unimportant, use the easiest for your application). Then that text file was read to update the local datastore. The advantage of this rigamarole was to allow the servers to fail and be restored without needing to poll the central server. We had 400 servers in the cluster. And at times, many of them would be fail (this was in 1999, don't be critical!). If many systems pulled data from the central server, the process would slow down. Which started another round of failures. To avoid that vicious circle of failure all of the systems could reboot independently. David Medinets http://www.codebits.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
We use Lucene http://www.bayt.com , we're basically an on-line Recruitment site and up until now we've got around 500 000 CVs and documents indexed with results that stump Oracle Intermedia. Nader Henein Senior Web Dev Bayt.com -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 04, 2003 6:09 PM To: [EMAIL PROTECTED] Subject: commercial websites powered by Lucene? Hello All, I've been trying to find examples of large commercial websites that use Lucene to power their search. Having such examples would make Lucene an easy sell to management Does anyone know of any good examples? The bigger the better, and the more the better. TIA, -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: commercial websites powered by Lucene?
A few big names are listed in the 1st Lucene article on Onjava.com, if I recall correctly. Otis --- [EMAIL PROTECTED] wrote: Hello All, I've been trying to find examples of large commercial websites that use Lucene to power their search. Having such examples would make Lucene an easy sell to management Does anyone know of any good examples? The bigger the better, and the more the better. TIA, -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Calendar - Free online calendar with sync to Outlook(TM). http://calendar.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: commercial websites powered by Lucene?
http://search.163.com China portal: NetEase use lucene as directory search and news search. Che, Dong http://www.chedong.com - Original Message - From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, June 04, 2003 10:08 PM Subject: commercial websites powered by Lucene? Hello All, I've been trying to find examples of large commercial websites that use Lucene to power their search. Having such examples would make Lucene an easy sell to management Does anyone know of any good examples? The bigger the better, and the more the better. TIA, -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]