RE: Which searched words are found in a document
Take a look at the highlighter code, you could implement this on the front end while processing the page. Nader -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 25, 2004 10:51 AM To: [EMAIL PROTECTED] Subject: Which searched words are found in a document Hi, I have the following question: Is there an easy way to see which words from a query were found in a resulting document? So if I search for 'cat OR dog' and get a result document with only 'cat' in it. I would like to ask the searcher object or something to tell me that for the result document 'cat' was the only word found. I did see it is somehow possible with the explain method, but this does not give a clean answer. I can also get the contents of the document and do an indexof for each search term but there could be quite a lot in our case. Any suggestions? Thanks, Edvard Scheffers - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: SELECTIVE Indexing
So you basically only want to index parts of your document within table Foo Bar /table tags, I'm not sure if there's an easier way, but here's what I do: 1) Parse XML files using JDOM (or any XML parser that floats your boat) into a Map or an ArrayList 2) Create a Lucene document and loop through the aforementioned structure (Map or ArrayList) adding field, value pairs to it like so contentDoc.add(new Field(fieldName,fieldValue,true,true,true) ) ; So all you would need to do is just put an if statement around the later statement to the effect of If ( fieldName.equalsIgnoreCase(table) == 0 ) { contentDoc.add(new Field(fieldName,fieldValue,true,true,true) ) ; } This may be overkill, someone feel free to correct me if I'm wrong Nader -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 19, 2004 1:01 PM To: Lucene Users List Subject: RE: SELECTIVE Indexing Hey Lucene Users My original intension for indexing was to index certain portions of HTML [ not the whole Document ], if Jtidy is not supporting this then what are my optionals Karthik -Original Message- From: Viparthi, Kiran (AFIS) [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 19, 2004 1:43 PM To: 'Lucene Users List' Subject: RE: SELECTIVE Indexing I doubt if it can be used as a plug in. Would be good to know if it can be used as a plug in. Regards, Kiran. -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: 17 May 2004 12:30 To: Lucene Users List Subject: RE: SELECTIVE Indexing Hi Can I Use TIDY [as plug in ] with Lucene ... with regards Karthik -Original Message- From: Viparthi, Kiran (AFIS) [mailto:[EMAIL PROTECTED] Sent: Monday, May 17, 2004 3:27 PM To: 'Lucene Users List' Subject: RE: SELECTIVE Indexing Try using Tidy. Creates a Document of the html and allows you to apply xpath. Hope this helps. Kiran. -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: 17 May 2004 11:59 To: Lucene Users List Subject: SELECTIVE Indexing Hi all Can Some Body tell me How to Index CERTAIN PORTION OF THE HTML FILE Only ex:- table . /table with regards Karthik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: change directory
When my server restarts, I have a little procedure that validates and sorts out the index in case the server crashes mid-indexing/optimizing, what it does is it checks for locks and frees them if need be then it optimizes the whole thing (as a precaution) here's the code I use, try it out in your lucene init: try { Directory directory = FSDirectory.getDirectory(indexPath,false); if ( directory.list().length == 0 ) clear() ; // Create a new index Lock writeLock = directory.makeLock(writeFileName); if (!writeLock.obtain()) { IndexReader.unlock(directory) ; } else { writeLock.release() ; } } catch (IOException e) { logger.error(Index Validate,e) ; } Try it out, hope it helps. Nader Henein -Original Message- From: Rosen Marinov [mailto:[EMAIL PROTECTED] Sent: Monday, May 03, 2004 5:52 PM To: Lucene Users List Subject: change directory Hi all, I have a good working index about 3 GB in one directory for example in c:/index1 now i want to change the computer and directory for example to d:/index2(is this possible ???) and when i copy it to the new pc and directory on IndeaxReader(indexpath) i get java.io.IOException: Lock obtain timed out at org.apache.lucene.store.Lock.obtain(Lock.java:97) at org.apache.lucene.store.Lock$With.run(Lock.java:147) at org.apache.lucene.index.IndexReader.open before coping i closed all java aplications, index was with closed writers, readers, serachers, terms and etc ... i have finally clauses to close all this and shut down function, all my methods which works with index are synchronized. 10x fopr help in advance - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Disappearing segments
You're catching an exception and acting on it, but you're not reporting it, for now, comment out the deletion and copyfrombackup and try reporting errors, if the batch is failing on a regular basis you want to know about it, comment the deletion and copy across code out, also watch out that if you backup the index during an indexing you could end up with a limp index missing a few files, hence the missing segments, I would check for write and commit locks pre-backup so as to avoid that. This is probably caused by two unrelated errors first batchindex() fails then the backup restores a version that may not have all the indexes there (depending when it was backed up) thereby giving you the feeling that segments are disappearing randomly. Hope this helps. Nader Henein -Original Message- From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Monday, May 03, 2004 6:52 AM To: Lucene Users List Subject: RE: Disappearing segments Thanks for responding Nader. h...you've hit the nail on the spot. I do have a cron job which backs up the index. Its run in a batch index scheduled job. The logic is basically backupindex() try { batchindex() } catch(Exception e) { deleteindex(); copyfrombackuptoindex() deletebackup(); } I assume that the original index before backing up was complete and 'working'. I'm also deleting the index that failed, instead of just overwriting. Where did I go wrong? I'm not checking that the index isn't write-locked before backing up, but I don't think that's the problem (though it very well can be a separate problem). Kelvin On Fri, 30 Apr 2004 23:20:42 +0400, Nader Henein said: Could you share you're indexing code, and just to make sure id there anything running on your machine that could delete these files, like an a cron job that'll back up the index. You could go by process of elimination and shut down your server and see if the files disappear, coz if the problem is contained within the server you know that you can safely go on the DEBUG rampage. Nader -Original Message- From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Friday, April 30, 2004 9:15 AM To: Lucene Users List Subject: Re: Disappearing segments An update: Daniel Naber suggested using IndexWriter.setUseCompoundFile() to see if it happens with the compound index format. Before I had a chance to try it out, this happened: java.io.FileNotFoundException: C:\index\segments (The system cannot find the file specified) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:200) at org.apache.lucene.store.FSInputStream$Descriptor.init(FSDirectory.j ava:321) at org.apache.lucene.store.FSInputStream.init(FSDirectory.java:329) at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:71) at org.apache.lucene.index.IndexWriter$1.doBody(IndexWriter.java:154) at org.apache.lucene.store.Lock$With.run(Lock.java:116) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:149) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:131) so even the segments file somehow got deleted. Hoping someone can shed some light on this... Kelvin On Thu, 29 Apr 2004 11:45:36 +0800, Kelvin Tan said: Errr, sorry for the cross-post to lucene-dev as well, but I realized this mail really belongs on lucene-user... I've been experiencing intermittent disappearing segments which result in the following stacktrace: Caused by: java.io.FileNotFoundException: C:\index\_1ae.fnm (The system cannot find the file specified) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:200) at org.apache.lucene.store.FSInputStream$Descriptor.init(FSDirectory.j a va:321) at org.apache.lucene.store.FSInputStream.init(FSDirectory.java:329) at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268) at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:78) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:104) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:95) at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:112) at org.apache.lucene.store.Lock$With.run(Lock.java:116) at org.apache.lucene.index.IndexReader.open(IndexReader.java:103) at org.apache.lucene.index.IndexReader.open(IndexReader.java:91) at org.apache.lucene.search.IndexSearcher.init(IndexSearcher.java:75) The segment that disappears (_1ae.fnm) varies. I can't seem to reproduce this error consistently, so don't have a clue what might cause it, but it usually happens after the application has been running for some time. Has anyone experienced something similar, or can anyone point me in the right direction? When this occurs, I need to rebuild the entire index for it to be usable. Very troubling indeed... Kelvin
RE: Documents the same search is done many times.
The short answer is, it's up to you :-) Lucene doesn't know which document is your primary key (you're thinking like a DB programmer) id you ad the new document with ID=one without deleting the old one from the index then when you search you'll get two documents pig and mongoose but if you delete all documents with ID=one then index you're new document then you'll only get mongoose, From a DBA perspective Lucene is like a table with a unique ID on each document (that being the Lucene assigned DOC ID (which changes every time you optimize, but nevertheless remains unique) and then all other columns weather indexed, tokenized, stored or not, can bare repetition, so if you want to implement a unique key like ID on your Lucene index, you 'll have to do a little delete based on that ID field every time you insert a new document into the index, quite simple and I've been doing it or a few years now without fail. Hope this helps Nader Henein - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Count for a keyword occurance in a file
Tricky, scoring has to do with the frequency of the occurrence of the word as opposed to the amount of words in the file in general (Somebody correct me if I'm wrong) , so short of an educated approximation, you could hack the indexer to dynamically store the frequency of a word (oh so unadvisable). Personally I recommend the educated approximation, because you could index the document with the number of words in it ( you would have to make sure you're not using Stop Word Analyzer or Port Stemmer) and then based on the score reverse engineer the result you want. Nader Henein -Original Message- From: hemal bhatt [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 28, 2004 5:50 PM To: Lucene Users List Subject: Count for a keyword occurance in a file Hi, How can I get a count of the score given by Hits.Score(). i.e I want to know how many times a keyword occurs in a file. Any help on this would be appreciated. regards Hemal Bhatt regards Hemal bhatt - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Count for a keyword occurance in a file
So even an educated calculation won't do it because you'd need to know how many documents the word occurs in (you could do a search, but that would be overkill and impractical). Cool -Original Message- From: Ype Kingma [mailto:[EMAIL PROTECTED] Sent: Thursday, April 29, 2004 10:57 AM To: Lucene Users List Subject: Re: Count for a keyword occurance in a file On Thursday 29 April 2004 08:14, Nader S. Henein wrote: Tricky, scoring has to do with the frequency of the occurrence of the word as opposed to the amount of words in the file in general (Somebody correct me if I'm wrong) , so short of an educated approximation, you could hack Lucene uses two frequencies for a term: the nr. of docs in which it occurs in an index (basis for IDF), and the nr of times a term occurs in a document. the indexer to dynamically store the frequency of a word (oh so unadvisable). Personally I recommend the educated approximation, because you could index the document with the number of words in it ( you would have to make sure you're not using Stop Word Analyzer or Port Stemmer) and then based on the score reverse engineer the result you want. Nader Henein -Original Message- From: hemal bhatt [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 28, 2004 5:50 PM To: Lucene Users List Subject: Count for a keyword occurance in a file Hi, How can I get a count of the score given by Hits.Score(). i.e I want to know how many times a keyword occurs in a file. Any help on this would be appreciated. The easiest way is to use IndexReader. I don't know what you mean by file (index or document), but you can have both frequencies I mentioned above from an IndexReader, evt. using skipTo() to go to the document. The methods are docFreq(Term) and termDocs(Term). Regards, Ype - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: sorting by date (XML)
Here's my two cents on this: Both ways you will need to combine the date in one field, but if you use a millisecond representation you will not be able to use the FLOAT sort type and you'll have use STRING sort (Slower) because the millisecond representation is longer than FLOAT allows, so you have three options: 1) Use MMDD and sort by FLOAT type 2) Use the millisecond representation and sort by STRING type 3) If the date you're entering here is the date of indexing then you can just sort by DOC type (which is the DOC ID) and save yourself the pain Hope this helps. Nader Henein -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 27, 2004 3:52 PM To: Lucene Users List Subject: sorting by date (XML) my XML files contain something like date year2004/yearmonth04/monthday27/day... /date and I would like to sort by this date. So I guess I need to modify the Documentparser and generate something like a millisecond field and then sort by this, correct? Has anyone done something like this yet? Thanks Michi -- Michael Wechner Wyona Inc. - Open Source Content Management - Apache Lenya http://www.wyona.com http://cocoon.apache.org/lenya/ [EMAIL PROTECTED][EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: searching only part of an index
You may be able to jimmy the bi filter to produce the most recent 100, but really keeping your fetch count at 100 and ordering by DOC should be sufficient. -Original Message- From: Alan Smith [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 27, 2004 4:03 PM To: [EMAIL PROTECTED] Subject: searching only part of an index Hi I wondered if anyone knows whether it is possible to search ONLY the 100 (or whatever) most recently added documents to a lucene index? I know that once I have all my results ordered by ID number in Hits I could then just display the required amount, but I wondered if there is a way to avoid searching all documents in the index in the first place? Many thanks Alan _ Express yourself with cool new emoticons http://www.msn.co.uk/specials/myemo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: searching only part of an index
Are the DOC ids sequential? Or just unique and ascending, I'm thinking like a good little Oracle boy, so does anyone know? -Original Message- From: Ioan Miftode [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 27, 2004 4:55 PM To: Lucene Users List Subject: Re: searching only part of an index If you know the id of the last document in the index. (I don't know what's the best way to get it) you could probably use a range query. something like find all docs with the id in [lastId-100 TO lastID]. maybe you should make sure that the first limit is non negative, though. just a thought ioan At 08:02 AM 4/27/2004, you wrote: Hi I wondered if anyone knows whether it is possible to search ONLY the 100 (or whatever) most recently added documents to a lucene index? I know that once I have all my results ordered by ID number in Hits I could then just display the required amount, but I wondered if there is a way to avoid searching all documents in the index in the first place? Many thanks Alan _ Express yourself with cool new emoticons http://www.msn.co.uk/specials/myemo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: searching only part of an index
So if Alan wants to limit it to the first 100 he can't really use a range search unless he can guarantee that the index is optimized after deletes, but then if his deletion rounds are anything like mine ( every 2 mins) then optimizing it at each delete will make searching the index really slow. Right? Nader -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 27, 2004 5:15 PM To: Lucene Users List Subject: Re: searching only part of an index On Apr 27, 2004, at 9:00 AM, Nader S. Henein wrote: Are the DOC ids sequential? Or just unique and ascending, I'm thinking like a good little Oracle boy, so does anyone know? They are unique and ascending. Gaps in id's exist when documents are removed, and then the id's are squeezed back to completely sequential with no holes during an optimize. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Segments file get deleted?!
Can you give us a bit of background, we've been using Lucene since the first stable release 2 years ago, and I 've never had segments disappear on me, first of all can you provide some background on your setup and secondly when you say a certain period of time, how much time are we talking about here and does that interval coincide with your indexing schedule, because you may have the create flag on the Indexer set to true so it simply recreates the index at every update and deleted whatever was there, of course if there are no files to index at any point it will just give you a blank index. Nader Henein -Original Message- From: Surya Kiran [mailto:[EMAIL PROTECTED] Sent: Monday, April 26, 2004 7:48 AM To: [EMAIL PROTECTED] Subject: Segments file get deleted?! Hi all, we have implemented our portal search using Lucene. It works fine. But after a certain period of time Lucene segments file get deleted. Eventually all searches fails. Anyone can guess where the error could be. Thanks a lot. Regards Surya. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: converting text/doc to XML
We read from the database and parse the data into a valid XML then I hand over the XML file to lucene which in turn digests it and indexes the information N. -Original Message- From: Jagdip Singh [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 08, 2003 10:39 AM To: 'Lucene Users List'; [EMAIL PROTECTED] Subject: RE: converting text/doc to XML Hi Nader, As you talked about using Lucene for your http://www.bayt.com web site. Do you convert CV's or any other documents to XML format before submitting to Lucene for indexing? Regards, Jagdip -Original Message- From: Nader S. Henein [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 08, 2003 1:55 AM To: 'Lucene Users List' Subject: RE: converting text/doc to XML XML is an organized, standardized format so let's say your document has the following characteristics File name : foobar.doc Firt line title : Foo Bar File content : Blah blah blah blah Blah blah blah blah Blah blah blah blah Blah blah blah blah Then you have to read the file ( simple file read, java can do this in about ten different ways, pick one ) But each of the files characteristincs in a variable And then parse it in a valid XML: doc doc_id=1 file_namefoobar.doc/file_name titleFoo Bar/title content Blah blah blah blah Blah blah blah blah Blah blah blah blah Blah blah blah blah /content /doc There are probably packages that will do this for you but it's so simple you could pull it off in under a hundred lines, it's also good exercise to familiarize yourself with XML (if you haven't played around with it before) -Original Message- From: Jagdip Singh [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 08, 2003 9:41 AM To: 'Lucene Users List' Subject: converting text/doc to XML Hi, How can I convert text/doc to XML? Please help. Regards, Jagdip - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: converting text/doc to XML
XML is an organized, standardized format so let's say your document has the following characteristics File name : foobar.doc Firt line title : Foo Bar File content : Blah blah blah blah Blah blah blah blah Blah blah blah blah Blah blah blah blah Then you have to read the file ( simple file read, java can do this in about ten different ways, pick one ) But each of the files characteristincs in a variable And then parse it in a valid XML: doc doc_id=1 file_namefoobar.doc/file_name titleFoo Bar/title content Blah blah blah blah Blah blah blah blah Blah blah blah blah Blah blah blah blah /content /doc There are probably packages that will do this for you but it's so simple you could pull it off in under a hundred lines, it's also good exercise to familiarize yourself with XML (if you haven't played around with it before) -Original Message- From: Jagdip Singh [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 08, 2003 9:41 AM To: 'Lucene Users List' Subject: converting text/doc to XML Hi, How can I convert text/doc to XML? Please help. Regards, Jagdip - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
I handle updates or inserts the same way first I delete the document from the index and then I insert it (better safe than sorry), I batch my updates/inserts every twenty minutes, I would do it in smaller intervals but since I have to sync the XML files created from the DB to three machines (I maintain three separate Lucene indices on my three separate web-servers) it takes a little longer. You have to batch your changes because Updating the index takes time as opposed to deleted which I batch every two minutes. You won't have a problem updating the index and searching at the same time because lucene updates the index on a separate set of files and then when It's done it overwrites the old version. I've had to provide for Backups, and things like server crashes mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS IT AWAY. -Original Message- From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller Sent: Tuesday, June 24, 2003 12:06 PM To: [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? Hi Nader, I was wondering if you'd mind me asking you a couple of questions about your implementation? The main thing I'm interested in is how you handle updates to Lucene's index. I'd imagine you have a fairly high turnover of CVs and jobs, so index updates must place a reasonable load on the CPU/disk. Do you keep CVs and jobs in the same index or two different ones? And what is the process you use to update the index(es) - do you batch-process updates or do you handle them in real-time as changes are made? Any insight you can offer would be much appreciated as I'm about to implement something similar and am a little unsure of the best approach to take. We need to be able to handle indexing about 60,000 documents/day, while allowing (many) searches to continue operating alongside. Thanks! Chris Nader S. Henein [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] We use Lucene http://www.bayt.com , we're basically an on-line Recruitment site and up until now we've got around 500 000 CVs and documents indexed with results that stump Oracle Intermedia. Nader Henein Senior Web Dev Bayt.com -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 04, 2003 6:09 PM To: [EMAIL PROTECTED] Subject: commercial websites powered by Lucene? Hello All, I've been trying to find examples of large commercial websites that use Lucene to power their search. Having such examples would make Lucene an easy sell to management Does anyone know of any good examples? The bigger the better, and the more the better. TIA, -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
I have to store the information I am indexing in the database because the nature of our application requires it, on update of certain columns in a table I create an XML file which is then copied to directories on each of my web servers, then separate lucene apps, running on separates machines digest the information into separate indices, you also have to provide procedures that will run periodically to ensure that all you indices are in sync with each other and in sync with the DB ( I run this once every three days when the CPU usage on the machines is low) To update the index I have a servlet running off a scheduler in Resin (you could use any webserver, Orion's cool too), the up-side to distributing your search engines like this is that you have three active back ups in case one got corrupted (hasn't happened in two years), and the load on each machine is pretty low even during updates/optimizations every 20 minutes. If the server crashes, it's not a problem unless it happens mid-indexing, then you have to somehow remove the write locks created in the index directory ( I just delete them, optimize, and re-start the update that crashed) Lucene destroyed Oracle on speed tests and we use to have to use our single DB monster machine for all the searching and indexing which made the load on it pretty high, but now I have 0.5 loads on all my CPUs and no need to buy new hardware -Original Message- From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller Sent: Tuesday, June 24, 2003 1:12 PM To: [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? So you have a holding table in a database (or directory on disk?) where you store the incoming documents correct? Does each webserver run it's own indexing thread which grabs any new documents every 20 minutes, or is there a central process that manages that? I'm trying to understand how you know when you can safely clean out the holding table. Did you look at having just a single process that was responsible for updating the index, and then pushing copies out to all the webservers? I'm wondering if that might be worth investigating (since it would take a lot of load off the webservers that are running the searches), or if it will be too troublesome in practice. Also, I'm interested to see how you handle the situation when a server gets shutdown/restarted - does it just take a copy of the index from one of the other servers (since it's own index is likely out of date)? I take it it's not safe to copy an index while it is being updated, so you have to block on that somehow? PS: It's great to hear Lucene blows Oracle out of the water! I've got some skeptical management that need some convincing, hearing stories like this helps a lot :-) Nader S. Henein [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] I handle updates or inserts the same way first I delete the document from the index and then I insert it (better safe than sorry), I batch my updates/inserts every twenty minutes, I would do it in smaller intervals but since I have to sync the XML files created from the DB to three machines (I maintain three separate Lucene indices on my three separate web-servers) it takes a little longer. You have to batch your changes because Updating the index takes time as opposed to deleted which I batch every two minutes. You won't have a problem updating the index and searching at the same time because lucene updates the index on a separate set of files and then when It's done it overwrites the old version. I've had to provide for Backups, and things like server crashes mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS IT AWAY. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
Because I've setup Lucene as a webapp with a centralized Init file and setup properties file, I do my sanity check in the Init, because if the serer crashes mid-indexing, I have to delete the lock files optimize and re-index the files that were indexing when the crash occurred, there was long discussion about this back in August, search for Crash / Recovery Scenario in the lucene-dev archived discussions. Should answer all your questions Nader Henein -Original Message- From: Gareth Griffiths [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 24, 2003 1:11 PM To: Lucene Users List; [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? Nader, You say you have to cope with server crash mid-indexing. I think I'm seeing lots of garbage files created by server crash mid merge/optimise while lucene is creating a new index. Did you write code specifically to handle this or is there something more automated. (I was thinking of writing a sanity check for before start-up that looked in 'segments' and 'deletable and got rid of any files in the catalog directory that are not referenced.) Did you do something similar or have I missed something... TIA Gareth - Original Message - From: Nader S. Henein [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Tuesday, June 24, 2003 9:30 AM Subject: RE: commercial websites powered by Lucene? I handle updates or inserts the same way first I delete the document from the index and then I insert it (better safe than sorry), I batch my updates/inserts every twenty minutes, I would do it in smaller intervals but since I have to sync the XML files created from the DB to three machines (I maintain three separate Lucene indices on my three separate web-servers) it takes a little longer. You have to batch your changes because Updating the index takes time as opposed to deleted which I batch every two minutes. You won't have a problem updating the index and searching at the same time because lucene updates the index on a separate set of files and then when It's done it overwrites the old version. I've had to provide for Backups, and things like server crashes mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS IT AWAY. -Original Message- From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller Sent: Tuesday, June 24, 2003 12:06 PM To: [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? Hi Nader, I was wondering if you'd mind me asking you a couple of questions about your implementation? The main thing I'm interested in is how you handle updates to Lucene's index. I'd imagine you have a fairly high turnover of CVs and jobs, so index updates must place a reasonable load on the CPU/disk. Do you keep CVs and jobs in the same index or two different ones? And what is the process you use to update the index(es) - do you batch-process updates or do you handle them in real-time as changes are made? Any insight you can offer would be much appreciated as I'm about to implement something similar and am a little unsure of the best approach to take. We need to be able to handle indexing about 60,000 documents/day, while allowing (many) searches to continue operating alongside. Thanks! Chris Nader S. Henein [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] We use Lucene http://www.bayt.com , we're basically an on-line Recruitment site and up until now we've got around 500 000 CVs and documents indexed with results that stump Oracle Intermedia. Nader Henein Senior Web Dev Bayt.com -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 04, 2003 6:09 PM To: [EMAIL PROTECTED] Subject: commercial websites powered by Lucene? Hello All, I've been trying to find examples of large commercial websites that use Lucene to power their search. Having such examples would make Lucene an easy sell to management Does anyone know of any good examples? The bigger the better, and the more the better. TIA, -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
The search is a little sluggish because our initial architecture was based on TCL, not java, so until we complete the full java overhaul, every time I perform a search the AOL Webserver (tcl) has to call the servlet in Resin (where lucene is) and then perform the search, then this is the killer , I have to parse all the results from a Java Collection into a TCL List, the most intense search with thousands of results takes less than a second, it's all the things I have to do around it that take time. Nader -Original Message- From: John Takacs [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 24, 2003 1:52 PM To: Lucene Users List Subject: RE: commercial websites powered by Lucene? Hi Nader, This thread is by far one of the best, and most practical. It will only be topped when someone provides benchmarks for a DMOZ.org type directory of 3 million plus urls. I would love to, but the whole JavaCC thing is a show stopper. Questions: I noticed that search is a little slow. What has been your experience? Perhaps it was a bandwidth issue, but I'm living in a country with the greatest internet connectivity and penetration in the world (South Korea), so I don't think that is an issue on my end. You have 500,000 resumes. Based on the steps you took to get to 500,000, do you think your current setup will scale to millions, like say, 3 million or so? What is your hardware like? CPU/RAM? Warm regards, and thanks for sharing. If I can ever get passed the Lucene/JavaCC installation failure, I'll share my benchmarks on the above directory scenario. John -Original Message- From: Nader S. Henein [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 24, 2003 5:30 PM To: 'Lucene Users List' Subject: RE: commercial websites powered by Lucene? I handle updates or inserts the same way first I delete the document from the index and then I insert it (better safe than sorry), I batch my updates/inserts every twenty minutes, I would do it in smaller intervals but since I have to sync the XML files created from the DB to three machines (I maintain three separate Lucene indices on my three separate web-servers) it takes a little longer. You have to batch your changes because Updating the index takes time as opposed to deleted which I batch every two minutes. You won't have a problem updating the index and searching at the same time because lucene updates the index on a separate set of files and then when It's done it overwrites the old version. I've had to provide for Backups, and things like server crashes mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS IT AWAY. -Original Message- From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller Sent: Tuesday, June 24, 2003 12:06 PM To: [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? Hi Nader, I was wondering if you'd mind me asking you a couple of questions about your implementation? The main thing I'm interested in is how you handle updates to Lucene's index. I'd imagine you have a fairly high turnover of CVs and jobs, so index updates must place a reasonable load on the CPU/disk. Do you keep CVs and jobs in the same index or two different ones? And what is the process you use to update the index(es) - do you batch-process updates or do you handle them in real-time as changes are made? Any insight you can offer would be much appreciated as I'm about to implement something similar and am a little unsure of the best approach to take. We need to be able to handle indexing about 60,000 documents/day, while allowing (many) searches to continue operating alongside. Thanks! Chris Nader S. Henein [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] We use Lucene http://www.bayt.com , we're basically an on-line Recruitment site and up until now we've got around 500 000 CVs and documents indexed with results that stump Oracle Intermedia. Nader Henein Senior Web Dev Bayt.com -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 04, 2003 6:09 PM To: [EMAIL PROTECTED] Subject: commercial websites powered by Lucene? Hello All, I've been trying to find examples of large commercial websites that use Lucene to power their search. Having such examples would make Lucene an easy sell to management Does anyone know of any good examples? The bigger the better, and the more the better. TIA, -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED
RE: commercial websites powered by Lucene?
About 100 documents every twenty minutes, but it fluctuates depending on how much traffic is on the site -Original Message- From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller Sent: Tuesday, June 24, 2003 3:28 PM To: [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? Hmm, good point with the cost of copying indicies in a distributed environment, although that is unlikely to affect us in the foreseeable future. But, noted! Do you have any rough statistics on how many documents you index/day, or how many every 20 minutes? This discussion is fantastic by the way, lots of great experience and comments coming out here. Thanks, it's really appreciated. Nader S. Henein [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] We thought of that in the beginning and then we became more comfortable with multiple indices for simple backup purposes, and now our indices are in excess of 100megs, and transferring that kind of data between three machines sitting in the same data center is passable, but once you start thinking of distributed webservers in different hosting facilities, copying 100Megs every 20 minutes, or even every hour becomes financially expensive. Our webservers are on Single Processor Sun Ultra Sparc III 400 Mhz with two gegs of memory, and I've never seen the CPU usage go over 0.8 at peek time with the indexer running. Try it out first, take your time to gather your own numbers so you can really get a feel of what set up fits you best. Nader - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
We were using Oracle Internedia before we switched to Lucene, and Lucene has been much faster and it has allowed us to distribute our search functionality over multiple servers, Intermedia which is supposedly one of the best in the business couldn't hold a candle to Lucene, and our Oracle installation and setup is impeccable, we spent years perfecting it before we decided to separate from Intermedia and use Oracle as DBMS not a search engine, also when you use lucene and not a proprietary product like Intermedia we can switch databases at will if Licensing fees become to high to ignore. Nader -Original Message- From: news [mailto:[EMAIL PROTECTED] On Behalf Of Ulrich Mayring Sent: Tuesday, June 24, 2003 3:40 PM To: [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? Chris Miller wrote: Thanks for your commments Ulrich. I just posted a message asking if anyone had attempted this approach! Sounds like you have, and it works :-) Thanks for information, this sounds pretty close to what my preferred approach would be. This is a good approach if the number of total documents doesn't grow too much. There's obviously a limit to full index runs at some point. You say you get 2000 docs/minute. I've done some benchmarking and managed to get our data indexing at ~1000/minute on an Athlon 1800+ (and most of that speed was acheived by bumping the IndexWriter.mergeFactor up to 100 or so). Our data is coming from a database table, each record contains about 40 fields, and I'm indexing 8 of those fields (an ID, 4 number fields, 3 text fields including one that has ~2k text). Does this sound reasonable to you, or do you have any tips that might improve that performance? You need to find out where you lose most of the time: a) in data access (like your database could be too slow, in my case I am scanning the local filesystem) b) in parsing (probably not an issue when reading from a DB, but in my case it is, I have HTML files) c) in indexing I haven't gone to the trouble to find that out for my app, because it is fast enough the way it is. However, what I wonder: if you have your data in a database anyway, why not use the database's indexing features? It seems like Lucene is an additional layer on top of your data, which you don't really need. cheers, Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
We use Lucene http://www.bayt.com , we're basically an on-line Recruitment site and up until now we've got around 500 000 CVs and documents indexed with results that stump Oracle Intermedia. Nader Henein Senior Web Dev Bayt.com -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 04, 2003 6:09 PM To: [EMAIL PROTECTED] Subject: commercial websites powered by Lucene? Hello All, I've been trying to find examples of large commercial websites that use Lucene to power their search. Having such examples would make Lucene an easy sell to management Does anyone know of any good examples? The bigger the better, and the more the better. TIA, -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Size limit for indexing ?
The size of the document is limited only by the OS constraints and 500 kb is really small, I have documents in the hundreds of megs it's fine .. check you indexing and searching you might find the problem there also are you using wildcard searches because they don't work from both sides Nader Henein -Original Message- From: Christophe GOGUYER DESSAGNES [mailto:[EMAIL PROTECTED]] Sent: Wednesday, October 09, 2002 12:08 PM To: [EMAIL PROTECTED] Subject: Size limit for indexing ? Hi, I use lucene 1.2 and I index a text document wich size is near 500 ko. (I use Field.UnStored method) It seems that only the beginning of this document is indexing ! If I search a term that is at the end of this document, I don't find it (but If find term at the beginning). So, I split my document in 2 parts and index them, and now it works fine. Is there a limit size for indexing a document ? Thx. - Christophe -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Lucene and RDBMS.
We had to do the same thing, we moved from an Oracle Intermedia search to Lucene (much better) the data is stored in the database. What we did is produce XML files on an interval (15 minutes) and those files would be picked by the indexer witch would delete any previous occurrence of the same entry and re-index the new one and then optimize the index. You could do the whole process in one shot retrieve a stream from the DB and then pass it directly to Lucene, but the stream should be in field,value pairs ( so XML makes sense ). The answer to your question is no you don't have to use files to create the index. The index itself is file based though. Nader Henein -Original Message- From: Rehan Syed [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 25, 2002 10:51 AM To: [EMAIL PROTECTED] Subject: Lucene and RDBMS. Hi, I am in process of implementing a Knowlegde base for internal use by my company. The contents of this Knowledge base will be stored in one or more database table(s). I am evaluating Lucene for performing text searches on this Knowledge base. I understand that Lucene has two components, indexing and searching, but both these components work on files, not on text data stored in an RDBMS. In order for me to use Lucene, would I need to develop a process that will extract text data out of the database, create text files and then do the indexing and searching? Are there any other approaches to this problem? Comments/suggestions would be greatly appreciated. - Do you Yahoo!? New DSL Internet Access from SBC Yahoo! -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Lucene and RDBMS.
The initial motivation behind switching from intermedia to Lucene was a first step in achieving DB abstraction because if you rely on intermedia for your indexing and searching purposes you're pretty much stuck with Oracle, an excellent DB but if you're business is growing the licensing fees become massive. Another thing is that I don't maintain one index on the Database server, I maintain an index on each webserver witch allowed me to reduce the average load on the DB machine by 78%, it's a little bit of a synchronization might mare but we've had it in place for the past three months without incident plus you have redundant indexes in-case one becomes corrupted. Furthermore the traffic between the DB machine and the webserver witch was inflated by having to pass search results back and forth has been dwarfed. Now the true joy behind using Lucene is the performance boost you'll get, we had intermedia customized and tuned to our needs yet Lucene was able to give a 200% increase in performance , a huge asset to our site witch is mainly search driven. PS: the reason why we create XML files and then hand them to Lucene is because, the files are then used for display purposes and caching purposes, because once they are transmitted to the webserver machines they save me the hassle of retrieving them from the database since they are the most recent version of the documents. Nader Henein -Original Message- From: Mariusz Dziewierz [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 25, 2002 4:23 PM To: Lucene Users List Subject: Re: Lucene and RDBMS. Nader S. Henein wrote: We had to do the same thing, we moved from an Oracle Intermedia search to Lucene (much better) the data is stored in the database. Could you give some reasons which lead you to conclusion that Lucene is much better than Oracle Intermedia in terms of searching data stored in database? I'm currently reviewing technologies related to text mining and I am very curious about your motives because I haven't opportunity to evaluate both technologies yet. -- Mariusz Dziewierz -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Lucene is not closing connections to index files
I know it's not the most efficient way but I do close the searched after every search using: searcher.close() ; This saves me the hassle of worrying about memory problems, and the search on my system is quite intensive about half a million searches a day, I haven't faced any problems with the opening an closing of searchers anyway if you keep it open you're in a way going against the garbage collector, slowly creating your own structured memory leek. Memory is cheap .. any of my clients would rather pay for a couple gegs of memory rather than have a team come two months after launch to troubleshoot a memory leek, somewhat reminiscent of the days of C dangling pointers. Gratned that one shouldn't go around using memory libraly but some trade offs do pay off. Nader Henein -Original Message- From: Halácsy Péter [mailto:[EMAIL PROTECTED]] Sent: Monday, August 12, 2002 11:18 AM To: Lucene Users List Subject: RE: Lucene is not closing connections to index files -Original Message- From: Jason Coleman [mailto:[EMAIL PROTECTED]] Sent: Monday, August 12, 2002 12:25 AM To: [EMAIL PROTECTED] Subject: Lucene is not closing connections to index files Lucene is not letting go (closing) index files that are being searched. I have not traced exactly where the problem is occurring, so I thought I would get some ideas first from the board. It appears that when a user does a search against the Lucene index files, the connections to these files are not released. It continues to maintain a connection until the JVM runs out of file space. yes, you are right. you have to close the searcher to release opened files. This is how I am querying the index: Searcher searcher = new IndexSearcher(index_path); Query query = QueryParser.parse(queryString, body, new StandardAnalyzer()); hits = searcher.search(query); index_path is just the location of the Lucene index files. I am sure that a Reader class somewhere is not being closed properly. Has anyone experienced this problem when querying the index? it's not bug but feature. lucene don't close files after searching only if you call the close() method. the cause: it's very slow to reopen the files. you should check the discussion about searcher cache (see mailing list archive) peter -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Hit Navigation in Lucene?
Here's the highlightinhg javascript ready with copyright and all -Original Message- From: Peter Carlson [mailto:[EMAIL PROTECTED]] Sent: Friday, August 02, 2002 5:11 AM To: Lucene Users List Subject: Re: Hit Navigation in Lucene? This clicking to the next highlighted term is all done in javascript, not by the backend system. So if you get permission, you can use their code and look this in with the Lucene Highlight. I'll bet that the highlighting is being done via javascript too so you don't need the lucene highlighting code. Although, the Lucene highlighting code works with wildcards. --Peter On 8/1/02 12:36 PM, Bruce Best (CRO) [EMAIL PROTECTED] wrote: I am looking at Lucene as the search engine for our office's legal research site. We have been looking at some of the commercial offerings, but Lucene seems to offer most of what we need, and we may end up using it and spending money on paying someone to customize it to our needs. For our purposes, one feature that is probably indispensible is hit highlighing and hit navigation. I see the former has already been added to the contributions section. With respect to hit navigation, the kind of thing I am looking at is along the lines of that used by the Fulcrum search engine; if anyone is not familiar with Fulcrum, a good example site is the Government of Canada Employment Insurance Jurisprudence Library at http://www.ei-ae.gc.ca/easyk/search.asp. Do a search for any term (try fired), then click on any of the resulting documents. The resulting page has the search terms highlighted, much as they would be in Lucene with the hit highlighting added, with a narrow frame at the top of the window with hit navigation buttons to allow users to jump to the next search term in the document. Would it be difficult to implement something similar with Lucene? I am not familiar with the technologies involved (I am not a coder), so do not know if this is trivial or impossible or somewhere in between. Any thoughts would be appreciated, Bruce -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Size Capabilities of Lucene Index
since it's file system based index I don't see any limitations other than OS max file size, and Imagine if you're data is 3 Terabytes you have monster machines with monster memory (you'll need it) also you'll need to max up the file handle set up on the OS and probably use a high MERGE_FACTOR. PS: I'm hypothesizing here, so please anyone feel free to jump in Nader Henein -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Wednesday, July 31, 2002 6:32 PM To: Lucene Users List Subject: Size Capabilites of Lucene Index Can anyone tell me the amount of data that Lucene is able to index? Can it handle up to 3 Terrabytes, how large are the indexes it creates, (1/2 the size of the data)? Thanks, Scott The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. Ernst Young LLP -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Using Filters in Lucene
My index changes ( updates every 15 minutes and delete every 2 minutes ) so using the filter is not going to work for me because the order of the Documents might change from the time the initial search is done to the time the filter is done, I'm currently using a crude method ( ... doc_id:(23 AND 78 .. ) ) and so to filter it works surprisingly well because I thought the query parser would cave but it's doing great even with sets as large as filtering within 2000 documents -Original Message- From: Scott Ganyo [mailto:[EMAIL PROTECTED]] Sent: Wednesday, July 31, 2002 10:24 PM To: 'Lucene Users List' Subject: RE: Using Filters in Lucene Cool. But instead of adding a new class, why not change Hits to inherit from Filter and add the bits() method to it? Then one could pipe the output of one Query into another search without modifying the Queries... Scott -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED]] Sent: Monday, July 29, 2002 12:03 PM To: Lucene Users List Subject: Re: Using Filters in Lucene Peter Carlson wrote: Would you suggest that search in selection type functionality use filters or redo the search with an AND clause? I'm not sure I fully understand the question. If you a condition that is likely to re-occur commonly in subsequent queries, then using a Filter which caches its bit vector is much faster than using an AND clause. However, you probably cannot afford to keep a large number of such filters around, as the cached bit vectors use a fair amount of memory--one bit per document in the index. Perhaps the ultimate filter is something like the attached class, QueryFilter. This caches the results of an arbitrary query in a bit vector. The filter can then be reused with multiple queries, and (so long as the index isn't altered) that part of the query computation will be cached. For example, RangeQuery could be used with this, instead of using DateFilter, which does not cache (yet). Caution: I have not yet tested this code. If someone does try it, please send a message to the list telling how it goes. If this is useful, I can document it better and add it to Lucene. Doug -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: is this possible in a query?
This is a long shot but if you want you search to yield exact results alone on that specific field, you might wannna think about replacing the spaces between words with underscores (make sure the analyzer doesn't split them up) and then apply that same rule to the query string in the sense that Cathflo OrthoMed will become Cathflo_OrthoMed and OrthoMed will stay the same so when you search for OrthoMed you'll only get exact results, this does not save you from re-indexing (unfortunately) but it does save you from writing a whole new analyzer. Nader Henein -Original Message- From: Robert A. Decker [mailto:[EMAIL PROTECTED]] Sent: Thursday, August 01, 2002 6:35 AM To: Lucene Users List Subject: Re: is this possible in a query? I think this may be what I end up doing... Unfortunately this means reindexing the documents... thanks, rob http://www.robdecker.com/ http://www.planetside.com/ On Wed, 31 Jul 2002 [EMAIL PROTECTED] wrote: if you make the product name a type Field.Keyword, it will still be indexed and searchable, but will not be tokenized. --dmg - Original Message - From: Robert A. Decker [EMAIL PROTECTED] Date: Wednesday, July 31, 2002 5:07 pm Subject: is this possible in a query? I have a Text Field named product. Two of the products are: Cathflo OrthoMed OrthoMed When I search for Cathflo OrthoMed, I correctly only get items that have the product Cathflo OrthoMed. However, when I search for OrthoMed, not only do I get all OrthoMed products, but I also get all Cathflo OrthoMed products. Is there a way, when searching on a Field.Text type, to limit the aboveOrthoMed search to only OrthoMed, and to exclude Cathflo OrthoMed? The solution has to be generic enough to work with any combination of product names. thanks, rob http://www.robdecker.com/ http://www.planetside.com/ -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED]For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Autonomy vs Lucene, etc..
Could you explain a little bit what autonomy does for you and what requirements you have that need to be met. Nader Henein -Original Message- From: Anoop Kumar V [mailto:[EMAIL PROTECTED]] Sent: Thursday, July 25, 2002 9:05 AM To: 'Lucene Users List' Subject: Autonomy vs Lucene, etc.. Hi, i have a very basic question. We have been using Autonomy until now. but now we are looking for any alternative tools to substitute autonomy. This we decided as we hv now shifted to a more internal database/site search rather tahn the external search offered by autonomy. What i want to know is that.. can Lucene (or any other search engine) substitute autonomy and what are the impacts. Can you also guide me to any other search engine (ok..if it is not open source) that is suitable in terms of ease of installation and integration. thanx in advance.. -anoop -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: I need help
If you're talking about the ranking (scoring) scheme of the search results, I imagine that you could use a vectorial model (a lot of changes), but why when an algebraic ranking method is more accurate? Nader Henein -Original Message- From: ilma barbosa [mailto:[EMAIL PROTECTED]] Sent: Wednesday, July 24, 2002 10:41 PM To: [EMAIL PROTECTED] Subject: I need help I would like to know lucene makes query using the vectorial model ___ Yahoo! Encontros O lugar certo para encontrar a sua alma gêmea. http://br.encontros.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Replication of indexes
I maintain the index on multiple machines UPDATING/DELETING/OPTIMIZING on all three machines, it's hard to make sure that everything is synchronized but it provides a fall back in case anything happens to the index. What you're doing is mainly a copy witch is probably like my backup, witch is quite simple now in the sense that I check the index directory for *.lock files if none are present (the index isn't being edited/optimized) I create a write.lock file witch tells the indexer not to run and I read the file list using a shell script and copy the files to a different directory. It's a hack but it works fine. I'm currently working on a backup and rollback API for Lucene witch should work for copying the index across. Nader Henein -Original Message- From: Harpreet S Walia [mailto:[EMAIL PROTECTED]] Sent: Monday, July 15, 2002 3:39 AM To: Lucene Users List Subject: Replication of indexes Hello Everybody, I have a requirement where i need to replicate the index files generated by lucene to another server at a remote location. What i have observed is that lucene keeps on changing file names for the index files . does it follow any speific pattern in doing this ? Is anybody doing something like this ? From what i understand it will be best if i optimize the index before i replicate it and also make a local copy so that the index is not updated while it is being replicated. What other issues can be there if i try something like this . TIA Regards Harpreet -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: simultaneous searching and indexing
I'm not sure as to the status of the FAQ but I've had this discussion before, and I've used tested Lucene heavily during the last few months, I've searched it during man y of my repeated full indexing sessions (witch are extremely exhaustive) and it has not failed me once, and about a month ago I had a related discussion about concurrent indexing and backup that might shed some light on your issue, http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg01709.html Nader Henein -Original Message- From: Harpreet S Walia [mailto:[EMAIL PROTECTED]] Sent: Wednesday, July 10, 2002 8:09 PM To: Lucene Users List Subject: simultaneous searching and indexing Hi I was going through the FAQ and found a mention of thread safety of lucene. From what i understand lucene is not full thread safe . the FAQ is dated 2001 . has there been any improvements on this since then . Is it safe to perform search and index operations on the same index simultaneously . Please pardon me if this question has been asked before. Thanks and Regards, Harpreet -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Crash / Recovery Scenario
I'm not worried about my hardware I've been blessed with an 8 CPU Sum machine and 2 2 CPU sun Machines with gegs of memory, and I do run Lucene with 15 threads, I've set my merge factor at 1000 so a lot of work is done in memory (speed), my current concerns are Recovery related as I'm a few days from deployment, on a windows based machines I'm not too falmiliar with the threading setup, the beauty of unix is you can do anything, I'm worried about Lucene hanging mid indexing, how do I monitor that? -Original Message- From: none none [mailto:[EMAIL PROTECTED]] Sent: Monday, July 08, 2002 11:05 PM To: [EMAIL PROTECTED] Subject: RE: Crash / Recovery Scenario If you tell me the computer doesn't crash , the only thing is you want to stop safely the process, well , in this case the Manager will not stop until the task is not complete, because i am running the Manager as an NT service ,i have a little problem here, you cannot stop a thread while it is doing I/O operation like recursively scan of a directory, you have to wait a little bit. I see that you are looking for software stability, but software is strictly related with hardware, you need a good hardware too, think about a RAID structure, 0 or 5 depends, think about a clustered system. This depends what you want from your search engine. Also i think is good focus on have a good cache status, e.g.: if i have a bad error and i can't recover the index i rebuild it by calling a method that scan all my cache, it is no great but better than nothing . Also i never had that kind of problem. Also adopt a multi threaded will improve by 40% the actual speed, you need to merge all the segments at the end. (i tested just with 2 thread on Win2K). If you are looking for a search engine like google, there is a lot of work to do, A LOT My opinion is to split index and cache on 'n' machine , but the only thing i don't know how to do it's run a search on multiple index on multiple machine, with sockets will not work, sockets become really slow with heavy traffic, i was thinking on a Java compatible DLL able to merge multiple machine as a logical unit. ciao. -- On Mon, 8 Jul 2002 21:07:32 Nader S. Henein wrote: brilliant .. I was thinking along the same lines, a new issue that I'm facing is just lucene dying on me, in the middle of indexing .. no server crash .. nothing .. what do you do if it just stops mid-indexing ? -Original Message- From: none none [mailto:[EMAIL PROTECTED]] Sent: Monday, July 08, 2002 8:42 PM To: [EMAIL PROTECTED] Subject: Re: Crash / Recovery Scenario hi, i do perform the same things as you do, but i do that everytime i got a NullPointerException when i try to run a search . If this happen i try to reopen the index searcher, if i got an exception here i sleep for 500 ms then i try again, after 5 times i generate a servlet exception. Concerning the delete of write.lock and commit.lock, i use a manager,what it does is execute different kind of operation in blocks, like 100 or 1000. Each operation can be: 1.Delete documents 2.Add documents 3.Search document/s A combination of this 3 operation allow me to update the index with searches still running. But there is a problem versioning, between current cache of documents and current version of INDEXED documents, during update you can search for something that is found in the index but that has been updated in the cache, so i have a bounch of documents duplicate during that, and at the end i notify using a RMI callback all the clients connected to that Manager to re open the index, then i clean up all this duplicate. At this stage i have still an error in case the Manager die because i have all in memory, but i did a little work around to handle that. My next step is make this transaction persistent, so i can recovery the previous status. Every time i run an operation as listed above i do a check if write.lock or commit.lock exists, in that case i call the unlock() method, i delete them (if the method unlock doesn't), then i optimize the index. Until now everything seems to work fine. ciao. -- On Mon, 8 Jul 2002 09:40:10 Nader S. Henein wrote: I'm currently using Lucene to sift through about a million documents, I've written a servlet to do the indexing and the searching, the servlets are ran through resin, The Crash scenario I'm thinking of is a web server crash ( for a million possible reasons ) while the index is being updated or optimized, what I've noticed is the creation of write.lock and commit.lock files witch stop further indexing because the application thinks that the previously scheduled indexer is still running (witch could very well be true depending on the size of the update). This is the recovery I have in mind but I think it might be somewhat of a hack, On restart of the web server I've written an Init function that checks for write.lock or commit.lock , and if either exist it deletes both of them and optimizes the index. Am I forgetting anything
RE: Crash / Recovery Scenario
Karl, what if I copy the index in memory or in another directory prior to indexing thereby, assuring a working index in the case of a crash. I want to stay away from DB interaction as I am trying to move out of an Oracle Intermedia search solution (if you saw the Oracle price list you would too). I have a backup process witch 1) Checks if the index is being updated 2) Does a small trial search (to ensure that the index s not corrupt) 3) Tar the index and move the file to another disk I'm thinking of writing a full backup/restore add-on to Lucene so all of this can be jared together as part of the package. Nader -Original Message- From: Karl Øie [mailto:[EMAIL PROTECTED]] Sent: Tuesday, July 09, 2002 1:49 PM To: Lucene Users List Subject: Re: Crash / Recovery Scenario only deletes the old one while it's working on the new one, so is there a way of checking for the .lock files in case of a crash a rolling back to the old index image? Nader Henein i have some thoughts about crash/recovery/rollback that i haven't found any good solutions for. If a crash happends during writing happens there is no good way to know if the index is intact, removing lock files doesn't help this fact, as we really don't know. So providing rollback functionality is a good but expensive way of compensating for lack of recovery. To provide rollback i have used a RAMDirectory and serialized it to a SQL table. By doing this i can catch any exceptions and ask the database to rollback if required. This works great for small indexes but if the index grows you will have problems with performance as the whole RAMDir has to be serialized/deserialized into the BLOB all the time. A better solution would be to hack the FSDirectory to store each file it would store in a file-directory as a serialized byte array in a blob of a sql table. This would increase performance because the whole Directory don't have to change each time, and it doesn't have to read the while directory into memory. I also suspect lucene to sort its records into these different files for increased performance (like: i KNOW that record will be in segment xxx if it is there at all). I have looked at the source for the RAMDirectory and the FSDirectory and they could both be altered to store their internal buffers into a BLOB, but i haven't managed to do this successfully. The problem i have been pounding is the lucene.InputStream's seek() function. This really requires the underlying impl to be either a file, or a array in memory. For a BLOB this would mean that the blob has to be fetched, then read/seek-ed/written/ then stored back again. (is this correct?!?, and if so is there a way to know WHEN it is required to fetch/store the array). I would really appreciate any tips on this as i would think crash/recovery/rollback functionality to benefit lucene greatly. I have indexes that uses 5 days to build, and it's really bad to receive exceptions during a long index run, and no recovery/rollback functionality. Mvh Karl Øie -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Crash / Recovery Scenario
I understand that these files are there for a reason but in case of a web server crash lucene will not be able to update/delete/optimize the index in the existence of these files, the existence of these two files after a web server restart means that the crash occurred when the web server was editing the index and since there is no way to Rollback (is there?, that would be a cool feature) I have to cut my losses and continue. Sorry for thinking out loud but speaking of rollback, I asked a question a while back about backing up the index while it's being written to. http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg01711.html and Peter told me that it's no problem especially on a Unix machine because the Lucene writer creates a new index and only deletes the old one while it's working on the new one, so is there a way of checking for the .lock files in case of a crash a rolling back to the old index image? Nader Henein -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Sent: Monday, July 08, 2002 9:43 PM To: Lucene Users List; [EMAIL PROTECTED] Subject: Re: Crash / Recovery Scenario Nader, I don't have a solution for you, but just removing these two files is probabl not a good idea. There is a reason for their existence. Actually, check jGuru Lucene FAQ for more information about them. Otis P.S. s/witch/which/gi :) witch = the ugly woman flying around on a broom stick :) --- Nader S. Henein [EMAIL PROTECTED] wrote: I'm currently using Lucene to sift through about a million documents, I've written a servlet to do the indexing and the searching, the servlets are ran through resin, The Crash scenario I'm thinking of is a web server crash ( for a million possible reasons ) while the index is being updated or optimized, what I've noticed is the creation of write.lock and commit.lock files witch stop further indexing because the application thinks that the previously scheduled indexer is still running (witch could very well be true depending on the size of the update). This is the recovery I have in mind but I think it might be somewhat of a hack, On restart of the web server I've written an Init function that checks for write.lock or commit.lock , and if either exist it deletes both of them and optimizes the index. Am I forgetting anything ? is this wrong ? is there a Lucene specific way of doing this like running the optimizer with a specific setup. Nader S. Henein Bayt.com , Dubai Internet City Tel. +9714 3911900 Fax. +9714 3911915 GSM. +9715 05659557 www.bayt.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Sign up for SBC Yahoo! Dial - First Month Free http://sbc.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Crash / Recovery Scenario
I'm currently using Lucene to sift through about a million documents, I've written a servlet to do the indexing and the searching, the servlets are ran through resin, The Crash scenario I'm thinking of is a web server crash ( for a million possible reasons ) while the index is being updated or optimized, what I've noticed is the creation of write.lock and commit.lock files witch stop further indexing because the application thinks that the previously scheduled indexer is still running (witch could very well be true depending on the size of the update). This is the recovery I have in mind but I think it might be somewhat of a hack, On restart of the web server I've written an Init function that checks for write.lock or commit.lock , and if either exist it deletes both of them and optimizes the index. Am I forgetting anything ? is this wrong ? is there a Lucene specific way of doing this like running the optimizer with a specific setup. Nader S. Henein Bayt.com , Dubai Internet City Tel. +9714 3911900 Fax. +9714 3911915 GSM. +9715 05659557 www.bayt.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Stress Testing Lucene
That's the weird thing I wasn't writing to the index at the time I was searching (hardcore searching) 20 clients each one issuing 20 simultaneous search request .. it was going fine until it started throwing errors at me and when I looked at the logs, I found a set of Too many files open error. Previously this only happened if their was a crash on the server while indexing leaving an un-optimized index with 800+ files. -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED]] Sent: Thursday, June 27, 2002 7:36 PM To: Lucene Users List Subject: Re: Stress Testing Lucene It's very hard to leave an index in a bad state. Updating the segments file atomically updates the index. So the only way to corrupt things is to only partly update the segments file. But that too is hard, since it's first written to a temporary file, which is then renamed segments. The only vulnerability I know if is that in Java on Win32 you can't atomically rename a file to something that already exists, so Lucene has to first remove the old version. So if you were to crash between the time that the old version of segments is removed and the new version is moved into place, then the index would be corrupt, because it would have no segments file. Doug Scott Ganyo wrote: Which came first--the out of file handles error or the corruption? I haven't looked, but I would guess that if you ran into the file handles exception while writing, that might leave Lucene in a bad state. Lucene isn't transactional and doesn't really have the ACID properties of a database... -Original Message- From: Nader S. Henein [mailto:[EMAIL PROTECTED]] Sent: Wednesday, June 26, 2002 11:45 PM To: Lucene Users List Subject: RE: Stress Testing Lucene I rebooted my machine and still the same issue .. if I know what caused that to happen, I would be able to solve it with some source tweaking, and it's not the files handles on the machine I got over that problem months ago. Let's consider worst case scenario and that corruption did occur what could be the reasons, I'm goig to need some insider help to get through this one. N. -Original Message- From: Scott Ganyo [mailto:[EMAIL PROTECTED]] Sent: Wednesday, June 26, 2002 7:15 PM To: 'Lucene Users List' Subject: RE: Stress Testing Lucene 1) Are you sure that the index is corrupted? Maybe the file handles just haven't been released yet. Did you try to reboot and try again? 2) To avoid the too-many files problem: a) increase the system file handle limits, b) make sure that you reuse IndexReaders as much as you can rather across requests and client rather than opening and closing them. -Original Message- From: Nader S. Henein [mailto:[EMAIL PROTECTED]] Sent: Wednesday, June 26, 2002 10:11 AM To: [EMAIL PROTECTED] Subject: Stress Testing Lucene Importance: High Hey people, I'm running a Lucene (v1.2) servlet on resin and I must say compared to Oracle Intermedia it's working beautifully. BUT today, I started stress testing and I downloaded a program called Web Roller, witch simulates clients, requests , multi-threading .. the works and I was testing I was doing something like 50 simultaneous requests and I was repeating that 10 times in a row. but then something happened and the index got corrupted, every time I try opening the index with the reader to search or open with the writer to optimize I get that damned too-many files open error. I can imagine that every application on the market has a breaking point and these breaking points have side effects, so is the corruption of the index a side effect and if so is there a way that I configure my web server to crash before the corruption occurs, I'd rather re-start the web server and throw some people off wack rather that have to re-build the index or revert to an older version. Do you know of any way to safeguard against this ? General Info: The index is about 45 MB with 60 000 XML files each containing 18-25 fields. Nader S. Henein Bayt.com , Dubai Internet City Tel. +9714 3911900 Fax. +9714 3911915 GSM. +9715 05659557 www.bayt.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Internationalization - Arabic Language Support
I'm indexing arabic in my index and to make it searchable I had to switch character sets (not fun) the problem lies in the week standards surrounding Arabic Character sets between ISO 8895-6 , win-1256 and UTF-8 you can have three different representations of the same exact thing UTF-8 store arabic in numeric form ( the code that represent each letter) the lucene analyzer isn't to friendly with numbers and especially if you use a stemmer. When it comes to the other two encodings they are different but both come back to the same results lucene views them as if they were European character sets and tries to apply the same rules to them so take care when you're indexing arabic, I only figured it out when I started experimenting with different unix charset settings while encoding because I have an oracle DB that spits out the XML files on a Solaris os and then lucene picks them up for encoding and since my core application isn't in java I have to contend with two web servers Main application ( AOL server ) and then search application (Lucene on Resin). When trying to figure out encoding issues, you need to convert everything to it's most simple form and compare and contrast as it passes through your application. Nader -Original Message- From: W. Eliot Kimber [mailto:[EMAIL PROTECTED]] Sent: Friday, June 28, 2002 6:59 PM To: Lucene Users List Subject: Re: Internationalization - Arabic Language Support Peter Carlson wrote: The biggest part that is usually changed per language is the analyzer. This is the part of Lucene which transforms and breaks up a string into distinct terms. I have only the smallest understanding of Arabic as a language, but I have done some work to implement back-of-the-book indexing of Arabic (and other languages) for XSL/XSLT. Based on that experience, I think that the main challenges in implementing an Arabic analyzer would be: 1. Understanding the stemming rules for Arabic. Our research into Arabic collation revealed that the rules for how Arabic words are formed is not nearly as simple as in English and other Western languages. At this point we haven't stepped up to trying to implement (or find an implementation for) Arabic stemming for collation (words are collated first by their roots, which are not necessarily at the start of the words, so simple lexical collation won't work for Arabic and I'm assuming that full-text indexing by word roots would have the same problem). So I don't know more than that the problem is hard, even for native speakers of Arabic. 2. Handling different letter forms in queries--Semitic languages often have different forms for the same abstract character for different positions in a word: initial forms, final forms, and base forms. These different forms have different Unicode code points (although initial and final forms are identified as such in the Unicode database). Often a word will be stored with the base forms but the presented word will be transformed to use the appropriate initial or final form. This means, for example, that cutting and pasting a word from, say, a PDF document into a query might require rationalization of variant forms to base forms before performing the search (assuming that the analyzer also reduces all letters to their base forms for indexing). 3. Right-to-left entry of queries and presentation of results. Mixing right-to-left data with left-to-right data can get pretty tricky at the user interface level (it's not an issue at the data storate level, where all characters are stored in order of occurrence regardless of presentation direction). Good support for bidirectional input and presentation is hit and miss at best. For example, we could not figure out how to get Internet Explorer to correctly present mixed English and Arabic where there were lots of special characters (as opposed to simple flowed prose, which seems to work OK). I would expect Arabic localized Web browsers to handle input OK, but it might be hard to find GUI toolkits that do it well. IBMs ICU4J package, a collection of national language support utilities and libraries, might offer some solutions to this problem but I have not yet investigated its support for Arabic and similar languages (we used it for its Thai word breaker, which would be needed to implement a Thai analyzer for Lucene). Cheers, Eliot -- W. Eliot Kimber, [EMAIL PROTECTED] Consultant, ISOGEN International 1016 La Posada Dr., Suite 240 Austin, TX 78752 Phone: 512.656.4139 -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Stress Testing Lucene
I rebooted my machine and still the same issue .. if I know what caused that to happen, I would be able to solve it with some source tweaking, and it's not the files handles on the machine I got over that problem months ago. Let's consider worst case scenario and that corruption did occur what could be the reasons, I'm goig to need some insider help to get through this one. N. -Original Message- From: Scott Ganyo [mailto:[EMAIL PROTECTED]] Sent: Wednesday, June 26, 2002 7:15 PM To: 'Lucene Users List' Subject: RE: Stress Testing Lucene 1) Are you sure that the index is corrupted? Maybe the file handles just haven't been released yet. Did you try to reboot and try again? 2) To avoid the too-many files problem: a) increase the system file handle limits, b) make sure that you reuse IndexReaders as much as you can rather across requests and client rather than opening and closing them. -Original Message- From: Nader S. Henein [mailto:[EMAIL PROTECTED]] Sent: Wednesday, June 26, 2002 10:11 AM To: [EMAIL PROTECTED] Subject: Stress Testing Lucene Importance: High Hey people, I'm running a Lucene (v1.2) servlet on resin and I must say compared to Oracle Intermedia it's working beautifully. BUT today, I started stress testing and I downloaded a program called Web Roller, witch simulates clients, requests , multi-threading .. the works and I was testing I was doing something like 50 simultaneous requests and I was repeating that 10 times in a row. but then something happened and the index got corrupted, every time I try opening the index with the reader to search or open with the writer to optimize I get that damned too-many files open error. I can imagine that every application on the market has a breaking point and these breaking points have side effects, so is the corruption of the index a side effect and if so is there a way that I configure my web server to crash before the corruption occurs, I'd rather re-start the web server and throw some people off wack rather that have to re-build the index or revert to an older version. Do you know of any way to safeguard against this ? General Info: The index is about 45 MB with 60 000 XML files each containing 18-25 fields. Nader S. Henein Bayt.com , Dubai Internet City Tel. +9714 3911900 Fax. +9714 3911915 GSM. +9715 05659557 www.bayt.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Stress Testing Lucene
sorry .. but still the same problem, I've saved the index in a seperate direcory and I've re-indexed overnight so, testing (witch is currently underway) on the system can resume. Like I said in my previous email , worst case scenarios and the index is corrupted any ideas as to why, I'll gladly go into the source but some guidance as to a starting point would be nice -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Sent: Wednesday, June 26, 2002 7:33 PM To: Lucene Users List Subject: RE: Stress Testing Lucene --- Scott Ganyo [EMAIL PROTECTED] wrote: 1) Are you sure that the index is corrupted? Maybe the file handles just haven't been released yet. Did you try to reboot and try again? You can also do something like this: # lsof | wc -l 8727 # lsof | grep -c java 5382 # lsof | grep java | head mozilla-b 8428 otis memREG3,5 1242726 1287892 /usr/local/.version/IBMJava2-13/jre/bin/libjavaplugin_oji.so mozilla-b 8453 otis memREG3,5 1242726 1287892 /usr/local/.version/IBMJava2-13/jre/bin/libjavaplugin_oji.so mozilla-b 8454 otis memREG3,5 1242726 1287892 /usr/local/.version/IBMJava2-13/jre/bin/libjavaplugin_oji.so mozilla-b 8455 otis memREG3,5 1242726 1287892 /usr/local/.version/IBMJava2-13/jre/bin/libjavaplugin_oji.so mozilla-b 8457 otis memREG3,5 1242726 1287892 /usr/local/.version/IBMJava2-13/jre/bin/libjavaplugin_oji.so mozilla-b 8471 otis memREG3,5 1242726 1287892 /usr/local/.version/IBMJava2-13/jre/bin/libjavaplugin_oji.so 2) To avoid the too-many files problem: a) increase the system file handle limits, b) make sure that you reuse IndexReaders as much as you can rather across requests and client rather than opening and closing them. -Original Message- From: Nader S. Henein [mailto:[EMAIL PROTECTED]] Sent: Wednesday, June 26, 2002 10:11 AM To: [EMAIL PROTECTED] Subject: Stress Testing Lucene Importance: High Hey people, I'm running a Lucene (v1.2) servlet on resin and I must say compared to Oracle Intermedia it's working beautifully. BUT today, I started stress testing and I downloaded a program called Web Roller, witch simulates clients, requests , multi-threading .. the works and I was testing I was doing something like 50 simultaneous requests and I was repeating that 10 times in a row. but then something happened and the index got corrupted, every time I try opening the index with the reader to search or open with the writer to optimize I get that damned too-many files open error. I can imagine that every application on the market has a breaking point and these breaking points have side effects, so is the corruption of the index a side effect and if so is there a way that I configure my web server to crash before the corruption occurs, I'd rather re-start the web server and throw some people off wack rather that have to re-build the index or revert to an older version. Do you know of any way to safeguard against this ? General Info: The index is about 45 MB with 60 000 XML files each containing 18-25 fields. Nader S. Henein Bayt.com , Dubai Internet City Tel. +9714 3911900 Fax. +9714 3911915 GSM. +9715 05659557 www.bayt.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
IndexReader Pool
I was going through the lucene-user posts on the web and I came accross a posting by Scott Oshima http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00693.html witch is talking about creating a IndexReader pool to spead up the search I've looked into that but I can't fiure out what to use for a DataSource like in creating a pool for DB connections, is there an equivalant in the lucene architecture or should one just take the initiative. Nader S. Henein Bayt.com , Dubai Internet City Tel. +9714 3911900 Fax. +9714 3911915 GSM. +9715 05659557 www.bayt.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Updating Documents in the index
As their is no update in lucene, this is exactly what you need to do, and I would advise you to batch your update and optimize after you update, because the number of files baloon if you don't -Original Message- From: Harpreet S Walia [mailto:[EMAIL PROTECTED]] Sent: Tuesday, June 25, 2002 8:26 AM To: Lucene Users List Subject: Updating Documents in the index Hi, My application needs to provide a feature for updating documents in the index. I am thinking of doing this by deleting the original document and indexing the updated one again , I think this is possible using the delete methods in the IndexReader class . Is there some other better way to achieve this with lucene . Thanks and Regards Harpreet -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Boolean Query + Memory Monster
I'm all ears .. I'm running the search from a servlet on a resin web server, any suggestions as to increasing the heap size in this case ? -Original Message- From: Scott Ganyo [mailto:[EMAIL PROTECTED]] Sent: Thursday, June 13, 2002 9:47 PM To: 'Lucene Users List' Subject: RE: Boolean Query + Memory Monster Use the java -Xmx option to increase your heap size. Scott -Original Message- From: Nader S. Henein [mailto:[EMAIL PROTECTED]] Sent: Thursday, June 13, 2002 12:20 PM To: [EMAIL PROTECTED] Subject: Boolean Query + Memory Monster I have 1 Geg of memory on the machine with the application when I use a normal query it goes well, but when I use a range query it sucks the memory out of the machine and throws a servlet out of memory error, I have 80 000 records in the index and it's 43 MB large anything people ? Nader S. Henein Bayt.com , Dubai Internet City Tel. +9714 3911900 Fax. +9714 3911915 GSM. +9715 05659557 www.bayt.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
DateField issues
I managed to index according to the date field no problem, but then when I search using a date filter, the search is slightly slower and the results do not seem to be constraint by any date. The following code segment shows how I'm searching : (I basically want all the records indexed with dates after start ) // Current time in mills long currInMills = DateField.MAX_DATE_STRING(); // startTime = currInMills - ( a number of days * ( length of a day in mills) ) long start = currInMills - ( freshness * dayInMillis ) ; Query query = QueryParser.parse(queryString, title , new SuperStandardAnalyzer()); filter = new DateFilter.After(datemodified, start) ; Searcher searcher = new IndexSearcher(indexPath); Hits hits = searcher.search(query,filter); I know I'm indexing the dates correctly because I encode them then I decode them and print them and they seem to be accurate, just to be sure time in mills is measured since 01 01 1970 right ? if anyone has any idea why this isn't working please feel free to contribute , oh and if you're wondering yes I also tried the date filter with start and end .. nada Nader S. Henein Bayt.com , Dubai Internet City Tel. +9714 3911900 Fax. +9714 3911915 GSM. +9715 05659557 www.bayt.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Creating indexes
depending on the build of the document, but I guess not, I had to write my own XML parser, you get better results when you customize something like that to your needs. -Original Message- From: Chris Sibert [mailto:[EMAIL PROTECTED]] Sent: Wednesday, June 12, 2002 10:27 AM To: Lucene Users List Subject: Creating indexes I have a big ( 40 MB or so) file to index. The file contains a whole bunch of documents, which are each pretty small, about a few typewritten pages long. There's a title, date, and author for each document, in addition to the documents' actual text. I'm not quite sure how you index this in Lucene. For each document in the original file, I assume that I create a separate Lucene Document object in the index with author, date, title, and text fields. If so, my question is that when I'm reading in the original file for indexing, does Lucene know where each document begins and ends in the original file ? Or do I have to write a parser or filter or something for the InputStream that's reading the file ? Chris Sibert -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Wilcard Search Issues
I'm using the new Lucene 1.5 release and I remember a message in the lucene-user mailing list that talked about a wildcard issue that if you search something like this: reslocCCsa/resloc using the following query string : resloc:CCsa* it will yield no results, and them there was a reply saying that the issue has been resolved in the nightly builds, this was about two weeks before rc1.5 (witch I'm using) and according the rc1.5 mailer that went out wildcard issues where hammered out. but I still have this problem if I search using resloc:CCsa I get 5 results but when I add the star sign to the right-hand side of the query string like so resloc:CCsa* I get no results. Anyone care to shed some light on this issue ? Nader S. Henein Bayt.com , Dubai Internet City Tel. +9714 3911900 Fax. +9714 3911915 GSM. +9715 05659557 www.bayt.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Filtering in Lucene
For those of you who have worked with the BitSet concept to use lucene in searching within a subset, just to make sure that I got this right, if I have 100 000 documents to search, my Bit Vector will be of 100 000 length, just to save that vector for repeated use I'll have to use a clob! Am I thinking right or have I misunderstood the concept. thanks Nader S. Henein Bayt.com , Dubai Internet City Tel. +9714 3911900 Fax. +9714 3911915 GSM. +9715 05659557 www.bayt.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Italian web sites
sniff the IP and then using the database at the internet topology website http://netgeo.caida.org/perl/netgeo.cgi you can find the country of origin, (use that to populate your own DB) so retrieval decreases as you accumulate IPs), but that will give you the website in Italy (not Italian websites). Unfortunately unless Italian uses a different encoding for the page, picking it up from the page (JavaScript) won't help much. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Wednesday, April 24, 2002 1:03 PM To: [EMAIL PROTECTED] Subject: Italian web sites Hi all, I'm using Jobo for spidering web sites and lucene for indexing. The problem is that I'd like spidering only Italian web sites. How can I see discover the country of a web site? Dou you know some method that tou can suggest me? Thanks Laura -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: too many open files in system
it's not a matter of releasing the handles, it needs to keep them open, this tricked me as well I thought it kept the file handles of the source XML files open, but if you look at the code it actually reads the contents of the files from an HTTP request, the file handles are consumed by the files that lucene creates to store the index results, that's why you get the same error when you try to search as well it tries to open all the files but runs out of handles in the process, you have to increase your unix file handles and reboot the system (how to depends on your OS), this solves one problem. I just hit another one, but I'm convinced it's worth it, I've gotten excellent results after indexing 20 000 files, very fast and very responsive and if it's going to take some tweaking to get it over this problem so be it, that's the joy of open source cheers .. I hope that was useful -Original Message- From: root [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 09, 2002 4:02 PM To: [EMAIL PROTECTED] Subject: too many open files in system Hi List! Doesn't Lucene releases the filehandles?? because I get too many open files in system after running lucene a while! I use the 1.2 rc 4 version! regards -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: too many open files in system
that depends on how many files you're indexing .. I still have to figure out too what logic does the LuceneCocoonIndexer adhere when it is creating the index files -Original Message- From: root [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 09, 2002 4:50 PM To: Lucene Users List Subject: Re: too many open files in system On Tuesday, 9. April 2002 14:08, you wrote: root wrote: Doesn't Lucene releases the filehandles?? because I get too many open files in system after running lucene a while! Are you closing the readers and writers after you've finished using them? cheers, Chris Yes I close the readers and writers! @Nader S. Henein If I increase the filehandles, to what count should I increase them? -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: too many open files in system
that might be the case I'm indexing 200 000 files each one has about 30 XML fields each one has a set of attributes .. could that be it ? -Original Message- From: Karl Øie [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 09, 2002 7:03 PM To: Lucene Users List Subject: Re: too many open files in system I have worked a little with the cocoon indexer and it indexes each xml-attribute in a Field. I have done some indexing on both plaintext and xml sources and i think the Too many open files problem is directly related to number of fields stored in a document in a index. the reason for this is that i have never encountered Too many open files when indexing clean text into one large field, but when creating many-many fields as required by indexing xml i got a Too many open files until i had to use a ram-dir to index document batches into.. mvh karl øie On Tuesday 09 April 2002 16:42, you wrote: This sounds like a question for Cocoon people, as what you are asking about seems to be related to Cocoon's usage of Lucene, not the core Lucene API. Otis -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Index problem
I'm currently working on indexing 200 000 documents with index updates every half hour on three separate webservers. So you can see my ordeal I have to update the index ( delete and add) on three separate machines, how many files are you indexing, the first issue I faced was the Too Many files open error, and are you indexing your files from the webapp or did you write the indexer to run from the command line ? Sorry about all the questions but there are very few people in the dev mailers talking about the lucene cocoon issues that it's a joy when a new voice props up Nader -Original Message- From: Flavio Arruda [mailto:[EMAIL PROTECTED]] Sent: Monday, April 08, 2002 7:03 PM To: [EMAIL PROTECTED] Subject: Index problem Hi everybody, All documents of my application (indexed by Lucene) came from a Web Form which the application´s Administrator can change/remove/add (fields) regularly. Researching Lucene´s FAQs I got that the only way to alter a indexed document (adding index, deleting index, modify fields) is deleting the given document and after this adding the modified version. Unfortunately this looks very slower on my application, because I have thousands of documents of each Form. My questions are: - Are there any eficient way to do what I need using Lucene? - If not, where is the best place to modify Lucene code? Is someone working on this? Thanks by advance and best wishes, Flavio Flavio Regis de Arruda [EMAIL PROTECTED] PROMON*INTELLIGENS Av. Pres. Juscelino Kubitschek, 1830/6º andar - T3 CEP: 04543-900, São Paulo, SP Tel.: 55.11.3847 1173, Fax: 55.11.3847 4546 www.promoninteligens.com.br -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]