Re: sorting by score and an additional field
On Thursday 04 November 2004 03:52, Chris Fraschetti wrote: I can only get it to sort by one or the other... but when it does one, it does sort correctly, but together in {score, custom_field} only the first sort seems to apply. Do you use real documents for that test? The score is a float value and it's hardly ever the same for two documents (unless you use very short test documents), so that's why the second field may not be used for sorting. regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Faster highlighting with TermPositionVectors
Hi Aviran, The code you are calling assumes that you have indexed with TermVector support for offsets (and optionally positions) ie code like this: doc.add(new Field(contents, content, Field.Store.COMPRESS, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS)); If you haven't stored offsets then the getTermFreqVector method returns a TermFreqVector rather than the TermPositionVector subclass, hence the class cast exception. I should tighten up that section of code to check for this situation and throw an exception with a suitable message. By the way, the getAnyTokenStream method is coded a little more defensively and will silently drop back to re-analyzing (parsing) the original content if it is asked to get a TokenStream for a field that doesnt have offset data stored. This is probably the safest way to code your app and the cost of the logic which checks the field storage type is minimal. Cheers Mark - ALL-NEW Yahoo! Messenger - all new features - even more fun!
Re: Faster highlighting with TermPositionVectors
Mark, This is great stuff! One quick comment just at my look at the code (I haven't tried it yet). Shouldn't the tpv variable be used in this method? public static TokenStream getAnyTokenStream(IndexReader reader,int docId, String field,Analyzer analyzer) throws IOException { TokenStream ts=null; TermFreqVector tfv=(TermFreqVector) reader.getTermFreqVector(docId,field); if(tfv!=null) { if(tfv instanceof TermPositionVector) { //the most efficient choice.. TermPositionVector tpv=(TermPositionVector) reader.getTermFreqVector(docId,field); ts=getTokenStream(reader,docId,field); } } //No token info stored so fall back to analyzing raw content if(ts==null) { ts=getTokenStream(reader,docId,field,analyzer); } return ts; } Erik On Oct 28, 2004, at 7:16 PM, [EMAIL PROTECTED] wrote: Thanks to the recent changes (see CVS) in TermFreqVector support we can now make use of term offset information held in the Lucene index rather than incurring the cost of re-analyzing text to highlight it. I have created a class ( see http://www.inperspective.com/lucene/TokenSources.java ) which handles creating a TokenStream from the TermPositionVector stored in the database which can then be passed to the highlighter. This approach is significantly faster than re-parsing the original text. If people are happy with this class I'll add it to the Highlighter sandbox but it may sit better elsewhere in the Lucene code base as a more general purpose utility. BTW as part of putting this together I found that the TermFreq code throws a null pointer when indexing fields that produce no tokens (ie empty or all stopwords). Otherwise things work very well. Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
A TokenFilter to split words and numbers
Hi, Im trying to implement a TokenFilter that splits words that contains numbers into a phrase with separate words and numbers. For example I want to turn v70 into a phrase v 70. Ive implemented a filter that does the actual split with a regular expression. Then I use this filter in my analyzer which is passed to the QueryParser. The resulting Query looks fine +(+words:v 70) but I does not return any Hits. If I instead pass in the input string v 70 (ignored by the filter) the resulting query looks the same but I get Hits. Why is this? Does it have something to do with the QueryParser guessing what kind of query it is by examining the string and thus presumes that the first string should not be parsed into a PhraseQuery? Anyways if there is a correct way to accomplish what I want could anyone please give me a hint? One way I thought about is preparsining the query and construct several subqueries i.e PhraseQuerys and so on and then combine them in a BooleanQuery but I guess there is a nicer solution? I have a similar problem with another Filter Iäm trying to implement that should remove certain suffixes and replace them with a wildcard ( bilar-bil*). /William
Re: A TokenFilter to split words and numbers
william.sporrong writes: Does it have something to do with the QueryParser guessing what kind of query it is by examining the string and thus presumes that the first string should not be parsed into a PhraseQuery? QueryParser creates a PhraseQuery for words that are tokenized to more than one token. You should see that in the serialized query. Anyways if there is a correct way to accomplish what I want could anyone please give me a hint? One way I thought about is preparsining the query and construct several subqueries i.e PhraseQuerys and so on and then combine them in a BooleanQuery but I guess there is a nicer solution? I guess you could overwrite the getFieldQuery method of query parser and change the way queries are generated. I have a similar problem with another Filter Iäm trying to implement that should remove certain suffixes and replace them with a wildcard ( bilar-bil*). If you expect bil* to be executed as a wildcard/prefix query, this cannot work. The query parser parses the query, not the analyzer output. Again you might introduce such behaviour in getFieldQuery. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Efficient search on lucene mailing archives
When I want to search for any thing I use the following URL. http://marc.theaimsgroup.com/ -Sreedhar -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Friday, October 15, 2004 2:18 AM To: Lucene Users List Subject: Re: Efficient search on lucene mailing archives On Oct 14, 2004, at 4:27 PM, David Spencer wrote: sam s wrote: Hi Folks, Is there any place where I can do a better search on lucene mailing archives? I tried JGuru and looks like their search is paid. Apache maintained archives lags efficient searching. Of course one of the ironies is, shouldn't we be able to use Lucene to search the mailing list archives and even apache.org? Eyebrowse uses Lucene and is set up for the Apache e-mail lists: http://nagoya.apache.org/eyebrowse/SummarizeList?listId=30 It seems clunky to navigate though and would be nice to have more recent e-mails ranked higher than older mails. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
one huge index or many small ones?
Hi, We are going to move from a just-in-time perl based search to using lucene in our project. I have to index emails (bodies and also attachements). I keep in the filesystem all the bodies and attachments for a long period of time. I have to find emails that fullfil certain conditions, some of the conditions are take care of at a different level, so in the end I have a SUBSET of emails I have to run through lucene. I was assuming that the best way would be to create an index for each email. Having an unique index for a group of emails (say a day worth of email) seems too coarse grained, imagine a day has 1 emails, and some queries will like to look in only a handful of the emails...But the problem with having one index per emails is the massive number of emails...imagine having 10 indexes Anyway, any idea about that? I just wanted to check wether someones feels I am wrong. Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: one huge index or many small ones?
One index per e-mail is way overkill and probably not even feasible resource-wise. Take advantage of fields in Lucene documents and use BooleanQuery to AND in other criteria for filtering, or use a Filter if the filtering criteria is relatively static. Erik On Nov 4, 2004, at 11:00 AM, javier muguruza wrote: Hi, We are going to move from a just-in-time perl based search to using lucene in our project. I have to index emails (bodies and also attachements). I keep in the filesystem all the bodies and attachments for a long period of time. I have to find emails that fullfil certain conditions, some of the conditions are take care of at a different level, so in the end I have a SUBSET of emails I have to run through lucene. I was assuming that the best way would be to create an index for each email. Having an unique index for a group of emails (say a day worth of email) seems too coarse grained, imagine a day has 1 emails, and some queries will like to look in only a handful of the emails...But the problem with having one index per emails is the massive number of emails...imagine having 10 indexes Anyway, any idea about that? I just wanted to check wether someones feels I am wrong. Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: one huge index or many small ones?
Hi Javier, I suggest you to build a single index, with all the information you need to find the right mail you are looking for. You than can use Lucene alone to find you messages. Giulio Cesare On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza [EMAIL PROTECTED] wrote: Hi, We are going to move from a just-in-time perl based search to using lucene in our project. I have to index emails (bodies and also attachements). I keep in the filesystem all the bodies and attachments for a long period of time. I have to find emails that fullfil certain conditions, some of the conditions are take care of at a different level, so in the end I have a SUBSET of emails I have to run through lucene. I was assuming that the best way would be to create an index for each email. Having an unique index for a group of emails (say a day worth of email) seems too coarse grained, imagine a day has 1 emails, and some queries will like to look in only a handful of the emails...But the problem with having one index per emails is the massive number of emails...imagine having 10 indexes Anyway, any idea about that? I just wanted to check wether someones feels I am wrong. Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: one huge index or many small ones?
Thanks Erik and Giulio for the fast reply. I am just starting to look at lucene so forgive me if I got some ideas wrong. I understand your concerns about one index per email. But having one index only is also (I guess) out of question. I am building an email archive. Email will be kept indefinitely available for search, adding new email every day. Imagine a company with millions of emails per day (been there), keep it growing for years, adding stuff to the index while using it for searches continuously... That's why my idea is to decide on a time frame (a day, a month...an extreme would be an instant, that is a single email, my original idea) and build the index for all the email in that timeframe. After the timeframe is finished no more stuff will be ever added. Before the lucene search emails are selected based on other conditions (we store the from, to, date etc in database as well, and these conditions are enforced with a sql query first, so I would not need to enforce them in the lucene search again, also that query can be quite sophisticated and I guess would not be easyly possible to do it in lucene by itself). That first db step gives me a group of emails that maybe I have to further narrow down based on a lucene search (of body and attachment contents). Having an index for more than one emails means that after the search I would have to get only the overlaping emails from the two searches...Maybe this is better than keeping the same info I have in the db in lucene fields as well. An example: I want all the email from [EMAIL PROTECTED] from Jan to Dec containing the word 'money'. I run the db query that returns a list with john's email for that period of time, then (lets assume I have one index per day) I iterate on every day, looking for emails that contain 'money', from the results returned by lucene I keep only these that are also in the first list. Does that sound better? On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli [EMAIL PROTECTED] wrote: Hi Javier, I suggest you to build a single index, with all the information you need to find the right mail you are looking for. You than can use Lucene alone to find you messages. Giulio Cesare On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza [EMAIL PROTECTED] wrote: Hi, We are going to move from a just-in-time perl based search to using lucene in our project. I have to index emails (bodies and also attachements). I keep in the filesystem all the bodies and attachments for a long period of time. I have to find emails that fullfil certain conditions, some of the conditions are take care of at a different level, so in the end I have a SUBSET of emails I have to run through lucene. I was assuming that the best way would be to create an index for each email. Having an unique index for a group of emails (say a day worth of email) seems too coarse grained, imagine a day has 1 emails, and some queries will like to look in only a handful of the emails...But the problem with having one index per emails is the massive number of emails...imagine having 10 indexes Anyway, any idea about that? I just wanted to check wether someones feels I am wrong. Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: one huge index or many small ones?
First off, I think you should make a decision about what you want to store in your index and how you go about searching it. The less information you store in your index, the better, for performance reasons. If you can store the messages in an external database you probably should. I would create a table that contains a clob and an associated id that can be used to get the message at any time. Assuming mail is in SMTP RFC format: I would suggest: Unstored: Subject Keyword: From Keyword: To Stored,Unindexed: ID -- this would be the ID to the message in your database Unstored: Body Keyword: Month Keyword: Day Keyword: Year (and any other keywords you might use) Your lucene query would then look something like: +From:[EMAIL PROTECTED] +(Subject:money Body:money) +Year:2004 Use the stored ID field to get the message contents from your database. If you want to break your index down into multiple indexes, based on some criteria such as time frame you could do that too. You would then use a MultiSearcher or ParallelMultiSearcher to process the multiple indexes. On Thu, 4 Nov 2004 18:03:49 +0100, javier muguruza [EMAIL PROTECTED] wrote: Thanks Erik and Giulio for the fast reply. I am just starting to look at lucene so forgive me if I got some ideas wrong. I understand your concerns about one index per email. But having one index only is also (I guess) out of question. I am building an email archive. Email will be kept indefinitely available for search, adding new email every day. Imagine a company with millions of emails per day (been there), keep it growing for years, adding stuff to the index while using it for searches continuously... That's why my idea is to decide on a time frame (a day, a month...an extreme would be an instant, that is a single email, my original idea) and build the index for all the email in that timeframe. After the timeframe is finished no more stuff will be ever added. Before the lucene search emails are selected based on other conditions (we store the from, to, date etc in database as well, and these conditions are enforced with a sql query first, so I would not need to enforce them in the lucene search again, also that query can be quite sophisticated and I guess would not be easyly possible to do it in lucene by itself). That first db step gives me a group of emails that maybe I have to further narrow down based on a lucene search (of body and attachment contents). Having an index for more than one emails means that after the search I would have to get only the overlaping emails from the two searches...Maybe this is better than keeping the same info I have in the db in lucene fields as well. An example: I want all the email from [EMAIL PROTECTED] from Jan to Dec containing the word 'money'. I run the db query that returns a list with john's email for that period of time, then (lets assume I have one index per day) I iterate on every day, looking for emails that contain 'money', from the results returned by lucene I keep only these that are also in the first list. Does that sound better? On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli [EMAIL PROTECTED] wrote: Hi Javier, I suggest you to build a single index, with all the information you need to find the right mail you are looking for. You than can use Lucene alone to find you messages. Giulio Cesare On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza [EMAIL PROTECTED] wrote: Hi, We are going to move from a just-in-time perl based search to using lucene in our project. I have to index emails (bodies and also attachements). I keep in the filesystem all the bodies and attachments for a long period of time. I have to find emails that fullfil certain conditions, some of the conditions are take care of at a different level, so in the end I have a SUBSET of emails I have to run through lucene. I was assuming that the best way would be to create an index for each email. Having an unique index for a group of emails (say a day worth of email) seems too coarse grained, imagine a day has 1 emails, and some queries will like to look in only a handful of the emails...But the problem with having one index per emails is the massive number of emails...imagine having 10 indexes Anyway, any idea about that? I just wanted to check wether someones feels I am wrong. Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
prefix wildcard matching options (*blah)
I'm thinking about making a seperate field in my index for prefix wildcard searches. I would chop off x characters from the front to create subtokens for the prefix matches. For the term: republican terms created: republican epublican publican ublican blican My query parser would then intelligently decide if their is a term that has a wildcard as the first character of the term. Instead of searching the normal field, it would then remove the wildcard from the start of the term and search on the prefix field instead. A search for *pub* would be converted to pub* in the prefix field. A search for *blican would be converted to blican Does this sound like an intelligent way to create fast prefix querying ability? Can I index the prefix field with a seperate analyzer that makes the prefix tokens, or should I just do the index-time expansion manually? I wouldn't need to search with this analyzer, just index with it, because the searching doesn't have to expand all those terms. If using a seperate analyzer for the prefix field makes more sense how do I make a tokenizer that returns multiple tokens for one word? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: one huge index or many small ones?
javier muguruza wrote: Hi Javier, I think the your optimization should take care of the response time of search queries. I asume that this is the variable you need to optimize. Probably it will be a good thing to read first the lucene benchmarks: http://jakarta.apache.org/lucene/docs/benchmarks.html. http://jakarta.apache.org/lucene/docs/benchmarks.html If you have a mandatory date constraint for each of your indexes you can split the index on time basis, I asume that one index per month will be enough I think ... 10.000 emails I think it will be fast enough if you will search in only one index afterwards. But I think this is not such a good Idea? What about creating one index per user? If your search require a user or a sender, and you can get its name from database, and apply only the other constrains on an index dedicated to that user .. I think the lucene search will be much more faster. Also the database search will be fast .. I don'T think you will have more then 1.000-10.000 user names. or maybe 1 index/user/year or 1 index/receiver/year + 1index/sender/year What about this solution is it feasible for your system? All the best, Sergiu Thanks Erik and Giulio for the fast reply. I am just starting to look at lucene so forgive me if I got some ideas wrong. I understand your concerns about one index per email. But having one index only is also (I guess) out of question. I am building an email archive. Email will be kept indefinitely available for search, adding new email every day. Imagine a company with millions of emails per day (been there), keep it growing for years, adding stuff to the index while using it for searches continuously... That's why my idea is to decide on a time frame (a day, a month...an extreme would be an instant, that is a single email, my original idea) and build the index for all the email in that timeframe. After the timeframe is finished no more stuff will be ever added. Before the lucene search emails are selected based on other conditions (we store the from, to, date etc in database as well, and these conditions are enforced with a sql query first, so I would not need to enforce them in the lucene search again, also that query can be quite sophisticated and I guess would not be easyly possible to do it in lucene by itself). That first db step gives me a group of emails that maybe I have to further narrow down based on a lucene search (of body and attachment contents). Having an index for more than one emails means that after the search I would have to get only the overlaping emails from the two searches...Maybe this is better than keeping the same info I have in the db in lucene fields as well. An example: I want all the email from [EMAIL PROTECTED] from Jan to Dec containing the word 'money'. I run the db query that returns a list with john's email for that period of time, then (lets assume I have one index per day) I iterate on every day, looking for emails that contain 'money', from the results returned by lucene I keep only these that are also in the first list. Does that sound better? On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli [EMAIL PROTECTED] wrote: Hi Javier, I suggest you to build a single index, with all the information you need to find the right mail you are looking for. You than can use Lucene alone to find you messages. Giulio Cesare On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza [EMAIL PROTECTED] wrote: Hi, We are going to move from a just-in-time perl based search to using lucene in our project. I have to index emails (bodies and also attachements). I keep in the filesystem all the bodies and attachments for a long period of time. I have to find emails that fullfil certain conditions, some of the conditions are take care of at a different level, so in the end I have a SUBSET of emails I have to run through lucene. I was assuming that the best way would be to create an index for each email. Having an unique index for a group of emails (say a day worth of email) seems too coarse grained, imagine a day has 1 emails, and some queries will like to look in only a handful of the emails...But the problem with having one index per emails is the massive number of emails...imagine having 10 indexes Anyway, any idea about that? I just wanted to check wether someones feels I am wrong. Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
updating documents in the index
So I've read that the only way to change a field in an already indexed document is to simple remove it and readd it... but that can be costly if I need to go back to where the data origionally came from and reparse and reindex it all. Is there a way to keep the document around after the delete call to the indexreader so that I can modify a field and add it again with a writer? I would simple rip out all the fields and then create a new document, but the 'content' field isn't stored due to the fact that my index would be much larger if i kept the content around. Anyone have any good solutions to do this short of keeping around the content in the index or going back to the origional document source? Does 'luke' rebuild a document so that it can be updated? If so, how do they go about it. Thanks is advance everyone! -Chris Fraschetti - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting by score and an additional field
Erik: doc.add(Field.Keyword(rank_field, rank_value)); is what I use to build my customized rank field. Considering the rank_value is an integer, should it be zero padded? Currently I have it padded because the rest of lucene needs it that way, should it be the same here? If I specify INT or STRING, the sort of rank works just fine... but its when I combine the two that I have issues. I'm using 1.4.2... but I'll see how my code differs from yours and give it a try.. can you tell me how you indexed your secondary rank field? as a keyword or what have you? Thanks, Chris Fraschetti On Thu, 4 Nov 2004 04:33:12 -0500, Erik Hatcher [EMAIL PROTECTED] wrote: On Nov 3, 2004, at 9:52 PM, Chris Fraschetti wrote: Has anyone had any luck using lucene's built in sort functions to sort first by the lucene hit score and secondarily by a Field in each document indexed as Keyword and in integer form? I get multiple sort fields to work, here's two examples: new Sort(new SortField[]{ new SortField(category), SortField.FIELD_SCORE, new SortField(pubmonth, SortField.INT, true) }); new Sort(new SortField[] {SortField.FIELD_SCORE, new SortField(category)}) Both of these, on a tiny dataset of only 10 documents, works exactly as expected. I can only get it to sort by one or the other... but when it does one, it does sort correctly, but together in {score, custom_field} only the first sort seems to apply. Any ideas? Are you using Lucene 1.4.2? How did you index your integer field? Are you simply using the .toString() of an Integer? Or zero padding the field somehow? You can use the .toString method, but you have to be sure that the sorting code does the right parsing of it - so you might need to specify SortField.INT as its type. It will do automatic detection if the type is not specified, but that assumes that the first document it encounters parses properly, otherwise it will fall back to using a String sort. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- ___ Chris Fraschetti, Student CompSci System Admin University of San Francisco e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: updating documents in the index
Chris Fraschetti wrote: So I've read that the only way to change a field in an already indexed document is to simple remove it and readd it... but that can be costly if I need to go back to where the data origionally came from and reparse and reindex it all. Yes. Is there a way to keep the document around after the delete call to the indexreader so that I can modify a field and add it again with a writer? Lucene does not provide this functionality (yet?). If you try to read from index a document that contains unstored fields, you will get nulls instead of their values. In other words, you cannot read the Document instance, modify it, and then add it - because you will lose all information from unstored fields. Also, when you re-add the document all fields need to be analyzed once again. I would simple rip out all the fields and then create a new document, but the 'content' field isn't stored due to the fact that my index would be much larger if i kept the content around. Anyone have any good solutions to do this short of keeping around the content in the index or going back to the origional document source? Does 'luke' rebuild a document so that it can be updated? If so, how do they go about it. They (me and Luke :) do it the hard way - we iterate over all terms in the index, and then iterate over all documents which contain that term. If the enumeration contains the selected doc number, terms and their positions are put in the target term array. After going through the whole index, we end up with an array containing all terms and every position of each term in the document. This array is then concatenated using spaces. That's it - not really a solution, rather a hack. This could be sped up using term vectors (Lucene 1.4.x), but you first need to build your index with term vectors. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: one huge index or many small ones?
Sergiu, A month could have tens of millions of emails in the worst case, but maybe I could discard such bad assumption for our current project. Lets say 1 emails per day max, that makes 300k emails a month. Either I would choose one index per day or per month (or week or whatever). Your suggestion about index per user is not valid, my searches do not require a user desafortunately. They can maybe say 'all email from department C from last week' etc. So, if I choose one index per day(or month) I already know that I will have to search in many indexes depending on the timeframe (the time frame is the only required value for the search) thanks for the suggestions! On Thu, 04 Nov 2004 19:01:53 +0100, Sergiu Gordea [EMAIL PROTECTED] wrote: javier muguruza wrote: Hi Javier, I think the your optimization should take care of the response time of search queries. I asume that this is the variable you need to optimize. Probably it will be a good thing to read first the lucene benchmarks: http://jakarta.apache.org/lucene/docs/benchmarks.html. http://jakarta.apache.org/lucene/docs/benchmarks.html If you have a mandatory date constraint for each of your indexes you can split the index on time basis, I asume that one index per month will be enough I think ... 10.000 emails I think it will be fast enough if you will search in only one index afterwards. But I think this is not such a good Idea? What about creating one index per user? If your search require a user or a sender, and you can get its name from database, and apply only the other constrains on an index dedicated to that user .. I think the lucene search will be much more faster. Also the database search will be fast .. I don'T think you will have more then 1.000-10.000 user names. or maybe 1 index/user/year or 1 index/receiver/year + 1index/sender/year What about this solution is it feasible for your system? All the best, Sergiu Thanks Erik and Giulio for the fast reply. I am just starting to look at lucene so forgive me if I got some ideas wrong. I understand your concerns about one index per email. But having one index only is also (I guess) out of question. I am building an email archive. Email will be kept indefinitely available for search, adding new email every day. Imagine a company with millions of emails per day (been there), keep it growing for years, adding stuff to the index while using it for searches continuously... That's why my idea is to decide on a time frame (a day, a month...an extreme would be an instant, that is a single email, my original idea) and build the index for all the email in that timeframe. After the timeframe is finished no more stuff will be ever added. Before the lucene search emails are selected based on other conditions (we store the from, to, date etc in database as well, and these conditions are enforced with a sql query first, so I would not need to enforce them in the lucene search again, also that query can be quite sophisticated and I guess would not be easyly possible to do it in lucene by itself). That first db step gives me a group of emails that maybe I have to further narrow down based on a lucene search (of body and attachment contents). Having an index for more than one emails means that after the search I would have to get only the overlaping emails from the two searches...Maybe this is better than keeping the same info I have in the db in lucene fields as well. An example: I want all the email from [EMAIL PROTECTED] from Jan to Dec containing the word 'money'. I run the db query that returns a list with john's email for that period of time, then (lets assume I have one index per day) I iterate on every day, looking for emails that contain 'money', from the results returned by lucene I keep only these that are also in the first list. Does that sound better? On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli [EMAIL PROTECTED] wrote: Hi Javier, I suggest you to build a single index, with all the information you need to find the right mail you are looking for. You than can use Lucene alone to find you messages. Giulio Cesare On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza [EMAIL PROTECTED] wrote: Hi, We are going to move from a just-in-time perl based search to using lucene in our project. I have to index emails (bodies and also attachements). I keep in the filesystem all the bodies and attachments for a long period of time. I have to find emails that fullfil certain conditions, some of the conditions are take care of at a different level, so in the end I have a SUBSET of emails I have to run through lucene. I was assuming that the best way would be to create an index for each email. Having an unique index for a group of emails (say a day worth of email) seems too coarse grained, imagine a day has 1 emails, and some queries will like to look in only a
Re: one huge index or many small ones?
Justin, Yes, I wanted as less info as possible in the index. The body and atachemntes will be stored outside lucene. As I mentioned, I only need to deal with the body/attachments contents with lucene, from, to, subject, dates etc are deal with before. My idea was: Unstored: Body + attachment (after extracting text) I dont need to know in with attchment the word I am looking for are, it's enough to know there are in the email. I will have a look at MultiSearcher or ParallelMultiSearcher, thanks! On Thu, 4 Nov 2004 10:28:18 -0700, Justin Swanhart [EMAIL PROTECTED] wrote: First off, I think you should make a decision about what you want to store in your index and how you go about searching it. The less information you store in your index, the better, for performance reasons. If you can store the messages in an external database you probably should. I would create a table that contains a clob and an associated id that can be used to get the message at any time. Assuming mail is in SMTP RFC format: I would suggest: Unstored: Subject Keyword: From Keyword: To Stored,Unindexed: ID -- this would be the ID to the message in your database Unstored: Body Keyword: Month Keyword: Day Keyword: Year (and any other keywords you might use) Your lucene query would then look something like: +From:[EMAIL PROTECTED] +(Subject:money Body:money) +Year:2004 Use the stored ID field to get the message contents from your database. If you want to break your index down into multiple indexes, based on some criteria such as time frame you could do that too. You would then use a MultiSearcher or ParallelMultiSearcher to process the multiple indexes. On Thu, 4 Nov 2004 18:03:49 +0100, javier muguruza [EMAIL PROTECTED] wrote: Thanks Erik and Giulio for the fast reply. I am just starting to look at lucene so forgive me if I got some ideas wrong. I understand your concerns about one index per email. But having one index only is also (I guess) out of question. I am building an email archive. Email will be kept indefinitely available for search, adding new email every day. Imagine a company with millions of emails per day (been there), keep it growing for years, adding stuff to the index while using it for searches continuously... That's why my idea is to decide on a time frame (a day, a month...an extreme would be an instant, that is a single email, my original idea) and build the index for all the email in that timeframe. After the timeframe is finished no more stuff will be ever added. Before the lucene search emails are selected based on other conditions (we store the from, to, date etc in database as well, and these conditions are enforced with a sql query first, so I would not need to enforce them in the lucene search again, also that query can be quite sophisticated and I guess would not be easyly possible to do it in lucene by itself). That first db step gives me a group of emails that maybe I have to further narrow down based on a lucene search (of body and attachment contents). Having an index for more than one emails means that after the search I would have to get only the overlaping emails from the two searches...Maybe this is better than keeping the same info I have in the db in lucene fields as well. An example: I want all the email from [EMAIL PROTECTED] from Jan to Dec containing the word 'money'. I run the db query that returns a list with john's email for that period of time, then (lets assume I have one index per day) I iterate on every day, looking for emails that contain 'money', from the results returned by lucene I keep only these that are also in the first list. Does that sound better? On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli [EMAIL PROTECTED] wrote: Hi Javier, I suggest you to build a single index, with all the information you need to find the right mail you are looking for. You than can use Lucene alone to find you messages. Giulio Cesare On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza [EMAIL PROTECTED] wrote: Hi, We are going to move from a just-in-time perl based search to using lucene in our project. I have to index emails (bodies and also attachements). I keep in the filesystem all the bodies and attachments for a long period of time. I have to find emails that fullfil certain conditions, some of the conditions are take care of at a different level, so in the end I have a SUBSET of emails I have to run through lucene. I was assuming that the best way would be to create an index for each email. Having an unique index for a group of emails (say a day worth of email) seems too coarse grained, imagine a day has 1 emails, and some queries will like to look in only a handful of the emails...But the problem with having one index per emails is the massive
Re: one huge index or many small ones?
Hi Javier, On Thu, 4 Nov 2004 20:08:15 +0100, javier muguruza [EMAIL PROTECTED] wrote: Justin, Yes, I wanted as less info as possible in the index. The body and atachemntes will be stored outside lucene. As I mentioned, I only need to deal with the body/attachments contents with lucene, from, to, subject, dates etc are deal with before. You probably can get away with this solution as well, but I would like to suggest you to test Lucene performance before starting optimizing. If your query on the text of the body/attachments are not huge (my user end up with rewritten query whose lengths are up to 600KBytes!!), Lucene will be probably able to return your the right result much faster than looking in different places for the same query. Don't be afraid of the number of documents either; not before testing on some real data. You could easily find that a simpler architecture can perform fast enough, and be much more easy to set up and tune. [...] Giulio Cesare - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Highlighting in Lucene
Hi All, I would like to know if Lucene support highlighting on the searched text? Thanks in advance. Thanks, Ramon Aseniero
RE: Highlighting in Lucene
There is a highlighting tool in the sandbox (3/4 of the way down): http://jakarta.apache.org/lucene/docs/lucene-sandbox/ -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 3:40 PM To: 'Lucene Users List' Subject: Highlighting in Lucene Hi All, I would like to know if Lucene support highlighting on the searched text? Thanks in advance. Thanks, Ramon Aseniero - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Highlighting in Lucene
Hi Will, Thanks a lot that really helps. Thanks, Ramon -Original Message- From: Will Allen [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 12:45 PM To: Lucene Users List Subject: RE: Highlighting in Lucene There is a highlighting tool in the sandbox (3/4 of the way down): http://jakarta.apache.org/lucene/docs/lucene-sandbox/ -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 3:40 PM To: 'Lucene Users List' Subject: Highlighting in Lucene Hi All, I would like to know if Lucene support highlighting on the searched text? Thanks in advance. Thanks, Ramon Aseniero - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory?
Hm, as far as I know, a CVS sub-directory in an index directory should not bother Lucene. As a matter of fact, I tested this (I used a file, not a directory) for Lucene in Action. What error are you getting? I know there is -I CVS option for ignoring files; perhaps it works with directories, too. Otis --- Chuck Williams [EMAIL PROTECTED] wrote: I have a Tomcat web module being developed with Netbeans 4.0 ide using CVS. One CVS repository holds the sources of my various web files in a directory structure that directly parallels the standard Tomcat webapp directory structure. This is well supported in a fully automated way within Netbeans. I have my search index directory as a subdirectory of WEB-INF, which seemed the natural place to put it. The index files themselves are not in the repository. I want to be able to do CVS Update for the web module directory tree as a whole. However, this places a CVS subdirectory within the index directory, which in turn causes Lucene indexing to blow up the next time I run it since this is an unexpected entry in the index directory. To make things works, to work around the problem I both need to delete the CVS subdirectory and find and delete the pointers to it in the Entries file and Netbeans cache file within the CVS subdirectory of the parent directory. This is annoying to say the least. I've asked the Netbeans users if there is a way to avoid creation of the index's CVS subdirectory, but the same thing happened using WinCVS and I so I expect this is not a Netbeans issue. It could be my relative ignorance of CVS. How do others avoid this problem? Any advice or suggestions would be appreciated. Thanks, Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory?
Otis, thanks for looking at this. The stack trace of the exception is below. I looked at the code. It wants to delete every file in the index directory, but fails to delete the CVS subdirectory entry (presumably because it is marked read-only; the specific exception is swallowed). Even if it could delete the CVS subdirectory, this would just cause another problem with Netbeans/CVS, since it wouldn't know how to fix up the pointers in the parent CVS subdirectory. Is there a change I could make that would cause it to safely leave this alone? This problem only arises on a full index (incremental == false = create == true). Incremental indexes work fine in my app. Chuck java.io.IOException: Cannot delete CVS at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:144) at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:128) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:102) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:83) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173) at [my app]... -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 1:54 PM To: Lucene Users List Subject: Re: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory? Hm, as far as I know, a CVS sub-directory in an index directory should not bother Lucene. As a matter of fact, I tested this (I used a file, not a directory) for Lucene in Action. What error are you getting? I know there is -I CVS option for ignoring files; perhaps it works with directories, too. Otis --- Chuck Williams [EMAIL PROTECTED] wrote: I have a Tomcat web module being developed with Netbeans 4.0 ide using CVS. One CVS repository holds the sources of my various web files in a directory structure that directly parallels the standard Tomcat webapp directory structure. This is well supported in a fully automated way within Netbeans. I have my search index directory as a subdirectory of WEB-INF, which seemed the natural place to put it. The index files themselves are not in the repository. I want to be able to do CVS Update for the web module directory tree as a whole. However, this places a CVS subdirectory within the index directory, which in turn causes Lucene indexing to blow up the next time I run it since this is an unexpected entry in the index directory. To make things works, to work around the problem I both need to delete the CVS subdirectory and find and delete the pointers to it in the Entries file and Netbeans cache file within the CVS subdirectory of the parent directory. This is annoying to say the least. I've asked the Netbeans users if there is a way to avoid creation of the index's CVS subdirectory, but the same thing happened using WinCVS and I so I expect this is not a Netbeans issue. It could be my relative ignorance of CVS. How do others avoid this problem? Any advice or suggestions would be appreciated. Thanks, Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
PorterStemmer / Levenshtein Distance
Hey, On the site It says Lucence Uses Levenshtein distance algorithm for fuzzy matching, where is this in the source code? Also I would like to use the porter stemming algorithm for somethign else, Are there any documents on the Lucence implementation of Porter Stemmer. Best, Yousef - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Sorting in Lucene.
Hi All, Does Lucene supports sorting on the search results? Thanks in advance. Ramon
RE: Sorting in Lucene.
Yes, by one or multiple criteria. Chuck -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 6:21 PM To: 'Lucene Users List' Subject: Sorting in Lucene. Hi All, Does Lucene supports sorting on the search results? Thanks in advance. Ramon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Sorting in Lucene.
Hi Chuck, Can you please point me to some articles or FAQ about Sorting in Lucene? Thanks a lot for your reply. Thanks, Ramon -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 9:44 PM To: Lucene Users List Subject: RE: Sorting in Lucene. Yes, by one or multiple criteria. Chuck -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 6:21 PM To: 'Lucene Users List' Subject: Sorting in Lucene. Hi All, Does Lucene supports sorting on the search results? Thanks in advance. Ramon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Sorting in Lucene.
Ramon, I'm not sure where a guide or tutorial might be, but you should be able to see how to do it from the javadoc. Look at classes Sort, SortField, SortComparator. I've also included a recent message from this group below concerning sorting with multiple fields. FYI, a number of people have wanted to first sort by score and secondarily by another field. This is tricky since scores are frequently different in low-order decimal positions. Good luck, Chuck -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 1:33 AM To: Lucene Users List Subject: Re: sorting by score and an additional field On Nov 3, 2004, at 9:52 PM, Chris Fraschetti wrote: Has anyone had any luck using lucene's built in sort functions to sort first by the lucene hit score and secondarily by a Field in each document indexed as Keyword and in integer form? I get multiple sort fields to work, here's two examples: new Sort(new SortField[]{ new SortField(category), SortField.FIELD_SCORE, new SortField(pubmonth, SortField.INT, true) }); new Sort(new SortField[] {SortField.FIELD_SCORE, new SortField(category)}) Both of these, on a tiny dataset of only 10 documents, works exactly as expected. I can only get it to sort by one or the other... but when it does one, it does sort correctly, but together in {score, custom_field} only the first sort seems to apply. Any ideas? Are you using Lucene 1.4.2? How did you index your integer field? Are you simply using the .toString() of an Integer? Or zero padding the field somehow? You can use the .toString method, but you have to be sure that the sorting code does the right parsing of it - so you might need to specify SortField.INT as its type. It will do automatic detection if the type is not specified, but that assumes that the first document it encounters parses properly, otherwise it will fall back to using a String sort. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 9:53 PM To: 'Lucene Users List' Subject: RE: Sorting in Lucene. Hi Chuck, Can you please point me to some articles or FAQ about Sorting in Lucene? Thanks a lot for your reply. Thanks, Ramon -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 9:44 PM To: Lucene Users List Subject: RE: Sorting in Lucene. Yes, by one or multiple criteria. Chuck -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 6:21 PM To: 'Lucene Users List' Subject: Sorting in Lucene. Hi All, Does Lucene supports sorting on the search results? Thanks in advance. Ramon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Sorting in Lucene.
Hi chuck, Thanks a lot this is really helpful. Thanks, Ramon -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 10:05 PM To: Lucene Users List Subject: RE: Sorting in Lucene. Ramon, I'm not sure where a guide or tutorial might be, but you should be able to see how to do it from the javadoc. Look at classes Sort, SortField, SortComparator. I've also included a recent message from this group below concerning sorting with multiple fields. FYI, a number of people have wanted to first sort by score and secondarily by another field. This is tricky since scores are frequently different in low-order decimal positions. Good luck, Chuck -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 1:33 AM To: Lucene Users List Subject: Re: sorting by score and an additional field On Nov 3, 2004, at 9:52 PM, Chris Fraschetti wrote: Has anyone had any luck using lucene's built in sort functions to sort first by the lucene hit score and secondarily by a Field in each document indexed as Keyword and in integer form? I get multiple sort fields to work, here's two examples: new Sort(new SortField[]{ new SortField(category), SortField.FIELD_SCORE, new SortField(pubmonth, SortField.INT, true) }); new Sort(new SortField[] {SortField.FIELD_SCORE, new SortField(category)}) Both of these, on a tiny dataset of only 10 documents, works exactly as expected. I can only get it to sort by one or the other... but when it does one, it does sort correctly, but together in {score, custom_field} only the first sort seems to apply. Any ideas? Are you using Lucene 1.4.2? How did you index your integer field? Are you simply using the .toString() of an Integer? Or zero padding the field somehow? You can use the .toString method, but you have to be sure that the sorting code does the right parsing of it - so you might need to specify SortField.INT as its type. It will do automatic detection if the type is not specified, but that assumes that the first document it encounters parses properly, otherwise it will fall back to using a String sort. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 9:53 PM To: 'Lucene Users List' Subject: RE: Sorting in Lucene. Hi Chuck, Can you please point me to some articles or FAQ about Sorting in Lucene? Thanks a lot for your reply. Thanks, Ramon -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 9:44 PM To: Lucene Users List Subject: RE: Sorting in Lucene. Yes, by one or multiple criteria. Chuck -Original Message- From: Ramon Aseniero [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 6:21 PM To: 'Lucene Users List' Subject: Sorting in Lucene. Hi All, Does Lucene supports sorting on the search results? Thanks in advance. Ramon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
INDEXREADER + DELETE + LUCENE1.4.1
Hi Guy's Apologies There seems to be a bug unresolved [ Or may I be may be doing something wrong ] in IndexReader.delete(int docNum) Here is the Code indexSearcher = null; indexDirectory = null; indexReader = null; indexDirectory =FSDirectory.getDirectory(/root/MERGEDINDEX/MERGER_1,false); indexReader = IndexReader.open(indexDirectory); IndexReader.unlock(indexDirectory); indexSearcher = new IndexSearcher(indexReader); query = new TermQuery(new Term(fieldName, FiledValue)); hits = indexSearcher.search(query); if ( hits.length() 0 ) { for(int k=0;k=hits.length();k++) { PRINTDBG_.append(QUERY : + query.toString() + \n + FIELD NAME : + fieldName + \n + FIELD VALUE: + FiledValue + \n + TOTAL HITS : + hits.length() + \n + DELETING : + k); indexReader.delete(k); } } indexReader.close(); indexSearcher.close(); indexDirectory.close(); System.out.printl( Debugger : +PRINTDBG_); indexReader = null; indexSearcher = null; indexDirectory = null; //optimization indexDirectory = FSDirectory.getDirectory(pathMergeIndex,false); IndexWriter writer = new IndexWriter(indexDirectory, analyzer, false); writer.mergeFactor = mergeFactorVal_; writer.maxMergeDocs = maxMergeDocsVal_; writer.optimize(); writer.close(); indexDirectory = null; writer = null; In spite of Using a new IndexReader for every Deletion of documents and Optimization's The 'indexReader.delete(k)' does not seems to work Configuration History a) 1 MergerIndex = 1000 subIndexes [ fieldName = KeyWord Field Type] b) O/s Windows c) Amd Processor e) Lucene 1.4.1 f) Jdk 1.4.2 Please Some body Suggest me For Alternates WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]