Re: a query for a special AND?
On Thursday 20 September 2007 07:29, Mohammad Norouzi wrote: > Sorry Paul I just hurried in replying ;) > I read the documents of Lucene about query syntax and I figured out the what > is the difference > but my problem is different, this is preoccupied my mind and I am under > pressure to solve this problem, after analyzing the results I get, now I > think we need a "group by" in our query. > > let me tell you an example: we need a list of patients that have been > examined by certain services specified by the user , say service one and > service two. > > in this case here is the correct result: > patient-id service_name patient_result > 1 s112 > 1 s213 > 2 s1 41 > 2 s222 > > but for example, following is incorrect because patient 1 has no service > with name service2: > patient-id service_name patient_result > 1 s112 > 1 s313 That depends on what you put in your lucene documents. You can only get complete lucene documents as query results. For the above example a patient with all service names should be indexed in a single lucene doc. The rows above suggest that the relation between patient and service forms the relational result. However, for a text search engine it is usual to denormalize the relational records into indexed documents, depending on the required output. Regards, Paul Elschot > > > > On 9/20/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote: > > > > Hi Paul, > > would you tell me what is the difference between AND and + ? > > I tried both but get different result > > with AND I get 1777 documents and with + I get nearly 25000 ? > > > > > > On 9/17/07, Paul Elschot <[EMAIL PROTECTED]> wrote: > > > > > > On Monday 17 September 2007 11:40, Mohammad Norouzi wrote: > > > > Hi > > > > I have a problem in getting correct result from Lucene, consider we > > > have an > > > > index containing documents with fields "field1" and "field2" etc. now > > > I want > > > > to have documents in which their field1 are equal one by one and their > > > > field2 with two different value > > > > > > > > to clarify consider I have this query: > > > > field1:val* (field2:"myValue1" XOR field2:"myValue2") > > > > > > Did you try this: > > > > > > +field1:val* +field2:"myValue1" +field2:"myValue2" > > > > > > Regards, > > > Paul Elschot > > > > > > > > > > > > > > now I want this result: > > > > field1 field2 > > > > val1myValue1 > > > > val1myValue2 > > > > val2myValue1 > > > > val2myValue2 > > > > > > > > this result is not acceptable: > > > > val3 myValue1 > > > > or > > > > val4 myValue1 > > > > val4 myValue3 > > > > > > > > I put XOR as operator because this is not a typical OR, it's > > > different, it > > > > means documents that contains both myValue1 and myValue2 for the field > > > > > > > field2 > > > > > > > > how to build a query to get such result? > > > > > > > > thanks in advance > > > > -- > > > > Regards, > > > > Mohammad > > > > -- > > > > see my blog: http://brainable.blogspot.com/ > > > > another in Persian: http://fekre-motefavet.blogspot.com/ > > > > Sun Certified Java Programmer > > > > ExpertsExchange Certified, Master: > > > > http://www.experts-exchange.com/M_1938796.html > > > > > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > -- > > Regards, > > Mohammad > > -- > > see my blog: http://brainable.blogspot.com/ > > another in Persian: http://fekre-motefavet.blogspot.com/ > > Sun Certified Java Programmer > > ExpertsExchange Certified, Master: http://www.experts-exchange.com/M_1938796.html > > > > > > > > -- > Regards, > Mohammad > -- > see my blog: http://brainable.blogspot.com/ > another in Persian: http://fekre-motefavet.blogspot.com/ > Sun Certified Java Programmer > ExpertsExchange Certified, Master: > http://www.experts-exchange.com/M_1938796.html > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: a query for a special AND?
well, you mean we should separate documents just like relational tables in databases ? if yes, how to make the relationship between those documents thank you so much Paul On 9/20/07, Paul Elschot <[EMAIL PROTECTED]> wrote: > > On Thursday 20 September 2007 07:29, Mohammad Norouzi wrote: > > Sorry Paul I just hurried in replying ;) > > I read the documents of Lucene about query syntax and I figured out the > what > > is the difference > > but my problem is different, this is preoccupied my mind and I am under > > pressure to solve this problem, after analyzing the results I get, now I > > think we need a "group by" in our query. > > > > let me tell you an example: we need a list of patients that have been > > examined by certain services specified by the user , say service one and > > service two. > > > > in this case here is the correct result: > > patient-id service_name patient_result > > 1 s112 > > 1 s213 > > 2 s1 41 > > 2 s222 > > > > but for example, following is incorrect because patient 1 has no service > > with name service2: > > patient-id service_name patient_result > > 1 s112 > > 1 s313 > > That depends on what you put in your lucene documents. > You can only get complete lucene documents as query results. > For the above example a patient with all service names > should be indexed in a single lucene doc. > > The rows above suggest that the relation between patient and > service forms the relational result. However, for a text search > engine it is usual to denormalize the relational records into > indexed documents, depending on the required output. > > Regards, > Paul Elschot > > > > > > > > > > > On 9/20/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote: > > > > > > Hi Paul, > > > would you tell me what is the difference between AND and + ? > > > I tried both but get different result > > > with AND I get 1777 documents and with + I get nearly 25000 ? > > > > > > > > > On 9/17/07, Paul Elschot <[EMAIL PROTECTED]> wrote: > > > > > > > > On Monday 17 September 2007 11:40, Mohammad Norouzi wrote: > > > > > Hi > > > > > I have a problem in getting correct result from Lucene, consider > we > > > > have an > > > > > index containing documents with fields "field1" and "field2" etc. > now > > > > I want > > > > > to have documents in which their field1 are equal one by one and > their > > > > > field2 with two different value > > > > > > > > > > to clarify consider I have this query: > > > > > field1:val* (field2:"myValue1" XOR field2:"myValue2") > > > > > > > > Did you try this: > > > > > > > > +field1:val* +field2:"myValue1" +field2:"myValue2" > > > > > > > > Regards, > > > > Paul Elschot > > > > > > > > > > > > > > > > > > now I want this result: > > > > > field1 field2 > > > > > val1myValue1 > > > > > val1myValue2 > > > > > val2myValue1 > > > > > val2myValue2 > > > > > > > > > > this result is not acceptable: > > > > > val3 myValue1 > > > > > or > > > > > val4 myValue1 > > > > > val4 myValue3 > > > > > > > > > > I put XOR as operator because this is not a typical OR, it's > > > > different, it > > > > > means documents that contains both myValue1 and myValue2 for the > field > > > > > > > > > field2 > > > > > > > > > > how to build a query to get such result? > > > > > > > > > > thanks in advance > > > > > -- > > > > > Regards, > > > > > Mohammad > > > > > -- > > > > > see my blog: http://brainable.blogspot.com/ > > > > > another in Persian: http://fekre-motefavet.blogspot.com/ > > > > > Sun Certified Java Programmer > > > > > ExpertsExchange Certified, Master: > > > > > http://www.experts-exchange.com/M_1938796.html > > > > > > > > > > > > > > - > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > -- > > > Regards, > > > Mohammad > > > -- > > > see my blog: http://brainable.blogspot.com/ > > > another in Persian: http://fekre-motefavet.blogspot.com/ > > > Sun Certified Java Programmer > > > ExpertsExchange Certified, Master: > http://www.experts-exchange.com/M_1938796.html > > > > > > > > > > > > > > -- > > Regards, > > Mohammad > > -- > > see my blog: http://brainable.blogspot.com/ > > another in Persian: http://fekre-motefavet.blogspot.com/ > > Sun Certified Java Programmer > > ExpertsExchange Certified, Master: > > http://www.experts-exchange.com/M
Re: a query for a special AND?
On Thursday 20 September 2007 09:19, Mohammad Norouzi wrote: > well, you mean we should separate documents just like relational tables in > databases ? Quite the contrary, it's called _de_normalization. This means that the documents in lucene normally contain more information than is present in a single relational entity. > if yes, how to make the relationship between those documents Lucene has no facilities to maintain relational relationships among its documents. A lucene index allows free format documents, i.e. any document may have any field or not. In practice you will need at least a primary key, but even that you will need to program yourself. Regards, Paul Elschot > > thank you so much Paul > > On 9/20/07, Paul Elschot <[EMAIL PROTECTED]> wrote: > > > > On Thursday 20 September 2007 07:29, Mohammad Norouzi wrote: > > > Sorry Paul I just hurried in replying ;) > > > I read the documents of Lucene about query syntax and I figured out the > > what > > > is the difference > > > but my problem is different, this is preoccupied my mind and I am under > > > pressure to solve this problem, after analyzing the results I get, now I > > > think we need a "group by" in our query. > > > > > > let me tell you an example: we need a list of patients that have been > > > examined by certain services specified by the user , say service one and > > > service two. > > > > > > in this case here is the correct result: > > > patient-id service_name patient_result > > > 1 s112 > > > 1 s213 > > > 2 s1 41 > > > 2 s222 > > > > > > but for example, following is incorrect because patient 1 has no service > > > with name service2: > > > patient-id service_name patient_result > > > 1 s112 > > > 1 s313 > > > > That depends on what you put in your lucene documents. > > You can only get complete lucene documents as query results. > > For the above example a patient with all service names > > should be indexed in a single lucene doc. > > > > The rows above suggest that the relation between patient and > > service forms the relational result. However, for a text search > > engine it is usual to denormalize the relational records into > > indexed documents, depending on the required output. > > > > Regards, > > Paul Elschot > > > > > > > > > > > > > > > > > > On 9/20/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote: > > > > > > > > Hi Paul, > > > > would you tell me what is the difference between AND and + ? > > > > I tried both but get different result > > > > with AND I get 1777 documents and with + I get nearly 25000 ? > > > > > > > > > > > > On 9/17/07, Paul Elschot <[EMAIL PROTECTED]> wrote: > > > > > > > > > > On Monday 17 September 2007 11:40, Mohammad Norouzi wrote: > > > > > > Hi > > > > > > I have a problem in getting correct result from Lucene, consider > > we > > > > > have an > > > > > > index containing documents with fields "field1" and "field2" etc. > > now > > > > > I want > > > > > > to have documents in which their field1 are equal one by one and > > their > > > > > > field2 with two different value > > > > > > > > > > > > to clarify consider I have this query: > > > > > > field1:val* (field2:"myValue1" XOR field2:"myValue2") > > > > > > > > > > Did you try this: > > > > > > > > > > +field1:val* +field2:"myValue1" +field2:"myValue2" > > > > > > > > > > Regards, > > > > > Paul Elschot > > > > > > > > > > > > > > > > > > > > > > now I want this result: > > > > > > field1 field2 > > > > > > val1myValue1 > > > > > > val1myValue2 > > > > > > val2myValue1 > > > > > > val2myValue2 > > > > > > > > > > > > this result is not acceptable: > > > > > > val3 myValue1 > > > > > > or > > > > > > val4 myValue1 > > > > > > val4 myValue3 > > > > > > > > > > > > I put XOR as operator because this is not a typical OR, it's > > > > > different, it > > > > > > means documents that contains both myValue1 and myValue2 for the > > field > > > > > > > > > > > field2 > > > > > > > > > > > > how to build a query to get such result? > > > > > > > > > > > > thanks in advance > > > > > > -- > > > > > > Regards, > > > > > > Mohammad > > > > > > -- > > > > > > see my blog: http://brainable.blogspot.com/ > > > > > > another in Persian: http://fekre-motefavet.blogspot.com/ > > > > > > Sun Certified Java Programmer > > > > > > ExpertsExchange Certified, Master: > > > > > > http://www.experts-exchange.com/M_1938796.html > > > > > > > > > > > > > > > > > > ---
Multiple Indices vs Single Index
Hi, I have about 40 indices which range in size from 10MB to 700MB. There are quite a few stored fields. To get an idea of the document size, I have about 400k documents in the 700MB index. Depending on the query, I choose the index which needs to be searched. Each query hits only one index. I was wondering if creating a single index where every document will have the indexname as a field will be more efficient. I created such an index and it was 3.4 GB in size. My initial performance tests with it are not conclusive. Also, what are the other points to be addressed while deciding between 1 index and 40 indices. I have 8GB RAM on the machine. Thanks, Nikhil - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene multiple indexes
Hi People, I was trying to get lucene to work for a mail indexing solution. Scenario: Traffic into the index method is on average 250 mails and their attachments per minute. This volume has made me think of a solution that will split the index on domain names of the owner of the message. So if I have say 100 users from 50 domains, I have 50 indexes. Now in the query method for my program, I have to load the indexes according to the user who is logged on. Is it a good Idea to cache the Searcher objects once they are created? Or is there a better approach to what I am trying to achieve? Many thanks d i n ok o r a h Tel: +44 795 66 65 283 51°21'52"N 0°5' 14.16"
Question regarding proximity search
Hi, I have a doubt on proximity search. Is the query "cat dog"~6 same as (cat dog)~6 ? I think both case will search for "cat" and "dog" within 6 words each other. But I am getting different number of results for the above queries. The second one may be the higher. Please clarify this. Thanks, Sonu
Re: Oracle-Lucene integration (OJVMDirectory and Lucene Domain Index) - LONG
Hi Chris: First sorry for the delay :( I have some preliminary performance test using Oracle 11g running on in a VMWare virtual Machine with 400Mb SGA (Virtual Machine using 812Mb RAM for Oracle Enterprise Linux 4.0). This virtual machine is hosted in a modest hardware, a Pentium IV 2.18Ghz with 2Gb RAM linux Mandriva 2007. Here some result: Indexing all_source system view took 23 minutes, all_source view have 220731 rows with 50Mb of data, sure this text is not free text because many rows have wrapped code with hexadecimal numbers. Here the table and the index: SQL> desc test_source_big Name Null?Type - OWNER VARCHAR2(30) NAME VARCHAR2(30) TYPE VARCHAR2(12) LINE NUMBER TEXT VARCHAR2(4000) SQL> create index source_big_lidx on test_source_big(text) indextype is lucene.LuceneIndex parameters('Stemmer:English;MaxBufferedDocs:5000;DecimalFormat:;ExtraCols:line'); Index created. Elapsed: 00:23:02.74 Index storage (45Mb, 220K Lucene docs) is: FILE_SIZE NAME -- -- 9 parameters 2 updateCount 20 segments.gen 45941031 _1d.cfs 42 segments_2t A query like this: select /*+ FIRST_ROWS(10) */ lscore(1) from test_source_big where lcontains(text,'"procedure java"~10',1)>0 order by lscore(1) desc; It took 11ms, and will be faster if you don't need lscore(1) value, here other example: select /*+ FIRST_ROWS(10) DOMAIN_INDEX_SORT */ lscore(1) from test_source_big where lcontains(text,'(optimize OR sync) AND "LANGUAGE JAVA"',1)>0 order by lscore(1) asc; It took 7ms. But there are other benefits related to the Domain Index implementation using Data Cartridge API: - Any modification on the table is notified to Lucene automatically, you can apply this modification on line or deferred, except for deletion that are always synced. - The execution plan is calculated by the optimizer using the domain index, and with latest additions (User Data Store) you can reduce with Lucene how many rows the database will process using multiples column at lcontains operator. For example this query use Lucene to search a free text at TEXT column and Oracle's filter reduction at LINE column: SQL> select count(text) from test_source_big where lcontains(text,'function')>0 and line>=6000; COUNT(TEXT) --- 2 Elapsed: 00:00:00.74 PLAN_TABLE_OUTPUT Plan hash value: 2350958379 | Id | Operation| Name| Rows | Bytes | Cost (%CPU)| Time | | 0 | SELECT STATEMENT | | 1 | 2027 | 2968 (1)| 00:00:36 | | 1 | SORT AGGREGATE | | 1 | 2027 || | |* 2 | TABLE ACCESS BY INDEX ROWID| TEST_SOURCE_BIG | 7 | 14189 | 2968 (1)| 00:00:36 | |* 3 |DOMAIN INDEX | SOURCE_BIG_LIDX | | || | Predicate Information (identified by operation id): --- 2 - filter("LINE">=6000) 3 - access("LUCENE"."LCONTAINS"("TEXT",'function')>0) But if you use Lucene to reduce the number of rows visited by Oracle by using User Data Store to index LINE column too, you can perform a query like this: SQL> select count(text) from test_source_big where lcontains(text,'function AND line:[6000 TO 7000]')>0; COUNT(TEXT) --- 2 Elapsed: 00:00:00.05 PLAN_TABLE_OUTPUT Plan hash value: 2350958379 | Id | Operation| Name| Rows | Bytes | Cost (%CPU)| Time | | 0 | SELECT STATEMENT | | 1 | 2014 | 2968 (1)| 00:00:36 | | 1 | SORT AGGREGATE | | 1 | 2014 || | | 2 | TABLE ACCESS BY INDEX ROWID| TEST_SOURCE_BIG | 11587 | 22M| 2968 (1)| 00:00:36 | |* 3 |DOMAIN INDEX | SOURCE_BIG_LI
Re: Multiple Indices vs Single Index
If I understand correctly, you want to do a two stage retrieval right? That is, look up in the initial index (3.4 GB) and then do a second search on the sub index? Presumably, you have to manage the Searchers, etc. for each of the sub-indexes as well as the big index. This means you have to go through the hits from the first search, then route, etc. correct? Have you tried creating one single index with all the (stored) fields, etc? Worst case scenario, assuming 1GB per index, is you would have a 40GB index, but my guess is index compression will reduce it more. Since you are less than that anyway, have you tried just the straightforward solution? Or do you have other requirements that force the sub-index solution? Also, I am not sure it will work, but it seems worth a try. Of course, this also depends on how much you expect your indexes to grow. Also, what was inconclusive about your tests? Maybe you can describe more what you have tried to date? Cheers, Grant On Sep 20, 2007, at 3:50 AM, Nikhil Chhaochharia wrote: Hi, I have about 40 indices which range in size from 10MB to 700MB. There are quite a few stored fields. To get an idea of the document size, I have about 400k documents in the 700MB index. Depending on the query, I choose the index which needs to be searched. Each query hits only one index. I was wondering if creating a single index where every document will have the indexname as a field will be more efficient. I created such an index and it was 3.4 GB in size. My initial performance tests with it are not conclusive. Also, what are the other points to be addressed while deciding between 1 index and 40 indices. I have 8GB RAM on the machine. Thanks, Nikhil - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple Indices vs Single Index
I am sorry, it seems that I was not clear with what my problem is. I will try to describe it again. My data is divided into 40 categories and at one time only one category can be searched. The GUI for the system will ask the user to select the category from a drop-down. Currently, I have a separate index for every category. The index sizes varies - one category index is 10MB and another is 700MB. Other index-sizes are somewhere in between. I was wondering if it will be better to just have 1 large index with all the 40 indices combined. I do not need to do dual-queries and my total index size (if I create a single index) is about 3.4GB. It will increase to maximum of 5-6 GB. I am running this on a dedicated machine with 8GB RAM. Unfortunately I do not have enough hardware to run both in parallel and test properly. Have just one server which is being used by live users. So it would be great if you could tell me whether I should stick with my 40 indices or combine them into 1 index. What are the pros and cons of each approach ? Thanks, Nikhil - Original Message From: Grant Ingersoll <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, 20 September, 2007 7:57:21 PM Subject: Re: Multiple Indices vs Single Index If I understand correctly, you want to do a two stage retrieval right? That is, look up in the initial index (3.4 GB) and then do a second search on the sub index? Presumably, you have to manage the Searchers, etc. for each of the sub-indexes as well as the big index. This means you have to go through the hits from the first search, then route, etc. correct? Have you tried creating one single index with all the (stored) fields, etc? Worst case scenario, assuming 1GB per index, is you would have a 40GB index, but my guess is index compression will reduce it more. Since you are less than that anyway, have you tried just the straightforward solution? Or do you have other requirements that force the sub-index solution? Also, I am not sure it will work, but it seems worth a try. Of course, this also depends on how much you expect your indexes to grow. Also, what was inconclusive about your tests? Maybe you can describe more what you have tried to date? Cheers, Grant On Sep 20, 2007, at 3:50 AM, Nikhil Chhaochharia wrote: > Hi, > > I have about 40 indices which range in size from 10MB to 700MB. > There are quite a few stored fields. To get an idea of the > document size, I have about 400k documents in the 700MB index. > > Depending on the query, I choose the index which needs to be > searched. Each query hits only one index. I was wondering if > creating a single index where every document will have the > indexname as a field will be more efficient. I created such an > index and it was 3.4 GB in size. My initial performance tests with > it are not conclusive. > > Also, what are the other points to be addressed while deciding > between 1 index and 40 indices. > > I have 8GB RAM on the machine. > > > Thanks, > Nikhil > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: thread safe shared IndexSearcher
Mark, Thanks for sharing your valuable exp. and thoughts. Frankly our system already has most of the functionalities LuceneIndexAcessor offers. The only thing I am looking for is to sync the searchers' close. That's why I am little worried about the way accessor handles the searcher sync. I will probably give it a try to see how it performs in our system. Thanks! Jay Mark Miller wrote: The method is synched, but this is because each thread *does* share the same Searcher. To maintain a cache of searchers across multiple threads, you've got to sync -- to reference count, you've got to sync. The performance hit of LuceneIndexAcessor is pretty minimal for its functionality, and frankly, for the functionality you want, you have to pay a cost. Thats not even the end of it really...your going to need to maintain a cache of Accessor objects for each index as well...and if you dont know all the indexes at startup time, access to this will also need to be synched. I wouldn't worry though -- searches are still lightening fast...that won't be the bottleneck. I'll work on getting you some code, but if your worried, try some benchmarking on the original code. Also, to be clear, I don't have the code in front of me, but getting a Searcher does not require waiting for a Writer to be released. Searchers are cached and resused (and instantly available) until a Writer is released. When this happens, the release Writer method waits for all the Searchers to return (happens pretty quick as searches are pretty quick), the Searcher cache is cleared, and then subsequent calls to getSearcher create new Searchers that can see what the Writer added. The key is use your Writer/Searcher/Reader quickly and then release it (unless your bulk loading). I've had such a system with 5+ million docs on a standard machine and searches where still well below a second after the first Searcher is cached (and even the first search is darn quick). And that includes a lot of extra crap I am doing. - Mark Jay Yu wrote: Mark, After reading the implementation of LuceneIndexAccessor.getSearcher(), I realized that the method is synchronized and wait for writingDirector to be released. That means if we getSearcher for each query in each thread, there might be a contention and performance hit. In fact, even the method of release(searcher) is costly. On the other hand, if multiple threads share share one searcher then it'd defeat the purpose of using LuceneIndexAccessor. Do I miss sth here? What's your suggested use case for LuceneIndexAccessor? Thanks! Jay Mark Miller wrote: Ill respond a point at a time: 1. ** Hi Maik, So what happens in this case: IndexAccessProvider accessProvider = new IndexAccessProvider(directory, analyzer); LuceneIndexAccessor accessor = new LuceneIndexAccessor(accessProvider); accessor.open(); IndexWriter writer = accessor.getWriter(); // reference to the same instance? IndexWriter writer2 = accessor.getWriter(); writer.addDocument(); writer2.addDocument(); // I didn't release the writer yet // will this block? IndexReader reader = accessor.getReader(); reader.delete(); This is not really an issue. First, if you are going to delete with a Reader you need to call getWritingReader and not getReader. When you do that, the getWritingReader call will block until writer and writer2 are released. If you are just adding a couple docs before releasing the writers, this is no problem because the block will be very short. If you are loading tons of docs and you want to be able to delete with a Reader in a timely manner, you should release the writers every now and then (release and re-get the Writer every 100 docs or something). An interactive index should not hog the Writer, while something that is just loading a lot could hog the Writer. This is no different than normal…you cannot delete with a Reader while adding with a Writer with Lucene. This code just enforces those semantics. The best solution is to just use a Writer to delete – I never get a ReadingWriter. 2. http://issues.apache.org/bugzilla/show_bug.cgi?id=34995#c3 This is no big deal either. I just added another getWriter call that takes a create Boolean. 3. I don't think there is a latest release. This has never gotten much official attention and is not in the sandbox. I worked straight from the originally submitted code. 4. I will look into getting together some code that I can share. The multisearcher changes that are need are a couple of one liners really, so at a minimum I will give you the changes needed. - Mark On 9/19/07, Jay Yu <[EMAIL PROTECTED]> wrote: Mark, thanks for sharing your insight and experience about LuceneIndexAccessor! I remember seeing some people reporting some issues about it, such as: http://www.archivum.info/[EMAIL PROTECTED]/2005-05/msg00114.html http://issues.apache.org/bugzilla/show_bug.cgi?i
Re: Multiple Indices vs Single Index
OK, I thought you meant your index would have in it the name of the second index and would thus do a two-stage retrieval. At any rate, if you are saying your combined index with all the stored fields is ~3.4 GB I would think it would fit reasonably on the machine you have and perform reasonably. Naturally, this depends on your application, your users, etc. and I can't make any guarantees, but I certainly recall others managing this size just fine. See the many tips on improving searching and indexing on the Wiki (link at bottom in my signature) and do some profiling/testing. When you said your tests were inconclusive, what tests have you done? If you can, run the tests in a profiler to see where your bottlenecks are. -Grant On Sep 20, 2007, at 11:16 AM, Nikhil Chhaochharia wrote: I am sorry, it seems that I was not clear with what my problem is. I will try to describe it again. My data is divided into 40 categories and at one time only one category can be searched. The GUI for the system will ask the user to select the category from a drop-down. Currently, I have a separate index for every category. The index sizes varies - one category index is 10MB and another is 700MB. Other index-sizes are somewhere in between. I was wondering if it will be better to just have 1 large index with all the 40 indices combined. I do not need to do dual-queries and my total index size (if I create a single index) is about 3.4GB. It will increase to maximum of 5-6 GB. I am running this on a dedicated machine with 8GB RAM. Unfortunately I do not have enough hardware to run both in parallel and test properly. Have just one server which is being used by live users. So it would be great if you could tell me whether I should stick with my 40 indices or combine them into 1 index. What are the pros and cons of each approach ? Thanks, Nikhil - Original Message From: Grant Ingersoll <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, 20 September, 2007 7:57:21 PM Subject: Re: Multiple Indices vs Single Index If I understand correctly, you want to do a two stage retrieval right? That is, look up in the initial index (3.4 GB) and then do a second search on the sub index? Presumably, you have to manage the Searchers, etc. for each of the sub-indexes as well as the big index. This means you have to go through the hits from the first search, then route, etc. correct? Have you tried creating one single index with all the (stored) fields, etc? Worst case scenario, assuming 1GB per index, is you would have a 40GB index, but my guess is index compression will reduce it more. Since you are less than that anyway, have you tried just the straightforward solution? Or do you have other requirements that force the sub-index solution? Also, I am not sure it will work, but it seems worth a try. Of course, this also depends on how much you expect your indexes to grow. Also, what was inconclusive about your tests? Maybe you can describe more what you have tried to date? Cheers, Grant On Sep 20, 2007, at 3:50 AM, Nikhil Chhaochharia wrote: Hi, I have about 40 indices which range in size from 10MB to 700MB. There are quite a few stored fields. To get an idea of the document size, I have about 400k documents in the 700MB index. Depending on the query, I choose the index which needs to be searched. Each query hits only one index. I was wondering if creating a single index where every document will have the indexname as a field will be more efficient. I created such an index and it was 3.4 GB in size. My initial performance tests with it are not conclusive. Also, what are the other points to be addressed while deciding between 1 index and 40 indices. I have 8GB RAM on the machine. Thanks, Nikhil - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: thread safe shared IndexSearcher
Good luck Jay. Keep in mind, pretty much all LuceneIndexAccessor does is sync Readers with Writers and allow multiple threads to share the same instances of them -- nothing more. The code just forces Readers to refresh when Writers are used to change the index. There really isn't any functionality beyond that offered. Since you want to have a multi thread system access the same resources (which occasionally need to be refreshed) its not too easy to get around a synchronized block. If I am able to extract some usable code for you soon I will let you know. - Mark Jay Yu wrote: Mark, Thanks for sharing your valuable exp. and thoughts. Frankly our system already has most of the functionalities LuceneIndexAcessor offers. The only thing I am looking for is to sync the searchers' close. That's why I am little worried about the way accessor handles the searcher sync. I will probably give it a try to see how it performs in our system. Thanks! Jay Mark Miller wrote: The method is synched, but this is because each thread *does* share the same Searcher. To maintain a cache of searchers across multiple threads, you've got to sync -- to reference count, you've got to sync. The performance hit of LuceneIndexAcessor is pretty minimal for its functionality, and frankly, for the functionality you want, you have to pay a cost. Thats not even the end of it really...your going to need to maintain a cache of Accessor objects for each index as well...and if you dont know all the indexes at startup time, access to this will also need to be synched. I wouldn't worry though -- searches are still lightening fast...that won't be the bottleneck. I'll work on getting you some code, but if your worried, try some benchmarking on the original code. Also, to be clear, I don't have the code in front of me, but getting a Searcher does not require waiting for a Writer to be released. Searchers are cached and resused (and instantly available) until a Writer is released. When this happens, the release Writer method waits for all the Searchers to return (happens pretty quick as searches are pretty quick), the Searcher cache is cleared, and then subsequent calls to getSearcher create new Searchers that can see what the Writer added. The key is use your Writer/Searcher/Reader quickly and then release it (unless your bulk loading). I've had such a system with 5+ million docs on a standard machine and searches where still well below a second after the first Searcher is cached (and even the first search is darn quick). And that includes a lot of extra crap I am doing. - Mark Jay Yu wrote: Mark, After reading the implementation of LuceneIndexAccessor.getSearcher(), I realized that the method is synchronized and wait for writingDirector to be released. That means if we getSearcher for each query in each thread, there might be a contention and performance hit. In fact, even the method of release(searcher) is costly. On the other hand, if multiple threads share share one searcher then it'd defeat the purpose of using LuceneIndexAccessor. Do I miss sth here? What's your suggested use case for LuceneIndexAccessor? Thanks! Jay Mark Miller wrote: Ill respond a point at a time: 1. ** Hi Maik, So what happens in this case: IndexAccessProvider accessProvider = new IndexAccessProvider(directory, analyzer); LuceneIndexAccessor accessor = new LuceneIndexAccessor(accessProvider); accessor.open(); IndexWriter writer = accessor.getWriter(); // reference to the same instance? IndexWriter writer2 = accessor.getWriter(); writer.addDocument(); writer2.addDocument(); // I didn't release the writer yet // will this block? IndexReader reader = accessor.getReader(); reader.delete(); This is not really an issue. First, if you are going to delete with a Reader you need to call getWritingReader and not getReader. When you do that, the getWritingReader call will block until writer and writer2 are released. If you are just adding a couple docs before releasing the writers, this is no problem because the block will be very short. If you are loading tons of docs and you want to be able to delete with a Reader in a timely manner, you should release the writers every now and then (release and re-get the Writer every 100 docs or something). An interactive index should not hog the Writer, while something that is just loading a lot could hog the Writer. This is no different than normal…you cannot delete with a Reader while adding with a Writer with Lucene. This code just enforces those semantics. The best solution is to just use a Writer to delete – I never get a ReadingWriter. 2. http://issues.apache.org/bugzilla/show_bug.cgi?id=34995#c3 This is no big deal either. I just added another getWriter call that takes a create Boolean. 3. I don't think there is a latest release. This has never gotten much official attention and is not in th
Re: thread safe shared IndexSearcher
Mark Miller wrote: Good luck Jay. Keep in mind, pretty much all LuceneIndexAccessor does is sync Readers with Writers and allow multiple threads to share the same instances of them -- nothing more. The code just forces Readers to refresh when Writers are used to change the index. There really isn't any functionality beyond that offered. Since you want to have a multi thread system access the same resources (which occasionally need to be refreshed) its not too easy to get around a synchronized block. If I am able to extract some usable code for you soon I will let you know. I will appreciate it! Thanks for your help! - Mark Jay Yu wrote: Mark, Thanks for sharing your valuable exp. and thoughts. Frankly our system already has most of the functionalities LuceneIndexAcessor offers. The only thing I am looking for is to sync the searchers' close. That's why I am little worried about the way accessor handles the searcher sync. I will probably give it a try to see how it performs in our system. Thanks! Jay Mark Miller wrote: The method is synched, but this is because each thread *does* share the same Searcher. To maintain a cache of searchers across multiple threads, you've got to sync -- to reference count, you've got to sync. The performance hit of LuceneIndexAcessor is pretty minimal for its functionality, and frankly, for the functionality you want, you have to pay a cost. Thats not even the end of it really...your going to need to maintain a cache of Accessor objects for each index as well...and if you dont know all the indexes at startup time, access to this will also need to be synched. I wouldn't worry though -- searches are still lightening fast...that won't be the bottleneck. I'll work on getting you some code, but if your worried, try some benchmarking on the original code. Also, to be clear, I don't have the code in front of me, but getting a Searcher does not require waiting for a Writer to be released. Searchers are cached and resused (and instantly available) until a Writer is released. When this happens, the release Writer method waits for all the Searchers to return (happens pretty quick as searches are pretty quick), the Searcher cache is cleared, and then subsequent calls to getSearcher create new Searchers that can see what the Writer added. The key is use your Writer/Searcher/Reader quickly and then release it (unless your bulk loading). I've had such a system with 5+ million docs on a standard machine and searches where still well below a second after the first Searcher is cached (and even the first search is darn quick). And that includes a lot of extra crap I am doing. - Mark Jay Yu wrote: Mark, After reading the implementation of LuceneIndexAccessor.getSearcher(), I realized that the method is synchronized and wait for writingDirector to be released. That means if we getSearcher for each query in each thread, there might be a contention and performance hit. In fact, even the method of release(searcher) is costly. On the other hand, if multiple threads share share one searcher then it'd defeat the purpose of using LuceneIndexAccessor. Do I miss sth here? What's your suggested use case for LuceneIndexAccessor? Thanks! Jay Mark Miller wrote: Ill respond a point at a time: 1. ** Hi Maik, So what happens in this case: IndexAccessProvider accessProvider = new IndexAccessProvider(directory, analyzer); LuceneIndexAccessor accessor = new LuceneIndexAccessor(accessProvider); accessor.open(); IndexWriter writer = accessor.getWriter(); // reference to the same instance? IndexWriter writer2 = accessor.getWriter(); writer.addDocument(); writer2.addDocument(); // I didn't release the writer yet // will this block? IndexReader reader = accessor.getReader(); reader.delete(); This is not really an issue. First, if you are going to delete with a Reader you need to call getWritingReader and not getReader. When you do that, the getWritingReader call will block until writer and writer2 are released. If you are just adding a couple docs before releasing the writers, this is no problem because the block will be very short. If you are loading tons of docs and you want to be able to delete with a Reader in a timely manner, you should release the writers every now and then (release and re-get the Writer every 100 docs or something). An interactive index should not hog the Writer, while something that is just loading a lot could hog the Writer. This is no different than normal…you cannot delete with a Reader while adding with a Writer with Lucene. This code just enforces those semantics. The best solution is to just use a Writer to delete – I never get a ReadingWriter. 2. http://issues.apache.org/bugzilla/show_bug.cgi?id=34995#c3 This is no big deal either. I just added another getWriter call that takes a create Boolean. 3. I don't think there is a latest re
Re: Multiple Indices vs Single Index
OK, thanks. I actually have both systems implemented. The multi-index one is being used currently and it works well. I have deployed the single index solution a few times during off-peak hours and the response time has been almost the same as the multi-index solution. I tried to simulate some load but again my numbers were mostly similar for both cases. I have already done all the suggested optimizations since I first ran into problems a few months ago. The performance had improved considerably. Since then, my traffic has increased and I have again started facing some issues during peak-load hours. I guess I should get another box and run proper tests there. Will run a profiler also. Thanks for all the suggestions. Regards, Nikhil - Original Message From: Grant Ingersoll <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, 20 September, 2007 9:25:01 PM Subject: Re: Multiple Indices vs Single Index OK, I thought you meant your index would have in it the name of the second index and would thus do a two-stage retrieval. At any rate, if you are saying your combined index with all the stored fields is ~3.4 GB I would think it would fit reasonably on the machine you have and perform reasonably. Naturally, this depends on your application, your users, etc. and I can't make any guarantees, but I certainly recall others managing this size just fine. See the many tips on improving searching and indexing on the Wiki (link at bottom in my signature) and do some profiling/testing. When you said your tests were inconclusive, what tests have you done? If you can, run the tests in a profiler to see where your bottlenecks are. -Grant On Sep 20, 2007, at 11:16 AM, Nikhil Chhaochharia wrote: > I am sorry, it seems that I was not clear with what my problem is. > I will try to describe it again. > > My data is divided into 40 categories and at one time only one > category can be searched. The GUI for the system will ask the user > to select the category from a drop-down. Currently, I have a > separate index for every category. The index sizes varies - one > category index is 10MB and another is 700MB. Other index-sizes are > somewhere in between. > > I was wondering if it will be better to just have 1 large index > with all the 40 indices combined. I do not need to do dual-queries > and my total index size (if I create a single index) is about > 3.4GB. It will increase to maximum of 5-6 GB. I am running this > on a dedicated machine with 8GB RAM. > > Unfortunately I do not have enough hardware to run both in parallel > and test properly. Have just one server which is being used by > live users. So it would be great if you could tell me whether I > should stick with my 40 indices or combine them into 1 index. What > are the pros and cons of each approach ? > > Thanks, > Nikhil > > > - Original Message > From: Grant Ingersoll <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Thursday, 20 September, 2007 7:57:21 PM > Subject: Re: Multiple Indices vs Single Index > > If I understand correctly, you want to do a two stage retrieval > right? That is, look up in the initial index (3.4 GB) and then do a > second search on the sub index? Presumably, you have to manage the > Searchers, etc. for each of the sub-indexes as well as the big > index. This means you have to go through the hits from the first > search, then route, etc. correct? > > Have you tried creating one single index with all the (stored) > fields, etc? Worst case scenario, assuming 1GB per index, is you > would have a 40GB index, but my guess is index compression will > reduce it more. Since you are less than that anyway, have you tried > just the straightforward solution? Or do you have other requirements > that force the sub-index solution? Also, I am not sure it will work, > but it seems worth a try. Of course, this also depends on how much > you expect your indexes to grow. > > Also, what was inconclusive about your tests? Maybe you can describe > more what you have tried to date? > > Cheers, > Grant > > On Sep 20, 2007, at 3:50 AM, Nikhil Chhaochharia wrote: > >> Hi, >> >> I have about 40 indices which range in size from 10MB to 700MB. >> There are quite a few stored fields. To get an idea of the >> document size, I have about 400k documents in the 700MB index. >> >> Depending on the query, I choose the index which needs to be >> searched. Each query hits only one index. I was wondering if >> creating a single index where every document will have the >> indexname as a field will be more efficient. I created such an >> index and it was 3.4 GB in size. My initial performance tests with >> it are not conclusive. >> >> Also, what are the other points to be addressed while deciding >> between 1 index and 40 indices. >> >> I have 8GB RAM on the machine. >> >> >> Thanks, >> Nikhil >> >> >> >> --
Re: Multiple Indices vs Single Index
If the current version is working well, what is the reason to move? Is it just to make management of the indices easier? On Sep 20, 2007, at 12:07 PM, Nikhil Chhaochharia wrote: OK, thanks. I actually have both systems implemented. The multi-index one is being used currently and it works well. I have deployed the single index solution a few times during off-peak hours and the response time has been almost the same as the multi-index solution. I tried to simulate some load but again my numbers were mostly similar for both cases. I have already done all the suggested optimizations since I first ran into problems a few months ago. The performance had improved considerably. Since then, my traffic has increased and I have again started facing some issues during peak-load hours. I guess I should get another box and run proper tests there. Will run a profiler also. Thanks for all the suggestions. Regards, Nikhil - Original Message From: Grant Ingersoll <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, 20 September, 2007 9:25:01 PM Subject: Re: Multiple Indices vs Single Index OK, I thought you meant your index would have in it the name of the second index and would thus do a two-stage retrieval. At any rate, if you are saying your combined index with all the stored fields is ~3.4 GB I would think it would fit reasonably on the machine you have and perform reasonably. Naturally, this depends on your application, your users, etc. and I can't make any guarantees, but I certainly recall others managing this size just fine. See the many tips on improving searching and indexing on the Wiki (link at bottom in my signature) and do some profiling/testing. When you said your tests were inconclusive, what tests have you done? If you can, run the tests in a profiler to see where your bottlenecks are. -Grant On Sep 20, 2007, at 11:16 AM, Nikhil Chhaochharia wrote: I am sorry, it seems that I was not clear with what my problem is. I will try to describe it again. My data is divided into 40 categories and at one time only one category can be searched. The GUI for the system will ask the user to select the category from a drop-down. Currently, I have a separate index for every category. The index sizes varies - one category index is 10MB and another is 700MB. Other index-sizes are somewhere in between. I was wondering if it will be better to just have 1 large index with all the 40 indices combined. I do not need to do dual-queries and my total index size (if I create a single index) is about 3.4GB. It will increase to maximum of 5-6 GB. I am running this on a dedicated machine with 8GB RAM. Unfortunately I do not have enough hardware to run both in parallel and test properly. Have just one server which is being used by live users. So it would be great if you could tell me whether I should stick with my 40 indices or combine them into 1 index. What are the pros and cons of each approach ? Thanks, Nikhil - Original Message From: Grant Ingersoll <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, 20 September, 2007 7:57:21 PM Subject: Re: Multiple Indices vs Single Index If I understand correctly, you want to do a two stage retrieval right? That is, look up in the initial index (3.4 GB) and then do a second search on the sub index? Presumably, you have to manage the Searchers, etc. for each of the sub-indexes as well as the big index. This means you have to go through the hits from the first search, then route, etc. correct? Have you tried creating one single index with all the (stored) fields, etc? Worst case scenario, assuming 1GB per index, is you would have a 40GB index, but my guess is index compression will reduce it more. Since you are less than that anyway, have you tried just the straightforward solution? Or do you have other requirements that force the sub-index solution? Also, I am not sure it will work, but it seems worth a try. Of course, this also depends on how much you expect your indexes to grow. Also, what was inconclusive about your tests? Maybe you can describe more what you have tried to date? Cheers, Grant On Sep 20, 2007, at 3:50 AM, Nikhil Chhaochharia wrote: Hi, I have about 40 indices which range in size from 10MB to 700MB. There are quite a few stored fields. To get an idea of the document size, I have about 400k documents in the 700MB index. Depending on the query, I choose the index which needs to be searched. Each query hits only one index. I was wondering if creating a single index where every document will have the indexname as a field will be more efficient. I created such an index and it was 3.4 GB in size. My initial performance tests with it are not conclusive. Also, what are the other points to be addressed while deciding between 1 index and 40 indices. I have 8GB RAM on the machine. Thanks, Nikhil ---
highlighting and fragments
Hello Folks, I wanted to stay away from storing text in the indexes in order to keep them smaller. I have a requirement now though to provide highlighting and, more so, fragments of the content so they will be displayed on the UI. Do you all prefer to store the text in the index to make this easier or would you suggest retrieving the text from the source after doing your search. From I can tell you need to run through the Hits anyway I am trying to keep the indexes as small as possible (they are still HUGE...but...) so storing fields is not really what I want to do. I will if it is the best and most efficient way to do so. Thanks, Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: highlighting and fragments
Lucene's storing functionality is just a simple storage mechanism. You can certainly and easily use your own storage mechanism. When you get your user created id back from Lucene due to a hit, just pass that id to your storage system to get the original text and then feed that to the Highlighter. Your storage system/code might be slower than Lucene, but I don't believe there is anything about Lucene's system that would give it an inside advantage. - Mark Michael J. Prichard wrote: Hello Folks, I wanted to stay away from storing text in the indexes in order to keep them smaller. I have a requirement now though to provide highlighting and, more so, fragments of the content so they will be displayed on the UI. Do you all prefer to store the text in the index to make this easier or would you suggest retrieving the text from the source after doing your search. From I can tell you need to run through the Hits anyway I am trying to keep the indexes as small as possible (they are still HUGE...but...) so storing fields is not really what I want to do. I will if it is the best and most efficient way to do so. Thanks, Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Question regarding proximity search
: Is the query "cat dog"~6 same as (cat dog)~6 ? : I think both case will search for "cat" and "dog" within 6 words each other. : But I am getting different number of results for the above queries. The : second one may be the higher. Please clarify this. i don't believe:(cat dog)~6 is even a legal query in the Lucene QueryParser sytnax ... it isn't documented, and it doesn't work in Lucene 2.2. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple Indices vs Single Index
: I was wondering if it will be better to just have 1 large index with all : the 40 indices combined. I do not need to do dual-queries and my total : index size (if I create a single index) is about 3.4GB. It will : increase to maximum of 5-6 GB. I am running this on a dedicated machine : with 8GB RAM. off the top of my head, there are 3 main reasons i can think of that would motivate one choice over another -- ultimately it's up to you... 1) FieldCache and sorting ... if all 40 sets of of Documents contain have consistently named fields, then there won't be much difference between 40 indexes and 1 index ... but if each of those 40 sets contain documents with radically differnet fields -- and you want to sort on N differnet fields for each sets -- then the total FieldCache sizes for each of those 40 indexes will be smaller then the FieldCaches for one gian index (because every document will get an entry wethe it makes sense or not. 2) idf statistics. if you have common fields you search regardless of document set, the 40 index approach will maintain seperate sttistics -- this may be important if some terms are very common in only som docsets. the word "albino" may be really common in docset A but only one doc in docset B has it ... in the 40 index appraoch querying B for (albino elephant) will give a lot of weight to albino because it's so rare, but in the single index case albino may not be considered as significant because of ht unified idf value for all docsets 9even if hte query is constrained to docset B) ... again: this only matters if the fields overlap, if every docset has a unique set of fields then the idfs will be unique because they are by field) 3) management: it's probably a lot simpler to maintain and manage code that deals with one index then code that deals with 40 indexes. you milage may vary. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Question regarding proximity search
Thanks Hoss, for the reply. I am using Lucene 2.1. I checked the lucene converted syntax (using Query.toString()) in both case and found the second one actually not converting to proximity query. "cat dog"~6 is converted to ABST:"cat dog"~4 and (cat dog)~6 is converted to +ABST:cat +ABST:dog. That is discarding the proximity operator in the second case. On 9/21/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > > : Is the query "cat dog"~6 same as (cat dog)~6 ? > : I think both case will search for "cat" and "dog" within 6 words each > other. > : But I am getting different number of results for the above queries. The > : second one may be the higher. Please clarify this. > > i don't believe:(cat dog)~6 is even a legal query in the Lucene > QueryParser sytnax ... it isn't documented, and it doesn't work in Lucene > 2.2. > > > > -Hoss > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Question regarding proximity search
: I checked the lucene converted syntax (using Query.toString()) in both case : and found the second one actually not converting to proximity query. I don't think you understood what I was trying to say... using parens with a "~" character after it is not currently, and has never been (to my knowledge) a means of creating a "proximity query". It is not documented in 2.2, 2.1, 2.0, 1.9, or 1.4.3. It is not legal syntax in 2.2 (it causes a parse exception). In lucene, the way to do proximity based queries is either with SpanNearQueries, or with PhraseQueries -- the way to create a PhraseQuery using hte Lucene QueryParser is with quote character '"' there is no reason why you should expect: (cat dog)~3 to create a proximity query. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple Indices vs Single Index
Thanks Grant and Chris for the replies. I am looking at a single index because the 40 index system has started having performance issues at high load. My daily traffic is increasing at a steady pace and about 40% of the traffic is concentrated in a 2 hour period and searches start slowing down just a little bit. All my indices have exactly the same fields, so I guess fieldcache and sorting are not an issue. I do not do any updates, so I just open 40 searchers and store them in a HashMap. When a query comes, I select the appropriate searcher and fire the query. So code maintainance also is not an issue. The different IDF statistics are a valid point. I have not analysed the difference in the results returned - I sort of assumed the results will be same in both cases. I will look at the difference in results. There is one more point. Sooner or later, I will have to move to multiple servers due to increasing traffic. (I will be handling 30,000+ hits per hour by the end of the year) One option is to have full index on all servers and load-balance them. Another is to have half the indices on one server and half of them on the other. The front-end (separate server) will then fire the query on the appropriate server. Any suggestions on which one would be a better choice ? All data on all servers will give me redundancy, the system will be up even if one server goes down. Also, adding more servers would be trivial. Thanks, Nikhil - Original Message From: Chris Hostetter <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, 21 September, 2007 4:02:38 AM Subject: Re: Multiple Indices vs Single Index : I was wondering if it will be better to just have 1 large index with all : the 40 indices combined. I do not need to do dual-queries and my total : index size (if I create a single index) is about 3.4GB. It will : increase to maximum of 5-6 GB. I am running this on a dedicated machine : with 8GB RAM. off the top of my head, there are 3 main reasons i can think of that would motivate one choice over another -- ultimately it's up to you... 1) FieldCache and sorting ... if all 40 sets of of Documents contain have consistently named fields, then there won't be much difference between 40 indexes and 1 index ... but if each of those 40 sets contain documents with radically differnet fields -- and you want to sort on N differnet fields for each sets -- then the total FieldCache sizes for each of those 40 indexes will be smaller then the FieldCaches for one gian index (because every document will get an entry wethe it makes sense or not. 2) idf statistics. if you have common fields you search regardless of document set, the 40 index approach will maintain seperate sttistics -- this may be important if some terms are very common in only som docsets. the word "albino" may be really common in docset A but only one doc in docset B has it ... in the 40 index appraoch querying B for (albino elephant) will give a lot of weight to albino because it's so rare, but in the single index case albino may not be considered as significant because of ht unified idf value for all docsets 9even if hte query is constrained to docset B) ... again: this only matters if the fields overlap, if every docset has a unique set of fields then the idfs will be unique because they are by field) 3) management: it's probably a lot simpler to maintain and manage code that deals with one index then code that deals with 40 indexes. you milage may vary. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
About the search efficiency based on document's length
Hi everyone, There is a question about the document’s length and search efficiency. Think of this situation: Two ways to index some html pages(ignore some information): one is both store and index the html content in lucene dictionary, the other is just index the content . For the first method is there a efficiency problem compare to the second besides the folder size increase? Thanks, Jarvis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: About the search efficiency based on document's length
21 sep 2007 kl. 08.23 skrev Jarvis: There is a question about the document’s length and search efficiency. Two ways to index some html pages(ignore some information): one is both store and index the html content in lucene dictionary, the other is just index the content . For the first method is there a efficiency problem compare to the second besides the folder size increase? Not sure I understand your question, but I'll give it a go. As far as I know, storing data in a document will not affect search speed. However, loading large amounts of data to a Document will of course consume resources. Therefor it is possible to pass a FieldSelector to the IndexReader when you retrieve a Document, allowing you to define what fields to ignore, load, lazy load, et c. I hope this helps. -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]