Re: a query for a special AND?

2007-09-20 Thread Paul Elschot
On Thursday 20 September 2007 07:29, Mohammad Norouzi wrote:
> Sorry Paul I just hurried in replying ;)
> I read the documents of Lucene about query syntax and I figured out the what
> is the difference
> but my problem is different, this is preoccupied my mind and I am under
> pressure to solve this problem, after analyzing the results I get, now I
> think we need a "group by" in our query.
> 
> let me tell you an example: we need a list of patients that have been
> examined by certain services specified by the user , say service one and
> service two.
> 
> in this case here is the correct result:
> patient-id  service_name   patient_result
> 1 s112
> 1 s213
> 2 s1  41
> 2 s222
> 
> but for example, following is incorrect because patient 1 has no service
> with name service2:
> patient-id  service_name   patient_result
> 1 s112
> 1 s313

That depends on what you put in your lucene documents.
You can only get complete lucene documents as query results.
For the above example a patient with all service names
should be indexed in a single lucene doc.

The rows above suggest that the relation between patient and
service forms the relational result. However, for a text search
engine it is usual to denormalize the relational records into
indexed documents, depending on the required output.

Regards,
Paul Elschot



> 
> 
> 
> On 9/20/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote:
> >
> > Hi Paul,
> > would you tell me what is the difference between AND and + ?
> > I tried both but get different result
> > with AND I get 1777 documents and with + I get nearly 25000 ?
> >
> >
> > On 9/17/07, Paul Elschot <[EMAIL PROTECTED]> wrote:
> > >
> > > On Monday 17 September 2007 11:40, Mohammad Norouzi wrote:
> > > > Hi
> > > > I have a problem in getting correct result from Lucene, consider we
> > > have an
> > > > index containing documents with fields "field1" and "field2" etc. now
> > > I want
> > > > to have documents in which their field1 are equal one by one and their
> > > > field2 with two different value
> > > >
> > > > to clarify consider I have this query:
> > > > field1:val*  (field2:"myValue1" XOR field2:"myValue2")
> > >
> > > Did you try this:
> > >
> > > +field1:val*  +field2:"myValue1" +field2:"myValue2"
> > >
> > > Regards,
> > > Paul Elschot
> > >
> > >
> > > >
> > > > now I want this result:
> > > > field1  field2
> > > > val1myValue1
> > > > val1myValue2
> > > > val2myValue1
> > > > val2myValue2
> > > >
> > > > this result is not acceptable:
> > > > val3  myValue1
> > > > or
> > > > val4 myValue1
> > > > val4 myValue3
> > > >
> > > > I put XOR as operator because this is not a typical OR, it's
> > > different, it
> > > > means documents that contains both myValue1 and myValue2 for the field
> > >
> > > > field2
> > > >
> > > > how to build a query to get such result?
> > > >
> > > > thanks in advance
> > > > --
> > > > Regards,
> > > > Mohammad
> > > > --
> > > > see my blog: http://brainable.blogspot.com/
> > > > another in Persian: http://fekre-motefavet.blogspot.com/
> > > > Sun Certified Java Programmer
> > > > ExpertsExchange Certified, Master:
> > > > http://www.experts-exchange.com/M_1938796.html
> > > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> >
> > --
> > Regards,
> > Mohammad
> > --
> > see my blog: http://brainable.blogspot.com/
> > another in Persian: http://fekre-motefavet.blogspot.com/
> > Sun Certified Java Programmer
> > ExpertsExchange Certified, Master: 
http://www.experts-exchange.com/M_1938796.html
> >
> >
> 
> 
> 
> -- 
> Regards,
> Mohammad
> --
> see my blog: http://brainable.blogspot.com/
> another in Persian: http://fekre-motefavet.blogspot.com/
> Sun Certified Java Programmer
> ExpertsExchange Certified, Master:
> http://www.experts-exchange.com/M_1938796.html
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: a query for a special AND?

2007-09-20 Thread Mohammad Norouzi
well, you mean we should separate documents just like relational tables in
databases ?
if yes, how to make the relationship between those documents

thank you so much Paul

On 9/20/07, Paul Elschot <[EMAIL PROTECTED]> wrote:
>
> On Thursday 20 September 2007 07:29, Mohammad Norouzi wrote:
> > Sorry Paul I just hurried in replying ;)
> > I read the documents of Lucene about query syntax and I figured out the
> what
> > is the difference
> > but my problem is different, this is preoccupied my mind and I am under
> > pressure to solve this problem, after analyzing the results I get, now I
> > think we need a "group by" in our query.
> >
> > let me tell you an example: we need a list of patients that have been
> > examined by certain services specified by the user , say service one and
> > service two.
> >
> > in this case here is the correct result:
> > patient-id  service_name   patient_result
> > 1 s112
> > 1 s213
> > 2 s1  41
> > 2 s222
> >
> > but for example, following is incorrect because patient 1 has no service
> > with name service2:
> > patient-id  service_name   patient_result
> > 1 s112
> > 1 s313
>
> That depends on what you put in your lucene documents.
> You can only get complete lucene documents as query results.
> For the above example a patient with all service names
> should be indexed in a single lucene doc.
>
> The rows above suggest that the relation between patient and
> service forms the relational result. However, for a text search
> engine it is usual to denormalize the relational records into
> indexed documents, depending on the required output.
>
> Regards,
> Paul Elschot
>
>
>
> >
> >
> >
> > On 9/20/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi Paul,
> > > would you tell me what is the difference between AND and + ?
> > > I tried both but get different result
> > > with AND I get 1777 documents and with + I get nearly 25000 ?
> > >
> > >
> > > On 9/17/07, Paul Elschot <[EMAIL PROTECTED]> wrote:
> > > >
> > > > On Monday 17 September 2007 11:40, Mohammad Norouzi wrote:
> > > > > Hi
> > > > > I have a problem in getting correct result from Lucene, consider
> we
> > > > have an
> > > > > index containing documents with fields "field1" and "field2" etc.
> now
> > > > I want
> > > > > to have documents in which their field1 are equal one by one and
> their
> > > > > field2 with two different value
> > > > >
> > > > > to clarify consider I have this query:
> > > > > field1:val*  (field2:"myValue1" XOR field2:"myValue2")
> > > >
> > > > Did you try this:
> > > >
> > > > +field1:val*  +field2:"myValue1" +field2:"myValue2"
> > > >
> > > > Regards,
> > > > Paul Elschot
> > > >
> > > >
> > > > >
> > > > > now I want this result:
> > > > > field1  field2
> > > > > val1myValue1
> > > > > val1myValue2
> > > > > val2myValue1
> > > > > val2myValue2
> > > > >
> > > > > this result is not acceptable:
> > > > > val3  myValue1
> > > > > or
> > > > > val4 myValue1
> > > > > val4 myValue3
> > > > >
> > > > > I put XOR as operator because this is not a typical OR, it's
> > > > different, it
> > > > > means documents that contains both myValue1 and myValue2 for the
> field
> > > >
> > > > > field2
> > > > >
> > > > > how to build a query to get such result?
> > > > >
> > > > > thanks in advance
> > > > > --
> > > > > Regards,
> > > > > Mohammad
> > > > > --
> > > > > see my blog: http://brainable.blogspot.com/
> > > > > another in Persian: http://fekre-motefavet.blogspot.com/
> > > > > Sun Certified Java Programmer
> > > > > ExpertsExchange Certified, Master:
> > > > > http://www.experts-exchange.com/M_1938796.html
> > > > >
> > > >
> > > >
> -
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > >
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > Mohammad
> > > --
> > > see my blog: http://brainable.blogspot.com/
> > > another in Persian: http://fekre-motefavet.blogspot.com/
> > > Sun Certified Java Programmer
> > > ExpertsExchange Certified, Master:
> http://www.experts-exchange.com/M_1938796.html
> > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Mohammad
> > --
> > see my blog: http://brainable.blogspot.com/
> > another in Persian: http://fekre-motefavet.blogspot.com/
> > Sun Certified Java Programmer
> > ExpertsExchange Certified, Master:
> > http://www.experts-exchange.com/M

Re: a query for a special AND?

2007-09-20 Thread Paul Elschot
On Thursday 20 September 2007 09:19, Mohammad Norouzi wrote:
> well, you mean we should separate documents just like relational tables in
> databases ?

Quite the contrary, it's called _de_normalization. This means that the
documents in lucene normally contain more information than is present
in a single relational entity.

> if yes, how to make the relationship between those documents

Lucene has no facilities to maintain relational relationships among
its documents. A lucene index allows free format documents, i.e.
any document may have any field or not.
In practice you will need at least a primary key, but even that you will
need to program yourself.

Regards,
Paul Elschot



> 
> thank you so much Paul
> 
> On 9/20/07, Paul Elschot <[EMAIL PROTECTED]> wrote:
> >
> > On Thursday 20 September 2007 07:29, Mohammad Norouzi wrote:
> > > Sorry Paul I just hurried in replying ;)
> > > I read the documents of Lucene about query syntax and I figured out the
> > what
> > > is the difference
> > > but my problem is different, this is preoccupied my mind and I am under
> > > pressure to solve this problem, after analyzing the results I get, now I
> > > think we need a "group by" in our query.
> > >
> > > let me tell you an example: we need a list of patients that have been
> > > examined by certain services specified by the user , say service one and
> > > service two.
> > >
> > > in this case here is the correct result:
> > > patient-id  service_name   patient_result
> > > 1 s112
> > > 1 s213
> > > 2 s1  41
> > > 2 s222
> > >
> > > but for example, following is incorrect because patient 1 has no service
> > > with name service2:
> > > patient-id  service_name   patient_result
> > > 1 s112
> > > 1 s313
> >
> > That depends on what you put in your lucene documents.
> > You can only get complete lucene documents as query results.
> > For the above example a patient with all service names
> > should be indexed in a single lucene doc.
> >
> > The rows above suggest that the relation between patient and
> > service forms the relational result. However, for a text search
> > engine it is usual to denormalize the relational records into
> > indexed documents, depending on the required output.
> >
> > Regards,
> > Paul Elschot
> >
> >
> >
> > >
> > >
> > >
> > > On 9/20/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Hi Paul,
> > > > would you tell me what is the difference between AND and + ?
> > > > I tried both but get different result
> > > > with AND I get 1777 documents and with + I get nearly 25000 ?
> > > >
> > > >
> > > > On 9/17/07, Paul Elschot <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > > On Monday 17 September 2007 11:40, Mohammad Norouzi wrote:
> > > > > > Hi
> > > > > > I have a problem in getting correct result from Lucene, consider
> > we
> > > > > have an
> > > > > > index containing documents with fields "field1" and "field2" etc.
> > now
> > > > > I want
> > > > > > to have documents in which their field1 are equal one by one and
> > their
> > > > > > field2 with two different value
> > > > > >
> > > > > > to clarify consider I have this query:
> > > > > > field1:val*  (field2:"myValue1" XOR field2:"myValue2")
> > > > >
> > > > > Did you try this:
> > > > >
> > > > > +field1:val*  +field2:"myValue1" +field2:"myValue2"
> > > > >
> > > > > Regards,
> > > > > Paul Elschot
> > > > >
> > > > >
> > > > > >
> > > > > > now I want this result:
> > > > > > field1  field2
> > > > > > val1myValue1
> > > > > > val1myValue2
> > > > > > val2myValue1
> > > > > > val2myValue2
> > > > > >
> > > > > > this result is not acceptable:
> > > > > > val3  myValue1
> > > > > > or
> > > > > > val4 myValue1
> > > > > > val4 myValue3
> > > > > >
> > > > > > I put XOR as operator because this is not a typical OR, it's
> > > > > different, it
> > > > > > means documents that contains both myValue1 and myValue2 for the
> > field
> > > > >
> > > > > > field2
> > > > > >
> > > > > > how to build a query to get such result?
> > > > > >
> > > > > > thanks in advance
> > > > > > --
> > > > > > Regards,
> > > > > > Mohammad
> > > > > > --
> > > > > > see my blog: http://brainable.blogspot.com/
> > > > > > another in Persian: http://fekre-motefavet.blogspot.com/
> > > > > > Sun Certified Java Programmer
> > > > > > ExpertsExchange Certified, Master:
> > > > > > http://www.experts-exchange.com/M_1938796.html
> > > > > >
> > > > >
> > > > >
> > ---

Multiple Indices vs Single Index

2007-09-20 Thread Nikhil Chhaochharia
Hi,

I have about 40 indices which range in size from 10MB to 700MB.  There are 
quite a few stored fields.  To get an idea of the document size, I have about 
400k documents in the 700MB index.

Depending on the query, I choose the index which needs to be searched.  Each 
query hits only one index.  I was wondering if creating a single index where 
every document will have the indexname as a field will be more efficient.  I 
created such an index and it was 3.4 GB in size.  My initial performance tests 
with it are not conclusive.

Also, what are the other points to be addressed while deciding between 1 index 
and 40 indices.

I have 8GB RAM on the machine.


Thanks,
Nikhil



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene multiple indexes

2007-09-20 Thread Dino Korah
Hi People,

 

I was trying to get lucene to work for a mail indexing solution.

 

Scenario:

Traffic into the index method is on average 250 mails and their attachments
per minute. This volume has made me think of a solution that will split the
index on domain names of the owner of the message. So if I have say 100
users from 50 domains, I have 50 indexes. Now in the query method for my
program, I have to load the indexes according to the user who is logged on. 

 

Is it a good Idea to cache the Searcher objects once they are created? Or is
there a better approach to what I am trying to achieve?

 

Many thanks

 

d i n ok o r a h
Tel: +44 795 66 65 283

51°21'52"N  0°5' 14.16"

 



Question regarding proximity search

2007-09-20 Thread Sonu SR
Hi,
I have a doubt on proximity search.
Is the query "cat dog"~6 same as (cat dog)~6 ?
I think both case will search for "cat" and "dog" within 6 words each other.
But I am getting different number of results for the above queries. The
second one may be the higher. Please clarify this.

Thanks,
Sonu


Re: Oracle-Lucene integration (OJVMDirectory and Lucene Domain Index) - LONG

2007-09-20 Thread Marcelo Ochoa
Hi Chris:
  First sorry for the delay :(
  I have some preliminary performance test using Oracle 11g running on
in a VMWare virtual Machine with 400Mb SGA (Virtual Machine using
812Mb RAM for Oracle Enterprise Linux 4.0). This virtual machine is
hosted in a modest hardware, a Pentium IV 2.18Ghz with 2Gb RAM linux
Mandriva 2007.
  Here some result:
  Indexing all_source system view took 23 minutes, all_source view
have 220731 rows with 50Mb of data, sure this text is not free text
because many rows have wrapped code with hexadecimal numbers. Here the
table and the index:
SQL> desc test_source_big
 Name  Null?Type
 -  
 OWNER  VARCHAR2(30)
 NAME   VARCHAR2(30)
 TYPE   VARCHAR2(12)
 LINE   NUMBER
 TEXT   VARCHAR2(4000)
SQL> create index source_big_lidx on test_source_big(text)
indextype is lucene.LuceneIndex
parameters('Stemmer:English;MaxBufferedDocs:5000;DecimalFormat:;ExtraCols:line');

Index created.

Elapsed: 00:23:02.74
   Index storage (45Mb, 220K Lucene docs) is:
 FILE_SIZE NAME
-- --
 9 parameters
 2 updateCount
20 segments.gen
  45941031 _1d.cfs
42 segments_2t

   A query like this:

select /*+ FIRST_ROWS(10) */ lscore(1) from test_source_big where
lcontains(text,'"procedure java"~10',1)>0 order by lscore(1) desc;

   It took 11ms, and will be faster if you don't need lscore(1) value,
here other example:

select /*+ FIRST_ROWS(10) DOMAIN_INDEX_SORT */ lscore(1) from
test_source_big where lcontains(text,'(optimize OR sync) AND "LANGUAGE
JAVA"',1)>0 order by lscore(1) asc;

  It took 7ms.

  But there are other benefits related to the Domain Index
implementation using Data Cartridge API:
  - Any modification on the table is notified to Lucene automatically,
you can apply this modification on line or deferred, except for
deletion that are always synced.
  - The execution plan is calculated by the optimizer using the domain
index, and with latest additions (User Data Store) you can reduce with
Lucene how many rows the database will process using multiples column
at lcontains operator.
   For example this query use Lucene to search a free text at TEXT
column and Oracle's filter reduction at LINE column:

SQL> select count(text) from test_source_big where
lcontains(text,'function')>0 and line>=6000;

COUNT(TEXT)
---
  2

Elapsed: 00:00:00.74
PLAN_TABLE_OUTPUT

Plan hash value: 2350958379


| Id  | Operation| Name| Rows  | Bytes
| Cost (%CPU)| Time |

|   0 | SELECT STATEMENT | | 1 |  2027
|  2968   (1)| 00:00:36 |
|   1 |  SORT AGGREGATE  | | 1 |  2027
||  |
|*  2 |   TABLE ACCESS BY INDEX ROWID| TEST_SOURCE_BIG | 7 | 14189
|  2968   (1)| 00:00:36 |
|*  3 |DOMAIN INDEX  | SOURCE_BIG_LIDX |   |
||  |


Predicate Information (identified by operation id):
---

   2 - filter("LINE">=6000)
   3 - access("LUCENE"."LCONTAINS"("TEXT",'function')>0)

   But if you use Lucene to reduce the number of rows visited by
Oracle by using User Data Store to index LINE column too, you can
perform a query like this:

SQL> select count(text) from test_source_big where
lcontains(text,'function AND line:[6000 TO 7000]')>0;

COUNT(TEXT)
---
  2

Elapsed: 00:00:00.05
PLAN_TABLE_OUTPUT

Plan hash value: 2350958379


| Id  | Operation| Name| Rows  | Bytes
| Cost (%CPU)| Time |

|   0 | SELECT STATEMENT | | 1 |  2014
|  2968   (1)| 00:00:36 |
|   1 |  SORT AGGREGATE  | | 1 |  2014
||  |
|   2 |   TABLE ACCESS BY INDEX ROWID| TEST_SOURCE_BIG | 11587 |
22M|  2968   (1)| 00:00:36 |
|*  3 |DOMAIN INDEX  | SOURCE_BIG_LI

Re: Multiple Indices vs Single Index

2007-09-20 Thread Grant Ingersoll
If I understand correctly, you want to do a two stage retrieval  
right?  That is, look up in the initial index (3.4 GB) and then do a  
second search on the sub index?  Presumably, you have to manage the  
Searchers, etc. for each of the sub-indexes as well as the big  
index.  This means you have to go through the hits from the first  
search, then route, etc. correct?


Have you tried creating one single index with all the (stored)  
fields, etc?  Worst case scenario, assuming 1GB per index, is you  
would have a 40GB index, but my guess is index compression will  
reduce it more.  Since you are less than that anyway, have you tried  
just the straightforward solution?  Or do you have other requirements  
that force the sub-index solution?  Also, I am not sure it will work,  
but it seems worth a try.  Of course, this also depends on how much  
you expect your indexes to grow.


Also, what was inconclusive about your tests?  Maybe you can describe  
more what you have tried to date?


Cheers,
Grant

On Sep 20, 2007, at 3:50 AM, Nikhil Chhaochharia wrote:


Hi,

I have about 40 indices which range in size from 10MB to 700MB.   
There are quite a few stored fields.  To get an idea of the  
document size, I have about 400k documents in the 700MB index.


Depending on the query, I choose the index which needs to be  
searched.  Each query hits only one index.  I was wondering if  
creating a single index where every document will have the  
indexname as a field will be more efficient.  I created such an  
index and it was 3.4 GB in size.  My initial performance tests with  
it are not conclusive.


Also, what are the other points to be addressed while deciding  
between 1 index and 40 indices.


I have 8GB RAM on the machine.


Thanks,
Nikhil



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multiple Indices vs Single Index

2007-09-20 Thread Nikhil Chhaochharia
I am sorry, it seems that I was not clear with what my problem is.  I will try 
to describe it again.

My data is divided into 40 categories and at one time only one category can be 
searched.  The GUI for the system will ask the user to select the category from 
a drop-down.  Currently, I have a separate index for every category.  The index 
sizes varies - one category index is 10MB and another is 700MB.  Other 
index-sizes are somewhere in between.

I was wondering if it will be better to just have 1 large index with all the 40 
indices combined.  I do not need to do dual-queries and my total index size (if 
I create a single index) is about 3.4GB.  It will increase to maximum of 5-6 
GB.  I am running this on a dedicated machine with 8GB RAM.

Unfortunately I do not have enough hardware to run both in parallel and test 
properly.  Have just one server which is being used by live users.  So it would 
be great if you could tell me whether I should stick with my 40 indices or 
combine them into 1 index.  What are the pros and cons of each approach ?

Thanks,
Nikhil


- Original Message 
From: Grant Ingersoll <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, 20 September, 2007 7:57:21 PM
Subject: Re: Multiple Indices vs Single Index

If I understand correctly, you want to do a two stage retrieval  
right?  That is, look up in the initial index (3.4 GB) and then do a  
second search on the sub index?  Presumably, you have to manage the  
Searchers, etc. for each of the sub-indexes as well as the big  
index.  This means you have to go through the hits from the first  
search, then route, etc. correct?

Have you tried creating one single index with all the (stored)  
fields, etc?  Worst case scenario, assuming 1GB per index, is you  
would have a 40GB index, but my guess is index compression will  
reduce it more.  Since you are less than that anyway, have you tried  
just the straightforward solution?  Or do you have other requirements  
that force the sub-index solution?  Also, I am not sure it will work,  
but it seems worth a try.  Of course, this also depends on how much  
you expect your indexes to grow.

Also, what was inconclusive about your tests?  Maybe you can describe  
more what you have tried to date?

Cheers,
Grant

On Sep 20, 2007, at 3:50 AM, Nikhil Chhaochharia wrote:

> Hi,
>
> I have about 40 indices which range in size from 10MB to 700MB.   
> There are quite a few stored fields.  To get an idea of the  
> document size, I have about 400k documents in the 700MB index.
>
> Depending on the query, I choose the index which needs to be  
> searched.  Each query hits only one index.  I was wondering if  
> creating a single index where every document will have the  
> indexname as a field will be more efficient.  I created such an  
> index and it was 3.4 GB in size.  My initial performance tests with  
> it are not conclusive.
>
> Also, what are the other points to be addressed while deciding  
> between 1 index and 40 indices.
>
> I have 8GB RAM on the machine.
>
>
> Thanks,
> Nikhil
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: thread safe shared IndexSearcher

2007-09-20 Thread Jay Yu

Mark,

Thanks for sharing your valuable exp. and thoughts.
Frankly our system already has most of the functionalities 
LuceneIndexAcessor offers. The only thing I am looking for is to sync 
the searchers' close. That's why I am little worried about the way 
accessor handles the searcher sync.

I will probably give it a try to see how it performs in our system.

Thanks!

Jay

Mark Miller wrote:
The method is synched, but this is because each thread *does* share the 
same Searcher. To maintain a cache of searchers across multiple threads, 
you've got to sync -- to reference count, you've got to sync. The 
performance hit of LuceneIndexAcessor is pretty minimal for its 
functionality, and frankly, for the functionality you want, you have to 
pay a cost. Thats not even the end of it really...your going to need to 
maintain a cache of Accessor objects for each index as well...and if you 
dont know all the indexes at startup time, access to this will also need 
to be synched. I wouldn't worry though -- searches are still lightening 
fast...that won't be the bottleneck. I'll work on getting you some code, 
but if your worried, try some benchmarking on the original code.


Also, to be clear, I don't have the code in front of me, but getting a 
Searcher does not require waiting for a Writer to be released. Searchers 
are cached and resused (and instantly available) until a Writer is 
released. When this happens, the release Writer method waits for all the 
Searchers to return (happens pretty quick as searches are pretty quick), 
the Searcher cache is cleared, and then subsequent calls to getSearcher 
create new Searchers that can see what the Writer added.


The key is use your Writer/Searcher/Reader quickly and then release it 
(unless your bulk loading). I've had such a system with 5+ million docs 
on a standard machine and searches where still well below a second after 
the first Searcher is cached (and even the first search is darn quick). 
And that includes a lot of extra crap I am doing.


- Mark

Jay Yu wrote:

Mark,

After reading the implementation of LuceneIndexAccessor.getSearcher(),
I realized that the method is synchronized and wait for 
writingDirector to be released. That means if we getSearcher for each 
query in each thread, there might be a contention and performance hit. 
In fact, even the method of release(searcher) is costly. On the other 
hand, if multiple threads share share one searcher then it'd defeat the

purpose of using LuceneIndexAccessor.
Do I miss sth here? What's your suggested use case for 
LuceneIndexAccessor?


Thanks!

Jay
Mark Miller wrote:

Ill respond a point at a time:

1.

** Hi Maik,

So what happens in this case:

IndexAccessProvider accessProvider = new IndexAccessProvider(directory,

analyzer);

LuceneIndexAccessor accessor = new LuceneIndexAccessor(accessProvider);

accessor.open();

IndexWriter writer = accessor.getWriter();

// reference to the same instance?

IndexWriter writer2 = accessor.getWriter();

writer.addDocument();

writer2.addDocument();



// I didn't release the writer yet

// will this block?

IndexReader reader = accessor.getReader();

reader.delete();



This is not really an issue. First, if you are going to delete with a 
Reader
you need to call getWritingReader and not getReader. When you do 
that, the
getWritingReader call will block until writer and writer2 are 
released. If
you are just adding a couple docs before releasing the writers, this 
is no

problem because the block will be very short. If you are loading tons of
docs and you want to be able to delete with a Reader in a timely 
manner, you
should release the writers every now and then (release and re-get the 
Writer

every 100 docs or something). An interactive index should not hog the
Writer, while something that is just loading a lot could hog the Writer.
This is no different than normal…you cannot delete with a Reader while
adding with a Writer with Lucene. This code just enforces those 
semantics.

The best solution is to just use a Writer to delete – I never get a
ReadingWriter.

2. http://issues.apache.org/bugzilla/show_bug.cgi?id=34995#c3

This is no big deal either. I just added another getWriter call that 
takes a

create Boolean.

3. I don't think there is a latest release. This has never gotten much
official attention and is not in the sandbox. I worked straight from the
originally submitted code.

4. I will look into getting together some code that I can share. The
multisearcher changes that are need are a couple of one liners 
really, so at

a minimum I will give you the changes needed.



-   Mark



On 9/19/07, Jay Yu <[EMAIL PROTECTED]> wrote:

Mark,



thanks for sharing your insight and experience about 
LuceneIndexAccessor!


I remember seeing some people reporting some issues about it, such as:

http://www.archivum.info/[EMAIL PROTECTED]/2005-05/msg00114.html 



http://issues.apache.org/bugzilla/show_bug.cgi?i

Re: Multiple Indices vs Single Index

2007-09-20 Thread Grant Ingersoll
OK, I thought you meant your index would have in it the name of the  
second index and would thus do a two-stage retrieval.


At any rate, if you are saying your combined index with all the  
stored fields is ~3.4 GB I would think it would fit reasonably on the  
machine you have and perform reasonably.  Naturally, this depends on  
your application, your users, etc. and I can't make any guarantees,  
but I certainly recall others managing this size just fine.  See the  
many tips on improving searching and indexing on the Wiki (link at  
bottom in my signature) and do some profiling/testing.


When you said your tests were inconclusive, what tests have you  
done?  If you can, run the tests in a profiler to see where your  
bottlenecks are.


-Grant


On Sep 20, 2007, at 11:16 AM, Nikhil Chhaochharia wrote:

I am sorry, it seems that I was not clear with what my problem is.   
I will try to describe it again.


My data is divided into 40 categories and at one time only one  
category can be searched.  The GUI for the system will ask the user  
to select the category from a drop-down.  Currently, I have a  
separate index for every category.  The index sizes varies - one  
category index is 10MB and another is 700MB.  Other index-sizes are  
somewhere in between.


I was wondering if it will be better to just have 1 large index  
with all the 40 indices combined.  I do not need to do dual-queries  
and my total index size (if I create a single index) is about  
3.4GB.  It will increase to maximum of 5-6 GB.  I am running this  
on a dedicated machine with 8GB RAM.


Unfortunately I do not have enough hardware to run both in parallel  
and test properly.  Have just one server which is being used by  
live users.  So it would be great if you could tell me whether I  
should stick with my 40 indices or combine them into 1 index.  What  
are the pros and cons of each approach ?


Thanks,
Nikhil


- Original Message 
From: Grant Ingersoll <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, 20 September, 2007 7:57:21 PM
Subject: Re: Multiple Indices vs Single Index

If I understand correctly, you want to do a two stage retrieval
right?  That is, look up in the initial index (3.4 GB) and then do a
second search on the sub index?  Presumably, you have to manage the
Searchers, etc. for each of the sub-indexes as well as the big
index.  This means you have to go through the hits from the first
search, then route, etc. correct?

Have you tried creating one single index with all the (stored)
fields, etc?  Worst case scenario, assuming 1GB per index, is you
would have a 40GB index, but my guess is index compression will
reduce it more.  Since you are less than that anyway, have you tried
just the straightforward solution?  Or do you have other requirements
that force the sub-index solution?  Also, I am not sure it will work,
but it seems worth a try.  Of course, this also depends on how much
you expect your indexes to grow.

Also, what was inconclusive about your tests?  Maybe you can describe
more what you have tried to date?

Cheers,
Grant

On Sep 20, 2007, at 3:50 AM, Nikhil Chhaochharia wrote:


Hi,

I have about 40 indices which range in size from 10MB to 700MB.
There are quite a few stored fields.  To get an idea of the
document size, I have about 400k documents in the 700MB index.

Depending on the query, I choose the index which needs to be
searched.  Each query hits only one index.  I was wondering if
creating a single index where every document will have the
indexname as a field will be more efficient.  I created such an
index and it was 3.4 GB in size.  My initial performance tests with
it are not conclusive.

Also, what are the other points to be addressed while deciding
between 1 index and 40 indices.

I have 8GB RAM on the machine.


Thanks,
Nikhil



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: thread safe shared IndexSearcher

2007-09-20 Thread Mark Miller
Good luck Jay. Keep in mind, pretty much all LuceneIndexAccessor does is 
sync Readers with Writers and allow multiple threads to share the same 
instances of them -- nothing more. The code just forces Readers to 
refresh when Writers are used to change the index. There really isn't 
any functionality beyond that offered. Since you want to have a multi 
thread system access the same resources (which occasionally need to be 
refreshed) its not too easy to get around a synchronized block.


If I am able to extract some usable code for you soon I will let you know.

- Mark

Jay Yu wrote:

Mark,

Thanks for sharing your valuable exp. and thoughts.
Frankly our system already has most of the functionalities 
LuceneIndexAcessor offers. The only thing I am looking for is to sync 
the searchers' close. That's why I am little worried about the way 
accessor handles the searcher sync.

I will probably give it a try to see how it performs in our system.

Thanks!

Jay

Mark Miller wrote:
The method is synched, but this is because each thread *does* share 
the same Searcher. To maintain a cache of searchers across multiple 
threads, you've got to sync -- to reference count, you've got to 
sync. The performance hit of LuceneIndexAcessor is pretty minimal for 
its functionality, and frankly, for the functionality you want, you 
have to pay a cost. Thats not even the end of it really...your going 
to need to maintain a cache of Accessor objects for each index as 
well...and if you dont know all the indexes at startup time, access 
to this will also need to be synched. I wouldn't worry though -- 
searches are still lightening fast...that won't be the bottleneck. 
I'll work on getting you some code, but if your worried, try some 
benchmarking on the original code.


Also, to be clear, I don't have the code in front of me, but getting 
a Searcher does not require waiting for a Writer to be released. 
Searchers are cached and resused (and instantly available) until a 
Writer is released. When this happens, the release Writer method 
waits for all the Searchers to return (happens pretty quick as 
searches are pretty quick), the Searcher cache is cleared, and then 
subsequent calls to getSearcher create new Searchers that can see 
what the Writer added.


The key is use your Writer/Searcher/Reader quickly and then release 
it (unless your bulk loading). I've had such a system with 5+ million 
docs on a standard machine and searches where still well below a 
second after the first Searcher is cached (and even the first search 
is darn quick). And that includes a lot of extra crap I am doing.


- Mark

Jay Yu wrote:

Mark,

After reading the implementation of LuceneIndexAccessor.getSearcher(),
I realized that the method is synchronized and wait for 
writingDirector to be released. That means if we getSearcher for 
each query in each thread, there might be a contention and 
performance hit. In fact, even the method of release(searcher) is 
costly. On the other hand, if multiple threads share share one 
searcher then it'd defeat the

purpose of using LuceneIndexAccessor.
Do I miss sth here? What's your suggested use case for 
LuceneIndexAccessor?


Thanks!

Jay
Mark Miller wrote:

Ill respond a point at a time:

1.

** Hi Maik,

So what happens in this case:

IndexAccessProvider accessProvider = new 
IndexAccessProvider(directory,


analyzer);

LuceneIndexAccessor accessor = new 
LuceneIndexAccessor(accessProvider);


accessor.open();

IndexWriter writer = accessor.getWriter();

// reference to the same instance?

IndexWriter writer2 = accessor.getWriter();

writer.addDocument();

writer2.addDocument();



// I didn't release the writer yet

// will this block?

IndexReader reader = accessor.getReader();

reader.delete();



This is not really an issue. First, if you are going to delete with 
a Reader
you need to call getWritingReader and not getReader. When you do 
that, the
getWritingReader call will block until writer and writer2 are 
released. If
you are just adding a couple docs before releasing the writers, 
this is no
problem because the block will be very short. If you are loading 
tons of
docs and you want to be able to delete with a Reader in a timely 
manner, you
should release the writers every now and then (release and re-get 
the Writer

every 100 docs or something). An interactive index should not hog the
Writer, while something that is just loading a lot could hog the 
Writer.

This is no different than normal…you cannot delete with a Reader while
adding with a Writer with Lucene. This code just enforces those 
semantics.

The best solution is to just use a Writer to delete – I never get a
ReadingWriter.

2. http://issues.apache.org/bugzilla/show_bug.cgi?id=34995#c3

This is no big deal either. I just added another getWriter call 
that takes a

create Boolean.

3. I don't think there is a latest release. This has never gotten much
official attention and is not in th

Re: thread safe shared IndexSearcher

2007-09-20 Thread Jay Yu



Mark Miller wrote:
Good luck Jay. Keep in mind, pretty much all LuceneIndexAccessor does is 
sync Readers with Writers and allow multiple threads to share the same 
instances of them -- nothing more. The code just forces Readers to 
refresh when Writers are used to change the index. There really isn't 
any functionality beyond that offered. Since you want to have a multi 
thread system access the same resources (which occasionally need to be 
refreshed) its not too easy to get around a synchronized block.


If I am able to extract some usable code for you soon I will let you know.

I will appreciate it!
Thanks for your help!



- Mark

Jay Yu wrote:

Mark,

Thanks for sharing your valuable exp. and thoughts.
Frankly our system already has most of the functionalities 
LuceneIndexAcessor offers. The only thing I am looking for is to sync 
the searchers' close. That's why I am little worried about the way 
accessor handles the searcher sync.

I will probably give it a try to see how it performs in our system.

Thanks!

Jay

Mark Miller wrote:
The method is synched, but this is because each thread *does* share 
the same Searcher. To maintain a cache of searchers across multiple 
threads, you've got to sync -- to reference count, you've got to 
sync. The performance hit of LuceneIndexAcessor is pretty minimal for 
its functionality, and frankly, for the functionality you want, you 
have to pay a cost. Thats not even the end of it really...your going 
to need to maintain a cache of Accessor objects for each index as 
well...and if you dont know all the indexes at startup time, access 
to this will also need to be synched. I wouldn't worry though -- 
searches are still lightening fast...that won't be the bottleneck. 
I'll work on getting you some code, but if your worried, try some 
benchmarking on the original code.


Also, to be clear, I don't have the code in front of me, but getting 
a Searcher does not require waiting for a Writer to be released. 
Searchers are cached and resused (and instantly available) until a 
Writer is released. When this happens, the release Writer method 
waits for all the Searchers to return (happens pretty quick as 
searches are pretty quick), the Searcher cache is cleared, and then 
subsequent calls to getSearcher create new Searchers that can see 
what the Writer added.


The key is use your Writer/Searcher/Reader quickly and then release 
it (unless your bulk loading). I've had such a system with 5+ million 
docs on a standard machine and searches where still well below a 
second after the first Searcher is cached (and even the first search 
is darn quick). And that includes a lot of extra crap I am doing.


- Mark

Jay Yu wrote:

Mark,

After reading the implementation of LuceneIndexAccessor.getSearcher(),
I realized that the method is synchronized and wait for 
writingDirector to be released. That means if we getSearcher for 
each query in each thread, there might be a contention and 
performance hit. In fact, even the method of release(searcher) is 
costly. On the other hand, if multiple threads share share one 
searcher then it'd defeat the

purpose of using LuceneIndexAccessor.
Do I miss sth here? What's your suggested use case for 
LuceneIndexAccessor?


Thanks!

Jay
Mark Miller wrote:

Ill respond a point at a time:

1.

** Hi Maik,

So what happens in this case:

IndexAccessProvider accessProvider = new 
IndexAccessProvider(directory,


analyzer);

LuceneIndexAccessor accessor = new 
LuceneIndexAccessor(accessProvider);


accessor.open();

IndexWriter writer = accessor.getWriter();

// reference to the same instance?

IndexWriter writer2 = accessor.getWriter();

writer.addDocument();

writer2.addDocument();



// I didn't release the writer yet

// will this block?

IndexReader reader = accessor.getReader();

reader.delete();



This is not really an issue. First, if you are going to delete with 
a Reader
you need to call getWritingReader and not getReader. When you do 
that, the
getWritingReader call will block until writer and writer2 are 
released. If
you are just adding a couple docs before releasing the writers, 
this is no
problem because the block will be very short. If you are loading 
tons of
docs and you want to be able to delete with a Reader in a timely 
manner, you
should release the writers every now and then (release and re-get 
the Writer

every 100 docs or something). An interactive index should not hog the
Writer, while something that is just loading a lot could hog the 
Writer.

This is no different than normal…you cannot delete with a Reader while
adding with a Writer with Lucene. This code just enforces those 
semantics.

The best solution is to just use a Writer to delete – I never get a
ReadingWriter.

2. http://issues.apache.org/bugzilla/show_bug.cgi?id=34995#c3

This is no big deal either. I just added another getWriter call 
that takes a

create Boolean.

3. I don't think there is a latest re

Re: Multiple Indices vs Single Index

2007-09-20 Thread Nikhil Chhaochharia
OK, thanks.

I actually have both systems implemented. The multi-index one is being used 
currently and it works well.  I have deployed the single index solution a few 
times during off-peak hours and the response time has been almost the same as 
the multi-index solution.  I tried to simulate some load but again my numbers 
were mostly similar for both cases.

I have already done all the suggested optimizations since I first ran into 
problems a few months ago.  The performance had improved considerably.  Since 
then, my traffic has increased and I have again started facing some issues 
during peak-load hours.

I guess I should get another box and run proper tests there.  Will run a 
profiler also.

Thanks for all the suggestions.

Regards,
Nikhil


- Original Message 
From: Grant Ingersoll <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, 20 September, 2007 9:25:01 PM
Subject: Re: Multiple Indices vs Single Index

OK, I thought you meant your index would have in it the name of the  
second index and would thus do a two-stage retrieval.

At any rate, if you are saying your combined index with all the  
stored fields is ~3.4 GB I would think it would fit reasonably on the  
machine you have and perform reasonably.  Naturally, this depends on  
your application, your users, etc. and I can't make any guarantees,  
but I certainly recall others managing this size just fine.  See the  
many tips on improving searching and indexing on the Wiki (link at  
bottom in my signature) and do some profiling/testing.

When you said your tests were inconclusive, what tests have you  
done?  If you can, run the tests in a profiler to see where your  
bottlenecks are.

-Grant


On Sep 20, 2007, at 11:16 AM, Nikhil Chhaochharia wrote:

> I am sorry, it seems that I was not clear with what my problem is.   
> I will try to describe it again.
>
> My data is divided into 40 categories and at one time only one  
> category can be searched.  The GUI for the system will ask the user  
> to select the category from a drop-down.  Currently, I have a  
> separate index for every category.  The index sizes varies - one  
> category index is 10MB and another is 700MB.  Other index-sizes are  
> somewhere in between.
>
> I was wondering if it will be better to just have 1 large index  
> with all the 40 indices combined.  I do not need to do dual-queries  
> and my total index size (if I create a single index) is about  
> 3.4GB.  It will increase to maximum of 5-6 GB.  I am running this  
> on a dedicated machine with 8GB RAM.
>
> Unfortunately I do not have enough hardware to run both in parallel  
> and test properly.  Have just one server which is being used by  
> live users.  So it would be great if you could tell me whether I  
> should stick with my 40 indices or combine them into 1 index.  What  
> are the pros and cons of each approach ?
>
> Thanks,
> Nikhil
>
>
> - Original Message 
> From: Grant Ingersoll <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Thursday, 20 September, 2007 7:57:21 PM
> Subject: Re: Multiple Indices vs Single Index
>
> If I understand correctly, you want to do a two stage retrieval
> right?  That is, look up in the initial index (3.4 GB) and then do a
> second search on the sub index?  Presumably, you have to manage the
> Searchers, etc. for each of the sub-indexes as well as the big
> index.  This means you have to go through the hits from the first
> search, then route, etc. correct?
>
> Have you tried creating one single index with all the (stored)
> fields, etc?  Worst case scenario, assuming 1GB per index, is you
> would have a 40GB index, but my guess is index compression will
> reduce it more.  Since you are less than that anyway, have you tried
> just the straightforward solution?  Or do you have other requirements
> that force the sub-index solution?  Also, I am not sure it will work,
> but it seems worth a try.  Of course, this also depends on how much
> you expect your indexes to grow.
>
> Also, what was inconclusive about your tests?  Maybe you can describe
> more what you have tried to date?
>
> Cheers,
> Grant
>
> On Sep 20, 2007, at 3:50 AM, Nikhil Chhaochharia wrote:
>
>> Hi,
>>
>> I have about 40 indices which range in size from 10MB to 700MB.
>> There are quite a few stored fields.  To get an idea of the
>> document size, I have about 400k documents in the 700MB index.
>>
>> Depending on the query, I choose the index which needs to be
>> searched.  Each query hits only one index.  I was wondering if
>> creating a single index where every document will have the
>> indexname as a field will be more efficient.  I created such an
>> index and it was 3.4 GB in size.  My initial performance tests with
>> it are not conclusive.
>>
>> Also, what are the other points to be addressed while deciding
>> between 1 index and 40 indices.
>>
>> I have 8GB RAM on the machine.
>>
>>
>> Thanks,
>> Nikhil
>>
>>
>>
>> --

Re: Multiple Indices vs Single Index

2007-09-20 Thread Grant Ingersoll
If the current version is working well, what is the reason to move?   
Is it just to make management of the indices easier?


On Sep 20, 2007, at 12:07 PM, Nikhil Chhaochharia wrote:


OK, thanks.

I actually have both systems implemented. The multi-index one is  
being used currently and it works well.  I have deployed the single  
index solution a few times during off-peak hours and the response  
time has been almost the same as the multi-index solution.  I tried  
to simulate some load but again my numbers were mostly similar for  
both cases.


I have already done all the suggested optimizations since I first  
ran into problems a few months ago.  The performance had improved  
considerably.  Since then, my traffic has increased and I have  
again started facing some issues during peak-load hours.


I guess I should get another box and run proper tests there.  Will  
run a profiler also.


Thanks for all the suggestions.

Regards,
Nikhil


- Original Message 
From: Grant Ingersoll <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, 20 September, 2007 9:25:01 PM
Subject: Re: Multiple Indices vs Single Index

OK, I thought you meant your index would have in it the name of the
second index and would thus do a two-stage retrieval.

At any rate, if you are saying your combined index with all the
stored fields is ~3.4 GB I would think it would fit reasonably on the
machine you have and perform reasonably.  Naturally, this depends on
your application, your users, etc. and I can't make any guarantees,
but I certainly recall others managing this size just fine.  See the
many tips on improving searching and indexing on the Wiki (link at
bottom in my signature) and do some profiling/testing.

When you said your tests were inconclusive, what tests have you
done?  If you can, run the tests in a profiler to see where your
bottlenecks are.

-Grant


On Sep 20, 2007, at 11:16 AM, Nikhil Chhaochharia wrote:


I am sorry, it seems that I was not clear with what my problem is.
I will try to describe it again.

My data is divided into 40 categories and at one time only one
category can be searched.  The GUI for the system will ask the user
to select the category from a drop-down.  Currently, I have a
separate index for every category.  The index sizes varies - one
category index is 10MB and another is 700MB.  Other index-sizes are
somewhere in between.

I was wondering if it will be better to just have 1 large index
with all the 40 indices combined.  I do not need to do dual-queries
and my total index size (if I create a single index) is about
3.4GB.  It will increase to maximum of 5-6 GB.  I am running this
on a dedicated machine with 8GB RAM.

Unfortunately I do not have enough hardware to run both in parallel
and test properly.  Have just one server which is being used by
live users.  So it would be great if you could tell me whether I
should stick with my 40 indices or combine them into 1 index.  What
are the pros and cons of each approach ?

Thanks,
Nikhil


- Original Message 
From: Grant Ingersoll <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, 20 September, 2007 7:57:21 PM
Subject: Re: Multiple Indices vs Single Index

If I understand correctly, you want to do a two stage retrieval
right?  That is, look up in the initial index (3.4 GB) and then do a
second search on the sub index?  Presumably, you have to manage the
Searchers, etc. for each of the sub-indexes as well as the big
index.  This means you have to go through the hits from the first
search, then route, etc. correct?

Have you tried creating one single index with all the (stored)
fields, etc?  Worst case scenario, assuming 1GB per index, is you
would have a 40GB index, but my guess is index compression will
reduce it more.  Since you are less than that anyway, have you tried
just the straightforward solution?  Or do you have other requirements
that force the sub-index solution?  Also, I am not sure it will work,
but it seems worth a try.  Of course, this also depends on how much
you expect your indexes to grow.

Also, what was inconclusive about your tests?  Maybe you can describe
more what you have tried to date?

Cheers,
Grant

On Sep 20, 2007, at 3:50 AM, Nikhil Chhaochharia wrote:


Hi,

I have about 40 indices which range in size from 10MB to 700MB.
There are quite a few stored fields.  To get an idea of the
document size, I have about 400k documents in the 700MB index.

Depending on the query, I choose the index which needs to be
searched.  Each query hits only one index.  I was wondering if
creating a single index where every document will have the
indexname as a field will be more efficient.  I created such an
index and it was 3.4 GB in size.  My initial performance tests with
it are not conclusive.

Also, what are the other points to be addressed while deciding
between 1 index and 40 indices.

I have 8GB RAM on the machine.


Thanks,
Nikhil



---

highlighting and fragments

2007-09-20 Thread Michael J. Prichard

Hello Folks,

I wanted to stay away from storing text in the indexes in order to keep 
them smaller.  I have a requirement now though to provide highlighting 
and, more so, fragments of the content so they will be displayed on the UI.


Do you all prefer to store the text in the index to make this easier or 
would you suggest retrieving the text from the source after doing your 
search.  From I can tell you need to run through the Hits anyway


I am trying to keep the indexes as small as possible (they are still 
HUGE...but...) so storing fields is not really what I want to do.  I 
will if it is the best and most efficient way to do so.


Thanks,
Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: highlighting and fragments

2007-09-20 Thread Mark Miller
Lucene's storing functionality is just a simple storage mechanism. You 
can certainly and easily use your own storage mechanism. When you get 
your user created id back from Lucene due to a hit, just pass that id to 
your storage system to get the original text and then feed that to the 
Highlighter. Your storage system/code might be slower than Lucene, but I 
don't believe there is anything about Lucene's system that would give it 
an inside advantage.


- Mark

Michael J. Prichard wrote:

Hello Folks,

I wanted to stay away from storing text in the indexes in order to 
keep them smaller.  I have a requirement now though to provide 
highlighting and, more so, fragments of the content so they will be 
displayed on the UI.


Do you all prefer to store the text in the index to make this easier 
or would you suggest retrieving the text from the source after doing 
your search.  From I can tell you need to run through the Hits anyway


I am trying to keep the indexes as small as possible (they are still 
HUGE...but...) so storing fields is not really what I want to do.  I 
will if it is the best and most efficient way to do so.


Thanks,
Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question regarding proximity search

2007-09-20 Thread Chris Hostetter

: Is the query "cat dog"~6 same as (cat dog)~6 ?
: I think both case will search for "cat" and "dog" within 6 words each other.
: But I am getting different number of results for the above queries. The
: second one may be the higher. Please clarify this.

i don't believe:(cat dog)~6 is even a legal query in the Lucene 
QueryParser sytnax ... it isn't documented, and it doesn't work in Lucene 2.2.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multiple Indices vs Single Index

2007-09-20 Thread Chris Hostetter

: I was wondering if it will be better to just have 1 large index with all 
: the 40 indices combined.  I do not need to do dual-queries and my total 
: index size (if I create a single index) is about 3.4GB.  It will 
: increase to maximum of 5-6 GB.  I am running this on a dedicated machine 
: with 8GB RAM.

off the top of my head, there are 3 main reasons i can think of that would 
motivate one choice over another -- ultimately it's up to you...

1) FieldCache and sorting ... if all 40 sets of of Documents contain 
have consistently named fields, then there won't be much difference 
between 40 indexes and 1 index ... but if each of those 40 sets contain 
documents with radically differnet fields -- and you want to sort on N 
differnet fields for each sets -- then the total FieldCache sizes for each 
of those 40 indexes will be smaller then the FieldCaches for one gian 
index (because every document will get an entry wethe it makes sense or 
not.

2) idf statistics.  if you have common fields you search regardless of 
document set, the 40 index approach will maintain seperate sttistics -- 
this may be important if some terms are very common in only som docsets.  
the word "albino" may be really common in docset A but only one doc in 
docset B has it ... in the 40 index appraoch querying B for (albino 
elephant) will give a lot of weight to albino because it's so rare, but in 
the single index case albino may not be considered as significant because 
of ht unified idf value for all docsets 9even if hte query is constrained 
to docset B) ... again: this only matters if the fields overlap, if every 
docset has a unique set of fields then the idfs will be unique because 
they are by field)

3) management: it's probably a lot simpler to maintain and manage code 
that deals with one index then code that deals with 40 indexes. you milage 
may vary.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question regarding proximity search

2007-09-20 Thread Sonu SR
Thanks Hoss, for the reply. I am using Lucene 2.1.
I checked the lucene converted syntax (using Query.toString()) in both case
and found the second one actually not converting to proximity query.

"cat dog"~6 is converted to ABST:"cat dog"~4 and
(cat dog)~6 is converted to +ABST:cat +ABST:dog.

That is discarding the proximity operator in the second case.


On 9/21/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
>
> : Is the query "cat dog"~6 same as (cat dog)~6 ?
> : I think both case will search for "cat" and "dog" within 6 words each
> other.
> : But I am getting different number of results for the above queries. The
> : second one may be the higher. Please clarify this.
>
> i don't believe:(cat dog)~6 is even a legal query in the Lucene
> QueryParser sytnax ... it isn't documented, and it doesn't work in Lucene
> 2.2.
>
>
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Question regarding proximity search

2007-09-20 Thread Chris Hostetter

: I checked the lucene converted syntax (using Query.toString()) in both case
: and found the second one actually not converting to proximity query.

I don't think you understood what I was trying to say...

using parens with a "~" character after it is not currently, and has never 
been (to my knowledge) a means of creating a "proximity query".  It is not 
documented in 2.2, 2.1, 2.0, 1.9, or 1.4.3.  It is not legal syntax in 2.2 
(it causes a parse exception).  In lucene, the way to do proximity based 
queries is either with SpanNearQueries, or with PhraseQueries -- the way 
to create a PhraseQuery using hte Lucene QueryParser is with quote 
character '"'

there is no reason why you should expect:  (cat dog)~3   to create a 
proximity query.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multiple Indices vs Single Index

2007-09-20 Thread Nikhil Chhaochharia
Thanks Grant and Chris for the replies.

I am looking at a single index because the 40 index system has started having 
performance issues at high load.  My daily traffic is increasing at a steady 
pace and about 40% of the traffic is concentrated in a 2 hour period and 
searches start slowing down just a little bit.

All my indices have exactly the same fields, so I guess fieldcache and sorting 
are not an issue.  I do not do any updates, so I just open 40 searchers and 
store them in a HashMap.  When a query comes, I select the appropriate searcher 
and fire the query.  So code maintainance also is not an issue.

The different IDF statistics are a valid point.  I have not analysed the 
difference in the results returned - I sort of assumed the results will be same 
in both cases.  I will look at the difference in results.


There is one more point.  Sooner or later, I will have to move to multiple 
servers due to increasing traffic.  (I will be handling 30,000+ hits per hour 
by the end of the year)  One option is to have full index on all servers and 
load-balance them.  Another is to have half the indices on one server and half 
of them on the other.  The front-end (separate server) will then fire the query 
on the appropriate server.  Any suggestions on which one would be a better 
choice ?  All data on all servers will give me redundancy, the system will be 
up even if one server goes down.  Also, adding more servers would be trivial.


Thanks,
Nikhil


- Original Message 
From: Chris Hostetter <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, 21 September, 2007 4:02:38 AM
Subject: Re: Multiple Indices vs Single Index


: I was wondering if it will be better to just have 1 large index with all 
: the 40 indices combined.  I do not need to do dual-queries and my total 
: index size (if I create a single index) is about 3.4GB.  It will 
: increase to maximum of 5-6 GB.  I am running this on a dedicated machine 
: with 8GB RAM.

off the top of my head, there are 3 main reasons i can think of that would 
motivate one choice over another -- ultimately it's up to you...

1) FieldCache and sorting ... if all 40 sets of of Documents contain 
have consistently named fields, then there won't be much difference 
between 40 indexes and 1 index ... but if each of those 40 sets contain 
documents with radically differnet fields -- and you want to sort on N 
differnet fields for each sets -- then the total FieldCache sizes for each 
of those 40 indexes will be smaller then the FieldCaches for one gian 
index (because every document will get an entry wethe it makes sense or 
not.

2) idf statistics.  if you have common fields you search regardless of 
document set, the 40 index approach will maintain seperate sttistics -- 
this may be important if some terms are very common in only som docsets.  
the word "albino" may be really common in docset A but only one doc in 
docset B has it ... in the 40 index appraoch querying B for (albino 
elephant) will give a lot of weight to albino because it's so rare, but in 
the single index case albino may not be considered as significant because 
of ht unified idf value for all docsets 9even if hte query is constrained 
to docset B) ... again: this only matters if the fields overlap, if every 
docset has a unique set of fields then the idfs will be unique because 
they are by field)

3) management: it's probably a lot simpler to maintain and manage code 
that deals with one index then code that deals with 40 indexes. you milage 
may vary.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



About the search efficiency based on document's length

2007-09-20 Thread Jarvis
Hi everyone,

There is a question about the document’s length and search efficiency.

Think of this situation:

Two ways to index some html pages(ignore some information): one is both
store and index the html content in lucene dictionary, the other is just
index the content . For the first method is there a efficiency problem
compare to the second besides the folder size increase?

Thanks,
Jarvis



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: About the search efficiency based on document's length

2007-09-20 Thread Karl Wettin

21 sep 2007 kl. 08.23 skrev Jarvis:


There is a question about the document’s length and search efficiency.


Two ways to index some html pages(ignore some information): one is  
both
store and index the html content in lucene dictionary, the other is  
just

index the content . For the first method is there a efficiency problem
compare to the second besides the folder size increase?


Not sure I understand your question, but I'll give it a go.

As far as I know, storing data in a document will not affect search  
speed. However, loading large amounts of data to a Document will of  
course consume resources. Therefor it is possible to pass a  
FieldSelector to the IndexReader when you retrieve a Document,  
allowing you to define what fields to ignore, load, lazy load, et c.


I hope this helps.

--
karl
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]