Re: Retrieving large num of docs

2010-01-07 Thread Otis Gospodnetic
Strange.  Ever figured out the source of performance difference?

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Raghuveer Kancherla raghuveer.kanche...@aplopio.com
 To: solr-user@lucene.apache.org
 Sent: Sat, December 5, 2009 12:05:49 PM
 Subject: Re: Retrieving large num of docs
 
 Hi Otis,
 I think my experiments are not conclusive about reduction in search time. I
 was playing around with various configurations to reduce the time to
 retrieve documents from Solr. I am sure that making the two multi valued
 text fields from stored to un-stored, retrieval time (query time + time to
 load the stored fields) became very fast. I was expecting the
 lazyfieldloading setting in solrconfig to take care of this but apparently
 it is not working as expected.
 
 Out of curiosity, I removed these 2 fields from the index (this time I am
 not even indexing them) and my search time got better (10 times better).
 However, I am still trying to isolate the reason for the search time
 reduction. It may be either because of 2 less fields to search in or because
 of the reduction in size of the index or may be something else. I am not
 sure if lazyfieldloading has any part in explaining this.
 
 - Raghu
 
 
 
 On Fri, Dec 4, 2009 at 3:07 AM, Otis Gospodnetic 
  wrote:
 
  Hm, hm, interesting.  I was looking into something like this the other day
  (BIG indexed+stored text fields).  After seeing enableLazyFieldLoading=true
  in solrconfig and after seeing fl didn't include those big fields, I
  though hm, so Lucene/Solr will not be pulling those large fields from disk,
  OK.
 
  You are saying that this may not be true based on your experiment?
  And what I'm calling your experiment means that you reindexed the same
  data, but without the 2 multi-valued text fields... .and that was the only
  change you made and got cca x10 search performance improvement?
 
  Sorry for repeating your words, just trying to confirm and understand.
 
  Thanks,
  Otis
  --
  Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
 
 
 
  - Original Message 
   From: Raghuveer Kancherla 
   To: solr-user@lucene.apache.org
   Sent: Thu, December 3, 2009 8:43:16 AM
   Subject: Re: Retrieving large num of docs
  
   Hi Hoss,
  
   I was experimenting with various queries to solve this problem and in one
   such test I remember that requesting only the ID did not change the
   retrieval time. To be sure, I tested it again using the curl command
  today
   and it confirms my previous observation.
  
   Also, enableLazyFieldLoading setting is set to true in my solrconfig.
  
   Another general observation (off topic) is that having a moderately large
   multi valued text field (~200 entries) in the index seems to slow down
  the
   search significantly. I removed the 2 multi valued text fields from my
  index
   and my search got ~10 time faster. :)
  
   - Raghu
  
  
   On Thu, Dec 3, 2009 at 2:14 AM, Chris Hostetter wrote:
  
   
: I think I solved the problem of retrieving 300 docs per request for
  now.
The
: problem was that I was storing 2 moderately large multivalued text
  fields
: though I was not retrieving them during search time.  I reindexed all
  my
: data without storing these fields. Now the response time (time for
  Solr
to
: return the http response) is very close to the QTime Solr is showing
  in
the
   
Hmmm
   
two comments:
   
1) the example URL from your previous mail...
   
: 
   
  
  
 http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python
   
...doesn't match your earlier statemnet that you are only returning hte
  id
field (there is no fl param in that URL) ... are you certain you
  werent'
returning those large stored fields in teh response?
   
2) assuming you were actually using an fl param to limit the fields,
  make
sure you have this setting in your solrconfig.xml...
   
   true
   
..that should make it pretty fast to return only a few fields of each
document, even if you do have some jumpto stored fields that aren't
  being
returned.
   
   
   
-Hoss
   
   
 
 



Re: Retrieving large num of docs

2009-12-05 Thread Raghuveer Kancherla
Hi Otis,
I think my experiments are not conclusive about reduction in search time. I
was playing around with various configurations to reduce the time to
retrieve documents from Solr. I am sure that making the two multi valued
text fields from stored to un-stored, retrieval time (query time + time to
load the stored fields) became very fast. I was expecting the
lazyfieldloading setting in solrconfig to take care of this but apparently
it is not working as expected.

Out of curiosity, I removed these 2 fields from the index (this time I am
not even indexing them) and my search time got better (10 times better).
However, I am still trying to isolate the reason for the search time
reduction. It may be either because of 2 less fields to search in or because
of the reduction in size of the index or may be something else. I am not
sure if lazyfieldloading has any part in explaining this.

- Raghu



On Fri, Dec 4, 2009 at 3:07 AM, Otis Gospodnetic otis_gospodne...@yahoo.com
 wrote:

 Hm, hm, interesting.  I was looking into something like this the other day
 (BIG indexed+stored text fields).  After seeing enableLazyFieldLoading=true
 in solrconfig and after seeing fl didn't include those big fields, I
 though hm, so Lucene/Solr will not be pulling those large fields from disk,
 OK.

 You are saying that this may not be true based on your experiment?
 And what I'm calling your experiment means that you reindexed the same
 data, but without the 2 multi-valued text fields... .and that was the only
 change you made and got cca x10 search performance improvement?

 Sorry for repeating your words, just trying to confirm and understand.

 Thanks,
 Otis
 --
 Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



 - Original Message 
  From: Raghuveer Kancherla raghuveer.kanche...@aplopio.com
  To: solr-user@lucene.apache.org
  Sent: Thu, December 3, 2009 8:43:16 AM
  Subject: Re: Retrieving large num of docs
 
  Hi Hoss,
 
  I was experimenting with various queries to solve this problem and in one
  such test I remember that requesting only the ID did not change the
  retrieval time. To be sure, I tested it again using the curl command
 today
  and it confirms my previous observation.
 
  Also, enableLazyFieldLoading setting is set to true in my solrconfig.
 
  Another general observation (off topic) is that having a moderately large
  multi valued text field (~200 entries) in the index seems to slow down
 the
  search significantly. I removed the 2 multi valued text fields from my
 index
  and my search got ~10 time faster. :)
 
  - Raghu
 
 
  On Thu, Dec 3, 2009 at 2:14 AM, Chris Hostetter wrote:
 
  
   : I think I solved the problem of retrieving 300 docs per request for
 now.
   The
   : problem was that I was storing 2 moderately large multivalued text
 fields
   : though I was not retrieving them during search time.  I reindexed all
 my
   : data without storing these fields. Now the response time (time for
 Solr
   to
   : return the http response) is very close to the QTime Solr is showing
 in
   the
  
   Hmmm
  
   two comments:
  
   1) the example URL from your previous mail...
  
   : 
  
 
 http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python
  
   ...doesn't match your earlier statemnet that you are only returning hte
 id
   field (there is no fl param in that URL) ... are you certain you
 werent'
   returning those large stored fields in teh response?
  
   2) assuming you were actually using an fl param to limit the fields,
 make
   sure you have this setting in your solrconfig.xml...
  
  true
  
   ..that should make it pretty fast to return only a few fields of each
   document, even if you do have some jumpto stored fields that aren't
 being
   returned.
  
  
  
   -Hoss
  
  




Re: Retrieving large num of docs

2009-12-03 Thread Raghuveer Kancherla
Hi Hoss,

I was experimenting with various queries to solve this problem and in one
such test I remember that requesting only the ID did not change the
retrieval time. To be sure, I tested it again using the curl command today
and it confirms my previous observation.

Also, enableLazyFieldLoading setting is set to true in my solrconfig.

Another general observation (off topic) is that having a moderately large
multi valued text field (~200 entries) in the index seems to slow down the
search significantly. I removed the 2 multi valued text fields from my index
and my search got ~10 time faster. :)

- Raghu


On Thu, Dec 3, 2009 at 2:14 AM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 : I think I solved the problem of retrieving 300 docs per request for now.
 The
 : problem was that I was storing 2 moderately large multivalued text fields
 : though I was not retrieving them during search time.  I reindexed all my
 : data without storing these fields. Now the response time (time for Solr
 to
 : return the http response) is very close to the QTime Solr is showing in
 the

 Hmmm

 two comments:

 1) the example URL from your previous mail...

 : 
 http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python

 ...doesn't match your earlier statemnet that you are only returning hte id
 field (there is no fl param in that URL) ... are you certain you werent'
 returning those large stored fields in teh response?

 2) assuming you were actually using an fl param to limit the fields, make
 sure you have this setting in your solrconfig.xml...

enableLazyFieldLoadingtrue/enableLazyFieldLoading

 ..that should make it pretty fast to return only a few fields of each
 document, even if you do have some jumpto stored fields that aren't being
 returned.



 -Hoss




Re: Retrieving large num of docs

2009-12-03 Thread Otis Gospodnetic
Hm, hm, interesting.  I was looking into something like this the other day (BIG 
indexed+stored text fields).  After seeing enableLazyFieldLoading=true in 
solrconfig and after seeing fl didn't include those big fields, I though hm, 
so Lucene/Solr will not be pulling those large fields from disk, OK.

You are saying that this may not be true based on your experiment?
And what I'm calling your experiment means that you reindexed the same data, 
but without the 2 multi-valued text fields... .and that was the only change you 
made and got cca x10 search performance improvement?

Sorry for repeating your words, just trying to confirm and understand.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
 From: Raghuveer Kancherla raghuveer.kanche...@aplopio.com
 To: solr-user@lucene.apache.org
 Sent: Thu, December 3, 2009 8:43:16 AM
 Subject: Re: Retrieving large num of docs
 
 Hi Hoss,
 
 I was experimenting with various queries to solve this problem and in one
 such test I remember that requesting only the ID did not change the
 retrieval time. To be sure, I tested it again using the curl command today
 and it confirms my previous observation.
 
 Also, enableLazyFieldLoading setting is set to true in my solrconfig.
 
 Another general observation (off topic) is that having a moderately large
 multi valued text field (~200 entries) in the index seems to slow down the
 search significantly. I removed the 2 multi valued text fields from my index
 and my search got ~10 time faster. :)
 
 - Raghu
 
 
 On Thu, Dec 3, 2009 at 2:14 AM, Chris Hostetter wrote:
 
 
  : I think I solved the problem of retrieving 300 docs per request for now.
  The
  : problem was that I was storing 2 moderately large multivalued text fields
  : though I was not retrieving them during search time.  I reindexed all my
  : data without storing these fields. Now the response time (time for Solr
  to
  : return the http response) is very close to the QTime Solr is showing in
  the
 
  Hmmm
 
  two comments:
 
  1) the example URL from your previous mail...
 
  : 
  
 http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python
 
  ...doesn't match your earlier statemnet that you are only returning hte id
  field (there is no fl param in that URL) ... are you certain you werent'
  returning those large stored fields in teh response?
 
  2) assuming you were actually using an fl param to limit the fields, make
  sure you have this setting in your solrconfig.xml...
 
 true
 
  ..that should make it pretty fast to return only a few fields of each
  document, even if you do have some jumpto stored fields that aren't being
  returned.
 
 
 
  -Hoss
 
 



Re: Retrieving large num of docs

2009-12-02 Thread Chris Hostetter

: I think I solved the problem of retrieving 300 docs per request for now. The
: problem was that I was storing 2 moderately large multivalued text fields
: though I was not retrieving them during search time.  I reindexed all my
: data without storing these fields. Now the response time (time for Solr to
: return the http response) is very close to the QTime Solr is showing in the

Hmmm

two comments:

1) the example URL from your previous mail...

:  
http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python

...doesn't match your earlier statemnet that you are only returning hte id 
field (there is no fl param in that URL) ... are you certain you werent' 
returning those large stored fields in teh response?

2) assuming you were actually using an fl param to limit the fields, make 
sure you have this setting in your solrconfig.xml...

enableLazyFieldLoadingtrue/enableLazyFieldLoading

..that should make it pretty fast to return only a few fields of each 
document, even if you do have some jumpto stored fields that aren't being 
returned.



-Hoss



Re: Retrieving large num of docs

2009-12-01 Thread Raghuveer Kancherla
Hi Hoss/Andrew,
I think I solved the problem of retrieving 300 docs per request for now. The
problem was that I was storing 2 moderately large multivalued text fields
though I was not retrieving them during search time.  I reindexed all my
data without storing these fields. Now the response time (time for Solr to
return the http response) is very close to the QTime Solr is showing in the
logs.

Thanks for all the help,
Raghu


On Mon, Nov 30, 2009 at 11:37 AM, Raghuveer Kancherla 
raghuveer.kanche...@aplopio.com wrote:

 Thanks Hoss,
 In my previous mail, I was measuring the system time difference between
 sending a (http) request and receiving a response. This was being run on a
 (different) client machine

 Like you suggested, I tried to time the response on the server itself as
 follows:

 $ /usr/bin/time -p curl -sS -o solr.out 
 http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python
 
 real 3.49

 user 0.00
 sys 0.00

 The query time in solr log shows me Qtime=600
 size of solr.out is 843 kB.

 As you've mentioned, Solr shouldn't give these kind of numbers for 300
 docs, and we're quite perplexed as to whats going on.

 Thanks,
 Raghu




 On Mon, Nov 30, 2009 at 6:00 AM, Chris Hostetter hossman_luc...@fucit.org
  wrote:


 : I am using Solr1.4 for searching through half a million documents. The
 : problem is, I want to retrieve nearly 200 documents for each search
 query.
 : The query time in Solr logs is showing 0.02 seconds and I am fairly
 happy
 : with that. However Solr is taking a long time (4 to 5 secs) to return
 the
 : results (I think it is because of the number of docs I am requesting). I
 : tried returning only the id's (unique key) without any other stored
 fields,
 : but it is not helping me improve the response times (time to return the
 id's
 : of matching documents).

 What exactly does your request URL look like, and how exactly are you
 timing the total response time?

 200 isn't a very big number for the rows param -- people who want to get
 100K documents back in their response at a time may have problems, but 200
 is not that big.

 so like i said: how exactly are you timing things?

 My guess: it's more likely that network overhead or the performance of
 your client code (reading the data off the wire) is causing your timing
 code to seem slow, then it is that Solr is taking 5 seconds to write out
 those document IDs.

 I suspect if you try hitting the same exact URL using curl via localhost,
 you'll see the total response time be a lot less then 5 seconds.

 Here's an example of a query that asks solr to return *every* field from
 500 documents, in the XML format.  And these are not small documents...

 $ /usr/bin/time -p curl -sS -o /tmp/solr.out 
 http://localhost:5051/solr/select/?q=doctype:productversion=2.2start=0rows=500indent=on
 
 real 0.07
 user 0.00
 sys 0.00
 [chr...@c18-ssa-so-dfll-qry1 ~]$ du -sh /tmp/solr.out
 1.6M/tmp/solr.out

 ...that's 1.6 MB of 500 Solr documents with all of their fields in
 verbose XML format (including indenting) fetched in 70ms.

 If it's taking 5 seconds for you to get just the ids of 200 docs, you've
 got a problem somewhere and i'm 99% certain it's not in Solr.

 what does a similar time curl command for your URL look like when you
 run it on your solr server?


 -Hoss





Re: Retrieving large num of docs

2009-11-29 Thread Chris Hostetter

: I am using Solr1.4 for searching through half a million documents. The
: problem is, I want to retrieve nearly 200 documents for each search query.
: The query time in Solr logs is showing 0.02 seconds and I am fairly happy
: with that. However Solr is taking a long time (4 to 5 secs) to return the
: results (I think it is because of the number of docs I am requesting). I
: tried returning only the id's (unique key) without any other stored fields,
: but it is not helping me improve the response times (time to return the id's
: of matching documents).

What exactly does your request URL look like, and how exactly are you 
timing the total response time?

200 isn't a very big number for the rows param -- people who want to get 
100K documents back in their response at a time may have problems, but 200 
is not that big.

so like i said: how exactly are you timing things?

My guess: it's more likely that network overhead or the performance of 
your client code (reading the data off the wire) is causing your timing 
code to seem slow, then it is that Solr is taking 5 seconds to write out 
those document IDs.

I suspect if you try hitting the same exact URL using curl via localhost, 
you'll see the total response time be a lot less then 5 seconds.

Here's an example of a query that asks solr to return *every* field from 
500 documents, in the XML format.  And these are not small documents...

$ /usr/bin/time -p curl -sS -o /tmp/solr.out 
http://localhost:5051/solr/select/?q=doctype:productversion=2.2start=0rows=500indent=on;
real 0.07
user 0.00
sys 0.00
[chr...@c18-ssa-so-dfll-qry1 ~]$ du -sh /tmp/solr.out 
1.6M/tmp/solr.out

...that's 1.6 MB of 500 Solr documents with all of their fields in 
verbose XML format (including indenting) fetched in 70ms.

If it's taking 5 seconds for you to get just the ids of 200 docs, you've 
got a problem somewhere and i'm 99% certain it's not in Solr.

what does a similar time curl command for your URL look like when you 
run it on your solr server?


-Hoss



Re: Retrieving large num of docs

2009-11-29 Thread Raghuveer Kancherla
Thanks Hoss,
In my previous mail, I was measuring the system time difference between
sending a (http) request and receiving a response. This was being run on a
(different) client machine

Like you suggested, I tried to time the response on the server itself as
follows:

$ /usr/bin/time -p curl -sS -o solr.out 
http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python

real 3.49
user 0.00
sys 0.00

The query time in solr log shows me Qtime=600
size of solr.out is 843 kB.

As you've mentioned, Solr shouldn't give these kind of numbers for 300 docs,
and we're quite perplexed as to whats going on.

Thanks,
Raghu



On Mon, Nov 30, 2009 at 6:00 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : I am using Solr1.4 for searching through half a million documents. The
 : problem is, I want to retrieve nearly 200 documents for each search
 query.
 : The query time in Solr logs is showing 0.02 seconds and I am fairly happy
 : with that. However Solr is taking a long time (4 to 5 secs) to return the
 : results (I think it is because of the number of docs I am requesting). I
 : tried returning only the id's (unique key) without any other stored
 fields,
 : but it is not helping me improve the response times (time to return the
 id's
 : of matching documents).

 What exactly does your request URL look like, and how exactly are you
 timing the total response time?

 200 isn't a very big number for the rows param -- people who want to get
 100K documents back in their response at a time may have problems, but 200
 is not that big.

 so like i said: how exactly are you timing things?

 My guess: it's more likely that network overhead or the performance of
 your client code (reading the data off the wire) is causing your timing
 code to seem slow, then it is that Solr is taking 5 seconds to write out
 those document IDs.

 I suspect if you try hitting the same exact URL using curl via localhost,
 you'll see the total response time be a lot less then 5 seconds.

 Here's an example of a query that asks solr to return *every* field from
 500 documents, in the XML format.  And these are not small documents...

 $ /usr/bin/time -p curl -sS -o /tmp/solr.out 
 http://localhost:5051/solr/select/?q=doctype:productversion=2.2start=0rows=500indent=on
 
 real 0.07
 user 0.00
 sys 0.00
 [chr...@c18-ssa-so-dfll-qry1 ~]$ du -sh /tmp/solr.out
 1.6M/tmp/solr.out

 ...that's 1.6 MB of 500 Solr documents with all of their fields in
 verbose XML format (including indenting) fetched in 70ms.

 If it's taking 5 seconds for you to get just the ids of 200 docs, you've
 got a problem somewhere and i'm 99% certain it's not in Solr.

 what does a similar time curl command for your URL look like when you
 run it on your solr server?


 -Hoss




Re: Retrieving large num of docs

2009-11-28 Thread Raghuveer Kancherla
Hi Andrew,
I applied the patch you suggested. I am not finding any significant changes
in the response times.
I am wondering if I forgot some important configuration setting etc.
Here is what I did:

   1. Wrote a small program using solrj to use EmbeddedSolrServer (most of
   the code is from the solr wiki) and run the server on an index of ~700k docs
   and note down the avg response time
   2. Applied the SOLR-797.patch to the source code of Solr1.4
   3. complied the source code and rebuilt the jar files.
   4. Rerun step 1 using the new jar files.

Am I supposed to do any other config changes in order to see the performance
jump that you are able to achieve.

Thanks a lot,
Raghu


On Fri, Nov 27, 2009 at 3:16 PM, AHMET ARSLAN iori...@yahoo.com wrote:

  Hi Andrew,
  We are running solr using its http interface from python.
  From the resources
  I could find, EmbeddedSolrServer is possible only if I am
  using solr from a
  java program.  It will be useful to understand if a
  significant part of the
  performance increase is due to bypassing HTTP before going
  down this path.
 
  In the mean time I am trying my luck with the other
  suggestions. Can you
  share the patch that helps cache solr documents instead of
  lucene documents?

 May be these links can help
 http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
 http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
 http://www.lucidimagination.com/Downloads/LucidGaze-for-Solr

 how often do you update your index?
 is your index optimized?
 configuring caching can also help:

 http://wiki.apache.org/solr/SolrCaching
 http://wiki.apache.org/solr/SolrPerformanceFactors








Re: Retrieving large num of docs

2009-11-28 Thread Andrey Klochkov
Hi Raghu

Let me describe our use case in more details. Probably that will clarify
things.

The usual use case for Lucene/Solr is retrieving of small portion of the
result set (10-20 documents). In our case we need to read the whole result
set and this creates huge load on Lucene index, meaning a lot of IO. Keep in
mind that we have large number of stored fields in the index.

In our case there's one thing that makes things simpler: our index is so
small that we can get every document in cache. This means that even if we
retrieve all documents for every result set, we don't retrieve them from
Lucene index and then the performance should be Ok. But here we've got 2
problems:

1. Solr caches Lucene's Document instances. And in case of retrieving the
whole result set it recreates SolrDocument instances every time. This
creates a load on CPU and in particular on Java GC.
2. EmbeddedSolrServer converts the whole response into a byte array and then
restores it back converting Lucene's documents and DocList's to Solr's
SolrDocument and SolrDocumentList instances. This create additional load on
CPU and GC.

We patched Solr to eliminate those things and that fixed our performance
problems.

I think that if you don't place all your documents in caches and/or you
don't use stored fields, retrieving ID field only, then probably those
improvements won't help you.

I suggest you first to find your bottlenecks. Look at IO, memory usage etc.
Using a profiler is the best thing too. Probably you can use some tools from
lucidimation for profiling.

On Sat, Nov 28, 2009 at 4:47 PM, Raghuveer Kancherla 
raghuveer.kanche...@aplopio.com wrote:

 Hi Andrew,
 I applied the patch you suggested. I am not finding any significant changes
 in the response times.
 I am wondering if I forgot some important configuration setting etc.
 Here is what I did:

   1. Wrote a small program using solrj to use EmbeddedSolrServer (most of
   the code is from the solr wiki) and run the server on an index of ~700k
 docs
   and note down the avg response time
   2. Applied the SOLR-797.patch to the source code of Solr1.4
   3. complied the source code and rebuilt the jar files.
   4. Rerun step 1 using the new jar files.

 Am I supposed to do any other config changes in order to see the
 performance
 jump that you are able to achieve.

 Thanks a lot,
 Raghu


 On Fri, Nov 27, 2009 at 3:16 PM, AHMET ARSLAN iori...@yahoo.com wrote:

   Hi Andrew,
   We are running solr using its http interface from python.
   From the resources
   I could find, EmbeddedSolrServer is possible only if I am
   using solr from a
   java program.  It will be useful to understand if a
   significant part of the
   performance increase is due to bypassing HTTP before going
   down this path.
  
   In the mean time I am trying my luck with the other
   suggestions. Can you
   share the patch that helps cache solr documents instead of
   lucene documents?
 
  May be these links can help
  http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
  http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
  http://www.lucidimagination.com/Downloads/LucidGaze-for-Solr
 
  how often do you update your index?
  is your index optimized?
  configuring caching can also help:
 
  http://wiki.apache.org/solr/SolrCaching
  http://wiki.apache.org/solr/SolrPerformanceFactors
 
 
 
 
 
 




-- 
Andrew Klochkov
Senior Software Engineer,
Grid Dynamics


Re: Retrieving large num of docs

2009-11-27 Thread Raghuveer Kancherla
Hi Andrew,
We are running solr using its http interface from python. From the resources
I could find, EmbeddedSolrServer is possible only if I am using solr from a
java program.  It will be useful to understand if a significant part of the
performance increase is due to bypassing HTTP before going down this path.

In the mean time I am trying my luck with the other suggestions. Can you
share the patch that helps cache solr documents instead of lucene documents?


On a different note, I am wondering why does it take 4 - 5 seconds for Solr
to return the ID's of ranked documents when it can rank the results in about
20 milli seconds? Am I missing something here?

Thanks,
Raghu



On Fri, Nov 27, 2009 at 2:15 AM, Andrey Klochkov akloch...@griddynamics.com
 wrote:

 Hi

 We obtain ALL documents for every query, the index size is about 50k. We
 use
 number of stored fields. Often the result set size is several thousands of
 docs.

 We performed the following things to make it faster:

 1. Use EmbeddedSolrServer
 2. Patch Solr to avoid unnecessary marshalling while using
 EmbeddedSolrServer (there's an issue  in Solr JIRA)
 3. Patch Solr to cache SolrDocument instances instead of Lucene's Document
 instances. I was going to share this patch, but then decided that our usage
 of Solr is not common and this functionality is useless in most cases
 4. We have all documents in cache
 5. In fact our index is stored in a data grid, not a file system. But as
 tests showed this is not important because standard FSDirectory is faster
 if
 you have enough of RAM free for OS caches.

 These changes improved the performance very much, so in the end we have
 performance comparable (about 3-5 times slower) to the proper Solr usage
 (obtaining first 20 documents).

 To get more details on how different Solr components perform we injected
 perf4j statements into key points in the code. And a profiler was helpful
 too.

 Hope it helps somehow.

 On Thu, Nov 26, 2009 at 8:48 PM, Raghuveer Kancherla 
 raghuveer.kanche...@aplopio.com wrote:

  Hi,
  I am using Solr1.4 for searching through half a million documents. The
  problem is, I want to retrieve nearly 200 documents for each search
 query.
  The query time in Solr logs is showing 0.02 seconds and I am fairly happy
  with that. However Solr is taking a long time (4 to 5 secs) to return the
  results (I think it is because of the number of docs I am requesting). I
  tried returning only the id's (unique key) without any other stored
 fields,
  but it is not helping me improve the response times (time to return the
  id's
  of matching documents).
  I understand that retrieving 200 documents for each search term is
  impractical in most scenarios but I dont have any other option. Any
  pointers
  on how to improve the response times will be a great help.
 
  Thanks,
   Raghu
 



 --
 Andrew Klochkov
 Senior Software Engineer,
 Grid Dynamics



Re: Retrieving large num of docs

2009-11-27 Thread AHMET ARSLAN
 Hi Andrew,
 We are running solr using its http interface from python.
 From the resources
 I could find, EmbeddedSolrServer is possible only if I am
 using solr from a
 java program.  It will be useful to understand if a
 significant part of the
 performance increase is due to bypassing HTTP before going
 down this path.
 
 In the mean time I am trying my luck with the other
 suggestions. Can you
 share the patch that helps cache solr documents instead of
 lucene documents?

May be these links can help
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
http://www.lucidimagination.com/Downloads/LucidGaze-for-Solr

how often do you update your index?
is your index optimized?
configuring caching can also help:

http://wiki.apache.org/solr/SolrCaching
http://wiki.apache.org/solr/SolrPerformanceFactors







Retrieving large num of docs

2009-11-26 Thread Raghuveer Kancherla
Hi,
I am using Solr1.4 for searching through half a million documents. The
problem is, I want to retrieve nearly 200 documents for each search query.
The query time in Solr logs is showing 0.02 seconds and I am fairly happy
with that. However Solr is taking a long time (4 to 5 secs) to return the
results (I think it is because of the number of docs I am requesting). I
tried returning only the id's (unique key) without any other stored fields,
but it is not helping me improve the response times (time to return the id's
of matching documents).
I understand that retrieving 200 documents for each search term is
impractical in most scenarios but I dont have any other option. Any pointers
on how to improve the response times will be a great help.

Thanks,
 Raghu


Re: Retrieving large num of docs

2009-11-26 Thread Andrey Klochkov
Hi

We obtain ALL documents for every query, the index size is about 50k. We use
number of stored fields. Often the result set size is several thousands of
docs.

We performed the following things to make it faster:

1. Use EmbeddedSolrServer
2. Patch Solr to avoid unnecessary marshalling while using
EmbeddedSolrServer (there's an issue  in Solr JIRA)
3. Patch Solr to cache SolrDocument instances instead of Lucene's Document
instances. I was going to share this patch, but then decided that our usage
of Solr is not common and this functionality is useless in most cases
4. We have all documents in cache
5. In fact our index is stored in a data grid, not a file system. But as
tests showed this is not important because standard FSDirectory is faster if
you have enough of RAM free for OS caches.

These changes improved the performance very much, so in the end we have
performance comparable (about 3-5 times slower) to the proper Solr usage
(obtaining first 20 documents).

To get more details on how different Solr components perform we injected
perf4j statements into key points in the code. And a profiler was helpful
too.

Hope it helps somehow.

On Thu, Nov 26, 2009 at 8:48 PM, Raghuveer Kancherla 
raghuveer.kanche...@aplopio.com wrote:

 Hi,
 I am using Solr1.4 for searching through half a million documents. The
 problem is, I want to retrieve nearly 200 documents for each search query.
 The query time in Solr logs is showing 0.02 seconds and I am fairly happy
 with that. However Solr is taking a long time (4 to 5 secs) to return the
 results (I think it is because of the number of docs I am requesting). I
 tried returning only the id's (unique key) without any other stored fields,
 but it is not helping me improve the response times (time to return the
 id's
 of matching documents).
 I understand that retrieving 200 documents for each search term is
 impractical in most scenarios but I dont have any other option. Any
 pointers
 on how to improve the response times will be a great help.

 Thanks,
  Raghu




-- 
Andrew Klochkov
Senior Software Engineer,
Grid Dynamics