solr4 performance question

2014-04-08 Thread Joshi, Shital
Hi,

We have 10 node Solr Cloud (5 shards, 2 replicas) with 30 GB JVM on 60GB 
machine and 40 GB of index. 
We're constantly noticing that Solr queries take longer time while update (with 
commit=false setting) is in progress. The query which usually takes .5 seconds, 
take up to 2 minutes while updates are in progress. And this is not the case 
with all queries, it is very sporadic behavior.
 
Any pointer to nail this issue would be appreciated. 

Is there a way to find how much of a query result came from cache? Can we 
enable any log settings to start printing what came from cache vs. what was 
queried?

Thanks!


Re: solr4 performance question

2014-04-08 Thread Erick Erickson
What do you have for hour _softcommit_ settings in solrconfig.xml? I'm
guessing you're using SolrJ or similar, but the solrconfig settings
will trip a commit as well.

For that matter ,what are all our commit settings in solrconfig.xml,
both hard and soft?

Best,
Erick

On Tue, Apr 8, 2014 at 10:28 AM, Joshi, Shital shital.jo...@gs.com wrote:
 Hi,

 We have 10 node Solr Cloud (5 shards, 2 replicas) with 30 GB JVM on 60GB 
 machine and 40 GB of index.
 We're constantly noticing that Solr queries take longer time while update 
 (with commit=false setting) is in progress. The query which usually takes .5 
 seconds, take up to 2 minutes while updates are in progress. And this is not 
 the case with all queries, it is very sporadic behavior.

 Any pointer to nail this issue would be appreciated.

 Is there a way to find how much of a query result came from cache? Can we 
 enable any log settings to start printing what came from cache vs. what was 
 queried?

 Thanks!


Re: solr4 performance question

2014-04-08 Thread Furkan KAMACI
Hi Joshi;

Click to the Plugins/Stats section under your collection at Solr Admin UI.
You will see the cache statistics for different types of caches. hitratio
and evictions are good statistics to look at first. On the other hand you
should read here: https://wiki.apache.org/solr/SolrPerformanceFactors

Thanks;
Furkan KAMACI


2014-04-08 20:28 GMT+03:00 Joshi, Shital shital.jo...@gs.com:

 Hi,

 We have 10 node Solr Cloud (5 shards, 2 replicas) with 30 GB JVM on 60GB
 machine and 40 GB of index.
 We're constantly noticing that Solr queries take longer time while update
 (with commit=false setting) is in progress. The query which usually takes
 .5 seconds, take up to 2 minutes while updates are in progress. And this is
 not the case with all queries, it is very sporadic behavior.

 Any pointer to nail this issue would be appreciated.

 Is there a way to find how much of a query result came from cache? Can we
 enable any log settings to start printing what came from cache vs. what was
 queried?

 Thanks!



RE: solr4 performance question

2014-04-08 Thread Joshi, Shital
We don't do any soft commit. This is our hard commit setting. 

autoCommit
   maxTime${solr.autoCommit.maxTime:60}/maxTime
   maxDocs10/maxDocs
   openSearchertrue/openSearcher   
/autoCommit

We use this update command: 

 solr_command=$(catEnD
time zcat --force $file2load | /usr/bin/curl --proxy  --silent --show-error 
--max-time 3600 \
http://$solr_url/solr/$solr_core/update/csv?\
commit=false\
separator=|\
escape=\\\
trim=true\
header=false\
skipLines=2\
overwrite=true\
_shard_=$shardid\
fieldnames=$fieldnames\
f.cs_rep.split=true\
f.cs_rep.separator=%5E  --data-binary @-  -H 'Content-type:text/plain; 
charset=utf-8'
EnD)


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, April 08, 2014 2:21 PM
To: solr-user@lucene.apache.org
Subject: Re: solr4 performance question

What do you have for hour _softcommit_ settings in solrconfig.xml? I'm
guessing you're using SolrJ or similar, but the solrconfig settings
will trip a commit as well.

For that matter ,what are all our commit settings in solrconfig.xml,
both hard and soft?

Best,
Erick

On Tue, Apr 8, 2014 at 10:28 AM, Joshi, Shital shital.jo...@gs.com wrote:
 Hi,

 We have 10 node Solr Cloud (5 shards, 2 replicas) with 30 GB JVM on 60GB 
 machine and 40 GB of index.
 We're constantly noticing that Solr queries take longer time while update 
 (with commit=false setting) is in progress. The query which usually takes .5 
 seconds, take up to 2 minutes while updates are in progress. And this is not 
 the case with all queries, it is very sporadic behavior.

 Any pointer to nail this issue would be appreciated.

 Is there a way to find how much of a query result came from cache? Can we 
 enable any log settings to start printing what came from cache vs. what was 
 queried?

 Thanks!


Re: solr4 performance question

2014-04-08 Thread Erick Erickson
bq:   solr.autoCommit.maxTime:60
   maxDocs10/maxDocs
   openSearchertrue/openSearcher

Every 100K documents or 10 minutes (whichever comes first) your
current searchers will be closed and a new searcher opened, all the
warmup queries etc. might happen. I suspect you're not doing much with
autwarming and/or newSearcher queries. So occasionally your search has
to wait for caches to be read, terms to be populated, etc.

Some possibilities to test this:
1 create some newSearcher queries in solrconfig.xml
2 specify a reasonable autowarm count for queryResultCache (don't go
crazy here, start with 16 or some similiar)
3 set openSearcher to false above. In this case you won't be able to
see the documents until either a hard or soft commit happens, you
could cure this with a single hard commit at the end of your indexing
run. It all depends on what latency you can tolerate in terms of
searching newly-indexed documents.

Here's a reference...

http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Best,
Erick

On Tue, Apr 8, 2014 at 12:11 PM, Joshi, Shital shital.jo...@gs.com wrote:
 We don't do any soft commit. This is our hard commit setting.

 autoCommit
maxTime${solr.autoCommit.maxTime:60}/maxTime
maxDocs10/maxDocs
openSearchertrue/openSearcher
 /autoCommit

 We use this update command:

  solr_command=$(catEnD
 time zcat --force $file2load | /usr/bin/curl --proxy  --silent --show-error 
 --max-time 3600 \
 http://$solr_url/solr/$solr_core/update/csv?\
 commit=false\
 separator=|\
 escape=\\\
 trim=true\
 header=false\
 skipLines=2\
 overwrite=true\
 _shard_=$shardid\
 fieldnames=$fieldnames\
 f.cs_rep.split=true\
 f.cs_rep.separator=%5E  --data-binary @-  -H 'Content-type:text/plain; 
 charset=utf-8'
 EnD)


 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Tuesday, April 08, 2014 2:21 PM
 To: solr-user@lucene.apache.org
 Subject: Re: solr4 performance question

 What do you have for hour _softcommit_ settings in solrconfig.xml? I'm
 guessing you're using SolrJ or similar, but the solrconfig settings
 will trip a commit as well.

 For that matter ,what are all our commit settings in solrconfig.xml,
 both hard and soft?

 Best,
 Erick

 On Tue, Apr 8, 2014 at 10:28 AM, Joshi, Shital shital.jo...@gs.com wrote:
 Hi,

 We have 10 node Solr Cloud (5 shards, 2 replicas) with 30 GB JVM on 60GB 
 machine and 40 GB of index.
 We're constantly noticing that Solr queries take longer time while update 
 (with commit=false setting) is in progress. The query which usually takes .5 
 seconds, take up to 2 minutes while updates are in progress. And this is not 
 the case with all queries, it is very sporadic behavior.

 Any pointer to nail this issue would be appreciated.

 Is there a way to find how much of a query result came from cache? Can we 
 enable any log settings to start printing what came from cache vs. what was 
 queried?

 Thanks!


Performance Question: 'facets.missing'

2013-11-06 Thread andres
I'm debating whether or not to set the 'facets.missing' parameter to true by
default when faceting. What is the performance impact of setting
'facets.missing' to true?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Performance-Question-facets-missing-tp4099602.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Performance Question: 'facets.missing'

2013-11-06 Thread Yonik Seeley
On Wed, Nov 6, 2013 at 12:07 PM, andres and...@octopart.com wrote:
 I'm debating whether or not to set the 'facets.missing' parameter to true by
 default when faceting. What is the performance impact of setting
 'facets.missing' to true?

It really depends on the faceting method.  For some faceting methods
(like enum), the first time on a new view of the index can be somewhat
expensive, but then the set of docs that have a value in the field
should be cached and it will be very cheap.  Other facet methods
should be cheap regardless.

-Yonik
http://heliosearch.com -- making solr shine


Re: Solr4 update and query performance question

2013-08-15 Thread Erick Erickson
bq: There is no batching while updating/inserting documents in Solr3

Correct, but all the updates only went to the server you targeted them for.
The batching you're seeing is the auto-distributing the docs to the various
shards, a whole different animal.

Keep an eye on: https://issues.apache.org/jira/browse/SOLR-4816. You might
prompt Joel to see if this is testable. This JIRA routes the docs directly
to the leader of the shard they should go to. IOW it does the routing on
the client side. There will still be batching from the leader to the
replicas, but this should help.

It is usually a Bad Thing to commit after every batch either in Solr 3 or
Solr 4 from the client. I suspect you're right that the wait for all the
searchers on all the shards is one of your problems. Try configuring
autocommit (both hard and soft) in solrconfig.xml and forgetting the commit
bits from the client. This is the usual pattern in Solr4.

Your soft commit (which may be commented out) controls when the documents
are searchable. It is less expensive than hard commits with
openSearcher=true and makes docs visible. Hard commit closes the current
segment and opens a new one. So set up openSearcher=false for your hard
commit and a soft commit interval of whatever latency you can stand would
by my recommendation.

Final note: if you set your hard commit with openSearcher=false, do it
fairly often since it truncates the transaction logs and is quite
inexpensive. If you let your tlog grow huge, if you kill your server and
re-start Solr you get into a situation where solr may replay the tlog. If
it has a bazillion docs in it that can take a very long time to start up.

Best
Erick




On Wed, Aug 14, 2013 at 4:39 PM, Joshi, Shital shital.jo...@gs.com wrote:

 We didn't copy/paste Solr3 config to solr4. We started with Solr4 config
 and only updated new searcher queries and few other things.

 There is no batching while updating/inserting documents in Solr3, is that
 correct? Committing 1000 documents in Solr3 takes 19 seconds while in Solr4
 it takes about 3-4 minutes. We noticed in Solr4 logs that, commit only
 returns after new searcher is created across all nodes. This is possibly
 cause waitSearcher=true by default in Solr4. This was not the case with
 Solr3, commit would return without waiting for new searcher creation.

 In order to improve performance with Solr4, we first changed from
 commit=true to commit=false in update URL and added autoHardCommit setting
 in solrconfig.xml. This improved performance from 3-4 minutes to 1-2
 minutes but that is not good enough.

 Then we changed maxBufferedAddsPerServer value in SolrCmdDistributor class
 from 10 to 1000 and deployed this class in
 $JETTY_TEMP_FOLDER/solr-webapp/webapp/WEB-INF/classes folder and restarted
 solr4 nodes. But we still see the batch size of 10 being used. Did we
 change correct variable/class?

 Next thing We will try using softCommit=true in update url and check if it
 gives us desired performance.

 Thanks for looking into this. Appreciate your help.

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Tuesday, August 13, 2013 8:12 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr4 update and query performance question

 1 That's hard-coded at present. There's anecdotal evidence that there
  are throughput improvements with larger batch sizes, but no action
  yet.
 2 Yep, all searchers are also re-opened, caches re-warmed, etc.
 3 Odd. I'm assuming your Solr3 was master/slave setup? Seeing the
 queries would help diagnose this. Also, did you try to copy/paste
 the configuration from your Solr3 to Solr4? I'd start with the
 Solr4 and copy/paste only the parts needed from your SOlr3 setup.

 Best
 Erick


 On Mon, Aug 12, 2013 at 11:38 AM, Joshi, Shital shital.jo...@gs.com
 wrote:

  Hi,
 
  We have SolrCloud (4.4.0) cluster (5 shards and 2 replicas) on 10 boxes
  with about 450 mil documents (~90 mil per shard). We're loading 1000 or
  less documents in CSV format every few minutes. In Solr3, with 300 mil
  documents, it used to take 30 seconds to load 1000 documents while in
  Solr4, its taking up to 3 minutes to load 1000 documents. We're using
  custom sharding, we include _shard_=shardid parameter in update command.
  Upon looking Solr4 log files we found that:
 
  1.   Documents are added in a batch of 10 records. How do we increase
  this batch size from 10 to 1000 documents?
 
  2.  We do hard commit after loading 1000 documents. For every hard
  commit, it refreshes searcher on all nodes. Are all caches also refreshed
  when hard commit happens? We're planning to change to soft commit and do
  auto hard commit every 10-15 minutes.
 
  3.  We're not seeing improved query performance compared to Solr3.
  Queries which took 3-5 seconds in Solr3 (300 mil docs) are taking 20
  seconds with Solr4. We think this could be due to frequent hard commits
 and
  searcher refresh. Do you think

RE: Solr4 update and query performance question

2013-08-14 Thread Joshi, Shital
We didn't copy/paste Solr3 config to solr4. We started with Solr4 config and 
only updated new searcher queries and few other things.

There is no batching while updating/inserting documents in Solr3, is that 
correct? Committing 1000 documents in Solr3 takes 19 seconds while in Solr4 it 
takes about 3-4 minutes. We noticed in Solr4 logs that, commit only returns 
after new searcher is created across all nodes. This is possibly cause 
waitSearcher=true by default in Solr4. This was not the case with Solr3, commit 
would return without waiting for new searcher creation. 

In order to improve performance with Solr4, we first changed from commit=true 
to commit=false in update URL and added autoHardCommit setting in 
solrconfig.xml. This improved performance from 3-4 minutes to 1-2 minutes but 
that is not good enough. 

Then we changed maxBufferedAddsPerServer value in SolrCmdDistributor class from 
10 to 1000 and deployed this class in 
$JETTY_TEMP_FOLDER/solr-webapp/webapp/WEB-INF/classes folder and restarted 
solr4 nodes. But we still see the batch size of 10 being used. Did we change 
correct variable/class? 

Next thing We will try using softCommit=true in update url and check if it 
gives us desired performance. 

Thanks for looking into this. Appreciate your help. 

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, August 13, 2013 8:12 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr4 update and query performance question

1 That's hard-coded at present. There's anecdotal evidence that there
 are throughput improvements with larger batch sizes, but no action
 yet.
2 Yep, all searchers are also re-opened, caches re-warmed, etc.
3 Odd. I'm assuming your Solr3 was master/slave setup? Seeing the
queries would help diagnose this. Also, did you try to copy/paste
the configuration from your Solr3 to Solr4? I'd start with the
Solr4 and copy/paste only the parts needed from your SOlr3 setup.

Best
Erick


On Mon, Aug 12, 2013 at 11:38 AM, Joshi, Shital shital.jo...@gs.com wrote:

 Hi,

 We have SolrCloud (4.4.0) cluster (5 shards and 2 replicas) on 10 boxes
 with about 450 mil documents (~90 mil per shard). We're loading 1000 or
 less documents in CSV format every few minutes. In Solr3, with 300 mil
 documents, it used to take 30 seconds to load 1000 documents while in
 Solr4, its taking up to 3 minutes to load 1000 documents. We're using
 custom sharding, we include _shard_=shardid parameter in update command.
 Upon looking Solr4 log files we found that:

 1.   Documents are added in a batch of 10 records. How do we increase
 this batch size from 10 to 1000 documents?

 2.  We do hard commit after loading 1000 documents. For every hard
 commit, it refreshes searcher on all nodes. Are all caches also refreshed
 when hard commit happens? We're planning to change to soft commit and do
 auto hard commit every 10-15 minutes.

 3.  We're not seeing improved query performance compared to Solr3.
 Queries which took 3-5 seconds in Solr3 (300 mil docs) are taking 20
 seconds with Solr4. We think this could be due to frequent hard commits and
 searcher refresh. Do you think when we change to soft commit and increase
 the batch size, we will see better query performance.

 Thanks!





Re: Solr4 update and query performance question

2013-08-13 Thread Erick Erickson
1 That's hard-coded at present. There's anecdotal evidence that there
 are throughput improvements with larger batch sizes, but no action
 yet.
2 Yep, all searchers are also re-opened, caches re-warmed, etc.
3 Odd. I'm assuming your Solr3 was master/slave setup? Seeing the
queries would help diagnose this. Also, did you try to copy/paste
the configuration from your Solr3 to Solr4? I'd start with the
Solr4 and copy/paste only the parts needed from your SOlr3 setup.

Best
Erick


On Mon, Aug 12, 2013 at 11:38 AM, Joshi, Shital shital.jo...@gs.com wrote:

 Hi,

 We have SolrCloud (4.4.0) cluster (5 shards and 2 replicas) on 10 boxes
 with about 450 mil documents (~90 mil per shard). We're loading 1000 or
 less documents in CSV format every few minutes. In Solr3, with 300 mil
 documents, it used to take 30 seconds to load 1000 documents while in
 Solr4, its taking up to 3 minutes to load 1000 documents. We're using
 custom sharding, we include _shard_=shardid parameter in update command.
 Upon looking Solr4 log files we found that:

 1.   Documents are added in a batch of 10 records. How do we increase
 this batch size from 10 to 1000 documents?

 2.  We do hard commit after loading 1000 documents. For every hard
 commit, it refreshes searcher on all nodes. Are all caches also refreshed
 when hard commit happens? We're planning to change to soft commit and do
 auto hard commit every 10-15 minutes.

 3.  We're not seeing improved query performance compared to Solr3.
 Queries which took 3-5 seconds in Solr3 (300 mil docs) are taking 20
 seconds with Solr4. We think this could be due to frequent hard commits and
 searcher refresh. Do you think when we change to soft commit and increase
 the batch size, we will see better query performance.

 Thanks!





Solr4 update and query performance question

2013-08-12 Thread Joshi, Shital
Hi,

We have SolrCloud (4.4.0) cluster (5 shards and 2 replicas) on 10 boxes with 
about 450 mil documents (~90 mil per shard). We're loading 1000 or less 
documents in CSV format every few minutes. In Solr3, with 300 mil documents, it 
used to take 30 seconds to load 1000 documents while in Solr4, its taking up to 
3 minutes to load 1000 documents. We're using custom sharding, we include 
_shard_=shardid parameter in update command. Upon looking Solr4 log files we 
found that:

1.   Documents are added in a batch of 10 records. How do we increase this 
batch size from 10 to 1000 documents?

2.  We do hard commit after loading 1000 documents. For every hard commit, 
it refreshes searcher on all nodes. Are all caches also refreshed when hard 
commit happens? We're planning to change to soft commit and do auto hard commit 
every 10-15 minutes.

3.  We're not seeing improved query performance compared to Solr3. Queries 
which took 3-5 seconds in Solr3 (300 mil docs) are taking 20 seconds with 
Solr4. We think this could be due to frequent hard commits and searcher 
refresh. Do you think when we change to soft commit and increase the batch 
size, we will see better query performance.

Thanks!




Re: Performance question on Spatial Search

2013-08-05 Thread Steven Bower
So after re-feeding our data with a new boolean field that is true when
data exists and false when it doesn't our search times have gone from avg
of about 20s to around 150ms... pretty amazing change in perf... It seems
like https://issues.apache.org/jira/browse/SOLR-5093 might alleviate many
peoples pain in doing this kind of query (if I have some time I may take a
look at it)..

Anyway we are in pretty good shape at this point.. the only remaining issue
is that the first queries after commits are taking 5-6s... This is cause by
the loading of 2 (one long and one int) FieldCaches (uninvert) that are
used for sorting.. I'm suspecting that docvalues will greatly help this
load performance?

thanks,

steve


On Wed, Jul 31, 2013 at 4:32 PM, Steven Bower smb-apa...@alcyon.net wrote:

 the list of IDs does change relatively frequently, but this doesn't seem
 to have very much impact on the performance of the query as far as I can
 tell.

 attached are the stacks

 thanks,

 steve


 On Wed, Jul 31, 2013 at 6:33 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

 On Wed, Jul 31, 2013 at 1:10 AM, Steven Bower sbo...@alcyon.net wrote:

 
  not sure what you mean by good hit raitio?
 

 I mean such queries are really expensive (even on cache hit), so if the
 list of ids changes every time, it never hit cache and hence executes
 these
 heavy queries every time. It's well known performance problem.


  Here are the stacks...
 
 they seems like hotspots, and shows index reading that's reasonable. But I
 can't see what caused these readings, to get that I need whole stack of
 hot
 thread.


 
Name Time (ms) Own Time (ms)
 
 
 org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(AtomicReaderContext,
  Bits) 300879 203478
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc()
  45539 19
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs()
  45519 40
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readVIntBlock(IndexInput,
  int[], int[], int, boolean) 24352 0
  org.apache.lucene.store.DataInput.readVInt() 24352 24352
  org.apache.lucene.codecs.lucene41.ForUtil.readBlock(IndexInput, byte[],
  int[]) 21126 14976
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  6150 0  java.nio.DirectByteBuffer.get(byte[], int, int)
  6150 0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 6150 6150
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits,
  DocsEnum, int) 35342 421
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData()
  34920 27939
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo,
  BlockTermState) 6980 6980
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next()
  14129 1053
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock()
  5948 261
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
  5686 199
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  3606 0  java.nio.DirectByteBuffer.get(byte[], int, int)
  3606 0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 3606 3606
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput,
  FieldInfo, BlockTermState) 1879 80
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  1798 0java.nio.DirectByteBuffer.get(byte[], int, int)
  1798 0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 1798 1798
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next()
  4010 3324
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextNonLeaf()
  685 685
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
  3117 144
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  1861 0java.nio.DirectByteBuffer.get(byte[], int, int) 1861
  0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 1861 1861
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput,
  FieldInfo, BlockTermState) 1090 19
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  1070 0  java.nio.DirectByteBuffer.get(byte[], int, int)
  1070 0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 1070 1070
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.initIndexInput()
  20 0org.apache.lucene.store.ByteBufferIndexInput.clone()
  20 0
  org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0
  org.apache.lucene.store.ByteBufferIndexInput.buildSlice(long, long) 20
  0
  org.apache.lucene.util.WeakIdentityMap.put(Object, Object) 20 0
 
 

Re: Performance question on Spatial Search

2013-08-05 Thread Shawn Heisey
On 8/5/2013 7:13 AM, Steven Bower wrote:
 So after re-feeding our data with a new boolean field that is true when
 data exists and false when it doesn't our search times have gone from avg
 of about 20s to around 150ms... pretty amazing change in perf... It seems
 like https://issues.apache.org/jira/browse/SOLR-5093 might alleviate many
 peoples pain in doing this kind of query (if I have some time I may take a
 look at it)..
 
 Anyway we are in pretty good shape at this point.. the only remaining issue
 is that the first queries after commits are taking 5-6s... This is cause by
 the loading of 2 (one long and one int) FieldCaches (uninvert) that are
 used for sorting.. I'm suspecting that docvalues will greatly help this
 load performance?

I would handle this by using newSearcher events in the config to search
for all documents (*:*) with your desired sort parameters.  That way,
the fieldcache will be pre-populated before the new searcher accepts any
queries.  The old searcher will continue to handle queries will this is
happening.

Be aware that this will increase your commit time, which might mean that
you need to decrease your autowarmCount values on your Solr caches to
compensate.

If you have removed this section from your solrconfig.xml file, see the
example config.

Thanks,
Shawn



Re: Performance question on Spatial Search

2013-08-05 Thread David Smiley (@MITRE.org)

From: Steven Bower-2 [via Lucene] 
ml-node+s472066n4082569...@n3.nabble.commailto:ml-node+s472066n4082569...@n3.nabble.com
Date: Monday, August 5, 2013 9:14 AM
To: Smiley, David W. dsmi...@mitre.orgmailto:dsmi...@mitre.org
Subject: Re: Performance question on Spatial Search

So after re-feeding our data with a new boolean field that is true when
data exists and false when it doesn't our search times have gone from avg
of about 20s to around 150ms... pretty amazing change in perf... It seems
like https://issues.apache.org/jira/browse/SOLR-5093 might alleviate many
peoples pain in doing this kind of query (if I have some time I may take a
look at it)..

Awesome performance improvement!


Anyway we are in pretty good shape at this point.. the only remaining issue
is that the first queries after commits are taking 5-6s... This is cause by
the loading of 2 (one long and one int) FieldCaches (uninvert) that are
used for sorting.. I'm suspecting that docvalues will greatly help this
load performance?

DocValues will help a lot.  I'd love to see the before  after times on that 
conversion.  I'm surprised it's taking as long as it is… but then you have a 
tone of data in one index so it's plausible.  Lucene 4.4 has some compression 
improvements there: LUCENE-5035


~ David




-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Performance-question-on-Spatial-Search-tp4081150p4082588.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Performance question on Spatial Search

2013-07-31 Thread Mikhail Khludnev
On Wed, Jul 31, 2013 at 1:10 AM, Steven Bower sbo...@alcyon.net wrote:


 not sure what you mean by good hit raitio?


I mean such queries are really expensive (even on cache hit), so if the
list of ids changes every time, it never hit cache and hence executes these
heavy queries every time. It's well known performance problem.


 Here are the stacks...

they seems like hotspots, and shows index reading that's reasonable. But I
can't see what caused these readings, to get that I need whole stack of hot
thread.



   Name Time (ms) Own Time (ms)

 org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(AtomicReaderContext,
 Bits) 300879 203478

 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc()
 45539 19

 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs()
 45519 40

 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readVIntBlock(IndexInput,
 int[], int[], int, boolean) 24352 0
 org.apache.lucene.store.DataInput.readVInt() 24352 24352
 org.apache.lucene.codecs.lucene41.ForUtil.readBlock(IndexInput, byte[],
 int[]) 21126 14976
 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
 6150 0  java.nio.DirectByteBuffer.get(byte[], int, int)
 6150 0
 java.nio.Bits.copyToArray(long, Object, long, long, long) 6150 6150

 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits,
 DocsEnum, int) 35342 421

 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData()
 34920 27939

 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo,
 BlockTermState) 6980 6980

 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next()
 14129 1053

 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock()
 5948 261

 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
 5686 199
 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
 3606 0  java.nio.DirectByteBuffer.get(byte[], int, int)
 3606 0
 java.nio.Bits.copyToArray(long, Object, long, long, long) 3606 3606

 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput,
 FieldInfo, BlockTermState) 1879 80
 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
 1798 0java.nio.DirectByteBuffer.get(byte[], int, int)
 1798 0
 java.nio.Bits.copyToArray(long, Object, long, long, long) 1798 1798

 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next()
 4010 3324

 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextNonLeaf()
 685 685

 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
 3117 144
 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
 1861 0java.nio.DirectByteBuffer.get(byte[], int, int) 1861
 0
 java.nio.Bits.copyToArray(long, Object, long, long, long) 1861 1861

 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput,
 FieldInfo, BlockTermState) 1090 19
 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
 1070 0  java.nio.DirectByteBuffer.get(byte[], int, int)
 1070 0
 java.nio.Bits.copyToArray(long, Object, long, long, long) 1070 1070

 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.initIndexInput()
 20 0org.apache.lucene.store.ByteBufferIndexInput.clone()
 20 0
 org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0
 org.apache.lucene.store.ByteBufferIndexInput.buildSlice(long, long) 20
 0
 org.apache.lucene.util.WeakIdentityMap.put(Object, Object) 20 0
 org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference.init(Object,
 ReferenceQueue) 20 0
 java.lang.System.identityHashCode(Object) 20 20
 org.apache.lucene.index.FilteredTermsEnum.docs(Bits, DocsEnum, int)
 1485 527

 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits,
 DocsEnum, int) 957 0

 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData()
 957 513

 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo,
 BlockTermState) 443 443
 org.apache.lucene.index.FilteredTermsEnum.next() 874 324

 org.apache.lucene.search.NumericRangeQuery$NumericRangeTermsEnum.accept(BytesRef)
 368 0

 org.apache.lucene.util.BytesRef$UTF8SortedAsUnicodeComparator.compare(Object,
 Object) 368 368

 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next()
 160 0

 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock()
 160 0

 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
 160 0
 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
 120
 0

 

Re: Performance question on Spatial Search

2013-07-31 Thread Steven Bower
the list of IDs does change relatively frequently, but this doesn't seem to
have very much impact on the performance of the query as far as I can tell.

attached are the stacks

thanks,

steve


On Wed, Jul 31, 2013 at 6:33 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 On Wed, Jul 31, 2013 at 1:10 AM, Steven Bower sbo...@alcyon.net wrote:

 
  not sure what you mean by good hit raitio?
 

 I mean such queries are really expensive (even on cache hit), so if the
 list of ids changes every time, it never hit cache and hence executes these
 heavy queries every time. It's well known performance problem.


  Here are the stacks...
 
 they seems like hotspots, and shows index reading that's reasonable. But I
 can't see what caused these readings, to get that I need whole stack of hot
 thread.


 
Name Time (ms) Own Time (ms)
 
 
 org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(AtomicReaderContext,
  Bits) 300879 203478
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc()
  45539 19
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs()
  45519 40
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readVIntBlock(IndexInput,
  int[], int[], int, boolean) 24352 0
  org.apache.lucene.store.DataInput.readVInt() 24352 24352
  org.apache.lucene.codecs.lucene41.ForUtil.readBlock(IndexInput, byte[],
  int[]) 21126 14976
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  6150 0  java.nio.DirectByteBuffer.get(byte[], int, int)
  6150 0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 6150 6150
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits,
  DocsEnum, int) 35342 421
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData()
  34920 27939
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo,
  BlockTermState) 6980 6980
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next()
  14129 1053
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock()
  5948 261
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
  5686 199
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  3606 0  java.nio.DirectByteBuffer.get(byte[], int, int)
  3606 0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 3606 3606
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput,
  FieldInfo, BlockTermState) 1879 80
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  1798 0java.nio.DirectByteBuffer.get(byte[], int, int)
  1798 0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 1798 1798
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next()
  4010 3324
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextNonLeaf()
  685 685
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
  3117 144
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  1861 0java.nio.DirectByteBuffer.get(byte[], int, int) 1861
  0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 1861 1861
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput,
  FieldInfo, BlockTermState) 1090 19
  org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
  1070 0  java.nio.DirectByteBuffer.get(byte[], int, int)
  1070 0
  java.nio.Bits.copyToArray(long, Object, long, long, long) 1070 1070
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.initIndexInput()
  20 0org.apache.lucene.store.ByteBufferIndexInput.clone()
  20 0
  org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0
  org.apache.lucene.store.ByteBufferIndexInput.buildSlice(long, long) 20
  0
  org.apache.lucene.util.WeakIdentityMap.put(Object, Object) 20 0
 
 org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference.init(Object,
  ReferenceQueue) 20 0
  java.lang.System.identityHashCode(Object) 20 20
  org.apache.lucene.index.FilteredTermsEnum.docs(Bits, DocsEnum, int)
  1485 527
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits,
  DocsEnum, int) 957 0
 
 
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData()
  957 513
 
 
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo,
  BlockTermState) 443 443
  org.apache.lucene.index.FilteredTermsEnum.next() 874 324
 
 
 org.apache.lucene.search.NumericRangeQuery$NumericRangeTermsEnum.accept(BytesRef)
  368 0
 
 
 org.apache.lucene.util.BytesRef$UTF8SortedAsUnicodeComparator.compare(Object,
  Object) 368 

Re: Performance question on Spatial Search

2013-07-30 Thread Steven Bower
 the results with LatLonType.
 
  ~ David Smiley
 
 
  Steven Bower wrote
   @Erick it is alot of hw, but basically trying to create a best case
   scenario to take HW out of the question. Will try increasing heap
 size
   tomorrow.. I haven't seen it get close to the max heap size yet.. but
  it's
   worth trying...
  
   Note that these queries look something like:
  
   q=*:*
   fq=[date range]
   fq=geo query
  
   on the fq for the geo query i've added {!cache=false} to prevent it
 from
   ending up in the filter cache.. once it's in filter cache queries
 come
   back
   in 10-20ms. For my use case i need the first unique geo search query
 to
   come back in a more reasonable time so I am currently ignoring the
 cache.
  
   @Bill will look into that, I'm not certain it will support the
 particular
   queries that are being executed but I'll investigate..
  
   steve
  
  
   On Mon, Jul 29, 2013 at 6:25 PM, Erick Erickson lt;
 
   erickerickson@
 
   gt;wrote:
  
   This is very strange. I'd expect slow queries on
   the first few queries while these caches were
   warmed, but after that I'd expect things to
   be quite fast.
  
   For a 12G index and 256G RAM, you have on the
   surface a LOT of hardware to throw at this problem.
   You can _try_ giving the JVM, say, 18G but that
   really shouldn't be a big issue, your index files
   should be MMaped.
  
   Let's try the crude thing first and give the JVM
   more memory.
  
   FWIW
   Erick
  
   On Mon, Jul 29, 2013 at 4:45 PM, Steven Bower lt;
 
   smb-apache@
 
   gt;
   wrote:
I've been doing some performance analysis of a spacial search use
 case
   I'm
implementing in Solr 4.3.0. Basically I'm seeing search times alot
   higher
than I'd like them to be and I'm hoping people may have some
   suggestions
for how to optimize further.
   
Here are the specs of what I'm doing now:
   
Machine:
- 16 cores @ 2.8ghz
- 256gb RAM
- 1TB (RAID 1+0 on 10 SSD)
   
Content:
- 45M docs (not very big only a few fields with no large textual
   content)
- 1 geo field (using config below)
- index is 12gb
- 1 shard
- Using MMapDirectory
   
Field config:
   
   
   fieldType name=geo
 class=solr.SpatialRecursivePrefixTreeFieldType
  
 distErrPct=0.025 maxDistErr=0.00045
   
  
 

 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFa
 ctory
units=degrees/
   
   
   field  name=geopoint indexed=true multiValued=false
  
 required=false stored=true type=geo/
   
   
What I've figured out so far:
   
- Most of my time (98%) is being spent in
java.nio.Bits.copyToByteArray(long,Object,long,long) which is
 being
driven by
   BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
which from what I gather is basically reading terms from the .tim
 file
in blocks
   
- I moved from Java 1.6 to 1.7 based upon what I read here:
   
  
 
 http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance/
and it definitely had some positive impact (i haven't been able to
measure this independantly yet)
   
- I changed maxDistErr from 0.09 (which is 1m precision per
 docs)
to 0.00045 (50m precision) ..
   
- It looks to me that the .tim file are being memory mapped fully
 (ie
they show up in pmap output) the virtual size of the jvm is ~18gb
(heap is 6gb)
   
- I've optimized the index but this doesn't have a dramatic impact
 on
performance
   
Changing the precision and the JVM upgrade yielded a drop from
 ~18s
avg query time to ~9s avg query time.. This is fantastic but I
 want to
get this down into the 1-2 second range.
   
At this point it seems that basically i am bottle-necked on
 basically
copying memory out of the mapped .tim file which leads me to think
that the only solution to my problem would be to read less data or
somehow read it more efficiently..
   
If anyone has any suggestions of where to go with this I'd love to
  know
   
   
thanks,
   
steve
  
 
 
 
 
 
  -
   Author:
  http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
  --
  View this message in context:
 
 
 http://lucene.472066.n3.nabble.com/Performance-question-on-Spatial-Search
 -tp4081150p4081309.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 





Re: Performance question on Spatial Search

2013-07-30 Thread Steven Bower
 you how to
 raise
  the
  prefix grid scan level to a # closer to max-levels.
  (4) Do all of your searches find less than a million points,
 considering
  all
  filters?  If so then it's worth comparing the results with LatLonType.
 
  ~ David Smiley
 
 
  Steven Bower wrote
   @Erick it is alot of hw, but basically trying to create a best case
   scenario to take HW out of the question. Will try increasing heap
 size
   tomorrow.. I haven't seen it get close to the max heap size yet..
 but
  it's
   worth trying...
  
   Note that these queries look something like:
  
   q=*:*
   fq=[date range]
   fq=geo query
  
   on the fq for the geo query i've added {!cache=false} to prevent it
 from
   ending up in the filter cache.. once it's in filter cache queries
 come
   back
   in 10-20ms. For my use case i need the first unique geo search query
 to
   come back in a more reasonable time so I am currently ignoring the
 cache.
  
   @Bill will look into that, I'm not certain it will support the
 particular
   queries that are being executed but I'll investigate..
  
   steve
  
  
   On Mon, Jul 29, 2013 at 6:25 PM, Erick Erickson lt;
 
   erickerickson@
 
   gt;wrote:
  
   This is very strange. I'd expect slow queries on
   the first few queries while these caches were
   warmed, but after that I'd expect things to
   be quite fast.
  
   For a 12G index and 256G RAM, you have on the
   surface a LOT of hardware to throw at this problem.
   You can _try_ giving the JVM, say, 18G but that
   really shouldn't be a big issue, your index files
   should be MMaped.
  
   Let's try the crude thing first and give the JVM
   more memory.
  
   FWIW
   Erick
  
   On Mon, Jul 29, 2013 at 4:45 PM, Steven Bower lt;
 
   smb-apache@
 
   gt;
   wrote:
I've been doing some performance analysis of a spacial search use
 case
   I'm
implementing in Solr 4.3.0. Basically I'm seeing search times
 alot
   higher
than I'd like them to be and I'm hoping people may have some
   suggestions
for how to optimize further.
   
Here are the specs of what I'm doing now:
   
Machine:
- 16 cores @ 2.8ghz
- 256gb RAM
- 1TB (RAID 1+0 on 10 SSD)
   
Content:
- 45M docs (not very big only a few fields with no large textual
   content)
- 1 geo field (using config below)
- index is 12gb
- 1 shard
- Using MMapDirectory
   
Field config:
   
   
   fieldType name=geo
 class=solr.SpatialRecursivePrefixTreeFieldType
  
 distErrPct=0.025 maxDistErr=0.00045
   
  
 

 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFa
 ctory
units=degrees/
   
   
   field  name=geopoint indexed=true multiValued=false
  
 required=false stored=true type=geo/
   
   
What I've figured out so far:
   
- Most of my time (98%) is being spent in
java.nio.Bits.copyToByteArray(long,Object,long,long) which is
 being
driven by
   BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
which from what I gather is basically reading terms from the .tim
 file
in blocks
   
- I moved from Java 1.6 to 1.7 based upon what I read here:
   
  
 
 http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance/
and it definitely had some positive impact (i haven't been able
 to
measure this independantly yet)
   
- I changed maxDistErr from 0.09 (which is 1m precision per
 docs)
to 0.00045 (50m precision) ..
   
- It looks to me that the .tim file are being memory mapped fully
 (ie
they show up in pmap output) the virtual size of the jvm is ~18gb
(heap is 6gb)
   
- I've optimized the index but this doesn't have a dramatic
 impact
 on
performance
   
Changing the precision and the JVM upgrade yielded a drop from
 ~18s
avg query time to ~9s avg query time.. This is fantastic but I
 want to
get this down into the 1-2 second range.
   
At this point it seems that basically i am bottle-necked on
 basically
copying memory out of the mapped .tim file which leads me to
 think
that the only solution to my problem would be to read less data
 or
somehow read it more efficiently..
   
If anyone has any suggestions of where to go with this I'd love
 to
  know
   
   
thanks,
   
steve
  
 
 
 
 
 
  -
   Author:
  http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
  --
  View this message in context:
 
 
 http://lucene.472066.n3.nabble.com/Performance-question-on-Spatial-Search
 -tp4081150p4081309.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 






Re: Performance question on Spatial Search

2013-07-30 Thread Smiley, David W.
 looking to ensure you are not using IsWithin, which
is
 not
  meant for point data.  If your query shape is a circle or the
bounding
 box
  of a circle, you should use the geofilt query parser, otherwise use
 the
  quirky syntax that allows you to specify the spatial predicate with
  Intersects.
  (2) Do you actually need JTS?  i.e. are you using Polygons, etc.
  (3) How dense would you estimate the data is at the 50m
resolution
 you've
  configured the data?  If It's very dense then I'll tell you how to
 raise
  the
  prefix grid scan level to a # closer to max-levels.
  (4) Do all of your searches find less than a million points,
 considering
  all
  filters?  If so then it's worth comparing the results with
LatLonType.
 
  ~ David Smiley
 
 
  Steven Bower wrote
   @Erick it is alot of hw, but basically trying to create a best
case
   scenario to take HW out of the question. Will try increasing
heap
 size
   tomorrow.. I haven't seen it get close to the max heap size yet..
 but
  it's
   worth trying...
  
   Note that these queries look something like:
  
   q=*:*
   fq=[date range]
   fq=geo query
  
   on the fq for the geo query i've added {!cache=false} to prevent
it
 from
   ending up in the filter cache.. once it's in filter cache queries
 come
   back
   in 10-20ms. For my use case i need the first unique geo search
query
 to
   come back in a more reasonable time so I am currently ignoring
the
 cache.
  
   @Bill will look into that, I'm not certain it will support the
 particular
   queries that are being executed but I'll investigate..
  
   steve
  
  
   On Mon, Jul 29, 2013 at 6:25 PM, Erick Erickson lt;
 
   erickerickson@
 
   gt;wrote:
  
   This is very strange. I'd expect slow queries on
   the first few queries while these caches were
   warmed, but after that I'd expect things to
   be quite fast.
  
   For a 12G index and 256G RAM, you have on the
   surface a LOT of hardware to throw at this problem.
   You can _try_ giving the JVM, say, 18G but that
   really shouldn't be a big issue, your index files
   should be MMaped.
  
   Let's try the crude thing first and give the JVM
   more memory.
  
   FWIW
   Erick
  
   On Mon, Jul 29, 2013 at 4:45 PM, Steven Bower lt;
 
   smb-apache@
 
   gt;
   wrote:
I've been doing some performance analysis of a spacial search
use
 case
   I'm
implementing in Solr 4.3.0. Basically I'm seeing search times
 alot
   higher
than I'd like them to be and I'm hoping people may have some
   suggestions
for how to optimize further.
   
Here are the specs of what I'm doing now:
   
Machine:
- 16 cores @ 2.8ghz
- 256gb RAM
- 1TB (RAID 1+0 on 10 SSD)
   
Content:
- 45M docs (not very big only a few fields with no large
textual
   content)
- 1 geo field (using config below)
- index is 12gb
- 1 shard
- Using MMapDirectory
   
Field config:
   
   
   fieldType name=geo
 class=solr.SpatialRecursivePrefixTreeFieldType
  
 distErrPct=0.025 maxDistErr=0.00045
   
  
 

 
spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialConte
xtFa
 ctory
units=degrees/
   
   
   field  name=geopoint indexed=true multiValued=false
  
 required=false stored=true type=geo/
   
   
What I've figured out so far:
   
- Most of my time (98%) is being spent in
java.nio.Bits.copyToByteArray(long,Object,long,long) which is
 being
driven by
   
BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
which from what I gather is basically reading terms from the
.tim
 file
in blocks
   
- I moved from Java 1.6 to 1.7 based upon what I read here:
   
  
 
 
http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance
/
and it definitely had some positive impact (i haven't been
able
 to
measure this independantly yet)
   
- I changed maxDistErr from 0.09 (which is 1m precision
per
 docs)
to 0.00045 (50m precision) ..
   
- It looks to me that the .tim file are being memory mapped
fully
 (ie
they show up in pmap output) the virtual size of the jvm is
~18gb
(heap is 6gb)
   
- I've optimized the index but this doesn't have a dramatic
 impact
 on
performance
   
Changing the precision and the JVM upgrade yielded a drop from
 ~18s
avg query time to ~9s avg query time.. This is fantastic but I
 want to
get this down into the 1-2 second range.
   
At this point it seems that basically i am bottle-necked on
 basically
copying memory out of the mapped .tim file which leads me to
 think
that the only solution to my problem would be to read less
data
 or
somehow read it more efficiently..
   
If anyone has any suggestions of where to go with this I'd
love
 to
  know
   
   
thanks,
   
steve
  
 
 
 
 
 
  -
   Author:
  http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
  --
  View this message in context:
 
 
 
http://lucene.472066.n3.nabble.com/Performance-question

Re: Performance question on Spatial Search

2013-07-30 Thread Luis Cappa Banda

 Changing the precision and the JVM upgrade yielded a drop from
  ~18s
 avg query time to ~9s avg query time.. This is fantastic but I
  want to
 get this down into the 1-2 second range.

 At this point it seems that basically i am bottle-necked on
  basically
 copying memory out of the mapped .tim file which leads me to
  think
 that the only solution to my problem would be to read less
 data
  or
 somehow read it more efficiently..

 If anyone has any suggestions of where to go with this I'd
 love
  to
   know


 thanks,

 steve
   
  
  
  
  
  
   -
Author:
  
 http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
   --
   View this message in context:
  
  
 
 
 http://lucene.472066.n3.nabble.com/Performance-question-on-Spatial-Sear
 ch
  -tp4081150p4081309.html
   Sent from the Solr - User mailing list archive at Nabble.com.
  
 
 
 
 




-- 
- Luis Cappa


Re: Performance question on Spatial Search

2013-07-30 Thread Steven Bower
Very good read... Already using MMap... verified using pmap and vsz from
top..

not sure what you mean by good hit raitio?

Here are the stacks...

  Name Time (ms) Own Time (ms)
org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(AtomicReaderContext,
Bits) 300879 203478
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc()
45539 19
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs()
45519 40
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readVIntBlock(IndexInput,
int[], int[], int, boolean) 24352 0
org.apache.lucene.store.DataInput.readVInt() 24352 24352
org.apache.lucene.codecs.lucene41.ForUtil.readBlock(IndexInput, byte[],
int[]) 21126 14976
org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
6150 0  java.nio.DirectByteBuffer.get(byte[], int, int)
6150 0
java.nio.Bits.copyToArray(long, Object, long, long, long) 6150 6150
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits,
DocsEnum, int) 35342 421
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData()
34920 27939
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo,
BlockTermState) 6980 6980
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next()
14129 1053
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock()
5948 261
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
5686 199
org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
3606 0  java.nio.DirectByteBuffer.get(byte[], int, int)
3606 0
java.nio.Bits.copyToArray(long, Object, long, long, long) 3606 3606
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput,
FieldInfo, BlockTermState) 1879 80
org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
1798 0java.nio.DirectByteBuffer.get(byte[], int, int)
1798 0
java.nio.Bits.copyToArray(long, Object, long, long, long) 1798 1798
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next()
4010 3324
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextNonLeaf()
685 685
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
3117 144
org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
1861 0java.nio.DirectByteBuffer.get(byte[], int, int) 1861
0
java.nio.Bits.copyToArray(long, Object, long, long, long) 1861 1861
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput,
FieldInfo, BlockTermState) 1090 19
org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
1070 0  java.nio.DirectByteBuffer.get(byte[], int, int)
1070 0
java.nio.Bits.copyToArray(long, Object, long, long, long) 1070 1070
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.initIndexInput()
20 0org.apache.lucene.store.ByteBufferIndexInput.clone()
20 0
org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0
org.apache.lucene.store.ByteBufferIndexInput.buildSlice(long, long) 20
0
org.apache.lucene.util.WeakIdentityMap.put(Object, Object) 20 0
org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference.init(Object,
ReferenceQueue) 20 0
java.lang.System.identityHashCode(Object) 20 20
org.apache.lucene.index.FilteredTermsEnum.docs(Bits, DocsEnum, int)
1485 527
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits,
DocsEnum, int) 957 0
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData()
957 513
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo,
BlockTermState) 443 443
org.apache.lucene.index.FilteredTermsEnum.next() 874 324
org.apache.lucene.search.NumericRangeQuery$NumericRangeTermsEnum.accept(BytesRef)
368 0
org.apache.lucene.util.BytesRef$UTF8SortedAsUnicodeComparator.compare(Object,
Object) 368 368
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next()
160 0
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock()
160 0
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
160 0
org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int) 120
0
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput,
FieldInfo, BlockTermState) 39 0
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekCeil(BytesRef,
boolean) 19 0
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
19 0
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.initIndexInput()
19 0  org.apache.lucene.store.ByteBufferIndexInput.clone()
19 0

Re: Performance question on Spatial Search

2013-07-30 Thread Steven Bower
@David I will certainly update when we get the data refed... and if you
have things you'd like to investigate or try out please let me know.. I'm
happy to eval things at scale here... we will be taking this index from its
current 45m records to 6-700m over the next few months as well..

steve


On Tue, Jul 30, 2013 at 5:10 PM, Steven Bower sbo...@alcyon.net wrote:

 Very good read... Already using MMap... verified using pmap and vsz from
 top..

 not sure what you mean by good hit raitio?

 Here are the stacks...

Name Time (ms) Own Time (ms)
 org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(AtomicReaderContext,
 Bits) 300879 203478
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc()
 45539 19
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs()
 45519 40
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readVIntBlock(IndexInput,
 int[], int[], int, boolean) 24352 0
 org.apache.lucene.store.DataInput.readVInt() 24352 24352
 org.apache.lucene.codecs.lucene41.ForUtil.readBlock(IndexInput, byte[],
 int[]) 21126 14976
 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
 6150 0  java.nio.DirectByteBuffer.get(byte[], int, int) 6150 0
 java.nio.Bits.copyToArray(long, Object, long, long, long) 6150 6150
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits,
 DocsEnum, int) 35342 421
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData()
 34920 27939
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo,
 BlockTermState) 6980 6980
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next()
 14129 1053
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock()
 5948 261
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
 5686 199
 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
 3606 0  java.nio.DirectByteBuffer.get(byte[], int, int) 3606 0
 java.nio.Bits.copyToArray(long, Object, long, long, long) 3606 3606
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput,
 FieldInfo, BlockTermState) 1879 80
 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
 1798 0java.nio.DirectByteBuffer.get(byte[], int, int) 1798
 0  java.nio.Bits.copyToArray(long, Object, long, long,
 long) 1798 1798
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next()
 4010 3324
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextNonLeaf()
 685 685
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
 3117 144
 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
 1861 0java.nio.DirectByteBuffer.get(byte[], int, int) 1861 0
 java.nio.Bits.copyToArray(long, Object, long, long, long) 1861 1861
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock(IndexInput,
 FieldInfo, BlockTermState) 1090 19
 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
 1070 0  java.nio.DirectByteBuffer.get(byte[], int, int) 1070 0
 java.nio.Bits.copyToArray(long, Object, long, long, long) 1070 1070
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.initIndexInput()
 20 0org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0
 org.apache.lucene.store.ByteBufferIndexInput.clone() 20 0
 org.apache.lucene.store.ByteBufferIndexInput.buildSlice(long, long) 20 0
 org.apache.lucene.util.WeakIdentityMap.put(Object, Object) 20 0
 org.apache.lucene.util.WeakIdentityMap$IdentityWeakReference.init(Object,
 ReferenceQueue) 20 0
 java.lang.System.identityHashCode(Object) 20 20
 org.apache.lucene.index.FilteredTermsEnum.docs(Bits, DocsEnum, int) 1485
 527
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docs(Bits,
 DocsEnum, int) 957 0
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.decodeMetaData()
 957 513
 org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.nextTerm(FieldInfo,
 BlockTermState) 443 443
 org.apache.lucene.index.FilteredTermsEnum.next() 874 324
 org.apache.lucene.search.NumericRangeQuery$NumericRangeTermsEnum.accept(BytesRef)
 368 0
 org.apache.lucene.util.BytesRef$UTF8SortedAsUnicodeComparator.compare(Object,
 Object) 368 368
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next()
 160 0
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadNextFloorBlock()
 160 0
 org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
 160 0
 org.apache.lucene.store.ByteBufferIndexInput.readBytes(byte[], int, int)
 120 0
 

Re: Performance question on Spatial Search

2013-07-30 Thread Smiley, David W.
 by
   
 BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
 which from what I gather is basically reading terms from
the
 .tim
  file
 in blocks

 - I moved from Java 1.6 to 1.7 based upon what I read here:

   
  
 
 
 http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance
 /
 and it definitely had some positive impact (i haven't been
 able
  to
 measure this independantly yet)

 - I changed maxDistErr from 0.09 (which is 1m precision
 per
  docs)
 to 0.00045 (50m precision) ..

 - It looks to me that the .tim file are being memory mapped
 fully
  (ie
 they show up in pmap output) the virtual size of the jvm is
 ~18gb
 (heap is 6gb)

 - I've optimized the index but this doesn't have a dramatic
  impact
  on
 performance

 Changing the precision and the JVM upgrade yielded a drop
from
  ~18s
 avg query time to ~9s avg query time.. This is fantastic
but I
  want to
 get this down into the 1-2 second range.

 At this point it seems that basically i am bottle-necked on
  basically
 copying memory out of the mapped .tim file which leads me
to
  think
 that the only solution to my problem would be to read less
 data
  or
 somehow read it more efficiently..

 If anyone has any suggestions of where to go with this I'd
 love
  to
   know


 thanks,

 steve
   
  
  
  
  
  
   -
Author:
  
 http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
   --
   View this message in context:
  
  
 
 
 http://lucene.472066.n3.nabble.com/Performance-question-on-Spatial-Sear
 ch
  -tp4081150p4081309.html
   Sent from the Solr - User mailing list archive at Nabble.com.
  
 
 
 
 




-- 
- Luis Cappa



Re: Performance question on Spatial Search

2013-07-30 Thread Luis Cappa Banda

   distErrPct=0.025 maxDistErr=0.00045
 

   
  
  
 
 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialCon
 te
  xtFa
   ctory
  units=degrees/
 
 
 field  name=geopoint indexed=true multiValued=false

   required=false stored=true type=geo/
 
 
  What I've figured out so far:
 
  - Most of my time (98%) is being spent in
  java.nio.Bits.copyToByteArray(long,Object,long,long) which
 is
   being
  driven by

  BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
  which from what I gather is basically reading terms from
 the
  .tim
   file
  in blocks
 
  - I moved from Java 1.6 to 1.7 based upon what I read here:
 

   
  
  
  http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance
  /
  and it definitely had some positive impact (i haven't been
  able
   to
  measure this independantly yet)
 
  - I changed maxDistErr from 0.09 (which is 1m precision
  per
   docs)
  to 0.00045 (50m precision) ..
 
  - It looks to me that the .tim file are being memory mapped
  fully
   (ie
  they show up in pmap output) the virtual size of the jvm is
  ~18gb
  (heap is 6gb)
 
  - I've optimized the index but this doesn't have a dramatic
   impact
   on
  performance
 
  Changing the precision and the JVM upgrade yielded a drop
 from
   ~18s
  avg query time to ~9s avg query time.. This is fantastic
 but I
   want to
  get this down into the 1-2 second range.
 
  At this point it seems that basically i am bottle-necked on
   basically
  copying memory out of the mapped .tim file which leads me
 to
   think
  that the only solution to my problem would be to read less
  data
   or
  somehow read it more efficiently..
 
  If anyone has any suggestions of where to go with this I'd
  love
   to
know
 
 
  thanks,
 
  steve

   
   
   
   
   
-
 Author:
   
  http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context:
   
   
  
  
  http://lucene.472066.n3.nabble.com/Performance-question-on-Spatial-Sear
  ch
   -tp4081150p4081309.html
Sent from the Solr - User mailing list archive at Nabble.com.
   
  
  
  
  
 
 
 
 
 --
 - Luis Cappa




-- 
- Luis Cappa


Re: Performance question on Spatial Search

2013-07-29 Thread Bill Bell
Can you compare with the old geo handler as a baseline. ?

Bill Bell
Sent from mobile


On Jul 29, 2013, at 4:25 PM, Erick Erickson erickerick...@gmail.com wrote:

 This is very strange. I'd expect slow queries on
 the first few queries while these caches were
 warmed, but after that I'd expect things to
 be quite fast.
 
 For a 12G index and 256G RAM, you have on the
 surface a LOT of hardware to throw at this problem.
 You can _try_ giving the JVM, say, 18G but that
 really shouldn't be a big issue, your index files
 should be MMaped.
 
 Let's try the crude thing first and give the JVM
 more memory.
 
 FWIW
 Erick
 
 On Mon, Jul 29, 2013 at 4:45 PM, Steven Bower smb-apa...@alcyon.net wrote:
 I've been doing some performance analysis of a spacial search use case I'm
 implementing in Solr 4.3.0. Basically I'm seeing search times alot higher
 than I'd like them to be and I'm hoping people may have some suggestions
 for how to optimize further.
 
 Here are the specs of what I'm doing now:
 
 Machine:
 - 16 cores @ 2.8ghz
 - 256gb RAM
 - 1TB (RAID 1+0 on 10 SSD)
 
 Content:
 - 45M docs (not very big only a few fields with no large textual content)
 - 1 geo field (using config below)
 - index is 12gb
 - 1 shard
 - Using MMapDirectory
 
 Field config:
 
 fieldType name=geo class=solr.SpatialRecursivePrefixTreeFieldType
 distErrPct=0.025 maxDistErr=0.00045
 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory
 units=degrees/
 
 field  name=geopoint indexed=true multiValued=false
 required=false stored=true type=geo/
 
 
 What I've figured out so far:
 
 - Most of my time (98%) is being spent in
 java.nio.Bits.copyToByteArray(long,Object,long,long) which is being
 driven by BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
 which from what I gather is basically reading terms from the .tim file
 in blocks
 
 - I moved from Java 1.6 to 1.7 based upon what I read here:
 http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance/
 and it definitely had some positive impact (i haven't been able to
 measure this independantly yet)
 
 - I changed maxDistErr from 0.09 (which is 1m precision per docs)
 to 0.00045 (50m precision) ..
 
 - It looks to me that the .tim file are being memory mapped fully (ie
 they show up in pmap output) the virtual size of the jvm is ~18gb
 (heap is 6gb)
 
 - I've optimized the index but this doesn't have a dramatic impact on
 performance
 
 Changing the precision and the JVM upgrade yielded a drop from ~18s
 avg query time to ~9s avg query time.. This is fantastic but I want to
 get this down into the 1-2 second range.
 
 At this point it seems that basically i am bottle-necked on basically
 copying memory out of the mapped .tim file which leads me to think
 that the only solution to my problem would be to read less data or
 somehow read it more efficiently..
 
 If anyone has any suggestions of where to go with this I'd love to know
 
 
 thanks,
 
 steve


SOLR Performance question

2013-02-19 Thread anurag.jain
Hi everybody.

I stored 42 field in solr. 

and indexed  34 field. 

and going to store 4-6 coloum more  and indexed 3-5 

and total doc i have stored --- 250 

and may be it will reach upto 500

SO question is, 

Will i get any problem ?? my machine is m1.small in amazon ec2. 

so should i shift machine to m1.large   for 250 data  or for 500??
or it will work for now ?? 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-Performance-question-tp4041245.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: SOLR Performance question

2013-02-19 Thread Harshvardhan Ojha
Hi Anurag,

We are running solr with almost same number of documents having more than 50 
indexed fields and 78 stored fields, we have no issues as of now, so I can say 
that you won't face any problem.

Regards
Harshvardhan Ojha

-Original Message-
From: anurag.jain [mailto:anurag.k...@gmail.com]
Sent: Tuesday, February 19, 2013 1:46 PM
To: solr-user@lucene.apache.org
Subject: SOLR Performance question

Hi everybody.

I stored 42 field in solr.

and indexed  34 field.

and going to store 4-6 coloum more  and indexed 3-5

and total doc i have stored --- 250

and may be it will reach upto 500

SO question is,

Will i get any problem ?? my machine is m1.small in amazon ec2.

so should i shift machine to m1.large   for 250 data  or for 500??
or it will work for now ??



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-Performance-question-tp4041245.html
Sent from the Solr - User mailing list archive at Nabble.com.
The contents of this email, including the attachments, are PRIVILEGED AND 
CONFIDENTIAL to the intended recipient at the email address to which it has 
been addressed. If you receive it in error, please notify the sender 
immediately by return email and then permanently delete it from your system. 
The unauthorized use, distribution, copying or alteration of this email, 
including the attachments, is strictly forbidden. Please note that neither 
MakeMyTrip nor the sender accepts any responsibility for viruses and it is your 
responsibility to scan the email and attachments (if any). No contracts may be 
concluded on behalf of MakeMyTrip by means of email communications.


Solr 4.0 indexing performance question

2013-01-23 Thread Kevin Stone
I am having some difficulty migrating our solr indexing scripts from using 3.5 
to solr 4.0. Notably, I am trying to track down why our performance in solr 4.0 
is about 5-10 times slower when indexing documents. Querying is still quite 
fast.

The code adds  documents in groups of 1000, and adds each group to the solr in 
a thread. The documents are somewhat large, including maybe 30-40 different 
field types, mostly multivalued. Here are some snippets of the code we used in 
3.5.


 MultiThreadedHttpConnectionManager mgr = new 
MultiThreadedHttpConnectionManager();

 HttpClient client = new HttpClient(mgr);

 CommonsHttpSolrServer server = new CommonsHttpSolrServer( some url for our 
index,client );

 server.setRequestWriter(new BinaryRequestWriter());


 Then, we delete the index, and proceed to generate documents and load the 
groups in a thread that looks kind of like this. I've omitted some overhead for 
handling exceptions, and retry attempts.


class DocWriterThread implements Runnable

{

CommonsHttpSolrServer server;

CollectionSolrInputDocument docs;

private int commitWithin = 5; // 50 seconds

public DocWriterThread(CommonsHttpSolrServer 
server,CollectionSolrInputDocument docs)

{

this.server=server;

this.docs=docs;

}

public void run()

{

// set the commitWithin feature

server.add(docs,commitWithin);

}

}


Now, I've had to change some things to get this compile with the Solr 4.0 
libraries. Here is what I tried to convert the above code to. I don't know if 
these are the correct equivalents, as I am not familiar with apache 
httpcomponents.



 ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager();

 DefaultHttpClient client = new DefaultHttpClient(mgr);

 HttpSolrServer server = new HttpSolrServer( some url for our solr 
index,client );

 server.setRequestWriter(new BinaryRequestWriter());




The thread method is the same, but uses HttpSolrServer instead of 
CommonsHttpSolrServer.

We also, had an old solrconfig (not sure what version, but it is pre 3.x and 
had mostly default values) that I had to replace with a 4.0 style 
solrconfig.xml. I don't want to post the entire file (as it is large), but I 
copied one from the solr 4.0 examples, and made a couple changes. First, I 
wanted to turn off transaction logging. So essentially I have a line like this 
(everything inside is commented out):


updateHandler class=solr.DirectUpdateHandler2/updateHandler


And I added a handler for javabin


requestHandler name=/update/javabin class=solr.BinaryUpdateRequestHandler

lst name=defaults

 str name=stream.contentTypeapplication/javabin/str

   /lst

  /requestHandler

I'm not sure what other configurations I should look at. I would think that 
there should be a big obvious reason why the indexing performance would drop 
nearly 10 fold.

Against our 3.5 instance I timed our index load, and it adds roughly 40,000 
documents every 3-8 seconds.

Against our 4.0 instance it adds 40,000 documents every 70-75 seconds.

This isn't the end of the world, and I would love to use the new join feature 
in solr 4.0. However, we have many different indexes with millions of 
documents, and this kind of increase in load time is troubling.


Thanks for your help.


-Kevin


The information in this email, including attachments, may be confidential and 
is intended solely for the addressee(s). If you believe you received this email 
by mistake, please notify the sender by return email as soon as possible.


Re: Solr 4.0 indexing performance question

2013-01-23 Thread Mark Miller
It's hard to guess, but I might start by looking at what the new UpdateLog is 
costing you. Take it's definition out of solrconfig.xml and try your test 
again. Then let's take it from there.

- Mark

On Jan 23, 2013, at 11:00 AM, Kevin Stone kevin.st...@jax.org wrote:

 I am having some difficulty migrating our solr indexing scripts from using 
 3.5 to solr 4.0. Notably, I am trying to track down why our performance in 
 solr 4.0 is about 5-10 times slower when indexing documents. Querying is 
 still quite fast.
 
 The code adds  documents in groups of 1000, and adds each group to the solr 
 in a thread. The documents are somewhat large, including maybe 30-40 
 different field types, mostly multivalued. Here are some snippets of the code 
 we used in 3.5.
 
 
 MultiThreadedHttpConnectionManager mgr = new 
 MultiThreadedHttpConnectionManager();
 
 HttpClient client = new HttpClient(mgr);
 
 CommonsHttpSolrServer server = new CommonsHttpSolrServer( some url for our 
 index,client );
 
 server.setRequestWriter(new BinaryRequestWriter());
 
 
 Then, we delete the index, and proceed to generate documents and load the 
 groups in a thread that looks kind of like this. I've omitted some overhead 
 for handling exceptions, and retry attempts.
 
 
 class DocWriterThread implements Runnable
 
 {
 
CommonsHttpSolrServer server;
 
CollectionSolrInputDocument docs;
 
private int commitWithin = 5; // 50 seconds
 
public DocWriterThread(CommonsHttpSolrServer 
 server,CollectionSolrInputDocument docs)
 
{
 
this.server=server;
 
this.docs=docs;
 
}
 
 public void run()
 
 {
 
// set the commitWithin feature
 
server.add(docs,commitWithin);
 
 }
 
 }
 
 
 Now, I've had to change some things to get this compile with the Solr 4.0 
 libraries. Here is what I tried to convert the above code to. I don't know if 
 these are the correct equivalents, as I am not familiar with apache 
 httpcomponents.
 
 
 
 ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager();
 
 DefaultHttpClient client = new DefaultHttpClient(mgr);
 
 HttpSolrServer server = new HttpSolrServer( some url for our solr 
 index,client );
 
 server.setRequestWriter(new BinaryRequestWriter());
 
 
 
 
 The thread method is the same, but uses HttpSolrServer instead of 
 CommonsHttpSolrServer.
 
 We also, had an old solrconfig (not sure what version, but it is pre 3.x and 
 had mostly default values) that I had to replace with a 4.0 style 
 solrconfig.xml. I don't want to post the entire file (as it is large), but I 
 copied one from the solr 4.0 examples, and made a couple changes. First, I 
 wanted to turn off transaction logging. So essentially I have a line like 
 this (everything inside is commented out):
 
 
 updateHandler class=solr.DirectUpdateHandler2/updateHandler
 
 
 And I added a handler for javabin
 
 
 requestHandler name=/update/javabin 
 class=solr.BinaryUpdateRequestHandler
 
lst name=defaults
 
 str name=stream.contentTypeapplication/javabin/str
 
   /lst
 
  /requestHandler
 
 I'm not sure what other configurations I should look at. I would think that 
 there should be a big obvious reason why the indexing performance would drop 
 nearly 10 fold.
 
 Against our 3.5 instance I timed our index load, and it adds roughly 40,000 
 documents every 3-8 seconds.
 
 Against our 4.0 instance it adds 40,000 documents every 70-75 seconds.
 
 This isn't the end of the world, and I would love to use the new join feature 
 in solr 4.0. However, we have many different indexes with millions of 
 documents, and this kind of increase in load time is troubling.
 
 
 Thanks for your help.
 
 
 -Kevin
 
 
 The information in this email, including attachments, may be confidential and 
 is intended solely for the addressee(s). If you believe you received this 
 email by mistake, please notify the sender by return email as soon as 
 possible.



Re: Solr 4.0 indexing performance question

2013-01-23 Thread Kevin Stone
Do you mean commenting out the updateLog.../updateLog tag? Because
that I already commented out. Or do I also need to remove the entire
updateHandler tag? Sorry, I am not too familiar with everything in the
solrconfig file. I have a tag that essentially looks like this:

updateHandler class=solr.DirectUpdateHandler2/updateHandler


Everything inside is commented out.

-Kevin

On 1/23/13 11:21 AM, Mark Miller markrmil...@gmail.com wrote:

It's hard to guess, but I might start by looking at what the new
UpdateLog is costing you. Take it's definition out of solrconfig.xml and
try your test again. Then let's take it from there.

- Mark

On Jan 23, 2013, at 11:00 AM, Kevin Stone kevin.st...@jax.org wrote:

 I am having some difficulty migrating our solr indexing scripts from
using 3.5 to solr 4.0. Notably, I am trying to track down why our
performance in solr 4.0 is about 5-10 times slower when indexing
documents. Querying is still quite fast.

 The code adds  documents in groups of 1000, and adds each group to the
solr in a thread. The documents are somewhat large, including maybe
30-40 different field types, mostly multivalued. Here are some snippets
of the code we used in 3.5.


 MultiThreadedHttpConnectionManager mgr = new
MultiThreadedHttpConnectionManager();

 HttpClient client = new HttpClient(mgr);

 CommonsHttpSolrServer server = new CommonsHttpSolrServer( some url for
our index,client );

 server.setRequestWriter(new BinaryRequestWriter());


 Then, we delete the index, and proceed to generate documents and load
the groups in a thread that looks kind of like this. I've omitted some
overhead for handling exceptions, and retry attempts.


 class DocWriterThread implements Runnable

 {

CommonsHttpSolrServer server;

CollectionSolrInputDocument docs;

private int commitWithin = 5; // 50 seconds

public DocWriterThread(CommonsHttpSolrServer
server,CollectionSolrInputDocument docs)

{

this.server=server;

this.docs=docs;

}

 public void run()

 {

// set the commitWithin feature

server.add(docs,commitWithin);

 }

 }


 Now, I've had to change some things to get this compile with the Solr
4.0 libraries. Here is what I tried to convert the above code to. I
don't know if these are the correct equivalents, as I am not familiar
with apache httpcomponents.



 ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager();

 DefaultHttpClient client = new DefaultHttpClient(mgr);

 HttpSolrServer server = new HttpSolrServer( some url for our solr
index,client );

 server.setRequestWriter(new BinaryRequestWriter());




 The thread method is the same, but uses HttpSolrServer instead of
CommonsHttpSolrServer.

 We also, had an old solrconfig (not sure what version, but it is pre
3.x and had mostly default values) that I had to replace with a 4.0
style solrconfig.xml. I don't want to post the entire file (as it is
large), but I copied one from the solr 4.0 examples, and made a couple
changes. First, I wanted to turn off transaction logging. So essentially
I have a line like this (everything inside is commented out):


 updateHandler class=solr.DirectUpdateHandler2/updateHandler


 And I added a handler for javabin


 requestHandler name=/update/javabin
class=solr.BinaryUpdateRequestHandler

lst name=defaults

 str name=stream.contentTypeapplication/javabin/str

   /lst

  /requestHandler

 I'm not sure what other configurations I should look at. I would think
that there should be a big obvious reason why the indexing performance
would drop nearly 10 fold.

 Against our 3.5 instance I timed our index load, and it adds roughly
40,000 documents every 3-8 seconds.

 Against our 4.0 instance it adds 40,000 documents every 70-75 seconds.

 This isn't the end of the world, and I would love to use the new join
feature in solr 4.0. However, we have many different indexes with
millions of documents, and this kind of increase in load time is
troubling.


 Thanks for your help.


 -Kevin


 The information in this email, including attachments, may be
confidential and is intended solely for the addressee(s). If you believe
you received this email by mistake, please notify the sender by return
email as soon as possible.



The information in this email, including attachments, may be confidential and 
is intended solely for the addressee(s). If you believe you received this email 
by mistake, please notify the sender by return email as soon as possible.


Re: Solr 4.0 indexing performance question

2013-01-23 Thread Kevin Stone
I'm still poking around trying to find the differences. I found a couple
things that may or may not be relevant.
First, when I start up my 3.5 solr, I get all sorts of warnings that my
solrconfig is old and will run using 2.4 emulation.
Of course I had to upgrade the solconfig for the 4.0 instance (which I
already described). I am curious if there could be some feature I was
taking advantage of in 2.4 that doesn't exist now in 4.0. I don't know.

Second when I look at the console logs for my server (3.5 and 4.0) and I
run the indexer against each, I see a subtle difference in this print out
when it connects to the solr core.
The 3.5 version prints this out:
webapp=/solr path=/update
params={waitSearcher=truewt=javabincommit=truesoftCommit=falseversion=2
} {commit=} 0 2722


The 4.0 version prints this out
 webapp=/solr path=/update/javabin
params={wt=javabincommit=truewaitFlush=truewaitSearcher=trueversion=2}
status=0 QTime=1404



The params for the update handle seem ever so slightly different. The 3.5
version (the one that runs fast) has a setting softCommit=false.
The 4.0 version does not print that setting, but instead prints this
setting waitFlush=true.

These could be irrelevant, but thought I should add the information.

-Kevin

On 1/23/13 11:42 AM, Kevin Stone kevin.st...@jax.org wrote:

Do you mean commenting out the updateLog.../updateLog tag? Because
that I already commented out. Or do I also need to remove the entire
updateHandler tag? Sorry, I am not too familiar with everything in the
solrconfig file. I have a tag that essentially looks like this:

updateHandler class=solr.DirectUpdateHandler2/updateHandler


Everything inside is commented out.

-Kevin

On 1/23/13 11:21 AM, Mark Miller markrmil...@gmail.com wrote:

It's hard to guess, but I might start by looking at what the new
UpdateLog is costing you. Take it's definition out of solrconfig.xml and
try your test again. Then let's take it from there.

- Mark

On Jan 23, 2013, at 11:00 AM, Kevin Stone kevin.st...@jax.org wrote:

 I am having some difficulty migrating our solr indexing scripts from
using 3.5 to solr 4.0. Notably, I am trying to track down why our
performance in solr 4.0 is about 5-10 times slower when indexing
documents. Querying is still quite fast.

 The code adds  documents in groups of 1000, and adds each group to the
solr in a thread. The documents are somewhat large, including maybe
30-40 different field types, mostly multivalued. Here are some snippets
of the code we used in 3.5.


 MultiThreadedHttpConnectionManager mgr = new
MultiThreadedHttpConnectionManager();

 HttpClient client = new HttpClient(mgr);

 CommonsHttpSolrServer server = new CommonsHttpSolrServer( some url for
our index,client );

 server.setRequestWriter(new BinaryRequestWriter());


 Then, we delete the index, and proceed to generate documents and load
the groups in a thread that looks kind of like this. I've omitted some
overhead for handling exceptions, and retry attempts.


 class DocWriterThread implements Runnable

 {

CommonsHttpSolrServer server;

CollectionSolrInputDocument docs;

private int commitWithin = 5; // 50 seconds

public DocWriterThread(CommonsHttpSolrServer
server,CollectionSolrInputDocument docs)

{

this.server=server;

this.docs=docs;

}

 public void run()

 {

// set the commitWithin feature

server.add(docs,commitWithin);

 }

 }


 Now, I've had to change some things to get this compile with the Solr
4.0 libraries. Here is what I tried to convert the above code to. I
don't know if these are the correct equivalents, as I am not familiar
with apache httpcomponents.



 ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager();

 DefaultHttpClient client = new DefaultHttpClient(mgr);

 HttpSolrServer server = new HttpSolrServer( some url for our solr
index,client );

 server.setRequestWriter(new BinaryRequestWriter());




 The thread method is the same, but uses HttpSolrServer instead of
CommonsHttpSolrServer.

 We also, had an old solrconfig (not sure what version, but it is pre
3.x and had mostly default values) that I had to replace with a 4.0
style solrconfig.xml. I don't want to post the entire file (as it is
large), but I copied one from the solr 4.0 examples, and made a couple
changes. First, I wanted to turn off transaction logging. So essentially
I have a line like this (everything inside is commented out):


 updateHandler class=solr.DirectUpdateHandler2/updateHandler


 And I added a handler for javabin


 requestHandler name=/update/javabin
class=solr.BinaryUpdateRequestHandler

lst name=defaults

 str name=stream.contentTypeapplication/javabin/str

   /lst

  /requestHandler

 I'm not sure what other configurations I should look at. I would think
that there should be a big obvious reason why the indexing performance
would drop nearly 10 fold.

 Against our 3.5 instance I timed our index load, and it adds roughly
40,000 documents every 3-8 

Re: Solr 4.0 indexing performance question

2013-01-23 Thread Kevin Stone
Another revelation...
I can see that there is a time difference in the Solr output for adding
these documents when I watch it realtime.
Here are some rows from the 3.5 solr server:

Jan 23, 2013 11:57:23 AM org.apache.solr.core.SolrCore execute
INFO: [gxdResult] webapp=/solr path=/update/javabin
params={wt=javabinversion=2} status=0 QTime=6196
Jan 23, 2013 11:57:23 AM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {add=[RNA in situ-1386104, RNA in situ-1351487, RNA in situ-1363917,
RNA in situ-1377125, RNA in situ-1371738, RNA in situ-1378746, RNA in
situ-1383410, RNA in situ-1362712, ... (1001 adds)]} 0 6266
Jan 23, 2013 11:57:23 AM org.apache.solr.core.SolrCore execute
INFO: [gxdResult] webapp=/solr path=/update/javabin
params={wt=javabinversion=2} status=0 QTime=6266
Jan 23, 2013 11:57:24 AM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {add=[RNA in situ-1371578, RNA in situ-1377716, RNA in situ-1378151,
RNA in situ-1360580, RNA in situ-1391657, RNA in situ-1370288, RNA in
situ-1388236, RNA in situ-1361465, ... (1001 adds)]} 0 6371
Jan 23, 2013 11:57:24 AM org.apache.solr.core.SolrCore execute
INFO: [gxdResult] webapp=/solr path=/update/javabin
params={wt=javabinversion=2} status=0 QTime=6371
Jan 23, 2013 11:57:24 AM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {add=[RNA in situ-1350555, RNA in situ-1350887, RNA in situ-1379699,
RNA in situ-1373773, RNA in situ-1374004, RNA in situ-1372265, RNA in
situ-1373027, RNA in situ-1380691, ... (1001 adds)]} 0 6440
Jan 23, 2013 11:57:24 AM org.apache.solr.core.SolrCore execute



And here from the 4.0 solr:

Jan 23, 2013 3:40:22 PM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: [gxdResult] webapp=/solr path=/update params={wt=javabinversion=2}
{add=[RNA in situ-115650, RNA in situ-4109, RNA in situ-107614, RNA in
situ-86038, RNA in situ-19647, RNA in situ-1422, RNA in situ-119536, RNA
in situ-5, RNA in situ-86825, RNA in situ-91009, ... (1001 adds)]} 0
3105
Jan 23, 2013 3:40:23 PM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: [gxdResult] webapp=/solr path=/update params={wt=javabinversion=2}
{add=[RNA in situ-38103, RNA in situ-15797, RNA in situ-79946, RNA in
situ-124877, RNA in situ-62025, RNA in situ-67908, RNA in situ-70527, RNA
in situ-20581, RNA in situ-107574, RNA in situ-96497, ... (1001 adds)]} 0
2689
Jan 23, 2013 3:40:24 PM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: [gxdResult] webapp=/solr path=/update params={wt=javabinversion=2}
{add=[RNA in situ-35518, RNA in situ-50512, RNA in situ-109961, RNA in
situ-113025, RNA in situ-33729, RNA in situ-116967, RNA in situ-133871,
RNA in situ-55287, RNA in situ-67367, RNA in situ-8617, ... (1001 adds)]}
0 2367
Jan 23, 2013 3:40:28 PM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: [gxdResult] webapp=/solr path=/update params={wt=javabinversion=2}
{add=[RNA in situ-105749, RNA in situ-125415, RNA in situ-14667, RNA in
situ-41067, RNA in situ-1099, RNA in situ-86169, RNA in situ-90834, RNA in
situ-114639, RT-PCR-26160, RNA in situ-79745, ... (1001 adds)]} 0 3401
Jan 23, 2013 3:40:28 PM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: [gxdResult] webapp=/solr path=/update params={wt=javabinversion=2}
{add=[RNA in situ-82061, RNA in situ-96965, RNA in situ-22677, RNA in
situ-52637, RNA in situ-131842, RNA in situ-31863, RNA in situ-111656, RNA
in situ-120509, RNA in situ-29659, RNA in situ-63579, ... (1001 adds)]} 0
3580
Jan 23, 2013 3:40:31 PM
org.apache.solr.update.processor.LogUpdateProcessor finish



I know that they aren't the same exact documents (like I said, there are
millions to load), but the times look pretty much like this for all of
them.

Can someone help me parse out the times of this? It *appears* to me that
the inserts are happening just as fast, if not faster in 4.0 than 3.5, BUT
the timestamps between the LogUpdateProcessor calls are much longer in
4.0.
I do not have the updateLog tag anywhere in my solrconfig.xml. So why
does it look to me like it is spending a lot of time logging? It shouldn't
really be logging anything, right? Bear in mind that these inserts happen
in threads that are pushing to Solr concurrently. So if 4.0 is logging
somewhere that 3.5 didn't, then the file-locking on that log file could be
slowing me down.

-Kevin

On 1/23/13 12:03 PM, Kevin Stone kevin.st...@jax.org wrote:

I'm still poking around trying to find the differences. I found a couple
things that may or may not be relevant.
First, when I start up my 3.5 solr, I get all sorts of warnings that my
solrconfig is old and will run using 2.4 emulation.
Of course I had to upgrade the solconfig for the 4.0 instance (which I
already described). I am curious if there could be some feature I was
taking advantage of in 2.4 that doesn't exist now in 4.0. I don't know.

Second when I look at the console logs for my server (3.5 and 4.0) and I
run the indexer against each, I 

Re: Performance Question

2012-03-19 Thread Jamie Johnson
Mikhail,

Thanks for the response.  Just to be clear you're saying that the size
of the index does not matter, it's more the size of the results?

On Fri, Mar 16, 2012 at 2:43 PM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:
 Hello,

 Frankly speaking the computational complexity of Lucene search depends from
 size of search result: numFound*log(start+rows), but from size of index.

 Regards

 On Fri, Mar 16, 2012 at 9:34 PM, Jamie Johnson jej2...@gmail.com wrote:

 I'm curious if anyone tell me how Solr/Lucene performs in a situation
 where you have 100,000 documents each with 100 tokens vs having
 1,000,000 documents each with 10 tokens.  Should I expect the
 performance to be the same?  Any information would be greatly
 appreciated.




 --
 Sincerely yours
 Mikhail Khludnev
 Lucid Certified
 Apache Lucene/Solr Developer
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com


Re: Performance Question

2012-03-19 Thread Mikhail Khludnev
Exactly. That's what I mean.

On Mon, Mar 19, 2012 at 6:15 PM, Jamie Johnson jej2...@gmail.com wrote:

 Mikhail,

 Thanks for the response.  Just to be clear you're saying that the size
 of the index does not matter, it's more the size of the results?

 On Fri, Mar 16, 2012 at 2:43 PM, Mikhail Khludnev
 mkhlud...@griddynamics.com wrote:
  Hello,
 
  Frankly speaking the computational complexity of Lucene search depends
 from
  size of search result: numFound*log(start+rows), but from size of index.
 
  Regards
 
  On Fri, Mar 16, 2012 at 9:34 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  I'm curious if anyone tell me how Solr/Lucene performs in a situation
  where you have 100,000 documents each with 100 tokens vs having
  1,000,000 documents each with 10 tokens.  Should I expect the
  performance to be the same?  Any information would be greatly
  appreciated.
 
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Lucid Certified
  Apache Lucene/Solr Developer
  Grid Dynamics
 
  http://www.griddynamics.com
   mkhlud...@griddynamics.com




-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Performance Question

2012-03-19 Thread Bill Bell
The size of the index does matter practically speaking.

Bill Bell
Sent from mobile


On Mar 19, 2012, at 11:41 AM, Mikhail Khludnev mkhlud...@griddynamics.com 
wrote:

 Exactly. That's what I mean.
 
 On Mon, Mar 19, 2012 at 6:15 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 Mikhail,
 
 Thanks for the response.  Just to be clear you're saying that the size
 of the index does not matter, it's more the size of the results?
 
 On Fri, Mar 16, 2012 at 2:43 PM, Mikhail Khludnev
 mkhlud...@griddynamics.com wrote:
 Hello,
 
 Frankly speaking the computational complexity of Lucene search depends
 from
 size of search result: numFound*log(start+rows), but from size of index.
 
 Regards
 
 On Fri, Mar 16, 2012 at 9:34 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 I'm curious if anyone tell me how Solr/Lucene performs in a situation
 where you have 100,000 documents each with 100 tokens vs having
 1,000,000 documents each with 10 tokens.  Should I expect the
 performance to be the same?  Any information would be greatly
 appreciated.
 
 
 
 
 --
 Sincerely yours
 Mikhail Khludnev
 Lucid Certified
 Apache Lucene/Solr Developer
 Grid Dynamics
 
 http://www.griddynamics.com
 mkhlud...@griddynamics.com
 
 
 
 
 -- 
 Sincerely yours
 Mikhail Khludnev
 Lucid Certified
 Apache Lucene/Solr Developer
 Grid Dynamics
 
 http://www.griddynamics.com
 mkhlud...@griddynamics.com


Performance Question

2012-03-16 Thread Jamie Johnson
I'm curious if anyone tell me how Solr/Lucene performs in a situation
where you have 100,000 documents each with 100 tokens vs having
1,000,000 documents each with 10 tokens.  Should I expect the
performance to be the same?  Any information would be greatly
appreciated.


Re: Performance Question

2012-03-16 Thread Mikhail Khludnev
Hello,

Frankly speaking the computational complexity of Lucene search depends from
size of search result: numFound*log(start+rows), but from size of index.

Regards

On Fri, Mar 16, 2012 at 9:34 PM, Jamie Johnson jej2...@gmail.com wrote:

 I'm curious if anyone tell me how Solr/Lucene performs in a situation
 where you have 100,000 documents each with 100 tokens vs having
 1,000,000 documents each with 10 tokens.  Should I expect the
 performance to be the same?  Any information would be greatly
 appreciated.




-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: performance question

2010-01-06 Thread A. Steven Anderson
 Strictly speaking there is some insignificant distinctions in performance
 related to how a field name is resolved -- Grant alluded to this
 earlier in this thread -- but it only comes into play when you actually
 refer to that field by name and Solr has to look them up in the
 metadata.  So for example if your request refered to 100 differnet field
 names in the q, fq, and facet.field params there would be a small overhead
 for any of those 100 fields that existed because of dynamicField/
 declarations, that would not exist for any of those fields that were
 declared using field/ -- but there would be no added overhead to htat
 query if there were 999 other fields that existed in your index
 because of that same dynamicField/ declaration.

 But frankly: we're getting talking about seriously ridiculous
 pico-optimizing at this point ... if you find yourselv with performance
 concerns there are probaly 500 other things worth worrying about before
 this should ever cross your mind.


Thanks for the follow up.

I've converted our schema to required fields only with every other field
being a dynamic field.

The only negative that I've found so far is that you lose the copyField
capability, so it makes my ingest a little bigger, since I have to manually
copy the values myself.

-- 
A. Steven Anderson
Independent Consultant
st...@asanderson.com


Re: performance question

2010-01-06 Thread Erik Hatcher
You don't lose copyField capability with dynamic fields.  You can copy  
dynamic fields into a fixed field name like *_s = text or dynamic  
fields into another dynamic field like  *_s = *_t


Erik

On Jan 6, 2010, at 9:35 AM, A. Steven Anderson wrote:

Strictly speaking there is some insignificant distinctions in  
performance

related to how a field name is resolved -- Grant alluded to this
earlier in this thread -- but it only comes into play when you  
actually

refer to that field by name and Solr has to look them up in the
metadata.  So for example if your request refered to 100 differnet  
field
names in the q, fq, and facet.field params there would be a small  
overhead

for any of those 100 fields that existed because of dynamicField/
declarations, that would not exist for any of those fields that were
declared using field/ -- but there would be no added overhead to  
htat

query if there were 999 other fields that existed in your index
because of that same dynamicField/ declaration.

But frankly: we're getting talking about seriously ridiculous
pico-optimizing at this point ... if you find yourselv with  
performance
concerns there are probaly 500 other things worth worrying about  
before

this should ever cross your mind.



Thanks for the follow up.

I've converted our schema to required fields only with every other  
field

being a dynamic field.

The only negative that I've found so far is that you lose the  
copyField
capability, so it makes my ingest a little bigger, since I have to  
manually

copy the values myself.

--
A. Steven Anderson
Independent Consultant
st...@asanderson.com




Re: performance question

2010-01-06 Thread A. Steven Anderson
 You don't lose copyField capability with dynamic fields.  You can copy
 dynamic fields into a fixed field name like *_s = text or dynamic fields
 into another dynamic field like  *_s = *_t


Ahhh...I missed that little detail.  Nice!

Ok, so there are no negatives to using dynamic fields then. ;-)

Thanks for all the info!

-- 
A. Steven Anderson
Independent Consultant
st...@asanderson.com


Re: performance question

2010-01-05 Thread Chris Hostetter

:  So, in general, there is no *significant* performance difference with using
:  dynamic fields. Correct?
: 
: Correct.  There's not even really an insignificant performance difference.
: A dynamic field is the same as a regular field in practically every way on the
: search side of things.

Strictly speaking there is some insignificant distinctions in performance 
related to how a field name is resolved -- Grant alluded to this 
earlier in this thread -- but it only comes into play when you actually 
refer to that field by name and Solr has to look them up in the 
metadata.  So for example if your request refered to 100 differnet field 
names in the q, fq, and facet.field params there would be a small overhead 
for any of those 100 fields that existed because of dynamicField/ 
declarations, that would not exist for any of those fields that were 
declared using field/ -- but there would be no added overhead to htat 
query if there were 999 other fields that existed in your index 
because of that same dynamicField/ declaration.

But frankly: we're getting talking about seriously ridiculous 
pico-optimizing at this point ... if you find yourselv with performance 
concerns there are probaly 500 other things worth worrying about before 
this should ever cross your mind.





-Hoss



Re: performance question

2010-01-04 Thread Erik Hatcher


On Jan 4, 2010, at 12:04 AM, A. Steven Anderson wrote:



dynamic fields don't make it worse ... the number of actaul field  
names

you sort on makes it worse.

If you sort on 100 fields, the cost is the same regardless of  
wether all
100 of those fields exist because of a single dynamicField/  
declaration,

or 100 distinct field/ declarations.



Ahh...thanks for the clarification.

So, in general, there is no *significant* performance difference  
with using

dynamic fields. Correct?


Correct.  There's not even really an insignificant performance  
difference.  A dynamic field is the same as a regular field in  
practically every way on the search side of things.


Erik



Re: performance question

2010-01-03 Thread A. Steven Anderson
 Sorting and index norms have space penalties.
 Sorting on a field creates an array of Java ints, one for every
 document in the index. Index norms (used for boosting documents and
 other things) create an array of bytes in the Lucene index files, one
 for every document in the index.
 If you sort on many of your dynamic fields your memory use will
 explode, and the same with index norms and disk space.


Thanks for the info.  In general, I knew sorting was expensive, but I didn't
realize that dynamic fields made it worse.

-- 
A. Steven Anderson
Independent Consultant
st...@asanderson.com


Re: performance question

2010-01-03 Thread Chris Hostetter

:  If you sort on many of your dynamic fields your memory use will
:  explode, and the same with index norms and disk space.

: Thanks for the info.  In general, I knew sorting was expensive, but I didn't
: realize that dynamic fields made it worse.

dynamic fields don't make it worse ... the number of actaul field names 
you sort on makes it worse.  

If you sort on 100 fields, the cost is the same regardless of wether all 
100 of those fields exist because of a single dynamicField/ declaration, 
or 100 distinct field/ declarations.


-Hoss



Re: performance question

2010-01-03 Thread A. Steven Anderson

 dynamic fields don't make it worse ... the number of actaul field names
 you sort on makes it worse.

 If you sort on 100 fields, the cost is the same regardless of wether all
 100 of those fields exist because of a single dynamicField/ declaration,
 or 100 distinct field/ declarations.


Ahh...thanks for the clarification.

So, in general, there is no *significant* performance difference with using
dynamic fields. Correct?


-- 
A. Steven Anderson
Independent Consultant
st...@asanderson.com


Re: performance question

2010-01-02 Thread Lance Norskog
Sorting and index norms have space penalties.

Sorting on a field creates an array of Java ints, one for every
document in the index. Index norms (used for boosting documents and
other things) create an array of bytes in the Lucene index files, one
for every document in the index.

If you sort on many of your dynamic fields your memory use will
explode, and the same with index norms and disk space.

On Wed, Dec 30, 2009 at 6:54 AM, A. Steven Anderson
a.steven.ander...@gmail.com wrote:
 There can be an impact if you are searching against a lot of fields or if
 you are indexing a lot of fields on every document, but for the most part in
 most applications it is negligible.


 We index a lot of fields at one time, but we can tolerate the performance
 impact at index time.

 It probably can't hurt to be more streamlined, but without knowing more
 about your model, it's hard to say.  I've built apps that were totally
 dynamic field based and they worked just fine, but these were more for
 discovery than just pure search.  In other words, the user was interacting
 with the system in a reflective model that selected which fields to search
 on.


 Our application is as much about discovery as search, so this is good to
 know.

 Thanks for the feedback. It was very helpful.
 --
 A. Steven Anderson
 Independent Consultant
 st...@asanderson.com




-- 
Lance Norskog
goks...@gmail.com


Re: performance question

2009-12-30 Thread Grant Ingersoll

On Dec 29, 2009, at 2:19 PM, A. Steven Anderson wrote:

 Greetings!
 
 Is there any significant negative performance impact of using a
 dynamicField?

There can be an impact if you are searching against a lot of fields or if you 
are indexing a lot of fields on every document, but for the most part in most 
applications it is negligible. 

 
 Likewise for multivalued fields?

No.  Multivalued fields are just concatenated together with a large position 
gap underneath the hood.

 
 The reason why I ask is that our system basically aggregates data from many
 disparate data sources (structured, unstructured, and semi-structured), and
 the management of the schema.xml has become unwieldy; i.e. we currently have
 dozens of fields which grows every time we add a new data source.
 
 I was considering redefining the domain model outside of Solr which would be
 used to generate the fields for the indexing process and the metadata (e.g.
 display names) for the search process.
 
 Thoughts?

It probably can't hurt to be more streamlined, but without knowing more about 
your model, it's hard to say.  I've built apps that were totally dynamic field 
based and they worked just fine, but these were more for discovery than just 
pure search.  In other words, the user was interacting with the system in a 
reflective model that selected which fields to search on.

-Grant

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



Re: performance question

2009-12-30 Thread A. Steven Anderson
 There can be an impact if you are searching against a lot of fields or if
 you are indexing a lot of fields on every document, but for the most part in
 most applications it is negligible.


We index a lot of fields at one time, but we can tolerate the performance
impact at index time.

It probably can't hurt to be more streamlined, but without knowing more
 about your model, it's hard to say.  I've built apps that were totally
 dynamic field based and they worked just fine, but these were more for
 discovery than just pure search.  In other words, the user was interacting
 with the system in a reflective model that selected which fields to search
 on.


Our application is as much about discovery as search, so this is good to
know.

Thanks for the feedback. It was very helpful.
-- 
A. Steven Anderson
Independent Consultant
st...@asanderson.com


performance question

2009-12-29 Thread A. Steven Anderson
Greetings!

Is there any significant negative performance impact of using a
dynamicField?

Likewise for multivalued fields?

The reason why I ask is that our system basically aggregates data from many
disparate data sources (structured, unstructured, and semi-structured), and
the management of the schema.xml has become unwieldy; i.e. we currently have
dozens of fields which grows every time we add a new data source.

I was considering redefining the domain model outside of Solr which would be
used to generate the fields for the indexing process and the metadata (e.g.
display names) for the search process.

Thoughts?
-- 
A. Steven Anderson
Independent Consultant
st...@asanderson.com


Re: Performance question: Solr 64 bit java vs 32 bit mode.

2007-11-20 Thread Otis Gospodnetic
Solr runs equally well on both 64-bit and 32-bit systems.

Your 15 second problem could be caused by IO bottleneck (not likely if your 
index is small and fits in RAM), could be concurrency (esp. if you are using 
compound index format), could be something else on production killing your CPU, 
could be the JVM being busy sweeping the garbage out, etc.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Robert Purdy [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Thursday, November 15, 2007 4:05:00 PM
Subject: Performance question: Solr 64 bit java vs 32 bit mode.


Would anyone know if solr runs better in 64bit java vs 32 bit and could
answer another possible related question.

I currently have two servers running solr under identical tomcat
installations. One is the production server and is under heavy user
 load and
the other is under no load at all because it is a test box.

I was looking in the logs on the production server and noticed some
 queries
were taking about 15 seconds, and this is after auto-warming. So I
 decided
to execute that same query on the other server with nothing in the
 caches
and found that it only took 2 seconds to complete. 

My question is why an Dual Intel Core Duo  Xserve server in 64 bit java
 mode
with 8GB of ram allocated to the tomcat server be slower than a Dual
 Power
PC G5 server running in 32 bit mode with only 2GB of ram allocated? Is
 it
because of the load/concurrrency issues on the production sever that
 made
the time next to the query in the log greater on the production server?
 If
so what is the best way to configure tomcat to deal with that issue? 

Thanks Robert.
-- 
View this message in context:
 
http://www.nabble.com/Performance-question%3A-Solr-64-bit-java-vs-32-bit-mode.-tf4817186.html#a13781791
Sent from the Solr - User mailing list archive at Nabble.com.






Performance question: Solr 64 bit java vs 32 bit mode.

2007-11-15 Thread Robert Purdy

Would anyone know if solr runs better in 64bit java vs 32 bit and could
answer another possible related question.

I currently have two servers running solr under identical tomcat
installations. One is the production server and is under heavy user load and
the other is under no load at all because it is a test box.

I was looking in the logs on the production server and noticed some queries
were taking about 15 seconds, and this is after auto-warming. So I decided
to execute that same query on the other server with nothing in the caches
and found that it only took 2 seconds to complete. 

My question is why an Dual Intel Core Duo  Xserve server in 64 bit java mode
with 8GB of ram allocated to the tomcat server be slower than a Dual Power
PC G5 server running in 32 bit mode with only 2GB of ram allocated? Is it
because of the load/concurrrency issues on the production sever that made
the time next to the query in the log greater on the production server? If
so what is the best way to configure tomcat to deal with that issue? 

Thanks Robert.
-- 
View this message in context: 
http://www.nabble.com/Performance-question%3A-Solr-64-bit-java-vs-32-bit-mode.-tf4817186.html#a13781791
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Phrase Query Performance Question and score threshold

2007-11-05 Thread Yonik Seeley
On 11/5/07, Haishan Chen [EMAIL PROTECTED] wrote:
 As for the first issues. The number of different phrase queries have 
 performance issues I found so far are about 10.

If these are normal phrase queries (no slop), a good solution might be
to simply index and query these phrases as a single token.  One could
do this with a SynonymFilter.

Oh, and no, a score threshold won't help performance.

 I believe there will be a lot more I just haven't tried.  It can be solve by 
 using faster hard
 ware though.  Also I believe it will help if SOLR has samilar distributed 
 search
 architecture like NUTCH so that it can scale out instead of scale up.

It's coming...

-Yonik


Re: Phrase Query Performance Question

2007-11-02 Thread Walter Underwood
He means extremely frequent and I agree. --wunder

On 11/2/07 1:51 AM, Haishan Chen [EMAIL PROTECTED] wrote:

 Thanks for the advice. You certainly have a point. I believe you mean a query
 term that appears in 5-10% of an index in a natural language corpus is
 extremely INFREQUENT?  



RE: Phrase Query Performance Question

2007-11-02 Thread Haishan Chen




 From: [EMAIL PROTECTED] Subject: Re: Phrase Query Performance Question 
 Date: Thu, 1 Nov 2007 11:25:26 -0700 To: solr-user@lucene.apache.org  On 
 31-Oct-07, at 11:54 PM, Haishan Chen wrote:Date: Wed, 31 Oct 2007 
 17:54:53 -0700 Subject: Re: Phrase Query   Performance Question From: 
 [EMAIL PROTECTED] To: solr-   [EMAIL PROTECTED]  hurricane katrina is 
 a very expensive   query against a collection focused on Hurricane 
 Katrina. There   will be many matches in many documents. If you want to 
 measure   worst-case, this is fine.  I'd try other things, like:  *  
  ninth ward * Ray Nagin * Audubon Park * Canal Street * French   
 Quarter * FEMA mistakes * storm surge * Jackson Square  Of   course, 
 real query logs are the only real test.  wunder   These terms are not 
 frequent in my index. I believe they are going   to be fast. The thing is 
 that I feel 2 million documents is a small   index.  100,000 or 200,000 
 hits is a small set and should always have sub   second query performance. 
 Now I am only querying one field and the  response is almost one second. I 
 feel I can't achieve sub second   performance if I add a bit more 
 complexity to the query.   Many of the category terms in my index will 
 appear in more than 5%   of the documents and those category terms are very 
 popular search  terms. So the example I gave were not extreme cases for my 
 index  I think that you are somewhat misguided about what constitutes a  
 small set. A query term that appears in 5-10% of the index in a  natural 
 language corpus is _extremely_ frequent. Not quite on the  order of 
 stopwords, but getting there. As a comparison, on an  extremely large corpus 
 that I have handy, documents containing both  the word 'auto' and 'repair' 
 (not necessarily adjacent) constitute  0.1% of the index. The frequency of 
 the phrase auto repair is 0.025%.  @200k docs would be the response rate 
 from an 800million-doc corpus.  What data are you indexing, what what is 
 the intended effect of the  phrase queries you are performing? Perhaps 
 getting at the issue from  this end would be more productive than hammering 
 at the phrasequery  performance question.
 
 
 
 
Thanks for the advice. You certainly have a point. I believe you mean a query 
term that appears in 5-10% of an index in a  natural language corpus is 
extremely INFREQUENT?  
 
 
 
 
   When I start tomcat I saw this message:  The Apache Tomcat Native 
   library which allows optimal performance   in production environments 
   was not found on the java.library.path   Is that mean if I use Apache 
   Tomcat Native library the query   performance will be better. Anyone 
   has experience on that?  Unlikely, though it might help you slightly at 
   a high query rate with  high cache hit ratios.  -Mike
 
I have try Apache Tomcat Native library on my window machine and you are right. 
No obvious difference on query performance
 
 
 
I have try the index on a linux machine. 
The windows machine:  Windows 2003, one intel(R) Xeon(TM) CPU 3.00 GHZ 
(Quo-core cpu) 4G Ram
The linux machine:  (not sure what version of linux), two  Intel(R) Xeon(R) CPU 
E5310 1.6 GHZ (Quo-core cpu) 4G Ram
 
Both system have raid5 but I don't know the difference.
 
I found substantial indexing performance improvement on the linux machine. On 
the windows machine it took more than 5 hours. 
But it took only one hour to index 2 million documents on the linux system. I 
am really happy to see that. I guess both linux and the extra CPU contributed 
to the improvement.
 
Query performance are almost the same though. The cpu on linux machine is 
slower so I think if the linux system were using the same cpu as the windows 
system query performance will improve too.  Both index and query are cpu bound. 
If I am right.
 
I guess I got enough on this question. But I still want to try the solr-trunk. 
Will update with everyone later.
 
 
 
Thanks
-Haishan
 
 
 
 
 
 
 
 
 
 
 
 
 
_
Boo! Scare away worms, viruses and so much more! Try Windows Live OneCare!
http://onecare.live.com/standard/en-us/purchase/trial.aspx?s_cid=wl_hotmailnews

Re: Phrase Query Performance Question

2007-11-02 Thread Mike Klaas

On 2-Nov-07, at 10:03 AM, Haishan Chen wrote:






Date: Fri, 2 Nov 2007 07:32:30 -0700 Subject: Re: Phrase Query  
Performance Question From: [EMAIL PROTECTED] To: solr- 
[EMAIL PROTECTED]  He means extremely frequent and I  
agree. --wunder



Then it means a PHRASE (combination of terms except stopwords)  
appear in 5% to 10% of an index should NOT be that frequent? I  
guess I get the idea.


Phrases should be rarer than individual keywords.  5-10% is  
moderately high even for a _single_ keyword, let alone the  
conjunction of two keywords, let alone the _exact phrase_ of two  
keywords (non stopwords in all of this discussion).


As I mentioned, the 'natural' rate of 'auto'+'repair' on a corpus  
100's of times bigger than yours (web documents) is .1%, and the rate  
of the phrase 'auto repair' is .025%.


It still feels to me that you are trying doing something unique with  
your phrase queries.  Unfortunately, you still haven't said what you  
are trying to do in general terms, which makes it very difficult for  
people to help you.


-Mike


Re: Phrase Query Performance Question

2007-11-02 Thread Chris Hostetter

: It still feels to me that you are trying doing something unique with your
: phrase queries.  Unfortunately, you still haven't said what you are trying to
: do in general terms, which makes it very difficult for people to help you.

Agreed.  This seems very special case, but we dont' know what the case is.

If there are specific phrases you know in advance that you will care 
about, and those phrases occur as frequetnly as the individual 
words, then the best way to deal with them is to index each phrase as 
a single Term (and ignore the individual words)

Speaking more generally to mike's point...

http://people.apache.org/~hossman/#xyproblem
Your question appears to be an XY Problem ... that is: you are dealing
with X, you are assuming Y will help you, and you are asking about Y
without giving more details about the X so that we can understand the
full issue.  Perhaps the best solution doesn't involve Y at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341





-Hoss



RE: Phrase Query Performance Question

2007-11-02 Thread Haishan Chen


 Date: Fri, 2 Nov 2007 12:31:29 -0700 From: [EMAIL PROTECTED] To: 
 solr-user@lucene.apache.org Subject: Re: Phrase Query Performance Question 
   : It still feels to me that you are trying doing something unique with 
 your : phrase queries. Unfortunately, you still haven't said what you are 
 trying to : do in general terms, which makes it very difficult for people to 
 help you.  Agreed. This seems very special case, but we dont' know what the 
 case is.  If there are specific phrases you know in advance that you will 
 care  about, and those phrases occur as frequetnly as the individual  
 words, then the best way to deal with them is to index each phrase as  a 
 single Term (and ignore the individual words)  Speaking more generally to 
 mike's point...  http://people.apache.org/~hossman/#xyproblem Your 
 question appears to be an XY Problem ... that is: you are dealing with 
 X, you are assuming Y will help you, and you are asking about Y 
 without giving more details about the X so that we can understand the full 
 issue. Perhaps the best solution doesn't involve Y at all? See Also: 
 http://www.perlmonks.org/index.pl?node_id=542341  -Hoss 
I think the documents I was indexing can not be considered a natural language 
documents. It is constructed following certain rules and then feed into the 
indexing process. I guess because of the rules many targeting searching terms 
have high document frequency. I am not in obligation to achieve the quarter 
second performance I am just interested to see whether it is achievable. 
 
Thanks everyone for offering advice
-Haishan
 
 
 
 
 
 
 
_
Help yourself to FREE treats served up daily at the Messenger Café. Stop by 
today.
http://www.cafemessenger.com/info/info_sweetstuff2.html?ocid=TXT_TAGLM_OctWLtagline

Re: Phrase Query Performance Question

2007-11-01 Thread Mike Klaas

On 31-Oct-07, at 11:54 PM, Haishan Chen wrote:



Date: Wed, 31 Oct 2007 17:54:53 -0700 Subject: Re: Phrase Query  
Performance Question From: [EMAIL PROTECTED] To: solr- 
[EMAIL PROTECTED]  hurricane katrina is a very expensive  
query against a collection focused on Hurricane Katrina. There  
will be many matches in many documents. If you want to measure  
worst-case, this is fine.  I'd try other things, like:  *  
ninth ward * Ray Nagin * Audubon Park * Canal Street * French  
Quarter * FEMA mistakes * storm surge * Jackson Square  Of  
course, real query logs are the only real test.  wunder


These terms are not frequent in my index. I believe they are going  
to be fast. The thing is that I feel 2 million documents is a small  
index.
100,000 or 200,000 hits is a small set and should always have sub  
second query performance. Now I am only querying one field and the
response is almost one second. I feel I can't achieve sub second  
performance if I add a bit more complexity to the query.


Many of the category terms in my index will appear in more than 5%  
of the documents and those category terms are very popular search

terms. So the example I gave were not extreme cases for my index


I think that you are somewhat misguided about what constitutes a  
small set.  A query term that appears in 5-10% of the index in a  
natural language corpus is _extremely_ frequent.  Not quite on the  
order of stopwords, but getting there.  As a comparison, on an  
extremely large corpus that I have handy, documents containing both  
the word 'auto' and 'repair' (not necessarily adjacent) constitute  
0.1% of the index.  The frequency of the phrase auto repair is 0.025%.


@200k docs would be the response rate from an 800million-doc corpus.

What data are you indexing, what what is the intended effect of the  
phrase queries you are performing?  Perhaps getting at the issue from  
this end would be more productive than hammering at the phrasequery  
performance question.



When I start tomcat I saw this message:
The Apache Tomcat Native library which allows optimal performance  
in production environments was not found on the java.library.path


Is that mean if I use Apache Tomcat Native library the query  
performance will be better. Anyone has experience on that?


Unlikely, though it might help you slightly at a high query rate with  
high cache hit ratios.


-Mike


RE: Phrase Query Performance Question

2007-10-31 Thread Haishan Chen




 From: [EMAIL PROTECTED] Subject: Re: Phrase Query Performance Question 
 Date: Tue, 30 Oct 2007 11:22:17 -0700 To: solr-user@lucene.apache.org  On 
 30-Oct-07, at 6:09 AM, Yonik Seeley wrote:   On 10/30/07, Haishan Chen 
 [EMAIL PROTECTED] wrote:  Thanks a lot for replying Yonik!   I am 
 running solr on a windows 2003 server (standard version).   intel Xeon CPU 
 3.00GHz, with 4.00 GB RAM.  The index is locate on Raid5 with 2 million 
 documents. Is there   any way to improve query performance without moving 
 to more   powerful computer?   I understand that the query 
 performances of phrase query (auto   repair) has to do with the number 
 of documents containing the two   words. In fact the number of documents 
 that have auto and repair   are about 10. It is like 5% of the 
 documents containing auto   and repair. It seems to me 937 ms is too 
 slower.   Chen, that does seem slow I'm not sure why.  1) was this 
 the first search on the index? if so, try running some  other searches to 
 warm things up first.  Indeed--phrase matching uses a completely different 
 part of the  index, so that needs to be warmed too.  One thing to try is 
 solr trunk: it contains some speedups for phrase  queries (though perhaps 
 not as substantial as you hope for).  -MIke  
 
Thanks for replying.
The statistics I collected were not on the first query. And I believe I was 
runing JVM on server mode. 
I configure tomcat to use the server version of JVM.dll. I guess that is the 
way to set it on windows.
I execute the same phrase query (auto repair) over and over again and that is 
the best performance I observe. 
Also when I did the test I disable all solr cache. I want to see the 
performance without Solr cache
 
I am currently trying to test the index on linux system with similar hardware. 
It will take me some time to set it up.
 
I read a discussion between Doug cutting and Andrzej Bialecki about lucene 
performance.
 
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200512.mbox/[EMAIL 
PROTECTED]
It mentioned that  http://websearch.archive.org/katrina/ (in nutch) had 10M 
documents and a search of hurricane katrina was able to return in 1.35 
seconds with  600,867 hits.  Althought the computer it was using might be more 
powerful than mine. I feel 937ms for a phrase query on a single field is kind 
of slower. Nutch actually expand a search to more complex queries. My index and 
the number of hits on my query (auto repair) is about one fifth of 
websearch.archive.org and its testing query. So I feel a reasonable performance 
for my query should be less than 300 ms. I am not sure if I am right on that 
logic.
 
Anyway I will collect the statistic on linux first and try out other options. 
 
 
Thanks a lot
Haishan
_
Windows Live Hotmail and Microsoft Office Outlook – together at last.  Get it 
now.
http://office.microsoft.com/en-us/outlook/HA102225181033.aspx?pid=CL100626971033

Re: Phrase Query Performance Question

2007-10-31 Thread Mike Klaas

On 31-Oct-07, at 2:40 PM, Haishan Chen wrote:



http://mail-archives.apache.org/mod_mbox/lucene-java-user/ 
200512.mbox/[EMAIL PROTECTED]
It mentioned that  http://websearch.archive.org/katrina/ (in nutch)  
had 10M documents and a search of hurricane katrina was able to  
return in 1.35 seconds with  600,867 hits.  Althought the computer  
it was using might be more powerful than mine. I feel 937ms for a  
phrase query on a single field is kind of slower. Nutch actually  
expand a search to more complex queries. My index and the number of  
hits on my query (auto repair) is about one fifth of  
websearch.archive.org and its testing query. So I feel a reasonable  
performance for my query should be less than 300 ms. I am not sure  
if I am right on that logic.


I'm not sure that it is reasonable, but I'm not sure that it isn't.   
However, have you tried other queries?  937ms seems a little high,  
even for phrase queries.


Anyway I will collect the statistic on linux first and try out  
other options.


Have you tried using the performance enhancements present in solr-trunk?

-Mike


RE: Phrase Query Performance Question

2007-10-31 Thread Haishan Chen


 From: [EMAIL PROTECTED] Subject: Re: Phrase Query Performance Question 
 Date: Wed, 31 Oct 2007 15:25:42 -0700 To: solr-user@lucene.apache.org  On 
 31-Oct-07, at 2:40 PM, Haishan Chen wrote:
 http://mail-archives.apache.org/mod_mbox/lucene-java-user/   
 200512.mbox/[EMAIL PROTECTED]  It mentioned that 
 http://websearch.archive.org/katrina/ (in nutch)   had 10M documents and a 
 search of hurricane katrina was able to   return in 1.35 seconds with 
 600,867 hits. Althought the computer   it was using might be more powerful 
 than mine. I feel 937ms for a   phrase query on a single field is kind of 
 slower. Nutch actually   expand a search to more complex queries. My index 
 and the number of   hits on my query (auto repair) is about one fifth of 
   websearch.archive.org and its testing query. So I feel a reasonable   
 performance for my query should be less than 300 ms. I am not sure   if I 
 am right on that logic.  I'm not sure that it is reasonable, but I'm not 
 sure that it isn't.  However, have you tried other queries? 937ms seems a 
 little high,  even for phrase queries.   Anyway I will collect the 
 statistic on linux first and try out   other options.  Have you tried 
 using the performance enhancements present in solr-trunk?  -Mike
 
Here are some query statistic. The phrase queries look slow to me.  
These are queries have more than 10 hits. For those return a couple 
thousand hits the responds time is quite fast. 
But this is query on one field only. 
 
(auto repair)  100384 hits 946 ms(auto repair)  100384 hits  31ms(car 
repair~100)  112183 hits  766 ms(car repair)112183 hits  63 
ms(business service~100) 1209751 hits  1500 ms(business service)  1209751 
hits  234 ms(shopping center~100) 119481 hits   359 ms(shopping 
center~100) 119481 hits   63 ms
 
I don't know what is solr-trunk yet but I will find out
 
Thank you
Haishan
 
 
 
_
Climb to the top of the charts!  Play Star Shuffle:  the word scramble 
challenge with star power.
http://club.live.com/star_shuffle.aspx?icid=starshuffle_wlmailtextlink_oct

Re: Phrase Query Performance Question

2007-10-31 Thread Walter Underwood
hurricane katrina is a very expensive query against a collection
focused on Hurricane Katrina. There will be many matches in many
documents. If you want to measure worst-case, this is fine.

I'd try other things, like:

* ninth ward
* Ray Nagin
* Audubon Park
* Canal Street
* French Quarter
* FEMA mistakes
* storm surge
* Jackson Square

Of course, real query logs are the only real test.

wunder

On 10/31/07 3:25 PM, Mike Klaas [EMAIL PROTECTED] wrote:

 On 31-Oct-07, at 2:40 PM, Haishan Chen wrote:
 
 
 http://mail-archives.apache.org/mod_mbox/lucene-java-user/
 200512.mbox/[EMAIL PROTECTED]
 It mentioned that  http://websearch.archive.org/katrina/ (in nutch)
 had 10M documents and a search of hurricane katrina was able to
 return in 1.35 seconds with  600,867 hits.  Althought the computer
 it was using might be more powerful than mine. I feel 937ms for a
 phrase query on a single field is kind of slower. Nutch actually
 expand a search to more complex queries. My index and the number of
 hits on my query (auto repair) is about one fifth of
 websearch.archive.org and its testing query. So I feel a reasonable
 performance for my query should be less than 300 ms. I am not sure
 if I am right on that logic.
 
 I'm not sure that it is reasonable, but I'm not sure that it isn't.
 However, have you tried other queries?  937ms seems a little high,
 even for phrase queries.
 
 Anyway I will collect the statistic on linux first and try out
 other options.
 
 Have you tried using the performance enhancements present in solr-trunk?
 
 -Mike



RE: Phrase Query Performance Question

2007-10-31 Thread Chris Hostetter

: (auto repair)  100384 hits 946 ms(auto repair)  100384 hits 31ms(car 
: repair~100)  112183 hits 766 ms(car repair)  112183 hits 63 
: ms(business service~100) 1209751 hits 1500 ms(business service)  
: 1209751 hits 234 ms(shopping center~100) 119481 hits 359 
: ms(shopping center~100) 119481 hits 63 ms

if i'm reading those numbers right, every document in your corpus 
containing the words auto or repair also contains the exact phrase 
auto repair with no slop ... this seems HIGHLY unlikely.  can you show 
us *exactly* what the query URLs you are using look like, and show us what 
the request handler section of your solrconfig.xml looks like.

also: where are you getting these times from?  are these from the logging 
output solr produces, or from the client you have hitting solr?

: I don't know what is solr-trunk yet but I will find out

he's refering to the unreleased develoment code, which you can checkout 
from the trunk of the SOlr subversion repository...

http://lucene.apache.org/solr/version_control.html


-Hoss



RE: Phrase Query Performance Question

2007-10-31 Thread Haishan Chen




 Date: Wed, 31 Oct 2007 19:19:07 -0700 From: [EMAIL PROTECTED] To: 
 solr-user@lucene.apache.org Subject: RE: Phrase Query Performance Question 
   : (auto repair) 100384 hits 946 ms(auto repair) 100384 hits 31ms(car  
 : repair~100) 112183 hits 766 ms(car repair) 112183 hits 63  : 
 ms(business service~100) 1209751 hits 1500 ms(business service)  : 
 1209751 hits 234 ms(shopping center~100) 119481 hits 359  : 
 ms(shopping center~100) 119481 hits 63 ms  if i'm reading those numbers 
 right, every document in your corpus  containing the words auto or 
 repair also contains the exact phrase  auto repair with no slop ... this 
 seems HIGHLY unlikely. can you show  us *exactly* what the query URLs you 
 are using look like, and show us what  the request handler section of your 
 solrconfig.xml looks like.
 
 
Yes that's exactly what the documents are like. The documents are categorized. 
I indexed the category with the content 
of the documents using text field type.  The URL I used is 
select?q=content:(auto repair~100)fl=title. All other options like 
faceting, highlighting are not used.
 
  also: where are you getting these times from? are these from the logging  
  output solr produces, or from the client you have hitting solr?  : I 
  don't know what is solr-trunk yet but I will find out  he's refering to 
  the unreleased develoment code, which you can checkout  from the trunk 
  of the SOlr subversion repository...  
  http://lucene.apache.org/solr/version_control.html   -Hoss 
 
I am getting the time from the client browser
 
 
Thanks
-Haishan
 
 
 
 
 
 
 
_
Help yourself to FREE treats served up daily at the Messenger Café. Stop by 
today.
http://www.cafemessenger.com/info/info_sweetstuff2.html?ocid=TXT_TAGLM_OctWLtagline

RE: Phrase Query Performance Question

2007-10-31 Thread Haishan Chen




 Date: Wed, 31 Oct 2007 17:54:53 -0700 Subject: Re: Phrase Query Performance 
 Question From: [EMAIL PROTECTED] To: solr-user@lucene.apache.org  
 hurricane katrina is a very expensive query against a collection focused 
 on Hurricane Katrina. There will be many matches in many documents. If you 
 want to measure worst-case, this is fine.  I'd try other things, like:  * 
 ninth ward * Ray Nagin * Audubon Park * Canal Street * French Quarter * 
 FEMA mistakes * storm surge * Jackson Square  Of course, real query logs 
 are the only real test.  wunder
 
 
 
These terms are not frequent in my index. I believe they are going to be fast. 
The thing is that I feel 2 million documents is a small index.
100,000 or 200,000 hits is a small set and should always have sub second query 
performance. Now I am only querying one field and the
response is almost one second. I feel I can't achieve sub second performance if 
I add a bit more complexity to the query.
 
Many of the category terms in my index will appear in more than 5% of the 
documents and those category terms are very popular search
terms. So the example I gave were not extreme cases for my index
 
When I start tomcat I saw this message:
The Apache Tomcat Native library which allows optimal performance in production 
environments was not found on the java.library.path
 
Is that mean if I use Apache Tomcat Native library the query performance will 
be better. Anyone has experience on that?
 
 
 
Thanks a lot
-Haishan
 
 
 
 
 
 
 
 
 
 
 
 
  On 10/31/07 3:25 PM, Mike Klaas [EMAIL PROTECTED] wrote:   On 
  31-Oct-07, at 2:40 PM, Haishan Chen wrote:  
  http://mail-archives.apache.org/mod_mbox/lucene-java-user/  
  200512.mbox/[EMAIL PROTECTED]  It mentioned that 
  http://websearch.archive.org/katrina/ (in nutch)  had 10M documents and 
  a search of hurricane katrina was able to  return in 1.35 seconds with 
  600,867 hits. Althought the computer  it was using might be more 
  powerful than mine. I feel 937ms for a  phrase query on a single field 
  is kind of slower. Nutch actually  expand a search to more complex 
  queries. My index and the number of  hits on my query (auto repair) is 
  about one fifth of  websearch.archive.org and its testing query. So I 
  feel a reasonable  performance for my query should be less than 300 ms. 
  I am not sure  if I am right on that logic.I'm not sure that it 
  is reasonable, but I'm not sure that it isn't.  However, have you tried 
  other queries? 937ms seems a little high,  even for phrase queries.   
   Anyway I will collect the statistic on linux first and try out  other 
  options.Have you tried using the performance enhancements present 
  in solr-trunk?-Mike 
_
Peek-a-boo FREE Tricks  Treats for You!
http://www.reallivemoms.com?ocid=TXT_TAGHMloc=us

RE: Phrase Query Performance Question

2007-10-30 Thread Haishan Chen
Thanks a lot for replying Yonik!
 
I am running solr on a windows 2003 server (standard version). intel Xeon CPU 
3.00GHz, with 4.00 GB RAM.
The index is locate on Raid5 with 2 million documents. Is there any way to 
improve query performance without moving to more powerful computer?
 
I understand that the query performances of phrase query (auto repair) has to 
do with the number of documents containing the two words. In fact the number of 
documents that have auto and repair are about 10. It is like 5% of the 
documents containing auto and repair.  It seems to me 937 ms is too slower.
 
Would it be faster if I run solr on linux system? If it is then how much faster 
it would be generally?  My performance target for this kind of phrase query is 
a quarter of a second or so. Any advice on how to achieve this on the above 
hardware?
 
 
Thanks a lot
 
Haishan
 
 
 
 
 
 
 
 
Re: phrase query performanceYonik SeeleyFri, 26 Oct 2007 08:09:52 -0700
The differences lie in Lucene.Instead of thinking of phrase queries as slow, 
think of term queries as fast :-)Phrase queries need to read and consider 
position information thatterm queries do not.
-Yonik
 
 
On 10/26/07, Haishan Chen [EMAIL PROTECTED] wrote: I am a new Solr user and 
wonder if anyone can help me these questions. I used  Solr to index about two 
million documents and query on it using standard  request handler. I disabled 
all cache. I found phrase query was substantially  slower than the usual 
query.  The statistic I collected is as following. I  was doing the query on 
the one field only.  content:(auto repair) 47 ms  
repeatablecontent:(auto repair)  937 ms  
repeatablecontent:(auto repair~1) 766 ms repeatable What are the  
factors affecting phrase query performance? How come the phrase query  
content:(auto repair) is almost 20 times slower than content:(auto repair)?  
I also notice a the phrase query with a slop is always faster than the one  
without a slop. Is the difference I observe here a performance problem of  
Lucene or Solr? It will be appreciated if anyone can help
_
Boo! Scare away worms, viruses and so much more! Try Windows Live OneCare!
http://onecare.live.com/standard/en-us/purchase/trial.aspx?s_cid=wl_hotmailnews

Re: Phrase Query Performance Question

2007-10-30 Thread Yonik Seeley
On 10/30/07, Haishan Chen [EMAIL PROTECTED] wrote:
 Thanks a lot for replying Yonik!

 I am running solr on a windows 2003 server (standard version). intel Xeon CPU 
 3.00GHz, with 4.00 GB RAM.
 The index is locate on Raid5 with 2 million documents. Is there any way to 
 improve query performance without moving to more powerful computer?

 I understand that the query performances of phrase query (auto repair) has 
 to do with the number of documents containing the two words. In fact the 
 number of documents that have auto and repair are about 10. It is like 5% 
 of the documents containing auto and repair.  It seems to me 937 ms is too 
 slower.

Chen, that does seem slow I'm not sure why.
1) was this the first search on the index?  if so, try running some
other searches to warm things up first.
2) was the jvm in server mode?  (start with -server)
3) shut down unlrelated things on the system so that there is more
memory available to the OS to cache the index files

 Would it be faster if I run solr on linux system?

Maybe... Lucene does rely on the OS caching often used parts of the
index, so this can differ the most between Windows and Linux.  If you
have a Linux box lying around, trying it out quick to remove that
variable would be a good idea.

-Yonik


Re: Phrase Query Performance Question

2007-10-30 Thread Mike Klaas

On 30-Oct-07, at 6:09 AM, Yonik Seeley wrote:


On 10/30/07, Haishan Chen [EMAIL PROTECTED] wrote:

Thanks a lot for replying Yonik!

I am running solr on a windows 2003 server (standard version).  
intel Xeon CPU 3.00GHz, with 4.00 GB RAM.
The index is locate on Raid5 with 2 million documents. Is there  
any way to improve query performance without moving to more  
powerful computer?


I understand that the query performances of phrase query (auto  
repair) has to do with the number of documents containing the two  
words. In fact the number of documents that have auto and repair  
are about 10. It is like 5% of the documents containing auto  
and repair.  It seems to me 937 ms is too slower.


Chen, that does seem slow I'm not sure why.
1) was this the first search on the index?  if so, try running some
other searches to warm things up first.


Indeed--phrase matching uses a completely different part of the  
index, so that needs to be warmed too.


One thing to try is solr trunk: it contains some speedups for phrase  
queries (though perhaps not as substantial as you hope for).


-MIke




Phrase Query Performance Question

2007-10-26 Thread Haishan Chen
I am a new Solr user and wonder if anyone can help me these questions. I used 
Solr to index about two million documents and query on it using standard 
request handler. I disabled all cache. I found phrase query was substantially 
slower than the usual query.  The statistic I collected is as following. I was 
doing the query on the one field only.  content:(auto repair)47 
ms  repeatablecontent:(auto repair)  937 ms 
repeatablecontent:(auto repair~1) 766 ms repeatable What are the 
factors affecting phrase query performance? How come the phrase query 
content:(auto repair) is almost 20 times slower than content:(auto repair)? I 
also notice a the phrase query with a slop is always faster than the one 
without a slop. Is the performance difference I observed here between phrase 
query and regular query a performance problem of Lucene or Solr? 
I was having trouble starting a new discussion thread eariler. Hopefully I do 
it right this time.
It will be appreciated if anyone can help Haishan
_
Climb to the top of the charts!  Play Star Shuffle:  the word scramble 
challenge with star power.
http://club.live.com/star_shuffle.aspx?icid=starshuffle_wlmailtextlink_oct

Re: Dynamic fields performance question

2007-03-26 Thread Yonik Seeley

On 3/26/07, climbingrose [EMAIL PROTECTED] wrote:

I'm developing an application that potentially creates thousands of dynamic
fields.  Does anyone know if large number of dynamic fields will degrade
Solr performance?


Thousands of fields won't be a problem if
- you don't sort on most of them (sorting by a field takes up memory)
- you can omit norms on most of them

Provided the above is true, differences in searching + indexing
performance shouldn't be noticeable.

-Yonik


Re: Dynamic fields performance question

2007-03-26 Thread climbingrose

Thanks Yonik. I think both of the conditions hold true for our application
;).

On 3/27/07, Yonik Seeley [EMAIL PROTECTED] wrote:


On 3/26/07, climbingrose [EMAIL PROTECTED] wrote:
 I'm developing an application that potentially creates thousands of
dynamic
 fields.  Does anyone know if large number of dynamic fields will degrade
 Solr performance?

Thousands of fields won't be a problem if
- you don't sort on most of them (sorting by a field takes up memory)
- you can omit norms on most of them

Provided the above is true, differences in searching + indexing
performance shouldn't be noticeable.

-Yonik





--
Regards,

Cuong Hoang


Dynamic fields performance question

2007-03-25 Thread climbingrose

Hi all,

I'm developing an application that potentially creates thousands of dynamic
fields.  Does anyone know if large number of dynamic fields will degrade
Solr performance?

Thanks.


--
Regards,

Cuong Hoang