On Wed, Oct 19, 2011 at 3:58 PM, Pranav Prakash <pra...@gmail.com> wrote:
> Hi guys,
>
> I have set up a Solr instance and upon attempting to index document, the
> whole process is painfully slow. I will try to put as much info as I can in
> this mail. Pl. feel free to ask me anything else that might be required.
>
> I am sending documents in batches not exceeding 2,000. The size of each of
> them depends but usually is around 10-15MiB. My indexing script tells me
> that Solr took T seconds to add N documents of size S. For the same data,
> the Solr Log add QTime is QT. Some of the sample data are:
>
>   N                     S                T               QT
> -------------------------------------------------------------------------
>  390 docs  |   3,478,804 Bytes   | 14.5s    |  2297
>  852 docs  |   6,039,535 Bytes   | 25.3s    |  4237
> 1345 docs | 11,147,512 Bytes   |  47s      |  8543
> 1147 docs |   9,457,717 Bytes   |  44s      |  2297
> 1096 docs | 13,058,204 Bytes   |  54.3s   |   8782
>
> The time T includes the time of converting an array of Hash objects into
> XML, POSTing it to Solr and response acknowledged from Solr. Clearly, there
> is a huge difference between both the time T and QT. After a lot of efforts,
> I have no clue why these times do not match.
>
> The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M
> -XX:+UseParNewGC
>
> I believe my Indexing is getting slow. Relevant portion from my schema file
> are as follows. On a related note, every document has one dynamic field.
> Based on this rate, it takes me ~30hrs to do a full index of my database.
> I would really appreciate kindness of community in order to get this
> indexing faster.
>
> <indexDefaults>
>
> <useCompoundFile>false</useCompoundFile>
>
> <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
>
> <int name="maxMergeCount">10</int>
>
> <int name="maxThreadCount">10</int>
>
>  </mergeScheduler>
>
> <ramBufferSizeMB>2048</ramBufferSizeMB>
>
> <maxMergeDocs>2147483647</maxMergeDocs>
>
> <maxFieldLength>3000000</maxFieldLength>
>
> <writeLockTimeout>1000</writeLockTimeout>
>
> <maxBufferedDocs>50000</maxBufferedDocs>
>
> <termIndexInterval>256</termIndexInterval>
>
> <mergeFactor>10</mergeFactor>
>
> <useCompoundFile>false</useCompoundFile>
>
> <!-- <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>
>  <int name="maxMergeAtOnceExplicit">19</int>
>
> <int name="segmentsPerTier">9</int>
>
> </mergePolicy> -->
>
> </indexDefaults>
>
> <mainIndex>
>
> <unlockOnStartup>true</unlockOnStartup>
>
> <reopenReaders>true</reopenReaders>
>
> <deletionPolicy class="solr.SolrDeletionPolicy">
>
>  <str name="maxCommitsToKeep">1</str>
>
> <str name="maxOptimizedCommitsToKeep">0</str>
>
> </deletionPolicy>
>
> <infoStream file="INFOSTREAM.txt">false</infoStream>
>
> </mainIndex>
>
> <updateHandler class="solr.DirectUpdateHandler2" >
>
> <autoCommit>
>
>  <maxDocs>100000</maxDocs>
>
> </autoCommit>
>
> </updateHandler>
>
>
> *Pranav Prakash*
>
> "temet nosce"
>
> Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
> Google <http://www.google.com/profiles/pranny>
>

hey,

are you calling commit after your batches or do an optimize by any chance?

I would suggest you to stream your documents to solr and try to commit
only if you really need to. Set your RAM Buffer to something between
256 and 320 MB and remove the maxBufferedDocs setting completely. You
can also experiment with your merge settings a little and 10 merging
threads seem to be a lot. I know you have lots of CPU but IO will be
the bottleneck here.

simon

Reply via email to