Hi Community,

We occasionally reindex whole data to our Auto-Suggest corpus. Total
documents to be indexed are around 250 million while, due to atomic
updates, total unique documents after full indexing converges to 60
million.

We have to atomically index documents to store different names for the same
product (like "bag" and "bags"), to increase demand and to store the months
they were searched for in the past. One approach could be to calculate all
this beforehand and then index normally to Solr (non-atomic).

Once the atomic updates process over 50 million documents, the speed of
indexing drops down to more than 10x of initial speed.

As what I have learnt, atomic updates fetch the matching document by
uniqueKey and then does the normal index using the information in the
fetched document. Is this actually taking time? As the number of documents
increases, Solr might be taking time to fetch the stored document.

But shouldn't the fetch by uniqueKey take O(1) time? If this really impacts
the fetch, can we use docValues for the field id (uniqueKey)? Our field is
of type string.



I'm pasting my config lines that may impact this:

----------------------------------------------------------------------------------

-Xmx8g -Xms8g

<field name="id" type="string" indexed="true" stored="true" required="true"
omitNorms="false" multiValued="false" />
<uniqueKey>id</uniqueKey>

<ramBufferSizeMB>2000</ramBufferSizeMB>

<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
         <int name="maxMergeAtOnce">50</int>
         <int name="segmentsPerTier">50</int>
<int name="maxMergeAtOnce">150</int>
 </mergePolicyFactory>

<autoCommit>
        <maxDocs>100000</maxDocs>
        <maxTime>120000</maxTime>
        <openSearcher>false</openSearcher>
</autoCommit>

----------------------------------------------------------------------------------



A normal indexing that should take less than 1 day actually takes over 5
days with atomic updates. Any experience or suggestion will help. How do
expedite your indexing process specifically atomic updates? I know this
might have been asked so many times and I have actually read/implemented
all of the recommendations. My question is specific to Atomic Updates and
if something exclusive to Atomic Updates can make it faster.


-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
*
*

 <https://www.facebook.com/IndiaMART/videos/578196442936091/>

Reply via email to