Some update:

I removed the auto warm configurations for the various caches and reduced
the cache sizes. I then issued a call to delete a day's worth of data (800K
documents).

There was no out of memory this time - but some of the nodes went into
recovery mode. Was able to catch some logs this time around and this is
what i see:

****************
*WARN  [2014-04-14 18:11:00.381] [org.apache.solr.update.PeerSync]
PeerSync: core=core1_shard1_replica2 url=http://host1:8983/solr
<http://host1:8983/solr> too many updates received since start -
startingUpdates no longer overlaps with our currentUpdates*
*INFO  [2014-04-14 18:11:00.476] [org.apache.solr.cloud.RecoveryStrategy]
PeerSync Recovery was not successful - trying replication.
core=core1_shard1_replica2*
*INFO  [2014-04-14 18:11:00.476] [org.apache.solr.cloud.RecoveryStrategy]
Starting Replication Recovery. core=core1_shard1_replica2*
*INFO  [2014-04-14 18:11:00.535] [org.apache.solr.cloud.RecoveryStrategy]
Begin buffering updates. core=core1_shard1_replica2*
*INFO  [2014-04-14 18:11:00.536] [org.apache.solr.cloud.RecoveryStrategy]
Attempting to replicate from http://host2:8983/solr/core1_shard1_replica1/
<http://host2:8983/solr/core1_shard1_replica1/>. core=core1_shard1_replica2*
*INFO  [2014-04-14 18:11:00.536]
[org.apache.solr.client.solrj.impl.HttpClientUtil] Creating new http
client,
config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false*
*INFO  [2014-04-14 18:11:01.964]
[org.apache.solr.client.solrj.impl.HttpClientUtil] Creating new http
client,
config:connTimeout=5000&socketTimeout=20000&allowCompression=false&maxConnections=10000&maxConnectionsPerHost=10000*
*INFO  [2014-04-14 18:11:01.969] [org.apache.solr.handler.SnapPuller]  No
value set for 'pollInterval'. Timer Task not started.*
*INFO  [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller]
Master's generation: 1108645*
*INFO  [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller]
Slave's generation: 1108627*
*INFO  [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller]
Starting replication process*
*INFO  [2014-04-14 18:11:02.007] [org.apache.solr.handler.SnapPuller]
Number of files in latest index in master: 814*
*INFO  [2014-04-14 18:11:02.007]
[org.apache.solr.core.CachingDirectoryFactory] return new directory for
/opt/data/solr/core1_shard1_replica2/data/index.20140414181102007*
*INFO  [2014-04-14 18:11:02.008] [org.apache.solr.handler.SnapPuller]
Starting download to
NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/data/solr/core1_shard1_replica2/data/index.20140414181102007
lockFactory=org.apache.lucene.store.NativeFSLockFactory@5f6570fe;
maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true*

****************


So, it looks like the number of updates is too huge for the regular
replication and then it goes into full copy of index. And since our index
size is very huge (350G), this is causing the cluster to go into recovery
mode forever - trying to copy that huge index.

I also read in some thread
http://lucene.472066.n3.nabble.com/Recovery-too-many-updates-received-since-start-td3935281.htmlthat
there is a limit of 100 documents.

I wonder if this has been updated to make that configurable since that
thread. If not, the only option I see is to do a "trickle" delete of 100
documents per second or something.

Also - the other suggestion of using "distributed=false" might not help
because the issue currently is that the replication is going to "full copy".

Any thoughts?

Thanks
Vinay







On 14 April 2014 07:54, Vinay Pothnis <poth...@gmail.com> wrote:

> Yes, that is our approach. We did try deleting a day's worth of data at a
> time, and that resulted in OOM as well.
>
> Thanks
> Vinay
>
>
> On 14 April 2014 00:27, Furkan KAMACI <furkankam...@gmail.com> wrote:
>
>> Hi;
>>
>> I mean you can divide the range (i.e. one week at each delete instead of
>> one month) and try to check whether you still get an OOM or not.
>>
>> Thanks;
>> Furkan KAMACI
>>
>>
>> 2014-04-14 7:09 GMT+03:00 Vinay Pothnis <poth...@gmail.com>:
>>
>> > Aman,
>> > Yes - Will do!
>> >
>> > Furkan,
>> > How do you mean by 'bulk delete'?
>> >
>> > -Thanks
>> > Vinay
>> >
>> >
>> > On 12 April 2014 14:49, Furkan KAMACI <furkankam...@gmail.com> wrote:
>> >
>> > > Hi;
>> > >
>> > > Do you get any problems when you index your data? On the other hand
>> > > deleting as bulks and reducing the size of documents may help you not
>> to
>> > > hit OOM.
>> > >
>> > > Thanks;
>> > > Furkan KAMACI
>> > >
>> > >
>> > > 2014-04-12 8:22 GMT+03:00 Aman Tandon <amantandon...@gmail.com>:
>> > >
>> > > > Vinay please share your experience after trying this solution.
>> > > >
>> > > >
>> > > > On Sat, Apr 12, 2014 at 4:12 AM, Vinay Pothnis <poth...@gmail.com>
>> > > wrote:
>> > > >
>> > > > > The query is something like this:
>> > > > >
>> > > > >
>> > > > > *curl -H 'Content-Type: text/xml' --data
>> '<delete><query>param1:(val1
>> > > OR
>> > > > > val2) AND -param2:(val3 OR val4) AND date_param:[1383955200000 TO
>> > > > > 1385164800000]</query></delete>'
>> > > > > 'http://host:port/solr/coll-name1/update?commit=true'*
>> > > > >
>> > > > > Trying to restrict the number of documents deleted via the date
>> > > > parameter.
>> > > > >
>> > > > > Had not tried the "distrib=false" option. I could give that a try.
>> > > Thanks
>> > > > > for the link! I will check on the cache sizes and autowarm values.
>> > Will
>> > > > try
>> > > > > and disable the caches when I am deleting and give that a try.
>> > > > >
>> > > > > Thanks Erick and Shawn for your inputs!
>> > > > >
>> > > > > -Vinay
>> > > > >
>> > > > >
>> > > > >
>> > > > > On 11 April 2014 15:28, Shawn Heisey <s...@elyograg.org> wrote:
>> > > > >
>> > > > > > On 4/10/2014 7:25 PM, Vinay Pothnis wrote:
>> > > > > >
>> > > > > >> When we tried to delete the data through a query - say 1
>> > day/month's
>> > > > > worth
>> > > > > >> of data. But after deleting just 1 month's worth of data, the
>> > master
>> > > > > node
>> > > > > >> is going out of memory - heap space.
>> > > > > >>
>> > > > > >> Wondering is there any way to incrementally delete the data
>> > without
>> > > > > >> affecting the cluster adversely.
>> > > > > >>
>> > > > > >
>> > > > > > I'm curious about the actual query being used here.  Can you
>> share
>> > > it,
>> > > > or
>> > > > > > a redacted version of it?  Perhaps there might be a clue there?
>> > > > > >
>> > > > > > Is this a fully distributed delete request?  One thing you might
>> > try,
>> > > > > > assuming Solr even supports it, is sending the same delete
>> request
>> > > > > directly
>> > > > > > to each shard core with distrib=false.
>> > > > > >
>> > > > > > Here's a very incomplete list about how you can reduce Solr heap
>> > > > > > requirements:
>> > > > > >
>> > > > > > http://wiki.apache.org/solr/SolrPerformanceProblems#
>> > > > > > Reducing_heap_requirements
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Shawn
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > With Regards
>> > > > Aman Tandon
>> > > >
>> > >
>> >
>>
>
>

Reply via email to