Re: Usergrid2 - Reindexing ES data

Harish Singh Bisht Thu, 04 Feb 2016 06:01:42 -0800

Hi Michael,

Thank you for your email.
I deployed the latest Usergrid code (master branch), reduced the RAM to 4.5
GB and then triggered the re-index job. After 20 hours of reindexing data,
Usergrid seems to have hung / stopped due to the heap memory finishing.
Below are the logs:


*1. UserGrid logs (out of java heap space)*
Feb 04 17:06:32 Usergrid-2 catalina.out:  06:36:31,961  WARN
OioServerSocketPipelineSink:83 - Failed to accept a connection.
Feb 04 17:06:32 Usergrid-2 catalina.out:  java.lang.OutOfMemoryError: Java
heap space
Feb 04 15:05:03 Usergrid-2 catalina.out:  04:34:25,166  WARN jvm:203 -
[default] [gc][old][29454][2683] duration [54.2s], collections [3]/[54.3s],
total [54.2s]/[13.2h], memory [4.4gb]->[4.4gb]/[4.4gb], all_pools {[young]
[532.5mb]->[532.5mb]/[532.5mb]}{[survivor] [62.5mb]->[65mb]/[66.5mb]}{[old]
[3.8gb]->[3.8gb]/[3.8gb]}
Feb 03 20:38:34 Usergrid-2 catalina.out: 10:08:34,616 ERROR
AmazonAsyncEventService:361 - Failed to index message:
886ea9bd-708d-4bea-ab1c-844ff97c947c

*2. ES logs (time out, removed non-data node from cluster)*
Feb 04 16:00:33 Elasticsearch elasticsearch.log:  [2016-02-04
05:31:17,243][INFO
][cluster.service          ] [Arlette Truffaut] removed
{[default][3GuynlamR6GvdiCAhhTBmw][ip-10-0-0-237][inet[/10.0.0.237:9301]]{client=true,
data=false},}, reason:
zen-disco-node_failed([default][3GuynlamR6GvdiCAhhTBmw][ip-10-0-0-237][inet[/10.0.0.237:9301]]{client=true,
data=false}), reason failed to ping, tried [3] times, each with maximum
[30s] timeout
Feb 04 16:00:33 Elasticsearch elasticsearch.log:  [2016-02-04
05:31:17,256][DEBUG][action.admin.cluster.node.stats]
[Arlette Truffaut] failed to execute on node [3GuynlamR6GvdiCAhhTBmw]
Feb 04 16:00:33 Elasticsearch elasticsearch.log:
 org.elasticsearch.transport.NodeDisconnectedException:
[default][inet[/10.0.0.237:9301]][cluster:monitor/nodes/stats[n]]
disconnected


Is it possible that the re-indexing code in Usergrid could have memory
leaks and thus uses up all the java heap memory.
Please suggest.

Many Thanks
Harish

On Wed, Feb 3, 2016 at 5:09 AM, Michael Russo <[email protected]>
wrote:

> Can you take a build from the latest of master branch?  We recently made a
> fix in Usergrid that would improve memory usage and overall speed of
> iterating all of the entities during a re-index for normal GET requests
> where limit query parameter has a value > 1.
>
> Those errors are basically showing that the Tomcat JVM was having a full
> GC event, on total memory of 11.8 GB.  When the full GC occurs the
> application will pause.  All of the testing we have been doing is on an
> instance where we only give Tomcat JVM 4.5 GB of heap allocation.  This
> would make a full GC time less than what you occurred, but should also see
> this happen less frequently with the improvements in master.
>
> Thanks.
> -Michael
>
> On Thu, Jan 28, 2016 at 6:06 AM, Harish Singh Bisht <
> [email protected]> wrote:
>
>> Hi Michael,
>>
>> Thank you very much for the help. I checked the logs as you suggested:
>> 1. ElasticSearch Logs show the below warnings:
>> Jan 28 19:19:19 Elasticsearch elasticsearch.log:  [2016-01-28 
>> 08:49:47,115][WARN
>> ][transport                ] [Zach] Received response for a request that
>> has timed out, sent [56430ms] ago, timed out [26430ms] ago, action
>> [internal:discovery/zen/fd/ping], node
>> [[default][AGRTPd64QyuiYtE4QKxfYQ][ip-10-0-0-237][inet[/10.0.0.237:9301]]{client=true,
>> data=false}], id [157852]
>>
>> 2. Usergrid logs show the below warning:
>> Jan 28 19:20:29 Usergrid-2 catalina.out:  2016-01-28 08:50:29 WARN
>> elasticsearch.monitor.jvm.monitorLongGc(203)<elasticsearch[default][scheduler][T#1]>-
>> [default] [gc][old][102381][600] duration [56.4s], collections [1]/[57.2s],
>> total [56.4s]/[3.8h], memory [11.8gb]->[11.7gb]/[11.9gb], all_pools
>> {[young] [532.5mb]->[358.6mb]/[532.5mb]}{[survivor]
>> [18.3mb]->[0b]/[66.5mb]}{[old] [11.3gb]->[11.3gb]/[11.3gb]}
>>
>> The re-indexing is still in progress. But in the last 6 hours the
>> re-indexing seems to be progressing very very slowly. We have observed a
>> very high RAM usage on the Usergrid machine (runs tomcat and cassandra).
>> The warning logs on the ES machine, came at the same time when the free RAM
>> dropped on the Usergrid Machine.
>> (On the other hand, the ES machine RAM / CPU seems to be fine.)
>>
>> Please advise.
>>
>> Many thanks
>> Harish
>>
>> On Thu, Jan 28, 2016 at 9:05 AM, Michael Russo <[email protected]>
>> wrote:
>>
>>> 1.  This simply means that the processing is being ran in the calling
>>> thread (the queue consumer thread that reads off of SQS) instead of the
>>> async pool as there are no threads available in the async pool.  It's
>>> typical during a full re-index or extreme write load.  The events are still
>>> getting processed and indexed.  We implemented this to ensure that the
>>> system does not get saturated and requests do not get rejected.
>>>
>>> 3.  It doesn't return the number of errors / failures, only the number
>>> of entities that it has processed.  Currently you'd only be able to see
>>> this information from parsing the log file.
>>>
>>> Thanks.
>>> -Michael
>>>
>>> On Wed, Jan 27, 2016 at 5:28 AM, Harish Singh Bisht <
>>> [email protected]> wrote:
>>>
>>>> Hi Michael,
>>>>
>>>> Thank you very much for the detailed explanation. I deleted the ES data
>>>> folders, increased the ES bulk queue size to 2000, restarted ES / UserGrid
>>>> and then started the re-indexing process. This fixed most of the errors - a
>>>> big thanks to you.
>>>> Below are my observations when i started the re-indexing process today:
>>>>
>>>> 1. Usergrid logs show the below warnings very frequently
>>>> (Usergrid-SQS-Pool task queue full):
>>>> Jan 27 14:33:00 Usergrid-2 catalina.out:  2016-01-27 04:03:00 INFO
>>>> corepersistence.index.ReIndexServiceImpl.lambda$rebuildIndex$97(131)<RxCachedThreadScheduler-14>-
>>>> Sending batch of 1000 to be indexed.
>>>> Jan 27 14:33:01 Usergrid-2 catalina.out:  2016-01-27 04:03:01 WARN
>>>> core.executor.TaskExecutorFactory.rejectedExecution(171)<QueueConsumer_9>-
>>>> Usergrid-SQS-Pool task queue full, rejecting task
>>>> rx.schedulers.ExecutorScheduler$ExecutorSchedulerWorker@4ef107a and
>>>> running in thread QueueConsumer_9
>>>>
>>>> Does this mean some index requests are being rejected, resulting in a
>>>> loss of data in ES? Do i need to set the scheduler.io.threads to a higher
>>>> value, in the usergrid property file? Will this prevent a loss of data
>>>> during the re-indexing.
>>>>
>>>> 2. I needed to trigger the re-index api twice, as the first time it
>>>> appears to have hung / stopped in between. (only 2 indexes were created out
>>>> of 6, and no docs were getting created after that).
>>>> After waiting for a few hours, I re-triggered the rebuild api again,
>>>> which seems to be working well now (the 5th index is is in process)
>>>>
>>>> 3. Does the reindex 'status' api "
>>>> http://localhost:8080/system/index/rebuild/<job-uuid>" return the
>>>> number of indexing errors / failures etc?
>>>>
>>>> Please advise. Appreciate the help.
>>>>
>>>> Many Thanks,
>>>> Harish
>>>>
>>>> On Mon, Jan 25, 2016 at 10:15 PM, Michael Russo <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Harish,
>>>>>
>>>>> I'm going to try explaining the errors by type:
>>>>>
>>>>> 1)
>>>>>
>>>>> This is occurring because of Elasticsearch not being able to start the
>>>>> shard for the index.  If your cluster health is red, this is possible that
>>>>> you don't have enough disk space or at one point you did not have enough
>>>>> disk space so the already created shards for Usergrid indexes are corrupt.
>>>>> Since Usergrid can completely rebuild the index, I'd suggest that you
>>>>> completely delete the Elasticsearch data directories, restart
>>>>> Elasticsearch, restart Usergrid, and begin the process again for
>>>>> re-indexing.  As an FYI we have configured in the code a default of 18
>>>>> shards and 1 replica used for each index.
>>>>>
>>>>> 2.)
>>>>>
>>>>> *Type 1* - Example exceptions that can occur due to issues see in
>>>>> #1.  Usergrid's use of the Elasticsearch sdk as a client will print these
>>>>> type of statements as it's aware of the Elasticsearch cluster state.
>>>>>
>>>>> *Type 2* - This shows that the Async index queue consumers were
>>>>> unable to index an entity.  This is most likely due to #1 as well, but 
>>>>> does
>>>>> not always have to be the cause of this exception.  If for any reason the
>>>>> consumer threads get some exception while trying to index the entity at
>>>>> Elasticsearch, this type of error will be seen.
>>>>>
>>>>> *Type 3 *- This message goes with 'Type 2'.
>>>>>
>>>>> *Type 4* - We have a finite threadpool size set for the asynchronous
>>>>> indexing (Usergrid-SQS-Pool).  This default used to be 100, but now
>>>>> is of size 40 in master branch.  If there is not a thread available
>>>>> (rejected task) in the pool, we implemented a 'caller runs' strategy in
>>>>> which the thread trying to hand-off the event (Queue_Consumer) to the pool
>>>>> will execute the request.  The number of threads available in this pool 
>>>>> can
>>>>> be configured with the *scheduler.io.threads* property.
>>>>>
>>>>> *Type 5* - Elasticsearch also maintains threadpool sizes, and will
>>>>> queue up requests when a thread is not available to process.  The default
>>>>> queue sizes are fairly small if you're trying for high concurrency.
>>>>> However, if there is a problem with ES and it's unable to process requests
>>>>> properly, the queue would likely fill up regardless.  See
>>>>> https://www.elastic.co/guide/en/elasticsearch/reference/2.x/modules-threadpool.html.
>>>>> As an FYI when we scaled Usergrid to 10,000 requests per second, we set 
>>>>> our
>>>>> queue sizes for search and bulk to 5000.  That is a bit high, you should
>>>>> probably start with something lower like 1000 if you plan to change.
>>>>>
>>>>>
>>>>> Hope this info helps!  Let me know how things progress.
>>>>>
>>>>> Thanks!
>>>>> -Michael
>>>>>
>>>>>
>>>>> On Mon, Jan 25, 2016 at 5:06 AM, Harish Singh Bisht <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Michael,
>>>>>>
>>>>>> (resending email as i found another type of error from the logs)
>>>>>> Thank you very much for the help. I tried the full system re-index
>>>>>> API, but i am still seeing the below issues:
>>>>>>
>>>>>> 1. One of the ES indexes (usergrid_applications_4) still has 15
>>>>>> shards in "INITIALIZING" state. The ES cluster health is red.
>>>>>>  Log Extract from ES (details in appendices) shows that:
>>>>>> "[usergrid_applications_4][14] failed to start shard.".
>>>>>>
>>>>>> 2. Usergrid logs show the below types of errors after invoking the
>>>>>> re-index api (details in appendices):
>>>>>> Type 1
>>>>>> UnavailableShardsException[[usergrid_applications_4][7] Primary shard
>>>>>> is not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>>> org.elasticsearch.action.bulk.BulkShardRequest@155fc46c]
>>>>>> Type 2
>>>>>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$callEventHandlers$65(359)<Usergrid-SQS-Pool-13>-
>>>>>> Failed to index message: 17ed55a5-3091-4f0d-8620-12f2915668c1
>>>>>> Type 3
>>>>>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$null$70(735)<Usergrid-SQS-Pool-16>-
>>>>>> Missing messages from queue post operation
>>>>>> Type 4
>>>>>> core.executor.TaskExecutorFactory.rejectedExecution(171)<QueueConsumer_11>-
>>>>>> Usergrid-SQS-Pool task queue full, rejecting task
>>>>>> rx.schedulers.ExecutorScheduler$ExecutorSchedulerWorker@131a4c5 and
>>>>>> running in thread QueueConsumer_11
>>>>>> Type 5
>>>>>> EsRejectedExecutionException[rejected execution (queue capacity 300)
>>>>>> on
>>>>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@13afccc8
>>>>>> ];
>>>>>>
>>>>>> Please suggest the next steps. Appreciate the help.
>>>>>>
>>>>>> Many Thanks,
>>>>>> Harish
>>>>>>
>>>>>>
>>>>>>
>>>>>> -------------------------------------------------------------------
>>>>>> *Appendices*
>>>>>> *1. ELASTICSEARCH LOG*
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 
>>>>>> 06:40:44,227][WARN
>>>>>> ][indices.cluster          ] [Blindside] [usergrid_applications_4][14]
>>>>>> failed to start shard
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:
>>>>>>  org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
>>>>>> [usergrid_applications_4][14] failed to recover shard
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:287)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> java.lang.Thread.run(Thread.java:745)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  Caused by:
>>>>>> org.elasticsearch.index.translog.TranslogCorruptedException: translog
>>>>>> corruption while reading from stream
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:70)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:257)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   ... 4 more
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  Caused by:
>>>>>> java.io.EOFException
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.common.io.stream.InputStreamStreamInput.readBytes(InputStreamStreamInput.java:53)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.translog.BufferedChecksumStreamInput.readBytes(BufferedChecksumStreamInput.java:55)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:86)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:74)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:495)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>> org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   ... 5 more
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 
>>>>>> 06:40:44,279][WARN
>>>>>> ][cluster.action.shard     ] [Blindside] [usergrid_applications_4][14]
>>>>>> sending failed shard for [usergrid_applications_4][14],
>>>>>> node[lb-HRQpWRQGCeIadzTEHSw], [P], s[INITIALIZING], indexUUID
>>>>>> [fdyaoJQZQKuFeBONTQSD1g], reason [Failed to start shard, message
>>>>>> [IndexShardGatewayRecoveryException[[usergrid_applications_4][14] failed 
>>>>>> to
>>>>>> recover shard]; nested: TranslogCorruptedException[translog corruption
>>>>>> while reading from stream]; nested: EOFException; ]]
>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 
>>>>>> 06:40:44,279][WARN
>>>>>> ][cluster.action.shard     ] [Blindside] [usergrid_applications_4][14]
>>>>>> received shard failed for [usergrid_applications_4][14],
>>>>>> node[lb-HRQpWRQGCeIadzTEHSw], [P], s[INITIALIZING], indexUUID
>>>>>> [fdyaoJQZQKuFeBONTQSD1g], reason [Failed to start shard, message
>>>>>> [IndexShardGatewayRecoveryException[[usergrid_applications_4][14] failed 
>>>>>> to
>>>>>> recover shard]; nested: TranslogCorruptedException[translog corruption
>>>>>> while reading from stream]; nested: EOFException; ]]
>>>>>> ......
>>>>>> *2. USERGRID LOG*
>>>>>> Jan 25 16:05:19 Usergrid-2 catalina.out:  2016-01-25 05:35:19 INFO
>>>>>> rest.system.IndexResource.rebuildIndexesPost(78)<http-bio-80-exec-92>-
>>>>>> Rebuilding all applications
>>>>>> Jan 25 16:05:19 Usergrid-2 catalina.out:  2016-01-25 05:35:19 INFO
>>>>>> corepersistence.index.ReIndexServiceImpl.lambda$rebuildIndex$97(131)<RxCachedThreadScheduler-35>-
>>>>>> Sending batch of 1000 to be indexed.
>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:  2016-01-25 05:35:20 ERROR
>>>>>> index.impl.EsIndexProducerImpl.sendRequest(209)<Usergrid-SQS-Pool-13>-
>>>>>> Unable to index
>>>>>> id=appId(cd2bd460-a3e8-11e5-a327-0a75091e6d25,application).entityId(7c3328cc-bdd1-11e5-88d3-0a75091e6d25,activity).version(7c3328cd-bdd1-11e5-88d3-0a75091e6d25).nodeId(99400999-a3ef-11e5-a327-0a75091e6d25,group).edgeName(zzzcollzzz|feed).nodeType(TARGET),
>>>>>> type=entity, index=usergrid_applications_4,
>>>>>> failureMessage=UnavailableShardsException[[usergrid_applications_4][4]
>>>>>> Primary shard is not active or isn't assigned is a known node. Timeout:
>>>>>> [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3d277a1
>>>>>> ]
>>>>>> .....
>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:  2016-01-25 05:35:20 ERROR
>>>>>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$callEventHandlers$65(359)<Usergrid-SQS-Pool-13>-
>>>>>> Failed to index message: 17ed55a5-3091-4f0d-8620-12f2915668c1
>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:  java.lang.RuntimeException:
>>>>>> Error during processing of bulk index operations one of the responses
>>>>>> failed.
>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>>  UnavailableShardsException[[usergrid_applications_4][4] Primary shard is
>>>>>> not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>>> org.elasticsearch.action.bulk.BulkShardRequest@3d277a1]
>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>>  UnavailableShardsException[[usergrid_applications_4][7] Primary shard is
>>>>>> not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>>> org.elasticsearch.action.bulk.BulkShardRequest@155fc46c]
>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>>  UnavailableShardsException[[usergrid_applications_4][10] Primary shard 
>>>>>> is
>>>>>> not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>>> org.elasticsearch.action.bulk.BulkShardRequest@445968e3]
>>>>>> .......
>>>>>> Jan 25 16:53:02 Usergrid-2 catalina.out: 2016-01-25 06:23:02 ERROR
>>>>>> index.impl.EsIndexProducerImpl.sendRequest(209)<Usergrid-SQS-Pool-97>-
>>>>>> Unable to index
>>>>>> id=appId(65ccc2b7-bde0-11e5-88d3-0a75091e6d25,application).entityId(b7dcf6c9-bf62-11e5-88d3-0a75091e6d25,activity).version(b7dcf6cb-bf62-11e5-88d3-0a75091e6d25).nodeId(65ccc2b7-bde0-11e5-88d3-0a75091e6d25,application).edgeName(zzzcollzzz|activities).nodeType(TARGET),
>>>>>> type=entity, index=usergrid_applications_3,
>>>>>> failureMessage=RemoteTransportException[[Blindside][inet[/10.0.0.148:9300]][indices:data/write/bulk[s]]];
>>>>>> nested: EsRejectedExecutionException[rejected execution (queue capacity
>>>>>> 300) on
>>>>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@13afccc8
>>>>>> ];
>>>>>> Jan 25 16:53:03 Usergrid-2 catalina.out:
>>>>>> RemoteTransportException[[Blindside][inet[/10.0.0.148:9300]][indices:data/write/bulk[s]]];
>>>>>> nested: EsRejectedExecutionException[rejected execution (queue capacity
>>>>>> 300) on
>>>>>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@59da0cf2
>>>>>> ];
>>>>>>
>>>>>> On Mon, Jan 25, 2016 at 5:54 PM, Harish Singh Bisht <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Michael,
>>>>>>>
>>>>>>> Thank you vert much for the help. I tried the full system re-index
>>>>>>> API, but i am still seeing the below issues:
>>>>>>>
>>>>>>> 1. One of the ES indexes (usergrid_applications_4) still has 15
>>>>>>> shards in "INITIALIZING" state. The ES cluster health is red.
>>>>>>>  Log Extract from ES (details in appendices) shows that:
>>>>>>> "[usergrid_applications_4][14] failed to start shard.".
>>>>>>>
>>>>>>> 2. Usergrid logs show the below types of errors after invoking the
>>>>>>> re-index api (details in appendices):
>>>>>>> Type 1
>>>>>>> UnavailableShardsException[[usergrid_applications_4][7] Primary
>>>>>>> shard is not active or isn't assigned is a known node. Timeout: [1m],
>>>>>>> request: org.elasticsearch.action.bulk.BulkShardRequest@155fc46c]
>>>>>>> Type 2
>>>>>>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$callEventHandlers$65(359)<Usergrid-SQS-Pool-13>-
>>>>>>> Failed to index message: 17ed55a5-3091-4f0d-8620-12f2915668c1
>>>>>>> Type 3
>>>>>>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$null$70(735)<Usergrid-SQS-Pool-16>-
>>>>>>> Missing messages from queue post operation
>>>>>>> Type 4
>>>>>>> core.executor.TaskExecutorFactory.rejectedExecution(171)<QueueConsumer_11>-
>>>>>>> Usergrid-SQS-Pool task queue full, rejecting task
>>>>>>> rx.schedulers.ExecutorScheduler$ExecutorSchedulerWorker@131a4c5 and
>>>>>>> running in thread QueueConsumer_11
>>>>>>>
>>>>>>> Please suggest the next steps. Appriciate the help.
>>>>>>>
>>>>>>> Many Thanks,
>>>>>>> Harish
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -------------------------------------------------------------------
>>>>>>> *Appendices*
>>>>>>> *1. ELASTICSEARCH LOG*
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 
>>>>>>> 06:40:44,227][WARN
>>>>>>> ][indices.cluster          ] [Blindside] [usergrid_applications_4][14]
>>>>>>> failed to start shard
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:
>>>>>>>  org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
>>>>>>> [usergrid_applications_4][14] failed to recover shard
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:287)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> java.lang.Thread.run(Thread.java:745)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  Caused by:
>>>>>>> org.elasticsearch.index.translog.TranslogCorruptedException: translog
>>>>>>> corruption while reading from stream
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:70)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:257)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   ... 4 more
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  Caused by:
>>>>>>> java.io.EOFException
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.common.io.stream.InputStreamStreamInput.readBytes(InputStreamStreamInput.java:53)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.index.translog.BufferedChecksumStreamInput.readBytes(BufferedChecksumStreamInput.java:55)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:86)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:74)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:495)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>>>>>>> org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   ... 5 more
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 
>>>>>>> 06:40:44,279][WARN
>>>>>>> ][cluster.action.shard     ] [Blindside] [usergrid_applications_4][14]
>>>>>>> sending failed shard for [usergrid_applications_4][14],
>>>>>>> node[lb-HRQpWRQGCeIadzTEHSw], [P], s[INITIALIZING], indexUUID
>>>>>>> [fdyaoJQZQKuFeBONTQSD1g], reason [Failed to start shard, message
>>>>>>> [IndexShardGatewayRecoveryException[[usergrid_applications_4][14] 
>>>>>>> failed to
>>>>>>> recover shard]; nested: TranslogCorruptedException[translog corruption
>>>>>>> while reading from stream]; nested: EOFException; ]]
>>>>>>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 
>>>>>>> 06:40:44,279][WARN
>>>>>>> ][cluster.action.shard     ] [Blindside] [usergrid_applications_4][14]
>>>>>>> received shard failed for [usergrid_applications_4][14],
>>>>>>> node[lb-HRQpWRQGCeIadzTEHSw], [P], s[INITIALIZING], indexUUID
>>>>>>> [fdyaoJQZQKuFeBONTQSD1g], reason [Failed to start shard, message
>>>>>>> [IndexShardGatewayRecoveryException[[usergrid_applications_4][14] 
>>>>>>> failed to
>>>>>>> recover shard]; nested: TranslogCorruptedException[translog corruption
>>>>>>> while reading from stream]; nested: EOFException; ]]
>>>>>>> ......
>>>>>>> *2. USERGRID LOG*
>>>>>>> Jan 25 16:05:19 Usergrid-2 catalina.out:  2016-01-25 05:35:19 INFO
>>>>>>> rest.system.IndexResource.rebuildIndexesPost(78)<http-bio-80-exec-92>-
>>>>>>> Rebuilding all applications
>>>>>>> Jan 25 16:05:19 Usergrid-2 catalina.out:  2016-01-25 05:35:19 INFO
>>>>>>> corepersistence.index.ReIndexServiceImpl.lambda$rebuildIndex$97(131)<RxCachedThreadScheduler-35>-
>>>>>>> Sending batch of 1000 to be indexed.
>>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:  2016-01-25 05:35:20 ERROR
>>>>>>> index.impl.EsIndexProducerImpl.sendRequest(209)<Usergrid-SQS-Pool-13>-
>>>>>>> Unable to index
>>>>>>> id=appId(cd2bd460-a3e8-11e5-a327-0a75091e6d25,application).entityId(7c3328cc-bdd1-11e5-88d3-0a75091e6d25,activity).version(7c3328cd-bdd1-11e5-88d3-0a75091e6d25).nodeId(99400999-a3ef-11e5-a327-0a75091e6d25,group).edgeName(zzzcollzzz|feed).nodeType(TARGET),
>>>>>>> type=entity, index=usergrid_applications_4,
>>>>>>> failureMessage=UnavailableShardsException[[usergrid_applications_4][4]
>>>>>>> Primary shard is not active or isn't assigned is a known node. Timeout:
>>>>>>> [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3d277a1
>>>>>>> ]
>>>>>>> .....
>>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:  2016-01-25 05:35:20 ERROR
>>>>>>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$callEventHandlers$65(359)<Usergrid-SQS-Pool-13>-
>>>>>>> Failed to index message: 17ed55a5-3091-4f0d-8620-12f2915668c1
>>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>>>  java.lang.RuntimeException: Error during processing of bulk index
>>>>>>> operations one of the responses failed.
>>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>>>  UnavailableShardsException[[usergrid_applications_4][4] Primary shard 
>>>>>>> is
>>>>>>> not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>>>> org.elasticsearch.action.bulk.BulkShardRequest@3d277a1]
>>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>>>  UnavailableShardsException[[usergrid_applications_4][7] Primary shard 
>>>>>>> is
>>>>>>> not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>>>> org.elasticsearch.action.bulk.BulkShardRequest@155fc46c]
>>>>>>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>>>>>>  UnavailableShardsException[[usergrid_applications_4][10] Primary shard 
>>>>>>> is
>>>>>>> not active or isn't assigned is a known node. Timeout: [1m], request:
>>>>>>> org.elasticsearch.action.bulk.BulkShardRequest@445968e3]
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jan 22, 2016 at 10:29 PM, Michael Russo <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Harish,
>>>>>>>>
>>>>>>>> Yeah, in Usergrid 2 you can re-index all of the entity data that
>>>>>>>> exists in Cassandra.  Here are example curl requests that invoke the
>>>>>>>> reindex APIs:
>>>>>>>>
>>>>>>>> *Full system re-index:*
>>>>>>>>
>>>>>>>> curl -i -X POST -u <sysadmin user>:<sysadmin pass> "
>>>>>>>> http://localhost:8080/system/index/rebuild";
>>>>>>>>
>>>>>>>> *Per application re-index:*
>>>>>>>>
>>>>>>>> curl -i -X POST -u <sysadmin user>:<sysadmin pass> "
>>>>>>>> http://localhost:8080/system/index/rebuild/<application uuid>"
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>> -Michael
>>>>>>>>
>>>>>>>> On Fri, Jan 22, 2016 at 4:37 AM, Harish Singh Bisht <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Team,
>>>>>>>>>
>>>>>>>>> We have been our testing our application based on Usergrid 2
>>>>>>>>> (master branch) and started noticing unusual poor performance with 
>>>>>>>>> spikes
>>>>>>>>> in the response time.
>>>>>>>>>
>>>>>>>>> Our investigations revealed that during the load testing we ran
>>>>>>>>> out of HDD space on the elasticsearch single node cluster. This led to
>>>>>>>>> indexing failures.
>>>>>>>>>
>>>>>>>>> So we increased the HDD space and restarted ES. But now the
>>>>>>>>> cluster health is red and alot of shards are in the initializing 
>>>>>>>>> state. It
>>>>>>>>> seem data has been lost on the ES node.
>>>>>>>>>
>>>>>>>>> Is there any way to recover the lost data in ES? Specifically, is
>>>>>>>>> there a way to trigger a re-index of data from Cassandra to ES?
>>>>>>>>>
>>>>>>>>> Appreciate the help.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Harish
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Regards,
>>>>>>>>> Harish Singh Bisht
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Harish Singh Bisht
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Harish Singh Bisht
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Harish Singh Bisht
>>>>
>>>>
>>>
>>
>>
>> --
>> Regards,
>> Harish Singh Bisht
>>
>>
>


-- 
Regards,
Harish Singh Bisht

Re: Usergrid2 - Reindexing ES data

Reply via email to