Re: Usergrid2 - Reindexing ES data

Michael Russo Mon, 25 Jan 2016 08:45:55 -0800

Hi Harish,

I'm going to try explaining the errors by type:


1)

This is occurring because of Elasticsearch not being able to start the
shard for the index.  If your cluster health is red, this is possible that
you don't have enough disk space or at one point you did not have enough
disk space so the already created shards for Usergrid indexes are corrupt.
Since Usergrid can completely rebuild the index, I'd suggest that you
completely delete the Elasticsearch data directories, restart
Elasticsearch, restart Usergrid, and begin the process again for
re-indexing.  As an FYI we have configured in the code a default of 18
shards and 1 replica used for each index.

2.)

*Type 1* - Example exceptions that can occur due to issues see in #1.
Usergrid's use of the Elasticsearch sdk as a client will print these type
of statements as it's aware of the Elasticsearch cluster state.

*Type 2* - This shows that the Async index queue consumers were unable to
index an entity.  This is most likely due to #1 as well, but does not
always have to be the cause of this exception.  If for any reason the
consumer threads get some exception while trying to index the entity at
Elasticsearch, this type of error will be seen.

*Type 3 *- This message goes with 'Type 2'.

*Type 4* - We have a finite threadpool size set for the asynchronous
indexing (Usergrid-SQS-Pool).  This default used to be 100, but now is of
size 40 in master branch.  If there is not a thread available (rejected
task) in the pool, we implemented a 'caller runs' strategy in which the
thread trying to hand-off the event (Queue_Consumer) to the pool will
execute the request.  The number of threads available in this pool can be
configured with the *scheduler.io.threads* property.

*Type 5* - Elasticsearch also maintains threadpool sizes, and will queue up
requests when a thread is not available to process.  The default queue
sizes are fairly small if you're trying for high concurrency.  However, if
there is a problem with ES and it's unable to process requests properly,
the queue would likely fill up regardless.  See
https://www.elastic.co/guide/en/elasticsearch/reference/2.x/modules-threadpool.html.
As an FYI when we scaled Usergrid to 10,000 requests per second, we set our
queue sizes for search and bulk to 5000.  That is a bit high, you should
probably start with something lower like 1000 if you plan to change.


Hope this info helps!  Let me know how things progress.

Thanks!
-Michael


On Mon, Jan 25, 2016 at 5:06 AM, Harish Singh Bisht <
[email protected]> wrote:

> Hi Michael,
>
> (resending email as i found another type of error from the logs)
> Thank you very much for the help. I tried the full system re-index API,
> but i am still seeing the below issues:
>
> 1. One of the ES indexes (usergrid_applications_4) still has 15 shards in
> "INITIALIZING" state. The ES cluster health is red.
>  Log Extract from ES (details in appendices) shows that:
> "[usergrid_applications_4][14] failed to start shard.".
>
> 2. Usergrid logs show the below types of errors after invoking the
> re-index api (details in appendices):
> Type 1
> UnavailableShardsException[[usergrid_applications_4][7] Primary shard is
> not active or isn't assigned is a known node. Timeout: [1m], request:
> org.elasticsearch.action.bulk.BulkShardRequest@155fc46c]
> Type 2
> corepersistence.asyncevents.AmazonAsyncEventService.lambda$callEventHandlers$65(359)<Usergrid-SQS-Pool-13>-
> Failed to index message: 17ed55a5-3091-4f0d-8620-12f2915668c1
> Type 3
> corepersistence.asyncevents.AmazonAsyncEventService.lambda$null$70(735)<Usergrid-SQS-Pool-16>-
> Missing messages from queue post operation
> Type 4
> core.executor.TaskExecutorFactory.rejectedExecution(171)<QueueConsumer_11>-
> Usergrid-SQS-Pool task queue full, rejecting task
> rx.schedulers.ExecutorScheduler$ExecutorSchedulerWorker@131a4c5 and
> running in thread QueueConsumer_11
> Type 5
> EsRejectedExecutionException[rejected execution (queue capacity 300) on
> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@13afccc8
> ];
>
> Please suggest the next steps. Appreciate the help.
>
> Many Thanks,
> Harish
>
>
>
> -------------------------------------------------------------------
> *Appendices*
> *1. ELASTICSEARCH LOG*
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 
> 06:40:44,227][WARN
> ][indices.cluster          ] [Blindside] [usergrid_applications_4][14]
> failed to start shard
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:
>  org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
> [usergrid_applications_4][14] failed to recover shard
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:287)
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
> org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
> java.lang.Thread.run(Thread.java:745)
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  Caused by:
> org.elasticsearch.index.translog.TranslogCorruptedException: translog
> corruption while reading from stream
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
> org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:70)
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:257)
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   ... 4 more
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  Caused by:
> java.io.EOFException
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
> org.elasticsearch.common.io.stream.InputStreamStreamInput.readBytes(InputStreamStreamInput.java:53)
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
> org.elasticsearch.index.translog.BufferedChecksumStreamInput.readBytes(BufferedChecksumStreamInput.java:55)
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
> org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:86)
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
> org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:74)
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
> org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:495)
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
> org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   ... 5 more
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 
> 06:40:44,279][WARN
> ][cluster.action.shard     ] [Blindside] [usergrid_applications_4][14]
> sending failed shard for [usergrid_applications_4][14],
> node[lb-HRQpWRQGCeIadzTEHSw], [P], s[INITIALIZING], indexUUID
> [fdyaoJQZQKuFeBONTQSD1g], reason [Failed to start shard, message
> [IndexShardGatewayRecoveryException[[usergrid_applications_4][14] failed to
> recover shard]; nested: TranslogCorruptedException[translog corruption
> while reading from stream]; nested: EOFException; ]]
> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 
> 06:40:44,279][WARN
> ][cluster.action.shard     ] [Blindside] [usergrid_applications_4][14]
> received shard failed for [usergrid_applications_4][14],
> node[lb-HRQpWRQGCeIadzTEHSw], [P], s[INITIALIZING], indexUUID
> [fdyaoJQZQKuFeBONTQSD1g], reason [Failed to start shard, message
> [IndexShardGatewayRecoveryException[[usergrid_applications_4][14] failed to
> recover shard]; nested: TranslogCorruptedException[translog corruption
> while reading from stream]; nested: EOFException; ]]
> ......
> *2. USERGRID LOG*
> Jan 25 16:05:19 Usergrid-2 catalina.out:  2016-01-25 05:35:19 INFO
> rest.system.IndexResource.rebuildIndexesPost(78)<http-bio-80-exec-92>-
> Rebuilding all applications
> Jan 25 16:05:19 Usergrid-2 catalina.out:  2016-01-25 05:35:19 INFO
> corepersistence.index.ReIndexServiceImpl.lambda$rebuildIndex$97(131)<RxCachedThreadScheduler-35>-
> Sending batch of 1000 to be indexed.
> Jan 25 16:05:21 Usergrid-2 catalina.out:  2016-01-25 05:35:20 ERROR
> index.impl.EsIndexProducerImpl.sendRequest(209)<Usergrid-SQS-Pool-13>-
> Unable to index
> id=appId(cd2bd460-a3e8-11e5-a327-0a75091e6d25,application).entityId(7c3328cc-bdd1-11e5-88d3-0a75091e6d25,activity).version(7c3328cd-bdd1-11e5-88d3-0a75091e6d25).nodeId(99400999-a3ef-11e5-a327-0a75091e6d25,group).edgeName(zzzcollzzz|feed).nodeType(TARGET),
> type=entity, index=usergrid_applications_4,
> failureMessage=UnavailableShardsException[[usergrid_applications_4][4]
> Primary shard is not active or isn't assigned is a known node. Timeout:
> [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3d277a1]
> .....
> Jan 25 16:05:21 Usergrid-2 catalina.out:  2016-01-25 05:35:20 ERROR
> corepersistence.asyncevents.AmazonAsyncEventService.lambda$callEventHandlers$65(359)<Usergrid-SQS-Pool-13>-
> Failed to index message: 17ed55a5-3091-4f0d-8620-12f2915668c1
> Jan 25 16:05:21 Usergrid-2 catalina.out:  java.lang.RuntimeException:
> Error during processing of bulk index operations one of the responses
> failed.
> Jan 25 16:05:21 Usergrid-2 catalina.out:
>  UnavailableShardsException[[usergrid_applications_4][4] Primary shard is
> not active or isn't assigned is a known node. Timeout: [1m], request:
> org.elasticsearch.action.bulk.BulkShardRequest@3d277a1]
> Jan 25 16:05:21 Usergrid-2 catalina.out:
>  UnavailableShardsException[[usergrid_applications_4][7] Primary shard is
> not active or isn't assigned is a known node. Timeout: [1m], request:
> org.elasticsearch.action.bulk.BulkShardRequest@155fc46c]
> Jan 25 16:05:21 Usergrid-2 catalina.out:
>  UnavailableShardsException[[usergrid_applications_4][10] Primary shard is
> not active or isn't assigned is a known node. Timeout: [1m], request:
> org.elasticsearch.action.bulk.BulkShardRequest@445968e3]
> .......
> Jan 25 16:53:02 Usergrid-2 catalina.out: 2016-01-25 06:23:02 ERROR
> index.impl.EsIndexProducerImpl.sendRequest(209)<Usergrid-SQS-Pool-97>-
> Unable to index
> id=appId(65ccc2b7-bde0-11e5-88d3-0a75091e6d25,application).entityId(b7dcf6c9-bf62-11e5-88d3-0a75091e6d25,activity).version(b7dcf6cb-bf62-11e5-88d3-0a75091e6d25).nodeId(65ccc2b7-bde0-11e5-88d3-0a75091e6d25,application).edgeName(zzzcollzzz|activities).nodeType(TARGET),
> type=entity, index=usergrid_applications_3,
> failureMessage=RemoteTransportException[[Blindside][inet[/10.0.0.148:9300]][indices:data/write/bulk[s]]];
> nested: EsRejectedExecutionException[rejected execution (queue capacity
> 300) on
> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@13afccc8
> ];
> Jan 25 16:53:03 Usergrid-2 catalina.out:
> RemoteTransportException[[Blindside][inet[/10.0.0.148:9300]][indices:data/write/bulk[s]]];
> nested: EsRejectedExecutionException[rejected execution (queue capacity
> 300) on
> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@59da0cf2
> ];
>
> On Mon, Jan 25, 2016 at 5:54 PM, Harish Singh Bisht <
> [email protected]> wrote:
>
>> Hi Michael,
>>
>> Thank you vert much for the help. I tried the full system re-index API,
>> but i am still seeing the below issues:
>>
>> 1. One of the ES indexes (usergrid_applications_4) still has 15 shards in
>> "INITIALIZING" state. The ES cluster health is red.
>>  Log Extract from ES (details in appendices) shows that:
>> "[usergrid_applications_4][14] failed to start shard.".
>>
>> 2. Usergrid logs show the below types of errors after invoking the
>> re-index api (details in appendices):
>> Type 1
>> UnavailableShardsException[[usergrid_applications_4][7] Primary shard is
>> not active or isn't assigned is a known node. Timeout: [1m], request:
>> org.elasticsearch.action.bulk.BulkShardRequest@155fc46c]
>> Type 2
>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$callEventHandlers$65(359)<Usergrid-SQS-Pool-13>-
>> Failed to index message: 17ed55a5-3091-4f0d-8620-12f2915668c1
>> Type 3
>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$null$70(735)<Usergrid-SQS-Pool-16>-
>> Missing messages from queue post operation
>> Type 4
>> core.executor.TaskExecutorFactory.rejectedExecution(171)<QueueConsumer_11>-
>> Usergrid-SQS-Pool task queue full, rejecting task
>> rx.schedulers.ExecutorScheduler$ExecutorSchedulerWorker@131a4c5 and
>> running in thread QueueConsumer_11
>>
>> Please suggest the next steps. Appriciate the help.
>>
>> Many Thanks,
>> Harish
>>
>>
>>
>> -------------------------------------------------------------------
>> *Appendices*
>> *1. ELASTICSEARCH LOG*
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 
>> 06:40:44,227][WARN
>> ][indices.cluster          ] [Blindside] [usergrid_applications_4][14]
>> failed to start shard
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:
>>  org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
>> [usergrid_applications_4][14] failed to recover shard
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:287)
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>> org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>> java.lang.Thread.run(Thread.java:745)
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  Caused by:
>> org.elasticsearch.index.translog.TranslogCorruptedException: translog
>> corruption while reading from stream
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>> org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:70)
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:257)
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   ... 4 more
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  Caused by:
>> java.io.EOFException
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>> org.elasticsearch.common.io.stream.InputStreamStreamInput.readBytes(InputStreamStreamInput.java:53)
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>> org.elasticsearch.index.translog.BufferedChecksumStreamInput.readBytes(BufferedChecksumStreamInput.java:55)
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>> org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:86)
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>> org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:74)
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>> org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:495)
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   at
>> org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:   ... 5 more
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 
>> 06:40:44,279][WARN
>> ][cluster.action.shard     ] [Blindside] [usergrid_applications_4][14]
>> sending failed shard for [usergrid_applications_4][14],
>> node[lb-HRQpWRQGCeIadzTEHSw], [P], s[INITIALIZING], indexUUID
>> [fdyaoJQZQKuFeBONTQSD1g], reason [Failed to start shard, message
>> [IndexShardGatewayRecoveryException[[usergrid_applications_4][14] failed to
>> recover shard]; nested: TranslogCorruptedException[translog corruption
>> while reading from stream]; nested: EOFException; ]]
>> Jan 25 17:10:23 Elasticsearch elasticsearch.log:  [2016-01-25 
>> 06:40:44,279][WARN
>> ][cluster.action.shard     ] [Blindside] [usergrid_applications_4][14]
>> received shard failed for [usergrid_applications_4][14],
>> node[lb-HRQpWRQGCeIadzTEHSw], [P], s[INITIALIZING], indexUUID
>> [fdyaoJQZQKuFeBONTQSD1g], reason [Failed to start shard, message
>> [IndexShardGatewayRecoveryException[[usergrid_applications_4][14] failed to
>> recover shard]; nested: TranslogCorruptedException[translog corruption
>> while reading from stream]; nested: EOFException; ]]
>> ......
>> *2. USERGRID LOG*
>> Jan 25 16:05:19 Usergrid-2 catalina.out:  2016-01-25 05:35:19 INFO
>> rest.system.IndexResource.rebuildIndexesPost(78)<http-bio-80-exec-92>-
>> Rebuilding all applications
>> Jan 25 16:05:19 Usergrid-2 catalina.out:  2016-01-25 05:35:19 INFO
>> corepersistence.index.ReIndexServiceImpl.lambda$rebuildIndex$97(131)<RxCachedThreadScheduler-35>-
>> Sending batch of 1000 to be indexed.
>> Jan 25 16:05:21 Usergrid-2 catalina.out:  2016-01-25 05:35:20 ERROR
>> index.impl.EsIndexProducerImpl.sendRequest(209)<Usergrid-SQS-Pool-13>-
>> Unable to index
>> id=appId(cd2bd460-a3e8-11e5-a327-0a75091e6d25,application).entityId(7c3328cc-bdd1-11e5-88d3-0a75091e6d25,activity).version(7c3328cd-bdd1-11e5-88d3-0a75091e6d25).nodeId(99400999-a3ef-11e5-a327-0a75091e6d25,group).edgeName(zzzcollzzz|feed).nodeType(TARGET),
>> type=entity, index=usergrid_applications_4,
>> failureMessage=UnavailableShardsException[[usergrid_applications_4][4]
>> Primary shard is not active or isn't assigned is a known node. Timeout:
>> [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@3d277a1]
>> .....
>> Jan 25 16:05:21 Usergrid-2 catalina.out:  2016-01-25 05:35:20 ERROR
>> corepersistence.asyncevents.AmazonAsyncEventService.lambda$callEventHandlers$65(359)<Usergrid-SQS-Pool-13>-
>> Failed to index message: 17ed55a5-3091-4f0d-8620-12f2915668c1
>> Jan 25 16:05:21 Usergrid-2 catalina.out:  java.lang.RuntimeException:
>> Error during processing of bulk index operations one of the responses
>> failed.
>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>  UnavailableShardsException[[usergrid_applications_4][4] Primary shard is
>> not active or isn't assigned is a known node. Timeout: [1m], request:
>> org.elasticsearch.action.bulk.BulkShardRequest@3d277a1]
>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>  UnavailableShardsException[[usergrid_applications_4][7] Primary shard is
>> not active or isn't assigned is a known node. Timeout: [1m], request:
>> org.elasticsearch.action.bulk.BulkShardRequest@155fc46c]
>> Jan 25 16:05:21 Usergrid-2 catalina.out:
>>  UnavailableShardsException[[usergrid_applications_4][10] Primary shard is
>> not active or isn't assigned is a known node. Timeout: [1m], request:
>> org.elasticsearch.action.bulk.BulkShardRequest@445968e3]
>>
>>
>> On Fri, Jan 22, 2016 at 10:29 PM, Michael Russo <[email protected]>
>> wrote:
>>
>>> Hi Harish,
>>>
>>> Yeah, in Usergrid 2 you can re-index all of the entity data that exists
>>> in Cassandra.  Here are example curl requests that invoke the reindex APIs:
>>>
>>> *Full system re-index:*
>>>
>>> curl -i -X POST -u <sysadmin user>:<sysadmin pass> "
>>> http://localhost:8080/system/index/rebuild";
>>>
>>> *Per application re-index:*
>>>
>>> curl -i -X POST -u <sysadmin user>:<sysadmin pass> "
>>> http://localhost:8080/system/index/rebuild/<application uuid>"
>>>
>>> Thanks.
>>> -Michael
>>>
>>> On Fri, Jan 22, 2016 at 4:37 AM, Harish Singh Bisht <
>>> [email protected]> wrote:
>>>
>>>> Hi Team,
>>>>
>>>> We have been our testing our application based on Usergrid 2 (master
>>>> branch) and started noticing unusual poor performance with spikes in the
>>>> response time.
>>>>
>>>> Our investigations revealed that during the load testing we ran out of
>>>> HDD space on the elasticsearch single node cluster. This led to indexing
>>>> failures.
>>>>
>>>> So we increased the HDD space and restarted ES. But now the cluster
>>>> health is red and alot of shards are in the initializing state. It seem
>>>> data has been lost on the ES node.
>>>>
>>>> Is there any way to recover the lost data in ES? Specifically, is there
>>>> a way to trigger a re-index of data from Cassandra to ES?
>>>>
>>>> Appreciate the help.
>>>>
>>>> Thanks
>>>> Harish
>>>>
>>>> --
>>>> Regards,
>>>> Harish Singh Bisht
>>>>
>>>>
>>>
>>
>>
>> --
>> Regards,
>> Harish Singh Bisht
>>
>>
>
>
> --
> Regards,
> Harish Singh Bisht
>
>

Re: Usergrid2 - Reindexing ES data

Reply via email to