There's already a ticket:
https://issues.apache.org/jira/browse/USERGRID-1051

Just as FYI when we load tested Usergrid to upwards of 10k TPS, we were
using a search queue of 5000. In fact, we made all of the queues have size
of 5000, including the bulk queue.

On Thu, Dec 10, 2015 at 7:17 AM, Jaskaran Singh <
jaskaran.si...@comprotechnologies.com> wrote:

> Hi Michael,
>
> I am providing an update on my situation. We have changed our application
> logic to minimize the use of queries (ie calls with "ql=.....") in usergrid
> 2.x. This seems to have provided significant benefit and all the problems
> reported below, seem to have disappeared.
>
> To some extent this is good news. However we were lucky that we were able
> to work around the logic and would like to understand any limitations or
> best practices around the use of queries (which are serviced by
> elasticsearch in usergrid 2.x) under high load situations.
>
> Also please let me know me know if there is an existing Jira issue for
> addressing the empty entity response when elasticsearch when it is
> overloaded. Or should i add one?
>
> Thanks in advance,
>
> Thanks
> Jaskaran
>
>
> On Tue, Dec 8, 2015 at 6:00 PM, Jaskaran Singh <
> jaskaran.si...@comprotechnologies.com> wrote:
>
>> Hi Michael,
>>
>> This makes sense. I can confirm that while we have been seeing missing
>> entity errors on high load; these automatically get resolved themselves as
>> the load decreases.
>>
>> Another anomaly that we have noticed is that usergrid responds with a
>> code "401" and message "Unable to authenticate OAuth credentials" for
>> certain user's credentials under high load and the same credentials work
>> fine after the load reduces. Can we assume that this issue (intermittent
>> invalid credentials) has the same underlying root cause (ie elasticsearch
>> is not responding)? See below a couple of examples of the error_description
>> for such 401 errors:
>> 1. 'invalid username or password'
>> 2. ‘Unable to authenticate OAuth credentials’
>> 3. ‘Unable to authenticate due to corrupt access token’
>>
>> Regarding your suggestion to increase the search thread pool queue size,
>> we were already using a setting of 1000 (with 320 threads). Should we
>> consider further increasing this? Or simply provide additional resources
>> (cpu / ram) to the ES process.
>>
>> Additionally we are also seeing cassandra connection timeouts,
>> specifically the exceptions below under high load conditions:
>> ERROR stage.write.WriteCommit.call(132)<Usergrid-Collection-Pool-12>-
>> Failed to execute write asynchronously
>> com.netflix.astyanax.connectionpool.exceptions.TimeoutException:
>> TimeoutException: [host=10.0.0.237(10.0.0.237):9160, latency=2003(2003),
>> attempts=1]org.apache.thrift.transport.TTransportException:
>> java.net.SocketTimeoutException: Read timed out
>>
>> These exceptions occur even though opscenter was reporting medium load on
>> our cluster. Is there a way to optimize the astyanax library. Please let us
>> know if you have any recommendations in this area.
>>
>> Thanks a lot for the help.
>>
>> Thanks
>> Jaskaran
>>
>> On Mon, Dec 7, 2015 at 2:29 AM, Michael Russo <michaelaru...@gmail.com>
>> wrote:
>>
>>> Here are a couple things to check:
>>>
>>> 1) Can you query all of these entities out when the system is not under
>>> load?
>>> 2) Elasticsearch has a search queue for index query requests. (
>>> https://www.elastic.co/guide/en/elasticsearch/reference/1.6/modules-threadpool.html)
>>> When this is full the searches are rejected. Currently Usergrid surfaces
>>> this as no results returned rather than unable to query or some other
>>> identifying error message (we're aware and plan to fix this in the future).
>>> Try increasing the queue size to 1000. You might have delayed results, but
>>> can prevent them from being empty results for data that's known to be in
>>> the index.
>>>
>>> Thanks.
>>> -Michael R.
>>>
>>> On Dec 5, 2015, at 07:07, Jaskaran Singh <
>>> jaskaran.si...@comprotechnologies.com> wrote:
>>>
>>> Hello All,
>>>
>>> We are testing usergrid 2.x (master branch) for our application that was
>>> previously being prototyped on usergrid 1.x. We are noticing some weird
>>> anomalies which are causing errors in our application which otherwise works
>>> fine against usergrid 1.x. Specifically, we are seeing empty responses when
>>> querying custom collections for a particular entity record.
>>> Following is an example of one such query:
>>> http://server-name/b2perf1/default/userdata?client_id=
>>> <...>&client_secret=<....>&ql=userproductid='4d543507-9839-11e5-ba08-0a75091e6d25~~5c856de9-9828-11e5-ba08-0a75091e6d25'
>>>
>>> In the above scenario, we are querying a custom collection “userdata”.
>>> And under high load conditions (performance tests), this query starts
>>> returning an empty entities array (see below), even though this entity did
>>> exist at one time and we have no code / logic to delete entities.
>>> {
>>>     "action": "get",
>>>     "application": "0f7a2396-9826-11e5-ba08-0a75091e6d25",
>>>     "params": {
>>>         "ql": [
>>>
>>> "userproductid='4d543507-9839-11e5-ba08-0a75091e6d25~~5c856de9-9828-11e5-ba08-0a75091e6d25'"
>>>         ]
>>>     },
>>>     "path": "/userdata",
>>>     "uri": "http://localhost:8080/b2perf1/default/userdata";,
>>>     "entities": [],
>>>     "timestamp": 1449322746733,
>>>     "duration": 1053,
>>>     "organization": "b2perf1",
>>>     "applicationName": "default",
>>>     "count": 0
>>> }
>>>
>>> This has been happening quite randomly / intermittently and we have not
>>> been able to isolate any replication steps besides, running load /
>>> performance tests when this problem does eventually show up.
>>> Note, the creation of the entities is prior to the load test and we can
>>> confirm that they existed before running the load test.
>>>
>>> We have never noticed this issue for ‘non’ query calls (ie calls that do
>>> not directly provide a field to query on)
>>>
>>> Our suspicion is that while these records do exist in Cassandra (because
>>> we have never deleted them), but the ElasticSearch index is ‘not’ in sync
>>> or is not functioning properly.
>>> How do we go about debugging this problem? Is there any particular
>>> logging or metric that we can check for us to confirm if all the
>>> elasticsearch index is upto date with the changes in cassandra.
>>>
>>> Any other suggestions will be greatly appreciated.
>>>
>>> Thanks
>>> Jaskaran
>>>
>>>
>>
>

Reply via email to