Thanks Sandeep. I just saw the JIRA created. I'll see if I can try to
eliminate variables from our side.

Kevin Risden

On Tue, Oct 31, 2017 at 1:51 PM, Sandeep More <moresand...@gmail.com> wrote:

> Hello Kevin,
>
> So I tried a bit but could not reproduce the issue. I tried with two
> separate topologies and tried sending concurrent requests (loop of 100).
> I tried with Auth caching and with auth failure requests still could not
> reproduce the issue.
>
> I am suspecting this has something to do with the load balancer (somehow
> sticky requests).
> For now I'll open a JIRA for it and target it to 0.15.0
>
> Best,
> Sandeep
>
> On Tue, Oct 31, 2017 at 10:23 AM, Kevin Risden <kris...@apache.org> wrote:
>
>> Sandeep - Any progress here? Any more ideas? We haven't had a chance to
>> look much closer yet but if you need more logs let me know.
>>
>> Kevin Risden
>>
>> On Wed, Oct 18, 2017 at 8:42 AM, Sandeep More <moresand...@gmail.com>
>> wrote:
>>
>>> This is interesting (and troubling). Thanks for the detailed analysis
>>> Kevin, this really helps to narrow down the problem. I will try to take a
>>> look at Auth failures and Concurrency issue that are still to be
>>> investigated.
>>>
>>> Best,
>>> Sandeep
>>>
>>> On Tue, Oct 17, 2017 at 5:47 PM, Kevin Risden <kris...@apache.org>
>>> wrote:
>>>
>>>> Responding to myself to add some more detail after digging deeper:
>>>>
>>>> More evidence from duplicate correlation ids for the following cases:
>>>>
>>>>    - this affects different services (same correlation id, different
>>>>    services being called)
>>>>    - multiple successful individual requests with same correlation id
>>>>
>>>> I'll try anonymize them and share to this thread in the next few days.
>>>>
>>>> We had a few theories that we are slowly working through
>>>>
>>>>    - Correlation id generation is returning duplicates?
>>>>    - Status: Basically disproven
>>>>       - https://github.com/apache/knox/blob/master/gateway-server/sr
>>>>       c/main/java/org/apache/hadoop/gateway/filter/CorrelationHand
>>>>       ler.java#L37
>>>>       
>>>> <https://github.com/apache/knox/blob/master/gateway-server/src/main/java/org/apache/hadoop/gateway/filter/CorrelationHandler.java#L37>
>>>>       - Wrote a small Java program to test this.
>>>>          - UUID.randomUUID().toString in multiple threads
>>>>          - No duplicates returned for 100+ threads and 10000 ids
>>>>          generated quickly
>>>>       - Pretty sure the random number generator is ok
>>>>    - Requests from same IP address = same correlation id?
>>>>       - Status: Disproven
>>>>       - This doesn't make sense for the different services
>>>>       - Different users would be hitting different services (hive vs
>>>>       hbase)
>>>>    - If it only occurs with authentication failures could be the
>>>>    system recycling the ID?
>>>>       - Status: Not explored yet
>>>>       - Active conversations require a unique ID, so it might be that
>>>>       auth fail 401 = conversation over and UUID recycling.
>>>>       - Not sure how to test this right now
>>>>    - CorrelationContext or CorrelationService or CorrelationHandler is
>>>>    getting confused?
>>>>       - The filter is adding the same id multiple times?
>>>>       - Asynchronous calls are being handled differently?
>>>>
>>>> As of right now we don't have a smoking gun as to what is happening,
>>>> but I would guess there is something in Knox causing the duplicates to
>>>> happen.
>>>>
>>>> Any ideas on where to look next would be appreciated.
>>>>
>>>>
>>>>
>>>> Kevin Risden
>>>>
>>>> On Tue, Oct 17, 2017 at 4:35 PM, Kevin Risden <kris...@apache.org>
>>>> wrote:
>>>>
>>>>> From the Knox User list thread: "Multiple topology audit logging", it
>>>>> came to my attention that Knox seems to be logging duplicate correlation
>>>>> ids. Separating out the topic specifically here to dig a bit deeper.
>>>>>
>>>>> While looking at our Knox audit logs (Knox 0.9 on HDP 2.5) the
>>>>> "correlation id" doesn't seem to be unique across requests. Is this to be
>>>>> expected? Here is a snippet (anonymized):
>>>>>
>>>>> grep 7557c91b-2a48-4e09-aefc-44e9892372da /var/knox/gateway-audit.log
>>>>>>
>>>>>
>>>>>
>>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>>>> 892372da|audit|WEBHBASE||||access|uri|/gateway/HADOOPTEST/hb
>>>>>> ase/hbase/NAMESPACE1:TABLE1/ID1//|unavailable|Request method: GET
>>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>>>> 892372da|audit|WEBHBASE|USER1|||authentication|uri|/gateway/
>>>>>> HADOOPPROD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3ara
>>>>>> w&|success|
>>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>>>> 892372da|audit|WEBHBASE|USER1|||authentication|uri|/gateway/
>>>>>> HADOOPPROD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|Groups:
>>>>>> []
>>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>>>> 892372da|audit|WEBHBASE|USER1|||dispatch|uri|http://WEBHBASE
>>>>>> .example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=
>>>>>> ID2%2Fd%3Araw|unavailable|Request
>>>>>> <http://webhbase.example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=ID2%2Fd%3Araw%7Cunavailable%7CRequest>
>>>>>>  method:
>>>>>> GET
>>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>>>> 892372da|audit|WEBHBASE|USER1|||dispatch|uri|http://WEBHBASE
>>>>>> .example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=
>>>>>> ID2%2Fd%3Araw|success|Response
>>>>>> <http://webhbase.example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=ID2%2Fd%3Araw%7Csuccess%7CResponse>
>>>>>>  status:
>>>>>> 200
>>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>>>> 892372da|audit|WEBHBASE|USER1|||access|uri|/gateway/HADOOPPR
>>>>>> OD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|Response
>>>>>> status: 200
>>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>>>> 892372da|audit|WEBHBASE||||authentication|principal|USER2|failure|LDAP
>>>>>> authentication failed.
>>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9
>>>>>> 892372da|audit|WEBHBASE||||access|uri|/gateway/HADOOPTEST/hb
>>>>>> ase/hbase/NAMESPACE1:TABLE2/ID1//|success|Response status: 401
>>>>>
>>>>>
>>>>> The things to highlight here for the same correlation id:
>>>>> * different topologies are being used
>>>>> * different uris are being used
>>>>> * different users are being used
>>>>>
>>>>> Some of the things that we have configured that could impact results:
>>>>> * authentication caching
>>>>> * multiple Knox servers
>>>>> * load balancer in front of Knox
>>>>>
>>>>> I would be surprised if those things would make one Knox node (it
>>>>> could affect our other Knox nodes haven't looked yet) have duplicate
>>>>> correlation ids.
>>>>>
>>>>> From Larry Mccay:
>>>>>
>>>>>> Interesting...
>>>>>>
>>>>>> I have not seen this before at all.
>>>>>> I don't think there is anyway for those things to affect it either -
>>>>>> especially the HA related things.
>>>>>> Those entries are all a single request that will not go to different
>>>>>> hosts - unless there is are redirects involved and then it should become 
>>>>>> a
>>>>>> separate correlation id.
>>>>>>
>>>>>> We should dig into the creation of the id to see what could cause it.
>>>>>>
>>>>>
>>>>> From myself:
>>>>>
>>>>>> Yea that is what I was afraid of :) I looked a bit further and it
>>>>>> isn't every request that has duplicates. The ones I have found so far are
>>>>>> from LDAP authentication failures. There could very well be others since
>>>>>> I've only been looking at authentication failures.
>>>>>> I'll see if my team can dig some more information up. I'm hoping to
>>>>>> get a few more people involved with Apache Knox. It will probably take a
>>>>>> few days to dig further.
>>>>>
>>>>>
>>>>> Kevin Risden
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to