Thanks Sandeep. I just saw the JIRA created. I'll see if I can try to eliminate variables from our side.
Kevin Risden On Tue, Oct 31, 2017 at 1:51 PM, Sandeep More <moresand...@gmail.com> wrote: > Hello Kevin, > > So I tried a bit but could not reproduce the issue. I tried with two > separate topologies and tried sending concurrent requests (loop of 100). > I tried with Auth caching and with auth failure requests still could not > reproduce the issue. > > I am suspecting this has something to do with the load balancer (somehow > sticky requests). > For now I'll open a JIRA for it and target it to 0.15.0 > > Best, > Sandeep > > On Tue, Oct 31, 2017 at 10:23 AM, Kevin Risden <kris...@apache.org> wrote: > >> Sandeep - Any progress here? Any more ideas? We haven't had a chance to >> look much closer yet but if you need more logs let me know. >> >> Kevin Risden >> >> On Wed, Oct 18, 2017 at 8:42 AM, Sandeep More <moresand...@gmail.com> >> wrote: >> >>> This is interesting (and troubling). Thanks for the detailed analysis >>> Kevin, this really helps to narrow down the problem. I will try to take a >>> look at Auth failures and Concurrency issue that are still to be >>> investigated. >>> >>> Best, >>> Sandeep >>> >>> On Tue, Oct 17, 2017 at 5:47 PM, Kevin Risden <kris...@apache.org> >>> wrote: >>> >>>> Responding to myself to add some more detail after digging deeper: >>>> >>>> More evidence from duplicate correlation ids for the following cases: >>>> >>>> - this affects different services (same correlation id, different >>>> services being called) >>>> - multiple successful individual requests with same correlation id >>>> >>>> I'll try anonymize them and share to this thread in the next few days. >>>> >>>> We had a few theories that we are slowly working through >>>> >>>> - Correlation id generation is returning duplicates? >>>> - Status: Basically disproven >>>> - https://github.com/apache/knox/blob/master/gateway-server/sr >>>> c/main/java/org/apache/hadoop/gateway/filter/CorrelationHand >>>> ler.java#L37 >>>> >>>> <https://github.com/apache/knox/blob/master/gateway-server/src/main/java/org/apache/hadoop/gateway/filter/CorrelationHandler.java#L37> >>>> - Wrote a small Java program to test this. >>>> - UUID.randomUUID().toString in multiple threads >>>> - No duplicates returned for 100+ threads and 10000 ids >>>> generated quickly >>>> - Pretty sure the random number generator is ok >>>> - Requests from same IP address = same correlation id? >>>> - Status: Disproven >>>> - This doesn't make sense for the different services >>>> - Different users would be hitting different services (hive vs >>>> hbase) >>>> - If it only occurs with authentication failures could be the >>>> system recycling the ID? >>>> - Status: Not explored yet >>>> - Active conversations require a unique ID, so it might be that >>>> auth fail 401 = conversation over and UUID recycling. >>>> - Not sure how to test this right now >>>> - CorrelationContext or CorrelationService or CorrelationHandler is >>>> getting confused? >>>> - The filter is adding the same id multiple times? >>>> - Asynchronous calls are being handled differently? >>>> >>>> As of right now we don't have a smoking gun as to what is happening, >>>> but I would guess there is something in Knox causing the duplicates to >>>> happen. >>>> >>>> Any ideas on where to look next would be appreciated. >>>> >>>> >>>> >>>> Kevin Risden >>>> >>>> On Tue, Oct 17, 2017 at 4:35 PM, Kevin Risden <kris...@apache.org> >>>> wrote: >>>> >>>>> From the Knox User list thread: "Multiple topology audit logging", it >>>>> came to my attention that Knox seems to be logging duplicate correlation >>>>> ids. Separating out the topic specifically here to dig a bit deeper. >>>>> >>>>> While looking at our Knox audit logs (Knox 0.9 on HDP 2.5) the >>>>> "correlation id" doesn't seem to be unique across requests. Is this to be >>>>> expected? Here is a snippet (anonymized): >>>>> >>>>> grep 7557c91b-2a48-4e09-aefc-44e9892372da /var/knox/gateway-audit.log >>>>>> >>>>> >>>>> >>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>>>> 892372da|audit|WEBHBASE||||access|uri|/gateway/HADOOPTEST/hb >>>>>> ase/hbase/NAMESPACE1:TABLE1/ID1//|unavailable|Request method: GET >>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>>>> 892372da|audit|WEBHBASE|USER1|||authentication|uri|/gateway/ >>>>>> HADOOPPROD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3ara >>>>>> w&|success| >>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>>>> 892372da|audit|WEBHBASE|USER1|||authentication|uri|/gateway/ >>>>>> HADOOPPROD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|Groups: >>>>>> [] >>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>>>> 892372da|audit|WEBHBASE|USER1|||dispatch|uri|http://WEBHBASE >>>>>> .example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row= >>>>>> ID2%2Fd%3Araw|unavailable|Request >>>>>> <http://webhbase.example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=ID2%2Fd%3Araw%7Cunavailable%7CRequest> >>>>>> method: >>>>>> GET >>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>>>> 892372da|audit|WEBHBASE|USER1|||dispatch|uri|http://WEBHBASE >>>>>> .example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row= >>>>>> ID2%2Fd%3Araw|success|Response >>>>>> <http://webhbase.example.com:8084/NAMESPACE2:TABLE2/multiget?doAs=USER1&row=ID2%2Fd%3Araw%7Csuccess%7CResponse> >>>>>> status: >>>>>> 200 >>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>>>> 892372da|audit|WEBHBASE|USER1|||access|uri|/gateway/HADOOPPR >>>>>> OD/hbase/NAMESPACE2:TABLE2/multiget?row=ID2%2fd%3araw&|success|Response >>>>>> status: 200 >>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>>>> 892372da|audit|WEBHBASE||||authentication|principal|USER2|failure|LDAP >>>>>> authentication failed. >>>>>> 17/10/10 12:50:09 ||7557c91b-2a48-4e09-aefc-44e9 >>>>>> 892372da|audit|WEBHBASE||||access|uri|/gateway/HADOOPTEST/hb >>>>>> ase/hbase/NAMESPACE1:TABLE2/ID1//|success|Response status: 401 >>>>> >>>>> >>>>> The things to highlight here for the same correlation id: >>>>> * different topologies are being used >>>>> * different uris are being used >>>>> * different users are being used >>>>> >>>>> Some of the things that we have configured that could impact results: >>>>> * authentication caching >>>>> * multiple Knox servers >>>>> * load balancer in front of Knox >>>>> >>>>> I would be surprised if those things would make one Knox node (it >>>>> could affect our other Knox nodes haven't looked yet) have duplicate >>>>> correlation ids. >>>>> >>>>> From Larry Mccay: >>>>> >>>>>> Interesting... >>>>>> >>>>>> I have not seen this before at all. >>>>>> I don't think there is anyway for those things to affect it either - >>>>>> especially the HA related things. >>>>>> Those entries are all a single request that will not go to different >>>>>> hosts - unless there is are redirects involved and then it should become >>>>>> a >>>>>> separate correlation id. >>>>>> >>>>>> We should dig into the creation of the id to see what could cause it. >>>>>> >>>>> >>>>> From myself: >>>>> >>>>>> Yea that is what I was afraid of :) I looked a bit further and it >>>>>> isn't every request that has duplicates. The ones I have found so far are >>>>>> from LDAP authentication failures. There could very well be others since >>>>>> I've only been looking at authentication failures. >>>>>> I'll see if my team can dig some more information up. I'm hoping to >>>>>> get a few more people involved with Apache Knox. It will probably take a >>>>>> few days to dig further. >>>>> >>>>> >>>>> Kevin Risden >>>>> >>>> >>>> >>> >> >