Re: Client Caller Context Through Router(RBF)

2020-05-10 Thread Akira Ajisaka
Thanks Ayush for bringing this up.

NameNode uses clientId and callId for retry cache.
- clientId: I'm +1 for adding remoteAddr in IPCConectionContext and NN
should use remoteAddr for clientId. This also resolves data locality issue.
- callId: Can DFSRouters share the callId and the clientId of the requests
via ZooKeeper? That way DFSRouters can use the same callId for each
request. It may not too costly for ZK if DFSRouters share the IDs
of @AtMostOnce requests only.

Thanks and regards,
Akira

On Fri, May 8, 2020 at 10:50 AM Hui Fei  wrote:

> Ayush, Thanks for bringing this  up, it is very meaningful!
>
> Add one more typical problem, NameNode should log both real client and
> Router 's ip. But now NN just logs router's ip, it's difficult for
> troubleshooting
>
>
> Ayush Saxena  于2020年5月4日周一 下午5:26写道:
>
> > Hi All,
> > Wanted to share and discuss a problem that we are facing in the present
> > situation when using Router Based Federation. Presently when a client
> > connects through Router to Namenode, the Namenode receives the caller
> > context of the router rather than being of the actual client. This
> > typically can cause a couple of problems, Two of which we have identified
> > as of now :
> >
> > Firstly, The concept of data locality doesn't work correctly when
> > connecting through Router as the Namenode considers Router as the actual
> > client and performs all the optimizations/computations based on Router's
> > location rather than using the actual client location.
> >
> > Secondly, The Namenode Retry Cache can not be used as if in case of
> > failover or such an event, the client retries again and connects to other
> > router, in that case the since the Call Id is from the Router, but not
> from
> > the actual client, the Retry Cache doesn't identify it as a repeated call
> > and serves it as a whole new call which creates inconsistencies.
> >
> > We have been discussing and trying on solutions since a long time now and
> > tried out a couple of solutions :
> >
> >- Add proxy address in IPC connection (HADOOP-16254
> >) --> This had
> some
> >security concerns for Daryn.
> >- The RouterRPCServer should transfer CallerContext and client ip to
> >NamenodeRpcServer (HDFS-13293
> >) --> This tend to
> be
> >little opaque and couple of more problems stated as in HDFS-13248
> > by Ajay Kumar and
> >Arpit Agarwal
> >- Favored Nodes -->  Pass the local node as favored node. But this
> isn't
> >a complete solution. This doesn't take into account the fallback in
> > case of
> >non availability of local nodes and couple of more. this isn't a
> > solution
> >for the Retry Cache problem too.
> >
> >
> > The related JIRA's where most of the discussion happened, if someone
> tends
> > to follow :
> > HDFS-13248  :- For the
> > DataLocality Problem. Has a patch too in the end with Solution 3(Favored
> > Nodes)
> > HDFS-15079  ,
> HDFS-15078
> >  & HDFS-15310
> >   : For the Retry
> Cache
> > Problem.
> > HADOOP-16254  :
> > Solution 1 : Add proxy address in IPC connection.
> > HDFS-13293  :
> Solution 2
> > : Passing Caller Context.
> >
> > Do let us know if any help here, Any further solutions, workarounds or a
> > way out to unblock or improvise the tried solutions.
> >
> > Thanx!!!
> > -Ayush
> >
>


Re: Client Caller Context Through Router(RBF)

2020-05-07 Thread Hui Fei
Ayush, Thanks for bringing this  up, it is very meaningful!

Add one more typical problem, NameNode should log both real client and
Router 's ip. But now NN just logs router's ip, it's difficult for
troubleshooting


Ayush Saxena  于2020年5月4日周一 下午5:26写道:

> Hi All,
> Wanted to share and discuss a problem that we are facing in the present
> situation when using Router Based Federation. Presently when a client
> connects through Router to Namenode, the Namenode receives the caller
> context of the router rather than being of the actual client. This
> typically can cause a couple of problems, Two of which we have identified
> as of now :
>
> Firstly, The concept of data locality doesn't work correctly when
> connecting through Router as the Namenode considers Router as the actual
> client and performs all the optimizations/computations based on Router's
> location rather than using the actual client location.
>
> Secondly, The Namenode Retry Cache can not be used as if in case of
> failover or such an event, the client retries again and connects to other
> router, in that case the since the Call Id is from the Router, but not from
> the actual client, the Retry Cache doesn't identify it as a repeated call
> and serves it as a whole new call which creates inconsistencies.
>
> We have been discussing and trying on solutions since a long time now and
> tried out a couple of solutions :
>
>- Add proxy address in IPC connection (HADOOP-16254
>) --> This had some
>security concerns for Daryn.
>- The RouterRPCServer should transfer CallerContext and client ip to
>NamenodeRpcServer (HDFS-13293
>) --> This tend to be
>little opaque and couple of more problems stated as in HDFS-13248
> by Ajay Kumar and
>Arpit Agarwal
>- Favored Nodes -->  Pass the local node as favored node. But this isn't
>a complete solution. This doesn't take into account the fallback in
> case of
>non availability of local nodes and couple of more. this isn't a
> solution
>for the Retry Cache problem too.
>
>
> The related JIRA's where most of the discussion happened, if someone tends
> to follow :
> HDFS-13248  :- For the
> DataLocality Problem. Has a patch too in the end with Solution 3(Favored
> Nodes)
> HDFS-15079  , HDFS-15078
>  & HDFS-15310
>   : For the Retry Cache
> Problem.
> HADOOP-16254  :
> Solution 1 : Add proxy address in IPC connection.
> HDFS-13293  : Solution 2
> : Passing Caller Context.
>
> Do let us know if any help here, Any further solutions, workarounds or a
> way out to unblock or improvise the tried solutions.
>
> Thanx!!!
> -Ayush
>


Re: Client Caller Context Through Router(RBF)

2020-05-05 Thread Xiaoqiao He
Hi Ayush,

Thank you for summary and re-launching this discussion. Both of JIRA's
links are very helpful for pushing it forward.
I would like to underline that this is a serious issue and very hope that
we could get the solution here. As far as know, there are many users who
deploy RBF meet this issue (such as HDFS-15079,HDFS-15310 reported) or
undetected issue.

I would like to give my +1 for improve at RPC framework. IMO, it is more
common and general solution(HADOOP-16254 is one attempt), of course, we
should solve security issues as Daryn concerns.
In my practice, we enhance RPC framework and expose some proxy interface
for Router, Data Locality and Data Inconsistency are both resolved as
expected.
Any suggestions and feedbacks are welcome.

Thanks and best regards.
Hexiaoqiao


On Mon, May 4, 2020 at 5:26 PM Ayush Saxena  wrote:

> Hi All,
> Wanted to share and discuss a problem that we are facing in the present
> situation when using Router Based Federation. Presently when a client
> connects through Router to Namenode, the Namenode receives the caller
> context of the router rather than being of the actual client. This
> typically can cause a couple of problems, Two of which we have identified
> as of now :
>
> Firstly, The concept of data locality doesn't work correctly when
> connecting through Router as the Namenode considers Router as the actual
> client and performs all the optimizations/computations based on Router's
> location rather than using the actual client location.
>
> Secondly, The Namenode Retry Cache can not be used as if in case of
> failover or such an event, the client retries again and connects to other
> router, in that case the since the Call Id is from the Router, but not from
> the actual client, the Retry Cache doesn't identify it as a repeated call
> and serves it as a whole new call which creates inconsistencies.
>
> We have been discussing and trying on solutions since a long time now and
> tried out a couple of solutions :
>
>- Add proxy address in IPC connection (HADOOP-16254
>) --> This had some
>security concerns for Daryn.
>- The RouterRPCServer should transfer CallerContext and client ip to
>NamenodeRpcServer (HDFS-13293
>) --> This tend to be
>little opaque and couple of more problems stated as in HDFS-13248
> by Ajay Kumar and
>Arpit Agarwal
>- Favored Nodes -->  Pass the local node as favored node. But this isn't
>a complete solution. This doesn't take into account the fallback in
> case of
>non availability of local nodes and couple of more. this isn't a
> solution
>for the Retry Cache problem too.
>
>
> The related JIRA's where most of the discussion happened, if someone tends
> to follow :
> HDFS-13248  :- For the
> DataLocality Problem. Has a patch too in the end with Solution 3(Favored
> Nodes)
> HDFS-15079  , HDFS-15078
>  & HDFS-15310
>   : For the Retry Cache
> Problem.
> HADOOP-16254  :
> Solution 1 : Add proxy address in IPC connection.
> HDFS-13293  : Solution 2
> : Passing Caller Context.
>
> Do let us know if any help here, Any further solutions, workarounds or a
> way out to unblock or improvise the tried solutions.
>
> Thanx!!!
> -Ayush
>