So, let me give my take on what the development space is like with Hadoop
and exactly how Knox fits in today and where I think it will be pulled in
the future by natural demand.

First, a statement about the history of REST APIs in Hadoop ecosystem. For
the most part, Hadoop developers (those developing mapreduce jobs or
various Hadoop services) have always worked within the cluster. ETL workers
have worked inside the cluster on gateway nodes and used the various CLIs
built into the the platform to accomplish their work requirements. These
gateway machines and development environments were necessary due to the
heavy footprint of binaries and configuration required for clients to work
properly.

The addition of REST-ish APIs was largely to enable some scripting of tasks
that would otherwise require the binaries and configuration in order to use
the CLIs that data workers were used to.

However, it there were a number of things about how these APIs were added
that largely hampered their uptake but it basically comes down to the fact
that there was not attention paid to how REST APIs are consumed by typical
REST-based application developers. The authentication may be wide open or
require kerberos/SPNEGO for instance. Middleware applications were
encouraged to build dependencies on Hadoop jars for things like
UserGroupInformation classes and the Hadoop java client. We basically left
the entire population of non-hadoop developers out in the cold instead of
enabling them to consume Hadoop resources in a way that they as REST-based
developers are accustomed to accessing things.

The history above explains why nearly every user of Hadoop has had to spend
time on a gateway/edge machine and with the CLIs and that there has been a
slow uptake of the Hadoop REST APIs.

Apache Knox comes along with a charter to encourage the use of the REST
APIs and to try and engage this from a whole new type of developer for the
Hadoop world. With a single endpoint that exposes any number of Hadoop
related APIs via an assortment of authentication methods with provider
integrations for enterprise class solutions for authentication and
authorization developers are now able to code to a topology deployment
model instead of a cluster authentication setting. Developers are able to
leverage REST APIs without requiring Hadoop jars and version conflicts.
They are able to build new CLIs that use this REST programming model and
not require binaries and configuration that is needed inside an edge node.
They can use SDK classes provided by the KnoxShell to access WebHDFS and
other services without the use of UGI or SPNEGO.

Are there data workers that will continue to need access to edge nodes and
be inside the cluster - in many instances, yes. Do they need to share their
use of the edge machines with everyone that needs to do anything with
Hadoop? No.

The future in my mind separates the end user population into 3 or 4 camps:

* A relatively small number of data workers within an organization that
regularly needs to ssh into an edge machine
* A very small number of data curators that may not even have access to the
edge nodes but usher data in and out of the cluster
* A much larger number of data scientists that will access and provide
analytics from their desktop using HS2, Phoenix, HBase, etc with JDBC/ODBC
* An extremely large number of new developers and users that will
incorporate the use of REST APIs for Hadoop in their middleware
applications to provide multi-tenant access to data and processing through
proxies like Knox and messaging like Kafka, etc. These developers and their
users will actually be the largest growing group Hadoop end users going
forward.

Edge node risk is greatly reduced in most cases and may be eliminated in
some.
The largest number of users will never see the network topology details of
the cluster itself.
Access to datanodes for instance is completely hidden from the end user and
clients that use Knox.
The hostnames and ports are not exposed.

So, in short, we still do suffer from short sightedness with regard to
wanting to make existing CLIs use Knox or twist our minds around trying to
figure out how to write an app that doesn't assume Knox during development
but need to deploy to an environment that has it.

I am hoping to be able to help make this more clear in blog articles for
the programming model afforded by Knox in the coming future and as that
programming model matures it should become more and more self-evident.

Thank you for the interesting discussion!

On Thu, Mar 16, 2017 at 8:38 PM, Damien Claveau <[email protected]>
wrote:

> Hi Larry,
>
> Thanks for your thoughts on that.
> "How Knox fits into the traditional hadoop programming model" is exactly
> the question.
>
> Few of the java client libs are wrapped in equivalent http-oriented lib,
> and we certainly don't want the apps running inside Yarn to request the
> Knox's API (it would be a severe bottleneck). That's my personal
> assumption, correct me if I am wrong.
>
> So, the point here is : if we lock a production cluster with Knox+ strict
> firewalling,
> we still must allow access to the ports of the core services on the
> development environments.
>
> Consequences :
> ​1. As you said, one will probably have to limit the dev envs to subsets
> of (or anonymized) data.
> 2. When the tested app will be deployed in production,
> it will be very, very hard to troubleshoot (through Knox again) in case we
> encounter bugs at run-time.
> 3. I guess that different security policies and deployment methods among
> the clusters would be very confusing for the devs, and the scientists : "On
> the dev you have an edge node, but not in production. Good luck dear
> scientist"
> 4. Again, Hadoop administration, even with a proxied Ambari UI and some
> enhancements to Knox-shell,
> would be extremely challenging (impossible ?) as well.
>
> For these reasons, I am very confident in that we will continue to
> encourage the usage of Knox for it's strengths :
>
>    - Single REST API Access Point
>    - Centralized authentication for Hadoop REST/HTTP services
>
> but I am still wondering if the following promises written by vendors are
> realistic, and how ?
>
>    - Eliminates SSH edge node risks
>    - Hides Network Topology
>
> Another side-effect, is that the integration with the Hadoop Rest Apis
> becomes a restrictive requirement for most of our tools (ETL, Viz,
> Datascience, from vendors and community).
>
> I hope I have correctly explained my interrogations, with a not too bad
> english :-)
>
> I am interested for any feedback on these concerns.
>
> Thanks for reading.
>
> Damien
>
>
>
>
>
>
> 2017-03-15 0:53 GMT+01:00 larry mccay <[email protected]>:
>
>> Hi Damien -
>>
>> Interesting questions...
>>
>> I suspect that development environments are quite varying in
>> configuration but for the most part that they are not typical production
>> deployment configurations.
>>
>> With recent focus on the KnoxShell DSL and SDK classes it makes sense to
>> try and determine what the programming model is for the use of those
>> aspects of Knox. However, the question you ask is how Knox fits into the
>> traditional hadoop programming model, environment and flow.
>>
>> If you have anything particular in mind, I would be interested in hearing
>> what you think.
>>
>> Perimeter security is certainly achievable but I guess there are valid
>> questions as to what sort of deployments are generally available for such
>> development. If you need access to the actual data does it push you to
>> development in production-like environments?
>>
>> Again, I'm not sure what you have in mind here but interested to hear
>> more.
>>
>> thanks,
>>
>> --larry
>>
>>
>> On Tue, Mar 14, 2017 at 5:54 PM, Damien Claveau <[email protected]
>> > wrote:
>>
>>> Hi,
>>>
>>> First time emailing the user mailing list.
>>>
>>> We currently use Knox successfully on several Kerberized clusters in 
>>> production,
>>>
>>> and mainly use it to integrate with external client applications (such as 
>>> ETL and Viz tools),
>>>
>>> We would like to promote and generalize the concept of a single Rest access 
>>> point for all services,
>>>
>>> then, in an ideal world, ban access from the outside world to the RPC and 
>>> Thrift interfaces of the core hadoop services.
>>>
>>>
>>> The question is ...
>>>
>>> Even if we can deploy binaries, scripts, workflows to hdfs and submit or 
>>> schedule them through Knox,
>>>
>>> At the very beginning, the developpers of course have to code apps (say 
>>> Spark jobs)
>>> that are designed to run natively inside the cluster (and will use Java 
>>> client libs to access the Thrift interfaces).
>>>
>>> How do you deal with that need ?
>>> Do they develop on sandboxed environments or their own laptop without Knox,
>>>  and so Knox only applies to the production/target clusters ?
>>> Is the promise of a "Perimeter Level Security" really achievable ?
>>>
>>> Thank you for your feedback.
>>>
>>> Damien Claveau
>>>
>>> France
>>>
>>>
>>>
>>>
>>>
>>
>
>
> --
>
> *Damien Claveau*
> *MOBILE* 06 60 31 47 84 • *E-MAIL* [email protected]
>

Reply via email to