So, let me give my take on what the development space is like with Hadoop and exactly how Knox fits in today and where I think it will be pulled in the future by natural demand.
First, a statement about the history of REST APIs in Hadoop ecosystem. For the most part, Hadoop developers (those developing mapreduce jobs or various Hadoop services) have always worked within the cluster. ETL workers have worked inside the cluster on gateway nodes and used the various CLIs built into the the platform to accomplish their work requirements. These gateway machines and development environments were necessary due to the heavy footprint of binaries and configuration required for clients to work properly. The addition of REST-ish APIs was largely to enable some scripting of tasks that would otherwise require the binaries and configuration in order to use the CLIs that data workers were used to. However, it there were a number of things about how these APIs were added that largely hampered their uptake but it basically comes down to the fact that there was not attention paid to how REST APIs are consumed by typical REST-based application developers. The authentication may be wide open or require kerberos/SPNEGO for instance. Middleware applications were encouraged to build dependencies on Hadoop jars for things like UserGroupInformation classes and the Hadoop java client. We basically left the entire population of non-hadoop developers out in the cold instead of enabling them to consume Hadoop resources in a way that they as REST-based developers are accustomed to accessing things. The history above explains why nearly every user of Hadoop has had to spend time on a gateway/edge machine and with the CLIs and that there has been a slow uptake of the Hadoop REST APIs. Apache Knox comes along with a charter to encourage the use of the REST APIs and to try and engage this from a whole new type of developer for the Hadoop world. With a single endpoint that exposes any number of Hadoop related APIs via an assortment of authentication methods with provider integrations for enterprise class solutions for authentication and authorization developers are now able to code to a topology deployment model instead of a cluster authentication setting. Developers are able to leverage REST APIs without requiring Hadoop jars and version conflicts. They are able to build new CLIs that use this REST programming model and not require binaries and configuration that is needed inside an edge node. They can use SDK classes provided by the KnoxShell to access WebHDFS and other services without the use of UGI or SPNEGO. Are there data workers that will continue to need access to edge nodes and be inside the cluster - in many instances, yes. Do they need to share their use of the edge machines with everyone that needs to do anything with Hadoop? No. The future in my mind separates the end user population into 3 or 4 camps: * A relatively small number of data workers within an organization that regularly needs to ssh into an edge machine * A very small number of data curators that may not even have access to the edge nodes but usher data in and out of the cluster * A much larger number of data scientists that will access and provide analytics from their desktop using HS2, Phoenix, HBase, etc with JDBC/ODBC * An extremely large number of new developers and users that will incorporate the use of REST APIs for Hadoop in their middleware applications to provide multi-tenant access to data and processing through proxies like Knox and messaging like Kafka, etc. These developers and their users will actually be the largest growing group Hadoop end users going forward. Edge node risk is greatly reduced in most cases and may be eliminated in some. The largest number of users will never see the network topology details of the cluster itself. Access to datanodes for instance is completely hidden from the end user and clients that use Knox. The hostnames and ports are not exposed. So, in short, we still do suffer from short sightedness with regard to wanting to make existing CLIs use Knox or twist our minds around trying to figure out how to write an app that doesn't assume Knox during development but need to deploy to an environment that has it. I am hoping to be able to help make this more clear in blog articles for the programming model afforded by Knox in the coming future and as that programming model matures it should become more and more self-evident. Thank you for the interesting discussion! On Thu, Mar 16, 2017 at 8:38 PM, Damien Claveau <[email protected]> wrote: > Hi Larry, > > Thanks for your thoughts on that. > "How Knox fits into the traditional hadoop programming model" is exactly > the question. > > Few of the java client libs are wrapped in equivalent http-oriented lib, > and we certainly don't want the apps running inside Yarn to request the > Knox's API (it would be a severe bottleneck). That's my personal > assumption, correct me if I am wrong. > > So, the point here is : if we lock a production cluster with Knox+ strict > firewalling, > we still must allow access to the ports of the core services on the > development environments. > > Consequences : > 1. As you said, one will probably have to limit the dev envs to subsets > of (or anonymized) data. > 2. When the tested app will be deployed in production, > it will be very, very hard to troubleshoot (through Knox again) in case we > encounter bugs at run-time. > 3. I guess that different security policies and deployment methods among > the clusters would be very confusing for the devs, and the scientists : "On > the dev you have an edge node, but not in production. Good luck dear > scientist" > 4. Again, Hadoop administration, even with a proxied Ambari UI and some > enhancements to Knox-shell, > would be extremely challenging (impossible ?) as well. > > For these reasons, I am very confident in that we will continue to > encourage the usage of Knox for it's strengths : > > - Single REST API Access Point > - Centralized authentication for Hadoop REST/HTTP services > > but I am still wondering if the following promises written by vendors are > realistic, and how ? > > - Eliminates SSH edge node risks > - Hides Network Topology > > Another side-effect, is that the integration with the Hadoop Rest Apis > becomes a restrictive requirement for most of our tools (ETL, Viz, > Datascience, from vendors and community). > > I hope I have correctly explained my interrogations, with a not too bad > english :-) > > I am interested for any feedback on these concerns. > > Thanks for reading. > > Damien > > > > > > > 2017-03-15 0:53 GMT+01:00 larry mccay <[email protected]>: > >> Hi Damien - >> >> Interesting questions... >> >> I suspect that development environments are quite varying in >> configuration but for the most part that they are not typical production >> deployment configurations. >> >> With recent focus on the KnoxShell DSL and SDK classes it makes sense to >> try and determine what the programming model is for the use of those >> aspects of Knox. However, the question you ask is how Knox fits into the >> traditional hadoop programming model, environment and flow. >> >> If you have anything particular in mind, I would be interested in hearing >> what you think. >> >> Perimeter security is certainly achievable but I guess there are valid >> questions as to what sort of deployments are generally available for such >> development. If you need access to the actual data does it push you to >> development in production-like environments? >> >> Again, I'm not sure what you have in mind here but interested to hear >> more. >> >> thanks, >> >> --larry >> >> >> On Tue, Mar 14, 2017 at 5:54 PM, Damien Claveau <[email protected] >> > wrote: >> >>> Hi, >>> >>> First time emailing the user mailing list. >>> >>> We currently use Knox successfully on several Kerberized clusters in >>> production, >>> >>> and mainly use it to integrate with external client applications (such as >>> ETL and Viz tools), >>> >>> We would like to promote and generalize the concept of a single Rest access >>> point for all services, >>> >>> then, in an ideal world, ban access from the outside world to the RPC and >>> Thrift interfaces of the core hadoop services. >>> >>> >>> The question is ... >>> >>> Even if we can deploy binaries, scripts, workflows to hdfs and submit or >>> schedule them through Knox, >>> >>> At the very beginning, the developpers of course have to code apps (say >>> Spark jobs) >>> that are designed to run natively inside the cluster (and will use Java >>> client libs to access the Thrift interfaces). >>> >>> How do you deal with that need ? >>> Do they develop on sandboxed environments or their own laptop without Knox, >>> and so Knox only applies to the production/target clusters ? >>> Is the promise of a "Perimeter Level Security" really achievable ? >>> >>> Thank you for your feedback. >>> >>> Damien Claveau >>> >>> France >>> >>> >>> >>> >>> >> > > > -- > > *Damien Claveau* > *MOBILE* 06 60 31 47 84 • *E-MAIL* [email protected] >
