If the Knox team implements the Hadoop FileSystem API, the ORC reader and writer could use it automatically.
.. Owen On Mon, Mar 6, 2017 at 10:17 PM, Srinivas M <[email protected]> wrote: > Thanks Owen and Larry for your perspective on this. This information is > helpful. I shall explore on the alternatives to meet the requirements of my > use case for now. > > On a side note, it was mentioned that there are plans (or it is being > considered) to add the knoxFS. I have a question on that. As and when such > a client / API is made available, should the ORC implementation also has to > be enhanced to support the knoxFS in the ORC API or should that come in by > default. Or it would be too early to discuss on that. > > On Tue, Mar 7, 2017 at 12:20 AM, larry mccay <[email protected]> wrote: > >> Thanks for adding the Knox list to this conversation, Owen! >> >> This is an interesting topic and one that we should define an end-to-end >> usecase for. >> >> We have considered a number of things to address this at one time or >> another and have encountered one or more roadblocks on some of them: >> >> * Knox (or Proxy) FileSystem implementation that would accommodate the >> use of addition context needed to route requests through a proxy such as >> Knox by altering the default URLs to match what is expected by Knox. There >> was a POC of this done a while back and we can try and dust that off. >> * Knox did have a feature for configuring the "default topology" which >> would allow the URLs that are expected to be used with webhdfs direct to >> work and Knox would translate the interactions into the context of the >> configured default URLs. This feature is currently not working >> unfortunately and we have a JIRA filed to correct that. >> * There may be work needed in the java webhdfs client in order to >> accommodate SPNEGO on the redirected DN interactions. Currently, the DN >> doesn't expect the hadoop.auth cookie but a block access token instead (I >> believe). So, when the block access token is presented to a Knox instance >> that is configured to use the Hadoop Auth provider it doesn't find a >> hadoop.auth cookie so it challenges the client again. This is not expected >> in existing clients and it throws an exception. Investigation needed here >> for most efficient way to address this. >> >> Incidentally, you may also consider looking at the KnoxShell client >> classes to write a file to HDFS. >> >> http://knox.apache.org/books/knox-0-11-0/user-guide.html#Client+Details >> >> The example below is showing how to use the groovy based DSL for >> establishing a "session", deleting, writing and reading files to HDFS. >> The underlying java classes can be used directly as well as an SDK to do >> the same. >> >> Uptaking the gateway-shell module can be easily done by adding a maven >> dependency to your project for that module. >> Additionally, the 0.12.0 release which is currently undergoing a VOTE for >> release contains a separate client release artifact for download. >> >> import org.apache.hadoop.gateway.shell.Hadoop >> import org.apache.hadoop.gateway.shell.hdfs.Hdfs >> import groovy.json.JsonSlurper >> >> gateway = "https://localhost:8443/gateway/sandbox" >> username = "guest" >> password = "guest-password" >> dataFile = "README" >> >> session = Hadoop.login( gateway, username, password ) >> Hdfs.rm( session ).file( "/tmp/example" ).recursive().now() >> Hdfs.put( session ).file( dataFile ).to( "/tmp/example/README" ).now() >> text = Hdfs.ls( session ).dir( "/tmp/example" ).now().string >> json = (new JsonSlurper()).parseText( text ) >> println json.FileStatuses.FileStatus.pathSuffix >> session.shutdown() >> exit >> >> >> On Mon, Mar 6, 2017 at 11:45 AM, Owen O'Malley <[email protected]> >> wrote: >> >>> Unfortunately, in the short run, you'll need to copy them locally using >>> wget or curl and then read the ORC file using file:/// paths to use the >>> local file system. >>> >>> I talked with Larry McCay from the Knox project and he said that they are >>> considering making a KnoxFS Java client, which implements >>> org.apache.hadoop.fs.FileSystem, that would handle this use case. >>> >>> .. Owen >>> >>> On Mon, Mar 6, 2017 at 4:05 AM, Srinivas M <[email protected]> wrote: >>> >>> > Hi >>> > >>> > I have an application that uses the Hive ORC API and to write a ORC >>> file >>> > to HDFS. I use the native FileSystem API and pass the WebHDFS URI >>> > (webhdfs://host:port) to create a FileSystem Object >>> > >>> > fs = FileSystem.get(hdfsuri,conf,_user) ; >>> > >>> > While trying to connect using the Knox gateway, is there a way to still >>> > use the Native FileSystem or should I be using the REST API calls to be >>> > able to access the Files on HDFS ? >>> > >>> > If so, is there any way to read or write an ORC file in such a case, >>> given >>> > that the ORC Reader or Writers, needs an object of type " >>> > org.apache.hadoop.fs.FileSystem" >>> > >>> > -- >>> > Srinivas >>> > (*-*) >>> > ------------------------------------------------------------ >>> > ------------------------------------------------------------ >>> > ------------------------------------------------------------------ >>> > You have to grow from the inside out. None can teach you, none can make >>> > you spiritual. >>> > -Narendra Nath Dutta(Swamy Vivekananda) >>> > ------------------------------------------------------------ >>> > ------------------------------------------------------------ >>> > ------------------------------------------------------------------ >>> > >>> >> >> > > > -- > Srinivas > (*-*) > ------------------------------------------------------------ > ------------------------------------------------------------ > ------------------------------------------------------------------ > You have to grow from the inside out. None can teach you, none can make > you spiritual. > -Narendra Nath Dutta(Swamy Vivekananda) > ------------------------------------------------------------ > ------------------------------------------------------------ > ------------------------------------------------------------------ >
