Hello Todd, one additional question:
There exists a KuduContext in org.apache.kudu.spark.kudu._ which provides read/write/update to be used with scala and spark. I´m now looking fo a similar solution for python and spark. I´ve found https://github.com/bkvarda/iot_demo which looks fine on a first look. But i would much more prever an "official" solution. Is there anything to be expected in the near future? Or a way - i don´t know yet - to use the scala library from python? Thanks Frank 2016-12-13 16:05 GMT+01:00 Frank Heimerzheim <[email protected]>: > Hello Todd, > > thanks a lot for the clarification. > > Greetings > Frank > > 2016-12-13 15:36 GMT+01:00 Todd Lipcon <[email protected]>: > >> Hi Frank, >> >> I'm sorry to say that the Java storage handler implementation you're >> looking for doesn't exist. The Hive metastore requires that non-HDFS >> storage engines set some value for the 'storage handler' property, so >> Impala uses that special string to denote a Kudu table in the HMS. However, >> there is no such Java implementation- Impala detects this class name and >> uses its own implementation to plan and execute queries against Kudu. >> >> The Hive support for Kudu is tracked here: https://issues.apache.or >> g/jira/browse/HIVE-12971 >> This work isn't committed to the Hive project but there is a prototype on >> github that you could try. Note that it's not being actively developed by >> the Kudu dev community at this point in time, but if you get it working, >> please report back with your experiences. >> >> Thanks >> -Todd >> >> On Tue, Dec 13, 2016 at 6:12 PM, Frank Heimerzheim <[email protected]> >> wrote: >> >>> Hello, >>> >>> within the impala-shell i can create an external table and thereafter >>> select and insert data from an underlying kudu table. Within the statement >>> for creation of the table an 'StorageHandler' will be set to >>> 'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine as >>> there exists apparently an *.jar with the referenced library within. >>> >>> When trying to select from a hive-shell there is an error that the >>> handler is not available. Trying to 'rdd.collect()' from an hiveCtx within >>> an sparkSession i also get an error JavaClassNotFoundException as >>> the KuduStorageHandler is not available. >>> >>> I then tried to find a jar in my system with the intention to copy it to >>> all my data nodes. Sadly i couldn´t find the specific jar. I think it >>> exists in the system as impala apparently is using it. For a test i´ve >>> changed the 'StorageHandler' in the creation statement to >>> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement >>> worked. Also the select from impala, but i didin´t return any data. There >>> was no error as i expected. The test was just for the case impala would in >>> a magic way select data from kudu without an correct 'StorageHandler'. >>> Apparently this is not the case and impala has access to an >>> 'com.cloudera.kudu.hive.KuduStorageHandler'. >>> >>> Long story, short question: >>> In which *.jar i can find the 'com.cloudera.kudu.hive.KuduS >>> torageHandler'? >>> Is the approach to copy the jar per hand to all nodes an appropriate way >>> to bring spark in a position to work with kudu? >>> What about the beeline-shell from hive and the possibility to read from >>> kudu? >>> >>> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed >>> parcels. Build a working python-kudu library successfully from scratch (git) >>> >>> Thanks a lot! >>> Frank >>> >> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> > >
