On Mon, Jan 9, 2017 at 2:54 AM, Frank Heimerzheim <[email protected]> wrote:
> Hello Todd, > > one additional question: > > There exists a KuduContext in org.apache.kudu.spark.kudu._ which provides > read/write/update to be used with scala and spark. I´m now looking fo a > similar solution for python and spark. I´ve found > https://github.com/bkvarda/iot_demo which looks fine on a first look. But > i would much more prever an "official" solution. Is there anything to be > expected in the near future? Or a way - i don´t know yet - to use the scala > library from python? > I'm not a real Spark expert (especially not pyspark) so I don't have a great answer to this question. The github demo you linked above looks like a reasonable approach, though. Jordan Birdsell is our primary Python expert, and he filed https://issues.apache.org/jira/browse/KUDU-1603 a while back. Hopefully he will chime in with a better answer than I can give :) -Todd 2016-12-13 16:05 GMT+01:00 Frank Heimerzheim <[email protected]>: > >> Hello Todd, >> >> thanks a lot for the clarification. >> >> Greetings >> Frank >> >> 2016-12-13 15:36 GMT+01:00 Todd Lipcon <[email protected]>: >> >>> Hi Frank, >>> >>> I'm sorry to say that the Java storage handler implementation you're >>> looking for doesn't exist. The Hive metastore requires that non-HDFS >>> storage engines set some value for the 'storage handler' property, so >>> Impala uses that special string to denote a Kudu table in the HMS. However, >>> there is no such Java implementation- Impala detects this class name and >>> uses its own implementation to plan and execute queries against Kudu. >>> >>> The Hive support for Kudu is tracked here: https://issues.apache.or >>> g/jira/browse/HIVE-12971 >>> This work isn't committed to the Hive project but there is a prototype >>> on github that you could try. Note that it's not being actively developed >>> by the Kudu dev community at this point in time, but if you get it working, >>> please report back with your experiences. >>> >>> Thanks >>> -Todd >>> >>> On Tue, Dec 13, 2016 at 6:12 PM, Frank Heimerzheim <[email protected]> >>> wrote: >>> >>>> Hello, >>>> >>>> within the impala-shell i can create an external table and thereafter >>>> select and insert data from an underlying kudu table. Within the statement >>>> for creation of the table an 'StorageHandler' will be set to >>>> 'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine as >>>> there exists apparently an *.jar with the referenced library within. >>>> >>>> When trying to select from a hive-shell there is an error that the >>>> handler is not available. Trying to 'rdd.collect()' from an hiveCtx within >>>> an sparkSession i also get an error JavaClassNotFoundException as >>>> the KuduStorageHandler is not available. >>>> >>>> I then tried to find a jar in my system with the intention to copy it >>>> to all my data nodes. Sadly i couldn´t find the specific jar. I think it >>>> exists in the system as impala apparently is using it. For a test i´ve >>>> changed the 'StorageHandler' in the creation statement to >>>> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement >>>> worked. Also the select from impala, but i didin´t return any data. There >>>> was no error as i expected. The test was just for the case impala would in >>>> a magic way select data from kudu without an correct 'StorageHandler'. >>>> Apparently this is not the case and impala has access to an >>>> 'com.cloudera.kudu.hive.KuduStorageHandler'. >>>> >>>> Long story, short question: >>>> In which *.jar i can find the 'com.cloudera.kudu.hive.KuduS >>>> torageHandler'? >>>> Is the approach to copy the jar per hand to all nodes an appropriate >>>> way to bring spark in a position to work with kudu? >>>> What about the beeline-shell from hive and the possibility to read from >>>> kudu? >>>> >>>> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed >>>> parcels. Build a working python-kudu library successfully from scratch >>>> (git) >>>> >>>> Thanks a lot! >>>> Frank >>>> >>> >>> >>> >>> -- >>> Todd Lipcon >>> Software Engineer, Cloudera >>> >> >> > -- Todd Lipcon Software Engineer, Cloudera
