Hello, within the impala-shell i can create an external table and thereafter select and insert data from an underlying kudu table. Within the statement for creation of the table an 'StorageHandler' will be set to 'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine as there exists apparently an *.jar with the referenced library within.
When trying to select from a hive-shell there is an error that the handler is not available. Trying to 'rdd.collect()' from an hiveCtx within an sparkSession i also get an error JavaClassNotFoundException as the KuduStorageHandler is not available. I then tried to find a jar in my system with the intention to copy it to all my data nodes. Sadly i couldn´t find the specific jar. I think it exists in the system as impala apparently is using it. For a test i´ve changed the 'StorageHandler' in the creation statement to 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement worked. Also the select from impala, but i didin´t return any data. There was no error as i expected. The test was just for the case impala would in a magic way select data from kudu without an correct 'StorageHandler'. Apparently this is not the case and impala has access to an 'com.cloudera.kudu.hive.KuduStorageHandler'. Long story, short question: In which *.jar i can find the 'com.cloudera.kudu.hive.KuduStorageHandler'? Is the approach to copy the jar per hand to all nodes an appropriate way to bring spark in a position to work with kudu? What about the beeline-shell from hive and the possibility to read from kudu? My Environment: Cloudera 5.7 with kudu and impala-kudu from installed parcels. Build a working python-kudu library successfully from scratch (git) Thanks a lot! Frank
