Hi guys, I'm doing my best to follow along with [0], but I'm hitting some stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm using pyspark for now.
I've added phoenix-$VERSION-client-spark.jar to both spark.executor.extraClassPath and spark.driver.extraClassPath. This allows me to use sqlContext.read to define a DataFrame against a Phoenix table. This appears to basically work, as I see PhoenixInputFormat in the logs and df.printSchema() shows me what I expect. However, when I try df.take(5), I get "IllegalStateException: unread block data" [1] from the workers. Poking around, this is commonly a problem with classpath. Any ideas as to specifically which jars are needed? Or better still, how to debug this issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath gives me a VerifyError about netty method version mismatch. Indeed I see two netty versions in that lib directory... Thanks a lot, -n [0]: http://phoenix.apache.org/phoenix_spark.html [1]: java.lang.IllegalStateException: unread block data at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <[email protected]> wrote: > Thanks for remembering about the docs, Josh. > > On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <[email protected]> wrote: > >> Just an update for anyone interested, PHOENIX-2503 was just committed for >> 4.7.0 and the docs have been updated to include these samples for PySpark >> users. >> >> https://phoenix.apache.org/phoenix_spark.html >> >> Josh >> >> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <[email protected]> wrote: >> >>> Hey Nick, >>> >>> I think this used to work, and will again once PHOENIX-2503 gets >>> resolved. With the Spark DataFrame support, all the necessary glue is there >>> for Phoenix and pyspark to play nice. With that client JAR (or by >>> overriding the com.fasterxml.jackson JARS), you can do something like: >>> >>> df = sqlContext.read \ >>> .format("org.apache.phoenix.spark") \ >>> .option("table", "TABLE1") \ >>> .option("zkUrl", "localhost:63512") \ >>> .load() >>> >>> And >>> >>> df.write \ >>> .format("org.apache.phoenix.spark") \ >>> .mode("overwrite") \ >>> .option("table", "TABLE1") \ >>> .option("zkUrl", "localhost:63512") \ >>> .save() >>> >>> >>> Yes, this should be added to the documentation. I hadn't actually tried >>> this till just now. :) >>> >>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <[email protected]> >>> wrote: >>> >>>> Heya, >>>> >>>> Has anyone any experience using phoenix-spark integration from pyspark >>>> instead of scala? Folks prefer python around here... >>>> >>>> I did find this example [0] of using HBaseOutputFormat from pyspark, >>>> haven't tried extending it for phoenix. Maybe someone with more experience >>>> in pyspark knows better? Would be a great addition to our documentation. >>>> >>>> Thanks, >>>> Nick >>>> >>>> [0]: >>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py >>>> >>> >>> >> >
