Hey Nick, What version of Spark are you using?
I just tried using spark-1.5.2-bin-hadoop2.4 with the latest from Phoenix master (probably the same phoenix-spark code as your version) with pyspark and was able to do a df.take(). Note that this was on one machine using the phoenix_sandbox.py and a local spark-shell. As well, I tried it out using my fork of HDP phoenix with basically those same patches applied, and a df.take() works for me on a reasonable dataset across a number of nodes using pyspark. Are there any other traces of exceptions anywhere? Are all your Spark nodes set up to point to the same phoenix-client-spark JAR? Josh On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <[email protected]> wrote: > Hi guys, > > I'm doing my best to follow along with [0], but I'm hitting some stumbling > blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix build is > much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm using > pyspark for now. > > I've added phoenix-$VERSION-client-spark.jar to both > spark.executor.extraClassPath and spark.driver.extraClassPath. This allows > me to use sqlContext.read to define a DataFrame against a Phoenix table. > This appears to basically work, as I see PhoenixInputFormat in the logs and > df.printSchema() shows me what I expect. However, when I try df.take(5), I > get "IllegalStateException: unread block data" [1] from the workers. Poking > around, this is commonly a problem with classpath. Any ideas as to > specifically which jars are needed? Or better still, how to debug this > issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath > gives me a VerifyError about netty method version mismatch. Indeed I see > two netty versions in that lib directory... > > Thanks a lot, > -n > > [0]: http://phoenix.apache.org/phoenix_spark.html > [1]: > > java.lang.IllegalStateException: unread block data > at > java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > > On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <[email protected]> > wrote: > >> Thanks for remembering about the docs, Josh. >> >> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <[email protected]> wrote: >> >>> Just an update for anyone interested, PHOENIX-2503 was just committed >>> for 4.7.0 and the docs have been updated to include these samples for >>> PySpark users. >>> >>> https://phoenix.apache.org/phoenix_spark.html >>> >>> Josh >>> >>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <[email protected]> >>> wrote: >>> >>>> Hey Nick, >>>> >>>> I think this used to work, and will again once PHOENIX-2503 gets >>>> resolved. With the Spark DataFrame support, all the necessary glue is there >>>> for Phoenix and pyspark to play nice. With that client JAR (or by >>>> overriding the com.fasterxml.jackson JARS), you can do something like: >>>> >>>> df = sqlContext.read \ >>>> .format("org.apache.phoenix.spark") \ >>>> .option("table", "TABLE1") \ >>>> .option("zkUrl", "localhost:63512") \ >>>> .load() >>>> >>>> And >>>> >>>> df.write \ >>>> .format("org.apache.phoenix.spark") \ >>>> .mode("overwrite") \ >>>> .option("table", "TABLE1") \ >>>> .option("zkUrl", "localhost:63512") \ >>>> .save() >>>> >>>> >>>> Yes, this should be added to the documentation. I hadn't actually tried >>>> this till just now. :) >>>> >>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <[email protected]> >>>> wrote: >>>> >>>>> Heya, >>>>> >>>>> Has anyone any experience using phoenix-spark integration from pyspark >>>>> instead of scala? Folks prefer python around here... >>>>> >>>>> I did find this example [0] of using HBaseOutputFormat from pyspark, >>>>> haven't tried extending it for phoenix. Maybe someone with more experience >>>>> in pyspark knows better? Would be a great addition to our documentation. >>>>> >>>>> Thanks, >>>>> Nick >>>>> >>>>> [0]: >>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py >>>>> >>>> >>>> >>> >> >
