Re: phoenix-spark and pyspark

Nick Dimiduk Tue, 19 Jan 2016 14:03:25 -0800

Hi guys,

I'm doing my best to follow along with [0], but I'm hitting some stumbling
blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix build is
much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm using
pyspark for now.


I've added phoenix-$VERSION-client-spark.jar to both
spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
me to use sqlContext.read to define a DataFrame against a Phoenix table.
This appears to basically work, as I see PhoenixInputFormat in the logs and
df.printSchema() shows me what I expect. However, when I try df.take(5), I
get "IllegalStateException: unread block data" [1] from the workers. Poking
around, this is commonly a problem with classpath. Any ideas as to
specifically which jars are needed? Or better still, how to debug this
issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
gives me a VerifyError about netty method version mismatch. Indeed I see
two netty versions in that lib directory...

Thanks a lot,
-n

[0]: http://phoenix.apache.org/phoenix_spark.html
[1]:

java.lang.IllegalStateException: unread block data
at
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <[email protected]>
wrote:

> Thanks for remembering about the docs, Josh.
>
> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <[email protected]> wrote:
>
>> Just an update for anyone interested, PHOENIX-2503 was just committed for
>> 4.7.0 and the docs have been updated to include these samples for PySpark
>> users.
>>
>> https://phoenix.apache.org/phoenix_spark.html
>>
>> Josh
>>
>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <[email protected]> wrote:
>>
>>> Hey Nick,
>>>
>>> I think this used to work, and will again once PHOENIX-2503 gets
>>> resolved. With the Spark DataFrame support, all the necessary glue is there
>>> for Phoenix and pyspark to play nice. With that client JAR (or by
>>> overriding the com.fasterxml.jackson JARS), you can do something like:
>>>
>>> df = sqlContext.read \
>>>   .format("org.apache.phoenix.spark") \
>>>   .option("table", "TABLE1") \
>>>   .option("zkUrl", "localhost:63512") \
>>>   .load()
>>>
>>> And
>>>
>>> df.write \
>>>   .format("org.apache.phoenix.spark") \
>>>   .mode("overwrite") \
>>>   .option("table", "TABLE1") \
>>>   .option("zkUrl", "localhost:63512") \
>>>   .save()
>>>
>>>
>>> Yes, this should be added to the documentation. I hadn't actually tried
>>> this till just now. :)
>>>
>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <[email protected]>
>>> wrote:
>>>
>>>> Heya,
>>>>
>>>> Has anyone any experience using phoenix-spark integration from pyspark
>>>> instead of scala? Folks prefer python around here...
>>>>
>>>> I did find this example [0] of using HBaseOutputFormat from pyspark,
>>>> haven't tried extending it for phoenix. Maybe someone with more experience
>>>> in pyspark knows better? Would be a great addition to our documentation.
>>>>
>>>> Thanks,
>>>> Nick
>>>>
>>>> [0]:
>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>
>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Reply via email to