Re: phoenix-spark and pyspark

Josh Mahonin Tue, 19 Jan 2016 16:18:13 -0800

Hey Nick,

What version of Spark are you using?


I just tried using spark-1.5.2-bin-hadoop2.4 with the latest from Phoenix
master (probably the same phoenix-spark code as your version) with pyspark
and was able to do a df.take(). Note that this was on one machine using the
phoenix_sandbox.py and a local spark-shell.

As well, I tried it out using my fork of HDP phoenix with basically those
same patches applied, and a df.take() works for me on a reasonable dataset
across a number of nodes using pyspark.

Are there any other traces of exceptions anywhere? Are all your Spark nodes
set up to point to the same phoenix-client-spark JAR?

Josh

On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <[email protected]> wrote:

> Hi guys,
>
> I'm doing my best to follow along with [0], but I'm hitting some stumbling
> blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix build is
> much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm using
> pyspark for now.
>
> I've added phoenix-$VERSION-client-spark.jar to both
> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
> me to use sqlContext.read to define a DataFrame against a Phoenix table.
> This appears to basically work, as I see PhoenixInputFormat in the logs and
> df.printSchema() shows me what I expect. However, when I try df.take(5), I
> get "IllegalStateException: unread block data" [1] from the workers. Poking
> around, this is commonly a problem with classpath. Any ideas as to
> specifically which jars are needed? Or better still, how to debug this
> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
> gives me a VerifyError about netty method version mismatch. Indeed I see
> two netty versions in that lib directory...
>
> Thanks a lot,
> -n
>
> [0]: http://phoenix.apache.org/phoenix_spark.html
> [1]:
>
> java.lang.IllegalStateException: unread block data
> at
> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
> at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
> at
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <[email protected]>
> wrote:
>
>> Thanks for remembering about the docs, Josh.
>>
>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <[email protected]> wrote:
>>
>>> Just an update for anyone interested, PHOENIX-2503 was just committed
>>> for 4.7.0 and the docs have been updated to include these samples for
>>> PySpark users.
>>>
>>> https://phoenix.apache.org/phoenix_spark.html
>>>
>>> Josh
>>>
>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <[email protected]>
>>> wrote:
>>>
>>>> Hey Nick,
>>>>
>>>> I think this used to work, and will again once PHOENIX-2503 gets
>>>> resolved. With the Spark DataFrame support, all the necessary glue is there
>>>> for Phoenix and pyspark to play nice. With that client JAR (or by
>>>> overriding the com.fasterxml.jackson JARS), you can do something like:
>>>>
>>>> df = sqlContext.read \
>>>>   .format("org.apache.phoenix.spark") \
>>>>   .option("table", "TABLE1") \
>>>>   .option("zkUrl", "localhost:63512") \
>>>>   .load()
>>>>
>>>> And
>>>>
>>>> df.write \
>>>>   .format("org.apache.phoenix.spark") \
>>>>   .mode("overwrite") \
>>>>   .option("table", "TABLE1") \
>>>>   .option("zkUrl", "localhost:63512") \
>>>>   .save()
>>>>
>>>>
>>>> Yes, this should be added to the documentation. I hadn't actually tried
>>>> this till just now. :)
>>>>
>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <[email protected]>
>>>> wrote:
>>>>
>>>>> Heya,
>>>>>
>>>>> Has anyone any experience using phoenix-spark integration from pyspark
>>>>> instead of scala? Folks prefer python around here...
>>>>>
>>>>> I did find this example [0] of using HBaseOutputFormat from pyspark,
>>>>> haven't tried extending it for phoenix. Maybe someone with more experience
>>>>> in pyspark knows better? Would be a great addition to our documentation.
>>>>>
>>>>> Thanks,
>>>>> Nick
>>>>>
>>>>> [0]:
>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>
>>>>
>>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Reply via email to