Re: phoenix-spark and pyspark

Nick Dimiduk Tue, 19 Jan 2016 17:24:07 -0800

I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers are
colocated with RegionServers; all the hosts have everything. There are no
spark workers to restart. You're sure it's not shipped by the YARN runtime?


On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <[email protected]> wrote:

> Sadly, it needs to be installed onto each Spark worker (for now). The
> executor config tells each Spark worker to look for that file to add to its
> classpath, so once you have it installed, you'll probably need to restart
> all the Spark workers.
>
> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
> consistently see is fine.
>
> One day we'll be able to have Spark ship the JAR around and use it without
> this classpath nonsense, but we need to do some extra work on the Phoenix
> side to make sure that Phoenix's calls to DriverManager actually go through
> Spark's weird wrapper version of it.
>
> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <[email protected]> wrote:
>
>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <[email protected]> wrote:
>>
>>> What version of Spark are you using?
>>>
>>
>> Probably HDP's Spark 1.4.1; that's what the jars in my install say, and
>> the welcome message in the pyspark console agrees.
>>
>> Are there any other traces of exceptions anywhere?
>>>
>>
>> No other exceptions that I can find. YARN apparently doesn't want to
>> aggregate spark's logs.
>>
>>
>>> Are all your Spark nodes set up to point to the same
>>> phoenix-client-spark JAR?
>>>
>>
>> Yes, as far as I can tell... though come to think of it, is that jar
>> shipped by the spark runtime to workers, or is it loaded locally on each
>> host? I only changed spark-defaults.conf on the client machine, the machine
>> from which I submitted the job.
>>
>> Thanks for taking a look Josh!
>>
>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <[email protected]>
>>> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> I'm doing my best to follow along with [0], but I'm hitting some
>>>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix
>>>> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm
>>>> using pyspark for now.
>>>>
>>>> I've added phoenix-$VERSION-client-spark.jar to both
>>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
>>>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>>>> This appears to basically work, as I see PhoenixInputFormat in the logs and
>>>> df.printSchema() shows me what I expect. However, when I try df.take(5), I
>>>> get "IllegalStateException: unread block data" [1] from the workers. Poking
>>>> around, this is commonly a problem with classpath. Any ideas as to
>>>> specifically which jars are needed? Or better still, how to debug this
>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
>>>> gives me a VerifyError about netty method version mismatch. Indeed I see
>>>> two netty versions in that lib directory...
>>>>
>>>> Thanks a lot,
>>>> -n
>>>>
>>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>>> [1]:
>>>>
>>>> java.lang.IllegalStateException: unread block data
>>>> at
>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>>> at
>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>>> at
>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>> at
>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>>> at
>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>> at java.lang.Thread.run(Thread.java:745)
>>>>
>>>>
>>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks for remembering about the docs, Josh.
>>>>>
>>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Just an update for anyone interested, PHOENIX-2503 was just committed
>>>>>> for 4.7.0 and the docs have been updated to include these samples for
>>>>>> PySpark users.
>>>>>>
>>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>>
>>>>>> Josh
>>>>>>
>>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Nick,
>>>>>>>
>>>>>>> I think this used to work, and will again once PHOENIX-2503 gets
>>>>>>> resolved. With the Spark DataFrame support, all the necessary glue is 
>>>>>>> there
>>>>>>> for Phoenix and pyspark to play nice. With that client JAR (or by
>>>>>>> overriding the com.fasterxml.jackson JARS), you can do something like:
>>>>>>>
>>>>>>> df = sqlContext.read \
>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>   .option("table", "TABLE1") \
>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>   .load()
>>>>>>>
>>>>>>> And
>>>>>>>
>>>>>>> df.write \
>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>   .mode("overwrite") \
>>>>>>>   .option("table", "TABLE1") \
>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>   .save()
>>>>>>>
>>>>>>>
>>>>>>> Yes, this should be added to the documentation. I hadn't actually
>>>>>>> tried this till just now. :)
>>>>>>>
>>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Heya,
>>>>>>>>
>>>>>>>> Has anyone any experience using phoenix-spark integration from
>>>>>>>> pyspark instead of scala? Folks prefer python around here...
>>>>>>>>
>>>>>>>> I did find this example [0] of using HBaseOutputFormat from
>>>>>>>> pyspark, haven't tried extending it for phoenix. Maybe someone with 
>>>>>>>> more
>>>>>>>> experience in pyspark knows better? Would be a great addition to our
>>>>>>>> documentation.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Nick
>>>>>>>>
>>>>>>>> [0]:
>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Reply via email to