Re: phoenix-spark and pyspark

Josh Mahonin Wed, 20 Jan 2016 13:29:28 -0800

That's great to hear. Looking forward to the doc patch!

On Wed, Jan 20, 2016 at 3:43 PM, Nick Dimiduk <ndimi...@apache.org> wrote:


> Josh -- I deployed my updated phoenix build across the cluster, added the
> phoenix-client-spark.jar to configs on the whole cluster, and now basic
> dataframe access is now working. Let me see about updating the docs page to
> be more clear, I'll send a patch by you for review.
>
> Thanks a lot for the help!
> -n
>
> On Tue, Jan 19, 2016 at 5:59 PM, Josh Mahonin <jmaho...@gmail.com> wrote:
>
>> Right, this cluster I just tested on is HDP 2.3.4, so it's Spark on YARN
>> as well. I suppose the JAR is probably shipped by YARN, though I don't see
>> any logging saying it, so I'm not certain how the nuts and bolts of that
>> work. By explicitly setting the classpath, we're bypassing Spark's native
>> JAR broadcast though.
>>
>> Taking a quick look at the config in Ambari (which ships the config to
>> each node after saving), in 'Custom spark-defaults' I have the following:
>>
>> spark.driver.extraClassPath ->
>> /etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>> spark.executor.extraClassPath ->
>> /usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>
>> I'm not sure if the /etc/hbase/conf is necessarily needed, but I think
>> that gets the Ambari generated hbase-site.xml in the classpath. Each node
>> has the custom phoenix-client-spark.jar installed to that same path as well.
>>
>> I can pop into regular spark-shell and load RDDs/DataFrames using:
>> /usr/hdp/current/spark-client/bin/spark-shell --master yarn-client
>>
>> or pyspark via:
>> /usr/hdp/current/spark-client/bin/pyspark
>>
>> I also do this as the Ambari-created 'spark' user, I think there was some
>> fun HDFS permission issue otherwise.
>>
>> On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <ndimi...@apache.org>
>> wrote:
>>
>>> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers are
>>> colocated with RegionServers; all the hosts have everything. There are no
>>> spark workers to restart. You're sure it's not shipped by the YARN runtime?
>>>
>>> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jmaho...@gmail.com>
>>> wrote:
>>>
>>>> Sadly, it needs to be installed onto each Spark worker (for now). The
>>>> executor config tells each Spark worker to look for that file to add to its
>>>> classpath, so once you have it installed, you'll probably need to restart
>>>> all the Spark workers.
>>>>
>>>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
>>>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
>>>> consistently see is fine.
>>>>
>>>> One day we'll be able to have Spark ship the JAR around and use it
>>>> without this classpath nonsense, but we need to do some extra work on the
>>>> Phoenix side to make sure that Phoenix's calls to DriverManager actually go
>>>> through Spark's weird wrapper version of it.
>>>>
>>>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <ndimi...@apache.org>
>>>> wrote:
>>>>
>>>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jmaho...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> What version of Spark are you using?
>>>>>>
>>>>>
>>>>> Probably HDP's Spark 1.4.1; that's what the jars in my install say,
>>>>> and the welcome message in the pyspark console agrees.
>>>>>
>>>>> Are there any other traces of exceptions anywhere?
>>>>>>
>>>>>
>>>>> No other exceptions that I can find. YARN apparently doesn't want to
>>>>> aggregate spark's logs.
>>>>>
>>>>>
>>>>>> Are all your Spark nodes set up to point to the same
>>>>>> phoenix-client-spark JAR?
>>>>>>
>>>>>
>>>>> Yes, as far as I can tell... though come to think of it, is that jar
>>>>> shipped by the spark runtime to workers, or is it loaded locally on each
>>>>> host? I only changed spark-defaults.conf on the client machine, the 
>>>>> machine
>>>>> from which I submitted the job.
>>>>>
>>>>> Thanks for taking a look Josh!
>>>>>
>>>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <ndimi...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> I'm doing my best to follow along with [0], but I'm hitting some
>>>>>>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My 
>>>>>>> phoenix
>>>>>>> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. 
>>>>>>> I'm
>>>>>>> using pyspark for now.
>>>>>>>
>>>>>>> I've added phoenix-$VERSION-client-spark.jar to both
>>>>>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This 
>>>>>>> allows
>>>>>>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>>>>>>> This appears to basically work, as I see PhoenixInputFormat in the logs 
>>>>>>> and
>>>>>>> df.printSchema() shows me what I expect. However, when I try 
>>>>>>> df.take(5), I
>>>>>>> get "IllegalStateException: unread block data" [1] from the workers. 
>>>>>>> Poking
>>>>>>> around, this is commonly a problem with classpath. Any ideas as to
>>>>>>> specifically which jars are needed? Or better still, how to debug this
>>>>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the 
>>>>>>> classpath
>>>>>>> gives me a VerifyError about netty method version mismatch. Indeed I see
>>>>>>> two netty versions in that lib directory...
>>>>>>>
>>>>>>> Thanks a lot,
>>>>>>> -n
>>>>>>>
>>>>>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>>>>>> [1]:
>>>>>>>
>>>>>>> java.lang.IllegalStateException: unread block data
>>>>>>> at
>>>>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>>>>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>>>>>> at
>>>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>>>>>> at
>>>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>>>>>> at
>>>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>>>>> at
>>>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>>>>>> at
>>>>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>>>>>> at
>>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>>>>>> at
>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>> at
>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <
>>>>>>> jamestay...@apache.org> wrote:
>>>>>>>
>>>>>>>> Thanks for remembering about the docs, Josh.
>>>>>>>>
>>>>>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <jmaho...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Just an update for anyone interested, PHOENIX-2503 was just
>>>>>>>>> committed for 4.7.0 and the docs have been updated to include these 
>>>>>>>>> samples
>>>>>>>>> for PySpark users.
>>>>>>>>>
>>>>>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>
>>>>>>>>> Josh
>>>>>>>>>
>>>>>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <jmaho...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Nick,
>>>>>>>>>>
>>>>>>>>>> I think this used to work, and will again once PHOENIX-2503 gets
>>>>>>>>>> resolved. With the Spark DataFrame support, all the necessary glue 
>>>>>>>>>> is there
>>>>>>>>>> for Phoenix and pyspark to play nice. With that client JAR (or by
>>>>>>>>>> overriding the com.fasterxml.jackson JARS), you can do something 
>>>>>>>>>> like:
>>>>>>>>>>
>>>>>>>>>> df = sqlContext.read \
>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>   .load()
>>>>>>>>>>
>>>>>>>>>> And
>>>>>>>>>>
>>>>>>>>>> df.write \
>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>   .mode("overwrite") \
>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>   .save()
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes, this should be added to the documentation. I hadn't actually
>>>>>>>>>> tried this till just now. :)
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <ndimi...@apache.org
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Heya,
>>>>>>>>>>>
>>>>>>>>>>> Has anyone any experience using phoenix-spark integration from
>>>>>>>>>>> pyspark instead of scala? Folks prefer python around here...
>>>>>>>>>>>
>>>>>>>>>>> I did find this example [0] of using HBaseOutputFormat from
>>>>>>>>>>> pyspark, haven't tried extending it for phoenix. Maybe someone with 
>>>>>>>>>>> more
>>>>>>>>>>> experience in pyspark knows better? Would be a great addition to our
>>>>>>>>>>> documentation.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Nick
>>>>>>>>>>>
>>>>>>>>>>> [0]:
>>>>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Reply via email to