Re: phoenix-spark and pyspark

Nick Dimiduk Wed, 20 Jan 2016 14:24:08 -0800

Well, I spoke too soon. It's working, but in local mode only. When I invoke
`pyspark --master yarn` (or yarn-client), the submitted application goes
from ACCEPTED to FAILED, with a NumberFormatException [0] in my container
log. Now that Phoenix is on my classpath, I'm suspicious that the versions
of YARN client libraries are incompatible. I found an old thread [1] with
the same stack trace I'm seeing, similar conclusion. I tried setting
spark.driver.extraClassPath and spark.executor.extraClassPath
to 
/usr/hdp/current/hadoop-yarn-client:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
but that appears to have no impact.


[0]:
16/01/20 22:03:45 ERROR yarn.ApplicationMaster: Uncaught exception:
java.lang.IllegalArgumentException: Invalid ContainerId:
container_e07_1452901320122_0042_01_000001
at
org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)
at
org.apache.spark.deploy.yarn.YarnRMClient.getAttemptId(YarnRMClient.scala:93)
at
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:85)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:574)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
at
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:572)
at
org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:599)
at
org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
Caused by: java.lang.NumberFormatException: For input string: "e07"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at
org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
at
org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
... 12 more

[1]:
http://mail-archives.us.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAAqMD1jSEvfyw9oUBymhZukm7f+WTDVZ8E6Zp3L4a9OBJ-hz=a...@mail.gmail.com%3E

On Wed, Jan 20, 2016 at 1:29 PM, Josh Mahonin <[email protected]> wrote:

> That's great to hear. Looking forward to the doc patch!
>
> On Wed, Jan 20, 2016 at 3:43 PM, Nick Dimiduk <[email protected]> wrote:
>
>> Josh -- I deployed my updated phoenix build across the cluster, added the
>> phoenix-client-spark.jar to configs on the whole cluster, and now basic
>> dataframe access is now working. Let me see about updating the docs page to
>> be more clear, I'll send a patch by you for review.
>>
>> Thanks a lot for the help!
>> -n
>>
>> On Tue, Jan 19, 2016 at 5:59 PM, Josh Mahonin <[email protected]> wrote:
>>
>>> Right, this cluster I just tested on is HDP 2.3.4, so it's Spark on YARN
>>> as well. I suppose the JAR is probably shipped by YARN, though I don't see
>>> any logging saying it, so I'm not certain how the nuts and bolts of that
>>> work. By explicitly setting the classpath, we're bypassing Spark's native
>>> JAR broadcast though.
>>>
>>> Taking a quick look at the config in Ambari (which ships the config to
>>> each node after saving), in 'Custom spark-defaults' I have the following:
>>>
>>> spark.driver.extraClassPath ->
>>> /etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>> spark.executor.extraClassPath ->
>>> /usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>
>>> I'm not sure if the /etc/hbase/conf is necessarily needed, but I think
>>> that gets the Ambari generated hbase-site.xml in the classpath. Each node
>>> has the custom phoenix-client-spark.jar installed to that same path as well.
>>>
>>> I can pop into regular spark-shell and load RDDs/DataFrames using:
>>> /usr/hdp/current/spark-client/bin/spark-shell --master yarn-client
>>>
>>> or pyspark via:
>>> /usr/hdp/current/spark-client/bin/pyspark
>>>
>>> I also do this as the Ambari-created 'spark' user, I think there was
>>> some fun HDFS permission issue otherwise.
>>>
>>> On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <[email protected]>
>>> wrote:
>>>
>>>> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers are
>>>> colocated with RegionServers; all the hosts have everything. There are no
>>>> spark workers to restart. You're sure it's not shipped by the YARN runtime?
>>>>
>>>> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <[email protected]>
>>>> wrote:
>>>>
>>>>> Sadly, it needs to be installed onto each Spark worker (for now). The
>>>>> executor config tells each Spark worker to look for that file to add to 
>>>>> its
>>>>> classpath, so once you have it installed, you'll probably need to restart
>>>>> all the Spark workers.
>>>>>
>>>>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
>>>>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
>>>>> consistently see is fine.
>>>>>
>>>>> One day we'll be able to have Spark ship the JAR around and use it
>>>>> without this classpath nonsense, but we need to do some extra work on the
>>>>> Phoenix side to make sure that Phoenix's calls to DriverManager actually 
>>>>> go
>>>>> through Spark's weird wrapper version of it.
>>>>>
>>>>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> What version of Spark are you using?
>>>>>>>
>>>>>>
>>>>>> Probably HDP's Spark 1.4.1; that's what the jars in my install say,
>>>>>> and the welcome message in the pyspark console agrees.
>>>>>>
>>>>>> Are there any other traces of exceptions anywhere?
>>>>>>>
>>>>>>
>>>>>> No other exceptions that I can find. YARN apparently doesn't want to
>>>>>> aggregate spark's logs.
>>>>>>
>>>>>>
>>>>>>> Are all your Spark nodes set up to point to the same
>>>>>>> phoenix-client-spark JAR?
>>>>>>>
>>>>>>
>>>>>> Yes, as far as I can tell... though come to think of it, is that jar
>>>>>> shipped by the spark runtime to workers, or is it loaded locally on each
>>>>>> host? I only changed spark-defaults.conf on the client machine, the 
>>>>>> machine
>>>>>> from which I submitted the job.
>>>>>>
>>>>>> Thanks for taking a look Josh!
>>>>>>
>>>>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>>
>>>>>>>> I'm doing my best to follow along with [0], but I'm hitting some
>>>>>>>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My 
>>>>>>>> phoenix
>>>>>>>> build is much newer, basically 4.6-branch + PHOENIX-2503, 
>>>>>>>> PHOENIX-2568. I'm
>>>>>>>> using pyspark for now.
>>>>>>>>
>>>>>>>> I've added phoenix-$VERSION-client-spark.jar to both
>>>>>>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This 
>>>>>>>> allows
>>>>>>>> me to use sqlContext.read to define a DataFrame against a Phoenix 
>>>>>>>> table.
>>>>>>>> This appears to basically work, as I see PhoenixInputFormat in the 
>>>>>>>> logs and
>>>>>>>> df.printSchema() shows me what I expect. However, when I try 
>>>>>>>> df.take(5), I
>>>>>>>> get "IllegalStateException: unread block data" [1] from the workers. 
>>>>>>>> Poking
>>>>>>>> around, this is commonly a problem with classpath. Any ideas as to
>>>>>>>> specifically which jars are needed? Or better still, how to debug this
>>>>>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the 
>>>>>>>> classpath
>>>>>>>> gives me a VerifyError about netty method version mismatch. Indeed I 
>>>>>>>> see
>>>>>>>> two netty versions in that lib directory...
>>>>>>>>
>>>>>>>> Thanks a lot,
>>>>>>>> -n
>>>>>>>>
>>>>>>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>>>>>>> [1]:
>>>>>>>>
>>>>>>>> java.lang.IllegalStateException: unread block data
>>>>>>>> at
>>>>>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>>>>>>> at
>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>>>>>>> at
>>>>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>>>>>>> at
>>>>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>>>>>>> at
>>>>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>>>>> at
>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>>>>>> at
>>>>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>>>>>>> at
>>>>>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>>>>>>> at
>>>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>>>>>>> at
>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>>> at
>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Thanks for remembering about the docs, Josh.
>>>>>>>>>
>>>>>>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Just an update for anyone interested, PHOENIX-2503 was just
>>>>>>>>>> committed for 4.7.0 and the docs have been updated to include these 
>>>>>>>>>> samples
>>>>>>>>>> for PySpark users.
>>>>>>>>>>
>>>>>>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>>
>>>>>>>>>> Josh
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <[email protected]
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Nick,
>>>>>>>>>>>
>>>>>>>>>>> I think this used to work, and will again once PHOENIX-2503 gets
>>>>>>>>>>> resolved. With the Spark DataFrame support, all the necessary glue 
>>>>>>>>>>> is there
>>>>>>>>>>> for Phoenix and pyspark to play nice. With that client JAR (or by
>>>>>>>>>>> overriding the com.fasterxml.jackson JARS), you can do something 
>>>>>>>>>>> like:
>>>>>>>>>>>
>>>>>>>>>>> df = sqlContext.read \
>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>   .load()
>>>>>>>>>>>
>>>>>>>>>>> And
>>>>>>>>>>>
>>>>>>>>>>> df.write \
>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>   .mode("overwrite") \
>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>   .save()
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yes, this should be added to the documentation. I hadn't
>>>>>>>>>>> actually tried this till just now. :)
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Heya,
>>>>>>>>>>>>
>>>>>>>>>>>> Has anyone any experience using phoenix-spark integration from
>>>>>>>>>>>> pyspark instead of scala? Folks prefer python around here...
>>>>>>>>>>>>
>>>>>>>>>>>> I did find this example [0] of using HBaseOutputFormat from
>>>>>>>>>>>> pyspark, haven't tried extending it for phoenix. Maybe someone 
>>>>>>>>>>>> with more
>>>>>>>>>>>> experience in pyspark knows better? Would be a great addition to 
>>>>>>>>>>>> our
>>>>>>>>>>>> documentation.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Nick
>>>>>>>>>>>>
>>>>>>>>>>>> [0]:
>>>>>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Reply via email to