Well, I spoke too soon. It's working, but in local mode only. When I invoke `pyspark --master yarn` (or yarn-client), the submitted application goes from ACCEPTED to FAILED, with a NumberFormatException [0] in my container log. Now that Phoenix is on my classpath, I'm suspicious that the versions of YARN client libraries are incompatible. I found an old thread [1] with the same stack trace I'm seeing, similar conclusion. I tried setting spark.driver.extraClassPath and spark.executor.extraClassPath to /usr/hdp/current/hadoop-yarn-client:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar but that appears to have no impact.
[0]: 16/01/20 22:03:45 ERROR yarn.ApplicationMaster: Uncaught exception: java.lang.IllegalArgumentException: Invalid ContainerId: container_e07_1452901320122_0042_01_000001 at org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182) at org.apache.spark.deploy.yarn.YarnRMClient.getAttemptId(YarnRMClient.scala:93) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:85) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:574) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:572) at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:599) at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) Caused by: java.lang.NumberFormatException: For input string: "e07" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137) at org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177) ... 12 more [1]: http://mail-archives.us.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAAqMD1jSEvfyw9oUBymhZukm7f+WTDVZ8E6Zp3L4a9OBJ-hz=a...@mail.gmail.com%3E On Wed, Jan 20, 2016 at 1:29 PM, Josh Mahonin <[email protected]> wrote: > That's great to hear. Looking forward to the doc patch! > > On Wed, Jan 20, 2016 at 3:43 PM, Nick Dimiduk <[email protected]> wrote: > >> Josh -- I deployed my updated phoenix build across the cluster, added the >> phoenix-client-spark.jar to configs on the whole cluster, and now basic >> dataframe access is now working. Let me see about updating the docs page to >> be more clear, I'll send a patch by you for review. >> >> Thanks a lot for the help! >> -n >> >> On Tue, Jan 19, 2016 at 5:59 PM, Josh Mahonin <[email protected]> wrote: >> >>> Right, this cluster I just tested on is HDP 2.3.4, so it's Spark on YARN >>> as well. I suppose the JAR is probably shipped by YARN, though I don't see >>> any logging saying it, so I'm not certain how the nuts and bolts of that >>> work. By explicitly setting the classpath, we're bypassing Spark's native >>> JAR broadcast though. >>> >>> Taking a quick look at the config in Ambari (which ships the config to >>> each node after saving), in 'Custom spark-defaults' I have the following: >>> >>> spark.driver.extraClassPath -> >>> /etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar >>> spark.executor.extraClassPath -> >>> /usr/hdp/current/phoenix-client/phoenix-client-spark.jar >>> >>> I'm not sure if the /etc/hbase/conf is necessarily needed, but I think >>> that gets the Ambari generated hbase-site.xml in the classpath. Each node >>> has the custom phoenix-client-spark.jar installed to that same path as well. >>> >>> I can pop into regular spark-shell and load RDDs/DataFrames using: >>> /usr/hdp/current/spark-client/bin/spark-shell --master yarn-client >>> >>> or pyspark via: >>> /usr/hdp/current/spark-client/bin/pyspark >>> >>> I also do this as the Ambari-created 'spark' user, I think there was >>> some fun HDFS permission issue otherwise. >>> >>> On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <[email protected]> >>> wrote: >>> >>>> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers are >>>> colocated with RegionServers; all the hosts have everything. There are no >>>> spark workers to restart. You're sure it's not shipped by the YARN runtime? >>>> >>>> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <[email protected]> >>>> wrote: >>>> >>>>> Sadly, it needs to be installed onto each Spark worker (for now). The >>>>> executor config tells each Spark worker to look for that file to add to >>>>> its >>>>> classpath, so once you have it installed, you'll probably need to restart >>>>> all the Spark workers. >>>>> >>>>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in >>>>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can >>>>> consistently see is fine. >>>>> >>>>> One day we'll be able to have Spark ship the JAR around and use it >>>>> without this classpath nonsense, but we need to do some extra work on the >>>>> Phoenix side to make sure that Phoenix's calls to DriverManager actually >>>>> go >>>>> through Spark's weird wrapper version of it. >>>>> >>>>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <[email protected]> >>>>> wrote: >>>>> >>>>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> What version of Spark are you using? >>>>>>> >>>>>> >>>>>> Probably HDP's Spark 1.4.1; that's what the jars in my install say, >>>>>> and the welcome message in the pyspark console agrees. >>>>>> >>>>>> Are there any other traces of exceptions anywhere? >>>>>>> >>>>>> >>>>>> No other exceptions that I can find. YARN apparently doesn't want to >>>>>> aggregate spark's logs. >>>>>> >>>>>> >>>>>>> Are all your Spark nodes set up to point to the same >>>>>>> phoenix-client-spark JAR? >>>>>>> >>>>>> >>>>>> Yes, as far as I can tell... though come to think of it, is that jar >>>>>> shipped by the spark runtime to workers, or is it loaded locally on each >>>>>> host? I only changed spark-defaults.conf on the client machine, the >>>>>> machine >>>>>> from which I submitted the job. >>>>>> >>>>>> Thanks for taking a look Josh! >>>>>> >>>>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi guys, >>>>>>>> >>>>>>>> I'm doing my best to follow along with [0], but I'm hitting some >>>>>>>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My >>>>>>>> phoenix >>>>>>>> build is much newer, basically 4.6-branch + PHOENIX-2503, >>>>>>>> PHOENIX-2568. I'm >>>>>>>> using pyspark for now. >>>>>>>> >>>>>>>> I've added phoenix-$VERSION-client-spark.jar to both >>>>>>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This >>>>>>>> allows >>>>>>>> me to use sqlContext.read to define a DataFrame against a Phoenix >>>>>>>> table. >>>>>>>> This appears to basically work, as I see PhoenixInputFormat in the >>>>>>>> logs and >>>>>>>> df.printSchema() shows me what I expect. However, when I try >>>>>>>> df.take(5), I >>>>>>>> get "IllegalStateException: unread block data" [1] from the workers. >>>>>>>> Poking >>>>>>>> around, this is commonly a problem with classpath. Any ideas as to >>>>>>>> specifically which jars are needed? Or better still, how to debug this >>>>>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the >>>>>>>> classpath >>>>>>>> gives me a VerifyError about netty method version mismatch. Indeed I >>>>>>>> see >>>>>>>> two netty versions in that lib directory... >>>>>>>> >>>>>>>> Thanks a lot, >>>>>>>> -n >>>>>>>> >>>>>>>> [0]: http://phoenix.apache.org/phoenix_spark.html >>>>>>>> [1]: >>>>>>>> >>>>>>>> java.lang.IllegalStateException: unread block data >>>>>>>> at >>>>>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424) >>>>>>>> at >>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383) >>>>>>>> at >>>>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) >>>>>>>> at >>>>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) >>>>>>>> at >>>>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) >>>>>>>> at >>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) >>>>>>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) >>>>>>>> at >>>>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69) >>>>>>>> at >>>>>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95) >>>>>>>> at >>>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) >>>>>>>> at >>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>>>>>> at >>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Thanks for remembering about the docs, Josh. >>>>>>>>> >>>>>>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Just an update for anyone interested, PHOENIX-2503 was just >>>>>>>>>> committed for 4.7.0 and the docs have been updated to include these >>>>>>>>>> samples >>>>>>>>>> for PySpark users. >>>>>>>>>> >>>>>>>>>> https://phoenix.apache.org/phoenix_spark.html >>>>>>>>>> >>>>>>>>>> Josh >>>>>>>>>> >>>>>>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <[email protected] >>>>>>>>>> > wrote: >>>>>>>>>> >>>>>>>>>>> Hey Nick, >>>>>>>>>>> >>>>>>>>>>> I think this used to work, and will again once PHOENIX-2503 gets >>>>>>>>>>> resolved. With the Spark DataFrame support, all the necessary glue >>>>>>>>>>> is there >>>>>>>>>>> for Phoenix and pyspark to play nice. With that client JAR (or by >>>>>>>>>>> overriding the com.fasterxml.jackson JARS), you can do something >>>>>>>>>>> like: >>>>>>>>>>> >>>>>>>>>>> df = sqlContext.read \ >>>>>>>>>>> .format("org.apache.phoenix.spark") \ >>>>>>>>>>> .option("table", "TABLE1") \ >>>>>>>>>>> .option("zkUrl", "localhost:63512") \ >>>>>>>>>>> .load() >>>>>>>>>>> >>>>>>>>>>> And >>>>>>>>>>> >>>>>>>>>>> df.write \ >>>>>>>>>>> .format("org.apache.phoenix.spark") \ >>>>>>>>>>> .mode("overwrite") \ >>>>>>>>>>> .option("table", "TABLE1") \ >>>>>>>>>>> .option("zkUrl", "localhost:63512") \ >>>>>>>>>>> .save() >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Yes, this should be added to the documentation. I hadn't >>>>>>>>>>> actually tried this till just now. :) >>>>>>>>>>> >>>>>>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Heya, >>>>>>>>>>>> >>>>>>>>>>>> Has anyone any experience using phoenix-spark integration from >>>>>>>>>>>> pyspark instead of scala? Folks prefer python around here... >>>>>>>>>>>> >>>>>>>>>>>> I did find this example [0] of using HBaseOutputFormat from >>>>>>>>>>>> pyspark, haven't tried extending it for phoenix. Maybe someone >>>>>>>>>>>> with more >>>>>>>>>>>> experience in pyspark knows better? Would be a great addition to >>>>>>>>>>>> our >>>>>>>>>>>> documentation. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Nick >>>>>>>>>>>> >>>>>>>>>>>> [0]: >>>>>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
