I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers are colocated with RegionServers; all the hosts have everything. There are no spark workers to restart. You're sure it's not shipped by the YARN runtime?
On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <[email protected]> wrote: > Sadly, it needs to be installed onto each Spark worker (for now). The > executor config tells each Spark worker to look for that file to add to its > classpath, so once you have it installed, you'll probably need to restart > all the Spark workers. > > I co-locate Spark and HBase/Phoenix nodes, so I just drop it in > /usr/hdp/current/phoenix-client/, but anywhere that each worker can > consistently see is fine. > > One day we'll be able to have Spark ship the JAR around and use it without > this classpath nonsense, but we need to do some extra work on the Phoenix > side to make sure that Phoenix's calls to DriverManager actually go through > Spark's weird wrapper version of it. > > On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <[email protected]> wrote: > >> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <[email protected]> wrote: >> >>> What version of Spark are you using? >>> >> >> Probably HDP's Spark 1.4.1; that's what the jars in my install say, and >> the welcome message in the pyspark console agrees. >> >> Are there any other traces of exceptions anywhere? >>> >> >> No other exceptions that I can find. YARN apparently doesn't want to >> aggregate spark's logs. >> >> >>> Are all your Spark nodes set up to point to the same >>> phoenix-client-spark JAR? >>> >> >> Yes, as far as I can tell... though come to think of it, is that jar >> shipped by the spark runtime to workers, or is it loaded locally on each >> host? I only changed spark-defaults.conf on the client machine, the machine >> from which I submitted the job. >> >> Thanks for taking a look Josh! >> >> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <[email protected]> >>> wrote: >>> >>>> Hi guys, >>>> >>>> I'm doing my best to follow along with [0], but I'm hitting some >>>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix >>>> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm >>>> using pyspark for now. >>>> >>>> I've added phoenix-$VERSION-client-spark.jar to both >>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows >>>> me to use sqlContext.read to define a DataFrame against a Phoenix table. >>>> This appears to basically work, as I see PhoenixInputFormat in the logs and >>>> df.printSchema() shows me what I expect. However, when I try df.take(5), I >>>> get "IllegalStateException: unread block data" [1] from the workers. Poking >>>> around, this is commonly a problem with classpath. Any ideas as to >>>> specifically which jars are needed? Or better still, how to debug this >>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath >>>> gives me a VerifyError about netty method version mismatch. Indeed I see >>>> two netty versions in that lib directory... >>>> >>>> Thanks a lot, >>>> -n >>>> >>>> [0]: http://phoenix.apache.org/phoenix_spark.html >>>> [1]: >>>> >>>> java.lang.IllegalStateException: unread block data >>>> at >>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424) >>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383) >>>> at >>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) >>>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) >>>> at >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) >>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) >>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) >>>> at >>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69) >>>> at >>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95) >>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>> at java.lang.Thread.run(Thread.java:745) >>>> >>>> >>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <[email protected]> >>>> wrote: >>>> >>>>> Thanks for remembering about the docs, Josh. >>>>> >>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <[email protected]> >>>>> wrote: >>>>> >>>>>> Just an update for anyone interested, PHOENIX-2503 was just committed >>>>>> for 4.7.0 and the docs have been updated to include these samples for >>>>>> PySpark users. >>>>>> >>>>>> https://phoenix.apache.org/phoenix_spark.html >>>>>> >>>>>> Josh >>>>>> >>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hey Nick, >>>>>>> >>>>>>> I think this used to work, and will again once PHOENIX-2503 gets >>>>>>> resolved. With the Spark DataFrame support, all the necessary glue is >>>>>>> there >>>>>>> for Phoenix and pyspark to play nice. With that client JAR (or by >>>>>>> overriding the com.fasterxml.jackson JARS), you can do something like: >>>>>>> >>>>>>> df = sqlContext.read \ >>>>>>> .format("org.apache.phoenix.spark") \ >>>>>>> .option("table", "TABLE1") \ >>>>>>> .option("zkUrl", "localhost:63512") \ >>>>>>> .load() >>>>>>> >>>>>>> And >>>>>>> >>>>>>> df.write \ >>>>>>> .format("org.apache.phoenix.spark") \ >>>>>>> .mode("overwrite") \ >>>>>>> .option("table", "TABLE1") \ >>>>>>> .option("zkUrl", "localhost:63512") \ >>>>>>> .save() >>>>>>> >>>>>>> >>>>>>> Yes, this should be added to the documentation. I hadn't actually >>>>>>> tried this till just now. :) >>>>>>> >>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Heya, >>>>>>>> >>>>>>>> Has anyone any experience using phoenix-spark integration from >>>>>>>> pyspark instead of scala? Folks prefer python around here... >>>>>>>> >>>>>>>> I did find this example [0] of using HBaseOutputFormat from >>>>>>>> pyspark, haven't tried extending it for phoenix. Maybe someone with >>>>>>>> more >>>>>>>> experience in pyspark knows better? Would be a great addition to our >>>>>>>> documentation. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Nick >>>>>>>> >>>>>>>> [0]: >>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
