That's great to hear. Looking forward to the doc patch! On Wed, Jan 20, 2016 at 3:43 PM, Nick Dimiduk <ndimi...@apache.org> wrote:
> Josh -- I deployed my updated phoenix build across the cluster, added the > phoenix-client-spark.jar to configs on the whole cluster, and now basic > dataframe access is now working. Let me see about updating the docs page to > be more clear, I'll send a patch by you for review. > > Thanks a lot for the help! > -n > > On Tue, Jan 19, 2016 at 5:59 PM, Josh Mahonin <jmaho...@gmail.com> wrote: > >> Right, this cluster I just tested on is HDP 2.3.4, so it's Spark on YARN >> as well. I suppose the JAR is probably shipped by YARN, though I don't see >> any logging saying it, so I'm not certain how the nuts and bolts of that >> work. By explicitly setting the classpath, we're bypassing Spark's native >> JAR broadcast though. >> >> Taking a quick look at the config in Ambari (which ships the config to >> each node after saving), in 'Custom spark-defaults' I have the following: >> >> spark.driver.extraClassPath -> >> /etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar >> spark.executor.extraClassPath -> >> /usr/hdp/current/phoenix-client/phoenix-client-spark.jar >> >> I'm not sure if the /etc/hbase/conf is necessarily needed, but I think >> that gets the Ambari generated hbase-site.xml in the classpath. Each node >> has the custom phoenix-client-spark.jar installed to that same path as well. >> >> I can pop into regular spark-shell and load RDDs/DataFrames using: >> /usr/hdp/current/spark-client/bin/spark-shell --master yarn-client >> >> or pyspark via: >> /usr/hdp/current/spark-client/bin/pyspark >> >> I also do this as the Ambari-created 'spark' user, I think there was some >> fun HDFS permission issue otherwise. >> >> On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <ndimi...@apache.org> >> wrote: >> >>> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers are >>> colocated with RegionServers; all the hosts have everything. There are no >>> spark workers to restart. You're sure it's not shipped by the YARN runtime? >>> >>> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jmaho...@gmail.com> >>> wrote: >>> >>>> Sadly, it needs to be installed onto each Spark worker (for now). The >>>> executor config tells each Spark worker to look for that file to add to its >>>> classpath, so once you have it installed, you'll probably need to restart >>>> all the Spark workers. >>>> >>>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in >>>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can >>>> consistently see is fine. >>>> >>>> One day we'll be able to have Spark ship the JAR around and use it >>>> without this classpath nonsense, but we need to do some extra work on the >>>> Phoenix side to make sure that Phoenix's calls to DriverManager actually go >>>> through Spark's weird wrapper version of it. >>>> >>>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <ndimi...@apache.org> >>>> wrote: >>>> >>>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jmaho...@gmail.com> >>>>> wrote: >>>>> >>>>>> What version of Spark are you using? >>>>>> >>>>> >>>>> Probably HDP's Spark 1.4.1; that's what the jars in my install say, >>>>> and the welcome message in the pyspark console agrees. >>>>> >>>>> Are there any other traces of exceptions anywhere? >>>>>> >>>>> >>>>> No other exceptions that I can find. YARN apparently doesn't want to >>>>> aggregate spark's logs. >>>>> >>>>> >>>>>> Are all your Spark nodes set up to point to the same >>>>>> phoenix-client-spark JAR? >>>>>> >>>>> >>>>> Yes, as far as I can tell... though come to think of it, is that jar >>>>> shipped by the spark runtime to workers, or is it loaded locally on each >>>>> host? I only changed spark-defaults.conf on the client machine, the >>>>> machine >>>>> from which I submitted the job. >>>>> >>>>> Thanks for taking a look Josh! >>>>> >>>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <ndimi...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> Hi guys, >>>>>>> >>>>>>> I'm doing my best to follow along with [0], but I'm hitting some >>>>>>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My >>>>>>> phoenix >>>>>>> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. >>>>>>> I'm >>>>>>> using pyspark for now. >>>>>>> >>>>>>> I've added phoenix-$VERSION-client-spark.jar to both >>>>>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This >>>>>>> allows >>>>>>> me to use sqlContext.read to define a DataFrame against a Phoenix table. >>>>>>> This appears to basically work, as I see PhoenixInputFormat in the logs >>>>>>> and >>>>>>> df.printSchema() shows me what I expect. However, when I try >>>>>>> df.take(5), I >>>>>>> get "IllegalStateException: unread block data" [1] from the workers. >>>>>>> Poking >>>>>>> around, this is commonly a problem with classpath. Any ideas as to >>>>>>> specifically which jars are needed? Or better still, how to debug this >>>>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the >>>>>>> classpath >>>>>>> gives me a VerifyError about netty method version mismatch. Indeed I see >>>>>>> two netty versions in that lib directory... >>>>>>> >>>>>>> Thanks a lot, >>>>>>> -n >>>>>>> >>>>>>> [0]: http://phoenix.apache.org/phoenix_spark.html >>>>>>> [1]: >>>>>>> >>>>>>> java.lang.IllegalStateException: unread block data >>>>>>> at >>>>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424) >>>>>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383) >>>>>>> at >>>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) >>>>>>> at >>>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) >>>>>>> at >>>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) >>>>>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) >>>>>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) >>>>>>> at >>>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69) >>>>>>> at >>>>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95) >>>>>>> at >>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) >>>>>>> at >>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>>>>> at >>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>>> >>>>>>> >>>>>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor < >>>>>>> jamestay...@apache.org> wrote: >>>>>>> >>>>>>>> Thanks for remembering about the docs, Josh. >>>>>>>> >>>>>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <jmaho...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Just an update for anyone interested, PHOENIX-2503 was just >>>>>>>>> committed for 4.7.0 and the docs have been updated to include these >>>>>>>>> samples >>>>>>>>> for PySpark users. >>>>>>>>> >>>>>>>>> https://phoenix.apache.org/phoenix_spark.html >>>>>>>>> >>>>>>>>> Josh >>>>>>>>> >>>>>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <jmaho...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hey Nick, >>>>>>>>>> >>>>>>>>>> I think this used to work, and will again once PHOENIX-2503 gets >>>>>>>>>> resolved. With the Spark DataFrame support, all the necessary glue >>>>>>>>>> is there >>>>>>>>>> for Phoenix and pyspark to play nice. With that client JAR (or by >>>>>>>>>> overriding the com.fasterxml.jackson JARS), you can do something >>>>>>>>>> like: >>>>>>>>>> >>>>>>>>>> df = sqlContext.read \ >>>>>>>>>> .format("org.apache.phoenix.spark") \ >>>>>>>>>> .option("table", "TABLE1") \ >>>>>>>>>> .option("zkUrl", "localhost:63512") \ >>>>>>>>>> .load() >>>>>>>>>> >>>>>>>>>> And >>>>>>>>>> >>>>>>>>>> df.write \ >>>>>>>>>> .format("org.apache.phoenix.spark") \ >>>>>>>>>> .mode("overwrite") \ >>>>>>>>>> .option("table", "TABLE1") \ >>>>>>>>>> .option("zkUrl", "localhost:63512") \ >>>>>>>>>> .save() >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Yes, this should be added to the documentation. I hadn't actually >>>>>>>>>> tried this till just now. :) >>>>>>>>>> >>>>>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <ndimi...@apache.org >>>>>>>>>> > wrote: >>>>>>>>>> >>>>>>>>>>> Heya, >>>>>>>>>>> >>>>>>>>>>> Has anyone any experience using phoenix-spark integration from >>>>>>>>>>> pyspark instead of scala? Folks prefer python around here... >>>>>>>>>>> >>>>>>>>>>> I did find this example [0] of using HBaseOutputFormat from >>>>>>>>>>> pyspark, haven't tried extending it for phoenix. Maybe someone with >>>>>>>>>>> more >>>>>>>>>>> experience in pyspark knows better? Would be a great addition to our >>>>>>>>>>> documentation. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Nick >>>>>>>>>>> >>>>>>>>>>> [0]: >>>>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >