NPE in Parquet
All, I strongly suspect this might be caused by a glitch in the communication with Google Cloud Storage where my job is writing to, as this NPE exception shows up fairly randomly. Any ideas? Exception in thread "Thread-126" java.lang.NullPointerException at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114) at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:108) at org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:447) at org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:485) at org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:65) at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:190) at Truven$Stats$anonfun$save_to_parquet$3$anonfun$21$anon$7.run(Truven.scala:957) Alex
Scaladoc
How do I build the scaladoc html files from the spark source distribution? Alex Bareta
Still struggling with building documentation
I finally came to realize that there is a special maven target to build the scaladocs, although arguably a very unintuitive on: mvn verify. So now I have scaladocs for each package, but not for the whole spark project. Specifically, build/docs/api/scala/index.html is missing. Indeed the whole build/docs/api directory referenced in api.html is missing. How do I build it? Alex Baretta
Re: Still struggling with building documentation
Nichols and Patrick, Thanks for your help, but, no, it still does not work. The latest master produces the following scaladoc errors: [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:55: not found: type Type [error] protected Type type() { return Type.UPLOAD_BLOCK; } [error] ^ [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:39: not found: type Type [error] protected Type type() { return Type.STREAM_HANDLE; } [error] ^ [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40: not found: type Type [error] protected Type type() { return Type.OPEN_BLOCKS; } [error] ^ [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44: not found: type Type [error] protected Type type() { return Type.REGISTER_EXECUTOR; } [error] ^ ... [error] four errors found [error] (spark/javaunidoc:doc) javadoc returned nonzero exit code [error] (spark/scalaunidoc:doc) Scaladoc generation failed [error] Total time: 140 s, completed Nov 11, 2014 10:20:53 AM Moving back into docs dir. Making directory api/scala cp -r ../target/scala-2.10/unidoc/. api/scala Making directory api/java cp -r ../target/javaunidoc/. api/java Moving to python/docs directory and building sphinx. Makefile:14: *** The 'sphinx-build' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the 'sphinx-build' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/. Stop. Moving back into home dir. Making directory api/python cp -r python/docs/_build/html/. docs/api/python /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `stat': No such file or directory - python/docs/_build/html/. (Errno::ENOENT) from /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `block in fu_each_src_dest' from /usr/lib/ruby/1.9.1/fileutils.rb:1529:in `fu_each_src_dest0' from /usr/lib/ruby/1.9.1/fileutils.rb:1513:in `fu_each_src_dest' from /usr/lib/ruby/1.9.1/fileutils.rb:436:in `cp_r' from /home/alex/git/spark/docs/_plugins/copy_api_dirs.rb:79:in `' from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require' from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:76:in `block in setup' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `each' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `setup' from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:30:in `initialize' from /usr/bin/jekyll:224:in `new' from /usr/bin/jekyll:224:in `' What next? Alex On Fri, Nov 7, 2014 at 12:54 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > I believe the web docs need to be built separately according to the > instructions here > <https://github.com/apache/spark/blob/master/docs/README.md>. > > Did you give those a shot? > > It's annoying to have a separate thing with new dependencies in order to > build the web docs, but that's how it is at the moment. > > Nick > > On Fri, Nov 7, 2014 at 3:39 PM, Alessandro Baretta > wrote: > >> I finally came to realize that there is a special maven target to build >> the scaladocs, although arguably a very unintuitive on: mvn verify. So now >> I have scaladocs for each package, but not for the whole spark project. >> Specifically, build/docs/api/scala/index.html is missing. Indeed the whole >> build/docs/api directory referenced in api.html is missing. How do I build >> it? >> >> Alex Baretta >> > >
Spark Shell slowness on Google Cloud
All, I'm using the Spark shell to interact with a small test deployment of Spark, built from the current master branch. I'm processing a dataset comprising a few thousand objects on Google Cloud Storage, split into a half dozen directories. My code constructs an object--let me call it the Dataset object--that defines a distinct RDD for each directory. The constructor of the object only defines the RDDs; it does not actually evaluate them, so I would expect it to return very quickly. Indeed, the logging code in the constructor prints a line signaling the completion of the code almost immediately after invocation, but the Spark shell does not show the prompt right away. Instead, it spends a few minutes seemingly frozen, eventually producing the following output: 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to process : 9 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to process : 759 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to process : 228 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to process : 3076 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to process : 1013 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to process : 156 This stage is inexplicably slow. What could be happening? Thanks. Alex
Re: Spark Shell slowness on Google Cloud
Denny, No, gsutil scans through the listing of the bucket quickly. See the following. alex@hadoop-m:~/split$ time bash -c "gsutil ls gs://my-bucket/20141205/csv/*/*/* | wc -l" 6860 real0m6.971s user0m1.052s sys 0m0.096s Alex On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee wrote: > > I'm curious if you're seeing the same thing when using bdutil against > GCS? I'm wondering if this may be an issue concerning the transfer rate of > Spark -> Hadoop -> GCS Connector -> GCS. > > > On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta < > alexbare...@gmail.com> wrote: > >> All, >> >> I'm using the Spark shell to interact with a small test deployment of >> Spark, built from the current master branch. I'm processing a dataset >> comprising a few thousand objects on Google Cloud Storage, split into a >> half dozen directories. My code constructs an object--let me call it the >> Dataset object--that defines a distinct RDD for each directory. The >> constructor of the object only defines the RDDs; it does not actually >> evaluate them, so I would expect it to return very quickly. Indeed, the >> logging code in the constructor prints a line signaling the completion of >> the code almost immediately after invocation, but the Spark shell does not >> show the prompt right away. Instead, it spends a few minutes seemingly >> frozen, eventually producing the following output: >> >> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to >> process : 9 >> >> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to >> process : 759 >> >> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to >> process : 228 >> >> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to >> process : 3076 >> >> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to >> process : 1013 >> >> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to >> process : 156 >> >> This stage is inexplicably slow. What could be happening? >> >> Thanks. >> >> >> Alex >> >
Re: Spark Shell slowness on Google Cloud
Well, what do you suggest I run to test this? But more importantly, what information would this give me? On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee wrote: > > Oh, it makes sense of gsutil scans through this quickly, but I was > wondering if running a Hadoop job / bdutil would result in just as fast > scans? > > > On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta < > alexbare...@gmail.com> wrote: > >> Denny, >> >> No, gsutil scans through the listing of the bucket quickly. See the >> following. >> >> alex@hadoop-m:~/split$ time bash -c "gsutil ls >> gs://my-bucket/20141205/csv/*/*/* | wc -l" >> >> 6860 >> >> real0m6.971s >> user0m1.052s >> sys 0m0.096s >> >> Alex >> >> >> On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee >> wrote: >>> >>> I'm curious if you're seeing the same thing when using bdutil against >>> GCS? I'm wondering if this may be an issue concerning the transfer rate of >>> Spark -> Hadoop -> GCS Connector -> GCS. >>> >>> >>> On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta < >>> alexbare...@gmail.com> wrote: >>> >>>> All, >>>> >>>> I'm using the Spark shell to interact with a small test deployment of >>>> Spark, built from the current master branch. I'm processing a dataset >>>> comprising a few thousand objects on Google Cloud Storage, split into a >>>> half dozen directories. My code constructs an object--let me call it the >>>> Dataset object--that defines a distinct RDD for each directory. The >>>> constructor of the object only defines the RDDs; it does not actually >>>> evaluate them, so I would expect it to return very quickly. Indeed, the >>>> logging code in the constructor prints a line signaling the completion of >>>> the code almost immediately after invocation, but the Spark shell does not >>>> show the prompt right away. Instead, it spends a few minutes seemingly >>>> frozen, eventually producing the following output: >>>> >>>> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to >>>> process : 9 >>>> >>>> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to >>>> process : 759 >>>> >>>> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to >>>> process : 228 >>>> >>>> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to >>>> process : 3076 >>>> >>>> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to >>>> process : 1013 >>>> >>>> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to >>>> process : 156 >>>> >>>> This stage is inexplicably slow. What could be happening? >>>> >>>> Thanks. >>>> >>>> >>>> Alex >>>> >>>
Re: Spark Shell slowness on Google Cloud
Here's another data point: the slow part of my code is the construction of an RDD as the union of the textFile RDDs representing data from several distinct google storage directories. So the question becomes the following: what computation happens when calling the union method on two RDDs? On Wed, Dec 17, 2014 at 11:24 PM, Alessandro Baretta wrote: > > Well, what do you suggest I run to test this? But more importantly, what > information would this give me? > > On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee wrote: >> >> Oh, it makes sense of gsutil scans through this quickly, but I was >> wondering if running a Hadoop job / bdutil would result in just as fast >> scans? >> >> >> On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta < >> alexbare...@gmail.com> wrote: >> >>> Denny, >>> >>> No, gsutil scans through the listing of the bucket quickly. See the >>> following. >>> >>> alex@hadoop-m:~/split$ time bash -c "gsutil ls >>> gs://my-bucket/20141205/csv/*/*/* | wc -l" >>> >>> 6860 >>> >>> real0m6.971s >>> user0m1.052s >>> sys 0m0.096s >>> >>> Alex >>> >>> >>> On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee >>> wrote: >>>> >>>> I'm curious if you're seeing the same thing when using bdutil against >>>> GCS? I'm wondering if this may be an issue concerning the transfer rate of >>>> Spark -> Hadoop -> GCS Connector -> GCS. >>>> >>>> >>>> On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta < >>>> alexbare...@gmail.com> wrote: >>>> >>>>> All, >>>>> >>>>> I'm using the Spark shell to interact with a small test deployment of >>>>> Spark, built from the current master branch. I'm processing a dataset >>>>> comprising a few thousand objects on Google Cloud Storage, split into a >>>>> half dozen directories. My code constructs an object--let me call it the >>>>> Dataset object--that defines a distinct RDD for each directory. The >>>>> constructor of the object only defines the RDDs; it does not actually >>>>> evaluate them, so I would expect it to return very quickly. Indeed, the >>>>> logging code in the constructor prints a line signaling the completion of >>>>> the code almost immediately after invocation, but the Spark shell does not >>>>> show the prompt right away. Instead, it spends a few minutes seemingly >>>>> frozen, eventually producing the following output: >>>>> >>>>> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to >>>>> process : 9 >>>>> >>>>> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to >>>>> process : 759 >>>>> >>>>> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to >>>>> process : 228 >>>>> >>>>> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to >>>>> process : 3076 >>>>> >>>>> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to >>>>> process : 1013 >>>>> >>>>> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to >>>>> process : 156 >>>>> >>>>> This stage is inexplicably slow. What could be happening? >>>>> >>>>> Thanks. >>>>> >>>>> >>>>> Alex >>>>> >>>>
/tmp directory fills up
Gents, I'm building spark using the current master branch and deploying in to Google Compute Engine on top of Hadoop 2.4/YARN via bdutil, Google's Hadoop cluster provisioning tool. bdutils configures Spark with spark.local.dir=/hadoop/spark/tmp, but this option is ignored in combination with YARN. Bdutils also configures YARN with: yarn.nodemanager.local-dirs /mnt/pd1/hadoop/yarn/nm-local-dir Directories on the local machine in which to application temp files. This is the right directory for spark to store temporary data in. Still, Spark is creating such directories as this: /tmp/spark-51388ee6-9de6-411d-b9b9-ab6f9502d01e and filling them up with gigabytes worth of output files, filling up the very small root filesystem. How can I diagnose why my Spark installation is not picking up the yarn.nodemanager.local-dirs from yarn? Alex
Re: Job priority
Mark, Thanks, but I don't see how this documentation solves my problem. You are referring me to documentation of fair scheduling; whereas, I am asking about as unfair a scheduling policy as can be: a priority queue. Alex On Sat, Jan 10, 2015 at 5:00 PM, Mark Hamstra wrote: > -dev, +user > > http://spark.apache.org/docs/latest/job-scheduling.html > > > On Sat, Jan 10, 2015 at 4:40 PM, Alessandro Baretta > wrote: > >> Is it possible to specify a priority level for a job, such that the active >> jobs might be scheduled in order of priority? >> >> Alex >> > >
Re: Job priority
Cody, Maybe I'm not getting this, but it doesn't look like this page is describing a priority queue scheduling policy. What this section discusses is how resources are shared between queues. A weight-1000 pool will get 1000 times more resources allocated to it than a priority 1 queue. Great, but not what I want. I want to be able to define an Ordering on make my tasks representing their priority, and have Spark allocate all resources to the job that has the highest priority. Alex On Sat, Jan 10, 2015 at 10:11 PM, Cody Koeninger wrote: > > http://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties > > "Setting a high weight such as 1000 also makes it possible to implement > *priority* between pools—in essence, the weight-1000 pool will always get > to launch tasks first whenever it has jobs active." > > On Sat, Jan 10, 2015 at 11:57 PM, Alessandro Baretta < > alexbare...@gmail.com> wrote: > >> Mark, >> >> Thanks, but I don't see how this documentation solves my problem. You are >> referring me to documentation of fair scheduling; whereas, I am asking >> about as unfair a scheduling policy as can be: a priority queue. >> >> Alex >> >> On Sat, Jan 10, 2015 at 5:00 PM, Mark Hamstra >> wrote: >> >>> -dev, +user >>> >>> http://spark.apache.org/docs/latest/job-scheduling.html >>> >>> >>> On Sat, Jan 10, 2015 at 4:40 PM, Alessandro Baretta < >>> alexbare...@gmail.com> wrote: >>> >>>> Is it possible to specify a priority level for a job, such that the >>>> active >>>> jobs might be scheduled in order of priority? >>>> >>>> Alex >>>> >>> >>> >> >
Re: Job priority
Cody, While I might be able to improve the scheduling of my jobs by using a few different pools with weights equal to, say, 1, 1e3 and 1e6, effectively getting a small handful of priority classes. Still, this is really not quite what I am describing. This is why my original post was on the dev list. Let me then ask if there is any interest in having priority queue job scheduling in Spark. This is something I might be able to pull off. Alex On Sun, Jan 11, 2015 at 6:21 AM, Cody Koeninger wrote: > If you set up a number of pools equal to the number of different priority > levels you want, make the relative weights of those pools very different, > and submit a job to the pool representing its priority, I think youll get > behavior equivalent to a priority queue. Try it and see. > > If I'm misunderstandng what youre trying to do, then I don't know. > > > On Sunday, January 11, 2015, Alessandro Baretta > wrote: > >> Cody, >> >> Maybe I'm not getting this, but it doesn't look like this page is >> describing a priority queue scheduling policy. What this section discusses >> is how resources are shared between queues. A weight-1000 pool will get >> 1000 times more resources allocated to it than a priority 1 queue. Great, >> but not what I want. I want to be able to define an Ordering on make my >> tasks representing their priority, and have Spark allocate all resources to >> the job that has the highest priority. >> >> Alex >> >> On Sat, Jan 10, 2015 at 10:11 PM, Cody Koeninger >> wrote: >> >>> >>> http://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties >>> >>> "Setting a high weight such as 1000 also makes it possible to implement >>> *priority* between pools—in essence, the weight-1000 pool will always >>> get to launch tasks first whenever it has jobs active." >>> >>> On Sat, Jan 10, 2015 at 11:57 PM, Alessandro Baretta < >>> alexbare...@gmail.com> wrote: >>> >>>> Mark, >>>> >>>> Thanks, but I don't see how this documentation solves my problem. You >>>> are referring me to documentation of fair scheduling; whereas, I am asking >>>> about as unfair a scheduling policy as can be: a priority queue. >>>> >>>> Alex >>>> >>>> On Sat, Jan 10, 2015 at 5:00 PM, Mark Hamstra >>>> wrote: >>>> >>>>> -dev, +user >>>>> >>>>> http://spark.apache.org/docs/latest/job-scheduling.html >>>>> >>>>> >>>>> On Sat, Jan 10, 2015 at 4:40 PM, Alessandro Baretta < >>>>> alexbare...@gmail.com> wrote: >>>>> >>>>>> Is it possible to specify a priority level for a job, such that the >>>>>> active >>>>>> jobs might be scheduled in order of priority? >>>>>> >>>>>> Alex >>>>>> >>>>> >>>>> >>>> >>> >>