NPE in Parquet

2015-01-20 Thread Alessandro Baretta
All,

I strongly suspect this might be caused by a glitch in the communication
with Google Cloud Storage where my job is writing to, as this NPE exception
shows up fairly randomly. Any ideas?

Exception in thread "Thread-126" java.lang.NullPointerException
at
scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
at
scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at
scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:108)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:447)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:485)
at
org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:65)
at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:190)
at
Truven$Stats$anonfun$save_to_parquet$3$anonfun$21$anon$7.run(Truven.scala:957)


Alex


Scaladoc

2014-10-30 Thread Alessandro Baretta
How do I build the scaladoc html files from the spark source distribution?

Alex Bareta


Still struggling with building documentation

2014-11-07 Thread Alessandro Baretta
I finally came to realize that there is a special maven target to build the
scaladocs, although arguably a very unintuitive on: mvn verify. So now I
have scaladocs for each package, but not for the whole spark project.
Specifically, build/docs/api/scala/index.html is missing. Indeed the whole
build/docs/api directory referenced in api.html is missing. How do I build
it?

Alex Baretta


Re: Still struggling with building documentation

2014-11-11 Thread Alessandro Baretta
Nichols and Patrick,

Thanks for your help, but, no, it still does not work. The latest master
produces the following scaladoc errors:

[error]
/home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:55:
not found: type Type
[error]   protected Type type() { return Type.UPLOAD_BLOCK; }
[error] ^
[error]
/home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:39:
not found: type Type
[error]   protected Type type() { return Type.STREAM_HANDLE; }
[error] ^
[error]
/home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40:
not found: type Type
[error]   protected Type type() { return Type.OPEN_BLOCKS; }
[error] ^
[error]
/home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44:
not found: type Type
[error]   protected Type type() { return Type.REGISTER_EXECUTOR; }
[error] ^

...

[error] four errors found
[error] (spark/javaunidoc:doc) javadoc returned nonzero exit code
[error] (spark/scalaunidoc:doc) Scaladoc generation failed
[error] Total time: 140 s, completed Nov 11, 2014 10:20:53 AM
Moving back into docs dir.
Making directory api/scala
cp -r ../target/scala-2.10/unidoc/. api/scala
Making directory api/java
cp -r ../target/javaunidoc/. api/java
Moving to python/docs directory and building sphinx.
Makefile:14: *** The 'sphinx-build' command was not found. Make sure you
have Sphinx installed, then set the SPHINXBUILD environment variable to
point to the full path of the 'sphinx-build' executable. Alternatively you
can add the directory with the executable to your PATH. If you don't have
Sphinx installed, grab it from http://sphinx-doc.org/.  Stop.

Moving back into home dir.
Making directory api/python
cp -r python/docs/_build/html/. docs/api/python
/usr/lib/ruby/1.9.1/fileutils.rb:1515:in `stat': No such file or directory
- python/docs/_build/html/. (Errno::ENOENT)
from /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `block in fu_each_src_dest'
from /usr/lib/ruby/1.9.1/fileutils.rb:1529:in `fu_each_src_dest0'
from /usr/lib/ruby/1.9.1/fileutils.rb:1513:in `fu_each_src_dest'
from /usr/lib/ruby/1.9.1/fileutils.rb:436:in `cp_r'
from /home/alex/git/spark/docs/_plugins/copy_api_dirs.rb:79:in `'
from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require'
from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:76:in `block in setup'
from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `each'
from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:75:in `setup'
from /usr/lib/ruby/vendor_ruby/jekyll/site.rb:30:in `initialize'
from /usr/bin/jekyll:224:in `new'
from /usr/bin/jekyll:224:in `'

What next?

Alex




On Fri, Nov 7, 2014 at 12:54 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> I believe the web docs need to be built separately according to the
> instructions here
> <https://github.com/apache/spark/blob/master/docs/README.md>.
>
> Did you give those a shot?
>
> It's annoying to have a separate thing with new dependencies in order to
> build the web docs, but that's how it is at the moment.
>
> Nick
>
> On Fri, Nov 7, 2014 at 3:39 PM, Alessandro Baretta 
> wrote:
>
>> I finally came to realize that there is a special maven target to build
>> the scaladocs, although arguably a very unintuitive on: mvn verify. So now
>> I have scaladocs for each package, but not for the whole spark project.
>> Specifically, build/docs/api/scala/index.html is missing. Indeed the whole
>> build/docs/api directory referenced in api.html is missing. How do I build
>> it?
>>
>> Alex Baretta
>>
>
>


Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
All,

I'm using the Spark shell to interact with a small test deployment of
Spark, built from the current master branch. I'm processing a dataset
comprising a few thousand objects on Google Cloud Storage, split into a
half dozen directories. My code constructs an object--let me call it the
Dataset object--that defines a distinct RDD for each directory. The
constructor of the object only defines the RDDs; it does not actually
evaluate them, so I would expect it to return very quickly. Indeed, the
logging code in the constructor prints a line signaling the completion of
the code almost immediately after invocation, but the Spark shell does not
show the prompt right away. Instead, it spends a few minutes seemingly
frozen, eventually producing the following output:

14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to process
: 9

14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to process
: 759

14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to process
: 228

14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to process
: 3076

14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to process
: 1013

14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to process
: 156

This stage is inexplicably slow. What could be happening?

Thanks.


Alex


Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Denny,

No, gsutil scans through the listing of the bucket quickly. See the
following.

alex@hadoop-m:~/split$ time bash -c "gsutil ls
gs://my-bucket/20141205/csv/*/*/* | wc -l"

6860

real0m6.971s
user0m1.052s
sys 0m0.096s

Alex

On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee  wrote:
>
> I'm curious if you're seeing the same thing when using bdutil against
> GCS?  I'm wondering if this may be an issue concerning the transfer rate of
> Spark -> Hadoop -> GCS Connector -> GCS.
>
>
> On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta <
> alexbare...@gmail.com> wrote:
>
>> All,
>>
>> I'm using the Spark shell to interact with a small test deployment of
>> Spark, built from the current master branch. I'm processing a dataset
>> comprising a few thousand objects on Google Cloud Storage, split into a
>> half dozen directories. My code constructs an object--let me call it the
>> Dataset object--that defines a distinct RDD for each directory. The
>> constructor of the object only defines the RDDs; it does not actually
>> evaluate them, so I would expect it to return very quickly. Indeed, the
>> logging code in the constructor prints a line signaling the completion of
>> the code almost immediately after invocation, but the Spark shell does not
>> show the prompt right away. Instead, it spends a few minutes seemingly
>> frozen, eventually producing the following output:
>>
>> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
>> process : 9
>>
>> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
>> process : 759
>>
>> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
>> process : 228
>>
>> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
>> process : 3076
>>
>> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
>> process : 1013
>>
>> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
>> process : 156
>>
>> This stage is inexplicably slow. What could be happening?
>>
>> Thanks.
>>
>>
>> Alex
>>
>


Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Well, what do you suggest I run to test this? But more importantly, what
information would this give me?

On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee  wrote:
>
> Oh, it makes sense of gsutil scans through this quickly, but I was
> wondering if running a Hadoop job / bdutil would result in just as fast
> scans?
>
>
> On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta <
> alexbare...@gmail.com> wrote:
>
>> Denny,
>>
>> No, gsutil scans through the listing of the bucket quickly. See the
>> following.
>>
>> alex@hadoop-m:~/split$ time bash -c "gsutil ls
>> gs://my-bucket/20141205/csv/*/*/* | wc -l"
>>
>> 6860
>>
>> real0m6.971s
>> user0m1.052s
>> sys 0m0.096s
>>
>> Alex
>>
>>
>> On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee 
>> wrote:
>>>
>>> I'm curious if you're seeing the same thing when using bdutil against
>>> GCS?  I'm wondering if this may be an issue concerning the transfer rate of
>>> Spark -> Hadoop -> GCS Connector -> GCS.
>>>
>>>
>>> On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta <
>>> alexbare...@gmail.com> wrote:
>>>
>>>> All,
>>>>
>>>> I'm using the Spark shell to interact with a small test deployment of
>>>> Spark, built from the current master branch. I'm processing a dataset
>>>> comprising a few thousand objects on Google Cloud Storage, split into a
>>>> half dozen directories. My code constructs an object--let me call it the
>>>> Dataset object--that defines a distinct RDD for each directory. The
>>>> constructor of the object only defines the RDDs; it does not actually
>>>> evaluate them, so I would expect it to return very quickly. Indeed, the
>>>> logging code in the constructor prints a line signaling the completion of
>>>> the code almost immediately after invocation, but the Spark shell does not
>>>> show the prompt right away. Instead, it spends a few minutes seemingly
>>>> frozen, eventually producing the following output:
>>>>
>>>> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
>>>> process : 9
>>>>
>>>> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
>>>> process : 759
>>>>
>>>> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
>>>> process : 228
>>>>
>>>> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
>>>> process : 3076
>>>>
>>>> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
>>>> process : 1013
>>>>
>>>> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
>>>> process : 156
>>>>
>>>> This stage is inexplicably slow. What could be happening?
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> Alex
>>>>
>>>


Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Here's another data point: the slow part of my code is the construction of
an RDD as the union of the textFile RDDs representing data from several
distinct google storage directories. So the question becomes the following:
what computation happens when calling the union method on two RDDs?

On Wed, Dec 17, 2014 at 11:24 PM, Alessandro Baretta 
wrote:
>
> Well, what do you suggest I run to test this? But more importantly, what
> information would this give me?
>
> On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee  wrote:
>>
>> Oh, it makes sense of gsutil scans through this quickly, but I was
>> wondering if running a Hadoop job / bdutil would result in just as fast
>> scans?
>>
>>
>> On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta <
>> alexbare...@gmail.com> wrote:
>>
>>> Denny,
>>>
>>> No, gsutil scans through the listing of the bucket quickly. See the
>>> following.
>>>
>>> alex@hadoop-m:~/split$ time bash -c "gsutil ls
>>> gs://my-bucket/20141205/csv/*/*/* | wc -l"
>>>
>>> 6860
>>>
>>> real0m6.971s
>>> user0m1.052s
>>> sys 0m0.096s
>>>
>>> Alex
>>>
>>>
>>> On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee 
>>> wrote:
>>>>
>>>> I'm curious if you're seeing the same thing when using bdutil against
>>>> GCS?  I'm wondering if this may be an issue concerning the transfer rate of
>>>> Spark -> Hadoop -> GCS Connector -> GCS.
>>>>
>>>>
>>>> On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta <
>>>> alexbare...@gmail.com> wrote:
>>>>
>>>>> All,
>>>>>
>>>>> I'm using the Spark shell to interact with a small test deployment of
>>>>> Spark, built from the current master branch. I'm processing a dataset
>>>>> comprising a few thousand objects on Google Cloud Storage, split into a
>>>>> half dozen directories. My code constructs an object--let me call it the
>>>>> Dataset object--that defines a distinct RDD for each directory. The
>>>>> constructor of the object only defines the RDDs; it does not actually
>>>>> evaluate them, so I would expect it to return very quickly. Indeed, the
>>>>> logging code in the constructor prints a line signaling the completion of
>>>>> the code almost immediately after invocation, but the Spark shell does not
>>>>> show the prompt right away. Instead, it spends a few minutes seemingly
>>>>> frozen, eventually producing the following output:
>>>>>
>>>>> 14/12/18 05:52:49 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 9
>>>>>
>>>>> 14/12/18 05:54:15 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 759
>>>>>
>>>>> 14/12/18 05:54:40 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 228
>>>>>
>>>>> 14/12/18 06:00:11 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 3076
>>>>>
>>>>> 14/12/18 06:02:02 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 1013
>>>>>
>>>>> 14/12/18 06:02:21 INFO mapred.FileInputFormat: Total input paths to
>>>>> process : 156
>>>>>
>>>>> This stage is inexplicably slow. What could be happening?
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>


/tmp directory fills up

2015-01-09 Thread Alessandro Baretta
Gents,

I'm building spark using the current master branch and deploying in to
Google Compute Engine on top of Hadoop 2.4/YARN via bdutil, Google's Hadoop
cluster provisioning tool. bdutils configures Spark with

spark.local.dir=/hadoop/spark/tmp,

but this option is ignored in combination with YARN. Bdutils also
configures YARN with:

  
yarn.nodemanager.local-dirs
/mnt/pd1/hadoop/yarn/nm-local-dir

  Directories on the local machine in which to application temp files.

  

This is the right directory for spark to store temporary data in. Still,
Spark is creating such directories as this:

/tmp/spark-51388ee6-9de6-411d-b9b9-ab6f9502d01e

and filling them up with gigabytes worth of output files, filling up the
very small root filesystem.

How can I diagnose why my Spark installation is not picking up the
yarn.nodemanager.local-dirs from yarn?

Alex


Re: Job priority

2015-01-10 Thread Alessandro Baretta
Mark,

Thanks, but I don't see how this documentation solves my problem. You are
referring me to documentation of fair scheduling; whereas, I am asking
about as unfair a scheduling policy as can be: a priority queue.

Alex

On Sat, Jan 10, 2015 at 5:00 PM, Mark Hamstra 
wrote:

> -dev, +user
>
> http://spark.apache.org/docs/latest/job-scheduling.html
>
>
> On Sat, Jan 10, 2015 at 4:40 PM, Alessandro Baretta  > wrote:
>
>> Is it possible to specify a priority level for a job, such that the active
>> jobs might be scheduled in order of priority?
>>
>> Alex
>>
>
>


Re: Job priority

2015-01-10 Thread Alessandro Baretta
Cody,

Maybe I'm not getting this, but it doesn't look like this page is
describing a priority queue scheduling policy. What this section discusses
is how resources are shared between queues. A weight-1000 pool will get
1000 times more resources allocated to it than a priority 1 queue. Great,
but not what I want. I want to be able to define an Ordering on make my
tasks representing their priority, and have Spark allocate all resources to
the job that has the highest priority.

Alex

On Sat, Jan 10, 2015 at 10:11 PM, Cody Koeninger  wrote:

>
> http://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties
>
> "Setting a high weight such as 1000 also makes it possible to implement
> *priority* between pools—in essence, the weight-1000 pool will always get
> to launch tasks first whenever it has jobs active."
>
> On Sat, Jan 10, 2015 at 11:57 PM, Alessandro Baretta <
> alexbare...@gmail.com> wrote:
>
>> Mark,
>>
>> Thanks, but I don't see how this documentation solves my problem. You are
>> referring me to documentation of fair scheduling; whereas, I am asking
>> about as unfair a scheduling policy as can be: a priority queue.
>>
>> Alex
>>
>> On Sat, Jan 10, 2015 at 5:00 PM, Mark Hamstra 
>> wrote:
>>
>>> -dev, +user
>>>
>>> http://spark.apache.org/docs/latest/job-scheduling.html
>>>
>>>
>>> On Sat, Jan 10, 2015 at 4:40 PM, Alessandro Baretta <
>>> alexbare...@gmail.com> wrote:
>>>
>>>> Is it possible to specify a priority level for a job, such that the
>>>> active
>>>> jobs might be scheduled in order of priority?
>>>>
>>>> Alex
>>>>
>>>
>>>
>>
>


Re: Job priority

2015-01-11 Thread Alessandro Baretta
Cody,

While I might be able to improve the scheduling of my jobs by using a few
different pools with weights equal to, say, 1, 1e3 and 1e6, effectively
getting a small handful of priority classes. Still, this is really not
quite what I am describing. This is why my original post was on the dev
list. Let me then ask if there is any interest in having priority queue job
scheduling in Spark. This is something I might be able to pull off.

Alex

On Sun, Jan 11, 2015 at 6:21 AM, Cody Koeninger  wrote:

> If you set up a number of pools equal to the number of different priority
> levels you want, make the relative weights of those pools very different,
> and submit a job to the pool representing its priority, I think youll get
> behavior equivalent to a priority queue. Try it and see.
>
> If I'm misunderstandng what youre trying to do, then I don't know.
>
>
> On Sunday, January 11, 2015, Alessandro Baretta 
> wrote:
>
>> Cody,
>>
>> Maybe I'm not getting this, but it doesn't look like this page is
>> describing a priority queue scheduling policy. What this section discusses
>> is how resources are shared between queues. A weight-1000 pool will get
>> 1000 times more resources allocated to it than a priority 1 queue. Great,
>> but not what I want. I want to be able to define an Ordering on make my
>> tasks representing their priority, and have Spark allocate all resources to
>> the job that has the highest priority.
>>
>> Alex
>>
>> On Sat, Jan 10, 2015 at 10:11 PM, Cody Koeninger 
>> wrote:
>>
>>>
>>> http://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties
>>>
>>> "Setting a high weight such as 1000 also makes it possible to implement
>>> *priority* between pools—in essence, the weight-1000 pool will always
>>> get to launch tasks first whenever it has jobs active."
>>>
>>> On Sat, Jan 10, 2015 at 11:57 PM, Alessandro Baretta <
>>> alexbare...@gmail.com> wrote:
>>>
>>>> Mark,
>>>>
>>>> Thanks, but I don't see how this documentation solves my problem. You
>>>> are referring me to documentation of fair scheduling; whereas, I am asking
>>>> about as unfair a scheduling policy as can be: a priority queue.
>>>>
>>>> Alex
>>>>
>>>> On Sat, Jan 10, 2015 at 5:00 PM, Mark Hamstra 
>>>> wrote:
>>>>
>>>>> -dev, +user
>>>>>
>>>>> http://spark.apache.org/docs/latest/job-scheduling.html
>>>>>
>>>>>
>>>>> On Sat, Jan 10, 2015 at 4:40 PM, Alessandro Baretta <
>>>>> alexbare...@gmail.com> wrote:
>>>>>
>>>>>> Is it possible to specify a priority level for a job, such that the
>>>>>> active
>>>>>> jobs might be scheduled in order of priority?
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>