Re: Spark SQL: Caching nested structures extremely slow

2014-08-25 Thread Michael Armbrust
One useful thing to do when you run into unexpected slowness is to run 'jstack' a few times on the driver and executors and see if there is any particular hotspot in the Spark SQL code. Also, it seems like a better option here might be to use the new applySchema API

Re: Spark SQL: Caching nested structures extremely slow

2014-08-21 Thread Yin Huai
I have not profiled this part. But, I think one possible cause is allocating an array for every inner struct for every row (every struct value is represented by a Spark SQL row). I will play with it later and see what I find. On Tue, Aug 19, 2014 at 9:01 PM, Evan Chan wrote: > Hey guys, > > I'm

Spark SQL: Caching nested structures extremely slow

2014-08-19 Thread Evan Chan
Hey guys, I'm using Spark 1.0.2 in AWS with 8 x c3.xlarge machines. I am working with a subset of the GDELT dataset (57 columns, > 250 million rows, but my subset is only 4 million) and trying to query it with Spark SQL. Since a CSV importer isn't available, my first thought was to use nested c