There have been optimizations in this area, such as: https://issues.apache.org/jira/browse/SPARK-8125
You can also look at parent issue. Which Spark release are you using ? > On Jan 22, 2016, at 1:08 AM, Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > > > Hi, > > I have a SPARK table (created from hiveContext) with couple of hundred > partitions and few thousand files. > > When I run query on the table then spark spends a lot of time (as seen in the > pyspark output) to collect this files from the several partitions. After this > the query starts running. > > Is there a way to store the object which has collected all these partitions > and files so that every time I restart the job I load this object instead of > taking 50 mins to just collect the files before starting to run the query? > > > Please do let me know in case the question is not quite clear. > > Regards, > Gourav Sengupta > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org