Hi, We use sequence files as input as well. Spark creates a task for each part* file by default. We use RDD.coalesce (set to number of cores or 2*number of cores). This helps when there are many more part* files than the number of cores and each part* file is relatively small. Coalesce doesn't actually move files or around or force a repartition.
This shouldn't affect your overall job initialization times, but might improve your general job throughput. Roshan On Tue, Feb 25, 2014 at 12:13 PM, polkosity <polkos...@gmail.com> wrote: > As mentioned in a previous post, I have an application which relies on a > quick response. The application matches a client's image against a set of > stored images. Image features are stored in a SequenceFile and passed over > JNI to match in OpenCV, along with the features for the client's image. An > id for the matched image is returned. > > I was using Hadoop 1.2.1 and achieved some pretty good results, but the job > initialization was taking about 15 seconds, and we'd hoped to have a > response in ~5 seconds. So we moved to Hadoop 2.2, YARN & Spark. Sadly, > job initialization is still taking over 10 seconds (on a cluster of 10 EC2 > m1.large). > > Any suggestions on what I can do to bring this initialization time down? > > Once the executors begin work, the performance is quite good, but any > general performance optimization tips also welcome! > > Thanks. > - Dan > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-performance-optimization-tp2017.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >