Hi,

We use sequence files as input as well. Spark creates a task for each part*
file by default. We use RDD.coalesce (set to number of cores or 2*number of
cores). This helps when there are many more part* files than the number of
cores and each part* file is relatively small. Coalesce doesn't actually
move files or around or force a repartition.

This shouldn't affect your overall job initialization times, but might
improve your general job throughput.

Roshan


On Tue, Feb 25, 2014 at 12:13 PM, polkosity <polkos...@gmail.com> wrote:

> As mentioned in a previous post, I have an application which relies on a
> quick response.  The application matches a client's image against a set of
> stored images.  Image features are stored in a SequenceFile and passed over
> JNI to match in OpenCV, along with the features for the client's image.  An
> id for the matched image is returned.
>
> I was using Hadoop 1.2.1 and achieved some pretty good results, but the job
> initialization was taking about 15 seconds, and we'd hoped to have a
> response in ~5 seconds.  So we moved to Hadoop 2.2, YARN & Spark.  Sadly,
> job initialization is still taking over 10 seconds (on a cluster of 10 EC2
> m1.large).
>
> Any suggestions on what I can do to bring this initialization time down?
>
> Once the executors begin work, the performance is quite good, but any
> general performance optimization tips also welcome!
>
> Thanks.
> - Dan
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-performance-optimization-tp2017.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to