That totally depends on the way you extract the data. It will be helpful if
you can paste your code so that we will understand it better.

Thanks
Best Regards

On Wed, Jun 24, 2015 at 2:32 PM, William Ferrell <wferr...@gmail.com> wrote:

> Hello -
>
> I am using Apache Spark 1.2.1 via pyspark. Thanks to any developers here
> for the great product!
>
> In my use case, I am running spark jobs to extract data from some raw
> data. Generally this works quite well.
>
> However, I am noticing that for certain data sets there are certain tasks
> that are extremely long running -- i.e. 8-12x longer than the normal task.
> I don't actually need the data from these extremely long running tasks --
> so I am writing today to ask is there a way to kill certain tasks that take
> significantly more time and just accept that no data will be found from
> them?
>
> I have read:
> http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-always-wait-for-stragglers-to-finish-running-td14298.html
>  and
> know about spark.speculation -- however, I think my use case is different
> in that I don't want the tasks re-started -- I just want to accept that oh
> this task is too long running -- let's kill it and move on.
>
> So in effect, I'd like to timeout the task, but still collect the data
> from the remaining tasks.
>
> Does anyone have any advice on how I can timeout / kill these stragglers
> -- and keep the remaining data?
>
> Thanks!
>
>
> - Bill
>

Reply via email to