Re: Non-deterministic behavior in spark

Ognen Duzlevski Fri, 24 Jan 2014 05:17:56 -0800

No. It is a filter that splits a line in a json file and extracts a
position for it - every run is the same.


That's what bothers me about this.

Ognen


On Fri, Jan 24, 2014 at 12:40 PM, 尹绪森 <[email protected]> wrote:

>  Does there are some non-deterministic codes in filter ? Such as
> Random.nextInt(). If so, the program lost the idempotent feature. You
> should specify a seed to it.
>
>
> 2014/1/24 Ognen Duzlevski <[email protected]>
>
>> Hello,
>>
>> (Sorry for the sensationalist title) :)
>>
>> If I run Spark on files from S3 and do basic transformation like:
>>
>> textfile()
>> filter
>> groupByKey
>> count
>>
>> I get one number (e.g. 40,000).
>>
>> If I do the same on the same files from HDFS, the number spat out is
>> completely different (VERY different - something like 13,000).
>>
>> What would one do in a situation like this? How do I even go about
>> figuring out what the problem is? This is run on a cluster of 15 instances
>> on Amazon.
>>
>> Thanks,
>> Ognen
>>
>
>
>
> --
> Best Regards
> -----------------------------------
> Xusen Yin    尹绪森
> Beijing Key Laboratory of Intelligent Telecommunications Software and
> Multimedia
> Beijing University of Posts & Telecommunications
> Intel Labs China
> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*
>



-- 
"Le secret des grandes fortunes sans cause apparente est un crime oublié,
parce qu'il a été proprement fait" - Honore de Balzac

Re: Non-deterministic behavior in spark

Reply via email to