I don't know how to debug distributed application, any tools or suggestion?

but from spark web UI,

the GC time (~0.1 s), Shuffle Write(11 GB) are similar for spark 1.1 and
1.2.
there are no Shuffle Read and Spill.
The only difference is Duration
DurationMin25th percentileMedian75th percentileMaxspark 1.24s37s45s53s1.9
minspark 1.12 s17 s18 s18 s34 s

2015-01-21 16:56 GMT+08:00 Sean Owen <so...@cloudera.com>:

> I mean that if you had tasks running on 10 machines now instead of 3 for
> some reason you would have more than 3 times the read load on your source
> of data all at once. Same if you made more executors per machine. But from
> your additional info it does not sound like this is the case. I think you
> need more debugging to pinpoint what is slower.
> On Jan 21, 2015 9:30 AM, "Fengyun RAO" <raofeng...@gmail.com> wrote:
>
>> thanks, Sean.
>>
>> I don't quite understand "you have *more *partitions across *more *
>> workers".
>>
>> It's within the same cluster, and the same data, thus I think the same
>> partition, the same workers.
>>
>> we switched from spark 1.1 to 1.2, then it's 3x slower.
>>
>> (We upgrade from CDH 5.2.1 to CDH 5.3, hence spark 1.1 to 1.2, and found
>> the problem.
>> then we installed a standalone spark 1.1, stop the 1.2, run the same
>> script, it's 3x faster.
>> stop 1.1, start 1.2, 3x slower again)
>>
>>
>> 2015-01-21 15:45 GMT+08:00 Sean Owen <so...@cloudera.com>:
>>
>>> I don't know of any reason to think the singleton pattern doesn't work
>>> or works differently. I wonder if, for example, task scheduling is
>>> different in 1.2 and you have more partitions across more workers and so
>>> are loading more copies more slowly into your singletons.
>>> On Jan 21, 2015 7:13 AM, "Fengyun RAO" <raofeng...@gmail.com> wrote:
>>>
>>>> the LogParser instance is not serializable, and thus cannot be a
>>>> broadcast,
>>>>
>>>> what’s worse, it contains an LRU cache, which is essential to the
>>>> performance, and we would like to share among all the tasks on the same
>>>> node.
>>>>
>>>> If it is the case, what’s the recommended way to share a variable among
>>>> all the tasks within the same executor.
>>>> ​
>>>>
>>>> 2015-01-21 15:04 GMT+08:00 Davies Liu <dav...@databricks.com>:
>>>>
>>>>> Maybe some change related to serialize the closure cause LogParser is
>>>>> not a singleton any more, then it is initialized for every task.
>>>>>
>>>>> Could you change it to a Broadcast?
>>>>>
>>>>> On Tue, Jan 20, 2015 at 10:39 PM, Fengyun RAO <raofeng...@gmail.com>
>>>>> wrote:
>>>>> > Currently we are migrating from spark 1.1 to spark 1.2, but found the
>>>>> > program 3x slower, with nothing else changed.
>>>>> > note: our program in spark 1.1 has successfully processed a whole
>>>>> year data,
>>>>> > quite stable.
>>>>> >
>>>>> > the main script is as below
>>>>> >
>>>>> > sc.textFile(inputPath)
>>>>> > .flatMap(line => LogParser.parseLine(line))
>>>>> > .groupByKey(new HashPartitioner(numPartitions))
>>>>> > .mapPartitionsWithIndex(...)
>>>>> > .foreach(_ => {})
>>>>> >
>>>>> > where LogParser is a singleton which may take some time to
>>>>> initialized and
>>>>> > is shared across the execuator.
>>>>> >
>>>>> > the flatMap stage is 3x slower.
>>>>> >
>>>>> > We tried to change spark.shuffle.manager back to hash, and
>>>>> > spark.shuffle.blockTransferService back to nio, but didn’t help.
>>>>> >
>>>>> > May somebody explain possible causes, or what should we test or
>>>>> change to
>>>>> > find it out
>>>>>
>>>>
>>>>
>>

Reply via email to