Re: Loading Files from HDFS Incurs Network Communication

Jinfeng Li Mon, 26 Oct 2015 03:43:19 -0700

I use standalone mode. Each machine has 4 workers. Spark is deployed
correctly as webUI and jps command can show that.


Actually, we are a team and already use spark for nearly half a year,
started from Spark 1.3.1. We find this problem on one of our application
and I write a simple program to demonstrate that. So can spark team try
this for us?  If you doesn't have this problem, we will debug by ourselves.

Jeffrey

On Mon, Oct 26, 2015 at 5:32 PM Sean Owen <so...@cloudera.com> wrote:

> Hm, how about the opposite question -- do you have just 1 executor? then
> again everything will be remote except for a small fraction of blocks.
>
> On Mon, Oct 26, 2015 at 9:28 AM, Jinfeng Li <liji...@gmail.com> wrote:
>
>> Replication factor is 3 and we have 18 data nodes. We check HDFS webUI,
>> data is evenly distributed among 18 machines.
>>
>>
>> On Mon, Oct 26, 2015 at 5:18 PM Sean Owen <so...@cloudera.com> wrote:
>>
>>> Have a look at your HDFS replication, and where the blocks are for these
>>> files. For example, if you had only 2 HDFS data nodes, then data would be
>>> remote to 16 of 18 workers and always entail a copy.
>>>
>>> On Mon, Oct 26, 2015 at 9:12 AM, Jinfeng Li <liji...@gmail.com> wrote:
>>>
>>>> I cat /proc/net/dev and then take the difference of received bytes
>>>> before and after the job. I also see a long-time peak (nearly 600Mb/s) in
>>>> nload interface.  We have 18 machines and each machine receives 4.7G bytes.
>>>>
>>>> On Mon, Oct 26, 2015 at 5:00 PM Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>>> -dev +user
>>>>> How are you measuring network traffic?
>>>>> It's not in general true that there will be zero network traffic,
>>>>> since not all executors are local to all data. That can be the situation 
>>>>> in
>>>>> many cases but not always.
>>>>>
>>>>> On Mon, Oct 26, 2015 at 8:57 AM, Jinfeng Li <liji...@gmail.com> wrote:
>>>>>
>>>>>> Hi, I find that loading files from HDFS can incur huge amount of
>>>>>> network traffic. Input size is 90G and network traffic is about 80G. By 
>>>>>> my
>>>>>> understanding, local files should be read and thus no network 
>>>>>> communication
>>>>>> is needed.
>>>>>>
>>>>>> I use Spark 1.5.1, and the following is my code:
>>>>>>
>>>>>> val textRDD = sc.textFile("hdfs://master:9000/inputDir")
>>>>>> textRDD.count
>>>>>>
>>>>>> Jeffrey
>>>>>>
>>>>>
>>>>>
>>>
>

Re: Loading Files from HDFS Incurs Network Communication

Reply via email to