Re: Parallel file read in LocalEnvironment

Stephan Ewen Wed, 18 Nov 2015 10:18:18 -0800

The JobManager does not read all files, but is has to query the HDFS for
all file metadata (size, blocks, block locations), which can take a bit.
There is a separate call to the HDFS Namenode for each file. The more
files, the more metadata has to be collected.



On Wed, Nov 18, 2015 at 7:15 PM, Flavio Pompermaier <pomperma...@okkam.it>
wrote:

> So why it takes so much to start the job?because in any case the job
> manager has to read all the lines of the input files before generating the
> splits?
> On 18 Nov 2015 17:52, "Stephan Ewen" <se...@apache.org> wrote:
>
>> Late answer, sorry:
>>
>> The splits are created in the JobManager, so the sub submission should
>> not be affected by that.
>>
>> The assignment of splits to workers is very fast, so many splits with
>> small data is not very different from few splits with large data.
>>
>> Lines are never materialized and the operators do not work differently
>> based on different numbers of splits.
>>
>> On Wed, Oct 7, 2015 at 4:26 PM, Flavio Pompermaier <pomperma...@okkam.it>
>> wrote:
>>
>>> I've tried to split my huge file by lines count (using the bash command
>>> split -l) in 2 different ways:
>>>
>>>    1. small lines count (huge number of small files)
>>>    2. big lines count (small number of big files)
>>>
>>> I can't understand why the time required to effectively start the job is
>>> more or less the same
>>>
>>>    - in 1. it takes a lot to fetch the file list (~50.000) and the
>>>    split assigner is fast to assign the splits (but also being fast they 
>>> are a
>>>    lot)
>>>    - in 2. Flink is fast in fetch the file list but it's extremely slow
>>>    to generate the splits to assign
>>>
>>> Initially I was thinking that Flink was eagerly materializing the lines
>>> somewhere but both the memory and the disks doesn't increase.
>>> What is going on underneath? Is it normal?
>>>
>>> Thanks in advance,
>>> Flavio
>>>
>>>
>>>
>>> On Wed, Oct 7, 2015 at 3:27 PM, Stephan Ewen <se...@apache.org> wrote:
>>>
>>>> The split functionality is in the FileInputFormat and the functionality
>>>> that takes care of lines across splits is in the DelimitedIntputFormat.
>>>>
>>>> On Wed, Oct 7, 2015 at 3:24 PM, Fabian Hueske <fhue...@gmail.com>
>>>> wrote:
>>>>
>>>>> I'm sorry there is no such documentation.
>>>>> You need to look at the code :-(
>>>>>
>>>>> 2015-10-07 15:19 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:
>>>>>
>>>>>> And what is the split policy for the FileInputFormat?it depends on
>>>>>> the fs block size?
>>>>>> Is there a pointer to the several flink input formats and a
>>>>>> description of their internals?
>>>>>>
>>>>>> On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <fhue...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Flavio,
>>>>>>>
>>>>>>> it is not possible to split by line count because that would mean to
>>>>>>> read and parse the file just for splitting.
>>>>>>>
>>>>>>> Parallel processing of data sources depends on the input splits
>>>>>>> created by the InputFormat. Local files can be split just like files in
>>>>>>> HDFS. Usually, each file corresponds to at least one split but multiple
>>>>>>> files could also be put into a single split if necessary.The logic for 
>>>>>>> that
>>>>>>> would go into to the InputFormat.createInputSplits() method.
>>>>>>>
>>>>>>> Cheers, Fabian
>>>>>>>
>>>>>>> 2015-10-07 14:47 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>
>>>>>>> :
>>>>>>>
>>>>>>>> Hi to all,
>>>>>>>>
>>>>>>>> is there a way to split a single local file by line count (e.g. a
>>>>>>>> split every 100 lines) in a LocalEnvironment to speed up a simple map
>>>>>>>> function? For me it is not very clear how the local files (files into
>>>>>>>> directory if recursive=true) are managed by Flink..is there any ref to 
>>>>>>>> this
>>>>>>>> internals?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Flavio
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: Parallel file read in LocalEnvironment

Reply via email to