The JobManager does not read all files, but is has to query the HDFS for all file metadata (size, blocks, block locations), which can take a bit. There is a separate call to the HDFS Namenode for each file. The more files, the more metadata has to be collected.
On Wed, Nov 18, 2015 at 7:15 PM, Flavio Pompermaier <pomperma...@okkam.it> wrote: > So why it takes so much to start the job?because in any case the job > manager has to read all the lines of the input files before generating the > splits? > On 18 Nov 2015 17:52, "Stephan Ewen" <se...@apache.org> wrote: > >> Late answer, sorry: >> >> The splits are created in the JobManager, so the sub submission should >> not be affected by that. >> >> The assignment of splits to workers is very fast, so many splits with >> small data is not very different from few splits with large data. >> >> Lines are never materialized and the operators do not work differently >> based on different numbers of splits. >> >> On Wed, Oct 7, 2015 at 4:26 PM, Flavio Pompermaier <pomperma...@okkam.it> >> wrote: >> >>> I've tried to split my huge file by lines count (using the bash command >>> split -l) in 2 different ways: >>> >>> 1. small lines count (huge number of small files) >>> 2. big lines count (small number of big files) >>> >>> I can't understand why the time required to effectively start the job is >>> more or less the same >>> >>> - in 1. it takes a lot to fetch the file list (~50.000) and the >>> split assigner is fast to assign the splits (but also being fast they >>> are a >>> lot) >>> - in 2. Flink is fast in fetch the file list but it's extremely slow >>> to generate the splits to assign >>> >>> Initially I was thinking that Flink was eagerly materializing the lines >>> somewhere but both the memory and the disks doesn't increase. >>> What is going on underneath? Is it normal? >>> >>> Thanks in advance, >>> Flavio >>> >>> >>> >>> On Wed, Oct 7, 2015 at 3:27 PM, Stephan Ewen <se...@apache.org> wrote: >>> >>>> The split functionality is in the FileInputFormat and the functionality >>>> that takes care of lines across splits is in the DelimitedIntputFormat. >>>> >>>> On Wed, Oct 7, 2015 at 3:24 PM, Fabian Hueske <fhue...@gmail.com> >>>> wrote: >>>> >>>>> I'm sorry there is no such documentation. >>>>> You need to look at the code :-( >>>>> >>>>> 2015-10-07 15:19 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: >>>>> >>>>>> And what is the split policy for the FileInputFormat?it depends on >>>>>> the fs block size? >>>>>> Is there a pointer to the several flink input formats and a >>>>>> description of their internals? >>>>>> >>>>>> On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <fhue...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Flavio, >>>>>>> >>>>>>> it is not possible to split by line count because that would mean to >>>>>>> read and parse the file just for splitting. >>>>>>> >>>>>>> Parallel processing of data sources depends on the input splits >>>>>>> created by the InputFormat. Local files can be split just like files in >>>>>>> HDFS. Usually, each file corresponds to at least one split but multiple >>>>>>> files could also be put into a single split if necessary.The logic for >>>>>>> that >>>>>>> would go into to the InputFormat.createInputSplits() method. >>>>>>> >>>>>>> Cheers, Fabian >>>>>>> >>>>>>> 2015-10-07 14:47 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it> >>>>>>> : >>>>>>> >>>>>>>> Hi to all, >>>>>>>> >>>>>>>> is there a way to split a single local file by line count (e.g. a >>>>>>>> split every 100 lines) in a LocalEnvironment to speed up a simple map >>>>>>>> function? For me it is not very clear how the local files (files into >>>>>>>> directory if recursive=true) are managed by Flink..is there any ref to >>>>>>>> this >>>>>>>> internals? >>>>>>>> >>>>>>>> Best, >>>>>>>> Flavio >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >>