Thanks a lot Robert! On Dec 15, 2014 12:54 PM, "Robert Metzger" <[email protected]> wrote:
> Hey Flavio, > > this pull request got merged: > https://github.com/apache/incubator-flink/pull/260 > > With this, you now can simulate an append behavior with Flink: > > - You have a directory in HDFS where you put the files you want to append > hdfs:///data/appendjob/ > - each time you want to append something, you run your job and let it > create a new directory in hdfs:///data/appendjob/, lets > say hdfs:///data/appendjob/run-X/ > - Now, you can instruct the job to read the full output by letting it > recursively read hdfs:///data/appendjob/. > > I hope that helps. > > > Best, > Robert > > > On Tue, Dec 9, 2014 at 3:37 PM, Flavio Pompermaier <[email protected]> > wrote: >> >> I didn't know such difference! Thus, Flink is very smart :) >> Thank for the explanation Robert. >> >> On Tue, Dec 9, 2014 at 3:33 PM, Robert Metzger <[email protected]> >> wrote: >> >>> Vasia is working on support for reading directories recursively. But I >>> thought that this is also allowing you to simulate something like an append. >>> >>> Did you notice an issue when reading many small files with Flink? Flink >>> is handling the reading of files differently than Spark. >>> >>> Spark basically starts a task for each file / file split. So if you have >>> millions of small files in your HDFS, spark will start millions of tasks >>> (queued however). You need to coalesce in spark to reduce the number of >>> partitions. by default, they re-use the partitions of the preceding >>> operator. >>> Flink on the other hand is starting a fixed number of tasks which are >>> reading multiple input splits which are lazily assigned to these tasks once >>> they ready to process new splits. >>> Flink will not create a partition for each (small) input file. I expect >>> Flink to handle that case a bit better than Spark (I haven't tested it >>> though) >>> >>> >>> >>> On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <[email protected] >>> > wrote: >>> >>>> Great! Append data to HDFS will be a very useful feature! >>>> I think that then you should think also how to read efficiently >>>> directories containing a lot of small files. I know that this can be quite >>>> inefficient so that's why in Spark they give you a coalesce operation to be >>>> able to deal siwth such cases.. >>>> >>>> >>>> On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri < >>>> [email protected]> wrote: >>>> >>>>> Hi! >>>>> >>>>> Yes, I took a look into this. I hope I'll be able to find some time to >>>>> work on it this week. >>>>> I'll keep you updated :) >>>>> >>>>> Cheers, >>>>> V. >>>>> >>>>> On 9 December 2014 at 14:03, Robert Metzger <[email protected]> >>>>> wrote: >>>>> >>>>>> It seems that Vasia started working on adding support for recursive >>>>>> reading: https://issues.apache.org/jira/browse/FLINK-1307. >>>>>> I'm still occupied with refactoring the YARN client, the HDFS >>>>>> refactoring is next on my list. >>>>>> >>>>>> On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Any news about this Robert? >>>>>>> >>>>>>> Thanks in advance, >>>>>>> Flavio >>>>>>> >>>>>>> On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <[email protected] >>>>>>> > wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I think there is no support for appending to HDFS files in Flink >>>>>>>> yet. >>>>>>>> HDFS supports it, but there are some adjustments in the system >>>>>>>> required (not deleting / creating directories before writing; exposing >>>>>>>> the >>>>>>>> append() methods in the FS abstractions). >>>>>>>> >>>>>>>> I'm planning to work on the FS abstractions in the next week, if I >>>>>>>> have enough time, I can also look into adding support for append(). >>>>>>>> >>>>>>>> Another approach could be adding support for recursively reading >>>>>>>> directories with the input formats. Vasia asked for this feature a few >>>>>>>> days >>>>>>>> ago on the mailing list. If we would have that feature, you could just >>>>>>>> write to a directory and read the parent directory (with all the dirs >>>>>>>> for >>>>>>>> the appends). >>>>>>>> >>>>>>>> Best, >>>>>>>> Robert >>>>>>>> >>>>>>>> On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hi guys, >>>>>>>>> how can I efficiently appends data (as plain strings or also avro >>>>>>>>> records) to HDFS using Flink? >>>>>>>>> Do I need to use Flume or can I avoid it? >>>>>>>>> >>>>>>>>> Thanks in advance, >>>>>>>>> Flavio >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >>
