Re: HDFS append

Flavio Pompermaier Mon, 15 Dec 2014 04:03:14 -0800

Thanks a lot Robert!
On Dec 15, 2014 12:54 PM, "Robert Metzger" <[email protected]> wrote:


> Hey Flavio,
>
> this pull request got merged:
> https://github.com/apache/incubator-flink/pull/260
>
> With this, you now can simulate an append behavior with Flink:
>
> - You have a directory in HDFS where you put the files you want to append
> hdfs:///data/appendjob/
> - each time you want to append something, you run your job and let it
> create a new directory in hdfs:///data/appendjob/, lets
> say hdfs:///data/appendjob/run-X/
> - Now, you can instruct the job to read the full output by letting it
> recursively read hdfs:///data/appendjob/.
>
> I hope that helps.
>
>
> Best,
> Robert
>
>
> On Tue, Dec 9, 2014 at 3:37 PM, Flavio Pompermaier <[email protected]>
> wrote:
>>
>> I didn't know such difference! Thus, Flink is very smart :)
>> Thank for the explanation Robert.
>>
>> On Tue, Dec 9, 2014 at 3:33 PM, Robert Metzger <[email protected]>
>> wrote:
>>
>>> Vasia is working on support for reading directories recursively. But I
>>> thought that this is also allowing you to simulate something like an append.
>>>
>>> Did you notice an issue when reading many small files with Flink? Flink
>>> is handling the reading of files differently than Spark.
>>>
>>> Spark basically starts a task for each file / file split. So if you have
>>> millions of small files in your HDFS, spark will start millions of tasks
>>> (queued however). You need to coalesce in spark to reduce the number of
>>> partitions. by default, they re-use the partitions of the preceding
>>> operator.
>>> Flink on the other hand is starting a fixed number of tasks which are
>>> reading multiple input splits which are lazily assigned to these tasks once
>>> they ready to process new splits.
>>> Flink will not create a partition for each (small) input file. I expect
>>> Flink to handle that case a bit better than Spark (I haven't tested it
>>> though)
>>>
>>>
>>>
>>> On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <[email protected]
>>> > wrote:
>>>
>>>> Great! Append data to HDFS will be a very useful feature!
>>>> I think that then you should think also how to read efficiently
>>>> directories containing a lot of small files. I know that this can be quite
>>>> inefficient so that's why in Spark they give you a coalesce operation to be
>>>> able to deal siwth such cases..
>>>>
>>>>
>>>> On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> Yes, I took a look into this. I hope I'll be able to find some time to
>>>>> work on it this week.
>>>>> I'll keep you updated :)
>>>>>
>>>>> Cheers,
>>>>> V.
>>>>>
>>>>> On 9 December 2014 at 14:03, Robert Metzger <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> It seems that Vasia started working on adding support for recursive
>>>>>> reading: https://issues.apache.org/jira/browse/FLINK-1307.
>>>>>> I'm still occupied with refactoring the YARN client, the HDFS
>>>>>> refactoring is next on my list.
>>>>>>
>>>>>> On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Any news about this Robert?
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>> Flavio
>>>>>>>
>>>>>>> On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <[email protected]
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I think there is no support for appending to HDFS files in Flink
>>>>>>>> yet.
>>>>>>>> HDFS supports it, but there are some adjustments in the system
>>>>>>>> required (not deleting / creating directories before writing; exposing 
>>>>>>>> the
>>>>>>>> append() methods in the FS abstractions).
>>>>>>>>
>>>>>>>> I'm planning to work on the FS abstractions in the next week, if I
>>>>>>>> have enough time, I can also look into adding support for append().
>>>>>>>>
>>>>>>>> Another approach could be adding support for recursively reading
>>>>>>>> directories with the input formats. Vasia asked for this feature a few 
>>>>>>>> days
>>>>>>>> ago on the mailing list. If we would have that feature, you could just
>>>>>>>> write to a directory and read the parent directory (with all the dirs 
>>>>>>>> for
>>>>>>>> the appends).
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Robert
>>>>>>>>
>>>>>>>> On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi guys,
>>>>>>>>> how can I efficiently appends data (as plain strings or also avro
>>>>>>>>> records) to  HDFS using Flink?
>>>>>>>>> Do I need to use Flume or can I avoid it?
>>>>>>>>>
>>>>>>>>> Thanks in advance,
>>>>>>>>> Flavio
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>

Re: HDFS append

Reply via email to