I stand corrected: a file does not take an entire bloc, according to Hadoop: 
The Definitive Guide. As Stephan mentioned it takes space in the namenode 
(150kb/file as i read). So the limitation with many small files is with the 
capacity of the namenode.
Still the data, when replicated is distributed by the HDFS engine; I don't see 
how there can be any assurance of the locality of the data. it's possible to 
figure where it is physically located, but I'm not sure it can be predicted.
The whole idea behind Hadoop is to bring the computation to the data, and 
that's what YARN does in distributing processing tasks. (Yahoo Dev network 
explains this well: https://developer.yahoo.com/hadoop/tutorial/module1.html)
Is the Flink behavior mentioned native or is this something happening when 
running Flink on YARN?

Date: Sat, 14 Mar 2015 19:16:24 +0100
Subject: Re: Maintaining data locality with list of paths (strings) as input
From: se...@apache.org
To: user@flink.apache.org

Hi Guy,
This sounds like a use case that should workwith Flink.
When it comes to input handling, Flink differs a bit from Spark. Flink creates 
a set of input tasks and a set of input splits. The splits are then on-the-fly 
assigned to the tasks. Each task may work on multiple input spits, which are 
pulled from the JobManager. The input localization also happens on the fly, as 
input tasks requests the next input split to process.
It should be possible to make sure this is executed in a data-local way, if the 
input splits expose the location properly.

If you share a bit more information about how your input format looks like, we 
can definitely help you to realize that.
As a side comment: Small files in HDFS do not take up 64 megabytes on disk, 
there must be some confusion. In early versions of HDFS, there has been a big 
problem with small files, because the Namenode was getting overloaded with 
metadata. This has gotten better, as far as I know, but I am not sure if they 
have fully solved it.
Greetings,Stephan
On Sat, Mar 14, 2015 at 5:49 PM, Emmanuel <ele...@msn.com> wrote:





Hi guy,



I don't have an answer about flink but a couple comments on your use case I 
hope might help:



- you should view HDFS as a giant RAID across nodes: the namenode maintains the 
file table but the data is distributed and replicated across nodes by bloc. 
There is no 'data locality' guarantee: the dat is distributed and replicated so 
it could be spread
 across many nodes.



- small files on HDFS is not a good idea because the typical minimal bloc size 
is 64MB, which means even if your file is 1kB, it will use 64MB on disk. It is 
best to aggregate those small files into a big one or use a dB storage like 
hbase or cassandra.



In Spark, you can load files from local file system, but usually it requires 
that files be in each node which defeats the purpose.






Emmanuel









-------- Original message --------

From: Guy Rapaport <guy4...@gmail.com> 

Date:03/14/2015 8:38 AM (GMT-08:00) 

To: Flink Users <user@flink.apache.org> 

Subject: Maintaining data locality with list of paths (strings) as input 




Hello,



Here's a use case I'd like to implement, and I wonder if Flink is the answer:



My input is a file containing a list of paths.
(It's actually a message queue with incoming messages, each containing a path, 
but let's use a simpler use-case.)



Each path points at a file stored on HDFS. The files are rather small so 
although they are replicated, they are not broken into chunks.



I want each file to get processed on the note on which it is stored, for the 
sake of data locality.
However, if I run such a job on Spark, what I get is that the input path gets 
to some node, which should access the file by pulling it from the HDFS - no 
data locality, but instead network congestion.



Can Flink solve this problem for me?



Note: I saw similar examples in which file lists are processed on Spark... By 
having each file in the list downloaded from the internet to the node 
processing it. That's not my use case - I already have the files on
 HDFS, all I want is to enjoy data locality in a cluster-like environment!



Thanks,
>Guy.





                                          

Reply via email to