Hi,
Answering the first question;
What happens is that the client on the machine issuing the command
'copyfromlocal' first creates a new instance of DistributedFileSystem,
which makes an RPC call to the NameNode to create a new file in the
filesystem's namespace, the NN performs various checks, if passed it
creates a record for the file. Then the DistributedFileSystem returns
FSDataOutputStream for the client which handles the communication to the
datanodes.
The file is then split into packets (based on the block size) and
written to the data queue by DataStreamer, which asks the nn to allocate
new blocks and picking up a list of dn to store the replicas on. The
list of dn forms a pipeline with the number of replicas, the
DataStreamer streams the packet to the first datanode in the pipeline
which stores the packet and forward it to the second datanode in the
pipeline. The packet keeps forwarded until the replicas are written.
When the client finish the stream it sends a confirmation to the NN.
Second question; split size is used during processing of a file, in
other words the split size is consumed by the mappers not during HDFS
operations. The block is the one consumed during the read and write
operations by HDFS client. IIRC the default size of a block in
Hadoop-2.6 is 256MB.
Third question; split is the logical representation of the data in a
block. Block is the physical representation of the data. As I said,
splits are consumed when processing the data, a mapper reads data from a
block through a split.
Consider the following content of a file:
000
111
222
333
444
555
666
777
888
999
aaa
bbb
ccc
ddd
eee
fff
The block representation of this file could look something like this
block-0
000
111
222
333
444
55
block-1
5
666
777
888
999
aa
block-2
a
bbb
ccc
ddd
eee
fff
If you notice, the blocks aren't aligned with the record, this happen
because the fact that each record in a file isn't like the other. Here
comes the splits to put the records all together before being processed
by a mapper.
Unlike traditional filesystems, HDFS will create files up to the size of
the HDFS block size; Note that there are 2 block size here, the
underlying filesystem block size and HDFS block size.
On 02/21/2015 01:59 AM, SP wrote:
Hello Every one,
I have couple of doubts can any one please point me in right direction.
1>What exactly happen when I want to copy 1TB file to Hadoop Cluster
using copyfromlocal command
1> what will be the split size? will it be same as the block size?
2> What is a block and split?
If we have 100 MB file and a block size of 64 MB, As we know it will
be divided into 2 blocks of 64 MB and 36 MB the second block still has
28 MB of space left what will happen to that free space?
will the cluster have unequal block size or will it be occupied by
other file?
3) let’s say a 64MB block is on node A and replicated among 2 other
nodes(B,C), and the input split size for the map-reduce program is
64MB, will this split just have location for node A? Or will it have
locations for all the three nodes A,b,C?
4) How is it handled if the Input Split size is greater or lesser than
block size?
can any one please help?
thanks
SP
--
Regards,
Ahmed Ossama