Re: Hadoop noob question

Mohammad Tariq Sat, 11 May 2013 10:05:21 -0700

@Rahul : Yes. distcp can do that.

And, bigger the files lesser the metadata hence lesser memory consumption.


Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
[email protected]> wrote:

> IMHO,I think the statement about NN with regard to block metadata is more
> like a general statement. Even if you put lots of small files of combined
> size 10 TB , you need to have a capable NN.
>
> can disct cp be used to copy local - to - hdfs ?
>
> Thanks,
> Rahul
>
>
> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <[email protected]>wrote:
>
>> absolutely rite Mohammad
>>
>>
>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <[email protected]>wrote:
>>
>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>
>>> Every file and block in HDFS is treated as an object and for each object
>>> around 200B of metadata get created. So the NN should be powerful enough to
>>> handle that much metadata, since it is going to be in-memory. Actually
>>> memory is the most important metric when it comes to NN.
>>>
>>> Am I correct @Nitin?
>>>
>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>> don't actually just do a "put". You could use something like "distcp" for
>>> parallel copying. A better approach would be to use a data aggregation tool
>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>> data aggregation tool, called Scribe for this purpose.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <[email protected]>wrote:
>>>
>>>> NN would still be in picture because it will be writing a lot of meta
>>>> data for each individual file. so you will need a NN capable enough which
>>>> can store the metadata for your entire dataset. Data will never go to NN
>>>> but lot of metadata about data will be on NN so its always good idea to
>>>> have a strong NN.
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>> [email protected]> wrote:
>>>>
>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>> part of the actual data write pipeline , means that the data would not
>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>> get locations of DN as where to store the data blocks.
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>
>>>>>> when you say , you have files worth 10TB files and you want to upload
>>>>>>  to HDFS, several factors come into picture
>>>>>>
>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>> 2) If there guarantee that network will not go down?
>>>>>>
>>>>>> and Most importantly I assume that you have a capable hadoop cluster.
>>>>>> By that I mean you have a capable namenode.
>>>>>>
>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write 
>>>>>> features
>>>>>> to speed up the process.
>>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>>> has not failed when we write a lot of data.
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <[email protected]>wrote:
>>>>>>
>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>
>>>>>>> But I have one more question , say I have 10 TB data in the pipeline
>>>>>>> .
>>>>>>>
>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>> files of size 10 TB and is there any limit to the file size  using 
>>>>>>> hadoop
>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>
>>>>>>> Thanks in advance
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>> basis and  then its retained on hdfs for some duration as per needed 
>>>>>>>> and
>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>
>>>>>>>> For data management products, you can look at falcon which is open
>>>>>>>> sourced by inmobi along with hortonworks.
>>>>>>>>
>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>> options available to you
>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>> 2) use hdfs proxy
>>>>>>>> 3) there is webhdfs
>>>>>>>> 4) command line hdfs
>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>> flume etc
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>> cluster for processing
>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>> local file system.
>>>>>>>>>
>>>>>>>>> I don't think they might be using the command line hadoop fs put
>>>>>>>>> to upload files as it would take too long or do they divide say 10 
>>>>>>>>> parts
>>>>>>>>> each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>
>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>
>>>>>>>>> Please help me .
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> thoihen
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>

Re: Hadoop noob question

Reply via email to