Re: Hadoop noob question

Rahul Bhattacharjee Sat, 11 May 2013 08:42:06 -0700

@Nitin , parallel dfs to write to hdfs is great , but could not understand
the meaning of capable NN. As I know , the NN would not be a part of the
actual data write pipeline , means that the data would not travel through
the NN , the dfs would contact the NN from time to time to get locations of
DN as where to store the data blocks.


Thanks,
Rahul



On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <[email protected]>wrote:

> is it safe? .. there is no direct answer yes or no
>
> when you say , you have files worth 10TB files and you want to upload  to
> HDFS, several factors come into picture
>
> 1) Is the machine in the same network as your hadoop cluster?
> 2) If there guarantee that network will not go down?
>
> and Most importantly I assume that you have a capable hadoop cluster. By
> that I mean you have a capable namenode.
>
> I would definitely not write files sequentially in HDFS. I would prefer to
> write files in parallel to hdfs to utilize the DFS write features to speed
> up the process.
> you can hdfs put command in parallel manner and in my experience it has
> not failed when we write a lot of data.
>
>
> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <[email protected]> wrote:
>
>> @Nitin Pawar , thanks for clearing my doubts .
>>
>> But I have one more question , say I have 10 TB data in the pipeline .
>>
>> Is it perfectly OK to use hadopo fs put command to upload these files of
>> size 10 TB and is there any limit to the file size  using hadoop command
>> line . Can hadoop put command line work with huge data.
>>
>> Thanks in advance
>>
>>
>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <[email protected]>wrote:
>>
>>> first of all .. most of the companies do not get 100 PB of data in one
>>> go. Its an accumulating process and most of the companies do have a data
>>> pipeline in place where the data is written to hdfs on a frequency basis
>>> and  then its retained on hdfs for some duration as per needed and from
>>> there its sent to archivers or deleted.
>>>
>>> For data management products, you can look at falcon which is open
>>> sourced by inmobi along with hortonworks.
>>>
>>> In any case, if you want to write files to hdfs there are few options
>>> available to you
>>> 1) Write your dfs client which writes to dfs
>>> 2) use hdfs proxy
>>> 3) there is webhdfs
>>> 4) command line hdfs
>>> 5) data collection tools come with support to write to hdfs like flume
>>> etc
>>>
>>>
>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <[email protected]>wrote:
>>>
>>>> Hi All,
>>>>
>>>> Can anyone help me know how does companies like Facebook ,Yahoo etc
>>>> upload bulk files say to the tune of 100 petabytes to Hadoop HDFS cluster
>>>> for processing
>>>> and after processing how they download those files from HDFS to local
>>>> file system.
>>>>
>>>> I don't think they might be using the command line hadoop fs put to
>>>> upload files as it would take too long or do they divide say 10 parts each
>>>> 10 petabytes and  compress and use the command line hadoop fs put
>>>>
>>>> Or if they use any tool to upload huge files.
>>>>
>>>> Please help me .
>>>>
>>>> Thanks
>>>> thoihen
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Nitin Pawar
>

Re: Hadoop noob question

Reply via email to