Re: HDFS2 vs MaprFS

Aaron Eng Mon, 06 Jun 2016 09:45:33 -0700

As others have answered, the number of blocks/files/directories that can be
addressed by a NameNode is limited by the amount of heap space available to
the NameNode JVM.  If you need more background on this topic, I'd suggest
reviewing various materials from Hadoop JIRA and other vendors that supply
and support HDFS.


For instance, this JIRA:
https://issues.apache.org/jira/browse/HADOOP-1687

Or, for instance, Cloudera discusses this topic:
http://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html

I don't intend to speak for Cloudera (obviously), but you can see on that
page:

> Cloudera recommends 1 GB of NameNode heap space per million blocks to
> account for the namespace objects
>

So, do you have >200GB of memory to give to the NameNode JVM? And do you
want to do that?  If yes, then you could probably address more than 200
million blocks.

On Mon, Jun 6, 2016 at 9:35 AM, Ascot Moss <[email protected]> wrote:

> Hi Aaron, from MapR site, [now HDSF2] "Limit to 50-200 million files", is
> it really true?
>
> On Tue, Jun 7, 2016 at 12:09 AM, Aaron Eng <[email protected]> wrote:
>
>> As I said, MapRFS has topologies.  You assign a volume (which is mounted
>> at a directory path) to a topology and in turn all the data for the volume
>> (e.g. under the directory) is stored on the storage hardware assigned to
>> the topology.
>>
>> These topological labels provide the same benefits as dfs.storage.policy
>> as well as enabling additional types of use cases.
>>
>> On Mon, Jun 6, 2016 at 9:02 AM, Ascot Moss <[email protected]> wrote:
>>
>>> In HDFS2, I can find "dfs.storage.policy",  for instances, HDFS2 allows
>>> to *Apply the COLD storage policy to a directory,*
>>>  where are these features in Mapr-FS?
>>>
>>> On Mon, Jun 6, 2016 at 11:43 PM, Aaron Eng <[email protected]> wrote:
>>>
>>>> >Since MapR  is proprietary, I find that it has many compatibility
>>>> issues in Apache open source projects
>>>>
>>>> This is faulty logic. And rather than saying it has "many compatibility
>>>> issues", perhaps you can describe one.
>>>>
>>>> Both MapRFS and HDFS are accessible through the same API.  The backend
>>>> implementations are what differs.
>>>>
>>>> >Hadoop has a built-in storage policy named COLD, where is it in
>>>> Mapr-FS?
>>>>
>>>> Long before HDFS had storage policies, MapRFS had topologies.  You can
>>>> restrict particular types of storage to a topology and then assign a volume
>>>> (subset of data stored in MapRFS) to the topology, and hence the data in
>>>> that subset would be served by whatever hardware was mapped into the
>>>> topology.
>>>>
>>>> >no to mention that Mapr-FS  loses Data-Locality.
>>>>
>>>> This statement is false.
>>>>
>>>>
>>>>
>>>> On Mon, Jun 6, 2016 at 8:32 AM, Ascot Moss <[email protected]>
>>>> wrote:
>>>>
>>>>> Since MapR  is proprietary, I find that it has many compatibility
>>>>> issues in Apache open source projects, or even worse, lose Hadoop's
>>>>> features.  For instances, Hadoop has a built-in storage policy named COLD,
>>>>> where is it in Mapr-FS? no to mention that Mapr-FS  loses Data-Locality.
>>>>>
>>>>> On Mon, Jun 6, 2016 at 11:26 PM, Ascot Moss <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I don't think HDFS2 needs SAN, use the QuorumJournal approach is much
>>>>>> better than using Shared edits directory SAN approach.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Monday, June 6, 2016, Peyman Mohajerian <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> It is very common practice to backup the metadata in some SAN store.
>>>>>>> So the idea of complete loss of all the metadata is preventable. You 
>>>>>>> could
>>>>>>> lose a day worth of data if e.g. you back the metadata once a day but 
>>>>>>> you
>>>>>>> could do it more frequently. I'm not saying S3 or Azure Blob are bad 
>>>>>>> ideas.
>>>>>>>
>>>>>>> On Sun, Jun 5, 2016 at 8:19 AM, Marcin Tustin <[email protected]
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> The namenode architecture is a source of fragility in HDFS. While a
>>>>>>>> high availability deployment (with two namenodes, and a failover 
>>>>>>>> mechanism)
>>>>>>>> means you're unlikely to see service interruption, it is still 
>>>>>>>> possible to
>>>>>>>> have a complete loss of filesystem metadata with the loss of two 
>>>>>>>> machines.
>>>>>>>>
>>>>>>>> Secondly, because HDFS identifies datanodes by their hostname/ip,
>>>>>>>> dns changes can cause havoc with HDFS (see my war story on this here:
>>>>>>>> https://medium.com/handy-tech/renaming-hdfs-datanodes-considered-terribly-harmful-2bc2f37aabab
>>>>>>>> ).
>>>>>>>>
>>>>>>>> Also, the namenode/datanode architecture probably does contribute
>>>>>>>> to the small files problem being a problem. That said, there are lot of
>>>>>>>> practical solutions for the small files problem.
>>>>>>>>
>>>>>>>> If you're just setting up a data infrastructure, I would say
>>>>>>>> consider alternatives before you pick HDFS. If you run in AWS, S3 is a 
>>>>>>>> good
>>>>>>>> alternative. If you run in some other cloud, it's probably worth
>>>>>>>> considering whatever their equivalent storage system is.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Jun 4, 2016 at 7:43 AM, Ascot Moss <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I read some (old?) articles from Internet about Mapr-FS vs HDFS.
>>>>>>>>>
>>>>>>>>> https://www.mapr.com/products/m5-features/no-namenode-architecture
>>>>>>>>>
>>>>>>>>> It states that HDFS Federation has
>>>>>>>>>
>>>>>>>>> a) "Multiple Single Points of Failure", is it really true?
>>>>>>>>> Why MapR uses HDFS but not HDFS2 in its comparison as this would
>>>>>>>>> lead to an unfair comparison (or even misleading comparison)?  (HDFS 
>>>>>>>>> was
>>>>>>>>> from Hadoop 1.x, the old generation) HDFS2 is available since 
>>>>>>>>> 2013-10-15,
>>>>>>>>> there is no any Single Points of  Failure in HDFS2.
>>>>>>>>>
>>>>>>>>> b) "Limit to 50-200 million files", is it really true?
>>>>>>>>> I have seen so many real world Hadoop Clusters with over 10PB
>>>>>>>>> data, some even with 150PB data.  If "Limit to 50 -200 millions 
>>>>>>>>> files" were
>>>>>>>>> true in HDFS2, why are there so many production Hadoop clusters in 
>>>>>>>>> real
>>>>>>>>> world? how can they mange well the issue of  "Limit to 50-200 million
>>>>>>>>> files"? For instances,  the Facebook's "Like" implementation runs on 
>>>>>>>>> HBase
>>>>>>>>> at Web Scale, I can image HBase generates huge number of files in 
>>>>>>>>> Facbook's
>>>>>>>>> Hadoop cluster, the number of files in Facebook's Hadoop cluster 
>>>>>>>>> should be
>>>>>>>>> much much bigger than 50-200 million.
>>>>>>>>>
>>>>>>>>> From my point of view, in contrast, MaprFS should have true
>>>>>>>>> limitation up to 1T files while HDFS2 can handle true unlimited files,
>>>>>>>>> please do correct me if I am wrong.
>>>>>>>>>
>>>>>>>>> c) "Performance Bottleneck", again, is it really true?
>>>>>>>>> MaprFS does not have namenode in order to gain file system
>>>>>>>>> performance. If without Namenode, MaprFS would lose Data Locality 
>>>>>>>>> which is
>>>>>>>>> one of the beauties of Hadoop  If Data Locality is no longer 
>>>>>>>>> available, any
>>>>>>>>> big data application running on MaprFS might gain some file system
>>>>>>>>> performance but it would totally lose the true gain of performance 
>>>>>>>>> from
>>>>>>>>> Data Locality provided by Hadoop's namenode (gain small lose big)
>>>>>>>>>
>>>>>>>>> d) "Commercial NAS required"
>>>>>>>>> Is there any wiki/blog/discussion about Commercial NAS on Hadoop
>>>>>>>>> Federation?
>>>>>>>>>
>>>>>>>>> regards
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Want to work at Handy? Check out our culture deck and open roles
>>>>>>>> <http://www.handy.com/careers>
>>>>>>>> Latest news <http://www.handy.com/press> at Handy
>>>>>>>> Handy just raised $50m
>>>>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>>>>>>>>  led
>>>>>>>> by Fidelity
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: HDFS2 vs MaprFS

Reply via email to