Hi,
Thanks for taking this into account.
As you seems to be interested in adapting the "per directory" HDFS
attributes regarding replication factor, I would also suggest to add a
finer granularity to the DFS blocksize.
The idea would be to optimize the blocksize on a per directory basis. So
that an application dealing with very large files could beneficiate from a
very large block size without impacting everyone.
As I am not familiar with the code, I don't know if this kind of feature
would not broke the simplicity of Hadoop.
Maybe this could be done by adding attributes to the hdfs policies.
Maybe it could also be a good idea to make the replication factor an
attribute of the hdfs policies for coherence sake.

Regards
Julien

Le jeu. 13 juin 2019 à 20:41, Wei-Chiu Chuang <weic...@apache.org> a écrit :

> Thank you. I really appreciate your feedback as I don't always know the
> detailed use case for a feature. (For me, it's mostly "hey, this thing is
> broken, fix it")
>
> What are the rest of the community thinks? This is a great opportunity to
> share what you think.
>
> My answers inline:
>
> On Wed, Jun 12, 2019 at 1:12 AM Julien Laurenceau <
> julien.laurenc...@pepitedata.com> wrote:
>
>> Hi,
>>
>> I am not absolutely sure it is not already in a roadmap or supported, but
>> I would appreciate those two features :
>>
>> - First feature : I would also like to be able to use a dedicated
>> directory in HDFS as a /tmp directory leveraging RAMFS for high performing
>> checkpoint of Spark Jobs without using Alluxio or Ignite.
>>
> My current issue is that the RAMFS is only useful with replication factor
>> x1 (in order to avoid network).
>> My default replication factor is x3, but I would need a way to set
>> replication factor x1 on a specific directory (/tmp) for all new writes
>> coming to this directory.
>> Currently if I use "hdfs setrep 1 /tmp" it only works for blocks already
>> written.
>> For example, this could be done by specifying the replication factor at
>> the storage policy level.
>> In my view this would dramatically improve the interest of the
>> Lazy-persist storage policy.
>>
>
> I am told LAZY_PERSIST is never considered a completed feature, and two
> Hadoop distros, CDH and HDP don't support it.
>
> But now that I understand the use case, it looks useful now.
>
>> > From the Doc > Note 1: The Lazy_Persist policy is useful only for
>> single replica blocks. For blocks with more than one replicas, all the
>> replicas will be written to DISK since writing only one of the replicas to
>> RAM_DISK does not improve the overall performance.
>> In the current state of HDFS configuration, I only see the following hack
>> (not tested) to implement such a solution : Configure HDFS replication x1
>> as default configuration and use Erasure Coding RS(6,3) for the main
>> storage by attaching an ec storage policy on all directories except /tmp.
>>
>> hdfs ec -setPolicy -path <directory> [-policy <policyName>]
>>
>>
>>
>> - Second feature: a bandwidth throttling dedicated to the re-replication
>> in case of a failed datanode.
>> Something similar to the option dedicated to the balancing algorithm
>> dfs.datanode.balance.bandwidthPerSecbut only for re-replication.
>>
> I am pretty sure I've got people asking about this before a few times.
>
>>
>> Thanks and regards
>> JL
>>
>> Le lun. 10 juin 2019 à 19:08, Wei-Chiu Chuang
>> <weic...@cloudera.com.invalid> a écrit :
>>
>>> Hi!
>>>
>>> I am soliciting feedbacks for HDFS roadmap items and wish list in the
>>> future Hadoop releases. A community meetup
>>> <https://www.meetup.com/Hadoop-Contributors/events/262055924/?rv=ea1_v2&_xtd=gatlbWFpbF9jbGlja9oAJGJiNTE1ODdkLTY0MDAtNDFiZS1iOTU5LTM5ZWYyMDU1N2Q4Nw>
>>> is happening soon, and perhaps we can use this thread to converge on things
>>> we should talk about there.
>>>
>>> I am aware of several major features that merged into trunk, such as
>>> RBF, Consistent Standby Serving Reads, as well as some recent features that
>>> merged into 3.2.0 release (storage policy satisfier).
>>>
>>> What else should we be doing? I have a laundry list of supportability
>>> improvement projects, mostly about improving performance or making
>>> performance diagnostics easier. I can share the list if folks are
>>> interested.
>>>
>>> Are there things we should do to make developer's life easier or things
>>> that would be nice to have for downstream applications? I know Sahil Takiar
>>> made a series of improvements in HDFS for Impala recently, and those
>>> improvements are applicable to other downstreamers such as HBase. Or would
>>> it help if we provide more Hadoop API examples?
>>>
>>

Reply via email to