Hi,

I am not absolutely sure it is not already in a roadmap or supported, but I
would appreciate those two features :

- First feature : I would also like to be able to use a dedicated directory
in HDFS as a /tmp directory leveraging RAMFS for high performing checkpoint
of Spark Jobs without using Alluxio or Ignite.
My current issue is that the RAMFS is only useful with replication factor
x1 (in order to avoid network).
My default replication factor is x3, but I would need a way to set
replication factor x1 on a specific directory (/tmp) for all new writes
coming to this directory.
Currently if I use "hdfs setrep 1 /tmp" it only works for blocks already
written.
For example, this could be done by specifying the replication factor at the
storage policy level.
In my view this would dramatically improve the interest of the Lazy-persist
storage policy.

> From the Doc > Note 1: The Lazy_Persist policy is useful only for single
replica blocks. For blocks with more than one replicas, all the replicas
will be written to DISK since writing only one of the replicas to RAM_DISK
does not improve the overall performance.
In the current state of HDFS configuration, I only see the following hack
(not tested) to implement such a solution : Configure HDFS replication x1
as default configuration and use Erasure Coding RS(6,3) for the main
storage by attaching an ec storage policy on all directories except /tmp.

hdfs ec -setPolicy -path <directory> [-policy <policyName>]



- Second feature: a bandwidth throttling dedicated to the re-replication in
case of a failed datanode.
Something similar to the option dedicated to the balancing algorithm
dfs.datanode.balance.bandwidthPerSecbut only for re-replication.

Thanks and regards
JL

Le lun. 10 juin 2019 à 19:08, Wei-Chiu Chuang <weic...@cloudera.com.invalid>
a écrit :

> Hi!
>
> I am soliciting feedbacks for HDFS roadmap items and wish list in the
> future Hadoop releases. A community meetup
> <https://www.meetup.com/Hadoop-Contributors/events/262055924/?rv=ea1_v2&_xtd=gatlbWFpbF9jbGlja9oAJGJiNTE1ODdkLTY0MDAtNDFiZS1iOTU5LTM5ZWYyMDU1N2Q4Nw>
> is happening soon, and perhaps we can use this thread to converge on things
> we should talk about there.
>
> I am aware of several major features that merged into trunk, such as RBF,
> Consistent Standby Serving Reads, as well as some recent features that
> merged into 3.2.0 release (storage policy satisfier).
>
> What else should we be doing? I have a laundry list of supportability
> improvement projects, mostly about improving performance or making
> performance diagnostics easier. I can share the list if folks are
> interested.
>
> Are there things we should do to make developer's life easier or things
> that would be nice to have for downstream applications? I know Sahil Takiar
> made a series of improvements in HDFS for Impala recently, and those
> improvements are applicable to other downstreamers such as HBase. Or would
> it help if we provide more Hadoop API examples?
>

Reply via email to