Hi, I am not absolutely sure it is not already in a roadmap or supported, but I would appreciate those two features :
- First feature : I would also like to be able to use a dedicated directory in HDFS as a /tmp directory leveraging RAMFS for high performing checkpoint of Spark Jobs without using Alluxio or Ignite. My current issue is that the RAMFS is only useful with replication factor x1 (in order to avoid network). My default replication factor is x3, but I would need a way to set replication factor x1 on a specific directory (/tmp) for all new writes coming to this directory. Currently if I use "hdfs setrep 1 /tmp" it only works for blocks already written. For example, this could be done by specifying the replication factor at the storage policy level. In my view this would dramatically improve the interest of the Lazy-persist storage policy. > From the Doc > Note 1: The Lazy_Persist policy is useful only for single replica blocks. For blocks with more than one replicas, all the replicas will be written to DISK since writing only one of the replicas to RAM_DISK does not improve the overall performance. In the current state of HDFS configuration, I only see the following hack (not tested) to implement such a solution : Configure HDFS replication x1 as default configuration and use Erasure Coding RS(6,3) for the main storage by attaching an ec storage policy on all directories except /tmp. hdfs ec -setPolicy -path <directory> [-policy <policyName>] - Second feature: a bandwidth throttling dedicated to the re-replication in case of a failed datanode. Something similar to the option dedicated to the balancing algorithm dfs.datanode.balance.bandwidthPerSecbut only for re-replication. Thanks and regards JL Le lun. 10 juin 2019 à 19:08, Wei-Chiu Chuang <weic...@cloudera.com.invalid> a écrit : > Hi! > > I am soliciting feedbacks for HDFS roadmap items and wish list in the > future Hadoop releases. A community meetup > <https://www.meetup.com/Hadoop-Contributors/events/262055924/?rv=ea1_v2&_xtd=gatlbWFpbF9jbGlja9oAJGJiNTE1ODdkLTY0MDAtNDFiZS1iOTU5LTM5ZWYyMDU1N2Q4Nw> > is happening soon, and perhaps we can use this thread to converge on things > we should talk about there. > > I am aware of several major features that merged into trunk, such as RBF, > Consistent Standby Serving Reads, as well as some recent features that > merged into 3.2.0 release (storage policy satisfier). > > What else should we be doing? I have a laundry list of supportability > improvement projects, mostly about improving performance or making > performance diagnostics easier. I can share the list if folks are > interested. > > Are there things we should do to make developer's life easier or things > that would be nice to have for downstream applications? I know Sahil Takiar > made a series of improvements in HDFS for Impala recently, and those > improvements are applicable to other downstreamers such as HBase. Or would > it help if we provide more Hadoop API examples? >