Hi, Thanks for taking this into account. As you seems to be interested in adapting the "per directory" HDFS attributes regarding replication factor, I would also suggest to add a finer granularity to the DFS blocksize. The idea would be to optimize the blocksize on a per directory basis. So that an application dealing with very large files could beneficiate from a very large block size without impacting everyone. As I am not familiar with the code, I don't know if this kind of feature would not broke the simplicity of Hadoop. Maybe this could be done by adding attributes to the hdfs policies. Maybe it could also be a good idea to make the replication factor an attribute of the hdfs policies for coherence sake.
Regards Julien Le jeu. 13 juin 2019 à 20:41, Wei-Chiu Chuang <weic...@apache.org> a écrit : > Thank you. I really appreciate your feedback as I don't always know the > detailed use case for a feature. (For me, it's mostly "hey, this thing is > broken, fix it") > > What are the rest of the community thinks? This is a great opportunity to > share what you think. > > My answers inline: > > On Wed, Jun 12, 2019 at 1:12 AM Julien Laurenceau < > julien.laurenc...@pepitedata.com> wrote: > >> Hi, >> >> I am not absolutely sure it is not already in a roadmap or supported, but >> I would appreciate those two features : >> >> - First feature : I would also like to be able to use a dedicated >> directory in HDFS as a /tmp directory leveraging RAMFS for high performing >> checkpoint of Spark Jobs without using Alluxio or Ignite. >> > My current issue is that the RAMFS is only useful with replication factor >> x1 (in order to avoid network). >> My default replication factor is x3, but I would need a way to set >> replication factor x1 on a specific directory (/tmp) for all new writes >> coming to this directory. >> Currently if I use "hdfs setrep 1 /tmp" it only works for blocks already >> written. >> For example, this could be done by specifying the replication factor at >> the storage policy level. >> In my view this would dramatically improve the interest of the >> Lazy-persist storage policy. >> > > I am told LAZY_PERSIST is never considered a completed feature, and two > Hadoop distros, CDH and HDP don't support it. > > But now that I understand the use case, it looks useful now. > >> > From the Doc > Note 1: The Lazy_Persist policy is useful only for >> single replica blocks. For blocks with more than one replicas, all the >> replicas will be written to DISK since writing only one of the replicas to >> RAM_DISK does not improve the overall performance. >> In the current state of HDFS configuration, I only see the following hack >> (not tested) to implement such a solution : Configure HDFS replication x1 >> as default configuration and use Erasure Coding RS(6,3) for the main >> storage by attaching an ec storage policy on all directories except /tmp. >> >> hdfs ec -setPolicy -path <directory> [-policy <policyName>] >> >> >> >> - Second feature: a bandwidth throttling dedicated to the re-replication >> in case of a failed datanode. >> Something similar to the option dedicated to the balancing algorithm >> dfs.datanode.balance.bandwidthPerSecbut only for re-replication. >> > I am pretty sure I've got people asking about this before a few times. > >> >> Thanks and regards >> JL >> >> Le lun. 10 juin 2019 à 19:08, Wei-Chiu Chuang >> <weic...@cloudera.com.invalid> a écrit : >> >>> Hi! >>> >>> I am soliciting feedbacks for HDFS roadmap items and wish list in the >>> future Hadoop releases. A community meetup >>> <https://www.meetup.com/Hadoop-Contributors/events/262055924/?rv=ea1_v2&_xtd=gatlbWFpbF9jbGlja9oAJGJiNTE1ODdkLTY0MDAtNDFiZS1iOTU5LTM5ZWYyMDU1N2Q4Nw> >>> is happening soon, and perhaps we can use this thread to converge on things >>> we should talk about there. >>> >>> I am aware of several major features that merged into trunk, such as >>> RBF, Consistent Standby Serving Reads, as well as some recent features that >>> merged into 3.2.0 release (storage policy satisfier). >>> >>> What else should we be doing? I have a laundry list of supportability >>> improvement projects, mostly about improving performance or making >>> performance diagnostics easier. I can share the list if folks are >>> interested. >>> >>> Are there things we should do to make developer's life easier or things >>> that would be nice to have for downstream applications? I know Sahil Takiar >>> made a series of improvements in HDFS for Impala recently, and those >>> improvements are applicable to other downstreamers such as HBase. Or would >>> it help if we provide more Hadoop API examples? >>> >>