Thanks Vinoth for driving this major release and everyone involved. Vinoth Chandar <[email protected]> 于2021年1月27日周三 上午6:33写道:
> Hello all, > > We are excited to share that the 0.7.0 release is out, and by far our > biggest release with lots of code moving around, new unique features and > bug fixes. > > Please find more information here and provide feedback > http://hudi.apache.org/releases.html#release-070-docs > > Few quick highlights: > Clustering:P <http://hudi.apache.org/releases.html#clustering>0.7.0 > brings the ability to cluster your Hudi tables, to optimize for file sizes > and also storage layout. Hudi will continue to enforce file sizes, as it > always has been, during the write. Clustering provides more flexibility to > increase the file sizes down the line or ability to ingest data at much > fresher intervals, and later coalesce them into bigger files. This is > very similar to the benefits of clustering delivered by cloud data > warehouses > <https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html>. > We are proud to announce that such capability is freely available in open > source, for the first time, through the 0.7.0 release.Metadata Table: 0.7.0 > lays out the foundation for storing more indexes, metadata in an internal > metadata table, which is implemented using a Hudi MOR table - which means > it’s compacted, cleaned and also incrementally updated like any other Hudi > table. By hoodie.metadata.enable=true on the writer side, will populate > the metadata table with file system listings so all operations don’t have > to explicitly use fs.listStatus() anymore on data partitions. In our > testing, on a large 250K file table, the metadata table delivers 2-3x > speedup <https://github.com/apache/hudi/pull/2441#issuecomment-761742963> over > parallelized listing done by the Hudi spark writer. > Users can also leverage the metadata table on the query side for the > following query paths. For Hive, setting the hoodie.metadata.enable=true > session > property and for SparkSQL on Hive registered tables using --conf > spark.hadoop.hoodie.metadata.enable=true, allows the file listings for > the partition to be fetched out of the metadata table, instead of listing > the underlying DFS partition. More engines are coming. > Java/Flink Writers: In 0.7.0, we have additionally added Java and Flink > based writers, as initial steps. Specifically, the HoodieFlinkStreamer allows > for Hudi Copy-On-Write table to be built by streaming data from a Kafka > topic. > > *Spark3 Support*: We have added support for writing/querying data using > Spark 3. please be sure to use the scala 2.12 hudi-spark-bundle. > > *Insert Overwrite/Insert Overwrite Table*: We have added these two new > write operation types, predominantly to help existing batch ETL jobs, which > typically overwrite entire tables/partitions each run. These operations are > much cheaper than having to issue upserts, given they are bulk replacing > the target table. Check here > <http://hudi.apache.org/docs/quick-start-guide.html#insert-overwrite-table> > for > examples. > > *Incremental Query on MOR (Spark Datasource)*: Spark datasource now has > experimental support for incremental queries on MOR table. This feature > will be hardened and certified in the next release, along with a large > overhaul of the spark datasource implementation. (sshh!:)) > > Thanks, > Vinoth > (on behalf of the Hudi Community) >
