Re: Apache Hudi 0.7.0 Released

leesf Wed, 27 Jan 2021 02:59:43 -0800

Thanks Vinoth for driving this major release and everyone involved.

Vinoth Chandar <[email protected]> 于2021年1月27日周三 上午6:33写道：


> Hello all,
>
> We are excited to share that the 0.7.0 release is out, and by far our
> biggest release with lots of code moving around, new unique features and
> bug fixes.
>
> Please find more information here and provide feedback
> http://hudi.apache.org/releases.html#release-070-docs
>
> Few quick highlights:
> Clustering:P <http://hudi.apache.org/releases.html#clustering>0.7.0
> brings the ability to cluster your Hudi tables, to optimize for file sizes
> and also storage layout. Hudi will continue to enforce file sizes, as it
> always has been, during the write. Clustering provides more flexibility to
> increase the file sizes down the line or ability to ingest data at much
> fresher intervals, and later coalesce them into bigger files. This is
> very similar to the benefits of clustering delivered by cloud data
> warehouses
> <https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html>.
> We are proud to announce that such capability is freely available in open
> source, for the first time, through the 0.7.0 release.Metadata Table: 0.7.0
> lays out the foundation for storing more indexes, metadata in an internal
> metadata table, which is implemented using a Hudi MOR table - which means
> it’s compacted, cleaned and also incrementally updated like any other Hudi
> table. By hoodie.metadata.enable=true on the writer side, will populate
> the metadata table with file system listings so all operations don’t have
> to explicitly use fs.listStatus() anymore on data partitions. In our
> testing, on a large 250K file table, the metadata table delivers 2-3x
> speedup <https://github.com/apache/hudi/pull/2441#issuecomment-761742963> over
> parallelized listing done by the Hudi spark writer.
> Users can also leverage the metadata table on the query side for the
> following query paths. For Hive, setting the hoodie.metadata.enable=true 
> session
> property and for SparkSQL on Hive registered tables using --conf
> spark.hadoop.hoodie.metadata.enable=true, allows the file listings for
> the partition to be fetched out of the metadata table, instead of listing
> the underlying DFS partition. More engines are coming.
> Java/Flink Writers: In 0.7.0, we have additionally added Java and Flink
> based writers, as initial steps. Specifically, the HoodieFlinkStreamer allows
> for Hudi Copy-On-Write table to be built by streaming data from a Kafka
> topic.
>
> *Spark3 Support*: We have added support for writing/querying data using
> Spark 3. please be sure to use the scala 2.12 hudi-spark-bundle.
>
> *Insert Overwrite/Insert Overwrite Table*: We have added these two new
> write operation types, predominantly to help existing batch ETL jobs, which
> typically overwrite entire tables/partitions each run. These operations are
> much cheaper than having to issue upserts, given they are bulk replacing
> the target table. Check here
> <http://hudi.apache.org/docs/quick-start-guide.html#insert-overwrite-table> 
> for
> examples.
>
> *Incremental Query on MOR (Spark Datasource)*: Spark datasource now has
> experimental support for incremental queries on MOR table. This feature
> will be hardened and certified in the next release, along with a large
> overhaul of the spark datasource implementation. (sshh!:))
>
> Thanks,
> Vinoth
> (on behalf of the Hudi Community)
>

Re: Apache Hudi 0.7.0 Released

Reply via email to