Hi all,
These are the meeting notes from today's community meeting. Date: 2/23/2021 Attendees: Xinli Shang, Gábor Szádovszky, Gidon Gershinsky, Ryan Blue 1. Iceberg and Parquet 1. Column ID v.s name 1. Column resolution: Parquet relies on the name, while Iceberg relies on ID. For example, column filtering projection by ID would avoid a lot of issues not only schema resolution. 2. FilterAPI: Iceberg expressions cover more. It would be great that Parquet also supports it. 1. IN, StartWith etc 3. How much effort is needed for Parquet to use Iceberg filter API? 1. It would depend on how to do it. We can just move that code to Parquet. That would save time. But that is just one solution and might not be the best. 4. Is the requirement generic from industry or Iceberg specific? 1. The parquet-avro module has the similar thing. 2. Pig has the resolution by position. 3. So it is pretty generic. 5. Should we create parquet-iceberg module or just make it generic to use? 1. Making it generic would make more sense. 6. Record materialization: Read support has MessageColumnIO. In the Iceberg, we materialize the record faster. We run Flink and Spark with the same API. It is kind of general. 7. Support vectorization into Arrow in Parquet 1. This is a great idea. It would boost the performance. 8. To conclude, we can start the ID resolution first. 1. Parquet-12 release 1. Once this pr <https://github.com/apache/parquet-mr/pull/868> is done, we can create RC build. 1. Inter-ops testing 1. It is about the idea about how to create data structures to have inter-ops testing. 2. Parque test repo change. Please let me know if you have any questions. Xinli Shang | Tech Lead Manager @ Uber Data Infra