[C++][Parquet] Best practice to write duplicated strings / enums into parquet

Haocheng Liu Mon, 22 May 2023 08:53:10 -0700

Hi,

I have a use case which can be simplified as there are {0-> "RED",
1->"GREEN":1, 2->"BLUE", etc} and I need to write them hundreds of millions
of times. In each row,  there may be tens of int -> string maps. When user
read the data, they want to see "RED", "GREED" and "BLUE" rather than some
unclear int.


According to the doc
<https://arrow.apache.org/docs/cpp/parquet.html#writer-properties>,
dictionary encoding is enabled by default so there are two possible
solutions:

1. Write strings via a stringBuilder and let Arrow do the encoding under
the hood.
2. Write enums(int) and provide the encoding in metadata(?).

Option 2 sounds preferred to me as it does not require expensive string
comparison and possible string copy. Can folks please guide on  if my
understanding is correct. If so, how to provide the int->string mapping in
metadata? If not, what's the best practice here?

Thanks in advance.

Regards,
Haocheng Liu

[C++][Parquet] Best practice to write duplicated strings / enums into parquet

Reply via email to