Arrow can also represent dictionary encoding. If you like StringBuilder
then there is also a StringDictionaryBuilder which should be more or less
compatible:
TEST(TestStringDictionaryBuilder, Basic) {
// Build the dictionary Array
StringDictionaryBuilder builder;
ASSERT_OK(builder.Append("RED"));
ASSERT_OK(builder.Append("GREEN"));
ASSERT_OK(builder.Append("RED"));
std::shared_ptr<Array> result;
ASSERT_OK(builder.Finish(&result));
// Build expected data
auto ex_dict = ArrayFromJSON(utf8(), "[\"RED\", \"GREEN\"]");
auto dtype = dictionary(int8(), utf8());
auto int_array = ArrayFromJSON(int8(), "[0, 1, 0]");
DictionaryArray expected(dtype, int_array, ex_dict);
ASSERT_TRUE(expected.Equals(result));
}
If your encoding is standard (e.g. you must always represent "RED" with 1
and "GREEN" with 0) then you can use InsertMemoValues to establish your
encoding first:
TEST(TestStringDictionaryBuilder, Basic) {
auto values = ArrayFromJSON(utf8(), R"(["GREEN", "RED"])");
// Build the dictionary Array
StringDictionaryBuilder builder;
ASSERT_OK(builder.InsertMemoValues(*values));
ASSERT_OK(builder.Append("RED"));
ASSERT_OK(builder.Append("GREEN"));
ASSERT_OK(builder.Append("RED"));
std::shared_ptr<Array> result;
ASSERT_OK(builder.Finish(&result));
// Build expected data
auto ex_dict = ArrayFromJSON(utf8(), "[\"GREEN\", \"RED\"]");
auto dtype = dictionary(int8(), utf8());
auto int_array = ArrayFromJSON(int8(), "[1, 0, 1]");
DictionaryArray expected(dtype, int_array, ex_dict);
ASSERT_TRUE(expected.Equals(result));
}
On Mon, May 22, 2023 at 8:53 AM Haocheng Liu <[email protected]> wrote:
> Hi,
>
> I have a use case which can be simplified as there are {0-> "RED",
> 1->"GREEN":1, 2->"BLUE", etc} and I need to write them hundreds of millions
> of times. In each row, there may be tens of int -> string maps. When user
> read the data, they want to see "RED", "GREED" and "BLUE" rather than some
> unclear int.
>
> According to the doc
> <https://arrow.apache.org/docs/cpp/parquet.html#writer-properties>,
> dictionary encoding is enabled by default so there are two possible
> solutions:
>
> 1. Write strings via a stringBuilder and let Arrow do the encoding under
> the hood.
> 2. Write enums(int) and provide the encoding in metadata(?).
>
> Option 2 sounds preferred to me as it does not require expensive string
> comparison and possible string copy. Can folks please guide on if my
> understanding is correct. If so, how to provide the int->string mapping in
> metadata? If not, what's the best practice here?
>
> Thanks in advance.
>
> Regards,
> Haocheng Liu
>
>
>