StringDictionaryBuilder sounds like a perfect candidate for my use case. Thanks Weston!
On Mon, May 22, 2023 at 3:01 PM Weston Pace <[email protected]> wrote: > Arrow can also represent dictionary encoding. If you like StringBuilder > then there is also a StringDictionaryBuilder which should be more or less > compatible: > > TEST(TestStringDictionaryBuilder, Basic) { > // Build the dictionary Array > StringDictionaryBuilder builder; > ASSERT_OK(builder.Append("RED")); > ASSERT_OK(builder.Append("GREEN")); > ASSERT_OK(builder.Append("RED")); > > std::shared_ptr<Array> result; > ASSERT_OK(builder.Finish(&result)); > > // Build expected data > auto ex_dict = ArrayFromJSON(utf8(), "[\"RED\", \"GREEN\"]"); > auto dtype = dictionary(int8(), utf8()); > auto int_array = ArrayFromJSON(int8(), "[0, 1, 0]"); > DictionaryArray expected(dtype, int_array, ex_dict); > > ASSERT_TRUE(expected.Equals(result)); > } > > If your encoding is standard (e.g. you must always represent "RED" with 1 > and "GREEN" with 0) then you can use InsertMemoValues to establish your > encoding first: > > TEST(TestStringDictionaryBuilder, Basic) { > auto values = ArrayFromJSON(utf8(), R"(["GREEN", "RED"])"); > > // Build the dictionary Array > StringDictionaryBuilder builder; > ASSERT_OK(builder.InsertMemoValues(*values)); > ASSERT_OK(builder.Append("RED")); > ASSERT_OK(builder.Append("GREEN")); > ASSERT_OK(builder.Append("RED")); > > std::shared_ptr<Array> result; > ASSERT_OK(builder.Finish(&result)); > > // Build expected data > auto ex_dict = ArrayFromJSON(utf8(), "[\"GREEN\", \"RED\"]"); > auto dtype = dictionary(int8(), utf8()); > auto int_array = ArrayFromJSON(int8(), "[1, 0, 1]"); > DictionaryArray expected(dtype, int_array, ex_dict); > > ASSERT_TRUE(expected.Equals(result)); > } > > On Mon, May 22, 2023 at 8:53 AM Haocheng Liu <[email protected]> wrote: > >> Hi, >> >> I have a use case which can be simplified as there are {0-> "RED", >> 1->"GREEN":1, 2->"BLUE", etc} and I need to write them hundreds of millions >> of times. In each row, there may be tens of int -> string maps. When user >> read the data, they want to see "RED", "GREED" and "BLUE" rather than some >> unclear int. >> >> According to the doc >> <https://arrow.apache.org/docs/cpp/parquet.html#writer-properties>, >> dictionary encoding is enabled by default so there are two possible >> solutions: >> >> 1. Write strings via a stringBuilder and let Arrow do the encoding under >> the hood. >> 2. Write enums(int) and provide the encoding in metadata(?). >> >> Option 2 sounds preferred to me as it does not require expensive string >> comparison and possible string copy. Can folks please guide on if my >> understanding is correct. If so, how to provide the int->string mapping in >> metadata? If not, what's the best practice here? >> >> Thanks in advance. >> >> Regards, >> Haocheng Liu >> >> >> -- Best regards
