StringDictionaryBuilder sounds like a perfect candidate for my use case.
Thanks Weston!

On Mon, May 22, 2023 at 3:01 PM Weston Pace <[email protected]> wrote:

> Arrow can also represent dictionary encoding.  If you like StringBuilder
> then there is also a StringDictionaryBuilder which should be more or less
> compatible:
>
> TEST(TestStringDictionaryBuilder, Basic) {
>   // Build the dictionary Array
>   StringDictionaryBuilder builder;
>   ASSERT_OK(builder.Append("RED"));
>   ASSERT_OK(builder.Append("GREEN"));
>   ASSERT_OK(builder.Append("RED"));
>
>   std::shared_ptr<Array> result;
>   ASSERT_OK(builder.Finish(&result));
>
>   // Build expected data
>   auto ex_dict = ArrayFromJSON(utf8(), "[\"RED\", \"GREEN\"]");
>   auto dtype = dictionary(int8(), utf8());
>   auto int_array = ArrayFromJSON(int8(), "[0, 1, 0]");
>   DictionaryArray expected(dtype, int_array, ex_dict);
>
>   ASSERT_TRUE(expected.Equals(result));
> }
>
> If your encoding is standard (e.g. you must always represent "RED" with 1
> and "GREEN" with 0) then you can use InsertMemoValues to establish your
> encoding first:
>
> TEST(TestStringDictionaryBuilder, Basic) {
>   auto values = ArrayFromJSON(utf8(), R"(["GREEN", "RED"])");
>
>   // Build the dictionary Array
>   StringDictionaryBuilder builder;
>   ASSERT_OK(builder.InsertMemoValues(*values));
>   ASSERT_OK(builder.Append("RED"));
>   ASSERT_OK(builder.Append("GREEN"));
>   ASSERT_OK(builder.Append("RED"));
>
>   std::shared_ptr<Array> result;
>   ASSERT_OK(builder.Finish(&result));
>
>   // Build expected data
>   auto ex_dict = ArrayFromJSON(utf8(), "[\"GREEN\", \"RED\"]");
>   auto dtype = dictionary(int8(), utf8());
>   auto int_array = ArrayFromJSON(int8(), "[1, 0, 1]");
>   DictionaryArray expected(dtype, int_array, ex_dict);
>
>   ASSERT_TRUE(expected.Equals(result));
> }
>
> On Mon, May 22, 2023 at 8:53 AM Haocheng Liu <[email protected]> wrote:
>
>> Hi,
>>
>> I have a use case which can be simplified as there are {0-> "RED",
>> 1->"GREEN":1, 2->"BLUE", etc} and I need to write them hundreds of millions
>> of times. In each row,  there may be tens of int -> string maps. When user
>> read the data, they want to see "RED", "GREED" and "BLUE" rather than some
>> unclear int.
>>
>> According to the doc
>> <https://arrow.apache.org/docs/cpp/parquet.html#writer-properties>,
>> dictionary encoding is enabled by default so there are two possible
>> solutions:
>>
>> 1. Write strings via a stringBuilder and let Arrow do the encoding under
>> the hood.
>> 2. Write enums(int) and provide the encoding in metadata(?).
>>
>> Option 2 sounds preferred to me as it does not require expensive string
>> comparison and possible string copy. Can folks please guide on  if my
>> understanding is correct. If so, how to provide the int->string mapping in
>> metadata? If not, what's the best practice here?
>>
>> Thanks in advance.
>>
>> Regards,
>> Haocheng Liu
>>
>>
>>

-- 
Best regards

Reply via email to