Arrow can also represent dictionary encoding.  If you like StringBuilder
then there is also a StringDictionaryBuilder which should be more or less
compatible:

TEST(TestStringDictionaryBuilder, Basic) {
  // Build the dictionary Array
  StringDictionaryBuilder builder;
  ASSERT_OK(builder.Append("RED"));
  ASSERT_OK(builder.Append("GREEN"));
  ASSERT_OK(builder.Append("RED"));

  std::shared_ptr<Array> result;
  ASSERT_OK(builder.Finish(&result));

  // Build expected data
  auto ex_dict = ArrayFromJSON(utf8(), "[\"RED\", \"GREEN\"]");
  auto dtype = dictionary(int8(), utf8());
  auto int_array = ArrayFromJSON(int8(), "[0, 1, 0]");
  DictionaryArray expected(dtype, int_array, ex_dict);

  ASSERT_TRUE(expected.Equals(result));
}

If your encoding is standard (e.g. you must always represent "RED" with 1
and "GREEN" with 0) then you can use InsertMemoValues to establish your
encoding first:

TEST(TestStringDictionaryBuilder, Basic) {
  auto values = ArrayFromJSON(utf8(), R"(["GREEN", "RED"])");

  // Build the dictionary Array
  StringDictionaryBuilder builder;
  ASSERT_OK(builder.InsertMemoValues(*values));
  ASSERT_OK(builder.Append("RED"));
  ASSERT_OK(builder.Append("GREEN"));
  ASSERT_OK(builder.Append("RED"));

  std::shared_ptr<Array> result;
  ASSERT_OK(builder.Finish(&result));

  // Build expected data
  auto ex_dict = ArrayFromJSON(utf8(), "[\"GREEN\", \"RED\"]");
  auto dtype = dictionary(int8(), utf8());
  auto int_array = ArrayFromJSON(int8(), "[1, 0, 1]");
  DictionaryArray expected(dtype, int_array, ex_dict);

  ASSERT_TRUE(expected.Equals(result));
}

On Mon, May 22, 2023 at 8:53 AM Haocheng Liu <[email protected]> wrote:

> Hi,
>
> I have a use case which can be simplified as there are {0-> "RED",
> 1->"GREEN":1, 2->"BLUE", etc} and I need to write them hundreds of millions
> of times. In each row,  there may be tens of int -> string maps. When user
> read the data, they want to see "RED", "GREED" and "BLUE" rather than some
> unclear int.
>
> According to the doc
> <https://arrow.apache.org/docs/cpp/parquet.html#writer-properties>,
> dictionary encoding is enabled by default so there are two possible
> solutions:
>
> 1. Write strings via a stringBuilder and let Arrow do the encoding under
> the hood.
> 2. Write enums(int) and provide the encoding in metadata(?).
>
> Option 2 sounds preferred to me as it does not require expensive string
> comparison and possible string copy. Can folks please guide on  if my
> understanding is correct. If so, how to provide the int->string mapping in
> metadata? If not, what's the best practice here?
>
> Thanks in advance.
>
> Regards,
> Haocheng Liu
>
>
>

Reply via email to