[Java][Arrow IPC] ArrowFileWriter/Reader Multi-Batch Dictionaries

Chris Larsen via user Thu, 28 Sep 2023 17:18:34 -0700

Hello, two questions for ya'll:

Q1) I'm trying out Arrow IPC format, comparing it to Parquet. I'd like to
be able to write to a file in batches with dictionary encoding for the
"String" columns. Per the format docs at
https://arrow.apache.org/docs/format/Columnar.html#dictionary-messages,
each batch should record a dictionary when `isDelta` is false (which it
seems to be in the Java library). However the `ArrowFileWriter` and
`ArrowFileReader` seem to expect a single dictionary in the first batch and
fails to flush or read (respectively) dictionaries on subsequent batches. I
can tweak the state easily with reflection to properly flush and read the
dictionaries but I'd like to know if I'm missing something before trying to
fix up the underlying code. See the code example below.


Q2) The dictionary recipe
https://arrow.apache.org/cookbook/java/io.html#id21 and unit tests point to
creating `VarCharVector`s that are then encoded and decoded into new
vectors for use. I'd rather avoid the endec and work with the dictionary
indices directly. Is the following code the best way to do it?

```
@Test
  public void testMultiBatchWithDictionary() throws Exception {
    File file = new File("target/mytest_multi_dictionary.arrow");
    Map<String, Integer> stringToIndex = new HashMap<>();

    try (VarCharVector dictionaryVector = new VarCharVector("dictionary",
allocator)) {
      DictionaryEncoding dictionaryEncoding = new DictionaryEncoding(42,
false, new ArrowType.Int(16, false));

      Dictionary dictionary = new Dictionary(dictionaryVector,
dictionaryEncoding);
      DictionaryProvider.MapDictionaryProvider provider = new
DictionaryProvider.MapDictionaryProvider();
      provider.put(dictionary);

      try (UInt2Vector vector = new UInt2Vector(
          "vector",
          new FieldType(false, new ArrowType.Int(16, false),
dictionaryEncoding),
          allocator)) {
        vector.allocateNew(4);
        dictionaryVector.allocateNew(4);

        dictionaryVector.set(0, "foo".getBytes(StandardCharsets.UTF_8));
        stringToIndex.put("foo", 0);
        dictionaryVector.set(1, "bar".getBytes(StandardCharsets.UTF_8));
        stringToIndex.put("bar", 1);

        vector.set(0, stringToIndex.get("foo"));
        vector.set(1, stringToIndex.get("bar"));
        vector.set(2, stringToIndex.get("bar"));
        vector.set(3, stringToIndex.get("foo"));

        vector.setValueCount(4);
        dictionaryVector.setValueCount(4); // NOTE: Should be 2 really.

        VectorSchemaRoot root = VectorSchemaRoot.of(dictionaryVector,
vector);
        try (FileOutputStream fileOutputStream = new FileOutputStream(file);
             ArrowFileWriter arrowWriter = new ArrowFileWriter(root,
provider, fileOutputStream.getChannel());) {

          // batch 1
          arrowWriter.start();
          arrowWriter.writeBatch();
          dictionaryVector.reset();
          vector.reset();
          stringToIndex.clear();

          // TODO - This is needed to write the next dictionary
          java.lang.reflect.Field dictionariesWritten =
ArrowFileWriter.class.getDeclaredField("dictionariesWritten");
          dictionariesWritten.setAccessible(true);
          dictionariesWritten.set(arrowWriter, false);

          // note the order is different for the strings
          dictionaryVector.set(0, "bar".getBytes(StandardCharsets.UTF_8));
          stringToIndex.put("bar", 0);
          dictionaryVector.set(1, "foo".getBytes(StandardCharsets.UTF_8));
          stringToIndex.put("foo", 1);

          vector.set(0, stringToIndex.get("bar"));
          vector.set(1, stringToIndex.get("bar"));
          vector.set(2, stringToIndex.get("foo"));
          vector.set(3, stringToIndex.get("foo"));

          dictionaryVector.setValueCount(2);
          vector.setValueCount(4);

          arrowWriter.writeBatch();
          arrowWriter.end();
        }
      }
    }

    try (FileInputStream fileInputStream = new FileInputStream(file);
         ArrowFileReader reader = new
ArrowFileReader(fileInputStream.getChannel(), allocator);) {
      VectorSchemaRoot root = reader.getVectorSchemaRoot();
      assertEquals(reader.getRecordBlocks().size(), 2);
      assertTrue(reader.loadNextBatch());

      FieldVector encoded = root.getVector("vector");
      DictionaryEncoding dictionaryEncoding =
encoded.getField().getDictionary();
      Dictionary dictionary =
reader.getDictionaryVectors().get(dictionaryEncoding.getId());
      try (ValueVector decoded = DictionaryEncoder.decode(encoded,
dictionary)) {
        assertEquals(decoded.getObject(0).toString(), "foo");
        assertEquals(decoded.getObject(1).toString(), "bar");
        assertEquals(decoded.getObject(2).toString(), "bar");
        assertEquals(decoded.getObject(3).toString(), "foo");
      }

      assertTrue(reader.loadNextBatch());
      // TODO without the load, values are mapped only to the first
dictionary.
      Method loadDictionary =
ArrowReader.class.getDeclaredMethod("loadDictionary",
ArrowDictionaryBatch.class);
      loadDictionary.invoke(reader, reader.readDictionary());

      try (ValueVector decoded = DictionaryEncoder.decode(encoded,
dictionary)) {
        System.out.println(decoded);
        assertEquals(decoded.getObject(0).toString(), "bar");
        assertEquals(decoded.getObject(1).toString(), "bar");
        assertEquals(decoded.getObject(2).toString(), "foo");
        assertEquals(decoded.getObject(3).toString(), "foo");
      }
    }
  }
```

[Java][Arrow IPC] ArrowFileWriter/Reader Multi-Batch Dictionaries

Reply via email to