Hi All,
I have code that creates a table with string columns as follows:
for(/* each column */) {
// ...
column_vectors.push(Vector.new(Data.Utf8(new Utf8(), 0, element_count,
null_count, nullmap_buffer, offsets_buffer, data_buffer)));
}
const arrow_table = Table.new(column_vectors, column_names);
const data = arrow_table.serialize('binary', false).buffer;
const arrow_table2 = Table.from([new Uint8Array(data)]);
Here offsets_buffer is a Int32Array with the offsets and data_buffer is a
Uint8Array with the strings, in accordance to the Arrow format described in
https://arrow.apache.org/docs/format/Columnar.html.
I am trying to change this to use a dictionary encoding instead. I change
the producer of the data to return only the unique strings in data_buffer
and offsets_buffer, and additionally produce an interned_buffer
(Int32Array) with the indices of the strings. However I couldn't find how
to initialize the column in Javascript.
Shooting in the dark, I tried:
for(/* each column */) {
// ...
const dictionary = Vector.new(Data.Utf8(new Utf8(), 0,
offsets_buffer.length - 1, 0, 0, offsets_buffer, data_buffer));
column_vectors.push(Vector.new(Data.Dictionary(new Dictionary(new
Utf8(), new Int32()), 0, element_count, null_count, nullmap_buffer, 0,
interned_buffer, dictionary)));
}
// ...
However, this causes the deserialization (Table.from) to fail with:
TypeError: undefined has no properties
visitUtf8
visit
visit
visitMany
map
visitMany
_loadVectors
_loadDictionaryBatch
_readDictionaryBatch
open
open
from
What's the correct way of creating a dictionary encoded column?
Yakov Galka
http://stannum.io/