Hi,
I have encountered a problem while saving an apache arrow table as a
parquet file and reading the table back again.The dictionary types of the
table read from the parquet file are not the same as the dictionary types
of the written table, more specific all integer types are int32. For
example a column that is stored with int8 indices is read in again with
int32 indices:
storing: column name: dictionary<values=string, indices=int8, ordered=0>
reading:column name: dictionary<values=string, indices=int32, ordered=0>
I'm on arrow version 8.0.0 on Windows 10
Please advise on how to correct/prevent this bug/feature so that the
original dictionary types are used
Best regards,
Matthieu
Code used to write parquet file
auto Table2Pqt(const std::shared_ptr<arrow::Table>& t, const
std::string& output_filepath) {
auto p = arrow::io::FileOutputStream::Open(output_filepath);
const auto st = parquet::arrow::WriteTable(
*t,
arrow::default_memory_pool(),
*p,
t->num_rows(),
parquet::default_writer_properties(),
parquet::ArrowWriterProperties::Builder().store_schema()->build()
);
}
Code used to read parquet file
auto Pqt2Table(std::shared_ptr<arrow::Table>& t, const std::string&
pqt_filepath) {
arrow::fs::LocalFileSystem fs;
const auto input_file = fs.OpenInputFile(pqt_filepath);
const auto& input = input_file.ValueOrDie();
std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
OpenFile(input, arrow::default_memory_pool(), &arrow_reader);
arrow_reader->ReadTable(&t)
}