Alright, I got it working:
parquet::WriterProperties::Builder file_writer_options_builder;
file_writer_options_builder.compression(arrow::Compression::BROTLI);
//file_writer_options_builder.compression(arrow::Compression::UNCOMPRESSED);
std::shared_ptr<parquet::WriterProperties> props =
file_writer_options_builder.build();
std::shared_ptr<ds::FileWriteOptions> file_write_options =
format->DefaultWriteOptions();
auto parquet_options =
arrow::internal::checked_pointer_cast<ds::ParquetFileWriteOptions>(file_write_options);
parquet_options->writer_properties = props;
arrow::dataset::FileSystemDatasetWriteOptions write_options;
write_options.file_write_options = parquet_options;
But surely a call to arrow::internal is not the intended usage?
On Sat, May 22, 2021 at 8:52 PM Xander Dunn <[email protected]> wrote:
> I see how to compress writes to a particular file using
> arrow::io::CompressedOutputStream::Make, but I’m having difficulty
> figuring out how to make Dataset writes compressed. I have my code set up
> similar to the CreateExampleParquetHivePartitionedDataset example here
> <https://github.com/apache/arrow/blob/master/cpp/examples/arrow/dataset_documentation_example.cc#L113>.
>
>
> I suspect there is some option on the FileSystemDatasetWriteOptions to
> specify compression, but I haven’t been able to uncover it:
>
> ds::FileSystemDatasetWriteOptions write_options;
> write_options.file_write_options = format->DefaultWriteOptions();
> write_options.filesystem = filesystem;
> write_options.base_dir = base_path;
> write_options.partitioning = partitioning;
> write_options.basename_template = "part{i}.parquet";
> ABORT_ON_FAILURE(ds::FileSystemDataset::Write(write_options, scanner));
>
> FileSystemDatasetWriteOptions is defined here
> <https://github.com/apache/arrow/blob/602a76ac58bc8de60a353648f02cf11891563e77/cpp/src/arrow/dataset/file_base.h#L331>
> and doesn’t have a compression option.
>
> The file_write_options property is a ParquetFileWriteOptions, which is
> defined here
> <https://github.com/apache/arrow/blob/8b4942728e7347dc921a2d423e996fea5f9e2102/cpp/src/arrow/dataset/file_parquet.h#L222>
> and has a parquet::WriterProperties and parquet::ArrowWriterProperties.
> It’s created here:
>
> std::shared_ptr<FileWriteOptions> ParquetFileFormat::DefaultWriteOptions() {
> std::shared_ptr<ParquetFileWriteOptions> options(
> new ParquetFileWriteOptions(shared_from_this()));
> options->writer_properties = parquet::default_writer_properties();
> options->arrow_writer_properties =
> parquet::default_arrow_writer_properties();
> return options;
> }
>
> parquet::WriterProperties can be created with a compression specified
> like this:
>
> parquet::WriterProperties::Builder file_writer_options_builder;
> file_writer_options_builder.compression(arrow::Compression::BROTLI);
> std::shared_ptr<parquet::WriterProperties> props =
> file_writer_options_builder.build();
>
> However, I have been unable to create a FileWriteOptions which includes
> this WriterProperties. What is shared_from_this()? Creating a
> FileWriteOptions with std::make_shared<> doesn’t compile. Any pointers on
> creating a FileWriteOptions in my project, or a better way to specify the
> compression type on a dataset write?
>