One minor note is that specifying compression in
parquet::WriterProperties will result in a slightly different file
than one created with arrow::io::CompressedOutputStream::Make.  The
former tells parquet the default compression to use for column data
(you could even specify a per-column compression scheme if desired).
It is unique to parquet.  The latter applies compression to the entire
file.  It could be used on any output format.

What you have should be fine.  There is currently no way (I am aware
of) to specify file-wide compression on dataset writes.  This will
probably be a more essential feature once CSV support (or some other
format that doesn't natively handle compression) is added for dataset
writes.

On Sat, May 22, 2021 at 9:17 PM Micah Kornfield <[email protected]> wrote:
>
> internal::checked_pointer_cast isn't really anything special.  It simply 
> switches between std::static_pointer_cast<T> and std::dynamic_pointer_cast<T> 
> depending on debug/release compilation. So you can choose one or the other 
> depending on how confident you are in the type you are casting.
>
>
> On Sat, May 22, 2021 at 9:23 PM Xander Dunn <[email protected]> wrote:
>>
>> Alright, I got it working:
>>
>>     parquet::WriterProperties::Builder file_writer_options_builder;
>>     file_writer_options_builder.compression(arrow::Compression::BROTLI);
>>     
>> //file_writer_options_builder.compression(arrow::Compression::UNCOMPRESSED);
>>     std::shared_ptr<parquet::WriterProperties> props = 
>> file_writer_options_builder.build();
>>
>>     std::shared_ptr<ds::FileWriteOptions> file_write_options = 
>> format->DefaultWriteOptions();
>>     auto parquet_options = 
>> arrow::internal::checked_pointer_cast<ds::ParquetFileWriteOptions>(file_write_options);
>>     parquet_options->writer_properties = props;
>>     arrow::dataset::FileSystemDatasetWriteOptions write_options;
>>     write_options.file_write_options = parquet_options;
>>
>> But surely a call to arrow::internal is not the intended usage?
>>
>>
>> On Sat, May 22, 2021 at 8:52 PM Xander Dunn <[email protected]> wrote:
>>>
>>> I see how to compress writes to a particular file using 
>>> arrow::io::CompressedOutputStream::Make, but I’m having difficulty figuring 
>>> out how to make Dataset writes compressed. I have my code set up similar to 
>>> the CreateExampleParquetHivePartitionedDataset example here.
>>>
>>> I suspect there is some option on the FileSystemDatasetWriteOptions to 
>>> specify compression, but I haven’t been able to uncover it:
>>>
>>> ds::FileSystemDatasetWriteOptions write_options;
>>>   write_options.file_write_options = format->DefaultWriteOptions();
>>>   write_options.filesystem = filesystem;
>>>   write_options.base_dir = base_path;
>>>   write_options.partitioning = partitioning;
>>>   write_options.basename_template = "part{i}.parquet";
>>>   ABORT_ON_FAILURE(ds::FileSystemDataset::Write(write_options, scanner));
>>>
>>> FileSystemDatasetWriteOptions is defined here and doesn’t have a 
>>> compression option.
>>>
>>> The file_write_options property is a ParquetFileWriteOptions, which is 
>>> defined here and has a parquet::WriterProperties and 
>>> parquet::ArrowWriterProperties. It’s created here:
>>>
>>> std::shared_ptr<FileWriteOptions> ParquetFileFormat::DefaultWriteOptions() {
>>>   std::shared_ptr<ParquetFileWriteOptions> options(
>>>       new ParquetFileWriteOptions(shared_from_this()));
>>>   options->writer_properties = parquet::default_writer_properties();
>>>   options->arrow_writer_properties = 
>>> parquet::default_arrow_writer_properties();
>>>   return options;
>>> }
>>>
>>> parquet::WriterProperties can be created with a compression specified like 
>>> this:
>>>
>>>     parquet::WriterProperties::Builder file_writer_options_builder;
>>>     file_writer_options_builder.compression(arrow::Compression::BROTLI);
>>>     std::shared_ptr<parquet::WriterProperties> props = 
>>> file_writer_options_builder.build();
>>>
>>> However, I have been unable to create a FileWriteOptions which includes 
>>> this WriterProperties. What is shared_from_this()? Creating a 
>>> FileWriteOptions with std::make_shared<> doesn’t compile. Any pointers on 
>>> creating a FileWriteOptions in my project, or a better way to specify the 
>>> compression type on a dataset write?

Reply via email to