Hi Weston,

Thanks again for your suggestion, I was able to come up with a semi working
example with an arrow write node. Now I run into two issues:


   -  The input data has a few columns whose type is *int64_t*. It has
   triggered an Arrow error as "Invalid: Casting from timestamps[ns] to
   timestamps[us] would lose data: XXXXXX" in columns_writer.cc
   
<https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_writer.cc#L1934-L1938>.
   Can you guide how I can fix this conversion issue with
   FileSystemDatasetWriteOptions/or something else? Indeed my bespoke format
   uses int64_t to represent nanoseconds since epoch, I failed to find a way
   to set it in FileSystemDatasetWriteOptions.
   - In the BespokeReaderNode, I confirm I send all columns and rows to the
   writer node. However the output data only has a subset of partitioned keys
   and the whole directory only contains *empty *.parquet file... Any
   suggestions?


Code look like below:
```
BespokeReader reader = OpenBespokeReaderForGiantFile(...);
RecordBatchReader rb_reader = reader.ToRecordBatchReader();
RecordBatchReaderSourceNodeOptions source_options{rb_reader};
// According to the execution_plan_documentation_example, I should not
specify which columns to partition on and partition here?
....

/// here is an almost copy paste from
https://github.com/apache/arrow/blob/main/cpp/examples/arrow/execution_plan_documentation_examples.cc#L647
FileSystemDatasetWriteOptions write_options = CreateWriteOptions(...);
/// I specify which columns to partition and hive partition in
write_options.<parameters>
.......

WriteNodeOptions write_node_options(write_options);
Declaration write{"wite", std::move(write_node_options)};
ARROW_RETURN_NOT_OK(Declaration::Sequence({record_batch_reader_source,
write}).AddToPlan(plan.get()));
plan->Validate();
plan->StartProducing();
ARROW_RETURN_NOT_OK(plan->finished().status());
```

Thanks.

Best,
Haocheng

Reply via email to