Hi Weston, Thanks again for your suggestion, I was able to come up with a semi working example with an arrow write node. Now I run into two issues:
- The input data has a few columns whose type is *int64_t*. It has triggered an Arrow error as "Invalid: Casting from timestamps[ns] to timestamps[us] would lose data: XXXXXX" in columns_writer.cc <https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_writer.cc#L1934-L1938>. Can you guide how I can fix this conversion issue with FileSystemDatasetWriteOptions/or something else? Indeed my bespoke format uses int64_t to represent nanoseconds since epoch, I failed to find a way to set it in FileSystemDatasetWriteOptions. - In the BespokeReaderNode, I confirm I send all columns and rows to the writer node. However the output data only has a subset of partitioned keys and the whole directory only contains *empty *.parquet file... Any suggestions? Code look like below: ``` BespokeReader reader = OpenBespokeReaderForGiantFile(...); RecordBatchReader rb_reader = reader.ToRecordBatchReader(); RecordBatchReaderSourceNodeOptions source_options{rb_reader}; // According to the execution_plan_documentation_example, I should not specify which columns to partition on and partition here? .... /// here is an almost copy paste from https://github.com/apache/arrow/blob/main/cpp/examples/arrow/execution_plan_documentation_examples.cc#L647 FileSystemDatasetWriteOptions write_options = CreateWriteOptions(...); /// I specify which columns to partition and hive partition in write_options.<parameters> ....... WriteNodeOptions write_node_options(write_options); Declaration write{"wite", std::move(write_node_options)}; ARROW_RETURN_NOT_OK(Declaration::Sequence({record_batch_reader_source, write}).AddToPlan(plan.get())); plan->Validate(); plan->StartProducing(); ARROW_RETURN_NOT_OK(plan->finished().status()); ``` Thanks. Best, Haocheng
