Francois Saint-Jacques created ARROW-8382: ---------------------------------------------
Summary: [C++][Dataset] Refactor WritePlan to decouple from Fragment/Scan/Partition classes Key: ARROW-8382 URL: https://issues.apache.org/jira/browse/ARROW-8382 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques WritePlan should look like the following. {code:c++} class ARROW_DS_EXPORT WritePlan { public: /// Execute the WritePlan and return a FileSystemDataset as a result. Result<FileSystemDataset> Execute(); protected: /// The schema of the Dataset which will be written std::shared_ptr<Schema> schema; /// The format into which fragments will be written std::shared_ptr<FileFormat> format; using SourceAndReader = std::pair<FIleSource, RecordBatchReader>; /// std::vector<SourceAndReader> outputs; }; {code} * Refactor FileFormat::Write(FileSource destination, RecordBatchReader), not sure if it should take the output schema, or the RecordBatchReader should be already of the right schema. * Add a class/function that constructs SourceAndReader from Fragments, Partitioning and base path. And remove any Write/Fragment logic from partition.cc. * Move Write() out FIleSystemDataset into WritePlan. It could take a FileSystemDatasetFactory to recreate the FileSystemDataset. This is a bonus, not a requirement. * Simplify writing routine to avoid the PathTree directory structure, it shouldn't be more complex than `for task in write_tasks: task()`. Not path construction should there. The effects are: * Simplified WritePlan execution, abstracted away from path construction, and can write to multiple FileSystem and/or Buffers since it doesn't construct the FileSource. * By the virtue of using RecordBatchReader instead of Fragment, it isn't tied to writing from Fragment, it can take any construct that yields a RecordBatchReader. It also means that WritePlan doesn't have to know about any Scan related classes. * Writing can be done with or without partitioning, this logic is given to whomever generates the SourceAndReader list. * Should be simpler to test. -- This message was sent by Atlassian Jira (v8.3.4#803005)