hi Sam — it's an interesting proposition. Other file formats like
Parquet don't make "resuming" particularly easy, either. The magic
number at the beginning of an Arrow file means that it's a lot more
expensive to turn a stream file into an Arrow-file-file — if we'd
thought about this use case, we might have chosen to only put the
magic number at the end of the file.

It's also not possible to put the file metadata "outside" the stream
file. One thing that occurs to me is whether we could enable the file
footer metadata to live in a "sidecar" file to support this use case.
To enable this, we would have to add a new optional field to Footer in
File.fbs that indicates the file path that the Footer references. This
would be null when the footer is part of the same file where the data
lives. A function could be implemented to produce this "sidecar index"
file from a stream file.

Not sure on others' thoughts about this.

Thanks,
Wes


On Wed, Jul 14, 2021 at 5:39 AM Sam Davis <[email protected]> wrote:
>
> Hi,
>
> I'm interested in a use case where there is a long running job producing 
> results as it goes that may die and therefore must be restarted, making sure 
> to continue from the last known-good point.
>
> For this use case, it seems best to use the "IPC Streaming Format" and write 
> out the batches as they are generated.
>
> However, once the job is finished it would also be beneficial to have random 
> access into the file. It seems like this is possible by:
>
> Manually creating a file with the correct magic number/padding bytes and then 
> seq'ing past them.
> Writing batches out as they appear.
> Doing a pass over the record batches to gather the information required to 
> generate the footer data.
>
>
> Whilst this seems possible, it doesn't seem like it is a use case that has 
> come up before. However, this does surprise me because adding index 
> information to a "completed" file seems like a genuinely useful thing to want 
> to do.
>
> Has anyone encountered something similar before?
>
> Is there an easier way to achieve this? i.e. does this functionality, or 
> parts of, exist in another language that I can bind to in Python?
>
> Best,
>
> Sam
>
>
> IMPORTANT NOTICE: The information transmitted is intended only for the person 
> or entity to which it is addressed and may contain confidential and/or 
> privileged material. Any review, re-transmission, dissemination or other use 
> of, or taking of any action in reliance upon, this information by persons or 
> entities other than the intended recipient is prohibited. If you received 
> this in error, please contact the sender and delete the material from any 
> computer. Although we routinely screen for viruses, addressees should check 
> this e-mail and any attachment for viruses. We make no warranty as to absence 
> of viruses in this e-mail or any attachments.

Reply via email to