>
> So far, this code has worked as expected and I have been able to read in
> multiple files simultaneously across processes, but recently I hit a case
> where reading a file in a single process resulted in a error that could be
> handled gracefully (with an `Unexpected end of stream` error), but reading
> in that same file across multiple processes crashed the code, and I would
> like to be able to handle the errors rather than having it crash. Thanks.


As long as there is no shared state (i.e. the multiprocesses aren't sharing
a reader handle via Forks) then reading in multple should be safe.   If
there is a small reproducible example to show the error that only occurs
when reading from multiple processes (and that doesn't reproduce when
reading from a single process) it would be helpful to share this to help
figure out what is going on.

On Thu, Mar 31, 2022 at 3:33 PM McDonald, Ben <[email protected]> wrote:

> Hello,
>
>
>
> I am currently writing some distributed code where I am reading Parquet
> columns from the same file across multiple processes. I see that
> https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N7parquet5arrow10FileReaderEseems
> to suggest that parallelism within a process would need to read at the row
> group granularity and that multiple file readers working independently on
> the same file in a single process would not be safe.
>
>
>
> Given that I haven’t been able to find anything suggesting the contrary, I
> was thinking that reading the same file from different processes would be
> allowed, but a recent crash I encountered made me question if that were
> true.
>
>
>
> Is it allowed to read a single Parquet file simultaneously from separate
> processes? I am currently using the low level `ReadBatch` API and, for
> example, if I were reading 1 file across 2 processes, I would have the
> first process read the first half of the elements and the second process
> read the second half of the elements, and both of these are happening
> simultaneously, but as I have mentioned, it is in different processes, so I
> wouldn’t expect there to be any conflict.
>
>
>
> So far, this code has worked as expected and I have been able to read in
> multiple files simultaneously across processes, but recently I hit a case
> where reading a file in a single process resulted in a error that could be
> handled gracefully (with an `Unexpected end of stream` error), but reading
> in that same file across multiple processes crashed the code, and I would
> like to be able to handle the errors rather than having it crash. Thanks.
>
>
>
> Best,
>
> Ben McDonald
>

Reply via email to