Re: [Python] Cannot read parquet files from windows shared drive with pyarrow

Weston Pace Mon, 15 Nov 2021 01:03:21 -0800

> My dataset is partitioned into 10000 files. And that seems to be
> the problem. Even if I use a filter that targets only one
> partition. If I use a small number of partitions, it works.

Ok, that makes sense.  We definitely have some improvements we can do
when we have a large number of files.  If a directory is provided
then we currently list the directory completely first and only begin
scanning once that directory listing is done.  We should start
scanning as soon as we start discovering files.  I believe this is
captured by ARROW-8163[1].  I am not aware of anyone working on this
at the moment but some of the prerequisite work was done in
ARROW-11924[2].  For your use case it may also help to list files
using multiple threads (I'm not aware enough of how SMB works to know
if this is true or not).

As a workaround you can manually create a list of files and construct
your dataset that way.  If the dataset is created with a list of
files then there should be no discovery step and the scanning should
begin immediately (and in parallel).

> Another would be to make the simplifying assumption that the
> schemas are all the same and so to use a single file to collect the
> schema and then validate that each file has the same schema when you
> actually read it, rather than inspecting and validating all 10,000
> files before beginning to read them.

This optimization is in place but only discussed in the C++ docs [3].
I'm think python always uses the default which is to read the schema
of the first file only.

> I see this discussion of common metadata files in
>
> https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files
>
> But I don't see anything in the documentation about how to use them
> for better performance for a many-file dataset. We should also fix
> that

The inverse of that discussion is at [4] although it could probably be
made more clear the two are linked.  I do believe that we can utilize
a common metadata file to skip the initial directory discovery.  The
performance should be pretty similar to the workaround I mentioned
above.  Right now we can only utilize the smaller _metadata file and
not the larger common_metadata file (which, if memory serves, and it
is quite hazy on this point so don't hold me to it) would be a minor
optimization that would allow us to apply predicate pushdown to filter
out files more quickly instead of waiting until we've read in the
metadata for that file to filter out the row groups.

 [1] https://issues.apache.org/jira/browse/ARROW-8163
 [2] https://issues.apache.org/jira/browse/ARROW-11924
 [3] 
https://arrow.apache.org/docs/cpp/api/dataset.html#_CPPv4N5arrow7dataset14InspectOptions9fragmentsE
 [4] 
https://arrow.apache.org/docs/python/dataset.html#working-with-parquet-datasets

On Sun, Nov 14, 2021 at 2:56 PM Wes McKinney <[email protected]> wrote:
>
> It sounds like the operation is blocking on the initial metadata
> inspection, which is well known to be a problem when the dataset has a
> large number of files. We should have some better-documented ways
> around this — I don't think it's possible to pass an explicit Arrow
> schema to parquet.read_table, but that is something we should look at.
> Another would be to make the simplifying assumption that the schemas
> are all the same and so to use a single file to collect the schema and
> then validate that each file has the same schema when you actually
> read it, rather than inspecting and validating all 10,000 files before
> beginning to read them.
>
> I see this discussion of common metadata files in
>
> https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files
>
> But I don't see anything in the documentation about how to use them
> for better performance for a many-file dataset. We should also fix
> that
>
> On Sat, Nov 13, 2021 at 6:27 AM Farhad Taebi
> <[email protected]> wrote:
> >
> > Thanks for the investigation. Your test code works and so does any other 
> > where a single parquet file is targeted.
> > My dataset is partitioned into 10000 files. And that seems to be the 
> > problem. Even if I use a filter that targets only one partition. If I use a 
> > small number of partitions, it works.
> > That looks like pyarrow tries to locate all file paths in the dataset 
> > before running the query, even if only one needs to be known and since the 
> > network drive is slow, it just waits for the response. Wouldn't it be 
> > better, if a meta data file would be created along with the partitions, so 
> > the needed paths could be read fast instead of asking the OS every time?
> > I don't know if my thoughts are correct though.
> >
> > Cheers

Re: [Python] Cannot read parquet files from windows shared drive with pyarrow

Reply via email to