> My dataset is partitioned into 10000 files. And that seems to be > the problem. Even if I use a filter that targets only one > partition. If I use a small number of partitions, it works.
Ok, that makes sense. We definitely have some improvements we can do when we have a large number of files. If a directory is provided then we currently list the directory completely first and only begin scanning once that directory listing is done. We should start scanning as soon as we start discovering files. I believe this is captured by ARROW-8163[1]. I am not aware of anyone working on this at the moment but some of the prerequisite work was done in ARROW-11924[2]. For your use case it may also help to list files using multiple threads (I'm not aware enough of how SMB works to know if this is true or not). As a workaround you can manually create a list of files and construct your dataset that way. If the dataset is created with a list of files then there should be no discovery step and the scanning should begin immediately (and in parallel). > Another would be to make the simplifying assumption that the > schemas are all the same and so to use a single file to collect the > schema and then validate that each file has the same schema when you > actually read it, rather than inspecting and validating all 10,000 > files before beginning to read them. This optimization is in place but only discussed in the C++ docs [3]. I'm think python always uses the default which is to read the schema of the first file only. > I see this discussion of common metadata files in > > https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files > > But I don't see anything in the documentation about how to use them > for better performance for a many-file dataset. We should also fix > that The inverse of that discussion is at [4] although it could probably be made more clear the two are linked. I do believe that we can utilize a common metadata file to skip the initial directory discovery. The performance should be pretty similar to the workaround I mentioned above. Right now we can only utilize the smaller _metadata file and not the larger common_metadata file (which, if memory serves, and it is quite hazy on this point so don't hold me to it) would be a minor optimization that would allow us to apply predicate pushdown to filter out files more quickly instead of waiting until we've read in the metadata for that file to filter out the row groups. [1] https://issues.apache.org/jira/browse/ARROW-8163 [2] https://issues.apache.org/jira/browse/ARROW-11924 [3] https://arrow.apache.org/docs/cpp/api/dataset.html#_CPPv4N5arrow7dataset14InspectOptions9fragmentsE [4] https://arrow.apache.org/docs/python/dataset.html#working-with-parquet-datasets On Sun, Nov 14, 2021 at 2:56 PM Wes McKinney <[email protected]> wrote: > > It sounds like the operation is blocking on the initial metadata > inspection, which is well known to be a problem when the dataset has a > large number of files. We should have some better-documented ways > around this — I don't think it's possible to pass an explicit Arrow > schema to parquet.read_table, but that is something we should look at. > Another would be to make the simplifying assumption that the schemas > are all the same and so to use a single file to collect the schema and > then validate that each file has the same schema when you actually > read it, rather than inspecting and validating all 10,000 files before > beginning to read them. > > I see this discussion of common metadata files in > > https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files > > But I don't see anything in the documentation about how to use them > for better performance for a many-file dataset. We should also fix > that > > On Sat, Nov 13, 2021 at 6:27 AM Farhad Taebi > <[email protected]> wrote: > > > > Thanks for the investigation. Your test code works and so does any other > > where a single parquet file is targeted. > > My dataset is partitioned into 10000 files. And that seems to be the > > problem. Even if I use a filter that targets only one partition. If I use a > > small number of partitions, it works. > > That looks like pyarrow tries to locate all file paths in the dataset > > before running the query, even if only one needs to be known and since the > > network drive is slow, it just waits for the response. Wouldn't it be > > better, if a meta data file would be created along with the partitions, so > > the needed paths could be read fast instead of asking the OS every time? > > I don't know if my thoughts are correct though. > > > > Cheers
