Hello François, Sorry that this question went unanswered for so long. We have gotten many requests for this feature of skipping bad files, but we haven't come to a consensus of how this feature should be implemented.
The problem largely comes out of the ambiguity of the definition of skipping "some" bade files and different users expectations. If the files are valid, and there is a bug in Drill that is preventing us from reading them, it doesn't seem like the right behavior to just skip the files. In this case if you could post a small parquet file that produces this error I can take a look at is causing the issue, because I would like to make sure this is fixed. I completely agree that we should be failing with a helpful message that point users to the file that failed to read. Many of these cases we catch today and add the filename to the failure message, we should add one where this is failing as well. I have filed a JIRA for fixing this as well as reviewing the other storage plugins for similar cases that fail without useful context information [1]/ [1] - https://issues.apache.org/jira/browse/DRILL-4426 On Fri, Jan 22, 2016 at 8:49 AM, François Méthot <[email protected]> wrote: > Hi Drill Community, > > > Using drill-embedded, I encountered this error while doing a query on > folders containing thousands of parquet files: > > > > Error: SYSTEM ERROR: IOException: FAILED_TO_UNCOMPRESSED(5) > > > > Fragment 1:9 > > > > After re-running the same query with the log level set to DEBUG, I tracked > the files that were scanned by Fragment 1:9, performed the same query on > each individual file until I got the same error. > > > > It turned out that a column in one of the parquet file is causing this > issue. Whether it is an issue with our parquet writer or with the drill > reader remains to be determined. > > > > My questions is : > > Is there an option to have a fragment thread to move on to the next file > after it encounter such error, without completely spoiling the whole query > and result? > > > > Also in this case, it would have been useful if it was clearly specified in > the log which parquet file is causing issue. > > > > Thanks a lot > > François >
