This issue has been raised before; currently as the execution detects the bad 
format, it has no knowledge which file the data came from.

See  https://issues.apache.org/jira/browse/DRILL-3764

There is a link to a design that could tell the name of the file, by adding 
this information to the data batches produced by the scan (can work in simple 
cases, like casting from a scan; not in cases where there are joins etc. below).

    Boaz

On Mar 1, 2017, at 2:10 PM, Wesley Chow 
<[email protected]<mailto:[email protected]>> wrote:

Great, I do get back a bit more info that's nice, but still not enough to
determine from Drill which file has the bad input. I ended up downloading a
few GBs of CSVs and examining them with good old grep, but this is
obviously less than ideal from a workflow perspective. Is there really no
way to get more debug info than NumberFormatException? Or is there some way
to ignore some configurable small number of errors?

Thanks,
Wes


On Wed, Mar 1, 2017 at 12:54 PM, John Omernik 
<[email protected]<mailto:[email protected]>> wrote:

The first thing I would try is turning on verbose errors.

The setting for that is exec.errors.verbose

I use select * from sys.options quite a bit when determining how to
approach problems.

To alter your session to use verbose errors type

ALTER SESSION set `exec.errors.verbose` = true;

Then you may get more data back on your error.

On Wed, Mar 1, 2017 at 10:34 AM, Wesley Chow 
<[email protected]<mailto:[email protected]>> wrote:

Is there any guidance on finding a needle-in-a-haystack input error in a
ton of data? For example, I've got one row in a csv file amongst
thousands
containing tens of millions of rows that has as its first column a string
instead of a number like it should be. Is there some way to get Drill to
tell me which file that row is in? Something like the dirX columns would
work, since I can select for the row.

Note, these are CSV files hosted in S3.

Thanks,
Wes



Reply via email to