This issue has been raised before; currently as the execution detects the bad format, it has no knowledge which file the data came from.
See https://issues.apache.org/jira/browse/DRILL-3764 There is a link to a design that could tell the name of the file, by adding this information to the data batches produced by the scan (can work in simple cases, like casting from a scan; not in cases where there are joins etc. below). Boaz On Mar 1, 2017, at 2:10 PM, Wesley Chow <[email protected]<mailto:[email protected]>> wrote: Great, I do get back a bit more info that's nice, but still not enough to determine from Drill which file has the bad input. I ended up downloading a few GBs of CSVs and examining them with good old grep, but this is obviously less than ideal from a workflow perspective. Is there really no way to get more debug info than NumberFormatException? Or is there some way to ignore some configurable small number of errors? Thanks, Wes On Wed, Mar 1, 2017 at 12:54 PM, John Omernik <[email protected]<mailto:[email protected]>> wrote: The first thing I would try is turning on verbose errors. The setting for that is exec.errors.verbose I use select * from sys.options quite a bit when determining how to approach problems. To alter your session to use verbose errors type ALTER SESSION set `exec.errors.verbose` = true; Then you may get more data back on your error. On Wed, Mar 1, 2017 at 10:34 AM, Wesley Chow <[email protected]<mailto:[email protected]>> wrote: Is there any guidance on finding a needle-in-a-haystack input error in a ton of data? For example, I've got one row in a csv file amongst thousands containing tens of millions of rows that has as its first column a string instead of a number like it should be. Is there some way to get Drill to tell me which file that row is in? Something like the dirX columns would work, since I can select for the row. Note, these are CSV files hosted in S3. Thanks, Wes
