Re: data issue
Hi Vishal, I think you can add the following to $DRILL_HOME/conf/logback.xml to enable the needed logging: Note that if you use a config directory separate from your install (using the --site flag to launch Drill) then modify the file in your custom location. To file a JIRA ticket, just go to Drill's home page [1], Click on Community, then Community Resources, then the first entry under Developer Resources: JIRA which is [2]. Make sure the Drill project is selected. Then just fill in the type (Improvement), title, your version number and a description. There are many other fields, but we mostly don't use them. Would be super-helpful if you can include a few lines of a CSV file that exhibits the problem (once you track down the problem using logging.) Thanks, - Paul [1] http://drill.apache.org/ [2] https://issues.apache.org/jira/browse/DRILL/ On Tuesday, February 18, 2020, 5:21:26 AM PST, Vishal Jadhav (BLOOMBERG/ 731 LEX) wrote: Hello Paul, Yes, I agree that a better error message would be a better solution. I am on drill 1.17. Regarding the logs - do I need to add/modify any specific things in the logback.xml to produce the trace? I can file a Jira with the instructions. What is the process for it? - Vishal From: user@drill.apache.org At: 02/14/20 17:47:26To: Vishal Jadhav (BLOOMBERG/ 731 LEX ) , user@drill.apache.org Subject: Re: data issue Hi Vishal, Yes, it is a known issue that Drill error reporting needs some TLC. Obviously, a better solution would be for the error to say something like "NumerFormatException: Column foo, value "this is not a number"". Feel free to file a JIRA ticket to remind us to fix this particular case. Please explain the context so we have a good shot at reproducing the issue. You said that the logs, at trace level, provided no information. Which version of Drill are you using? If the latest (and, I think 1.16), there is a log message each time the reader opens a file: package org.apache.drill.exec.store.easy.text.reader; public class CompliantTextBatchReader ... private void openReader(TextOutput output) throws IOException { logger.trace("Opening file {}", split.getPath()); Given this, you should see a series of "Opening file" messages when you enable trace-level logging for the above class. As Charles noted, CSV reads columns as text, let's assume that you do have a CAST or other conversion. Then, the number format exception says that you are trying to convert a column from text to a number, and that value does not actually contain a number. Again, it would be better if the error message told us the column that has the problem. Otherwise, if the number of columns in question is small, you can run a query to find non-numeric values. Now, it would be nice if Drill has an isNumber() function. (Another Jira feature request you can file.) Since I can't find one, we can roll our own with a regex. Something like: SELECT foo FROM yourTable WHERE NOT regexp_matches('\d+') If the number is a float or decimal, add the proper pattern. Caveat: I didn't try the above regex, there may be some fiddly bits with back-slashes. Then, you can add file metadata (AKA "implicit") columns to give you the information you want: SELECT filename, foo FROM ... If if that finds the data, and it is something you must handle, you can add an IF function to handle the data. Thanks, - Paul On Friday, February 14, 2020, 7:44:59 AM PST, Vishal Jadhav (BLOOMBERG/ 731 LEX) wrote: During my select statement on conversion of csv file to parquet file, I get the NumberFormatException exception, I am running drill in the embedded mode. Is there a way to find out which csv file or row in that file is causing the issue? I checked the logs with trace verbosity, but not able find the 'data' which has the issue. Error: SYSTEM ERROR: NumberFormatException Fragment 1:5 Please, refer to logs for more information. Thanks! - Vishal
Re: data issue
Hello Paul, Yes, I agree that a better error message would be a better solution. I am on drill 1.17. Regarding the logs - do I need to add/modify any specific things in the logback.xml to produce the trace? I can file a Jira with the instructions. What is the process for it? - Vishal From: user@drill.apache.org At: 02/14/20 17:47:26To: Vishal Jadhav (BLOOMBERG/ 731 LEX ) , user@drill.apache.org Subject: Re: data issue Hi Vishal, Yes, it is a known issue that Drill error reporting needs some TLC. Obviously, a better solution would be for the error to say something like "NumerFormatException: Column foo, value "this is not a number"". Feel free to file a JIRA ticket to remind us to fix this particular case. Please explain the context so we have a good shot at reproducing the issue. You said that the logs, at trace level, provided no information. Which version of Drill are you using? If the latest (and, I think 1.16), there is a log message each time the reader opens a file: package org.apache.drill.exec.store.easy.text.reader; public class CompliantTextBatchReader ... private void openReader(TextOutput output) throws IOException { logger.trace("Opening file {}", split.getPath()); Given this, you should see a series of "Opening file" messages when you enable trace-level logging for the above class. As Charles noted, CSV reads columns as text, let's assume that you do have a CAST or other conversion. Then, the number format exception says that you are trying to convert a column from text to a number, and that value does not actually contain a number. Again, it would be better if the error message told us the column that has the problem. Otherwise, if the number of columns in question is small, you can run a query to find non-numeric values. Now, it would be nice if Drill has an isNumber() function. (Another Jira feature request you can file.) Since I can't find one, we can roll our own with a regex. Something like: SELECT foo FROM yourTable WHERE NOT regexp_matches('\d+') If the number is a float or decimal, add the proper pattern. Caveat: I didn't try the above regex, there may be some fiddly bits with back-slashes. Then, you can add file metadata (AKA "implicit") columns to give you the information you want: SELECT filename, foo FROM ... If if that finds the data, and it is something you must handle, you can add an IF function to handle the data. Thanks, - Paul On Friday, February 14, 2020, 7:44:59 AM PST, Vishal Jadhav (BLOOMBERG/ 731 LEX) wrote: During my select statement on conversion of csv file to parquet file, I get the NumberFormatException exception, I am running drill in the embedded mode. Is there a way to find out which csv file or row in that file is causing the issue? I checked the logs with trace verbosity, but not able find the 'data' which has the issue. Error: SYSTEM ERROR: NumberFormatException Fragment 1:5 Please, refer to logs for more information. Thanks! - Vishal
Re: data issue
Hi Vishal, Yes, it is a known issue that Drill error reporting needs some TLC. Obviously, a better solution would be for the error to say something like "NumerFormatException: Column foo, value "this is not a number"". Feel free to file a JIRA ticket to remind us to fix this particular case. Please explain the context so we have a good shot at reproducing the issue. You said that the logs, at trace level, provided no information. Which version of Drill are you using? If the latest (and, I think 1.16), there is a log message each time the reader opens a file: package org.apache.drill.exec.store.easy.text.reader; public class CompliantTextBatchReader ... private void openReader(TextOutput output) throws IOException { logger.trace("Opening file {}", split.getPath()); Given this, you should see a series of "Opening file" messages when you enable trace-level logging for the above class. As Charles noted, CSV reads columns as text, let's assume that you do have a CAST or other conversion. Then, the number format exception says that you are trying to convert a column from text to a number, and that value does not actually contain a number. Again, it would be better if the error message told us the column that has the problem. Otherwise, if the number of columns in question is small, you can run a query to find non-numeric values. Now, it would be nice if Drill has an isNumber() function. (Another Jira feature request you can file.) Since I can't find one, we can roll our own with a regex. Something like: SELECT foo FROM yourTable WHERE NOT regexp_matches('\d+') If the number is a float or decimal, add the proper pattern. Caveat: I didn't try the above regex, there may be some fiddly bits with back-slashes. Then, you can add file metadata (AKA "implicit") columns to give you the information you want: SELECT filename, foo FROM ... If if that finds the data, and it is something you must handle, you can add an IF function to handle the data. Thanks, - Paul On Friday, February 14, 2020, 7:44:59 AM PST, Vishal Jadhav (BLOOMBERG/ 731 LEX) wrote: During my select statement on conversion of csv file to parquet file, I get the NumberFormatException exception, I am running drill in the embedded mode. Is there a way to find out which csv file or row in that file is causing the issue? I checked the logs with trace verbosity, but not able find the 'data' which has the issue. Error: SYSTEM ERROR: NumberFormatException Fragment 1:5 Please, refer to logs for more information. Thanks! - Vishal
Re: data issue
Vishal, Does the output have to be DEC? Could you try FLOAT? The other option would be to use the TO_NUMBER function. One thing which might be causing an issue are null values as well. In any event, can you query the parquet output that Drill is generating? If so, another option might be to look at the last entry in the parquet file and then find that entry in your CSV data to see what is "near" to see if you can figure out what is breaking. Best, -- C > On Feb 14, 2020, at 2:40 PM, Vishal Jadhav (BLOOMBERG/ 731 LEX) > wrote: > > Yes, that's what I am doing and it seems to work, I am casting the data as > e.g. CAST (price as DEC(a,b)). > > I have about 40,000 csv, each with about 1000+ rows, it fails after 5 mins of > conversion and I do see some parquet files are produced. So, it would nice to > know how far we went through the logs, what record is having an issue. > > Error does say, look at the logs, but not able to find anything meaningful in > there. > > > From: user@drill.apache.org At: 02/14/20 12:18:36To: Vishal Jadhav > (BLOOMBERG/ 731 LEX ) , user@drill.apache.org > Subject: Re: data issue > > Hi Vishal, > This one is an easy one (I think)... All columns in CSV are read as VARCHAR. > So if you are trying to convert anything in CSV to a Numeric format, you will > first have to CAST it via one of Drill's data conversion functions to the > appropriate numeric type. > -- C > >> On Feb 14, 2020, at 10:44 AM, Vishal Jadhav (BLOOMBERG/ 731 LEX) > wrote: >> >> During my select statement on conversion of csv file to parquet file, I get > the NumberFormatException exception, I am running drill in the embedded mode. > Is there a way to find out which csv file or row in that file is causing the > issue? >> I checked the logs with trace verbosity, but not able find the 'data' which > has the issue. >> >> Error: SYSTEM ERROR: NumberFormatException >> >> Fragment 1:5 >> >> Please, refer to logs for more information. >> >> Thanks! >> - Vishal >> > >
Re: data issue
Yes, that's what I am doing and it seems to work, I am casting the data as e.g. CAST (price as DEC(a,b)). I have about 40,000 csv, each with about 1000+ rows, it fails after 5 mins of conversion and I do see some parquet files are produced. So, it would nice to know how far we went through the logs, what record is having an issue. Error does say, look at the logs, but not able to find anything meaningful in there. From: user@drill.apache.org At: 02/14/20 12:18:36To: Vishal Jadhav (BLOOMBERG/ 731 LEX ) , user@drill.apache.org Subject: Re: data issue Hi Vishal, This one is an easy one (I think)... All columns in CSV are read as VARCHAR. So if you are trying to convert anything in CSV to a Numeric format, you will first have to CAST it via one of Drill's data conversion functions to the appropriate numeric type. -- C > On Feb 14, 2020, at 10:44 AM, Vishal Jadhav (BLOOMBERG/ 731 LEX) wrote: > > During my select statement on conversion of csv file to parquet file, I get the NumberFormatException exception, I am running drill in the embedded mode. Is there a way to find out which csv file or row in that file is causing the issue? > I checked the logs with trace verbosity, but not able find the 'data' which has the issue. > > Error: SYSTEM ERROR: NumberFormatException > > Fragment 1:5 > > Please, refer to logs for more information. > > Thanks! > - Vishal >
Re: data issue
Hi Vishal, This one is an easy one (I think)... All columns in CSV are read as VARCHAR. So if you are trying to convert anything in CSV to a Numeric format, you will first have to CAST it via one of Drill's data conversion functions to the appropriate numeric type. -- C > On Feb 14, 2020, at 10:44 AM, Vishal Jadhav (BLOOMBERG/ 731 LEX) > wrote: > > During my select statement on conversion of csv file to parquet file, I get > the NumberFormatException exception, I am running drill in the embedded mode. > Is there a way to find out which csv file or row in that file is causing the > issue? > I checked the logs with trace verbosity, but not able find the 'data' which > has the issue. > > Error: SYSTEM ERROR: NumberFormatException > > Fragment 1:5 > > Please, refer to logs for more information. > > Thanks! > - Vishal >
data issue
During my select statement on conversion of csv file to parquet file, I get the NumberFormatException exception, I am running drill in the embedded mode. Is there a way to find out which csv file or row in that file is causing the issue? I checked the logs with trace verbosity, but not able find the 'data' which has the issue. Error: SYSTEM ERROR: NumberFormatException Fragment 1:5 Please, refer to logs for more information. Thanks! - Vishal