Re: data issue

2020-02-18 Thread Paul Rogers
Hi Vishal,

I think you can add the following to $DRILL_HOME/conf/logback.xml to enable the 
needed logging:

  
    
    
  


Note that if you use a config directory separate from your install (using the 
--site flag to launch Drill) then modify the file in your custom location.

To file a JIRA ticket, just go to Drill's home page [1], Click on Community, 
then Community Resources, then the first entry under Developer Resources: JIRA 
which is [2].

Make sure the Drill project is selected. Then just fill in the type 
(Improvement), title, your version number and a description. There are many 
other fields, but we mostly don't use them.

Would be super-helpful if you can include a few lines of a CSV file that 
exhibits the problem (once you track down the problem using logging.)


Thanks,
- Paul


[1] http://drill.apache.org/
 
[2] https://issues.apache.org/jira/browse/DRILL/

On Tuesday, February 18, 2020, 5:21:26 AM PST, Vishal Jadhav (BLOOMBERG/ 
731 LEX)  wrote:  
 
 Hello Paul,
Yes, I agree that a better error message would be a better solution. I am on 
drill 1.17. Regarding the logs - do I need to add/modify any specific things in 
the logback.xml to produce the trace?
I can file a Jira with the instructions. What is the process for it?
- Vishal

From: user@drill.apache.org At: 02/14/20 17:47:26To:  Vishal Jadhav (BLOOMBERG/ 
731 LEX ) ,  user@drill.apache.org
Subject: Re: data issue

Hi Vishal,

Yes, it is a known issue that Drill error reporting needs some TLC. Obviously, 
a better solution would be for the error to say something like 
"NumerFormatException: Column foo, value "this is not a number"". Feel 
free to file a JIRA ticket to remind us to fix this particular case. Please 
explain the context so we have a good shot at reproducing the issue.


You said that the logs, at trace level, provided no information. Which version 
of Drill are you using? If the latest (and, I think 1.16), there is a log 
message each time the reader opens a file:

package org.apache.drill.exec.store.easy.text.reader;


public class CompliantTextBatchReader ...

  private void openReader(TextOutput output) throws IOException {
    logger.trace("Opening file {}", split.getPath());


Given this, you should see a series of "Opening file" messages when you enable 
trace-level logging for the above class.

As Charles noted, CSV reads columns as text, let's assume that you do have a 
CAST or other conversion. Then, the number format exception says that you are 
trying to convert a column from text to a number, and that value does not 
actually contain a number.

Again, it would be better if the error message told us the column that has the 
problem. Otherwise, if the number of columns in question is small, you can run 
a query to find non-numeric values. Now, it would be nice if Drill has an 
isNumber() function. (Another Jira feature request you can file.)

Since I can't find one, we can roll our own with a regex. Something like:

SELECT foo FROM yourTable WHERE  NOT regexp_matches('\d+')

If the number is a float or decimal, add the proper pattern.

Caveat: I didn't try the above regex, there may be some fiddly bits with 
back-slashes.

Then, you can add file metadata (AKA "implicit") columns to give you the 
information you want:

SELECT filename, foo FROM ...


If if that finds the data, and it is something you must handle, you can add an 
IF function to handle the data.

Thanks,
- Paul

 

    On Friday, February 14, 2020, 7:44:59 AM PST, Vishal Jadhav (BLOOMBERG/ 731 
LEX)  wrote:  
 
 During my select statement on conversion of csv file to parquet file, I get 
the NumberFormatException exception, I am running drill in the embedded mode. 
Is there a way to find out which csv file or row in that file is causing the 
issue?
I checked the logs with trace verbosity, but not able find the 'data' which has 
the issue. 

Error: SYSTEM ERROR: NumberFormatException

Fragment 1:5

Please, refer to logs for more information.

Thanks!
- Vishal

  

  

Re: data issue

2020-02-18 Thread Vishal Jadhav (BLOOMBERG/ 731 LEX)
Hello Paul,
Yes, I agree that a better error message would be a better solution. I am on 
drill 1.17. Regarding the logs - do I need to add/modify any specific things in 
the logback.xml to produce the trace?
I can file a Jira with the instructions. What is the process for it?
- Vishal

From: user@drill.apache.org At: 02/14/20 17:47:26To:  Vishal Jadhav (BLOOMBERG/ 
731 LEX ) ,  user@drill.apache.org
Subject: Re: data issue

Hi Vishal,

Yes, it is a known issue that Drill error reporting needs some TLC. Obviously, 
a better solution would be for the error to say something like 
"NumerFormatException: Column foo, value "this is not a number"". Feel 
free to file a JIRA ticket to remind us to fix this particular case. Please 
explain the context so we have a good shot at reproducing the issue.


You said that the logs, at trace level, provided no information. Which version 
of Drill are you using? If the latest (and, I think 1.16), there is a log 
message each time the reader opens a file:

package org.apache.drill.exec.store.easy.text.reader;


public class CompliantTextBatchReader ...

  private void openReader(TextOutput output) throws IOException {
logger.trace("Opening file {}", split.getPath());


Given this, you should see a series of "Opening file" messages when you enable 
trace-level logging for the above class.

As Charles noted, CSV reads columns as text, let's assume that you do have a 
CAST or other conversion. Then, the number format exception says that you are 
trying to convert a column from text to a number, and that value does not 
actually contain a number.

Again, it would be better if the error message told us the column that has the 
problem. Otherwise, if the number of columns in question is small, you can run 
a query to find non-numeric values. Now, it would be nice if Drill has an 
isNumber() function. (Another Jira feature request you can file.)

Since I can't find one, we can roll our own with a regex. Something like:

SELECT foo FROM yourTable WHERE  NOT regexp_matches('\d+')

If the number is a float or decimal, add the proper pattern.

Caveat: I didn't try the above regex, there may be some fiddly bits with 
back-slashes.

Then, you can add file metadata (AKA "implicit") columns to give you the 
information you want:

SELECT filename, foo FROM ...


If if that finds the data, and it is something you must handle, you can add an 
IF function to handle the data.

Thanks,
- Paul

 

On Friday, February 14, 2020, 7:44:59 AM PST, Vishal Jadhav (BLOOMBERG/ 731 
LEX)  wrote:  
 
 During my select statement on conversion of csv file to parquet file, I get 
the NumberFormatException exception, I am running drill in the embedded mode. 
Is there a way to find out which csv file or row in that file is causing the 
issue?
I checked the logs with trace verbosity, but not able find the 'data' which has 
the issue. 

Error: SYSTEM ERROR: NumberFormatException

Fragment 1:5

Please, refer to logs for more information.

Thanks!
- Vishal

  



Re: data issue

2020-02-14 Thread Paul Rogers
Hi Vishal,

Yes, it is a known issue that Drill error reporting needs some TLC. Obviously, 
a better solution would be for the error to say something like 
"NumerFormatException: Column foo, value "this is not a number"". Feel free to 
file a JIRA ticket to remind us to fix this particular case. Please explain the 
context so we have a good shot at reproducing the issue.


You said that the logs, at trace level, provided no information. Which version 
of Drill are you using? If the latest (and, I think 1.16), there is a log 
message each time the reader opens a file:

package org.apache.drill.exec.store.easy.text.reader;


public class CompliantTextBatchReader ...

  private void openReader(TextOutput output) throws IOException {
    logger.trace("Opening file {}", split.getPath());


Given this, you should see a series of "Opening file" messages when you enable 
trace-level logging for the above class.

As Charles noted, CSV reads columns as text, let's assume that you do have a 
CAST or other conversion. Then, the number format exception says that you are 
trying to convert a column from text to a number, and that value does not 
actually contain a number.

Again, it would be better if the error message told us the column that has the 
problem. Otherwise, if the number of columns in question is small, you can run 
a query to find non-numeric values. Now, it would be nice if Drill has an 
isNumber() function. (Another Jira feature request you can file.)

Since I can't find one, we can roll our own with a regex. Something like:

SELECT foo FROM yourTable WHERE  NOT regexp_matches('\d+')

If the number is a float or decimal, add the proper pattern.

Caveat: I didn't try the above regex, there may be some fiddly bits with 
back-slashes.

Then, you can add file metadata (AKA "implicit") columns to give you the 
information you want:

SELECT filename, foo FROM ...


If if that finds the data, and it is something you must handle, you can add an 
IF function to handle the data.

Thanks,
- Paul

 

On Friday, February 14, 2020, 7:44:59 AM PST, Vishal Jadhav (BLOOMBERG/ 731 
LEX)  wrote:  
 
 During my select statement on conversion of csv file to parquet file, I get 
the NumberFormatException exception, I am running drill in the embedded mode. 
Is there a way to find out which csv file or row in that file is causing the 
issue?
I checked the logs with trace verbosity, but not able find the 'data' which has 
the issue. 

Error: SYSTEM ERROR: NumberFormatException

Fragment 1:5

Please, refer to logs for more information.

Thanks!
- Vishal

  

Re: data issue

2020-02-14 Thread Charles Givre
Vishal, 

Does the output have to be DEC?  Could you try FLOAT?  The other option would 
be to use the TO_NUMBER function.  One thing which might be causing an issue 
are null values as well. 
In any event, can you query the parquet output that Drill is generating?  If 
so, another option might be to look at the last entry in the parquet file and 
then find that entry in your CSV data to see what is "near" to see if you can 
figure out what is breaking.

Best,
-- C




> On Feb 14, 2020, at 2:40 PM, Vishal Jadhav (BLOOMBERG/ 731 LEX) 
>  wrote:
> 
> Yes, that's what I am doing and it seems to work, I am casting the data as 
> e.g. CAST (price as DEC(a,b)). 
> 
> I have about 40,000 csv, each with about 1000+ rows, it fails after 5 mins of 
> conversion and I do see some parquet files are produced. So, it would nice to 
> know how far we went through the logs, what record is having an issue.
> 
> Error does say, look at the logs, but not able to find anything meaningful in 
> there.
> 
> 
> From: user@drill.apache.org At: 02/14/20 12:18:36To:  Vishal Jadhav 
> (BLOOMBERG/ 731 LEX ) ,  user@drill.apache.org
> Subject: Re: data issue
> 
> Hi Vishal, 
> This one is an easy one (I think)... All columns in CSV are read as VARCHAR.  
> So if you are trying to convert anything in CSV to a Numeric format, you will 
> first have to CAST it via one of Drill's data conversion functions to the 
> appropriate numeric type.
> -- C
> 
>> On Feb 14, 2020, at 10:44 AM, Vishal Jadhav (BLOOMBERG/ 731 LEX) 
>  wrote:
>> 
>> During my select statement on conversion of csv file to parquet file, I get 
> the NumberFormatException exception, I am running drill in the embedded mode. 
> Is there a way to find out which csv file or row in that file is causing the 
> issue?
>> I checked the logs with trace verbosity, but not able find the 'data' which 
> has the issue. 
>> 
>> Error: SYSTEM ERROR: NumberFormatException
>> 
>> Fragment 1:5
>> 
>> Please, refer to logs for more information.
>> 
>> Thanks!
>> - Vishal
>> 
> 
> 



Re: data issue

2020-02-14 Thread Vishal Jadhav (BLOOMBERG/ 731 LEX)
Yes, that's what I am doing and it seems to work, I am casting the data as e.g. 
CAST (price as DEC(a,b)). 

I have about 40,000 csv, each with about 1000+ rows, it fails after 5 mins of 
conversion and I do see some parquet files are produced. So, it would nice to 
know how far we went through the logs, what record is having an issue.

Error does say, look at the logs, but not able to find anything meaningful in 
there.


From: user@drill.apache.org At: 02/14/20 12:18:36To:  Vishal Jadhav (BLOOMBERG/ 
731 LEX ) ,  user@drill.apache.org
Subject: Re: data issue

Hi Vishal, 
This one is an easy one (I think)... All columns in CSV are read as VARCHAR.  
So if you are trying to convert anything in CSV to a Numeric format, you will 
first have to CAST it via one of Drill's data conversion functions to the 
appropriate numeric type.
-- C

> On Feb 14, 2020, at 10:44 AM, Vishal Jadhav (BLOOMBERG/ 731 LEX) 
 wrote:
> 
> During my select statement on conversion of csv file to parquet file, I get 
the NumberFormatException exception, I am running drill in the embedded mode. 
Is there a way to find out which csv file or row in that file is causing the 
issue?
> I checked the logs with trace verbosity, but not able find the 'data' which 
has the issue. 
> 
> Error: SYSTEM ERROR: NumberFormatException
> 
> Fragment 1:5
> 
> Please, refer to logs for more information.
> 
> Thanks!
> - Vishal
> 




Re: data issue

2020-02-14 Thread Charles Givre
Hi Vishal, 
This one is an easy one (I think)... All columns in CSV are read as VARCHAR.  
So if you are trying to convert anything in CSV to a Numeric format, you will 
first have to CAST it via one of Drill's data conversion functions to the 
appropriate numeric type.
-- C

> On Feb 14, 2020, at 10:44 AM, Vishal Jadhav (BLOOMBERG/ 731 LEX) 
>  wrote:
> 
> During my select statement on conversion of csv file to parquet file, I get 
> the NumberFormatException exception, I am running drill in the embedded mode. 
> Is there a way to find out which csv file or row in that file is causing the 
> issue?
> I checked the logs with trace verbosity, but not able find the 'data' which 
> has the issue. 
> 
> Error: SYSTEM ERROR: NumberFormatException
> 
> Fragment 1:5
> 
> Please, refer to logs for more information.
> 
> Thanks!
> - Vishal
> 



data issue

2020-02-14 Thread Vishal Jadhav (BLOOMBERG/ 731 LEX)
During my select statement on conversion of csv file to parquet file, I get the 
NumberFormatException exception, I am running drill in the embedded mode. Is 
there a way to find out which csv file or row in that file is causing the issue?
I checked the logs with trace verbosity, but not able find the 'data' which has 
the issue. 

Error: SYSTEM ERROR: NumberFormatException

Fragment 1:5

Please, refer to logs for more information.

Thanks!
- Vishal