Okay manually filtering out the header row while casting values seems to work.
On Mon, Feb 9, 2015 at 1:59 AM, Andries Engelbrecht < [email protected]> wrote: > You can can eliminate the header row with a predicate. Also Drill will try > to guess the data type of columns, but sometimes it may be wrong. In this > case the header row can confuse Drill as to the data type of the column as > the first column has a string, and the next ones have a number. In such > cases it is a good practice to cast the data type of the column. > > Try this query as an example for the file you are testing with. > > select columns[0], sum(cast(columns[1] as INT)) from `./csv/test.csv` > where columns[0]<>'Employee' group by columns[0]; > > If you remove the header row in the file, you may find that Drill will > more than likely be able to perform the sum without the cast. > > --Andries > > > > > On Feb 8, 2015, at 7:44 PM, Minnow Noir <[email protected]> wrote: > > > > That works with the file that has no row header, Sudhakar. > > > >> On Sun, Feb 8, 2015 at 10:34 PM, Sudhakar Thota <[email protected]> > wrote: > >> > >> May be you have to cast it to integer. > >> > >> Sudhakar Thota > >> Sent from my iPhone > >> > >>> On Feb 8, 2015, at 6:54 PM, Minnow Noir <[email protected]> wrote: > >>> > >>> The error actually happens when I remove the header row from the file. > >>> > >>> cat test2.csv > >>> "Ed",100 > >>> "Pete",200 > >>> "Ed",100 > >>> "Pete",400 > >>> > >>> > >>> 0: jdbc:drill:zk=local> select columns[0], columns[1] from > >>> dfs.`/data/test2.csv`; > >>> +------------+------------+ > >>> | EXPR$0 | EXPR$1 | > >>> +------------+------------+ > >>> | "Ed" | 100 | > >>> | "Pete" | 200 | > >>> | "Ed" | 100 | > >>> | "Pete" | 400 | > >>> +------------+------------+ > >>> 4 rows selected (0.837 seconds) > >>> > >>> 0: jdbc:drill:zk=local> select columns[0], sum(columns[1]) from > >>> dfs.`/data/test2.csv` group by columns[0]; > >>> Query failed: Query failed: Failure while running fragment., Only COUNT > >>> aggregate function supported for Boolean type [ > >>> a9344c36-ebb4-4b9d-9c7d-75a0f7e52edd on sandbox.hortonworks.com:31010 > ] > >>> [ a9344c36-ebb4-4b9d-9c7d-75a0f7e52edd on > sandbox.hortonworks.com:31010 > >> ] > >>> > >>> > >>> Error: exception while executing query: Failure while executing query. > >>> (state=,code=0) > >>> > >>> Also, I don't see anything on the Drill wiki about filtering out the > >> header > >>> row. How is that done? > >>> > >>> Thanks > >>> > >>> > >>> On Sun, Feb 8, 2015 at 9:23 PM, Andries Engelbrecht < > >>> [email protected]> wrote: > >>> > >>>> You need to filter out the header line on the CSV file as you are > trying > >>>> to sum a string in column1. > >>>> > >>>> > >>>> —Andries > >>>> > >>>> > >>>>> On Feb 8, 2015, at 6:12 PM, Minnow Noir <[email protected]> > wrote: > >>>>> > >>>>> I'm trying to perform a basic query in order to learn Drill, but > >> getting > >>>> an > >>>>> the error message in the subject line. > >>>>> > >>>>> I created a dead simple CSV file on disk. Note that Sales values are > >> not > >>>>> quoted. > >>>>> > >>>>> cat test.csv > >>>>> Employee,Sales > >>>>> Ed,100 > >>>>> Pete,200 > >>>>> Ed,100 > >>>>> Pete,400 > >>>>> > >>>>> When I query it without performing a sum, it returns the expected > >> values. > >>>>> > >>>>> 0: jdbc:drill:zk=local> select columns[0], columns[1] as TotalSales > >> from > >>>>> dfs.`/data/test.csv`; > >>>>> +------------+------------+ > >>>>> | EXPR$0 | TotalSales | > >>>>> +------------+------------+ > >>>>> | Employee | Sales | > >>>>> | Ed | 100 | > >>>>> | Pete | 200 | > >>>>> | Ed | 100 | > >>>>> | Pete | 400 | > >>>>> +------------+------------+ > >>>>> 5 rows selected (0.11 seconds) > >>>>> > >>>>> However, if I throw a sum() in there, I get the confusing error > message > >>>> in > >>>>> the subject line: > >>>>> > >>>>> 0: jdbc:drill:zk=local> select columns[0], sum(columns[1]) as > >> TotalSales > >>>>> from dfs.` > >>>>> /data/test.csv` group by columns[0]; > >>>>> Query failed: Query failed: Failure while running fragment., Only > COUNT > >>>>> aggregate function supported for Boolean type [ > >>>>> bfd34bd1-2fac-4d9e-a9bd-26bced552120 on > sandbox.hortonworks.com:31010 > >> ] > >>>>> [ bfd34bd1-2fac-4d9e-a9bd-26bced552120 on > >> sandbox.hortonworks.com:31010 > >>>> ] > >>>>> > >>>>> The message seems to be saying that Drill interpreted the Sales > column > >>>> data > >>>>> as being Boolean somehow, and therefore, the only function that can > be > >>>>> called is count(). It's not clear why Drill would interpret the > Sales > >>>>> column values as being Boolean. > >>>>> > >>>>> Some Googling turned up this thread, which says the error also occurs > >> for > >>>>> character data: https://issues.apache.org/jira/browse/DRILL-1998 > >>>>> > >>>>> Of course, it's not clear why Drill would interpret 100, 200, etc. as > >>>> being > >>>>> character data unless it's getting thrown off by the header > row...which > >>>> of > >>>>> course is present in every CSV and TSV file. However, I created a > copy > >>>> of > >>>>> test.csv *without* the header row, and reran the query, but got the > >> same > >>>>> error. > >>>>> > >>>>> Any ideas what's causing the issue and how to resolve it? > >>>>> > >>>>> Thanks > >> >
