benj created DRILL-7524: --------------------------- Summary: Distinct on array with any_value Key: DRILL-7524 URL: https://issues.apache.org/jira/browse/DRILL-7524 Project: Apache Drill Issue Type: Bug Components: Functions - Drill Affects Versions: 1.17.0 Reporter: benj Attachments: IndexOutOfBoundsException.txt, NegativeArraySizeException.txt
AS drill doesn't allow to GROUP BY nor DISTINCT nor ORDER BY complex type, it may appears as a solution to use any_value aggregate function to do some works. But some problems appears: With a dataset of 223664 rows like: {code:sql} SELECT Url, Tags FROM dfs.tmp.`data.json` LIMIT 1; +-----------------------------------------+--------+ | Url | Tags | +-----------------------------------------+--------+ | http://000.dijiushipindian.com/feed.rss | ["us"] | +-----------------------------------------+--------+ {code} With the own UDF function to_string that only do {code:java} @Param FieldReader input; ... String rowString = input.readObject().toString(); ... {code} {code:sql} SELECT any_value(T.Tags)Tags FROM dfs.tmp.`data.json` GROUP BY NULLIF(UPPER(to_string(T.Tags)),'') /* WORK WELL */; +--------+ | Tags | +--------+ | ["us"] | | ["cn"] | ... SELECT Url, any_value(T.Tags)Tags FROM dfs.tmp.`data.json` GROUP BY Url, NULLIF(UPPER(to_string(T.Tags)),'') /* NOK */; java.lang.NegativeArraySizeException {code} Sometimes the error can be different (details in attachment): java.lang.IndexOutOfBoundsException: index: 1634787136, length: 7629168 (expected: range(0, 8388608)) And before producing the error, the output show some results like below {code} +----------------------------------------------------------------------------------+------+ | Url | Tags | +----------------------------------------------------------------------------------+------+ | http://everythiing4u.blogspot.com.es/2013/04/omg-proposal-fail.html | [] | | http://everythiing4u.blogspot.com.es/2013/04/omg-this-dude-just-owned-his-friend.html | [] | {code} And this result is not correct because field Tags is empty although this is never the case in the source file. So maybe there is a problem with the aggregate function any_value. -- This message was sent by Atlassian Jira (v8.3.4#803005)