Hi,
I am experimenting with drill 1.6 to see if it fits our SQL on hadoop needs.
As repeated_count doesn't work on nested objects (
https://issues.apache.org/jira/browse/DRILL-1650), I decided to implement
my own UDF to do that using a FieldReader. I was hoping that using
FieldReader would be a generic way to count the no. of elements in an
array. However, during that process I discovered couple of inconsistencies
with some of the FieldReaders.
Here is my test UDF implementation. This is just created to illustrate the
issue:
@FunctionTemplate(name="arrayCount",
scope=FunctionTemplate.FunctionScope.SIMPLE)
public class ArrayCount implements DrillSimpleFunc {
@Param FieldReader prArray;
@Output VarCharHolder out;
@Inject DrillBuf buffer;
/** Builds a string as an output. The string contains is in the
following format:
* Size:<result of FieldReader.size() function>,Iterating
Count:<result from counting the no. of Iterations>,<Simple FieldReader
class name>
**/
public void eval() {
int count = 0;
StringBuilder sb = new
StringBuilder().append("Size:").append(prArray.size()).append(",");
while(prArray.next()) count++;
sb.append("Iterating
Count:").append(count).append(",").append(prArray.getClass().getSimpleName());
byte[] d = sb.toString().getBytes();
out.buffer = buffer;
out.start = 0;
out.end = d.length;
buffer.setBytes(0, d);
}
public void setup() {}
}
Here's the output from a sample File:
0: jdbc:drill:zk=local> select t1.b, arrayCount(t1.b) from
dfs.`/s/tmp/delete/btmpoc/data/a.json` t1;
+------------+----------------------------------------------------------+
| b | EXPR$1 |
+------------+----------------------------------------------------------+
| [1,2] | Size:2,Iterating Count:2,RepeatedBigIntHolderReaderImpl |
| [1,2,3] | Size:3,Iterating Count:5,RepeatedBigIntHolderReaderImpl |
| [1,2,3,4] | Size:4,Iterating Count:9,RepeatedBigIntHolderReaderImpl |
| [] | Size:0,Iterating Count:9,RepeatedBigIntHolderReaderImpl |
| [] | Size:0,Iterating Count:9,RepeatedBigIntHolderReaderImpl |
+------------+----------------------------------------------------------+
5 rows selected (0.179 seconds)
0: jdbc:drill:zk=local> select t1.c, arrayCount(t1.c) from
dfs.`/s/tmp/delete/btmpoc/data/a.json` t1;
+-----------------------------------------------+-------------------------------------------------+
| c |
EXPR$1 |
+-----------------------------------------------+-------------------------------------------------+
| [{"ca":"11cav"}] | Size:1,Iterating
Count:1,RepeatedMapReaderImpl |
| [{"ca":"21cav","cb":"21cbv"},{"ca":"22cav"}] | Size:3,Iterating
Count:2,RepeatedMapReaderImpl |
| [{"ca":"31cav"},{"ca":"32cav"}] | Size:3,Iterating
Count:2,RepeatedMapReaderImpl |
| [{"ca":"3"}] | Size:2,Iterating
Count:1,RepeatedMapReaderImpl |
| [] | Size:0,Iterating
Count:0,RepeatedMapReaderImpl |
+-----------------------------------------------+-------------------------------------------------+
5 rows selected (0.115 seconds)
================================================
RepeatedBigIntHolderReaderImpl is generated from HolderReaderImpl.java. I
think the following line in HolderReaderImpl.java has the issue:
https://github.com/apache/drill/blob/245da9790813569c5da9404e0fc5e45cc88e22bb/exec/vector/src/main/codegen/templates/HolderReaderImpl.java#L80
Maybe we should change it to: if(repeatedHolder.start + index + 1 <
repeatedHolder.end)
Not sure if the size function of RepeatedMapReaderImpl is implemented
correctly.