Hello,
Thanks for your answers.
My payload is not encoded in UTF8, it can contain some non printable
characters, new lines and it may contain bytes that are not valid in UTF8.
An the latter should be my case.
I have tried with regexp_matches:
SELECT * FROM `dfs`.`myfile.avro` WHERE regexp_matches(payload,
'(?s).*abcd.*');
but I have the same problem, and I have the same error if I do, obviously,
SELECT CAST(payload as VARCHAR) FROM `dfs`.`myfile.avro`;
So, I implemented an UDF function to convert the bytes into a hex encoded
string
public class AsciiStringBinaryFunc {
// Converts a varbinary type into a hex encoded string.
// (byte[]) {(byte)0xca, (byte)0xfe, (byte)0xba, (byte)0xbe} =>
"\xca\xfe\xba\xbe"
@FunctionTemplate(name = "ascii_string_binary", scope =
FunctionScope.SIMPLE, nulls = NullHandling.NULL_IF_NULL)
public static class StringBinary implements DrillSimpleFunc {
@Param VarBinaryHolder in;
@Output VarCharHolder out;
@Workspace Charset charset;
@Inject DrillBuf buffer;
@Override
public void setup() {
charset = java.nio.charset.Charset.forName("US-ASCII");
}
@Override
public void eval() {
byte[] buf =
org.apache.drill.common.util.DrillStringUtils.toBinaryString(in.buffer,
in.start, in.end).getBytes(charset);
buffer.setBytes(0, buf);
buffer.setIndex(0, buf.length);
out.start = 0;
out.end = buf.length;
out.buffer = buffer;
}
}
}
but then, I have a new problem
SELECT ascii_string_binary(payload) FROM `dfs`.`myfile.avro` LIMIT 1;
Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 0, length: 3484
(expected: range(0, 256))
Fragment 0:0
[Error Id: d0ab90d6-8b2a-4200-8809-534138c217fb on maprdemo:31010]
(state=,code=0)
knowing that
SELECT length(payload) FROM `dfs`.`myfile.avro` LIMIT 1;
+---------+
| EXPR$0 |
+---------+
| 3484 |
+---------+
Thanks a lot for your help,
Franca
On Sat, Apr 29, 2017 at 12:13 AM, Jinfeng Ni <[email protected]> wrote:
> The error seems to indicated 'PAYLOAD' does not contain UTF8-encoded
> bytes. The like function is a string function, and it only accepts
> varchar/char type, which assumes inputs are UTF8 bytes.
>
> You may consider implementing a Drill UDF 'blike" which works similar
> to string function 'like', but could operate on non-UTF8 bytes.
>
> On Fri, Apr 28, 2017 at 3:02 PM, Boaz Ben-Zvi <[email protected]> wrote:
> > Hi Franca,
> >
> > This issue is specific to the “bytes” type; for other Avro types the
> LIKE clause matches the printed representation, like:
> >
> > select * from dfs.`/data/avro/twitter.snappy.avro` where `timestamp`
> like '%66%';
> > +-------------+--------------------------------------+-------------+
> > | username | tweet | timestamp |
> > +-------------+--------------------------------------+-------------+
> > | miguno | Rock: Nerf paper, scissors is fine. | 1366150681 |
> > | BlizzardCS | Works as intended. Terran is IMBA. | 1366154481 |
> > +-------------+--------------------------------------+-------------+
> >
> > Can you share some sample avro file with “bytes” type? (I couldn’t find
> any such sample online) Maybe we’ll need to open a Jira for this case …
> >
> > Thanks,
> >
> > -- Boaz
> >
> > On 4/25/17, 8:45 AM, "franca perrina" <[email protected]> wrote:
> >
> > Hi,
> >
> > I would like to use Drill to query data formatted in avro.
> >
> > My avro schema looks like
> >
> > ..
> > {"name":"payload",
> > "type":"bytes"}
> > ..
> >
> > and the result to the query
> >
> > SELECT payload FROM `dfs`.`myfile.avro` LIMIT 1
> >
> > looks like:
> >
> > +-----------------+
> > | payload |
> > +-----------------+
> > | [B@3b8e004e |
> > +-----------------+
> >
> >
> > My problem is that when I run a query like:
> >
> > SELECT * FROM `dfs`.`myfile.avro` WHERE `PAYLOAD` LIKE '%abcd%'
> >
> > then I have
> > org.apache.drill.common.exceptions.UserRemoteException: SYSTEM
> ERROR:
> > DrillRuntimeException: Unexpected byte 0xfd at position 1008556
> encountered
> > while decoding UTF8 string. Fragment 0:0 [Error Id:
> > 0c247c14-0e51-402c-ad9a-411cbc445597
> > on maprdemo:31010]
> >
> > It seems like drill tries to decode the payload's bytes to UTF8.
> >
> > What I would need is a grep like behaviour, where my payload data is
> > considered as is, i.e. binary data, and it is not converted to a
> string
> > data type.
> >
> > Thanks a lot for your help.
> > franca
> >
> >
>