I found the way to resize the buffer
@Override
public void eval() {
byte[] buf =
org.apache.drill.common.util.DrillStringUtils.toBinaryString(in.buffer,
in.start, in.end).getBytes(charset);
* out.buffer = buffer = buffer.reallocIfNeeded(buf.length);*
...
}
Thanks a lot for you help.
Franca
On Fri, May 5, 2017 at 6:03 PM, franca perrina <[email protected]> wrote:
> Hello,
>
> Thanks for your answers.
>
> My payload is not encoded in UTF8, it can contain some non printable
> characters, new lines and it may contain bytes that are not valid in UTF8.
> An the latter should be my case.
>
> I have tried with regexp_matches:
>
> SELECT * FROM `dfs`.`myfile.avro` WHERE regexp_matches(payload,
> '(?s).*abcd.*');
>
> but I have the same problem, and I have the same error if I do,
> obviously,
>
> SELECT CAST(payload as VARCHAR) FROM `dfs`.`myfile.avro`;
>
> So, I implemented an UDF function to convert the bytes into a hex encoded
> string
>
>
> public class AsciiStringBinaryFunc {
>
> // Converts a varbinary type into a hex encoded string.
> // (byte[]) {(byte)0xca, (byte)0xfe, (byte)0xba, (byte)0xbe} =>
> "\xca\xfe\xba\xbe"
> @FunctionTemplate(name = "ascii_string_binary", scope =
> FunctionScope.SIMPLE, nulls = NullHandling.NULL_IF_NULL)
> public static class StringBinary implements DrillSimpleFunc {
> @Param VarBinaryHolder in;
> @Output VarCharHolder out;
> @Workspace Charset charset;
> @Inject DrillBuf buffer;
>
> @Override
> public void setup() {
> charset = java.nio.charset.Charset.forName("US-ASCII");
> }
>
> @Override
> public void eval() {
> byte[] buf =
> org.apache.drill.common.util.DrillStringUtils.toBinaryString(in.buffer,
> in.start, in.end).getBytes(charset);
> buffer.setBytes(0, buf);
> buffer.setIndex(0, buf.length);
>
> out.start = 0;
> out.end = buf.length;
> out.buffer = buffer;
> }
> }
> }
>
> but then, I have a new problem
>
> SELECT ascii_string_binary(payload) FROM `dfs`.`myfile.avro` LIMIT 1;
>
> Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 0, length: 3484
> (expected: range(0, 256))
>
> Fragment 0:0
>
> [Error Id: d0ab90d6-8b2a-4200-8809-534138c217fb on maprdemo:31010]
> (state=,code=0)
>
>
> knowing that
>
> SELECT length(payload) FROM `dfs`.`myfile.avro` LIMIT 1;
>
> +---------+
> | EXPR$0 |
> +---------+
> | 3484 |
> +---------+
>
>
>
>
> Thanks a lot for your help,
> Franca
>
> On Sat, Apr 29, 2017 at 12:13 AM, Jinfeng Ni <[email protected]> wrote:
>
>> The error seems to indicated 'PAYLOAD' does not contain UTF8-encoded
>> bytes. The like function is a string function, and it only accepts
>> varchar/char type, which assumes inputs are UTF8 bytes.
>>
>> You may consider implementing a Drill UDF 'blike" which works similar
>> to string function 'like', but could operate on non-UTF8 bytes.
>>
>> On Fri, Apr 28, 2017 at 3:02 PM, Boaz Ben-Zvi <[email protected]> wrote:
>> > Hi Franca,
>> >
>> > This issue is specific to the “bytes” type; for other Avro types
>> the LIKE clause matches the printed representation, like:
>> >
>> > select * from dfs.`/data/avro/twitter.snappy.avro` where `timestamp`
>> like '%66%';
>> > +-------------+--------------------------------------+-------------+
>> > | username | tweet | timestamp |
>> > +-------------+--------------------------------------+-------------+
>> > | miguno | Rock: Nerf paper, scissors is fine. | 1366150681 |
>> > | BlizzardCS | Works as intended. Terran is IMBA. | 1366154481 |
>> > +-------------+--------------------------------------+-------------+
>> >
>> > Can you share some sample avro file with “bytes” type? (I couldn’t
>> find any such sample online) Maybe we’ll need to open a Jira for this
>> case …
>> >
>> > Thanks,
>> >
>> > -- Boaz
>> >
>> > On 4/25/17, 8:45 AM, "franca perrina" <[email protected]> wrote:
>> >
>> > Hi,
>> >
>> > I would like to use Drill to query data formatted in avro.
>> >
>> > My avro schema looks like
>> >
>> > ..
>> > {"name":"payload",
>> > "type":"bytes"}
>> > ..
>> >
>> > and the result to the query
>> >
>> > SELECT payload FROM `dfs`.`myfile.avro` LIMIT 1
>> >
>> > looks like:
>> >
>> > +-----------------+
>> > | payload |
>> > +-----------------+
>> > | [B@3b8e004e |
>> > +-----------------+
>> >
>> >
>> > My problem is that when I run a query like:
>> >
>> > SELECT * FROM `dfs`.`myfile.avro` WHERE `PAYLOAD` LIKE '%abcd%'
>> >
>> > then I have
>> > org.apache.drill.common.exceptions.UserRemoteException: SYSTEM
>> ERROR:
>> > DrillRuntimeException: Unexpected byte 0xfd at position 1008556
>> encountered
>> > while decoding UTF8 string. Fragment 0:0 [Error Id:
>> > 0c247c14-0e51-402c-ad9a-411cbc445597
>> > on maprdemo:31010]
>> >
>> > It seems like drill tries to decode the payload's bytes to UTF8.
>> >
>> > What I would need is a grep like behaviour, where my payload data is
>> > considered as is, i.e. binary data, and it is not converted to a
>> string
>> > data type.
>> >
>> > Thanks a lot for your help.
>> > franca
>> >
>> >
>>
>
>