I found the way to resize the buffer

            @Override
   public void eval() {

     byte[] buf =
org.apache.drill.common.util.DrillStringUtils.toBinaryString(in.buffer,
in.start, in.end).getBytes(charset);
*      out.buffer = buffer = buffer.reallocIfNeeded(buf.length);*
...
   }

Thanks a lot for you help.

Franca

On Fri, May 5, 2017 at 6:03 PM, franca perrina <[email protected]> wrote:

> Hello,
>
> Thanks for your answers.
>
> My payload is not encoded in UTF8, it can contain some non printable
> characters, new lines and it may contain bytes that are not valid in UTF8.
> An the latter should be my case.
>
> I have tried with regexp_matches:
>
> SELECT * FROM  `dfs`.`myfile.avro` WHERE regexp_matches(payload,
> '(?s).*abcd.*');
>
> but I have the same problem, and I have the same error if I do,
> obviously,
>
> SELECT CAST(payload as VARCHAR) FROM `dfs`.`myfile.avro`;
>
> So, I implemented an UDF function to convert the bytes into a hex encoded
> string
>
>
> public class AsciiStringBinaryFunc {
>
>  // Converts a varbinary type into a hex encoded string.
>  // (byte[]) {(byte)0xca, (byte)0xfe, (byte)0xba, (byte)0xbe}  =>
> "\xca\xfe\xba\xbe"
>  @FunctionTemplate(name = "ascii_string_binary", scope =
> FunctionScope.SIMPLE, nulls = NullHandling.NULL_IF_NULL)
>  public static class StringBinary implements DrillSimpleFunc {
>    @Param VarBinaryHolder in;
>    @Output VarCharHolder   out;
>    @Workspace Charset charset;
>    @Inject DrillBuf buffer;
>
>    @Override
>    public void setup() {
>      charset = java.nio.charset.Charset.forName("US-ASCII");
>    }
>
>    @Override
>    public void eval() {
>      byte[] buf = 
> org.apache.drill.common.util.DrillStringUtils.toBinaryString(in.buffer,
> in.start, in.end).getBytes(charset);
>      buffer.setBytes(0, buf);
>      buffer.setIndex(0, buf.length);
>
>      out.start = 0;
>      out.end = buf.length;
>      out.buffer = buffer;
>    }
>  }
> }
>
> but then, I have a new problem
>
> SELECT ascii_string_binary(payload) FROM `dfs`.`myfile.avro` LIMIT 1;
>
> Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 0, length: 3484
> (expected: range(0, 256))
>
> Fragment 0:0
>
> [Error Id: d0ab90d6-8b2a-4200-8809-534138c217fb on maprdemo:31010]
> (state=,code=0)
>
>
> knowing that
>
> SELECT length(payload) FROM `dfs`.`myfile.avro` LIMIT 1;
>
> +---------+
> | EXPR$0  |
> +---------+
> | 3484    |
> +---------+
>
>
>
>
> Thanks a lot for your help,
> Franca
>
> On Sat, Apr 29, 2017 at 12:13 AM, Jinfeng Ni <[email protected]> wrote:
>
>> The error seems to indicated 'PAYLOAD' does not contain UTF8-encoded
>> bytes. The like function is a string function, and it only accepts
>> varchar/char type, which assumes inputs are UTF8 bytes.
>>
>> You may consider implementing a Drill UDF 'blike" which works similar
>> to string function 'like', but could operate on non-UTF8 bytes.
>>
>> On Fri, Apr 28, 2017 at 3:02 PM, Boaz Ben-Zvi <[email protected]> wrote:
>> >  Hi Franca,
>> >
>> >     This issue is specific to the “bytes” type; for other Avro types
>> the LIKE clause matches the printed representation, like:
>> >
>> > select * from dfs.`/data/avro/twitter.snappy.avro` where `timestamp`
>> like '%66%';
>> > +-------------+--------------------------------------+-------------+
>> > |  username   |                tweet                 |  timestamp  |
>> > +-------------+--------------------------------------+-------------+
>> > | miguno      | Rock: Nerf paper, scissors is fine.  | 1366150681  |
>> > | BlizzardCS  | Works as intended.  Terran is IMBA.  | 1366154481  |
>> > +-------------+--------------------------------------+-------------+
>> >
>> > Can you share some sample avro file with “bytes” type?  (I couldn’t
>> find any such sample online)     Maybe we’ll need to open a Jira for this
>> case …
>> >
>> >      Thanks,
>> >
>> >              -- Boaz
>> >
>> > On 4/25/17, 8:45 AM, "franca perrina" <[email protected]> wrote:
>> >
>> >     Hi,
>> >
>> >     I would like to use Drill to query data formatted in avro.
>> >
>> >     My avro schema looks like
>> >
>> >     ..
>> >     {"name":"payload",
>> >       "type":"bytes"}
>> >     ..
>> >
>> >     and the result to the query
>> >
>> >     SELECT payload FROM `dfs`.`myfile.avro` LIMIT 1
>> >
>> >     looks like:
>> >
>> >     +-----------------+
>> >     |     payload     |
>> >     +-----------------+
>> >     | [B@3b8e004e     |
>> >     +-----------------+
>> >
>> >
>> >     My problem is that when I run a query like:
>> >
>> >     SELECT * FROM  `dfs`.`myfile.avro` WHERE `PAYLOAD` LIKE '%abcd%'
>> >
>> >     then I have
>> >     org.apache.drill.common.exceptions.UserRemoteException: SYSTEM
>> ERROR:
>> >     DrillRuntimeException: Unexpected byte 0xfd at position 1008556
>> encountered
>> >     while decoding UTF8 string. Fragment 0:0 [Error Id:
>> >     0c247c14-0e51-402c-ad9a-411cbc445597
>> >     on maprdemo:31010]
>> >
>> >     It seems like drill tries to decode the payload's bytes to UTF8.
>> >
>> >     What I would need is a grep like behaviour, where my payload data is
>> >     considered as is, i.e. binary data, and it is not converted to a
>> string
>> >     data type.
>> >
>> >     Thanks a lot for your help.
>> >     franca
>> >
>> >
>>
>
>

Reply via email to