Re: Excluding HDFS .tmp file from multi-file query?

Andries Engelbrecht Thu, 22 Sep 2016 07:29:02 -0700

I noticed if you specifically use * for file matching it will still read hidden 
files. However if you only point Drill at a directory it will read the 
directory and sub structure without reading any hidden files.


select * from `/dir1/*`  - will read hidden files
select * from `/dir1` will not read hidden files

So it depends if you need to use file name pattern matching or not. Most of the 
time it is a good idea not to mix different data in the same directory 
structure, but rather use the directory structures to separate different data 
and types as it makes for easier matching management down the road.

At either rate you found a solution for your needs. I have not looked if there 
are exclusion parameters for pattern matching.

--Andries

> On Sep 22, 2016, at 4:11 AM, Robin Moffatt <robin.moff...@rittmanmead.com> 
> wrote:
> 
> Hi,
> 
> It still tried to read it, even with a . prefix:
> 
> 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
> table(`hdfs`.`/user/flume/incoming/twitter/2016/09/22/*`(type => 'json'));
> Error: DATA_READ ERROR: Failure reading JSON file - Cannot obtain block
> length for
> LocatedBlock{BP-478416316-192.168.10.112-1466151126376:blk_1074005711_265071;
> getBlockSize()=39945; corrupt=false; offset=0;
> locs=[DatanodeInfoWithStorage[192.168.10.115:50010,DS-a0e97909-3d40-4f49-b67f-636e9f10928a,DISK],
> DatanodeInfoWithStorage[192.168.10.114:50010,DS-6c2cd5a6-22c5-4445-9018-ca0f2549a6cf,DISK],
> DatanodeInfoWithStorage[192.168.10.117:50010
> ,DS-70946f9d-95d9-4f35-b19d-97b8dc01cb88,DISK]]}
> 
> File  /user/flume/incoming/twitter/2016/09/22/.FlumeData.1474530954642.tmp
> Record  1
> Fragment 0:0
> 
> [Error Id: 04d44b5d-6d02-4062-9f4a-ebf3831d9ba1 on
> cdh57-01-node-01.moffatt.me:31010] (state=,code=0)
> 
> 
> However - good call on the Flume hdfs.inUsePrefix - by using that I can
> then set my Drill file pattern sufficiently so that it doesn't match on the
> in-use file:
> 
> 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
> table(`hdfs`.`/user/flume/incoming/twitter/2016/09/22/Flume*`(type =>
> 'json'));
> +---------+
> | EXPR$0  |
> +---------+
> | 12652   |
> +---------+
> 1 row selected (6.34 seconds)
> 
> 
> So problem sidestepped for now - but would be good to understand if I
> couldn't modify the producer, if there's a way to get Drill to omit certain
> patterns from its file matching?
> 
> thanks, Robin.
> 
> 
> On 21 September 2016 at 23:15, Andries Engelbrecht <
> aengelbre...@maprtech.com> wrote:
> 
>> Add a . prefix to the Flume temp files. Drill will ignore the hidden files
>> when you query the directory structure.
>> 
>> --Andries
>> 
>>> On Sep 21, 2016, at 2:36 PM, Robin Moffatt <
>> robin.moff...@rittmanmead.com> wrote:
>>> 
>>> Hi,
>>> I have a stream of data from Flume landing in HDFS in files of a set
>> size.
>>> I can query these files individually just fine, and across multiple ones
>>> too - except if the wildcard encompasses the *currently open HDFS file
>> that
>>> Flume is writing to*. When this happens, Drill understandably barfs.
>>> 
>>> 0: jdbc:drill:drillbit=localhost> show files in
>>> `hdfs`.`/user/flume/incoming/twitter/2016/09/21/`;
>>> +------------------------------+--------------+---------+---
>> -------+--------+-------------+--------------+--------------
>> ------------+--------------------------+
>>> |             name             | isDirectory  | isFile  |  length  |
>> owner
>>> |    group    | permissions  |        accessTime        |
>>> modificationTime     |
>>> +------------------------------+--------------+---------+---
>> -------+--------+-------------+--------------+--------------
>> ------------+--------------------------+
>>> [...]
>>> | FlumeData.1474467815652      | false        | true    | 1055490  |
>> flume
>>> | supergroup  | rw-r--r--    | 2016-09-21 21:52:07.219  | 2016-09-21
>>> 21:58:58.28   |
>>> | FlumeData.1474467815653      | false        | true    | 1050470  |
>> flume
>>> | supergroup  | rw-r--r--    | 2016-09-21 21:58:58.556  | 2016-09-21
>>> 22:06:28.636  |
>>> | FlumeData.1474467815654      | false        | true    | 1051043  |
>> flume
>>> | supergroup  | rw-r--r--    | 2016-09-21 22:06:29.564  | 2016-09-21
>>> 22:13:40.808  |
>>> | FlumeData.1474467815655      | false        | true    | 1052657  |
>> flume
>>> | supergroup  | rw-r--r--    | 2016-09-21 22:13:40.978  | 2016-09-21
>>> 22:23:00.409  |
>>> | FlumeData.1474467815656.tmp  | false        | true    | 9447     |
>> flume
>>> | supergroup  | rw-r--r--    | 2016-09-21 22:23:00.788  | 2016-09-21
>>> 22:23:00.788  |
>>> +------------------------------+--------------+---------+---
>> -------+--------+-------------+--------------+--------------
>> ------------+--------------------------+
>>> 59 rows selected (0.265 seconds)
>>> 
>>> Note the .tmp file as the last one in the folder
>>> 
>>> Querying a single file works :
>>> 
>>> 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
>>> table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/
>> FlumeData.1474467815655`(type
>>> => 'json'));
>>> +---------+
>>> | EXPR$0  |
>>> +---------+
>>> | 221     |
>>> +---------+
>>> 1 row selected (0.685 seconds)
>>> 
>>> 
>>> As does across multiple files where the wildcard pattern would exclude
>> the
>>> .tmp file:
>>> 
>>> 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
>>> table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/
>> FlumeData.147446781564*`(type
>>> => 'json'));
>>> +---------+
>>> | EXPR$0  |
>>> +---------+
>>> | 2178    |
>>> +---------+
>>> 1 row selected (1.24 seconds)
>>> 
>>> 
>>> But if I try to query all the files, Drill includes the .tmp file and
>>> errors:
>>> 
>>> 0: jdbc:drill:drillbit=localhost> SELECT count(*) FROM
>>> table(`hdfs`.`/user/flume/incoming/twitter/2016/09/21/*`(type =>
>> 'json'));
>>> Error: DATA_READ ERROR: Failure reading JSON file - Cannot obtain block
>>> length for
>>> LocatedBlock{BP-478416316-192.168.10.112-1466151126376:blk_
>> 1074004983_264343;
>>> getBlockSize()=9447; corrupt=false; offset=0;
>> locs=[DatanodeInfoWithStorage[
>>> 192.168.10.116:50010,DS-39bf5e74-3eec-4447-9cd2-f17b5cc259b8,DISK],
>>> DatanodeInfoWithStorage[192.168.10.113:50010,DS-845945e7-
>> 0bc8-44aa-945c-a140ad1f55ab,DISK],
>>> DatanodeInfoWithStorage[192.168.10.115:50010
>>> ,DS-a0e97909-3d40-4f49-b67f-636e9f10928a,DISK]]}
>>> 
>>> File  /user/flume/incoming/twitter/2016/09/21/FlumeData.
>> 1474467815656.tmp
>>> Record  1
>>> Fragment 0:0
>>> 
>>> [Error Id: d3f322cb-c64d-43c8-9231-fb2c96e8589d on
>>> cdh57-01-node-01.moffatt.me:31010] (state=,code=0)
>>> 0: jdbc:drill:drillbit=localhost>
>>> 
>>> 
>>> Is there a way around this with Drill? For example, can I use a regex in
>>> the path? I've tried, but just hit
>>> Error: VALIDATION ERROR: null
>>> 
>>> thanks, Robin.
>> 
>>

Re: Excluding HDFS .tmp file from multi-file query?

Reply via email to