Re: The praises for Drill

2016-02-25 Thread Antonio Romero (carnorom)
Can you tell us what the volume of those files was? How many records, how many 
files, how many columns?

Sent from my iPhone

> On Feb 25, 2016, at 7:27 PM, "Edmon Begoli"  wrote:
> 
> Hello fellow Driilers,
> 
> I have been inactive on the development side of the project, as we got busy
> being heavy/power users of the Drill in the last few months.
> 
> I just want to share some great experiences with the latest versions of
> Drill.
> 
> Just tonight, as we were scrambling to meet the deadline, we were able to
> query two years of flat psv files of claims/billing and clinical data in
> Drill in less than 60 seconds.
> 
> No ETL, no warehousing - just plain SQL against tons of files. Run SQL, get
> results.
> 
> Amazing!
> 
> We have also done some much more important things too, and we had a paper
> accepted to Big Data Services about the experiences. The co-author of the
> paper is Drill's own Dr. Ted Dunning :-)
> I will share it once it is published.
> 
> Anyway, cheers to all, and hope to re-join the dev activities soon.
> 
> Best,
> Edmon


The praises for Drill

2016-02-25 Thread Edmon Begoli
Hello fellow Driilers,

I have been inactive on the development side of the project, as we got busy
being heavy/power users of the Drill in the last few months.

I just want to share some great experiences with the latest versions of
Drill.

Just tonight, as we were scrambling to meet the deadline, we were able to
query two years of flat psv files of claims/billing and clinical data in
Drill in less than 60 seconds.

No ETL, no warehousing - just plain SQL against tons of files. Run SQL, get
results.

Amazing!

We have also done some much more important things too, and we had a paper
accepted to Big Data Services about the experiences. The co-author of the
paper is Drill's own Dr. Ted Dunning :-)
I will share it once it is published.

Anyway, cheers to all, and hope to re-join the dev activities soon.

Best,
Edmon


Re: Drill error with large sort

2016-02-25 Thread Abdel Hakim Deneche
Not so short answer:

In Drill 1.5 (I assume you are using 1.5) we have an improved allocator
that better tracks how much memory each operator is using. In your case it
seems that the date has very wide columns that are causing Sort to choke on
the very first batch of data (1024 records taking up 224MB!!!) because it's
way more than it's memory limit (around 178MB in your particular case).
Drill uses a fancy equation to compute this limit and increasing the
aforementioned option will increase the sort limit. More details here:

http://drill.apache.org/docs/configuring-drill-memory/

On Thu, Feb 25, 2016 at 5:26 PM, Abdel Hakim Deneche 
wrote:

> Short answer:
>
> increase the value of planner.memory.max_query_memory_per_node, by default
> it's set to 2GB, try setting to 4 or even 8GB. This should get the query to
> pass.
>
> On Thu, Feb 25, 2016 at 5:24 PM, Jeff Maass  wrote:
>
>>
>> If you are open to changing the query:
>>   # try removing the functions on the 5th column
>>   # is there any way you could further limit the query?
>>   # does the query finish if u add a limit / top clause?
>>   # what do the logs say?
>>
>> 
>> From: Paul Friedman 
>> Sent: Thursday, February 25, 2016 7:07:12 PM
>> To: user@drill.apache.org
>> Subject: Drill error with large sort
>>
>> I’ve got a query reading from a large directory of parquet files (41 GB)
>> and I’m consistently getting this error:
>>
>>
>>
>> Error: RESOURCE ERROR: One or more nodes ran out of memory while executing
>> the query.
>>
>>
>>
>> Unable to allocate sv2 for 1023 records, and not enough batchGroups to
>> spill.
>>
>> batchGroups.size 0
>>
>> spilledBatchGroups.size 0
>>
>> allocated memory 224287987
>>
>> allocator limit 178956970
>>
>> Fragment 0:0
>>
>>
>>
>> [Error Id: 878d604c-4656-4a5a-8b46-ff38a6ae020d on
>> chai.dev.streetlightdata.com:31010] (state=,code=0)
>>
>>
>>
>> Direct memory is set to 48GB and heap is 8GB.
>>
>>
>>
>> The query is:
>>
>>
>>
>> select probe_id, provider_id, is_moving, mode,  cast(convert_to(points,
>> 'JSON') as varchar(1))
>>
>> from dfs.`/home/paul/data`
>>
>> where
>>
>> start_lat between 24.4873780449008 and 60.0108911181433 and
>>
>> start_lon between -139.065890469841 and -52.8305074899881 and
>>
>> provider_id = '343' and
>>
>> mod(abs(hash(probe_id)),  100) = 0
>>
>> order by probe_id, start_time;
>>
>>
>>
>> I’m also using the “example” drill-override configuration.
>>
>>
>>
>> Any help would be appreciated.
>>
>>
>>
>> Thanks.
>>
>>
>>
>> ---Paul
>>
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   
>
>
> Now Available - Free Hadoop On-Demand Training
> 
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: Drill error with large sort

2016-02-25 Thread Jeff Maass

If you are open to changing the query:
  # try removing the functions on the 5th column
  # is there any way you could further limit the query?
  # does the query finish if u add a limit / top clause?
  # what do the logs say?


From: Paul Friedman 
Sent: Thursday, February 25, 2016 7:07:12 PM
To: user@drill.apache.org
Subject: Drill error with large sort

I’ve got a query reading from a large directory of parquet files (41 GB)
and I’m consistently getting this error:



Error: RESOURCE ERROR: One or more nodes ran out of memory while executing
the query.



Unable to allocate sv2 for 1023 records, and not enough batchGroups to
spill.

batchGroups.size 0

spilledBatchGroups.size 0

allocated memory 224287987

allocator limit 178956970

Fragment 0:0



[Error Id: 878d604c-4656-4a5a-8b46-ff38a6ae020d on
chai.dev.streetlightdata.com:31010] (state=,code=0)



Direct memory is set to 48GB and heap is 8GB.



The query is:



select probe_id, provider_id, is_moving, mode,  cast(convert_to(points,
'JSON') as varchar(1))

from dfs.`/home/paul/data`

where

start_lat between 24.4873780449008 and 60.0108911181433 and

start_lon between -139.065890469841 and -52.8305074899881 and

provider_id = '343' and

mod(abs(hash(probe_id)),  100) = 0

order by probe_id, start_time;



I’m also using the “example” drill-override configuration.



Any help would be appreciated.



Thanks.



---Paul


Drill error with large sort

2016-02-25 Thread Paul Friedman
I’ve got a query reading from a large directory of parquet files (41 GB)
and I’m consistently getting this error:



Error: RESOURCE ERROR: One or more nodes ran out of memory while executing
the query.



Unable to allocate sv2 for 1023 records, and not enough batchGroups to
spill.

batchGroups.size 0

spilledBatchGroups.size 0

allocated memory 224287987

allocator limit 178956970

Fragment 0:0



[Error Id: 878d604c-4656-4a5a-8b46-ff38a6ae020d on
chai.dev.streetlightdata.com:31010] (state=,code=0)



Direct memory is set to 48GB and heap is 8GB.



The query is:



select probe_id, provider_id, is_moving, mode,  cast(convert_to(points,
'JSON') as varchar(1))

from dfs.`/home/paul/data`

where

start_lat between 24.4873780449008 and 60.0108911181433 and

start_lon between -139.065890469841 and -52.8305074899881 and

provider_id = '343' and

mod(abs(hash(probe_id)),  100) = 0

order by probe_id, start_time;



I’m also using the “example” drill-override configuration.



Any help would be appreciated.



Thanks.



---Paul


Avro support in Drill - Missing support for the IN operator and other frustrating things

2016-02-25 Thread Stefán Baxter
Hi,

This query targets Avro files in the latest 1.5 release:

0: jdbc:drill:zk=local> select count(*) from
dfs.asa.`/streaming/venuepoint/transactions/` as s where s.sold_to =
'Customer/4-2492847';
+-+
| EXPR$0  |
+-+
| 5788|
+-+

0: jdbc:drill:zk=local> select count(*) from
dfs.asa.`/streaming/venuepoint/transactions/` as s where s.sold_to IN
('Customer/4-2492847');
+-+
| EXPR$0  |
+-+
| 0   |
+-+

It shows that the IN operator does not work with Avro (works with Parquet).

This finally tips us over. We have invested hundreds of hours moving all
streaming/fresh data from JSON to Avro but the Avro part of Drill is broken
in too many ways to recommend its use to anyone.

Attempts to report Avro errors and shortcomings, like the missing support
for dirX, has had no results.

I think it would be prudent to warn people on the Drill website that the
Avro support is experimental, at best

- Stefán Baxter


Add rest server to each drill node

2016-02-25 Thread Jeff Maass
What is the prescribed / appropriate way to do the below in apache drill?


We want is to do as one can do with elasticsearch:
  * Write our rest service endpoint in java
  * consume the elasticsearch library
  * deploy our application
  * have an elasticsearch cluster that also has our code running in it