Hi Yun,
Can you try using the “managed” version of the external sort – either
change this option to false:
0: jdbc:drill:zk=local> select * from sys.options where name like '%man%';
+----------------------------+----------+-------------------+--------------+----------+----------+-------------+-----------+------------+
| name | kind | accessibleScopes | optionScope |
status | num_val | string_val | bool_val | float_val |
+----------------------------+----------+-------------------+--------------+----------+----------+-------------+-----------+------------+
| exec.sort.disable_managed | BOOLEAN | ALL | BOOT |
DEFAULT | null | null | false | null |
+----------------------------+----------+-------------------+--------------+----------+----------+-------------+-----------+------------+
Or override it into ‘false’ in the configuration:
0: jdbc:drill:zk=local> select * from sys.boot where name like '%managed%';
+-----------------------------------------------+----------+-------------------+--------------+---------+----------+-------------+-----------+------------+
| name | kind | accessibleScopes
| optionScope | status | num_val | string_val | bool_val | float_val |
+-----------------------------------------------+----------+-------------------+--------------+---------+----------+-------------+-----------+------------+
| drill.exec.options.exec.sort.disable_managed | BOOLEAN | BOOT
| BOOT | BOOT | null | null | false | null |
+-----------------------------------------------+----------+-------------------+--------------+---------+----------+-------------+-----------+------------+
i.e., in the drill-override.conf file:
sort: {
external: {
disable_managed: false
}
}
Please let us know if this change helped,
-- Boaz
On 11/2/17, 1:12 PM, "Yun Liu" <[email protected]> wrote:
Please help me as to what further information I could provide to get this
going. I am also experiencing a separate issue:
RESOURCE ERROR: One or more nodes ran out of memory while executing the
query.
Unable to allocate sv2 for 8501 records, and not enough batchGroups to
spill.
batchGroups.size 1
spilledBatchGroups.size 0
allocated memory 42768000
allocator limit 41943040
Current setting is:
planner.memory.max_query_memory_per_node= 10GB
HEAP to 12G
Direct memory to 32G
Perm to 1024M
What is the issue here?
Thanks,
Yun
-----Original Message-----
From: Yun Liu [mailto:[email protected]]
Sent: Thursday, November 2, 2017 3:52 PM
To: [email protected]
Subject: RE: Drill Capacity
Yes- I increased planner.memory.max_query_memory_per_node to 10GB HEAP to
12G Direct memory to 16G And Perm to 1024M
It didn't have any schema changes. As with the same file format but less
data- it works perfectly ok. I am unable to tell if there's corruption.
Yun
-----Original Message-----
From: Andries Engelbrecht [mailto:[email protected]]
Sent: Thursday, November 2, 2017 3:35 PM
To: [email protected]
Subject: Re: Drill Capacity
What memory setting did you increase? Have you tried 6 or 8GB?
How much memory is allocated to Drill Heap and Direct memory for the
embedded Drillbit?
Also did you check the larger document doesn’t have any schema changes or
corruption?
--Andries
On 11/2/17, 12:31 PM, "Yun Liu" <[email protected]> wrote:
Hi Kunal and Andries,
Thanks for your reply. We need json in this case because Drill only
supports up to 65536 columns in a csv file. I also tried increasing the memory
size to 4GB but I am still experiencing same issues. Drill is installed in
Embedded Mode.
Thanks,
Yun
-----Original Message-----
From: Kunal Khatua [mailto:[email protected]]
Sent: Thursday, November 2, 2017 2:01 PM
To: [email protected]
Subject: RE: Drill Capacity
Hi Yun
Andries solution should address your problem. However, do understand
that, unlike CSV files, a JSON file cannot be processed in parallel, because
there is no clear record delimiter (CSV data usually has a new-line character
to indicate the end of a record). So, the larger a file gets, the more work a
single minor fragment has to do in processing it, including maintaining
internal data-structures to represent the complex JSON document.
The preferable way would be to create more JSON files so that the files
can be processed in parallel.
Hope that helps.
~ Kunal
-----Original Message-----
From: Andries Engelbrecht [mailto:[email protected]]
Sent: Thursday, November 02, 2017 10:26 AM
To: [email protected]
Subject: Re: Drill Capacity
How much memory is allocated to the Drill environment?
Embedded or in a cluster?
I don’t think there is a particular limit, but a single JSON file will
be read by a single minor fragment, in general it is better to match the
number/size of files to the Drill environment.
In the short term try to bump up
planner.memory.max_query_memory_per_node in the options and see if that works
for you.
--Andries
On 11/2/17, 7:46 AM, "Yun Liu" <[email protected]> wrote:
Hi,
I've been using Apache Drill actively and just wondering what is
the capacity of Drill? I have a json file which is 390MB and it keeps throwing
me an DATA_READ ERROR. I have another json file with exact same format but only
150MB and it's processing fine. When I did a *select* on the large json, it
returns successfully for some of the fields. None of these errors really apply
to me. So I am trying to understand the capacity of the json files Drill
supports up to. Or if there's something else I missed.
Thanks,
Yun Liu
Solutions Delivery Consultant
321 West 44th St | Suite 501 | New York, NY 10036
+1 212.871.8355 office | +1 646.752.4933 mobile
CAST, Leader in Software Analysis and Measurement
Achieve Insight. Deliver Excellence.
Join the discussion http://blog.castsoftware.com/
LinkedIn<http://www.linkedin.com/companies/162909> |
Twitter<http://twitter.com/onquality> |
Facebook<http://www.facebook.com/pages/CAST/105668942817177>