Hi Yun,

I’m going to give you multiple ways to understand the issue based on the 
information you’ve provided. I generally like to see the full logs to diagnose 
such problems, but we’ll start with what you’ve provided thus far.
 
How large is each record in your file? How many fields? How many bytes? 
(Alternatively, how big is a single input file and how many records does it 
contain?)

You mention the limit of 64K columns in CSV. This makes me wonder if you have a 
“jumbo” record. If each individual record is large, then there won’t be enough 
space in the sort to take even a single batch of records, and you’ll get the 
sv2 error that you saw.

We can guess the size, however, from the info you provided:

batchGroups.size 1
spilledBatchGroups.size 0
allocated memory 42768000
allocator limit 41943040

This says you have a batch in memory and are trying to allocate some memory 
(the “sv2”). The allocated memory number tells us that each batch size is 
probably ~43 MB. But, the sort only has 42 MB to play with. The sort needs at 
least two batches in memory to make progress, hence the out-of-memory errors.

It would be nice to confirm this from the logs, but unfortunately, Drill does 
not normally log the size of each batch. As it turns out, however, the 
“managed” version that Boaz mentioned added more logging around this problem: 
it will tell you how large it thinks each batch is, and will warn if you have, 
say, a 43 MB batch but only 42 MB in which to sort.

(If you do want to use the “managed” version of the sort, I suggest you try 
Drill 1.12 when it is released as that version contains additional fixes to 
handle constrained memory.)

Also, at present, The JSON record reader loads 4096 records into each batch. If 
your file has at least that many records, then we can guess each record is 
about 43 MB / 4096 =~ 10K in size. (You can confirm, as noted above, by 
dividing total file size by record count.)

We are doing work to handle such large batches, but the work is not yet 
available in a release. Unfortunately, in the meanwhile, we also don’t let you 
control the batch size. But, we can provide another solution.

Let's explain why the message you provided said that the “allocator limit” was 
42 MB. Drill does the following to allocate memory to the sort:

* Take the “max query memory per node” (default of 2 GB regardless of actual 
direct memory),
* Divide by the number of sort operators in the plan (as shown in the 
visualized query profile)
* Divide by the “planner width” which is, by default, 70% of the number of 
cores on your system.

In your case, if you are using the default 2 GB total, but getting 41 MB per 
sort, the divisor is 50. Maybe you have 2 sorts and 32 cores? (2 * 32 * 70% =~ 
45.) Or some other combination.

We can’t reduce the number of sorts; that’s determined by your query. But, we 
can play with the other numbers.

First, we can increase the memory per query:

ALTER SESSION SET `planner.memory.max_query_memory_per_node` = 4,294,967,296

That is, 4 GB. This obviously means you must have at least 6 GB of direct 
memory; more is better.

And/or, we can reduce the number of fragments:

ALTER SESSION SET `planner.width.max_per_node` = <a number>

The value is a bit tricky. Drill normally creates a number of fragments equal 
to 70% of the number of CPUs on your system. Let’s say you have 32 cores. If 
so, change the max_per_node to, say, 10 or even 5. This will mean fewer sorts 
and so more memory per sort, helping compensate for the “jumbo” batches in your 
query. Pick a number based on your actual number of cores.

As an alternative, as Ted suggested, you could create a larger number of 
smaller files as this would solve the batch size problem while also getting the 
parallelization benefits that Kunal mentioned.

That is three separate possible solutions. Try them one by one or (carefully) 
together.

- Paul

>> On 11/2/17, 12:31 PM, "Yun Liu" <[email protected]> wrote:
>> 
>>    Hi Kunal and Andries,
>> 
>>    Thanks for your reply. We need json in this case because Drill only
>> supports up to 65536 columns in a csv file.

Reply via email to