Hi Yun, I’m going to give you multiple ways to understand the issue based on the information you’ve provided. I generally like to see the full logs to diagnose such problems, but we’ll start with what you’ve provided thus far. How large is each record in your file? How many fields? How many bytes? (Alternatively, how big is a single input file and how many records does it contain?)
You mention the limit of 64K columns in CSV. This makes me wonder if you have a “jumbo” record. If each individual record is large, then there won’t be enough space in the sort to take even a single batch of records, and you’ll get the sv2 error that you saw. We can guess the size, however, from the info you provided: batchGroups.size 1 spilledBatchGroups.size 0 allocated memory 42768000 allocator limit 41943040 This says you have a batch in memory and are trying to allocate some memory (the “sv2”). The allocated memory number tells us that each batch size is probably ~43 MB. But, the sort only has 42 MB to play with. The sort needs at least two batches in memory to make progress, hence the out-of-memory errors. It would be nice to confirm this from the logs, but unfortunately, Drill does not normally log the size of each batch. As it turns out, however, the “managed” version that Boaz mentioned added more logging around this problem: it will tell you how large it thinks each batch is, and will warn if you have, say, a 43 MB batch but only 42 MB in which to sort. (If you do want to use the “managed” version of the sort, I suggest you try Drill 1.12 when it is released as that version contains additional fixes to handle constrained memory.) Also, at present, The JSON record reader loads 4096 records into each batch. If your file has at least that many records, then we can guess each record is about 43 MB / 4096 =~ 10K in size. (You can confirm, as noted above, by dividing total file size by record count.) We are doing work to handle such large batches, but the work is not yet available in a release. Unfortunately, in the meanwhile, we also don’t let you control the batch size. But, we can provide another solution. Let's explain why the message you provided said that the “allocator limit” was 42 MB. Drill does the following to allocate memory to the sort: * Take the “max query memory per node” (default of 2 GB regardless of actual direct memory), * Divide by the number of sort operators in the plan (as shown in the visualized query profile) * Divide by the “planner width” which is, by default, 70% of the number of cores on your system. In your case, if you are using the default 2 GB total, but getting 41 MB per sort, the divisor is 50. Maybe you have 2 sorts and 32 cores? (2 * 32 * 70% =~ 45.) Or some other combination. We can’t reduce the number of sorts; that’s determined by your query. But, we can play with the other numbers. First, we can increase the memory per query: ALTER SESSION SET `planner.memory.max_query_memory_per_node` = 4,294,967,296 That is, 4 GB. This obviously means you must have at least 6 GB of direct memory; more is better. And/or, we can reduce the number of fragments: ALTER SESSION SET `planner.width.max_per_node` = <a number> The value is a bit tricky. Drill normally creates a number of fragments equal to 70% of the number of CPUs on your system. Let’s say you have 32 cores. If so, change the max_per_node to, say, 10 or even 5. This will mean fewer sorts and so more memory per sort, helping compensate for the “jumbo” batches in your query. Pick a number based on your actual number of cores. As an alternative, as Ted suggested, you could create a larger number of smaller files as this would solve the batch size problem while also getting the parallelization benefits that Kunal mentioned. That is three separate possible solutions. Try them one by one or (carefully) together. - Paul >> On 11/2/17, 12:31 PM, "Yun Liu" <[email protected]> wrote: >> >> Hi Kunal and Andries, >> >> Thanks for your reply. We need json in this case because Drill only >> supports up to 65536 columns in a csv file.
