querying to a big file

Wes Peng Fri, 08 Apr 2022 00:16:28 -0700

Hello community,

The VM I run drill has only 2GB memory. But the input file is 3.2GB.


$ du -h rate.csv
3.2G    rate.csv

$ free -m

total used free shared buff/cacheavailableMem: 1992 1243 696 4 52647

Swap:          1023         532         491


Even this drill can implement the query job well.

apache drill (dfs.skydrive)> select columns[1] as itemID, count(*) as ddfrom `rate.csv` group by itemID order by dd desc limit 50;

+------------+-------+
|   itemID   |  dd   |
+------------+-------+
| B0054JZC6E | 25368 |
| B00FAPF5U0 | 24024 |
| B009UX2YAC | 23956 |
| 0439023483 | 21398 |
| 030758836X | 19867 |
| B0051VVOB2 | 19529 |
| B005SUHPO6 | 18688 |
| B00G5LQ5MU | 18645 |
| B0074BW614 | 18244 |
| B000GF7ZRA | 17948 |
| B00DR0PDNE | 16454 |
| B0064X7B4A | 16239 |
| B00DJFIMW6 | 16221 |
| B00I8Q77Y0 | 15601 |
| B00992CF6W | 15294 |
| B00AREIAI8 | 15080 |
| B00AWH595M | 14393 |
| B005ZXWMUS | 14310 |
| B007WTAJTO | 14172 |
| 0439023513 | 14114 |
| B004KNWWMW | 13572 |
| B009HKL4B8 | 13527 |
| B008JK6W5K | 13517 |
| B00E8KLWB4 | 13207 |
| B0063IH60K | 13126 |
| B007Q1W586 | 13102 |
| 0385537859 | 12973 |
| B007T36S34 | 12855 |
| B00DAHSVYC | 12633 |
| 0007444117 | 12629 |
| 0375831002 | 12571 |
| 038536315X | 12564 |
| B00BWYQ9YE | 12316 |
| 0345803485 | 12290 |
| B0019EHU8G | 12285 |
| B006GWO5WK | 12226 |
| 1608838137 | 11906 |
| B004MC8CA2 | 11835 |
| B0086700CM | 11787 |
| 0316055433 | 11746 |
| B003ELYQGG | 11617 |
| B00CWY76CC | 11546 |
| B001KXZ808 | 11365 |
| B0077SPHM4 | 11101 |
| B00I3MPDP4 | 10938 |
| B004LLIKVU | 10778 |
| B006OC2ANS | 10617 |
| B008Y7SMQU | 10577 |
| 0849922070 | 10424 |
| B00B2V66VS | 10417 |
+------------+-------+
50 rows selected (92.905 seconds)

May I know how drill can optimize memory usage as this? Is it swappingto disk?


Thanks.

querying to a big file

Reply via email to