Hi Stone,
Welcome to the Drill mailing list!
The situation you describe is not the expected behavior. Something is going
wrong.
You mention that, in the large-file case, you do get results, but the value is
empty. This suggests that the query did, in fact run (that is, it didn't run
out of memory or encounter some other internal problem.) The question is: why
were the columns empty?
And, when you say "empty", do you mean NULL (SQL NULL) or an empty string. (Or,
might they be a string of spaces?)
You mention that you used both the Web console and API. Do you mean REST API?
What results do you get if you use the Sqlline command line tool?
Your configuration seems fine. Your query is a simple scan and filter, so not
much memory should be needed. 32 GB should be fine. You are using the default
Drill memory options?
The odd thing is that you have two files: one small, one large. The small one
works. Is the schema the same for the two Parquet files?
Parquet files are divided into row groups. How large are the row groups in your
file?
With only the data you provided, my hunch is that the problem is related to
file size somehow, but we will have to poke around to find where the problem
lies.
Please do file a bug. It would be super helpful if you can include:
* The same description as in your e-mail.
* The actual output of the query (from Sqlline or the REST API.)
* An example small Parquet file. Mostly we will need to see the schema so we
can try to reproduce the problem.
Here is a possible workaround. Drill does not need the entire data set in one
file: you can create a directory of files. Try doing this with files of, say,
100MB or 256MB in size. A side benefit is that, for real queries (not just
LIMIT 10), having multiple files will let Drill divide up the query to run
scans in parallel so it can use all the CPUs on your machine.
Thanks,
- Paul
On Tuesday, January 7, 2020, 5:14:12 PM PST, Stone Zhong
<[email protected]> wrote:
Hi,
This is my first message to the user email list.
*Symptom*: I have a large parquet file (size = 1TB), I run a query against
the parquet file, the query is like below:
select * from `dfs`.`/root/data/hwd/machine_main2`
WHERE
verifiedSerial='1547NM70EN'
LIMIT 20;
However, I found in both Drill Web UI and API, the returned column for
'verifiedSerial' has empty value, which is impossible since I have explicit
condition in the query which the column `verifiedSerial` cannot be empty.
*my configuration*:
- I am running Apache Drill 1.17 (the latest version now) in embedded
mode
- The parquet file is generated from Apache Spark (3.x version). E.g., I
uses such code to generate the parquet file:
`df.coalesce(1).write.mode('overwrite').parquet('/tmp/foo.parquet')`
- The parquet file is about 1TB in size
- My machine running drill has 32GB RAM, drill is pretty much the only
app running on that machine
What is wired is, if I filter the parquet file in spark (which result in a
smaller parquet file), everything works fine.
My Question is:
1. Is the behavior I described expected?
2. If (1) is not expected, how can I avoid this behavior, what kind of
configuration change is needed? Or perhaps is this a bug?
Thanks in advance for any kind of help!
Regards,
Stone Zhong