Re: Query Parquet large parquet file, return empty columns

Paul Rogers Tue, 07 Jan 2020 18:49:17 -0800

Hi Stone,

Welcome to the Drill mailing list!


The situation you describe is not the expected behavior. Something is going 
wrong.

You mention that, in the large-file case, you do get results, but the value is 
empty. This suggests that the query did, in fact run (that is, it didn't run 
out of memory or encounter some other internal problem.) The question is: why 
were the columns empty?

And, when you say "empty", do you mean NULL (SQL NULL) or an empty string. (Or, 
might they be a string of spaces?)

You mention that you used both the Web console and API. Do you mean REST API? 
What results do you get if you use the Sqlline command line tool?


Your configuration seems fine. Your query is a simple scan and filter, so not 
much memory should be needed. 32 GB should be fine. You are using the default 
Drill memory options?

The odd thing is that you have two files: one small, one large. The small one 
works. Is the schema the same for the two Parquet files?

Parquet files are divided into row groups. How large are the row groups in your 
file?


With only the data you provided, my hunch is that the problem is related to 
file size somehow, but we will have to poke around to find where the problem 
lies.


Please do file a bug. It would be super helpful if you can include:
* The same description as in your e-mail.
* The actual output of the query (from Sqlline or the REST API.)

* An example small Parquet file. Mostly we will need to see the schema so we 
can try to reproduce the problem.

Here is a possible workaround. Drill does not need the entire data set in one 
file: you can create a directory of files. Try doing this with files of, say, 
100MB or 256MB in size. A side benefit is that, for real queries (not just 
LIMIT 10), having multiple files will let Drill divide up the query to run 
scans in parallel so it can use all the CPUs on your machine.

Thanks,
- Paul

 

    On Tuesday, January 7, 2020, 5:14:12 PM PST, Stone Zhong 
<[email protected]> wrote:  
 
 Hi,

This is my first message to the user email list.

*Symptom*: I have a large parquet file (size = 1TB), I run a query against
the parquet file, the query is like below:

select * from `dfs`.`/root/data/hwd/machine_main2`
WHERE
    verifiedSerial='1547NM70EN'
LIMIT 20;

However, I found in both Drill Web UI and API, the returned column for
'verifiedSerial' has empty value, which is impossible since I have explicit
condition in the query which the column `verifiedSerial` cannot be empty.

*my configuration*:

  - I am running Apache Drill 1.17 (the latest version now) in embedded
  mode
  - The parquet file is generated from Apache Spark (3.x version). E.g., I
  uses such code to generate the parquet file:
  `df.coalesce(1).write.mode('overwrite').parquet('/tmp/foo.parquet')`
  - The parquet file is about 1TB in size
  - My machine running drill has 32GB RAM, drill is pretty much the only
  app running on that machine


What is wired is, if I filter the parquet file in spark (which result in a
smaller parquet file), everything works fine.

My Question is:

  1. Is the behavior I described expected?
  2. If (1) is not expected, how can I avoid this behavior, what kind of
  configuration change is needed? Or perhaps is this a bug?

Thanks in advance for any kind of help!

Regards,
Stone Zhong

Re: Query Parquet large parquet file, return empty columns

Reply via email to