Hi, your query
parquetFile = spark.read.parquet("path/to/hdfs") parquetFile.createOrReplaceTempView("parquetFile") spark.sql("SELECT * FROM parquetFile WHERE field1 = 'value' ORDER BY timestamp LIMIT 10000") will be lazily evaluated and won't do anything until the sql statement is actioned with .show etc In local mode, there is only one executor. Assuming you are actioning the sql statement, it will have to do a full table scan to find field1 = 'value' scala> spark.sql("""select * from tmp where 'Account Type' = 'abc' limit 1000""").explain() == Physical Plan == LocalTableScan <empty>, [Date#16, Type#17, Description#18, Value#19, Balance#20, Account Name#21, Account Number#22] Try actioning it and see what happens HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Thu, 29 Apr 2021 at 07:30, Amin Borjian <borjianami...@outlook.com> wrote: > Hi. > > We use spark 3.0.1 in HDFS cluster and we store our files as parquet with > snappy compression and enabled dictionary. We try to perform a simple query: > > parquetFile = spark.read.parquet("path/to/hadf") > parquetFile.createOrReplaceTempView("parquetFile") > spark.sql("SELECT * FROM parquetFile WHERE field1 = 'value' ORDER BY > timestamp LIMIT 10000") > > Condition in 'where' clause is selected so that no record is selected > (matched) for this query. (on purpose) However, this query takes about > 5-6 minutes to complete on our cluster (with NODE_LOCAL) and a simple spark > configuration. (Input data is about 8TB in the following tests but can be > much more) > > We decided to test the consumption of disk, network, memory and CPU > resources in order to detect bottlenecks in this query. However, we came > to much more strange results, which we will discuss in the following. > > We provided dashboards of each network, disk, memory, and CPU usage by > monitoring tools so that we could check the conditions when running the > query. > > 1) First, we increased the amount of CPU allocated to Spark from the > initial value to 2 and then about 2.5 times. Although the last increase > in the total amount of dedicated CPU, all of it was not used, we did not > see any change in the duration of the query. (As a general point, in all > tests, the execution times were increased or decreased between 0 and 20 > seconds, but in 5 minutes, these cases were insignificant) > > 2) Then we similarly increased the amount of memory allocated to Spark to > 2 to 2.5 times its original value. In this case, in the last increase, > the entire memory allocated to the spark was not used by query. But > again, we did not see any change in the duration of the query. > > 3) In all these tests, we monitored the amount of network consumption and > sending and receiving it in the whole cluster. We can run a query whose > network consumption is 2 or almost 3 times the consumption of the query > mentioned in this email, and this shows that we have not reached the > maximum of the cluster network capacity in this query. Of course, it was > strange to us why we need the network in a query that has no record and is > completely node local, but we assumed that it probably needs it for a > number of reasons, and with this assumption we were still very far from the > network capacity. > > 4) In all these tests, we also checked the amount of writing and reading > from the disk. In the same way as in the third case, we can write a query > that is about 4 times the write and read rate of the query mentioned in the > email, and our disks are much stronger. But the observation in this query > shows that the write rate is almost zero (We were expecting it) and the > read rate is running at a medium speed, which is very far from the amount > of disk rate capacity and therefore cannot be a bottleneck. > > After all these tests and the query running time of 5 minutes, we did not > know exactly what could be more restrictive, and it was strange to us that > the simple query stated in the email needed such a run time (because with > such execution time, heavier queries take longer). > > Does it make sense that the duration of the query is so long? Is there > something we need to pay attention to or can we improve by changing it? > > Thanks, > Amin Borjian >