Data locality can only occur if the Spark Executor IP address string matches 
the preferred location returned by the file system. So this job would only have 
local tasks if the datanode replicas for the files in question had the same ip 
address as the Spark executors you are using. If they don't then the scheduler 
falls back to assigning read tasks to the first executor available with 
locality level "any". 

So unless you have that HDFS - Spark Cluster co-localization I wouldn't expect 
this job to run at any other locality level than ANY.

> On Apr 13, 2021, at 3:47 AM, Mohamadreza Rostami 
> <mohamadrezarosta...@gmail.com> wrote:
> 
> I have a Hadoop cluster that uses Apache Spark to query parquet files saved 
> on Hadoop. For example, when i'm using the following PySpark code to find a 
> word in parquet files:
> df = spark.read.parquet("hdfs://test/parquets/* <hdfs://test/parquets/*>")
> df.filter(df['word'] == "jhon").show()
> After running this code, I go to spark application UI, stages tab, I see that 
> locality level summery set on Any. In contrast, because of this query's 
> nature, it must run locally and on NODE_LOCAL locality level at least. When I 
> check the network IO of the cluster while running this, I find out that this 
> query use network (network IO increases while the query is running). The 
> strange part of this situation is that the number shown in the spark UI's 
> shuffle section is very small.
> How can I find out the root cause of this problem and solve that?
> link of stackoverflow.com <http://stackoverflow.com/> : 
> https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache
>  
> <https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache>

Reply via email to