Data locality can only occur if the Spark Executor IP address string matches the preferred location returned by the file system. So this job would only have local tasks if the datanode replicas for the files in question had the same ip address as the Spark executors you are using. If they don't then the scheduler falls back to assigning read tasks to the first executor available with locality level "any".
So unless you have that HDFS - Spark Cluster co-localization I wouldn't expect this job to run at any other locality level than ANY. > On Apr 13, 2021, at 3:47 AM, Mohamadreza Rostami > <mohamadrezarosta...@gmail.com> wrote: > > I have a Hadoop cluster that uses Apache Spark to query parquet files saved > on Hadoop. For example, when i'm using the following PySpark code to find a > word in parquet files: > df = spark.read.parquet("hdfs://test/parquets/* <hdfs://test/parquets/*>") > df.filter(df['word'] == "jhon").show() > After running this code, I go to spark application UI, stages tab, I see that > locality level summery set on Any. In contrast, because of this query's > nature, it must run locally and on NODE_LOCAL locality level at least. When I > check the network IO of the cluster while running this, I find out that this > query use network (network IO increases while the query is running). The > strange part of this situation is that the number shown in the spark UI's > shuffle section is very small. > How can I find out the root cause of this problem and solve that? > link of stackoverflow.com <http://stackoverflow.com/> : > https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache > > <https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache>