Christian Homberg created SPARK-27966:
-----------------------------------------

             Summary: input_file_name empty when listing files in parallel
                 Key: SPARK-27966
                 URL: https://issues.apache.org/jira/browse/SPARK-27966
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 2.4.0
         Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)

 
Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
Workers: 3
Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
            Reporter: Christian Homberg


I ran into an issue similar and probably related to SPARK-26128. The 
`org.apache.spark.sql.functions.input_file_name` is sometime empty.

My environment is databricks and debugging Log4j output showed me that the 
issue occurred when the files are being listed in parallel, e.g. when 

19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 127; threshold: 32

 

This is not an issue when listing less than 32 files. Alternatively setting 
spark.sql.sources.parallelPartitionDiscovery.threshold to 9999 resolves the 
issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to