Junfeng, I would suggest preprocessing/validating the paths in plain Python (and not Spark) before you try to fetch data. I am not familiar with Python Hadoop libraries, but see if this helps - http://crs4.github.io/pydoop/tutorial/hdfs_api.html
Best, Jayesh From: JF Chen <darou...@gmail.com> Date: Monday, May 21, 2018 at 10:20 PM To: ayan guha <guha.a...@gmail.com> Cc: "Thakrar, Jayesh" <jthak...@conversantmedia.com>, user <user@spark.apache.org> Subject: Re: How to skip nonexistent file when read files with spark? Thanks ayan, Also I have tried this method, the most tricky thing is that dataframe union method must be based on same structure schema, while on my files, the schema is variable. Regard, Junfeng Chen On Tue, May 22, 2018 at 10:33 AM, ayan guha <guha.a...@gmail.com<mailto:guha.a...@gmail.com>> wrote: A relatively naive solution will be: 0. Create a dummy blank dataframe 1. Loop through the list of paths. 2. Try to create the dataframe from the path. If success then union it cumulatively. 3. If error, just ignore it or handle as you wish. At the end of the loop, just use the unioned df. This should not have any additional performance overhead as declaring dataframes and union is not expensive, unless you call any action within the loop. Best Ayan On Tue, 22 May 2018 at 11:27 am, JF Chen <darou...@gmail.com<mailto:darou...@gmail.com>> wrote: Thanks, Thakrar, I have tried to check the existence of path before read it, but HDFSCli python package seems not support wildcard. "FileSystem.globStatus" is a java api while I am using python via livy.... Do you know any python api implementing the same function? Regard, Junfeng Chen On Mon, May 21, 2018 at 9:01 PM, Thakrar, Jayesh <jthak...@conversantmedia.com<mailto:jthak...@conversantmedia.com>> wrote: Probably you can do some preprocessing/checking of the paths before you attempt to read it via Spark. Whether it is local or hdfs filesystem, you can try to check for existence and other details by using the "FileSystem.globStatus" method from the Hadoop API. From: JF Chen <darou...@gmail.com<mailto:darou...@gmail.com>> Date: Sunday, May 20, 2018 at 10:30 PM To: user <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: How to skip nonexistent file when read files with spark? Hi Everyone I meet a tricky problem recently. I am trying to read some file paths generated by other method. The file paths are represented by wild card in list, like [ '/data/*/12', '/data/*/13'] But in practice, if the wildcard cannot match any existed path, it will throw an exception:"pyspark.sql.utils.AnalysisException: 'Path does not exist: ...'", and the program stops after that. Actually I want spark can just ignore and skip these nonexistent file path, and continues to run. I have tried python HDFSCli api to check the existence of path , but hdfs cli cannot support wildcard. Any good idea to solve my problem? Thanks~ Regard, Junfeng Chen -- Best Regards, Ayan Guha