Thanks Thakrar~
Regard, Junfeng Chen On Tue, May 22, 2018 at 11:22 AM, Thakrar, Jayesh < jthak...@conversantmedia.com> wrote: > Junfeng, > > > > I would suggest preprocessing/validating the paths in plain Python (and > not Spark) before you try to fetch data. > > I am not familiar with Python Hadoop libraries, but see if this helps - > http://crs4.github.io/pydoop/tutorial/hdfs_api.html > > > > Best, > > Jayesh > > > > *From: *JF Chen <darou...@gmail.com> > *Date: *Monday, May 21, 2018 at 10:20 PM > *To: *ayan guha <guha.a...@gmail.com> > *Cc: *"Thakrar, Jayesh" <jthak...@conversantmedia.com>, user < > user@spark.apache.org> > *Subject: *Re: How to skip nonexistent file when read files with spark? > > > > Thanks ayan, > > > > Also I have tried this method, the most tricky thing is that dataframe > union method must be based on same structure schema, while on my files, the > schema is variable. > > > > > Regard, > Junfeng Chen > > > > On Tue, May 22, 2018 at 10:33 AM, ayan guha <guha.a...@gmail.com> wrote: > > A relatively naive solution will be: > > > > 0. Create a dummy blank dataframe > > 1. Loop through the list of paths. > > 2. Try to create the dataframe from the path. If success then union it > cumulatively. > > 3. If error, just ignore it or handle as you wish. > > > > At the end of the loop, just use the unioned df. This should not have any > additional performance overhead as declaring dataframes and union is not > expensive, unless you call any action within the loop. > > > > Best > > Ayan > > > > On Tue, 22 May 2018 at 11:27 am, JF Chen <darou...@gmail.com> wrote: > > Thanks, Thakrar, > > > > I have tried to check the existence of path before read it, but HDFSCli > python package seems not support wildcard. "FileSystem.globStatus" is a > java api while I am using python via livy.... Do you know any python api > implementing the same function? > > > > > Regard, > Junfeng Chen > > > > On Mon, May 21, 2018 at 9:01 PM, Thakrar, Jayesh < > jthak...@conversantmedia.com> wrote: > > Probably you can do some preprocessing/checking of the paths before you > attempt to read it via Spark. > > Whether it is local or hdfs filesystem, you can try to check for existence > and other details by using the "FileSystem.globStatus" method from the > Hadoop API. > > > > *From: *JF Chen <darou...@gmail.com> > *Date: *Sunday, May 20, 2018 at 10:30 PM > *To: *user <user@spark.apache.org> > *Subject: *How to skip nonexistent file when read files with spark? > > > > Hi Everyone > > I meet a tricky problem recently. I am trying to read some file paths > generated by other method. The file paths are represented by wild card in > list, like [ '/data/*/12', '/data/*/13'] > > But in practice, if the wildcard cannot match any existed path, it will > throw an exception:"pyspark.sql.utils.AnalysisException: 'Path does not > exist: ...'", and the program stops after that. > > Actually I want spark can just ignore and skip these nonexistent file > path, and continues to run. I have tried python HDFSCli api to check the > existence of path , but hdfs cli cannot support wildcard. > > > > Any good idea to solve my problem? Thanks~ > > > > Regard, > Junfeng Chen > > > > -- > > Best Regards, > Ayan Guha > > >