Re: How to skip nonexistent file when read files with spark?

Thakrar, Jayesh Mon, 21 May 2018 22:02:48 -0700

Junfeng,

I would suggest preprocessing/validating the paths in plain Python (and not 
Spark) before you try to fetch data.
I am not familiar with Python Hadoop libraries, but see if this helps - 
http://crs4.github.io/pydoop/tutorial/hdfs_api.html

Best,
Jayesh

From: JF Chen <darou...@gmail.com>
Date: Monday, May 21, 2018 at 10:20 PM
To: ayan guha <guha.a...@gmail.com>
Cc: "Thakrar, Jayesh" <jthak...@conversantmedia.com>, user 
<user@spark.apache.org>
Subject: Re: How to skip nonexistent file when read files with spark?

Thanks ayan,

Also I have tried this method, the most tricky thing is that dataframe union 
method must be based on same structure schema, while on my files, the schema is 
variable.

Regard,
Junfeng Chen

On Tue, May 22, 2018 at 10:33 AM, ayan guha 
<guha.a...@gmail.com<mailto:guha.a...@gmail.com>> wrote:
A relatively naive solution will be:

0. Create a dummy blank dataframe
1. Loop through the list of paths.
2. Try to create the dataframe from the path. If success then union it 
cumulatively.
3. If error, just ignore it or handle as you wish.

At the end of the loop, just use the unioned df. This should not have any 
additional performance overhead as declaring dataframes and union is not 
expensive, unless you call any action within the loop.

Best
Ayan

On Tue, 22 May 2018 at 11:27 am, JF Chen 
<darou...@gmail.com<mailto:darou...@gmail.com>> wrote:
Thanks, Thakrar,

I have tried to check the existence of path before read it, but HDFSCli python 
package seems not support wildcard.  "FileSystem.globStatus" is a java api 
while I am using python via livy.... Do you know any python api implementing 
the same function?

Regard,
Junfeng Chen

On Mon, May 21, 2018 at 9:01 PM, Thakrar, Jayesh 
<jthak...@conversantmedia.com<mailto:jthak...@conversantmedia.com>> wrote:
Probably you can do some preprocessing/checking of the paths before you attempt 
to read it via Spark.
Whether it is local or hdfs filesystem, you can try to check for existence and 
other details by using the "FileSystem.globStatus" method from the Hadoop API.

From: JF Chen <darou...@gmail.com<mailto:darou...@gmail.com>>
Date: Sunday, May 20, 2018 at 10:30 PM
To: user <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: How to skip nonexistent file when read files with spark?

Hi Everyone
I meet a tricky problem recently. I am trying to read some file paths generated 
by other method. The file paths are represented by wild card in list, like [ 
'/data/*/12', '/data/*/13']
But in practice, if the wildcard cannot match any existed path, it will throw 
an exception:"pyspark.sql.utils.AnalysisException: 'Path does not exist: ...'", 
and the program stops after that.
Actually I want spark can just ignore and skip these nonexistent  file path, 
and continues to run. I have tried python HDFSCli api to check the existence of 
path , but hdfs cli cannot support wildcard.

Any good idea to solve my problem? Thanks~

Regard,
Junfeng Chen

--
Best Regards,
Ayan Guha

Re: How to skip nonexistent file when read files with spark?

Reply via email to