Thanks Thakrar~

Regard,
Junfeng Chen

On Tue, May 22, 2018 at 11:22 AM, Thakrar, Jayesh <
jthak...@conversantmedia.com> wrote:

> Junfeng,
>
>
>
> I would suggest preprocessing/validating the paths in plain Python (and
> not Spark) before you try to fetch data.
>
> I am not familiar with Python Hadoop libraries, but see if this helps -
> http://crs4.github.io/pydoop/tutorial/hdfs_api.html
>
>
>
> Best,
>
> Jayesh
>
>
>
> *From: *JF Chen <darou...@gmail.com>
> *Date: *Monday, May 21, 2018 at 10:20 PM
> *To: *ayan guha <guha.a...@gmail.com>
> *Cc: *"Thakrar, Jayesh" <jthak...@conversantmedia.com>, user <
> user@spark.apache.org>
> *Subject: *Re: How to skip nonexistent file when read files with spark?
>
>
>
> Thanks ayan,
>
>
>
> Also I have tried this method, the most tricky thing is that dataframe
> union method must be based on same structure schema, while on my files, the
> schema is variable.
>
>
>
>
> Regard,
> Junfeng Chen
>
>
>
> On Tue, May 22, 2018 at 10:33 AM, ayan guha <guha.a...@gmail.com> wrote:
>
> A relatively naive solution will be:
>
>
>
> 0. Create a dummy blank dataframe
>
> 1. Loop through the list of paths.
>
> 2. Try to create the dataframe from the path. If success then union it
> cumulatively.
>
> 3. If error, just ignore it or handle as you wish.
>
>
>
> At the end of the loop, just use the unioned df. This should not have any
> additional performance overhead as declaring dataframes and union is not
> expensive, unless you call any action within the loop.
>
>
>
> Best
>
> Ayan
>
>
>
> On Tue, 22 May 2018 at 11:27 am, JF Chen <darou...@gmail.com> wrote:
>
> Thanks, Thakrar,
>
>
>
> I have tried to check the existence of path before read it, but HDFSCli
> python package seems not support wildcard.  "FileSystem.globStatus" is a
> java api while I am using python via livy.... Do you know any python api
> implementing the same function?
>
>
>
>
> Regard,
> Junfeng Chen
>
>
>
> On Mon, May 21, 2018 at 9:01 PM, Thakrar, Jayesh <
> jthak...@conversantmedia.com> wrote:
>
> Probably you can do some preprocessing/checking of the paths before you
> attempt to read it via Spark.
>
> Whether it is local or hdfs filesystem, you can try to check for existence
> and other details by using the "FileSystem.globStatus" method from the
> Hadoop API.
>
>
>
> *From: *JF Chen <darou...@gmail.com>
> *Date: *Sunday, May 20, 2018 at 10:30 PM
> *To: *user <user@spark.apache.org>
> *Subject: *How to skip nonexistent file when read files with spark?
>
>
>
> Hi Everyone
>
> I meet a tricky problem recently. I am trying to read some file paths
> generated by other method. The file paths are represented by wild card in
> list, like [ '/data/*/12', '/data/*/13']
>
> But in practice, if the wildcard cannot match any existed path, it will
> throw an exception:"pyspark.sql.utils.AnalysisException: 'Path does not
> exist: ...'", and the program stops after that.
>
> Actually I want spark can just ignore and skip these nonexistent  file
> path, and continues to run. I have tried python HDFSCli api to check the
> existence of path , but hdfs cli cannot support wildcard.
>
>
>
> Any good idea to solve my problem? Thanks~
>
>
>
> Regard,
> Junfeng Chen
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>
>
>

Reply via email to