Re: PathFilter for newAPIHadoopFile?

Davies Liu Mon, 15 Sep 2014 14:53:05 -0700

Or maybe you could give this one a try:
https://labs.spotify.com/2013/05/07/snakebite/


On Mon, Sep 15, 2014 at 2:51 PM, Davies Liu <dav...@databricks.com> wrote:
> There is one way by do it in bash: hadoop fs -ls xxxx, maybe you could
> end up with a bash scripts to do the things.
>
> On Mon, Sep 15, 2014 at 1:01 PM, Eric Friedman
> <eric.d.fried...@gmail.com> wrote:
>> That's a good idea and one I had considered too.  Unfortunately I'm not
>> aware of an API in PySpark for enumerating paths on HDFS.  Have I overlooked
>> one?
>>
>> On Mon, Sep 15, 2014 at 10:01 AM, Davies Liu <dav...@databricks.com> wrote:
>>>
>>> In PySpark, I think you could enumerate all the valid files, and create
>>> RDD by
>>> newAPIHadoopFile(), then union them together.
>>>
>>> On Mon, Sep 15, 2014 at 5:49 AM, Eric Friedman
>>> <eric.d.fried...@gmail.com> wrote:
>>> > I neglected to specify that I'm using pyspark. Doesn't look like these
>>> > APIs have been bridged.
>>> >
>>> > ----
>>> > Eric Friedman
>>> >
>>> >> On Sep 14, 2014, at 11:02 PM, Nat Padmanabhan <reachn...@gmail.com>
>>> >> wrote:
>>> >>
>>> >> Hi Eric,
>>> >>
>>> >> Something along the lines of the following should work
>>> >>
>>> >> val fs = getFileSystem(...) // standard hadoop API call
>>> >> val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath,
>>> >> pathFilter).map(_.getPath.toString).mkString(",")  // pathFilter is an
>>> >> instance of org.apache.hadoop.fs.PathFilter
>>> >> val parquetRdd = sc.hadoopFile(filteredConcatenatedPaths,
>>> >> classOf[ParquetInputFormat[Something]], classOf[Void],
>>> >> classOf[SomeAvroType], getConfiguration(...))
>>> >>
>>> >> You have to do some initializations on ParquetInputFormat such as
>>> >> AvroReadSetup/AvroWriteSupport etc but that you should be doing
>>> >> already I am guessing.
>>> >>
>>> >> Cheers,
>>> >> Nat
>>> >>
>>> >>
>>> >> On Sun, Sep 14, 2014 at 7:37 PM, Eric Friedman
>>> >> <eric.d.fried...@gmail.com> wrote:
>>> >>> Hi,
>>> >>>
>>> >>> I have a directory structure with parquet+avro data in it. There are a
>>> >>> couple of administrative files (.foo and/or _foo) that I need to
>>> >>> ignore when
>>> >>> processing this data or Spark tries to read them as containing parquet
>>> >>> content, which they do not.
>>> >>>
>>> >>> How can I set a PathFilter on the FileInputFormat used to construct an
>>> >>> RDD?
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>> >
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: PathFilter for newAPIHadoopFile?

Reply via email to