Hi Eric,

Something along the lines of the following should work

val fs = getFileSystem(...) // standard hadoop API call
val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath,
pathFilter).map(_.getPath.toString).mkString(",")  // pathFilter is an
instance of org.apache.hadoop.fs.PathFilter
val parquetRdd = sc.hadoopFile(filteredConcatenatedPaths,
classOf[ParquetInputFormat[Something]], classOf[Void],
classOf[SomeAvroType], getConfiguration(...))

You have to do some initializations on ParquetInputFormat such as
AvroReadSetup/AvroWriteSupport etc but that you should be doing
already I am guessing.

Cheers,
Nat


On Sun, Sep 14, 2014 at 7:37 PM, Eric Friedman
<eric.d.fried...@gmail.com> wrote:
> Hi,
>
> I have a directory structure with parquet+avro data in it. There are a
> couple of administrative files (.foo and/or _foo) that I need to ignore when
> processing this data or Spark tries to read them as containing parquet
> content, which they do not.
>
> How can I set a PathFilter on the FileInputFormat used to construct an RDD?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to