I neglected to specify that I'm using pyspark. Doesn't look like these APIs have been bridged.
---- Eric Friedman > On Sep 14, 2014, at 11:02 PM, Nat Padmanabhan <reachn...@gmail.com> wrote: > > Hi Eric, > > Something along the lines of the following should work > > val fs = getFileSystem(...) // standard hadoop API call > val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath, > pathFilter).map(_.getPath.toString).mkString(",") // pathFilter is an > instance of org.apache.hadoop.fs.PathFilter > val parquetRdd = sc.hadoopFile(filteredConcatenatedPaths, > classOf[ParquetInputFormat[Something]], classOf[Void], > classOf[SomeAvroType], getConfiguration(...)) > > You have to do some initializations on ParquetInputFormat such as > AvroReadSetup/AvroWriteSupport etc but that you should be doing > already I am guessing. > > Cheers, > Nat > > > On Sun, Sep 14, 2014 at 7:37 PM, Eric Friedman > <eric.d.fried...@gmail.com> wrote: >> Hi, >> >> I have a directory structure with parquet+avro data in it. There are a >> couple of administrative files (.foo and/or _foo) that I need to ignore when >> processing this data or Spark tries to read them as containing parquet >> content, which they do not. >> >> How can I set a PathFilter on the FileInputFormat used to construct an RDD? --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org