Re: PathFilter for newAPIHadoopFile?

Eric Friedman Mon, 15 Sep 2014 05:50:30 -0700

I neglected to specify that I'm using pyspark. Doesn't look like these APIs 
have been bridged.


----
Eric Friedman

> On Sep 14, 2014, at 11:02 PM, Nat Padmanabhan <reachn...@gmail.com> wrote:
> 
> Hi Eric,
> 
> Something along the lines of the following should work
> 
> val fs = getFileSystem(...) // standard hadoop API call
> val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath,
> pathFilter).map(_.getPath.toString).mkString(",")  // pathFilter is an
> instance of org.apache.hadoop.fs.PathFilter
> val parquetRdd = sc.hadoopFile(filteredConcatenatedPaths,
> classOf[ParquetInputFormat[Something]], classOf[Void],
> classOf[SomeAvroType], getConfiguration(...))
> 
> You have to do some initializations on ParquetInputFormat such as
> AvroReadSetup/AvroWriteSupport etc but that you should be doing
> already I am guessing.
> 
> Cheers,
> Nat
> 
> 
> On Sun, Sep 14, 2014 at 7:37 PM, Eric Friedman
> <eric.d.fried...@gmail.com> wrote:
>> Hi,
>> 
>> I have a directory structure with parquet+avro data in it. There are a
>> couple of administrative files (.foo and/or _foo) that I need to ignore when
>> processing this data or Spark tries to read them as containing parquet
>> content, which they do not.
>> 
>> How can I set a PathFilter on the FileInputFormat used to construct an RDD?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: PathFilter for newAPIHadoopFile?

Reply via email to