There is one way by do it in bash: hadoop fs -ls xxxx, maybe you could end up with a bash scripts to do the things.
On Mon, Sep 15, 2014 at 1:01 PM, Eric Friedman <eric.d.fried...@gmail.com> wrote: > That's a good idea and one I had considered too. Unfortunately I'm not > aware of an API in PySpark for enumerating paths on HDFS. Have I overlooked > one? > > On Mon, Sep 15, 2014 at 10:01 AM, Davies Liu <dav...@databricks.com> wrote: >> >> In PySpark, I think you could enumerate all the valid files, and create >> RDD by >> newAPIHadoopFile(), then union them together. >> >> On Mon, Sep 15, 2014 at 5:49 AM, Eric Friedman >> <eric.d.fried...@gmail.com> wrote: >> > I neglected to specify that I'm using pyspark. Doesn't look like these >> > APIs have been bridged. >> > >> > ---- >> > Eric Friedman >> > >> >> On Sep 14, 2014, at 11:02 PM, Nat Padmanabhan <reachn...@gmail.com> >> >> wrote: >> >> >> >> Hi Eric, >> >> >> >> Something along the lines of the following should work >> >> >> >> val fs = getFileSystem(...) // standard hadoop API call >> >> val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath, >> >> pathFilter).map(_.getPath.toString).mkString(",") // pathFilter is an >> >> instance of org.apache.hadoop.fs.PathFilter >> >> val parquetRdd = sc.hadoopFile(filteredConcatenatedPaths, >> >> classOf[ParquetInputFormat[Something]], classOf[Void], >> >> classOf[SomeAvroType], getConfiguration(...)) >> >> >> >> You have to do some initializations on ParquetInputFormat such as >> >> AvroReadSetup/AvroWriteSupport etc but that you should be doing >> >> already I am guessing. >> >> >> >> Cheers, >> >> Nat >> >> >> >> >> >> On Sun, Sep 14, 2014 at 7:37 PM, Eric Friedman >> >> <eric.d.fried...@gmail.com> wrote: >> >>> Hi, >> >>> >> >>> I have a directory structure with parquet+avro data in it. There are a >> >>> couple of administrative files (.foo and/or _foo) that I need to >> >>> ignore when >> >>> processing this data or Spark tries to read them as containing parquet >> >>> content, which they do not. >> >>> >> >>> How can I set a PathFilter on the FileInputFormat used to construct an >> >>> RDD? >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org