Or maybe you could give this one a try: https://labs.spotify.com/2013/05/07/snakebite/
On Mon, Sep 15, 2014 at 2:51 PM, Davies Liu <dav...@databricks.com> wrote: > There is one way by do it in bash: hadoop fs -ls xxxx, maybe you could > end up with a bash scripts to do the things. > > On Mon, Sep 15, 2014 at 1:01 PM, Eric Friedman > <eric.d.fried...@gmail.com> wrote: >> That's a good idea and one I had considered too. Unfortunately I'm not >> aware of an API in PySpark for enumerating paths on HDFS. Have I overlooked >> one? >> >> On Mon, Sep 15, 2014 at 10:01 AM, Davies Liu <dav...@databricks.com> wrote: >>> >>> In PySpark, I think you could enumerate all the valid files, and create >>> RDD by >>> newAPIHadoopFile(), then union them together. >>> >>> On Mon, Sep 15, 2014 at 5:49 AM, Eric Friedman >>> <eric.d.fried...@gmail.com> wrote: >>> > I neglected to specify that I'm using pyspark. Doesn't look like these >>> > APIs have been bridged. >>> > >>> > ---- >>> > Eric Friedman >>> > >>> >> On Sep 14, 2014, at 11:02 PM, Nat Padmanabhan <reachn...@gmail.com> >>> >> wrote: >>> >> >>> >> Hi Eric, >>> >> >>> >> Something along the lines of the following should work >>> >> >>> >> val fs = getFileSystem(...) // standard hadoop API call >>> >> val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath, >>> >> pathFilter).map(_.getPath.toString).mkString(",") // pathFilter is an >>> >> instance of org.apache.hadoop.fs.PathFilter >>> >> val parquetRdd = sc.hadoopFile(filteredConcatenatedPaths, >>> >> classOf[ParquetInputFormat[Something]], classOf[Void], >>> >> classOf[SomeAvroType], getConfiguration(...)) >>> >> >>> >> You have to do some initializations on ParquetInputFormat such as >>> >> AvroReadSetup/AvroWriteSupport etc but that you should be doing >>> >> already I am guessing. >>> >> >>> >> Cheers, >>> >> Nat >>> >> >>> >> >>> >> On Sun, Sep 14, 2014 at 7:37 PM, Eric Friedman >>> >> <eric.d.fried...@gmail.com> wrote: >>> >>> Hi, >>> >>> >>> >>> I have a directory structure with parquet+avro data in it. There are a >>> >>> couple of administrative files (.foo and/or _foo) that I need to >>> >>> ignore when >>> >>> processing this data or Spark tries to read them as containing parquet >>> >>> content, which they do not. >>> >>> >>> >>> How can I set a PathFilter on the FileInputFormat used to construct an >>> >>> RDD? >>> > >>> > --------------------------------------------------------------------- >>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> > For additional commands, e-mail: user-h...@spark.apache.org >>> > >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org