Great that worked! The only problem was that it returned all the files including _SUCCESS and _metadata, but I filtered only the *.parquet
Thanks Michael, Krzysztof 2015-12-01 20:20 GMT+01:00 Michael Armbrust <[email protected]>: > sqlContext.table("...").inputFiles > > (this is best effort, but should work for hive tables). > > Michael > > On Tue, Dec 1, 2015 at 10:55 AM, Krzysztof Zarzycki <[email protected]> > wrote: > >> Hi there, >> Do you know how easily I can get a list of all files of a Hive table? >> >> What I want to achieve is to get all files that are underneath parquet >> table and using sparksql-protobuf[1] library(really handy library!) and its >> helper class ProtoParquetRDD: >> >> val protobufsRdd = new ProtoParquetRDD(sc, "files", classOf[MyProto]) >> >> Access the underlying parquet files as normal protocol buffers. But I >> don't know how to get them. I pointed the call above to one file by hand it >> worked well. >> The parquet table was created based on the same library and it's implicit >> hiveContext extension createDataFrame, which creates a DataFrame based on >> Protocol buffer class. >> >> (The revert read operation is needed to support legacy code, where after >> converting protocol buffers to parquet, I still want some code to access >> parquet as normal protocol buffers). >> >> Maybe someone will have other way to get an Rdd of protocol buffers from >> Parquet-stored table. >> >> [1] https://github.com/saurfang/sparksql-protobuf >> >> Thanks! >> Krzysztof >> >> >> >> >
