Re: Spark 1.3.1 and Parquet Partitions

in4maniac Thu, 07 May 2015 13:30:08 -0700

Hi V, 

I am assuming that each of the three .parquet paths you mentioned have
multiple partitions in them.


For eg: [/dataset/city=London/data.parquet/part-r-0.parquet,
/dataset/city=London/data.parquet/part-r-1.parquet]

I haven't personally used this with "hdfs", but I've worked with a similar
file strucutre with '=' in "S3". 

And how i get around this is by building a string of all the filepaths
seperated by commas (with NO spaces inbetween). Then I use that string as
the filepath parameter. I think the following adaptation of S3 file access
pattern to HDFS would work

If I want to load 1 file:
sqlcontext.parquetFile( "hdfs://some
ip:8029/dataset/city=London/data.parquet")

If I want to load multiple files (lets say all 3 of them):
sqlcontext.parquetFile( "hdfs://some
ip:8029/dataset/city=London/data.parquet,hdfs://some
ip:8029/dataset/city=NewYork/data.parquet,hdfs://some
ip:8029/dataset/city=Paris/data.parquet")

*** But in the multiple file scenario, the schema of all the files should be
the same

I hope you can use this S3 pattern with HDFS and hope it works !!

Thanks
in4



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792p22801.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.3.1 and Parquet Partitions

Reply via email to