No scala. Suppose I read the Parquet file as shown in the following. How would that be converted to an RDD to use it in my Spark Batch. I use Core Spark. I don't use Spark SQL.
ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[ AminoAcid]]) val file = sc.newAPIHadoopFile(outputDir, classOf[ ParquetInputFormat[AminoAcid]], classOf[Void], classOf[AminoAcid], job. getConfiguration) On Thu, Nov 5, 2015 at 12:48 PM, Igor Berman <igor.ber...@gmail.com> wrote: > java/scala? I think there is everything in dataframes tutorial > *e.g. if u have dataframe and working from java - toJavaRDD > <https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#toJavaRDD()>* > () > > On 5 November 2015 at 21:13, swetha kasireddy <swethakasire...@gmail.com> > wrote: > >> How to convert a parquet file that is saved in hdfs to an RDD after >> reading the file from hdfs? >> >> On Thu, Nov 5, 2015 at 10:02 AM, Igor Berman <igor.ber...@gmail.com> >> wrote: >> >>> Hi, >>> we are using avro with compression(snappy). As soon as you have enough >>> partitions, the saving won't be a problem imho. >>> in general hdfs is pretty fast, s3 is less so >>> the issue with storing data is that you will loose your partitioner(even >>> though rdd has it) at loading moment. There is PR that tries to solve this. >>> >>> >>> On 5 November 2015 at 01:09, swetha <swethakasire...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> What is the efficient approach to save an RDD as a file in HDFS and >>>> retrieve >>>> it back? I was thinking between Avro, Parquet and SequenceFileFormart. >>>> We >>>> currently use SequenceFileFormart for one of our use cases. >>>> >>>> Any example on how to store and retrieve an RDD in an Avro and Parquet >>>> file >>>> formats would be of great help. >>>> >>>> Thanks, >>>> Swetha >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-approach-to-store-an-RDD-as-a-file-in-HDFS-and-read-it-back-as-an-RDD-tp25279.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >> >