OK. I found the following code that does that. def readParquetRDD[T <% SpecificRecord](sc: SparkContext, parquetFile: String)(implicit tag: ClassTag[T]): RDD[T] = { val jobConf= new JobConf(sc.hadoopConfiguration) ParquetInputFormat.setReadSupportClass(jobConf, classOf[AvroReadSupport[T]]) sc.newAPIHadoopFile( parquetFile, classOf[ParquetInputFormat[T]], classOf[Void], tag.runtimeClass.asInstanceOf[Class[T]], jobConf) .map(_._2.asInstanceOf[T]) }
On Thu, Nov 5, 2015 at 2:14 PM, swetha kasireddy <swethakasire...@gmail.com> wrote: > No scala. Suppose I read the Parquet file as shown in the following. How > would that be converted to an RDD to use it in my Spark Batch. I use Core > Spark. I don't use Spark SQL. > > ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[ > AminoAcid]]) val file = sc.newAPIHadoopFile(outputDir, classOf[ > ParquetInputFormat[AminoAcid]], classOf[Void], classOf[AminoAcid], job. > getConfiguration) > > On Thu, Nov 5, 2015 at 12:48 PM, Igor Berman <igor.ber...@gmail.com> > wrote: > >> java/scala? I think there is everything in dataframes tutorial >> *e.g. if u have dataframe and working from java - toJavaRDD >> <https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#toJavaRDD()>* >> () >> >> On 5 November 2015 at 21:13, swetha kasireddy <swethakasire...@gmail.com> >> wrote: >> >>> How to convert a parquet file that is saved in hdfs to an RDD after >>> reading the file from hdfs? >>> >>> On Thu, Nov 5, 2015 at 10:02 AM, Igor Berman <igor.ber...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> we are using avro with compression(snappy). As soon as you have enough >>>> partitions, the saving won't be a problem imho. >>>> in general hdfs is pretty fast, s3 is less so >>>> the issue with storing data is that you will loose your >>>> partitioner(even though rdd has it) at loading moment. There is PR that >>>> tries to solve this. >>>> >>>> >>>> On 5 November 2015 at 01:09, swetha <swethakasire...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> What is the efficient approach to save an RDD as a file in HDFS and >>>>> retrieve >>>>> it back? I was thinking between Avro, Parquet and SequenceFileFormart. >>>>> We >>>>> currently use SequenceFileFormart for one of our use cases. >>>>> >>>>> Any example on how to store and retrieve an RDD in an Avro and Parquet >>>>> file >>>>> formats would be of great help. >>>>> >>>>> Thanks, >>>>> Swetha >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-approach-to-store-an-RDD-as-a-file-in-HDFS-and-read-it-back-as-an-RDD-tp25279.html >>>>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>> >>>>> >>>> >>> >> >