Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

swetha kasireddy Thu, 05 Nov 2015 14:46:54 -0800

OK. I found the following code that does that.

def readParquetRDD[T <% SpecificRecord](sc: SparkContext, parquetFile:
String)(implicit tag: ClassTag[T]): RDD[T] = {
  val jobConf= new JobConf(sc.hadoopConfiguration)
  ParquetInputFormat.setReadSupportClass(jobConf, classOf[AvroReadSupport[T]])
  sc.newAPIHadoopFile(
    parquetFile,
    classOf[ParquetInputFormat[T]],
    classOf[Void],
    tag.runtimeClass.asInstanceOf[Class[T]],
    jobConf)
    .map(_._2.asInstanceOf[T])
}



On Thu, Nov 5, 2015 at 2:14 PM, swetha kasireddy <swethakasire...@gmail.com>
wrote:

> No scala. Suppose  I read the Parquet file as shown in the following. How
> would that be converted to an RDD to use it in my Spark Batch. I use Core
> Spark. I don't use Spark SQL.
>
> ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[
> AminoAcid]]) val file = sc.newAPIHadoopFile(outputDir, classOf[
> ParquetInputFormat[AminoAcid]], classOf[Void], classOf[AminoAcid], job.
> getConfiguration)
>
> On Thu, Nov 5, 2015 at 12:48 PM, Igor Berman <igor.ber...@gmail.com>
> wrote:
>
>> java/scala? I think there is everything in dataframes tutorial
>> *e.g. if u have dataframe and working from java - toJavaRDD
>> <https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#toJavaRDD()>*
>> ()
>>
>> On 5 November 2015 at 21:13, swetha kasireddy <swethakasire...@gmail.com>
>> wrote:
>>
>>> How to convert a parquet file that is saved in hdfs to an RDD after
>>> reading the file from hdfs?
>>>
>>> On Thu, Nov 5, 2015 at 10:02 AM, Igor Berman <igor.ber...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>> we are using avro with compression(snappy). As soon as you have enough
>>>> partitions, the saving won't be a problem imho.
>>>> in general hdfs is pretty fast, s3 is less so
>>>> the issue with storing data is that you will loose your
>>>> partitioner(even though rdd has it) at loading moment. There is PR that
>>>> tries to solve this.
>>>>
>>>>
>>>> On 5 November 2015 at 01:09, swetha <swethakasire...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> What is the efficient approach to save an RDD as a file in HDFS and
>>>>> retrieve
>>>>> it back? I was thinking between Avro, Parquet and SequenceFileFormart.
>>>>> We
>>>>> currently use SequenceFileFormart for one of our use cases.
>>>>>
>>>>> Any example on how to store and retrieve an RDD in an Avro and Parquet
>>>>> file
>>>>> formats would be of great help.
>>>>>
>>>>> Thanks,
>>>>> Swetha
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-approach-to-store-an-RDD-as-a-file-in-HDFS-and-read-it-back-as-an-RDD-tp25279.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

Reply via email to