Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

swetha kasireddy Thu, 05 Nov 2015 14:15:07 -0800

No scala. Suppose  I read the Parquet file as shown in the following. How
would that be converted to an RDD to use it in my Spark Batch. I use Core
Spark. I don't use Spark SQL.


ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[
AminoAcid]]) val file = sc.newAPIHadoopFile(outputDir, classOf[
ParquetInputFormat[AminoAcid]], classOf[Void], classOf[AminoAcid], job.
getConfiguration)

On Thu, Nov 5, 2015 at 12:48 PM, Igor Berman <igor.ber...@gmail.com> wrote:

> java/scala? I think there is everything in dataframes tutorial
> *e.g. if u have dataframe and working from java - toJavaRDD
> <https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#toJavaRDD()>*
> ()
>
> On 5 November 2015 at 21:13, swetha kasireddy <swethakasire...@gmail.com>
> wrote:
>
>> How to convert a parquet file that is saved in hdfs to an RDD after
>> reading the file from hdfs?
>>
>> On Thu, Nov 5, 2015 at 10:02 AM, Igor Berman <igor.ber...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> we are using avro with compression(snappy). As soon as you have enough
>>> partitions, the saving won't be a problem imho.
>>> in general hdfs is pretty fast, s3 is less so
>>> the issue with storing data is that you will loose your partitioner(even
>>> though rdd has it) at loading moment. There is PR that tries to solve this.
>>>
>>>
>>> On 5 November 2015 at 01:09, swetha <swethakasire...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> What is the efficient approach to save an RDD as a file in HDFS and
>>>> retrieve
>>>> it back? I was thinking between Avro, Parquet and SequenceFileFormart.
>>>> We
>>>> currently use SequenceFileFormart for one of our use cases.
>>>>
>>>> Any example on how to store and retrieve an RDD in an Avro and Parquet
>>>> file
>>>> formats would be of great help.
>>>>
>>>> Thanks,
>>>> Swetha
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-approach-to-store-an-RDD-as-a-file-in-HDFS-and-read-it-back-as-an-RDD-tp25279.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

Reply via email to