Re: JsonRDD to parquet -- data loss

Michael Armbrust Wed, 18 Feb 2015 13:57:07 -0800

Concurrent inserts into the same table are not supported.  I can try to
make this clearer in the documentation.


On Tue, Feb 17, 2015 at 8:01 PM, Vasu C <vasuc.bigd...@gmail.com> wrote:

> Hi,
>
> I am running spark batch processing job using spark-submit command. And
> below is my code snippet.  Basically converting JsonRDD to parquet and
> storing it in HDFS location.
>
> The problem I am facing is if multiple jobs are are triggered parallely,
> even though job executes properly (as i can see in spark webUI), there is
> no parquet file created in hdfs path. If 5 jobs are executed parallely than
> only 3 parquet files are getting created.
>
> Is this the data loss scenario ? Or am I missing something here. Please
> help me in this
>
> Here tableName is unique with timestamp appended to it.
>
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>
> val jsonRdd  = sqlContext.jsonRDD(results)
>
> val parquetTable = sqlContext.parquetFile(parquetFilePath)
>
> parquetTable.registerTempTable(tableName)
>
> jsonRdd.insertInto(tableName)
>
>
> Regards,
>
>   Vasu C
>

Re: JsonRDD to parquet -- data loss

Reply via email to