I have tried structured streaming, but in line
> Dataset<Row> rowDataset = spark.read().json(jsondataset);
> Queries with streaming sources must be executed with writeStream.start()
But what i need to do in this step is only transforming json string data to
Dataset . How to fix it?
On Thu, Apr 12, 2018 at 3:08 PM, Tathagata Das <tathagata.das1...@gmail.com>
> It's not very surprising that doing this sort of RDD to DF conversion
> inside DStream.foreachRDD has weird corner cases like this. In fact, you
> are going to have additional problems with partial parquet files (when
> there are failures) in this approach. I strongly suggest that you use
> Structured Streaming, which is designed to do this sort of processing. It
> will take care of tracking the written parquet files correctly.
> On Wed, Apr 11, 2018 at 6:58 PM, Junfeng Chen <darou...@gmail.com> wrote:
>> I write a program to read some json data from kafka and purpose to save
>> them to parquet file on hdfs.
>> Here is my code:
>>> JavaInputDstream stream = ...
>>> JavaDstream rdd = stream.map...
>>> Dataset<Row> df = spark.read().json( stringjavardd ); // convert
>>> json to df
>>> JavaRDD<Row> rowJavaRDD = df.javaRDD().map... //add some new fields
>>> StructType type = df.schema()...; // constuct new type for new added
>>> Dataset<Row) newdf = spark.createDataFrame(rowJavaRDD.type);
>>> //create new dataframe
>>> tionedBy("appname").parquet(savepath); // save to parquet
>> However, if I remove the repartition method of newdf in writing parquet
>> stage, the program always throw nullpointerexception error in json convert
>>> at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.
>> While it looks make no sense, writing parquet operation should be in
>> different stage with json transforming operation.
>> So how to solve it? Thanks!
>> Junfeng Chen