Hi, I'm in a somewhat similar situation. Here's what I do (it seems to be working so far):
1. Stream in the JSON as a plain string. 2. Feed this string into a JSON library to validate it (I use Circe). 3. Using the same library, parse the JSON and extract fields X, Y and Z. 4. Create a dataset with fields X, Y, Z and the JSON as a String/ 5. Write this dataset to HDFS as Parquet partitioned on X and sorted on Y. Obviously, this is not exactly the same as your use case (for instance, I have no idea what your requirements are regarding "flattening the nesting jsons"). Also, I extract only a few fields that I use as columns in the resulting Dataset but then store the rest of the JSON as a string. However, the principle should be the same for you. HTH. Phillip On Mon, Feb 11, 2019 at 2:59 PM Jacek Laskowski <ja...@japila.pl> wrote: > Hi Lian, > > "What have you tried?" would be a good starting point. Any help on this? > > How do you read the JSONs? readStream.json? You could use readStream.text > followed by filter to include/exclude good/bad JSONs. > > Pozdrawiam, > Jacek Laskowski > ---- > https://about.me/JacekLaskowski > Mastering Spark SQL https://bit.ly/mastering-spark-sql > Spark Structured Streaming https://bit.ly/spark-structured-streaming > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams > Follow me at https://twitter.com/jaceklaskowski > > > On Sat, Feb 9, 2019 at 8:25 PM Lian Jiang <jiangok2...@gmail.com> wrote: > >> Hi, >> >> We have a structured streaming job that converting json into parquets. We >> want to validate the json records. If a json record is not valid, we want >> to log a message and refuse to write it into the parquet. Also the json has >> nesting jsons and we want to flatten the nesting jsons into other parquets >> by using the same streaming job. My questions are: >> >> 1. how to validate the json records in a structured streaming job? >> 2. how to flattening the nesting jsons in a structured streaming job? >> 3. is it possible to use one structured streaming job to validate json, >> convert json into a parquet and convert nesting jsons into other parquets? >> >> I think unstructured streaming can achieve these goals but structured >> streaming is recommended by spark community. >> >> Appreciate your feedback! >> >