Here's a link to the thread http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-Dropping-Duplicates-td20884.html On Sat, 11 Feb 2017 at 08:47, Sam Elamin <hussam.ela...@gmail.com> wrote:
> Hey Egor > > > You can use for each writer or you can write a custom sink. I personally > went with a custom sink since I get a dataframe per batch > > > https://github.com/samelamin/spark-bigquery/blob/master/src/main/scala/com/samelamin/spark/bigquery/streaming/BigQuerySink.scala > > You can have a look at how I implemented something similar to file sink > that in the event if a failure skips batches already written > > > Also have a look at Micheals reply to me a few days ago on exactly the > same topic. The email subject was called structured streaming. Dropping > duplicates > > > Regards > > Sam > > On Sat, 11 Feb 2017 at 07:59, Jacek Laskowski <ja...@japila.pl> wrote: > > "Something like that" I've never tried it out myself so I'm only > guessing having a brief look at the API. > > Pozdrawiam, > Jacek Laskowski > ---- > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Sat, Feb 11, 2017 at 1:31 AM, Egor Pahomov <pahomov.e...@gmail.com> > wrote: > > Jacek, so I create cache in ForeachWriter, in all "process()" I write to > it > > and on close I flush? Something like that? > > > > 2017-02-09 12:42 GMT-08:00 Jacek Laskowski <ja...@japila.pl>: > >> > >> Hi, > >> > >> Yes, that's ForeachWriter. > >> > >> Yes, it works with element by element. You're looking for mapPartition > >> and ForeachWriter has partitionId that you could use to implement a > >> similar thing. > >> > >> Pozdrawiam, > >> Jacek Laskowski > >> ---- > >> https://medium.com/@jaceklaskowski/ > >> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark > >> Follow me at https://twitter.com/jaceklaskowski > >> > >> > >> On Thu, Feb 9, 2017 at 3:55 AM, Egor Pahomov <pahomov.e...@gmail.com> > >> wrote: > >> > Jacek, you mean > >> > > >> > > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter > >> > ? I do not understand how to use it, since it passes every value > >> > separately, > >> > not every partition. And addding to table value by value would not > work > >> > > >> > 2017-02-07 12:10 GMT-08:00 Jacek Laskowski <ja...@japila.pl>: > >> >> > >> >> Hi, > >> >> > >> >> Have you considered foreach sink? > >> >> > >> >> Jacek > >> >> > >> >> On 6 Feb 2017 8:39 p.m., "Egor Pahomov" <pahomov.e...@gmail.com> > wrote: > >> >>> > >> >>> Hi, I'm thinking of using Structured Streaming instead of old > >> >>> streaming, > >> >>> but I need to be able to save results to Hive table. Documentation > for > >> >>> file > >> >>> sink > >> >>> > >> >>> says( > http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks > ): > >> >>> "Supports writes to partitioned tables. ". But being able to write > to > >> >>> partitioned directories is not enough to write to the table: someone > >> >>> needs > >> >>> to write to Hive metastore. How can I use Structured Streaming and > >> >>> write to > >> >>> Hive table? > >> >>> > >> >>> -- > >> >>> Sincerely yours > >> >>> Egor Pakhomov > >> > > >> > > >> > > >> > > >> > -- > >> > Sincerely yours > >> > Egor Pakhomov > > > > > > > > > > -- > > Sincerely yours > > Egor Pakhomov > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >