Re: [Structured Streaming] Using File Sink to store to hive table.

Sam Elamin Sat, 11 Feb 2017 00:57:08 -0800

Here's a link to the thread

http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-Dropping-Duplicates-td20884.html
On Sat, 11 Feb 2017 at 08:47, Sam Elamin <hussam.ela...@gmail.com> wrote:


> Hey Egor
>
>
> You can use for each writer or you can write a custom sink. I personally
> went with a custom sink since I get a dataframe per batch
>
>
> https://github.com/samelamin/spark-bigquery/blob/master/src/main/scala/com/samelamin/spark/bigquery/streaming/BigQuerySink.scala
>
> You can have a look at how I implemented something similar to file sink
> that in the event if a failure skips batches already written
>
>
> Also have a look at Micheals reply to me a few days ago on exactly the
> same topic. The email subject was called structured streaming. Dropping
> duplicates
>
>
> Regards
>
> Sam
>
> On Sat, 11 Feb 2017 at 07:59, Jacek Laskowski <ja...@japila.pl> wrote:
>
> "Something like that" I've never tried it out myself so I'm only
> guessing having a brief look at the API.
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sat, Feb 11, 2017 at 1:31 AM, Egor Pahomov <pahomov.e...@gmail.com>
> wrote:
> > Jacek, so I create cache in ForeachWriter, in all "process()" I write to
> it
> > and on close I flush? Something like that?
> >
> > 2017-02-09 12:42 GMT-08:00 Jacek Laskowski <ja...@japila.pl>:
> >>
> >> Hi,
> >>
> >> Yes, that's ForeachWriter.
> >>
> >> Yes, it works with element by element. You're looking for mapPartition
> >> and ForeachWriter has partitionId that you could use to implement a
> >> similar thing.
> >>
> >> Pozdrawiam,
> >> Jacek Laskowski
> >> ----
> >> https://medium.com/@jaceklaskowski/
> >> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> >> Follow me at https://twitter.com/jaceklaskowski
> >>
> >>
> >> On Thu, Feb 9, 2017 at 3:55 AM, Egor Pahomov <pahomov.e...@gmail.com>
> >> wrote:
> >> > Jacek, you mean
> >> >
> >> >
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter
> >> > ? I do not understand how to use it, since it passes every value
> >> > separately,
> >> > not every partition. And addding to table value by value would not
> work
> >> >
> >> > 2017-02-07 12:10 GMT-08:00 Jacek Laskowski <ja...@japila.pl>:
> >> >>
> >> >> Hi,
> >> >>
> >> >> Have you considered foreach sink?
> >> >>
> >> >> Jacek
> >> >>
> >> >> On 6 Feb 2017 8:39 p.m., "Egor Pahomov" <pahomov.e...@gmail.com>
> wrote:
> >> >>>
> >> >>> Hi, I'm thinking of using Structured Streaming instead of old
> >> >>> streaming,
> >> >>> but I need to be able to save results to Hive table. Documentation
> for
> >> >>> file
> >> >>> sink
> >> >>>
> >> >>> says(
> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks
> ):
> >> >>> "Supports writes to partitioned tables. ". But being able to write
> to
> >> >>> partitioned directories is not enough to write to the table: someone
> >> >>> needs
> >> >>> to write to Hive metastore. How can I use Structured Streaming and
> >> >>> write to
> >> >>> Hive table?
> >> >>>
> >> >>> --
> >> >>> Sincerely yours
> >> >>> Egor Pakhomov
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Sincerely yours
> >> > Egor Pakhomov
> >
> >
> >
> >
> > --
> > Sincerely yours
> > Egor Pakhomov
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: [Structured Streaming] Using File Sink to store to hive table.

Reply via email to