Recommendation of using StreamSinkProvider for a custom KairosDB Sink

2018-06-25 Thread subramgr
We are using Spark 2.3 and would want to know if it is recommended to create a custom KairoDBSink by implementing the StreamSinkProvider ? The interface is marked experimental and in-stable ? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Emit Custom metrics in Spark Structured Streaming job

2018-06-26 Thread subramgr
I am planning to send these metrics to our KairosDB. Let me know if there are any examples that I can take a look -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Emit Custom metrics in Spark Structured Streaming job

2018-06-26 Thread subramgr
In our Spark Structured Streaming job we listen to Kafka and we filter out some messages which we feel are malformed. Currently we log that information using the LOGGER. Is there a way to emit some kind of metrics for each time such a malformed message is seen in Structured Streaming ? Thanks

[Structured Streaming] Metrics or logs of events that are ignored due to watermark

2018-07-02 Thread subramgr
Hi all, Do we have some logs or some metrics that get recorded in log files or some metrics sinker about the number of events that are ignored due to watermark in structured streaming? Thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Spark 2.3.0 and Custom Sink

2018-06-21 Thread subramgr
Hi Spark Mailing list, We are looking for pushing the output of the structured streaming query output to KairosDB. (time series database) What would be the recommended way of doing this? Do we implement the *Sink* trait or do we use the *ForEachWriter* At each trigger point if I do a

[Structured Streaming] Understanding waterMark, flatMapGroupWithState and possibly windowing

2018-08-08 Thread subramgr
Hi, We have a use case where we need to *sessionize* our data and for each *session* emit some *metrics* we need to handle *repeated sessions* and *sessions timeout*. We have come up with the following code structure and would like to understand if we understand all the concept of *watermark*,

[Structured Streaming] Two watermarks and StreamingQueryListener

2018-08-09 Thread subramgr
Hi, We have two *flatMapGroupWithState* in our job and we have two *withWatermark* We are getting the event max time, event time and watermarks from *QueryProgressEvent*. Right now it just returns one *watermark* value. Are two watermarks maintained by Spark or just one. If one which one

Re: Number of records per micro-batch in DStream vs Structured Streaming

2018-07-08 Thread subramgr
Any one? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[Structured Streaming] Reading Checkpoint data

2018-07-09 Thread subramgr
Hi, I read somewhere that with Structured Streaming all the checkpoint data is more readable (Json) like. Is there any documentation on how to read the checkpoint data. If I do `hadoop fs -ls` on the `state` directory I get some encoded data. Thanks Girish -- Sent from:

[Structured Streaming] User Define Aggregation Function

2018-07-09 Thread subramgr
Hi I am trying to explore how I can use UDAF for my use case. I have something like this in my Structured Streaming Job. val counts: Dataset[(String, Double)] = events .withWatermark("timestamp", "30 minutes") .groupByKey(e => e._2.siteIdentifier + "|" +

[Structured Streaming] Last processed event time always behind Direct Streaming

2018-07-09 Thread subramgr
Hi We are migrating our Direct Streaming Spark job to Structured Streaming. We have a batch size of 1 minute. I am consistently seeing that the Structured Streaming job is always (3-5 minutes) behind the Direct Streaming job. Is there some kinda fine tuning that will help Structured Streaming

Re: [Structured Streaming] Reading Checkpoint data

2018-07-09 Thread subramgr
thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Emit Custom metrics in Spark Structured Streaming job

2018-07-10 Thread subramgr
Hi I am working on implementing my idea but here is how it goes: 1. Use this library https://github.com/groupon/spark-metrics 2. Have a cron job which periodically curl /metrics/json endpoint at driver and all other nodes 3. Parse the response and send the data through a telegraf agent

[Structured Streaming] Custom StateStoreProvider

2018-07-10 Thread subramgr
Hi, Currently we are using HDFS for our checkpointing but we are having issues maintaining a HDFS cluster. We tried glusterfs in the past for checkpointing but in our setup glusterfs does not work well. We are evaluating using Cassandra for storing the checkpoint data. Has any one implemented

Re: [Structured Streaming] Custom StateStoreProvider

2018-07-10 Thread subramgr
Hi, This looks very daunting *trait* is there some blog post or some articles which explains on how to implement this *trait* Thanks Girish -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe

[Structured Streaming] Fine tuning GC performance

2018-07-10 Thread subramgr
Hi, Are there any specific methods to fine tune our Structured Streaming job ? Or is it similar to once mentioned here for RDDs https://spark.apache.org/docs/latest/tuning.html -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: [Structured Streaming] Metrics or logs of events that are ignored due to watermark

2018-07-03 Thread subramgr
Thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Number of records per micro-batch in DStream vs Structured Streaming

2018-07-03 Thread subramgr
Hi, We have 2 spark streaming job one using DStreams and the other using Structured Streaming. I have observed that the number of records per micro-batch (Per Trigger in case of Structured Streaming) is not the same between the two jobs. The Structured Streaming job has higher numbers compared

[Spark Structured Streaming] Running out of disk quota due to /work/tmp

2018-10-11 Thread subramgr
We have a Spark Structured Streaming job which runs out of disk quota after some days. The primary reason is there are bunch of empty folders that are getting created in the /work/tmp directory. Any idea how to prune them? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

[Spark Structured Streaming] Dynamically changing maxOffsetsPerTrigger

2018-12-04 Thread subramgr
Is there a way to dynamically change the value of *maxOffsetsPerTrigger* ? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[Spark Structured Streaming] Metrics for latency or performance of checkpointing

2019-02-23 Thread subramgr
Hi all, We are using Spark 2.3 and we are facing a weird issue. The streaming job works perfectly fine in one environment but in an another environment it does not. We feel that in low performing environment the checkpointing of the state is not as performant as the other. Is there any