Re: unable to stream kafka messages
The exception is telling you precisely what is wrong. The kafka source has a schema of (topic, partition, offset, key, value, timestamp, timestampType). Nothing about those columns makes sense as a tweet. You need to inform spark how to get from bytes to tweet, it doesn't know how you serialized the messages into kafka. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/unable-to-stream-kafka-messages-tp28537p29107.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [Spark Structured Streaming]: truncated Parquet after driver crash or kill
The default spark.sql.streaming.commitProtocolClass is https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ManifestFileCommitProtocol.scala which may or may not be the best suited for all needs. Code deploys could be improved by ensuring you shutdown gracefully, eg. invoke StreamingQuery#stop. https://issues.apache.org/jira/browse/SPARK-21029 is probably of interest. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Structured-Streaming-truncated-Parquet-after-driver-crash-or-kill-tp29043p29106.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Structured Streaming: multiple sinks
1. would it not be more natural to write processed to kafka and sink processed from kafka to s3? 2a. addBatch is the time Sink#addBatch took as measured by StreamExecution. 2b. getBatch is the time Source#getBatch took as measured by StreamExecution. 3. triggerExecution is effectively end-to-end processing time for the micro-batch, note all other durations sum closely to triggerExecution, there is a little slippage based on book-keeping activities in StreamExecution. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-multiple-sinks-tp29056p29105.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [Streaming][Structured Streaming] Understanding dynamic allocation in streaming jobs
You can leverage dynamic resource allocation with structured streaming. Certainly there's an argument trivial jobs won't benefit. Certainly there's an argument important jobs should have fixed resources for stable end to end latency. Few scenarios come to mind with benefits: - I want my application to automatically leverage more resources if my environment changes, eg. kafka topic partitions were increased at runtime - I am not building a toy application and my driver is managing many streaming queries with fair scheduling enabled where not every streaming query has strict latency requirements - My source's underlying rdd representing the dataframe provided by getbatch is volatile, eg. #partitions batch to batch -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Structured-Streaming-Understanding-dynamic-allocation-in-streaming-jobs-tp29091p29104.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org