Re: unable to stream kafka messages

2017-08-24 Thread cbowden
The exception is telling you precisely what is wrong. The kafka source has a
schema of (topic, partition, offset, key, value, timestamp, timestampType).
Nothing about those columns makes sense as a tweet. You need to inform spark
how to get from bytes to tweet, it doesn't know how you serialized the
messages into kafka.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/unable-to-stream-kafka-messages-tp28537p29107.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark Structured Streaming]: truncated Parquet after driver crash or kill

2017-08-24 Thread cbowden
The default spark.sql.streaming.commitProtocolClass is
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ManifestFileCommitProtocol.scala
which may or may not be the best suited for all needs.

Code deploys could be improved by ensuring you shutdown gracefully, eg.
invoke StreamingQuery#stop.
https://issues.apache.org/jira/browse/SPARK-21029 is probably of interest.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Structured-Streaming-truncated-Parquet-after-driver-crash-or-kill-tp29043p29106.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Structured Streaming: multiple sinks

2017-08-24 Thread cbowden
1. would it not be more natural to write processed to kafka and sink
processed from kafka to s3?
2a. addBatch is the time Sink#addBatch took as measured by StreamExecution.
2b. getBatch is the time Source#getBatch took as measured by
StreamExecution.
3. triggerExecution is effectively end-to-end processing time for the
micro-batch, note all other durations sum closely to triggerExecution, there
is a little slippage based on book-keeping activities in StreamExecution.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-multiple-sinks-tp29056p29105.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Streaming][Structured Streaming] Understanding dynamic allocation in streaming jobs

2017-08-24 Thread cbowden
You can leverage dynamic resource allocation with structured streaming.
Certainly there's an argument trivial jobs won't benefit. Certainly there's
an argument important jobs should have fixed resources for stable end to end
latency.

Few scenarios come to mind with benefits:
- I want my application to automatically leverage more resources if my
environment changes, eg. kafka topic partitions were increased at runtime
- I am not building a toy application and my driver is managing many
streaming queries with fair scheduling enabled where not every streaming
query has strict latency requirements
- My source's underlying rdd representing the dataframe provided by getbatch
is volatile, eg. #partitions batch to batch






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Structured-Streaming-Understanding-dynamic-allocation-in-streaming-jobs-tp29091p29104.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org