Hi,
I have started using spark structured streaming for reading data from kaka
and the job is very slow. Number of output rows keeps increasing in query 0
and the job is running forever. any suggestions for this please?
[image: image.png]
Thanks,
Asmath
Hi,
can you share the code?
You have to call *writeStream* on streaming Dataset/Dataframe
On Fri, 16 Oct 2020 at 02:14, Krishnanand Khambadkone
wrote:
> Hi, I am trying to write to mysql from a spark structured streaming kafka
> source. Using spark 2.4.0. I get this exc
Hi, I am trying to write to mysql from a spark structured streaming kafka
source. Using spark 2.4.0. I get this exception,
AnalysisException: u"'write' can not be called on streaming Dataset/DataFrame
I can't spend too much time on explaining one by one. I strongly encourage
you to do a deep-dive instead of just looking around as you want to know
about "details" - that's how open source works.
I'll go through a general explanation instead of replying inline; probably
I'd write a blog doc if the
Hi Jungtaek,
*> I meant the subdirectory inside the directory you're providing as
"checkpointLocation", as there're several directories in that directory...*
There are two:
*my-spark-checkpoint-dir/MainApp*
created by sparkSession.sparkContext().setCheckpointDir()
contains only empty subdir wit
roposed under SPARK-30294 [1].
>>
>> 2) Spark leverages HDFS API which is configured to create crc file per
>> file by default. (So you'll have 2x files than expected.) There's a bug in
>> HDFS API (HADOOP-16255 [2]) which missed to handle crc files during rename
&g
round (SPARK-28025 [3]) Spark tries to
> delete the crc file which two additional operations (exist -> delete) may
> occur per crc file.
>
> Hope this helps.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1. https://issues.apache.org/jira/browse/SPARK-30294
> 2. https://
(HeartSaVioR)
1. https://issues.apache.org/jira/browse/SPARK-30294
2. https://issues.apache.org/jira/browse/HADOOP-16255
3. https://issues.apache.org/jira/browse/SPARK-28025
On Sun, Oct 4, 2020 at 10:08 PM Sergey Oboguev wrote:
> I am trying to run a Spark structured streaming program simulating bas
I am trying to run a Spark structured streaming program simulating basic
scenario of ingesting events and calculating aggregates on a window with
watermark, and I am observing an inordinate amount of disk IO Spark
performs.
The basic structure of the program is like this:
sparkSession
As per the solution, if we are closing and starting the query, then what
happens to the the state which is maintained in memory, will that be
retained ?
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To u
File stream sink doesn't support the functionality. There're several
approaches to do so:
1) two queries write to Kafka (or any intermediate storage which allows
concurrent writes), and let next Spark application read and write to the
final path
2) two queries write to two different directories, a
Hi ,
I am looking for some information on how to gracefully kill the spark
structured streaming kafka job and redeploy it.
How to kill a spark structured job in YARN?
any suggestions on how to kill gracefully?
I was able to monitor the job from SQL tab but how can I kill this job when
deployed
Hi,
I have 2spark structure streaming queries writing to the same outpath in
object storage.
Once in a while I am getting the "IllegalStateException: Race while writing
batch 4".
I found that this error is because there are two writers writing to the
output path. The file streaming sink doesn't su
dummy record to move watermark
>> forward.
>>
>> Hope this helps.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> On Mon, Jul 27, 2020 at 8:10 PM Phillip Henry
>> wrote:
>>
>>> Sorry, should have mentioned that Spark only seems reluctan
should have mentioned that Spark only seems reluctant to take the
>> last windowed, groupBy batch from Kafka when using OutputMode.Append.
>>
>> I've asked on StackOverflow:
>>
>> https://stackoverflow.com/questions/62915922/spark-structured-streaming-wont-pull-the
Thanks,
Jungtaek Lim (HeartSaVioR)
On Mon, Jul 27, 2020 at 8:10 PM Phillip Henry
wrote:
> Sorry, should have mentioned that Spark only seems reluctant to take the
> last windowed, groupBy batch from Kafka when using OutputMode.Append.
>
> I've asked on StackOverflow:
>
>
Sorry, should have mentioned that Spark only seems reluctant to take the
last windowed, groupBy batch from Kafka when using OutputMode.Append.
I've asked on StackOverflow:
https://stackoverflow.com/questions/62915922/spark-structured-streaming-wont-pull-the-final-batch-from-kafka
but am
We are using Spark structured streaming to make the join association between
two data streams. Use Kafka to collect data in the earliest way (the sender
sends data cyclically, sending only one data message at a time).
The following are our kafka configuration parameters:
def
Can you try calling batchDF.unpersist() once the work is done in loop?
On Mon, Jul 20, 2020 at 3:38 PM Yong Yuan wrote:
> It seems the following structured streaming code keeps on consuming
> usercache until all disk space are occupied.
>
> val monitoring_stream =
> monitoring_df.writeSt
It seems the following structured streaming code keeps on consuming
usercache until all disk space are occupied.
val monitoring_stream =
monitoring_df.writeStream
.trigger(Trigger.ProcessingTime("120 seconds"))
.foreachBatch {
(batchDF: DataFrame, b
Trigger = Once.
Regards,
..Piyush
On Sun, Jul 19, 2020 at 11:01 PM anbutech wrote:
> Hi Team,
>
> I'm very new to spark structured streaming.could you please guide me how to
> Schedule/Orchestrate spark structured streaming job.Any scheduler similar
> like airflow.I knew air
Please try with maxBytesPerTrigger option, probably files are big enough to
crash the JVM.
Please give some info on Executors and file info ( size etc)
Regards,
..Piyush
On Sun, Jul 19, 2020 at 3:29 PM Rachana Srivastava
wrote:
> *Issue:* I am trying to process 5000+ files of gzipped json file
Hi Team,
I'm very new to spark structured streaming.could you please guide me how to
Schedule/Orchestrate spark structured streaming job.Any scheduler similar
like airflow.I knew airflow doesn't support streaming jobs.
Thanks
Anbu
--
Sent from: http://apache-spark-user-list.
Can you reduce maxFilesPerTrigger further and see if the OOM still persists, if
it does then the problem may be somewhere else.
> On Jul 19, 2020, at 5:37 AM, Jungtaek Lim
> wrote:
>
> Please provide logs and dump file for the OOM case - otherwise no one could
> say what's the cause.
>
> Add
Please provide logs and dump file for the OOM case - otherwise no one could
say what's the cause.
Add JVM options to driver/executor => -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath="...dir..."
On Sun, Jul 19, 2020 at 6:56 PM Rachana Srivastava
wrote:
> *Issue:* I am trying to process 5000+
Issue: I am trying to process 5000+ files of gzipped json file periodically
from S3 using Structured Streaming code.
Here are the key steps:
-
Read json schema and broadccast to executors
-
Read Stream
Dataset inputDS = sparkSession.readStream() .format("text")
.option("inf
It depends a bit on the data as well, but have you investigated in SparkUI
which executor/task becomes slowly?
Could it be also the database from which you load data?
> Am 18.07.2020 um 17:00 schrieb Yong Yuan :
>
>
> The spark job has the correct functions and logic. However, after several
The spark job has the correct functions and logic. However, after several
hours running, it becomes slower and slower. Are there some pitfalls in the
below code? Thanks!
val query = "(select * from meta_table) as meta_data"
val meta_schema = new StructType()
.add("config_id", BooleanType)
Hi, folks.
I noticed that SSS won't process a waiting batch if there are no batches
after that. To put it another way, Spark must always leave one batch on
Kafka waiting to be consumed.
There is a JIRA for this at:
https://issues.apache.org/jira/browse/SPARK-24156
that says it's resolved in 2.4
In SS, checkpointing is now a part of running micro-batch and it's
supported natively. (making clear, my library doesn't deal with the native
behavior of checkpointing)
In other words, it can't be customized like you have been doing with your
database. You probably don't need to do it with SS, but
Thanks Lim, this is really helpful. I have few questions.
Our earlier approach used low level customer to read offsets from database
and use those information to read using spark streaming in Dstreams. Save
the offsets back once the process is finished. This way we never lost data.
with your libr
There're sections in SS programming guide which exactly answer these
questions:
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#managing-streaming-queries
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries
A
In 3.0 the community just added it.
On Sun, 5 Jul 2020, 14:28 KhajaAsmath Mohammed,
wrote:
> Hi,
>
> We are trying to move our existing code from spark dstreams to structured
> streaming for one of the old application which we built few years ago.
>
> Structured streaming job doesn’t have stream
Hi,
We are trying to move our existing code from spark dstreams to structured
streaming for one of the old application which we built few years ago.
Structured streaming job doesn’t have streaming tab in sparkui. Is there a way
to monitor the job submitted by us in structured streaming ? Since
Structured Streaming is basically following SQL semantic, which doesn't
have such a semantic of "max allowance of failures". If you'd like to
tolerate malformed data, please read with raw format (string or binary)
which won't fail with such data, and try converting. e.g. from_json() will
produce nu
Currently my job fails even on a single failure. In other words, even if
one incoming message is malformed the job fails. I believe there's a
property that allows us to set an acceptable number of failures. I Googled
but couldn't find the answer. Can someone please help? Thanks.
started working. Hopefully this will help someone. Thanks.
>
> On Fri, Jun 26, 2020 at 2:12 PM Something Something <
> mailinglist...@gmail.com> wrote:
>
>> My Spark Structured Streaming job works fine when I set "startingOffsets"
>> to "latest". Whe
My apologies... After I set the 'maxOffsetsPerTrigger' to a value such as
'20' it started working. Hopefully this will help someone. Thanks.
On Fri, Jun 26, 2020 at 2:12 PM Something Something <
mailinglist...@gmail.com> wrote:
> My Spark Structured Stream
My Spark Structured Streaming job works fine when I set "startingOffsets"
to "latest". When I simply change it to "earliest" & specify a new "check
point directory", the job doesn't work. The states don't get timed out
after 10 minutes.
Whi
Hi Rachana,
> Should I go backward and use Spark Streaming DStream based.
No. Never. It's no longer supported (and should really be removed from the
codebase once and for all - dreaming...).
Spark focuses on Spark SQL and Spark Structured Streaming as user-facing
modules for batch and s
Structured Stream Vs Spark Steaming (DStream)?
Which is recommended for system stability. Exactly once is NOT first priority.
First priority is STABLE system.
I am I need to make a decision soon. I need help. Here is the question again.
Should I go backward and use Spark Streaming DStream b
Frankly speaking I do not care about EXACTLY ONCE... I am OK with ATLEAST ONCE
at long as system does not fail every 5 to 7 days with no recovery option.
On Wednesday, June 17, 2020, 02:31:50 PM PDT, Rachana Srivastava
wrote:
Thanks so much TD. Thanks for forwarding your datalake pro
Thanks so much TD. Thanks for forwarding your datalake project but at this
time we have budget constraints we can only use open source project.
I just want the Structured Streaming Application or Spark Streaming DStream
Application to run without and issue for a long time.. I do not want the
Kafka-connect (https://docs.confluent.io/current/connect/index.html) may
be an easier solution for this use case of just dumping kafka topics.
On 17/06/2020 18:02, Jungtaek Lim wrote:
Just in case if anyone prefers ASF projects then there are other
alternative projects in ASF as well, alphabeti
Just in case if anyone prefers ASF projects then there are other
alternative projects in ASF as well, alphabetically, Apache Hudi [1] and
Apache Iceberg [2]. Both are recently graduated as top level projects.
(DISCLAIMER: I'm not involved in both.)
BTW it would be nice if we make the metadata impl
Hello Rachana,
Getting exactly-once semantics on files and making it scale to a very large
number of files are very hard problems to solve. While Structured Streaming
+ built-in file sink solves the exactly-once guarantee that DStreams could
not, it is definitely limited in other ways (scaling in
so big that we
start getting OOM errors. I want to get rid of metadata and checkpoint
folders of Spark Structured Streaming and manage offsets myself.
How we managed offsets in Spark Streaming:I have used val offsetRanges =
rdd.asInstanceOf[HasOffsetRanges].offsetRanges to get offsets in Spark
I have written a simple spark structured steaming app to move data from Kafka
to S3. Found that in order to support exactly-once guarantee spark creates
_spark_metadata folder, which ends up growing too large as the streaming app is
SUPPOSE TO run FOREVER. But when the streaming app runs for a l
Does stateful structured streaming work on a stand-alone spark cluster with
few nodes? Does it need hdfs ? If not how to get it working without hdfs ?
Regards
Srini
ok, thanks for confirming, I will do it this way.
Regards
Srini
On Tue, Jun 9, 2020 at 11:31 PM Gerard Maas wrote:
> Hi Srinivas,
>
> Reading from different brokers is possible but you need to connect to each
> Kafka cluster separately.
> Trying to mix connections to two different Kafka cluster
Hi Srinivas,
Reading from different brokers is possible but you need to connect to each
Kafka cluster separately.
Trying to mix connections to two different Kafka clusters in one subscriber
is not supported. (I'm sure that it would give all kind of weird errors)
The "kafka.bootstrap.servers" opti
Thanks for the quick reply. This may work but I have like 5 topics to
listen to right now, I am trying to keep all topics in an array in a
properties file and trying to read all at once. This way it is dynamic and
you have one code block like below and you may add or delete topics from
the config f
Hello,
I've never tried that, this doesn't work?
val df_cluster1 = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "cluster1_host:cluster1_port")
.option("subscribe", "topic1")
val df_cluster2 = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "cluste
Hello,
In Structured Streaming, is it possible to have one spark application with
one query to consume topics from multiple kafka clusters?
I am trying to consume two topics each from different Kafka Cluster, but it
gives one of the topics as an unknown topic and the job keeps running
without com
format(*"kafka"*)
>>
>> .option(
>>
>> *"kafka.bootstrap.servers"*,
>>
>> conig.outputBootstrapServer
>>
>> )
>>
>> .option(ProducerConfig.MAX_REQUEST_SIZE_CONFIG, *"1000"*)
>>
>> -
>>
>>
>>
>> But this is not working. Am I setting this correctly? Is there a
>> different way to set this property under Spark Structured Streaming?
>>
>>
>> Please help. Thanks.
>>
>>
>>
;
> .format(*"kafka"*)
>
> .option(
>
> *"kafka.bootstrap.servers"*,
>
> conig.outputBootstrapServer
>
> )
>
> .option(ProducerConfig.MAX_REQUEST_SIZE_CONFIG, *"1000"*)
>
> ---------
>
>
>
> But this is not working. Am I setting this correctly? Is there a different
> way to set this property under Spark Structured Streaming?
>
>
> Please help. Thanks.
>
>
>
ing this correctly? Is there a different
way to set this property under Spark Structured Streaming?
Please help. Thanks.
asks.
> 1.How to use coalesce with spark structured streaming ?
>
> Also I want to ask few more questions,
> 2. How to restrict number of executors on structured streaming?
> —num-executors is minimum is it ?
> To cap max, can I use spark.dynamicAllocation.maxExecutors ?
Hi Alex, read the book , it is a good one but i don’t see things which I
strongly want to understand.
You are right on the partition and tasks.
1.How to use coalesce with spark structured streaming ?
Also I want to ask few more questions,
2. How to restrict number of executors on structured
It’s not intermittent, seems to happen everytime spark fails when it starts
up from last checkpoint and complains the offset is old. I checked the
offset and it is indeed true the offset expired from kafka side. My version
of spark is 2.4.4 using kafka 0.10
On Sun, Apr 19, 2020 at 3:38 PM Jungtaek
n Wed, Apr 15, 2020 at 9:24 AM Ruijing Li wrote:
>>
>>> I see, I wasn’t sure if that would work as expected. The docs seems to
>>> suggest to be careful before turning off that option, and I’m not sure why
>>> failOnDataLoss is true by default.
>>>
>>> On Tu
That sounds odd. Is it intermittent, or always reproducible if you starts
with same checkpoint? What's the version of Spark?
On Fri, Apr 17, 2020 at 6:17 AM Ruijing Li wrote:
> Hi all,
>
> I have a question on how structured streaming does checkpointing. I’m
> noticing that spark is not reading
Just to clarify - I didn't write this explicitly in my answer. When you're
working with Kafka, every partition in Kafka is mapped into Spark
partition. And in Spark, every partition is mapped into task. But you can
use `coalesce` to decrease the number of Spark partitions, so you'll have
less tas
Thank you Alex. I will check it out and let you know if I have any questions
On Fri, Apr 17, 2020 at 11:36 PM Alex Ott wrote:
> http://shop.oreilly.com/product/0636920047568.do has quite good
> information
> on it. For Kafka, you need to start with approximation that processing of
> each partit
http://shop.oreilly.com/product/0636920047568.do has quite good information
on it. For Kafka, you need to start with approximation that processing of
each partition is a separate task that need to be executed, so you need to
plan number of cores correspondingly.
Srinivas V at "Thu, 16 Apr 2020 2
Hi all,
I have a question on how structured streaming does checkpointing. I’m
noticing that spark is not reading from the max / latest offset it’s seen.
For example, in HDFS, I see it stored offset file 30 which contains
partition: offset {1: 2000}
But instead after stopping the job and restartin
Hello,
Can someone point me to a good video or document which takes about
performance tuning for structured streaming app?
I am looking especially for listening to Kafka topics say 5 topics each
with 100 portions .
Trying to figure out best cluster size and number of executors and cores
required.
k as expected. The docs seems to
>> suggest to be careful before turning off that option, and I’m not sure why
>> failOnDataLoss is true by default.
>>
>> On Tue, Apr 14, 2020 at 5:16 PM Burak Yavuz wrote:
>>
>>> Just set `failOnDataLoss=false` as an option in
2020 at 5:16 PM Burak Yavuz wrote:
>
>> Just set `failOnDataLoss=false` as an option in readStream?
>>
>> On Tue, Apr 14, 2020 at 4:33 PM Ruijing Li wrote:
>>
>>> Hi all,
>>>
>>> I have a spark structured streaming app that is consuming from a ka
eam?
>
> On Tue, Apr 14, 2020 at 4:33 PM Ruijing Li wrote:
>
>> Hi all,
>>
>> I have a spark structured streaming app that is consuming from a kafka
>> topic with retention set up. Sometimes I face an issue where my query has
>> not finished processing a mess
Just set `failOnDataLoss=false` as an option in readStream?
On Tue, Apr 14, 2020 at 4:33 PM Ruijing Li wrote:
> Hi all,
>
> I have a spark structured streaming app that is consuming from a kafka
> topic with retention set up. Sometimes I face an issue where my query has
&g
Hi all,
I have a spark structured streaming app that is consuming from a kafka
topic with retention set up. Sometimes I face an issue where my query has
not finished processing a message but the retention kicks in and deletes
the offset, which since I use the default setting of “failOnDataLoss
That seems to come from the difference how Spark infers schema and create
serializer / deserializer for Java beans to construct bean encoder.
When inferring schema for Java beans, all properties which have getter
methods are considered. When creating serializer / deserializer, only
properties whic
Never mind. It got resolved after I removed extra two getter methods (to
calculate duration) I created in my State specific Java bean
(ProductSessionInformation). But I am surprised why it has created so much
problem. I guess when this bean is converted to Scala class it may not be
taking care of n
Ok, I will try to create some simple code to reproduce, if I can. Problem
is that I am adding this code in an existing big project with several
dependencies with spark streaming older version(2.2) on root level etc.
Also, I observed that there is @Experimental on GroupState class. What
state is it
I have't heard known issue for this - that said, this may require new
investigation which is not possible or require huge effort without simple
reproducer.
Contributors (who are basically volunteers) may not want to struggle to
reproduce from your partial information - I'd recommend you to spend y
Sorry for typos , correcting them below
On Sat, Mar 28, 2020 at 4:39 PM Srinivas V wrote:
> Sorry I was just changing some names not to send exact names. Please
> ignore that. I am really struggling with this since couple of days. Can
> this happen due to
> 1. some of the values being null or
>
Sorry I was just changing some names not to send exact names. Please ignore
that. I am really struggling with this sine couple of days. Can this happen
due to
1. some of the values being null or
2.UTF8 issue ? Or some sterilization/ deserilization issue ?
3. Not enough memory ?
I am using same nam
Well, the code itself doesn't seem to be OK - you're using
ProductStateInformation as the class of State whereas you provide
ProductSessionInformation to Encoder for State.
On Fri, Mar 27, 2020 at 11:14 PM Jungtaek Lim
wrote:
> Could you play with Encoders.bean()? You can Encoders.bean() with yo
Could you play with Encoders.bean()? You can Encoders.bean() with your
class, and call .schema() with the return value to see how it transforms to
the schema in Spark SQL. The schema must be consistent across multiple JVM
runs to make it work properly, but I suspect it doesn't retain the order.
On
I am listening to Kafka topic with a structured streaming application with
Java, testing it on my local Mac.
When I retrieve back GroupState object with
state.get(), it is giving some random values for the fields in the object,
some are interchanging some are default and some are junk values.
See
There are lots of examples on 'Stateful Structured Streaming' in 'The
Definitive Guide' book BUT all of them read JSON from a 'path'. That's
working for me.
Now I need to read from Kafka.
I Googled but I couldn't find any example. I am struggling to Map the
'Value' of the Kafka message to my JSON
AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: FW: [Spark Structured Streaming] Getting all the data in flatMapGroup
I have a scenario in which we need to calculate 'charges' for a stream of
events which has the following details:
1. Event contains eventTime,
r facet
4. Events that arrive in a single minute can be considered equivalent (for
reduced state maintenance) and all of them need to have free units
proportionally distributed
I was hoping to make it the work in the following manner using spark structured
streaming
1. Aggregate events a
rk.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> On Fri, Nov 29, 2019 at 2:08 PM shicheng31...@gmail.com <
> shicheng31...@gmail.com> wrote:
>
>> Hi:
>> Spark S
g31...@gmail.com> wrote:
> Hi:
> Spark Structured Streaming uses the DataFrame API. When programming,
> there are no compilation errors, but when running, it will report various
> unsupported conditions. The official website does not seem to have a
> document to list the unsupported
Hi:
Spark Structured Streaming uses the DataFrame API. When programming, there
are no compilation errors, but when running, it will report various unsupported
conditions. The official website does not seem to have a document to list the
unsupported operators. This will Inconvenient when
Spark mllib library Streaming Training models work with DStream. So is
there any way to use them with spark structured streaming.
So I am trying to integrate MLeap to spark structured streaming, But facing
a problem. As the Spark structured Streaming with Kafka works with data
frames and for MLeap LeapFrame is required. So I tried to convert data
frame to leapframe using mleap spark support library function
(toSparkLeapFrame
I'm trying to analyze data using Kinesis source in PySpark Structured
Streaming on Databricks.
Ceeated a Dataframe as shown below.
kinDF = spark.readStream.format("kinesis").("streamName",
"test-stream-1").load()
Converted the data from base64 encoding as below.
df = kinDF.withColumn("xml_data
Key option is not work!
I have set the " maxOffsetsPerTrigger",but it still receive one partition
per trigger on micro-batch mode.So where to set receiving on 10 partitions
parallel like what is Spark Streaming doing?
Hi Gabor,
sure, the DSv2 seems to be undergoing backward-incompatible changes from
Spark 2 -> 3 though, right? That combined with the fact that the API is
pretty new still doesn't instill confidence in its stability (API wise I
mean).
Cheers,
Lars
On Fri, Jun 28, 2019 at 4:10 PM Gabor Somogyi
w
Hi Lars,
DSv2 already used in production.
Documentation, well since Spark evolving fast I would take a look at how
the built-in connectors implemented.
BR?
G
On Fri, Jun 28, 2019 at 3:52 PM Lars Francke wrote:
> Gabor,
>
> thank you. That is immensely helpful. DataSource v1 it is then. Does
Gabor,
thank you. That is immensely helpful. DataSource v1 it is then. Does that
mean DSV2 is not really for production use yet?
Any idea what the best documentation would be? I'd probably start by
looking at existing code.
Cheers,
Lars
On Fri, Jun 28, 2019 at 1:06 PM Gabor Somogyi
wrote:
> H
Hi Lars,
Since Structured Streaming doesn't support receivers at all so that
source/sink can't be used.
Data source v2 is under development and because of that it's a moving
target so I suggest to implement it with v1 (unless special features are
required from v2).
Additionally since I've just ad
Hi,
I'm a bit confused about the current state and the future plans of custom
data sources in Structured Streaming.
So for DStreams we could write a Receiver as documented. Can this be used
with Structured Streaming?
Then we had the DataSource API with DefaultSource et. al. which was (in my
opin
Got the point. If you would like to get "correct" output, you may need to
set global watermark as "min", because watermark is not only used for
evicting rows in state, but also discarding input rows later than
watermark. Here you may want to be aware that there're two stateful
operators which will
Hi all
it took me some time to get the issues extracted into a piece of standalone
code. I created the following gist
https://gist.github.com/jammann/b58bfbe0f4374b89ecea63c1e32c8f17
I has messages for 4 topics A/B/C/D and a simple Python program which shows 6
use cases, with my expectations a
https://stackoverflow.com/questions/56428367/any-clue-how-to-join-this-spark-structured-stream-joins
201 - 300 of 595 matches
Mail list logo