Spark Structured streaming - Kakfa - slowness with query 0

2020-10-20 Thread KhajaAsmath Mohammed
Hi, I have started using spark structured streaming for reading data from kaka and the job is very slow. Number of output rows keeps increasing in query 0 and the job is running forever. any suggestions for this please? [image: image.png] Thanks, Asmath

Re: Writing to mysql from pyspark spark structured streaming

2020-10-15 Thread German Schiavon
Hi, can you share the code? You have to call *writeStream* on streaming Dataset/Dataframe On Fri, 16 Oct 2020 at 02:14, Krishnanand Khambadkone wrote: > Hi, I am trying to write to mysql from a spark structured streaming kafka > source. Using spark 2.4.0. I get this exc

Writing to mysql from pyspark spark structured streaming

2020-10-15 Thread Krishnanand Khambadkone
Hi,  I am trying to write to mysql from a spark structured streaming kafka source.  Using spark 2.4.0. I get this exception, AnalysisException: u"'write' can not be called on streaming Dataset/DataFrame

Re: Excessive disk IO with Spark structured streaming

2020-10-07 Thread Jungtaek Lim
I can't spend too much time on explaining one by one. I strongly encourage you to do a deep-dive instead of just looking around as you want to know about "details" - that's how open source works. I'll go through a general explanation instead of replying inline; probably I'd write a blog doc if the

Re: Excessive disk IO with Spark structured streaming

2020-10-07 Thread Sergey Oboguev
Hi Jungtaek, *> I meant the subdirectory inside the directory you're providing as "checkpointLocation", as there're several directories in that directory...* There are two: *my-spark-checkpoint-dir/MainApp* created by sparkSession.sparkContext().setCheckpointDir() contains only empty subdir wit

Re: Excessive disk IO with Spark structured streaming

2020-10-05 Thread Jungtaek Lim
roposed under SPARK-30294 [1]. >> >> 2) Spark leverages HDFS API which is configured to create crc file per >> file by default. (So you'll have 2x files than expected.) There's a bug in >> HDFS API (HADOOP-16255 [2]) which missed to handle crc files during rename &g

Re: Excessive disk IO with Spark structured streaming

2020-10-05 Thread Sergey Oboguev
round (SPARK-28025 [3]) Spark tries to > delete the crc file which two additional operations (exist -> delete) may > occur per crc file. > > Hope this helps. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > 1. https://issues.apache.org/jira/browse/SPARK-30294 > 2. https://

Re: Excessive disk IO with Spark structured streaming

2020-10-05 Thread Jungtaek Lim
(HeartSaVioR) 1. https://issues.apache.org/jira/browse/SPARK-30294 2. https://issues.apache.org/jira/browse/HADOOP-16255 3. https://issues.apache.org/jira/browse/SPARK-28025 On Sun, Oct 4, 2020 at 10:08 PM Sergey Oboguev wrote: > I am trying to run a Spark structured streaming program simulating bas

Excessive disk IO with Spark structured streaming

2020-10-04 Thread Sergey Oboguev
I am trying to run a Spark structured streaming program simulating basic scenario of ingesting events and calculating aggregates on a window with watermark, and I am observing an inordinate amount of disk IO Spark performs. The basic structure of the program is like this: sparkSession

Re: Spark structured streaming: periodically refresh static data frame

2020-09-17 Thread Harsh
As per the solution, if we are closing and starting the query, then what happens to the the state which is maintained in memory, will that be retained ? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To u

Re: [SPARK-STRUCTURED-STREAMING] IllegalStateException: Race while writing batch 4

2020-08-12 Thread Jungtaek Lim
File stream sink doesn't support the functionality. There're several approaches to do so: 1) two queries write to Kafka (or any intermediate storage which allows concurrent writes), and let next Spark application read and write to the final path 2) two queries write to two different directories, a

Spark Structured streaming 2.4 - Kill and deploy in yarn

2020-08-10 Thread KhajaAsmath Mohammed
Hi , I am looking for some information on how to gracefully kill the spark structured streaming kafka job and redeploy it. How to kill a spark structured job in YARN? any suggestions on how to kill gracefully? I was able to monitor the job from SQL tab but how can I kill this job when deployed

[SPARK-STRUCTURED-STREAMING] IllegalStateException: Race while writing batch 4

2020-08-07 Thread Amit Joshi
Hi, I have 2spark structure streaming queries writing to the same outpath in object storage. Once in a while I am getting the "IllegalStateException: Race while writing batch 4". I found that this error is because there are two writers writing to the output path. The file streaming sink doesn't su

Re: Lazy Spark Structured Streaming

2020-08-02 Thread Jungtaek Lim
dummy record to move watermark >> forward. >> >> Hope this helps. >> >> Thanks, >> Jungtaek Lim (HeartSaVioR) >> >> On Mon, Jul 27, 2020 at 8:10 PM Phillip Henry >> wrote: >> >>> Sorry, should have mentioned that Spark only seems reluctan

Re: Lazy Spark Structured Streaming

2020-08-02 Thread Phillip Henry
should have mentioned that Spark only seems reluctant to take the >> last windowed, groupBy batch from Kafka when using OutputMode.Append. >> >> I've asked on StackOverflow: >> >> https://stackoverflow.com/questions/62915922/spark-structured-streaming-wont-pull-the

Re: Lazy Spark Structured Streaming

2020-07-27 Thread Jungtaek Lim
Thanks, Jungtaek Lim (HeartSaVioR) On Mon, Jul 27, 2020 at 8:10 PM Phillip Henry wrote: > Sorry, should have mentioned that Spark only seems reluctant to take the > last windowed, groupBy batch from Kafka when using OutputMode.Append. > > I've asked on StackOverflow: > >

Re: Lazy Spark Structured Streaming

2020-07-27 Thread Phillip Henry
Sorry, should have mentioned that Spark only seems reluctant to take the last windowed, groupBy batch from Kafka when using OutputMode.Append. I've asked on StackOverflow: https://stackoverflow.com/questions/62915922/spark-structured-streaming-wont-pull-the-final-batch-from-kafka but am

Spark Structured Streaming join data results in missing result set

2020-07-21 Thread dong524dong
We are using Spark structured streaming to make the join association between two data streams. Use Kafka to collect data in the earliest way (the sender sends data cyclically, sending only one data message at a time). The following are our kafka configuration parameters: def

Re: Spark Structured Streaming keep on consuming usercache

2020-07-20 Thread Piyush Acharya
Can you try calling batchDF.unpersist() once the work is done in loop? On Mon, Jul 20, 2020 at 3:38 PM Yong Yuan wrote: > It seems the following structured streaming code keeps on consuming > usercache until all disk space are occupied. > > val monitoring_stream = > monitoring_df.writeSt

Spark Structured Streaming keep on consuming usercache

2020-07-20 Thread Yong Yuan
It seems the following structured streaming code keeps on consuming usercache until all disk space are occupied. val monitoring_stream = monitoring_df.writeStream .trigger(Trigger.ProcessingTime("120 seconds")) .foreachBatch { (batchDF: DataFrame, b

Re: Schedule/Orchestrate spark structured streaming job

2020-07-19 Thread Piyush Acharya
Trigger = Once. Regards, ..Piyush On Sun, Jul 19, 2020 at 11:01 PM anbutech wrote: > Hi Team, > > I'm very new to spark structured streaming.could you please guide me how to > Schedule/Orchestrate spark structured streaming job.Any scheduler similar > like airflow.I knew air

Re: OOM while processing read/write to S3 using Spark Structured Streaming

2020-07-19 Thread Piyush Acharya
Please try with maxBytesPerTrigger option, probably files are big enough to crash the JVM. Please give some info on Executors and file info ( size etc) Regards, ..Piyush On Sun, Jul 19, 2020 at 3:29 PM Rachana Srivastava wrote: > *Issue:* I am trying to process 5000+ files of gzipped json file

Schedule/Orchestrate spark structured streaming job

2020-07-19 Thread anbutech
Hi Team, I'm very new to spark structured streaming.could you please guide me how to Schedule/Orchestrate spark structured streaming job.Any scheduler similar like airflow.I knew airflow doesn't support streaming jobs. Thanks Anbu -- Sent from: http://apache-spark-user-list.

Re: OOM while processing read/write to S3 using Spark Structured Streaming

2020-07-19 Thread Sanjeev Mishra
Can you reduce maxFilesPerTrigger further and see if the OOM still persists, if it does then the problem may be somewhere else. > On Jul 19, 2020, at 5:37 AM, Jungtaek Lim > wrote: > > Please provide logs and dump file for the OOM case - otherwise no one could > say what's the cause. > > Add

Re: OOM while processing read/write to S3 using Spark Structured Streaming

2020-07-19 Thread Jungtaek Lim
Please provide logs and dump file for the OOM case - otherwise no one could say what's the cause. Add JVM options to driver/executor => -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath="...dir..." On Sun, Jul 19, 2020 at 6:56 PM Rachana Srivastava wrote: > *Issue:* I am trying to process 5000+

OOM while processing read/write to S3 using Spark Structured Streaming

2020-07-19 Thread Rachana Srivastava
Issue: I am trying to process 5000+ files of gzipped json file periodically from S3 using Structured Streaming code.  Here are the key steps: - Read json schema and broadccast to executors - Read Stream Dataset inputDS = sparkSession.readStream() .format("text") .option("inf

Re: Are there some pitfalls in my spark structured streaming code which causes slow response after several hours running?

2020-07-18 Thread Jörn Franke
It depends a bit on the data as well, but have you investigated in SparkUI which executor/task becomes slowly? Could it be also the database from which you load data? > Am 18.07.2020 um 17:00 schrieb Yong Yuan : > >  > The spark job has the correct functions and logic. However, after several

Are there some pitfalls in my spark structured streaming code which causes slow response after several hours running?

2020-07-18 Thread Yong Yuan
The spark job has the correct functions and logic. However, after several hours running, it becomes slower and slower. Are there some pitfalls in the below code? Thanks! val query = "(select * from meta_table) as meta_data" val meta_schema = new StructType() .add("config_id", BooleanType)

Lazy Spark Structured Streaming

2020-07-12 Thread Phillip Henry
Hi, folks. I noticed that SSS won't process a waiting batch if there are no batches after that. To put it another way, Spark must always leave one batch on Kafka waiting to be consumed. There is a JIRA for this at: https://issues.apache.org/jira/browse/SPARK-24156 that says it's resolved in 2.4

Re: Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-06 Thread Jungtaek Lim
In SS, checkpointing is now a part of running micro-batch and it's supported natively. (making clear, my library doesn't deal with the native behavior of checkpointing) In other words, it can't be customized like you have been doing with your database. You probably don't need to do it with SS, but

Re: Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-06 Thread KhajaAsmath Mohammed
Thanks Lim, this is really helpful. I have few questions. Our earlier approach used low level customer to read offsets from database and use those information to read using spark streaming in Dstreams. Save the offsets back once the process is finished. This way we never lost data. with your libr

Re: Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-05 Thread Jungtaek Lim
There're sections in SS programming guide which exactly answer these questions: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#managing-streaming-queries http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries A

Re: Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-05 Thread Gabor Somogyi
In 3.0 the community just added it. On Sun, 5 Jul 2020, 14:28 KhajaAsmath Mohammed, wrote: > Hi, > > We are trying to move our existing code from spark dstreams to structured > streaming for one of the old application which we built few years ago. > > Structured streaming job doesn’t have stream

Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-05 Thread KhajaAsmath Mohammed
Hi, We are trying to move our existing code from spark dstreams to structured streaming for one of the old application which we built few years ago. Structured streaming job doesn’t have streaming tab in sparkui. Is there a way to monitor the job submitted by us in structured streaming ? Since

Re: Failure Threshold in Spark Structured Streaming?

2020-07-02 Thread Jungtaek Lim
Structured Streaming is basically following SQL semantic, which doesn't have such a semantic of "max allowance of failures". If you'd like to tolerate malformed data, please read with raw format (string or binary) which won't fail with such data, and try converting. e.g. from_json() will produce nu

Failure Threshold in Spark Structured Streaming?

2020-07-02 Thread Eric Beabes
Currently my job fails even on a single failure. In other words, even if one incoming message is malformed the job fails. I believe there's a property that allows us to set an acceptable number of failures. I Googled but couldn't find the answer. Can someone please help? Thanks.

Re: Spark Structured Streaming: “earliest” as “startingOffsets” is not working

2020-06-26 Thread Srinivas V
started working. Hopefully this will help someone. Thanks. > > On Fri, Jun 26, 2020 at 2:12 PM Something Something < > mailinglist...@gmail.com> wrote: > >> My Spark Structured Streaming job works fine when I set "startingOffsets" >> to "latest". Whe

Re: Spark Structured Streaming: “earliest” as “startingOffsets” is not working

2020-06-26 Thread Eric Beabes
My apologies... After I set the 'maxOffsetsPerTrigger' to a value such as '20' it started working. Hopefully this will help someone. Thanks. On Fri, Jun 26, 2020 at 2:12 PM Something Something < mailinglist...@gmail.com> wrote: > My Spark Structured Stream

Spark Structured Streaming: “earliest” as “startingOffsets” is not working

2020-06-26 Thread Something Something
My Spark Structured Streaming job works fine when I set "startingOffsets" to "latest". When I simply change it to "earliest" & specify a new "check point directory", the job doesn't work. The states don't get timed out after 10 minutes. Whi

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-18 Thread Jacek Laskowski
Hi Rachana, > Should I go backward and use Spark Streaming DStream based. No. Never. It's no longer supported (and should really be removed from the codebase once and for all - dreaming...). Spark focuses on Spark SQL and Spark Structured Streaming as user-facing modules for batch and s

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Rachana Srivastava
Structured Stream Vs Spark Steaming (DStream)? Which is recommended for system stability.  Exactly once is NOT first priority.  First priority is STABLE system. I am I need to make a decision soon.  I need help.  Here is the question again.  Should I go backward and use Spark Streaming DStream b

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Rachana Srivastava
Frankly speaking I do not care about EXACTLY ONCE... I am OK with ATLEAST ONCE at long as system does not fail every 5 to 7 days with no recovery option. On Wednesday, June 17, 2020, 02:31:50 PM PDT, Rachana Srivastava wrote: Thanks so much TD.  Thanks for forwarding your datalake pro

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Rachana Srivastava
Thanks so much TD.  Thanks for forwarding your datalake project but at this time we have budget constraints we can only use open source project.   I just want the Structured Streaming Application or Spark Streaming DStream Application to run without and issue for a long time..  I do not want the

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Breno Arosa
Kafka-connect (https://docs.confluent.io/current/connect/index.html) may be an easier solution for this use case of just dumping kafka topics. On 17/06/2020 18:02, Jungtaek Lim wrote: Just in case if anyone prefers ASF projects then there are other alternative projects in ASF as well, alphabeti

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Jungtaek Lim
Just in case if anyone prefers ASF projects then there are other alternative projects in ASF as well, alphabetically, Apache Hudi [1] and Apache Iceberg [2]. Both are recently graduated as top level projects. (DISCLAIMER: I'm not involved in both.) BTW it would be nice if we make the metadata impl

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Tathagata Das
Hello Rachana, Getting exactly-once semantics on files and making it scale to a very large number of files are very hard problems to solve. While Structured Streaming + built-in file sink solves the exactly-once guarantee that DStreams could not, it is definitely limited in other ways (scaling in

How to manage offsets in Spark Structured Streaming?

2020-06-17 Thread Rachana Srivastava
so big that we start getting OOM errors.   I want to get rid of metadata and checkpoint folders of Spark Structured Streaming and manage offsets myself. How we managed offsets in Spark Streaming:I have used val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges  to get offsets in Spark

Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Rachana Srivastava
I have written a simple spark structured steaming app to move data from Kafka to S3. Found that in order to support exactly-once guarantee spark creates _spark_metadata folder, which ends up growing too large as the streaming app is SUPPOSE TO run FOREVER. But when the streaming app runs for a l

[spark-structured-streaming] [stateful]

2020-06-13 Thread Srinivas V
Does stateful structured streaming work on a stand-alone spark cluster with few nodes? Does it need hdfs ? If not how to get it working without hdfs ? Regards Srini

Re: [spark-structured-streaming] [kafka] consume topics from multiple Kafka clusters

2020-06-09 Thread Srinivas V
ok, thanks for confirming, I will do it this way. Regards Srini On Tue, Jun 9, 2020 at 11:31 PM Gerard Maas wrote: > Hi Srinivas, > > Reading from different brokers is possible but you need to connect to each > Kafka cluster separately. > Trying to mix connections to two different Kafka cluster

Re: [spark-structured-streaming] [kafka] consume topics from multiple Kafka clusters

2020-06-09 Thread Gerard Maas
Hi Srinivas, Reading from different brokers is possible but you need to connect to each Kafka cluster separately. Trying to mix connections to two different Kafka clusters in one subscriber is not supported. (I'm sure that it would give all kind of weird errors) The "kafka.bootstrap.servers" opti

Re: [spark-structured-streaming] [kafka] consume topics from multiple Kafka clusters

2020-06-09 Thread Srinivas V
Thanks for the quick reply. This may work but I have like 5 topics to listen to right now, I am trying to keep all topics in an array in a properties file and trying to read all at once. This way it is dynamic and you have one code block like below and you may add or delete topics from the config f

Re: [spark-structured-streaming] [kafka] consume topics from multiple Kafka clusters

2020-06-09 Thread German SM
Hello, I've never tried that, this doesn't work? val df_cluster1 = spark .read .format("kafka") .option("kafka.bootstrap.servers", "cluster1_host:cluster1_port") .option("subscribe", "topic1") val df_cluster2 = spark .read .format("kafka") .option("kafka.bootstrap.servers", "cluste

[spark-structured-streaming] [kafka] consume topics from multiple Kafka clusters

2020-06-09 Thread Srinivas V
Hello, In Structured Streaming, is it possible to have one spark application with one query to consume topics from multiple kafka clusters? I am trying to consume two topics each from different Kafka Cluster, but it gives one of the topics as an unknown topic and the job keeps running without com

Re: RecordTooLargeException in Spark *Structured* Streaming

2020-05-26 Thread Something Something
format(*"kafka"*) >> >> .option( >> >> *"kafka.bootstrap.servers"*, >> >> conig.outputBootstrapServer >> >> ) >> >> .option(ProducerConfig.MAX_REQUEST_SIZE_CONFIG, *"1000"*) >> >> - >> >> >> >> But this is not working. Am I setting this correctly? Is there a >> different way to set this property under Spark Structured Streaming? >> >> >> Please help. Thanks. >> >> >>

Re: RecordTooLargeException in Spark *Structured* Streaming

2020-05-25 Thread Jungtaek Lim
; > .format(*"kafka"*) > > .option( > > *"kafka.bootstrap.servers"*, > > conig.outputBootstrapServer > > ) > > .option(ProducerConfig.MAX_REQUEST_SIZE_CONFIG, *"1000"*) > > --------- > > > > But this is not working. Am I setting this correctly? Is there a different > way to set this property under Spark Structured Streaming? > > > Please help. Thanks. > > >

RecordTooLargeException in Spark *Structured* Streaming

2020-05-25 Thread Something Something
ing this correctly? Is there a different way to set this property under Spark Structured Streaming? Please help. Thanks.

Re: Spark structured streaming - performance tuning

2020-05-08 Thread Srinivas V
asks. > 1.How to use coalesce with spark structured streaming ? > > Also I want to ask few more questions, > 2. How to restrict number of executors on structured streaming? > —num-executors is minimum is it ? > To cap max, can I use spark.dynamicAllocation.maxExecutors ?

Re: Spark structured streaming - performance tuning

2020-05-02 Thread Srinivas V
Hi Alex, read the book , it is a good one but i don’t see things which I strongly want to understand. You are right on the partition and tasks. 1.How to use coalesce with spark structured streaming ? Also I want to ask few more questions, 2. How to restrict number of executors on structured

Re: Understanding spark structured streaming checkpointing system

2020-04-19 Thread Ruijing Li
It’s not intermittent, seems to happen everytime spark fails when it starts up from last checkpoint and complains the offset is old. I checked the offset and it is indeed true the offset expired from kafka side. My version of spark is 2.4.4 using kafka 0.10 On Sun, Apr 19, 2020 at 3:38 PM Jungtaek

Re: Spark structured streaming - Fallback to earliest offset

2020-04-19 Thread Jungtaek Lim
n Wed, Apr 15, 2020 at 9:24 AM Ruijing Li wrote: >> >>> I see, I wasn’t sure if that would work as expected. The docs seems to >>> suggest to be careful before turning off that option, and I’m not sure why >>> failOnDataLoss is true by default. >>> >>> On Tu

Re: Understanding spark structured streaming checkpointing system

2020-04-19 Thread Jungtaek Lim
That sounds odd. Is it intermittent, or always reproducible if you starts with same checkpoint? What's the version of Spark? On Fri, Apr 17, 2020 at 6:17 AM Ruijing Li wrote: > Hi all, > > I have a question on how structured streaming does checkpointing. I’m > noticing that spark is not reading

Re: Spark structured streaming - performance tuning

2020-04-18 Thread Alex Ott
Just to clarify - I didn't write this explicitly in my answer. When you're working with Kafka, every partition in Kafka is mapped into Spark partition. And in Spark, every partition is mapped into task. But you can use `coalesce` to decrease the number of Spark partitions, so you'll have less tas

Re: Spark structured streaming - performance tuning

2020-04-17 Thread Srinivas V
Thank you Alex. I will check it out and let you know if I have any questions On Fri, Apr 17, 2020 at 11:36 PM Alex Ott wrote: > http://shop.oreilly.com/product/0636920047568.do has quite good > information > on it. For Kafka, you need to start with approximation that processing of > each partit

Re: Spark structured streaming - performance tuning

2020-04-17 Thread Alex Ott
http://shop.oreilly.com/product/0636920047568.do has quite good information on it. For Kafka, you need to start with approximation that processing of each partition is a separate task that need to be executed, so you need to plan number of cores correspondingly. Srinivas V at "Thu, 16 Apr 2020 2

Understanding spark structured streaming checkpointing system

2020-04-16 Thread Ruijing Li
Hi all, I have a question on how structured streaming does checkpointing. I’m noticing that spark is not reading from the max / latest offset it’s seen. For example, in HDFS, I see it stored offset file 30 which contains partition: offset {1: 2000} But instead after stopping the job and restartin

Spark structured streaming - performance tuning

2020-04-16 Thread Srinivas V
Hello, Can someone point me to a good video or document which takes about performance tuning for structured streaming app? I am looking especially for listening to Kafka topics say 5 topics each with 100 portions . Trying to figure out best cluster size and number of executors and cores required.

Re: Spark structured streaming - Fallback to earliest offset

2020-04-16 Thread Ruijing Li
k as expected. The docs seems to >> suggest to be careful before turning off that option, and I’m not sure why >> failOnDataLoss is true by default. >> >> On Tue, Apr 14, 2020 at 5:16 PM Burak Yavuz wrote: >> >>> Just set `failOnDataLoss=false` as an option in

Re: Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Jungtaek Lim
2020 at 5:16 PM Burak Yavuz wrote: > >> Just set `failOnDataLoss=false` as an option in readStream? >> >> On Tue, Apr 14, 2020 at 4:33 PM Ruijing Li wrote: >> >>> Hi all, >>> >>> I have a spark structured streaming app that is consuming from a ka

Re: Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Ruijing Li
eam? > > On Tue, Apr 14, 2020 at 4:33 PM Ruijing Li wrote: > >> Hi all, >> >> I have a spark structured streaming app that is consuming from a kafka >> topic with retention set up. Sometimes I face an issue where my query has >> not finished processing a mess

Re: Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Burak Yavuz
Just set `failOnDataLoss=false` as an option in readStream? On Tue, Apr 14, 2020 at 4:33 PM Ruijing Li wrote: > Hi all, > > I have a spark structured streaming app that is consuming from a kafka > topic with retention set up. Sometimes I face an issue where my query has &g

Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Ruijing Li
Hi all, I have a spark structured streaming app that is consuming from a kafka topic with retention set up. Sometimes I face an issue where my query has not finished processing a message but the retention kicks in and deletes the offset, which since I use the default setting of “failOnDataLoss

Re: spark structured streaming GroupState returns weird values from sate

2020-03-31 Thread Jungtaek Lim
That seems to come from the difference how Spark infers schema and create serializer / deserializer for Java beans to construct bean encoder. When inferring schema for Java beans, all properties which have getter methods are considered. When creating serializer / deserializer, only properties whic

Re: spark structured streaming GroupState returns weird values from sate

2020-03-31 Thread Srinivas V
Never mind. It got resolved after I removed extra two getter methods (to calculate duration) I created in my State specific Java bean (ProductSessionInformation). But I am surprised why it has created so much problem. I guess when this bean is converted to Scala class it may not be taking care of n

Re: spark structured streaming GroupState returns weird values from sate

2020-03-28 Thread Srinivas V
Ok, I will try to create some simple code to reproduce, if I can. Problem is that I am adding this code in an existing big project with several dependencies with spark streaming older version(2.2) on root level etc. Also, I observed that there is @Experimental on GroupState class. What state is it

Re: spark structured streaming GroupState returns weird values from sate

2020-03-28 Thread Jungtaek Lim
I have't heard known issue for this - that said, this may require new investigation which is not possible or require huge effort without simple reproducer. Contributors (who are basically volunteers) may not want to struggle to reproduce from your partial information - I'd recommend you to spend y

Re: spark structured streaming GroupState returns weird values from sate

2020-03-28 Thread Srinivas V
Sorry for typos , correcting them below On Sat, Mar 28, 2020 at 4:39 PM Srinivas V wrote: > Sorry I was just changing some names not to send exact names. Please > ignore that. I am really struggling with this since couple of days. Can > this happen due to > 1. some of the values being null or >

Re: spark structured streaming GroupState returns weird values from sate

2020-03-28 Thread Srinivas V
Sorry I was just changing some names not to send exact names. Please ignore that. I am really struggling with this sine couple of days. Can this happen due to 1. some of the values being null or 2.UTF8 issue ? Or some sterilization/ deserilization issue ? 3. Not enough memory ? I am using same nam

Re: spark structured streaming GroupState returns weird values from sate

2020-03-27 Thread Jungtaek Lim
Well, the code itself doesn't seem to be OK - you're using ProductStateInformation as the class of State whereas you provide ProductSessionInformation to Encoder for State. On Fri, Mar 27, 2020 at 11:14 PM Jungtaek Lim wrote: > Could you play with Encoders.bean()? You can Encoders.bean() with yo

Re: spark structured streaming GroupState returns weird values from sate

2020-03-27 Thread Jungtaek Lim
Could you play with Encoders.bean()? You can Encoders.bean() with your class, and call .schema() with the return value to see how it transforms to the schema in Spark SQL. The schema must be consistent across multiple JVM runs to make it work properly, but I suspect it doesn't retain the order. On

spark structured streaming GroupState returns weird values from sate

2020-03-27 Thread Srinivas V
I am listening to Kafka topic with a structured streaming application with Java, testing it on my local Mac. When I retrieve back GroupState object with state.get(), it is giving some random values for the fields in the object, some are interchanging some are default and some are junk values. See

Example of Stateful Spark Structured Streaming with Kafka

2020-03-03 Thread Something Something
There are lots of examples on 'Stateful Structured Streaming' in 'The Definitive Guide' book BUT all of them read JSON from a 'path'. That's working for me. Now I need to read from Kafka. I Googled but I couldn't find any example. I am struggling to Map the 'Value' of the Kafka message to my JSON

RE: [Spark Structured Streaming] Getting all the data in flatMapGroup

2020-01-12 Thread Shaji U
AM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: FW: [Spark Structured Streaming] Getting all the data in flatMapGroup I have a scenario in which we need to calculate 'charges' for a stream of events which has the following details: 1. Event contains eventTime,

FW: [Spark Structured Streaming] Getting all the data in flatMapGroup

2020-01-10 Thread Shaji U
r facet 4. Events that arrive in a single minute can be considered equivalent (for reduced state maintenance) and all of them need to have free units proportionally distributed I was hoping to make it the work in the following manner using spark structured streaming 1. Aggregate events a

Re: Operators supported by Spark Structured Streaming

2019-11-28 Thread hahaha sc
rk.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations > > Thanks, > Jungtaek Lim (HeartSaVioR) > > > On Fri, Nov 29, 2019 at 2:08 PM shicheng31...@gmail.com < > shicheng31...@gmail.com> wrote: > >> Hi: >> Spark S

Re: Operators supported by Spark Structured Streaming

2019-11-28 Thread Jungtaek Lim
g31...@gmail.com> wrote: > Hi: > Spark Structured Streaming uses the DataFrame API. When programming, > there are no compilation errors, but when running, it will report various > unsupported conditions. The official website does not seem to have a > document to list the unsupported

Operators supported by Spark Structured Streaming

2019-11-28 Thread shicheng31...@gmail.com
Hi: Spark Structured Streaming uses the DataFrame API. When programming, there are no compilation errors, but when running, it will report various unsupported conditions. The official website does not seem to have a document to list the unsupported operators. This will Inconvenient when

How to Integrate Spark mllib Streaming Training Models To Spark Structured Streaming

2019-09-17 Thread Praful Rana
Spark mllib library Streaming Training models work with DStream. So is there any way to use them with spark structured streaming.

How to integrates MLeap to Spark Structured Streaming

2019-09-17 Thread Praful Rana
So I am trying to integrate MLeap to spark structured streaming, But facing a problem. As the Spark structured Streaming with Kafka works with data frames and for MLeap LeapFrame is required. So I tried to convert data frame to leapframe using mleap spark support library function (toSparkLeapFrame

Spark Structured Streaming XML content

2019-08-14 Thread Nick Dawes
I'm trying to analyze data using Kinesis source in PySpark Structured Streaming on Databricks. Ceeated a Dataframe as shown below. kinDF = spark.readStream.format("kinesis").("streamName", "test-stream-1").load() Converted the data from base64 encoding as below. df = kinDF.withColumn("xml_data

how to specify which partition each record send on spark structured streaming kafka sink?

2019-08-12 Thread zenglong chen
Key option is not work!

How to add spark structured streaming kafka source receiver

2019-08-09 Thread zenglong chen
I have set the " maxOffsetsPerTrigger",but it still receive one partition per trigger on micro-batch mode.So where to set receiving on 10 partitions parallel like what is Spark Streaming doing?

Re: Spark Structured Streaming Custom Sources confusion

2019-06-28 Thread Lars Francke
Hi Gabor, sure, the DSv2 seems to be undergoing backward-incompatible changes from Spark 2 -> 3 though, right? That combined with the fact that the API is pretty new still doesn't instill confidence in its stability (API wise I mean). Cheers, Lars On Fri, Jun 28, 2019 at 4:10 PM Gabor Somogyi w

Re: Spark Structured Streaming Custom Sources confusion

2019-06-28 Thread Gabor Somogyi
Hi Lars, DSv2 already used in production. Documentation, well since Spark evolving fast I would take a look at how the built-in connectors implemented. BR? G On Fri, Jun 28, 2019 at 3:52 PM Lars Francke wrote: > Gabor, > > thank you. That is immensely helpful. DataSource v1 it is then. Does

Re: Spark Structured Streaming Custom Sources confusion

2019-06-28 Thread Lars Francke
Gabor, thank you. That is immensely helpful. DataSource v1 it is then. Does that mean DSV2 is not really for production use yet? Any idea what the best documentation would be? I'd probably start by looking at existing code. Cheers, Lars On Fri, Jun 28, 2019 at 1:06 PM Gabor Somogyi wrote: > H

Re: Spark Structured Streaming Custom Sources confusion

2019-06-28 Thread Gabor Somogyi
Hi Lars, Since Structured Streaming doesn't support receivers at all so that source/sink can't be used. Data source v2 is under development and because of that it's a moving target so I suggest to implement it with v1 (unless special features are required from v2). Additionally since I've just ad

Spark Structured Streaming Custom Sources confusion

2019-06-25 Thread Lars Francke
Hi, I'm a bit confused about the current state and the future plans of custom data sources in Structured Streaming. So for DStreams we could write a Receiver as documented. Can this be used with Structured Streaming? Then we had the DataSource API with DefaultSource et. al. which was (in my opin

Re: Spark structured streaming leftOuter join not working as I expect

2019-06-11 Thread Jungtaek Lim
Got the point. If you would like to get "correct" output, you may need to set global watermark as "min", because watermark is not only used for evicting rows in state, but also discarding input rows later than watermark. Here you may want to be aware that there're two stateful operators which will

Re: Spark structured streaming leftOuter join not working as I expect

2019-06-10 Thread Joe Ammann
Hi all it took me some time to get the issues extracted into a piece of standalone code. I created the following gist https://gist.github.com/jammann/b58bfbe0f4374b89ecea63c1e32c8f17 I has messages for 4 topics A/B/C/D and a simple Python program which shows 6 use cases, with my expectations a

Does anyone used spark-structured streaming successfully in production ?

2019-06-10 Thread Shyam P
https://stackoverflow.com/questions/56428367/any-clue-how-to-join-this-spark-structured-stream-joins

<    1   2   3   4   5   6   >