Re: [Spark Streaming]: Save the records that are dropped by watermarking in spark structured streaming

2024-05-08 Thread Mich Talebzadeh
you may consider - Increase Watermark Retention: Consider increasing the watermark retention duration. This allows keeping records for a longer period before dropping them. However, this might increase processing latency and violate at-least-once semantics if the watermark lags behind real-time.

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Mich Talebzadeh
Hi Kartrick, Unfortunately Materialised views are not available in Spark as yet. I raised Jira [SPARK-48117] Spark Materialized Views: Improve Query Performance and Data Management - ASF JIRA (apache.org) as a feature request. Let me think

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Karthick Nk
Thanks Mich, can you please confirm me is my understanding correct? First, we have to create the materialized view based on the mapping details we have by using multiple tables as source(since we have multiple join condition from different tables). From the materialised view we can stream the

Re: ********Spark streaming issue to Elastic data**********

2024-05-03 Thread Mich Talebzadeh
My recommendation! is using materialized views (MVs) created in Hive with Spark Structured Streaming and Change Data Capture (CDC) is a good combination for efficiently streaming view data updates in your scenario. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI |

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Kidong Lee
Thanks, Mich for your reply. I agree, it is not so scalable and efficient. But it works correctly for kafka transaction, and there is no problem with committing offset to kafka async for now. I try to tell you some more details about my streaming job. CustomReceiver does not receive anything

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Mich Talebzadeh
Interesting My concern is infinite Loop in* foreachRDD*: The *while(true)* loop within foreachRDD creates an infinite loop within each Spark executor. This might not be the most efficient approach, especially since offsets are committed asynchronously.? HTH Mich Talebzadeh, Technologist |

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Kidong Lee
Because spark streaming for kafk transaction does not work correctly to suit my need, I moved to another approach using raw kafka consumer which handles read_committed messages from kafka correctly. My codes look like the following. JavaDStream stream = ssc.receiverStream(new CustomReceiver());

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-13 Thread Kidong Lee
Thank you Mich for your reply. Actually, I tried to do most of your advice. When spark.streaming.kafka.allowNonConsecutiveOffsets=false, I got the following error. Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 3)

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-13 Thread Mich Talebzadeh
Hi Kidong, There may be few potential reasons why the message counts from your Kafka producer and Spark Streaming consumer might not match, especially with transactional messages and read_committed isolation level. 1) Just ensure that both your Spark Streaming job and the Kafka consumer written

Re: [Spark streaming]: Microbatch id in logs

2023-06-26 Thread Mich Talebzadeh
In SSS writeStream. \ outputMode('append'). \ option("truncate", "false"). \ * foreachBatch(SendToBigQuery). \* option('checkpointLocation', checkpoint_path). \ so this writeStream will call

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread Mich Talebzadeh
evolving. So if anyone is interested, > please support the project. > > -- > Lingzhe Sun > Hirain Technologies > > > *From:* Mich Talebzadeh > *Date:* 2023-04-11 02:06 > *To:* Rajesh Katkar > *CC:* user > *Subject:* Re: spark streami

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread 孙令哲
Hi Rajesh, It's working fine, at least for now. But you'll need to build your own spark image using later versions. Lingzhe Sun Hirain Technologies Original: From:Rajesh Katkar Date:2023-04-12 21:36:52To:Lingzhe SunCc:Mich Talebzadeh , user Subject:Re: Re: spark streaming

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread Yi Huang
;> >> >> *From:* Mich Talebzadeh >> *Date:* 2023-04-11 02:06 >> *To:* Rajesh Katkar >> *CC:* user >> *Subject:* Re: spark streaming and kinesis integration >> What I said was this >> "In so far as I know k8s does not support spark structured stre

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread Rajesh Katkar
s interested, > please support the project. > > -- > Lingzhe Sun > Hirain Technologies > > > *From:* Mich Talebzadeh > *Date:* 2023-04-11 02:06 > *To:* Rajesh Katkar > *CC:* user > *Subject:* Re: spark streaming and kinesis integration

Re: Re: spark streaming and kinesis integration

2023-04-11 Thread Lingzhe Sun
for quite long time. Kind of worried that this project might finally become outdated as k8s is evolving. So if anyone is interested, please support the project. Lingzhe Sun Hirain Technologies From: Mich Talebzadeh Date: 2023-04-11 02:06 To: Rajesh Katkar CC: user Subject: Re: spark streaming

Re: spark streaming and kinesis integration

2023-04-10 Thread Mich Talebzadeh
Just to clarify, a major benefit of k8s in this case is to host your Spark applications in the form of containers in an automated fashion so that one can easily deploy as many instances of the application as required (autoscaling). From below:

Re: spark streaming and kinesis integration

2023-04-10 Thread Mich Talebzadeh
What I said was this "In so far as I know k8s does not support spark structured streaming?" So it is an open question. I just recalled it. I have not tested myself. I know structured streaming works on Google Dataproc cluster but I have not seen any official link that says Spark Structured

Re: spark streaming and kinesis integration

2023-04-10 Thread Rajesh Katkar
Do you have any link or ticket which justifies that k8s does not support spark streaming ? On Thu, 6 Apr, 2023, 9:15 pm Mich Talebzadeh, wrote: > Do you have a high level diagram of the proposed solution? > > In so far as I know k8s does not support spark structured streaming? > > Mich

Re: spark streaming and kinesis integration

2023-04-06 Thread Rajesh Katkar
Use case is , we want to read/write to kinesis streams using k8s Officially I could not find the connector or reader for kinesis from spark like it has for kafka. Checking here if anyone used kinesis and spark streaming combination ? On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh, wrote: > Hi

RE: spark streaming and kinesis integration

2023-04-06 Thread Jonske, Kurt
kar Cc: u...@spark.incubator.apache.org Subject: Re: spark streaming and kinesis integration ⚠ [EXTERNAL EMAIL]: Use Caution Do you have a high level diagram of the proposed solution? In so far as I know k8s does not support spark structured streaming? Mich Talebzadeh, Lead Solutions

Re: spark streaming and kinesis integration

2023-04-06 Thread Mich Talebzadeh
Do you have a high level diagram of the proposed solution? In so far as I know k8s does not support spark structured streaming? Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies London United Kingdom view my Linkedin profile

Re: spark streaming and kinesis integration

2023-04-06 Thread Mich Talebzadeh
Hi Rajesh, What is the use case for Kinesis here? I have not used it personally, Which use case it concerns https://aws.amazon.com/kinesis/ Can you use something else instead? HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies London United Kingdom view

Re: Spark streaming

2022-08-20 Thread Gourav Sengupta
Hi, spark is just an unnecessary overengineered overkill for that kind of a job. I know they are trying to make SPARK a one stop solution for everything but that is a marketing attempt to capture market share, rather than the true blue engineering creativity that led to the creation of SPARK - so

Re: [EXTERNAL] Re: Spark streaming

2022-08-20 Thread sandra sukumaran
a sukumaran > *Cc:* user@spark.apache.org > *Subject:* [EXTERNAL] Re: Spark streaming > > *Caution! This email originated outside of FedEx. Please do not open > attachments or click links from an unknown or suspicious origin*. > https://github.com/allwefantasy/spark-binlog

Re: [EXTERNAL] Re: Spark streaming

2022-08-19 Thread Saurabh Gulati
You can also try out https://debezium.io/documentation/reference/0.10/connectors/mysql.html From: Ajit Kumar Amit Sent: 19 August 2022 14:30 To: sandra sukumaran Cc: user@spark.apache.org Subject: [EXTERNAL] Re: Spark streaming Caution! This email originated

Re: Spark streaming

2022-08-19 Thread Ajit Kumar Amit
https://github.com/allwefantasy/spark-binlog Sent from my iPhone > On 19 Aug 2022, at 5:45 PM, sandra sukumaran > wrote: > >  > Dear Sir, > > > > Is there any possible method to fetch MySQL database bin log, with the > help of spark streaming. > Kafka streaming is not applicable

Re: Spark streaming

2022-08-18 Thread ミユナ (alice)
> Dear sir, > > >I want to check the logs of MySQL database using spark streaming, can > someone help me with those listening queries. > > > Thanks and regards > Akash P > you can ingest logs by fluent-bit to kafka then setup spark to read records from kafka by streaming.

Re: [EXTERNAL] Re: Spark streaming - Data Ingestion

2022-08-17 Thread Akash Vellukai
; *Sent:* 17 August 2022 16:53 > *To:* Akash Vellukai > *Cc:* user@spark.apache.org > *Subject:* [EXTERNAL] Re: Spark streaming - Data Ingestion > > *Caution! This email originated outside of FedEx. Please do not open > attachments or click links from an unknown or suspicious origi

Re: [EXTERNAL] Re: Spark streaming - Data Ingestion

2022-08-17 Thread Gibson
t;> Regards >> -- >> *From:* Gibson >> *Sent:* 17 August 2022 16:53 >> *To:* Akash Vellukai >> *Cc:* user@spark.apache.org >> *Subject:* [EXTERNAL] Re: Spark streaming - Data Ingestion >> >> *Caution! This email originat

Re: [EXTERNAL] Re: Spark streaming - Data Ingestion

2022-08-17 Thread Saurabh Gulati
; Hive Regards From: Gibson Sent: 17 August 2022 16:53 To: Akash Vellukai Cc: user@spark.apache.org Subject: [EXTERNAL] Re: Spark streaming - Data Ingestion Caution! This email originated outside of FedEx. Please do not open attachments or click links from an unknown or su

Re: Spark streaming - Data Ingestion

2022-08-17 Thread Gibson
If you have space for a message log like, then you should try: MySQL -> Kafka (via CDC) -> Spark (Structured Streaming) -> HDFS/S3/ADLS -> Hive On Wed, Aug 17, 2022 at 5:40 PM Akash Vellukai wrote: > Dear sir > > I have tried a lot on this could you help me with this? > > Data ingestion from

Re: Spark streaming pending mircobatches queue max length

2022-07-13 Thread Anil Dasari
Retry. From: Anil Dasari Date: Tuesday, July 12, 2022 at 3:42 PM To: user@spark.apache.org Subject: Spark streaming pending mircobatches queue max length Hello, Spark is adding entry to pending microbatches queue at periodic batch interval. Is there config to set the max size for pending

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-20 Thread Xavier Gervilla
Thank you for the flatten function, it has a bigger functionality than what I need for my project but the examples (which were really, really useful) helped me find a solution. Instead of accessing the confidence and entity attributes (metadata.confidence and metadata.entity) I was accessing

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-20 Thread Bjørn Jørgensen
Glad to hear that it works :) Your dataframe is nested with both map, array and struct. I`m using this function to flatten a nested dataframe to rows and columns. from pyspark.sql.types import * from pyspark.sql.functions import * def flatten_test(df, sep="_"): """Returns a flattened

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-19 Thread Bjørn Jørgensen
https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet *change spark = sparknlp.start()* to spark = sparknlp.start(spark32=True) tir. 19. apr. 2022 kl. 21:10 skrev Bjørn Jørgensen : > Yes, there are some that have that issue. > > Please open a new issue at >

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-19 Thread Bjørn Jørgensen
Yes, there are some that have that issue. Please open a new issue at https://github.com/JohnSnowLabs/spark-nlp/issues and they will help you. tir. 19. apr. 2022 kl. 20:33 skrev Xavier Gervilla < xavier.gervi...@datapta.com>: > Thank you for your advice, I had small knowledge of Spark NLP and

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-19 Thread Jungtaek Lim
I have no context on ML, but your "streaming" query exposes the possibility of memory issues. *flattenedNER.registerTempTable(**"df"**) >>> >>> >>> querySelect = **"SELECT col as entity, avg(sentiment) as sentiment, >>> count(col) as count FROM df GROUP BY col"** >>> finalDF =

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-18 Thread Bjørn Jørgensen
When did SpaCy have support for Spark? Try Spark NLP it`s made for spark. They have a lot of notebooks at https://github.com/JohnSnowLabs/spark-nlp and they public user guides at

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-18 Thread Sean Owen
It looks good, are you sure it even starts? the problem I see is that you send a copy of the model from the driver for every task. Try broadcasting the model instead. I'm not sure if that resolves it but would be a good practice. On Mon, Apr 18, 2022 at 9:10 AM Xavier Gervilla wrote: > Hi Team,

Re: Spark Streaming | Dynamic Action Support

2022-03-03 Thread Mich Talebzadeh
In short, I don't think there is such a possibility. However, there is the option of shutting down spark gracefully with checkpoint directory enabled. In such a way you can re-submit the modified code which will pick up BatchID from where it was left off, assuming the topic is the same. See the

Re: Spark Streaming with Files

2021-04-30 Thread muru
Yes, trigger (once=True) set to all streaming sources and it will treat as a batch mode. Then you can use any scheduler (e.g airflow) to run it whatever time window. With checkpointing, in the next run it will start processing files from the last checkpoint. On Fri, Apr 23, 2021 at 8:13 AM Mich

Re: Spark Streaming non functional requirements

2021-04-27 Thread Mich Talebzadeh
Forgot to add under non-functional requirements under heading - *Supportability and Maintainability* Someone queried the other day on how to shutdown a streaming job gracefully, meaning wait until such time as the "current queue" including backlog is drained and all processing is completed.

Re: Spark Streaming non functional requirements

2021-04-27 Thread ashok34...@yahoo.com.INVALID
Hello Mich Thank you for your great explanation. Best A. On Tuesday, 27 April 2021, 11:25:19 BST, Mich Talebzadeh wrote: Hi, Any design (in whatever framework) needs to consider both Functional and non-functional requirements. Functional requirements are those which are related to

Re: Spark Streaming non functional requirements

2021-04-27 Thread Mich Talebzadeh
Hi, Any design (in whatever framework) needs to consider both Functional and non-functional requirements. Functional requirements are those which are related to the technical functionality of the system that we cover daily in this forum. The non-functional requirement is a requirement that

Re: [Spark-Streaming] moving average on categorical data with time windowing

2021-04-26 Thread Sean Owen
You might be able to do this with multiple aggregations on avg(col("col1") == "cat1") etc, but how about pivoting the DataFrame first so that you get columns like "cat1" being 1 or 0? you would end up with columns x categories new columns if you want to count all categories in all cols. But then

Re: Spark Streaming with Files

2021-04-23 Thread Mich Talebzadeh
Interesting. If we go back to classic Lambda architecture on premise, you could Flume API to Kafka to add files to HDFS in time series bases. Most higher CDC vendors do exactly that. Oracle GoldenGate (OGG) classic gets data from Oracle redo logs and sends them to subscribers. One can deploy OGC

Re: Spark streaming giving error for version 2.4

2021-03-16 Thread Attila Zsolt Piros
Hi! I am just guessing here (as Gabor said before we need more information / logs): But is it possible Renu that you just upgraded one single jar? Best Regards, Attila On Tue, Mar 16, 2021 at 11:31 AM Gabor Somogyi wrote: > Well, this is not much. Please provide driver and executor logs... >

Re: Spark streaming giving error for version 2.4

2021-03-16 Thread Gabor Somogyi
Well, this is not much. Please provide driver and executor logs... G On Tue, Mar 16, 2021 at 6:03 AM Renu Yadav wrote: > Hi Team, > > > I have upgraded my spark streaming from 2.2 to 2.4 but getting below error: > > > spark-streaming-kafka_0-10.2.11_2.4.0 > > > scala 2.11 > > > Any Idea? > >

Re: Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread forece85
Not sure if kinesis have such flexibility. What else possibilities are there at transformations level? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Re: Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread forece85
Any example for this please -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread Sean Owen
You can also group by the key in the transformation on each batch. But yes that's faster/easier if it's already partitioned that way. On Tue, Mar 9, 2021 at 7:30 AM Ali Gouta wrote: > Do not know Kenesis, but it looks like it works like kafka. Your producer > should implement a paritionner that

Re: Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread Ali Gouta
Do not know Kenesis, but it looks like it works like kafka. Your producer should implement a paritionner that makes it possible to send your data with the same key to the same partition. Though, each task in your spark streaming app will load data from the same partition in the same executor. I

Re: Spark streaming with Kafka

2020-11-03 Thread Kevin Pis
Hi, this is my Word Count demo. https://github.com/kevincmchen/wordcount MohitAbbi 于2020年11月4日周三 上午3:32写道: > Hi, > > Can you please share the correct versions of JAR files which you used to > resolve the issue. I'm also facing the same issue. > > Thanks > > > > > -- > Sent from:

Re: Spark streaming with Kafka

2020-11-03 Thread MohitAbbi
Hi, Can you please share the correct versions of JAR files which you used to resolve the issue. I'm also facing the same issue. Thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe

Re: Spark Streaming Job is stucked

2020-10-18 Thread Artemis User
If it was running fine before and stops working now, one thing I could think of may be your disk was full.  Check your disk space and clean up your old log files might help... On 10/18/20 12:06 PM, rajat kumar wrote: Hello Everyone, My spark streaming job is running too slow, it is having

Re: Spark Streaming ElasticSearch

2020-10-06 Thread jainshasha
Hi Siva In that case u can use structured streaming foreach / foreachBatch function which can help you process each record and write it into some sink -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To

Re: Spark Streaming ElasticSearch

2020-10-05 Thread Siva Samraj
Hi Jainshasha, I need to read each row from Dataframe and made some changes to it before inserting it into ES. Thanks Siva On Mon, Oct 5, 2020 at 8:06 PM jainshasha wrote: > Hi Siva > > To emit data into ES using spark structured streaming job you need to used > ElasticSearch jar which has

Re: Spark Streaming ElasticSearch

2020-10-05 Thread jainshasha
Hi Siva To emit data into ES using spark structured streaming job you need to used ElasticSearch jar which has support for sink for spark structured streaming job. For this you can use this one my branch where we have integrated ES with spark 3.0 and scala 2.12 compatible

Re: Spark Streaming Checkpointing

2020-09-04 Thread András Kolbert
Hi Gábor, Thanks for your reply on this! Internally that's used at the company I work at - it hasn't been changed mainly due to the compatibility of the current deployed java applications. Hence I am attempting to make the most of this version :) András On Fri, 4 Sep 2020, 14:09 Gabor

Re: Spark Streaming Checkpointing

2020-09-04 Thread Gabor Somogyi
Hi Andras, A general suggestion is to use Structured Streaming instead of DStreams because it provides several things out of the box (stateful streaming, etc...). Kafka 0.8 is super old and deprecated (no security...). Do you have a specific reason to use that? BR, G On Thu, Sep 3, 2020 at

Re: Spark Streaming with Kafka and Python

2020-08-12 Thread Sean Owen
What supports Python in (Kafka?) 0.8? I don't think Spark ever had a specific Python-Kafka integration. But you have always been able to use it to read DataFrames as in Structured Streaming. Kafka 0.8 support is deprecated (gone in 3.0) but 0.10 means 0.10+ - works with the latest 2.x. What is the

Re: Spark Streaming with Kafka and Python

2020-08-12 Thread German Schiavon
Hey, Maybe I'm missing some restriction with EMR, but have you tried to use Structured Streaming instead of Spark Streaming? https://spark.apache.org/docs/2.4.5/structured-streaming-kafka-integration.html Regards On Wed, 12 Aug 2020 at 14:12, Hamish Whittal wrote: > Hi folks, > > Thought I

Re: Spark streaming receivers

2020-08-10 Thread Russell Spitzer
The direct approach, which is also available through dstreams, and structured streaming use a different model. Instead of being a push based streaming solution they instead are pull based. (In general) On every batch the driver uses the configuration to create a number of partitions, each is

Re: Spark streaming receivers

2020-08-09 Thread Dark Crusader
Hi Russell, This is super helpful. Thank you so much. Can you elaborate on the differences between structured streaming vs dstreams? How would the number of receivers required etc change? On Sat, 8 Aug, 2020, 10:28 pm Russell Spitzer, wrote: > Note, none of this applies to Direct streaming

Re: Spark streaming receivers

2020-08-08 Thread Russell Spitzer
Note, none of this applies to Direct streaming approaches, only receiver based Dstreams. You can think of a receiver as a long running task that never finishes. Each receiver is submitted to an executor slot somewhere, it then runs indefinitely and internally has a method which passes records

Re: Spark Streaming - Set Parallelism and Optimize driver

2020-07-20 Thread Russell Spitzer
Without seeing the rest (and you can confirm this by looking at the DAG visualization in the Spark UI) I would say your first stage with 6 partitions is: Stage 1: Read data from kinesis (or read blocks from receiver not sure what method you are using) and write shuffle files for repartition Stage

Re: Spark Streaming - Set Parallelism and Optimize driver

2020-07-20 Thread forece85
Thanks for reply. Please find sudo code below. Its Dstreams reading for every 10secs from kinesis stream and after transformations, pushing into hbase. Once got Dstream, we are using below code to repartition and do processing: dStream = dStream.repartition(javaSparkContext.defaultMinPartitions()

Re: Spark Streaming - Set Parallelism and Optimize driver

2020-07-20 Thread forece85
Thanks for reply. Please find sudo code below. We are fetching Dstreams from kinesis stream for every 10sec and performing transformations and finally persisting to hbase tables using batch insertions. dStream = dStream.repartition(jssc.defaultMinPartitions() * 3); dStream.foreachRDD(javaRDD ->

Re: Spark Streaming - Set Parallelism and Optimize driver

2020-07-20 Thread Russell Spitzer
Without your code this is hard to determine but a few notes. The number of partitions is usually dictated by the input source, see if it has any configuration which allows you to increase input splits. I'm not sure why you think some of the code is running on the driver. All methods on

Re: Spark streaming with Confluent kafka

2020-07-03 Thread Gabor Somogyi
The error is clear: Caused by: java.lang.IllegalArgumentException: No serviceName defined in either JAAS or Kafka config On Fri, 3 Jul 2020, 15:40 dwgw, wrote: > Hi > > I am trying to stream confluent kafka topic in the spark shell. For that i > have invoked spark shell using following command.

Re: Spark streaming with Kafka

2020-07-02 Thread dwgw
Hi I am able to correct the issue. The issue was due to wrong version of JAR file I have used. I have removed the these JAR files and copied correct version of JAR files and the error has gone away. Regards -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Spark streaming with Kafka

2020-07-02 Thread Jungtaek Lim
I can't reproduce. Could you please make sure you're running spark-shell with official spark 3.0.0 distribution? Please try out changing the directory and using relative path like "./spark-shell". On Thu, Jul 2, 2020 at 9:59 PM dwgw wrote: > Hi > I am trying to stream kafka topic from spark

Re: Spark Streaming Memory

2020-05-17 Thread Ali Gouta
The spark UI is misleading in spark 2.4.4. I moved to spark 2.4.5 and it fixed it. Now, your problem should be somewhere else. Probably related to memory consumption but not the one you see in the UI. Best regards, Ali Gouta. On Sun, May 17, 2020 at 7:36 PM András Kolbert wrote: > Hi, > > I

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-03 Thread Jungtaek Lim
Replied inline: On Sun, May 3, 2020 at 6:25 PM Magnus Nilsson wrote: > Thank you, so that would mean spark gets the current latest offset(s) when > the trigger fires and then process all available messages in the topic upto > and including that offset as long as maxOffsetsPerTrigger is the

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-03 Thread Magnus Nilsson
Thank you, so that would mean spark gets the current latest offset(s) when the trigger fires and then process all available messages in the topic upto and including that offset as long as maxOffsetsPerTrigger is the default of None (or large enought to handle all available messages). I think the

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-02 Thread Jungtaek Lim
If I understand correctly, Trigger.once executes only one micro-batch and terminates, that's all. Your understanding of structured streaming applies there as well. It's like a hybrid approach as bringing incremental processing from micro-batch but having processing interval as batch. That said,

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-02 Thread Magnus Nilsson
I've always had a question about Trigger.Once that I never got around to ask or test for myself. If you have a 24/7 stream to a Kafka topic. Will Trigger.Once get the last offset(s) when it starts and then quit once it hits this offset(s) or will the job run until no new messages is added to the

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-01 Thread Rishi Shah
Thanks Burak! Appreciate it. This makes sense. How do you suggest we make sure resulting data doesn't produce tiny files? If we are not on databricks yet and can not leverage delta lake features? Also checkpointing feature, do you have active blog/article I can take a look at to try out an

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-01 Thread Burak Yavuz
Hi Rishi, That is exactly why Trigger.Once was created for Structured Streaming. The way we look at streaming is that it doesn't have to be always real time, or 24-7 always on. We see streaming as a workflow that you have to repeat indefinitely. See this blog post for more details!

Re: Spark Streaming not working

2020-04-14 Thread Gerard Maas
Hi, Could you share the code that you're using to configure the connection to the Kafka broker? This is a bread-and-butter feature. My first thought is that there's something in your particular setup that prevents this from working. kind regards, Gerard. On Fri, Apr 10, 2020 at 7:34 PM

Re: Spark Streaming not working

2020-04-14 Thread Gabor Somogyi
n task 0.5 in stage 0.0 >> (TID 24) >> > java.lang.AssertionError: assertion failed: Failed to get records for >> spark-executor-service-spark-ingestion dice-ingestion 11 0 after polling >> for 12 >> >> Cheers, >> -z >> >> _

Re: Spark Streaming not working

2020-04-14 Thread Gabor Somogyi
on in task 0.5 in stage 0.0 > (TID 24) > > java.lang.AssertionError: assertion failed: Failed to get records for > spark-executor-service-spark-ingestion dice-ingestion 11 0 after polling > for 12 > > Cheers, > -z > > > From: Debabrata Ghosh

Re: Spark Streaming not working

2020-04-14 Thread ZHANG Wei
e 0.0 (TID 24) > java.lang.AssertionError: assertion failed: Failed to get records for > spark-executor-service-spark-ingestion dice-ingestion 11 0 after polling for > 12 Cheers, -z From: Debabrata Ghosh Sent: Saturday, April 11, 2020 2:25 To: user Subjec

Re: Spark Streaming not working

2020-04-10 Thread Debabrata Ghosh
Any solution please ? On Fri, Apr 10, 2020 at 11:04 PM Debabrata Ghosh wrote: > Hi, > I have a spark streaming application where Kafka is producing > records but unfortunately spark streaming isn't able to consume those. > > I am hitting the following error: > > 20/04/10 17:28:04 ERROR

Re: Spark Streaming not working

2020-04-10 Thread Chenguang He
unsubscribe

Re: Spark Streaming not working

2020-04-10 Thread Debabrata Ghosh
Yes the Kafka producer is producing records from the same host - Rechecked Kafka connection and the connection is there. Came across this URL but unable to understand it https://stackoverflow.com/questions/42264669/spark-streaming-assertion-failed-failed-to-get-records-for-spark-executor-a-gro

Re: Spark Streaming not working

2020-04-10 Thread Srinivas V
Check if your broker details are correct, verify if you have network connectivity to your client box and Kafka broker server host. On Fri, Apr 10, 2020 at 11:04 PM Debabrata Ghosh wrote: > Hi, > I have a spark streaming application where Kafka is producing > records but unfortunately

Re: Spark Streaming on Compact Kafka topic - consumers 1 message per partition per batch

2020-04-08 Thread Hrishikesh Mishra
It seems, I found the issue. The actual problem is something related to back pressure. When I am adding these config *spark.streaming.kafka.maxRatePerPartition* or *spark.streaming.backpressure.initialRate* (the of these configs are 100). After that it starts consuming one message per partition

Re: Spark Streaming on Compact Kafka topic - consumers 1 message per partition per batch

2020-04-01 Thread Waleed Fateem
Well this is interesting. Not sure if this is the expected behavior. The log messages you have referenced are actually printed out by the Kafka Consumer itself (org.apache.kafka.clients.consumer.internals.Fetcher). That log message belongs to a new feature added starting with Kafka 1.1:

Re: Spark Streaming Code

2020-03-28 Thread Jungtaek Lim
To get any meaningful answers you may want to provide the information/context as much as possible. e.g. Spark version, which behavior/output was expected (and why you think) and how it behaves actually. On Sun, Mar 29, 2020 at 3:37 AM Siva Samraj wrote: > Hi Team, > > Need help on windowing &

Re: Spark Streaming with mapGroupsWithState

2020-03-02 Thread Something Something
I changed it to Tuple2 and that problem is solved. Any thoughts on this message *Unapplied methods are only converted to functions when a function type is expected.* *You can make this conversion explicit by writing `updateAcrossEvents _` or `updateAcrossEvents(_,_,_,_,_)` instead of

Re: Spark Streaming with mapGroupsWithState

2020-03-02 Thread lec ssmi
maybe you can combine the fields you want to use into one field Something Something 于2020年3月3日周二 上午6:37写道: > I am writing a Stateful Streaming application in which I am using > mapGroupsWithState to create aggregates for Groups but I need to create > *Groups > based on more than one column in

Re: Spark Streaming: Aggregating values across batches

2020-02-27 Thread Tathagata Das
Use Structured Streaming. Its aggregation, by definition, is across batches. On Thu, Feb 27, 2020 at 3:17 PM Something Something < mailinglist...@gmail.com> wrote: > We've a Spark Streaming job that calculates some values in each batch. > What we need to do now is aggregate values across ALL

Re: spark streaming exception

2019-11-10 Thread Akshay Bhardwaj
Hi, Could you provide with the code snippet of how you are connecting and reading data from kafka? Akshay Bhardwaj +91-97111-33849 On Thu, Oct 17, 2019 at 8:39 PM Amit Sharma wrote: > Please update me if any one knows about it. > > > Thanks > Amit > > On Thu, Oct 10, 2019 at 3:49 PM Amit

Re: spark streaming exception

2019-10-17 Thread Amit Sharma
Please update me if any one knows about it. Thanks Amit On Thu, Oct 10, 2019 at 3:49 PM Amit Sharma wrote: > Hi , we have spark streaming job to which we send a request through our UI > using kafka. It process and returned the response. We are getting below > error and this stareming is not

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-06 Thread Sethupathi T
Gabor, Thanks for the clarification. Thanks On Fri, Sep 6, 2019 at 12:38 AM Gabor Somogyi wrote: > Sethupathi, > > Let me extract then the important part what I've shared: > > 1. "This ensures that each Kafka source has its own consumer group that > does not face interference from any other

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-06 Thread Gabor Somogyi
Sethupathi, Let me extract then the important part what I've shared: 1. "This ensures that each Kafka source has its own consumer group that does not face interference from any other consumer" 2. Consumers may eat the data from each other, offset calculation may give back wrong result (that's

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-05 Thread Sethupathi T
Gabor, Thanks for the quick response and sharing about spark 3.0, we need to use spark streaming (KafkaUtils.createDirectStream) than structured streaming by following this document https://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html and re-iterating the issue again for

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-05 Thread Sethupathi T
Gabor, Thanks for the quick response and sharing about spark 3.0, we need to use spark streaming (KafkaUtils.createDirectStream) than structured streaming by following this document https://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html and re-iterating the issue again for

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-05 Thread Gabor Somogyi
Hi, Let me share Spark 3.0 documentation part (Structured Streaming and not DStreams what you've mentioned but still relevant): kafka.group.id string none streaming and batch The Kafka group id to use in Kafka consumer while reading from Kafka. Use this with caution. By default, each query

  1   2   3   4   5   6   7   8   9   10   >