checkpoint file deletion

2023-06-29 Thread Lingzhe Sun
Hi all,

I performed a stateful structure streaming job, and configured 
spark.cleaner.referenceTracking.cleanCheckpoints to true
spark.cleaner.periodicGC.interval to 1min
in the config. But the checkpoint files are not deleted and the number of them 
keeps growing. Did I miss something?



Lingzhe Sun  
Hirain Technology


Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Lingzhe Sun
Hi Mich,

Spark 3.4.0 prebuilt with scala 2.13 is built with version 2.13.8. Since you 
are using spark-core_2.13 and spark-sql_2.13, you should stick to the major(13) 
and the minor version(8). Not using any of these may cause unexpected 
behaviour(though scala claims compatibility among minor version changes, I've 
encountered problem using the scala package with the same major version and 
different minor version. That may due to bug fixes and upgrade of scala 
itself.). 
And although I did not encountered such problem, this can be a a pitfall for 
you.



Best Regards!
...
Lingzhe Sun 
Hirain Technology

 
From: Mich Talebzadeh
Date: 2023-05-29 17:55
To: Bjørn Jørgensen
CC: user @spark
Subject: Re: maven with Spark 3.4.0 fails compilation
Thanks for your helpful comments Bjorn.

I managed to compile the code with maven but when it run it fails with

  Application is ReduceByKey

Exception in thread "main" java.lang.NoSuchMethodError: 
scala.package$.Seq()Lscala/collection/immutable/Seq$;
at ReduceByKey$.main(ReduceByKey.scala:23)
at ReduceByKey.main(ReduceByKey.scala)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1020)
at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I attach the pom.xml and the sample scala code is self contained and basic. 
Again it runs with SBT with no issues.

FYI, my scala version on host is

 scala -version
Scala code runner version 2.13.6 -- Copyright 2002-2021, LAMP/EPFL and 
Lightbend, Inc.

I think I have a scala  incompatible somewhere again

Cheers


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh
 
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 
 


On Sun, 28 May 2023 at 20:29, Bjørn Jørgensen  wrote:
From chatgpt4 


The problem appears to be that there is a mismatch between the version of Scala 
used by the Scala Maven plugin and the version of the Scala library defined as 
a dependency in your POM. You've defined your Scala version in your properties 
as `2.12.17` but you're pulling in `scala-library` version `2.13.6` as a 
dependency.

The Scala Maven plugin will be using the Scala version defined in the 
`scala.version` property for compilation, but then it tries to load classes 
from a different Scala version, hence the error.

To resolve this issue, make sure the `scala.version` property matches the 
version of `scala-library` defined in your dependencies. In your case, you may 
want to change `scala.version` to `2.13.6`.

Here's the corrected part of your POM:

```xml

  1.7
  1.7
  UTF-8
  2.13.6 
  2.15.2

```

Additionally, ensure that the Scala versions in the Spark dependencies match 
the `scala.version` property as well. If you've updated the Scala version to 
`2.13.6`, the artifactIds for Spark dependencies should be `spark-core_2.13` 
and `spark-sql_2.13`. 

Another thing to consider: your Java version defined in `maven.compiler.source` 
and `maven.compiler.target` is `1.7`, which is quite outdated and might not be 
compatible with the latest versions of these libraries. Consider updating to a 
more recent version of Java, such as Java 8 or above, depending on the 
requirements of the libraries you're using.



The same problem persists in this updated POM file - there's a mismatch in the 
Scala version declared in the properties and the version used in your 
dependencies. Here's what you need to update:

1. Update the Scala version in your properties to match the Scala library and 
your Spark dependencies:

```xml

1.7
1.7
UTF-8
2.13.6 
2.15.2

```

2. Make sure all 

Re: Re: spark streaming and kinesis integration

2023-04-11 Thread Lingzhe Sun
Hi Mich,

FYI we're using spark 
operator(https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) to build 
stateful structured streaming on k8s for a year. Haven't test it using 
non-operator way.

Besides that, the main contributor of the spark operator, Yinan Li, has been 
inactive for quite long time. Kind of worried that this project might finally 
become outdated as k8s is evolving. So if anyone is interested, please support 
the project.



Lingzhe Sun
Hirain Technologies
 
From: Mich Talebzadeh
Date: 2023-04-11 02:06
To: Rajesh Katkar
CC: user
Subject: Re: spark streaming and kinesis integration
What I said was this
"In so far as I know k8s does not support spark structured streaming?"

So it is an open question. I just recalled it. I have not tested myself. I know 
structured streaming works on Google Dataproc cluster but I have not seen any 
official link that says Spark Structured Streaming is supported on k8s.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh
 
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 
 


On Mon, 10 Apr 2023 at 06:31, Rajesh Katkar  wrote:
Do you have any link or ticket which justifies that k8s does not support spark 
streaming ?

On Thu, 6 Apr, 2023, 9:15 pm Mich Talebzadeh,  wrote:
Do you have a high level diagram of the proposed solution?

In so far as I know k8s does not support spark structured streaming?

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh
 
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 
 


On Thu, 6 Apr 2023 at 16:40, Rajesh Katkar  wrote:
Use case is , we want to read/write to kinesis streams using k8s
Officially I could not find the connector or reader for kinesis from spark like 
it has for kafka.

Checking here if anyone used kinesis and spark streaming combination ?

On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh,  wrote:
Hi Rajesh,

What is the use case for Kinesis here? I have not used it personally, Which use 
case it concerns

https://aws.amazon.com/kinesis/

Can you use something else instead?

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh
 
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 
 


On Thu, 6 Apr 2023 at 13:08, Rajesh Katkar  wrote:
Hi Spark Team,
We need to read/write the kinesis streams using spark streaming.
 We checked the official documentation - 
https://spark.apache.org/docs/latest/streaming-kinesis-integration.html
It does not mention kinesis connector. Alternative is - 
https://github.com/qubole/kinesis-sql which is not active now.  This is now 
handed over here - https://github.com/roncemer/spark-sql-kinesis
Also according to SPARK-18165 , Spark officially do not have any kinesis 
connector 
We have few below questions , It would be great if you can answer 
Does Spark provides officially any kinesis connector which have 
readstream/writestream and endorse any connector for production use cases ?  
https://spark.apache.org/docs/latest/streaming-kinesis-integration.html This 
documentation does not mention how to write to kinesis. This method has default 
dynamodb as checkpoint, can we override it ?
We have rocksdb as a state store but when we ran an application using official  
https://spark.apache.org/docs/latest/streaming-kinesis-integration.html rocksdb 
configurations were not effective. Can you please confirm if rocksdb is not 
applicable in these cases?
rocksdb however works with qubole connector , do you have any plan to release 
kinesis connector?
Please help/recommend us for any good stable kinesis connector or some pointers 
around it


Re: Re: spark+kafka+dynamic resource allocation

2023-01-29 Thread Lingzhe Sun
Hi Mich,

Thanks for the information. I myself think this open issue should have higher 
priority as more streaming application is being built using spark. Hope this 
becomes a feature in the coming future.



BR
Lingzhe Sun
 
From: Mich Talebzadeh
Date: 2023-01-30 02:14
To: Lingzhe Sun
CC: ashok34...@yahoo.com; User
Subject: Re: Re: spark+kafka+dynamic resource allocation
Hi,

Spark Structured Streaming currently does not support dynamic allocation (see 
SPARK-24815: Structured Streaming should support dynamic allocation). which is 
still open

Autoscaling in Cloud offerings like Google Dataproc does not support Spark 
Structured Streaming either

HTH

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh
 
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 
 


On Sun, 29 Jan 2023 at 01:49, Lingzhe Sun  wrote:
Thank you for the response. But the reference does not seem to be answering any 
of those questions.

BS
Lingzhe Sun
 
From: ashok34...@yahoo.com
Date: 2023-01-29 04:01
To: User; Lingzhe Sun
Subject: Re: spark+kafka+dynamic resource allocation
Hi,

Worth checking this link

https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

On Saturday, 28 January 2023 at 06:18:28 GMT, Lingzhe Sun 
 wrote: 


Hi all,

I'm wondering if dynamic resource allocation works in spark+kafka streaming 
applications. Here're some questions:
Will structured streaming be supported?
Is the number of consumers always equal to the number of the partitions of 
subscribed topic (let's say there's only one topic)?
If consumers is evenly distributed across executors, will newly added 
executor(through dynamic resource allocation) trigger a consumer reassignment?
Would it be simply a bad idea to use dynamic resource allocation in streaming 
app, because there's no way to scale down number of executors unless no data is 
coming in?
Any thoughts are welcomed.

Lingzhe Sun 
Hirain Technology


Re: Re: spark+kafka+dynamic resource allocation

2023-01-28 Thread Lingzhe Sun
Thank you for the response. But the reference does not seem to be answering any 
of those questions.

BS
Lingzhe Sun
 
From: ashok34...@yahoo.com
Date: 2023-01-29 04:01
To: User; Lingzhe Sun
Subject: Re: spark+kafka+dynamic resource allocation
Hi,

Worth checking this link

https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

On Saturday, 28 January 2023 at 06:18:28 GMT, Lingzhe Sun 
 wrote: 


Hi all,

I'm wondering if dynamic resource allocation works in spark+kafka streaming 
applications. Here're some questions:
Will structured streaming be supported?
Is the number of consumers always equal to the number of the partitions of 
subscribed topic (let's say there's only one topic)?
If consumers is evenly distributed across executors, will newly added 
executor(through dynamic resource allocation) trigger a consumer reassignment?
Would it be simply a bad idea to use dynamic resource allocation in streaming 
app, because there's no way to scale down number of executors unless no data is 
coming in?
Any thoughts are welcomed.

Lingzhe Sun 
Hirain Technology


spark+kafka+dynamic resource allocation

2023-01-27 Thread Lingzhe Sun
Hi all,

I'm wondering if dynamic resource allocation works in spark+kafka streaming 
applications. Here're some questions:
Will structured streaming be supported?
Is the number of consumers always equal to the number of the partitions of 
subscribed topic (let's say there's only one topic)?
If consumers is evenly distributed across executors, will newly added 
executor(through dynamic resource allocation) trigger a consumer reassignment?
Would it be simply a bad idea to use dynamic resource allocation in streaming 
app, because there's no way to scale down number of executors unless no data is 
coming in?
Any thoughts are welcomed.

Lingzhe Sun 
Hirain Technology


Re: Re: should one every make a spark streaming job in pyspark

2022-11-03 Thread Lingzhe Sun
In addition to that:

For now some stateful operations in structured streaming don't have equivalent 
python API, e.g. flatMapGroupsWithState. However spark engineers are making it 
possible in the upcoming version. See more: 
https://www.databricks.com/blog/2022/10/18/python-arbitrary-stateful-processing-structured-streaming.html



Best Regards!
...
Lingzhe Sun 
Hirain Technology / APIC
 
From: Mich Talebzadeh
Date: 2022-11-03 19:15
To: Joris Billen
CC: User
Subject: Re: should one every make a spark streaming job in pyspark
Well your mileage varies so to speak.

Spark itself is written in Scala. However, that does not imply you should stick 
with Scala.
I have used both for spark streaming and spark structured streaming, they both 
work fine
PySpark has become popular with the widespread use of iData Science projects
What matters normally is the skill set you already have in-house. The 
likelihood is that there are more Python developers than Scala developers and 
the learning curve for scala has to be taken into account
The idea of performance etc is tangential.
 With regard to the Spark code itself, there should be little efforts in 
converting from Scala to PySpark or vice-versa
HTH

   view my Linkedin profile

  
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 
 


On Wed, 2 Nov 2022 at 08:54, Joris Billen  wrote:
Dear community, 
I had a general question about the use of scala VS pyspark for spark streaming.
I believe spark streaming will work most efficiently when written in scala. I 
believe however that things can be implemented in pyspark. My question: 
1)is it completely dumb to make a streaming job in pyspark? 
2)what are the technical reasons that it is done best in scala (is this easy to 
understand why)? 
3)any good links anyone has seen with numbers of the difference in performance 
and under what circumstances+explanation?
4)are there certain scenarios when the use of pyspark can be motivated (maybe 
when someone doesn’t feel confortable writing a job in scala and the number of 
messages/minute aren’t gigantic so performance isnt that crucial?)

Thanks for any input!
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org