Re: Which version of spark version supports parquet version 2 ?

2024-04-18 Thread Bjørn Jørgensen
dge but of course cannot be guaranteed . It is 
>>>>>>>>>>> essential to
>>>>>>>>>>> note that, as with any advice, quote "one test result is worth
>>>>>>>>>>> one-thousand expert opinions (Werner
>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, 15 Apr 2024 at 21:33, Prem Sahoo 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thank you so much for the info! But do we have any release
>>>>>>>>>>>> notes where it says spark2.4.0 onwards supports parquet version 2. 
>>>>>>>>>>>> I was
>>>>>>>>>>>> under the impression Spark3.0 onwards it started supporting .
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh <
>>>>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Well if I am correct, Parquet version 2 support was introduced
>>>>>>>>>>>>> in Spark version 2.4.0. Therefore, any version of Spark starting 
>>>>>>>>>>>>> from 2.4.0
>>>>>>>>>>>>> supports Parquet version 2. Assuming that you are using Spark 
>>>>>>>>>>>>> version
>>>>>>>>>>>>> 2.4.0 or later, you should be able to take advantage of Parquet 
>>>>>>>>>>>>> version 2
>>>>>>>>>>>>> features.
>>>>>>>>>>>>>
>>>>>>>>>>>>> HTH
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>>>> Technologist | Solutions Architect | Data Engineer  |
>>>>>>>>>>>>> Generative AI
>>>>>>>>>>>>> London
>>>>>>>>>>>>> United Kingdom
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>view my Linkedin profile
>>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Disclaimer:* The information provided is correct to the best
>>>>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is 
>>>>>>>>>>>>> essential to
>>>>>>>>>>>>> note that, as with any advice, quote "one test result is
>>>>>>>>>>>>> worth one-thousand expert opinions (Werner
>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, 15 Apr 2024 at 20:53, Prem Sahoo 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you for the information!
>>>>>>>>>>>>>> I can use any version of parquet-mr to produce parquet file.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> regarding 2nd question .
>>>>>>>>>>>>>> Which version of spark is supporting parquet version 2?
>>>>>>>>>>>>>> May I get the release notes where parquet versions are
>>>>>>>>>>>>>> mentioned ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh <
>>>>>>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Parquet-mr is a Java library that provides functionality
>>>>>>>>>>>>>>> for working with Parquet files with hadoop. It is therefore  
>>>>>>>>>>>>>>> more geared
>>>>>>>>>>>>>>> towards working with Parquet files within the Hadoop ecosystem,
>>>>>>>>>>>>>>> particularly using MapReduce jobs. There is no definitive way 
>>>>>>>>>>>>>>> to check
>>>>>>>>>>>>>>> exact compatible versions within the library itself. However, 
>>>>>>>>>>>>>>> you can have
>>>>>>>>>>>>>>> a look at this
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://github.com/apache/parquet-mr/blob/master/CHANGES.md
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> HTH
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>>>>>> Technologist | Solutions Architect | Data Engineer  |
>>>>>>>>>>>>>>> Generative AI
>>>>>>>>>>>>>>> London
>>>>>>>>>>>>>>> United Kingdom
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>view my Linkedin profile
>>>>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Disclaimer:* The information provided is correct to the
>>>>>>>>>>>>>>> best of my knowledge but of course cannot be guaranteed . It is 
>>>>>>>>>>>>>>> essential
>>>>>>>>>>>>>>> to note that, as with any advice, quote "one test result is
>>>>>>>>>>>>>>> worth one-thousand expert opinions (Werner
>>>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, 15 Apr 2024 at 18:59, Prem Sahoo <
>>>>>>>>>>>>>>> prem.re...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello Team,
>>>>>>>>>>>>>>>> May I know how to check which version of parquet is
>>>>>>>>>>>>>>>> supported by parquet-mr 1.2.1 ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Which version of parquet-mr is supporting parquet version 2
>>>>>>>>>>>>>>>> (V2) ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Which version of spark is supporting parquet version 2?
>>>>>>>>>>>>>>>> May I get the release notes where parquet versions are
>>>>>>>>>>>>>>>> mentioned ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Tabular
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
I think this answers your question about what to do if you need more space
on nodes.

https://spark.apache.org/docs/latest/running-on-kubernetes.html#local-storage

Local Storage
<https://spark.apache.org/docs/latest/running-on-kubernetes.html#local-storage>

Spark supports using volumes to spill data during shuffles and other
operations. To use a volume as local storage, the volume’s name should
starts with spark-local-dir-, for example:

--conf 
spark.kubernetes.driver.volumes.[VolumeType].spark-local-dir-[VolumeName].mount.path=
--conf 
spark.kubernetes.driver.volumes.[VolumeType].spark-local-dir-[VolumeName].mount.readOnly=false

Specifically, you can use persistent volume claims if the jobs require
large shuffle and sorting operations in executors.

spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=gp
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=500Gi
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false

To enable shuffle data recovery feature via the built-in
KubernetesLocalDiskShuffleDataIO plugin, we need to have the followings.
You may want to enable
spark.kubernetes.driver.waitToReusePersistentVolumeClaim additionally.

spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data/spark-x/executor-x
spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO

If no volume is set as local storage, Spark uses temporary scratch space to
spill data to disk during shuffles and other operations. When using
Kubernetes as the resource manager the pods will be created with an emptyDir
<https://kubernetes.io/docs/concepts/storage/volumes/#emptydir> volume
mounted for each directory listed in spark.local.dir or the environment
variable SPARK_LOCAL_DIRS . If no directories are explicitly specified then
a default directory is created and configured appropriately.

emptyDir volumes use the ephemeral storage feature of Kubernetes and do not
persist beyond the life of the pod.

tor. 11. apr. 2024 kl. 10:29 skrev Bjørn Jørgensen :

> " In the end for my usecase I started using pvcs and pvc aware scheduling
> along with decommissioning. So far performance is good with this choice."
> How did you do this?
>
>
> tor. 11. apr. 2024 kl. 04:13 skrev Arun Ravi :
>
>> Hi Everyone,
>>
>> I had to explored IBM's and AWS's S3 shuffle plugins (some time back), I
>> had also explored AWS FSX lustre in few of my production jobs which has
>> ~20TB of shuffle operations with 200-300 executors. What I have observed is
>> S3 and fax behaviour was fine during the write phase, however I faced iops
>> throttling during the read phase(read taking forever to complete). I think
>> this might be contributed by the heavy use of shuffle index file (I didn't
>> perform any extensive research on this), so I believe the shuffle manager
>> logic have to be intelligent enough to reduce the fetching of files from
>> object store. In the end for my usecase I started using pvcs and pvc aware
>> scheduling along with decommissioning. So far performance is good with this
>> choice.
>>
>> Thank you
>>
>> On Tue, 9 Apr 2024, 15:17 Mich Talebzadeh, 
>> wrote:
>>
>>> Hi,
>>>
>>> First thanks everyone for their contributions
>>>
>>> I was going to reply to @Enrico Minack   but
>>> noticed additional info. As I understand for example,  Apache Uniffle is an
>>> incubating project aimed at providing a pluggable shuffle service for
>>> Spark. So basically, all these "external shuffle services" have in common
>>> is to offload shuffle data management to external services, thus reducing
>>> the memory and CPU overhead on Spark executors. That is great.  While
>>> Uniffle and others enhance shuffle performance and scalability, it would be
>>> great to integrate them with Spark UI. This may require additional
>>> development efforts. I suppose  the interest would be to have these
>>> external matrices incorporated into Spark with one look and feel. This may
>>> require customizing the UI to fetch and display metrics or statistics from
>>> the external shuffle services. Has any project done this?
>>>
>>> Thanks
>>>
>>> Mich Talebzadeh,
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>>

Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
>>> "service_account_name",
>>>>> "spark.hadoop.fs.gs.impl":
>>>>> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
>>>>> "spark.hadoop.google.cloud.auth.service.account.enable": "true",
>>>>> "spark.hadoop.google.cloud.auth.service.account.json.keyfile":
>>>>> "/path/to/keyfile.json",
>>>>> }
>>>>>
>>>>> For Amazon S3 similar
>>>>>
>>>>> spark_config_s3 = {
>>>>> "spark.kubernetes.authenticate.driver.serviceAccountName":
>>>>> "service_account_name",
>>>>> "spark.hadoop.fs.s3a.impl":
>>>>> "org.apache.hadoop.fs.s3a.S3AFileSystem",
>>>>> "spark.hadoop.fs.s3a.access.key": "s3_access_key",
>>>>> "spark.hadoop.fs.s3a.secret.key": "secret_key",
>>>>> }
>>>>>
>>>>>
>>>>> To implement these configurations and enable Spark applications to
>>>>> interact with GCS and S3, I guess we can approach it this way
>>>>>
>>>>> 1) Spark Repository Integration: These configurations need to be added
>>>>> to the Spark repository as part of the supported configuration options for
>>>>> k8s deployments.
>>>>>
>>>>> 2) Configuration Settings: Users need to specify these configurations
>>>>> when submitting Spark applications to a Kubernetes cluster. They can
>>>>> include these configurations in the Spark application code or pass them as
>>>>> command-line arguments or environment variables during application
>>>>> submission.
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>>
>>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>> expert opinions (Werner
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>
>>>>>
>>>>> On Sun, 7 Apr 2024 at 13:31, Vakaris Baškirov <
>>>>> vakaris.bashki...@gmail.com> wrote:
>>>>>
>>>>>> There is an IBM shuffle service plugin that supports S3
>>>>>> https://github.com/IBM/spark-s3-shuffle
>>>>>>
>>>>>> Though I would think a feature like this could be a part of the main
>>>>>> Spark repo. Trino already has out-of-box support for s3 exchange 
>>>>>> (shuffle)
>>>>>> and it's very useful.
>>>>>>
>>>>>> Vakaris
>>>>>>
>>>>>> On Sun, Apr 7, 2024 at 12:27 PM Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Thanks for your suggestion that I take it as a workaround. Whilst
>>>>>>> this workaround can potentially address storage allocation issues, I was
>>>>>>> more interested in exploring solutions that offer a more seamless
>>>>>>> integration with large distributed file systems like HDFS, GCS, or S3. 
>>>>>>> This
>>>>>>> would ensure better performance and scalability for handling larger
>>>>>>> datasets efficiently.
>>>>>>>
>>>>>>>
>>>>>>> Mich Talebzadeh,
>>>>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>>>>> London
>>>>>>> United Kingdom
>>>>>>>
>>>>>>>
>>>>>>>view my Linkedin profile
>&g

Re: External Spark shuffle service for k8s

2024-04-06 Thread Bjørn Jørgensen
You can make a PVC on K8S call it 300GB

make a folder in yours dockerfile
WORKDIR /opt/spark/work-dir
RUN chmod g+w /opt/spark/work-dir

start spark with adding this

.config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName",
"300gb") \

.config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path",
"/opt/spark/work-dir") \

.config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly",
"False") \

.config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName",
"300gb") \

.config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path",
"/opt/spark/work-dir") \

.config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly",
"False") \
  .config("spark.local.dir", "/opt/spark/work-dir")




lør. 6. apr. 2024 kl. 15:45 skrev Mich Talebzadeh :

> I have seen some older references for shuffle service for k8s,
> although it is not clear they are talking about a generic shuffle
> service for k8s.
>
> Anyhow with the advent of genai and the need to allow for a larger
> volume of data, I was wondering if there has been any more work on
> this matter. Specifically larger and scalable file systems like HDFS,
> GCS , S3 etc, offer significantly larger storage capacity than local
> disks on individual worker nodes in a k8s cluster, thus allowing
> handling much larger datasets more efficiently. Also the degree of
> parallelism and fault tolerance  with these files systems come into
> it. I will be interested in hearing more about any progress on this.
>
> Thanks
> .
>
> Mich Talebzadeh,
>
> Technologist | Solutions Architect | Data Engineer  | Generative AI
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Bjørn Jørgensen
something like this  Spark community · GitHub
<https://github.com/Spark-community>


man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud
:

> Good idea. Will be useful
>
>
>
> +1
>
>
>
>
>
>
>
> *From: *ashok34...@yahoo.com.INVALID 
> *Date: *Monday, March 18, 2024 at 6:36 AM
> *To: *user @spark , Spark dev list <
> dev@spark.apache.org>, Mich Talebzadeh 
> *Cc: *Matei Zaharia 
> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for Apache
> Spark Community
>
> External message, be mindful when clicking links or attachments
>
>
>
> Good idea. Will be useful
>
>
>
> +1
>
>
>
> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
>
>
>
> Some of you may be aware that Databricks community Home | Databricks
>
> have just launched a knowledge sharing hub. I thought it would be a
>
> good idea for the Apache Spark user group to have the same, especially
>
> for repeat questions on Spark core, Spark SQL, Spark Structured
>
> Streaming, Spark Mlib and so forth.
>
>
>
> Apache Spark user and dev groups have been around for a good while.
>
> They are serving their purpose . We went through creating a slack
>
> community that managed to create more more heat than light.. This is
>
> what Databricks community came up with and I quote
>
>
>
> "Knowledge Sharing Hub
>
> Dive into a collaborative space where members like YOU can exchange
>
> knowledge, tips, and best practices. Join the conversation today and
>
> unlock a wealth of collective wisdom to enhance your experience and
>
> drive success."
>
>
>
> I don't know the logistics of setting it up.but I am sure that should
>
> not be that difficult. If anyone is supportive of this proposal, let
>
> the usual +1, 0, -1 decide
>
>
>
> HTH
>
>
>
> Mich Talebzadeh,
>
> Dad | Technologist | Solutions Architect | Engineer
>
> London
>
> United Kingdom
>
>
>
>
>
>   view my Linkedin profile
>
>
>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
> <https://urldefense.com/v3/__https:/en.everybodywiki.com/Mich_Talebzadeh__;!!HrbR-XT-OQ!Wu9fFP8RFJW2N_YUvwl9yctGHxtM-CFPe6McqOJDrxGBjIaRoF8vRwpjT9WzHojwI2R09Nbg8YE9ggB4FtocU8cQFw$>
>
>
>
>
>
>
>
> Disclaimer: The information provided is correct to the best of my
>
> knowledge but of course cannot be guaranteed . It is essential to note
>
> that, as with any advice, quote "one test result is worth one-thousand
>
> expert opinions (Werner Von Braun)".
>
>
>
> -
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: When and how does Spark use metastore statistics?

2023-12-26 Thread Bjørn Jørgensen
Tell me more about
spark.sql.cbo.strategy


tir. 12. des. 2023 kl. 00:25 skrev Nicholas Chammas <
nicholas.cham...@gmail.com>:

> Where exactly are you getting this information from?
>
> As far as I can tell, spark.sql.cbo.enabled has defaulted to false since
> it was introduced 7 years ago
> .
> It has never been enabled by default.
>
> And I cannot see mention of spark.sql.cbo.strategy anywhere at all in the
> code base.
>
> So again, where is this information coming from? Please link directly to
> your source.
>
>
>
> On Dec 11, 2023, at 5:45 PM, Mich Talebzadeh 
> wrote:
>
> You are right. By default CBO is not enabled. Whilst the CBO was the
> default optimizer in earlier versions of Spark, it has been replaced by
> the AQE in recent releases.
>
> spark.sql.cbo.strategy
>
> As I understand, The spark.sql.cbo.strategy configuration property
> specifies the optimizer strategy used by Spark SQL to generate query
> execution plans. There are two main optimizer strategies available:
>
>-
>
>CBO (Cost-Based Optimization): The default optimizer strategy, which
>analyzes the query plan and estimates the execution costs associated with
>each operation. It uses statistics to guide its decisions, selecting the
>plan with the lowest estimated cost.
>-
>
>CBO-Like (Cost-Based Optimization-Like): A simplified optimizer
>strategy that mimics some of the CBO's logic, but without the ability to
>estimate costs. This strategy is faster than CBO for simple queries, but
>may not produce the most efficient plan for complex queries.
>
> The spark.sql.cbo.strategy property can be set to either CBO or CBO-Like.
> The default value is AUTO, which means that Spark will automatically
> choose the most appropriate strategy based on the complexity of the query
> and availability of statistic
>
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 11 Dec 2023 at 17:11, Nicholas Chammas 
> wrote:
>
>>
>> On Dec 11, 2023, at 6:40 AM, Mich Talebzadeh 
>> wrote:
>>
>> By default, the CBO is enabled in Spark.
>>
>>
>> Note that this is not correct. AQE is enabled
>> 
>>  by
>> default, but CBO isn’t
>> 
>> .
>>
>
>


Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-10 Thread Bjørn Jørgensen
t; Kubernetes operator, making it a part of the Apache Flink project (
>>>>>> https://github.com/apache/flink-kubernetes-operator
>>>>>> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink-kubernetes-operator=05%7C01%7Cif56%40g.cornell.edu%7C6b33babc19c64437ef0408dbe18607c6%7C5d7e43661b9b45cf8e79b14b27df46e1%7C0%7C0%7C638351737993352064%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=jltCb10Ws2CxEHh4%2FF%2Big96Tt8U1UCEZlmhAuWRxx9Y%3D=0>).
>>>>>> This move has gained wide industry adoption and contributions from the
>>>>>> community. In a mere year, the Flink operator has garnered more than 600
>>>>>> stars and has attracted contributions from over 80 contributors. This
>>>>>> showcases the level of community interest and collaborative momentum that
>>>>>> can be achieved in similar scenarios.
>>>>>> More details can be found at SPIP doc : Spark Kubernetes Operator
>>>>>> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
>>>>>> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE=05%7C01%7Cif56%40g.cornell.edu%7C6b33babc19c64437ef0408dbe18607c6%7C5d7e43661b9b45cf8e79b14b27df46e1%7C0%7C0%7C638351737993352064%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=w8FrIp88nEpI7lXCBy7Y2U9NZ0uy%2B2Bssu7wjFqZCFw%3D=0>
>>>>>>
>>>>>> Thanks,
>>>>>> --
>>>>>> *Zhou JIANG*
>>>>>>
>>>>>>
>>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: [Reminder] Spark 3.5 RC Cut

2023-08-02 Thread Bjørn Jørgensen
@Dongjoon Hyun  FYI
[image: image.png]

We better ask common-...@hadoop.apache.org.

ons. 2. aug. 2023 kl. 18:03 skrev Dongjoon Hyun :

> Oh, I got it, Emil and Bjorn.
>
> Dongjoon.
>
> On Wed, Aug 2, 2023 at 12:32 AM Bjørn Jørgensen 
> wrote:
>
>> "*As far as I can tell this makes both 3.3.5 and 3.3.6 unusable with s3
>> without providing an alternative committer code.*"
>>
>> https://github.com/apache/hadoop/pull/5706#issuecomment-1619927992
>>
>> ons. 2. aug. 2023 kl. 08:05 skrev Emil Ejbyfeldt
>> :
>>
>>>  > Apache Spark is not affected by HADOOP-18757 because it is not a part
>>> of
>>>  > both Apache Hadoop 3.3.5 and 3.3.6.
>>>
>>> I am not sure I am following what you are trying to say here. Is that
>>> the jira is saying that only 3.3.5 is affected? Here I think the Jira is
>>> just incorrect. The jira was created (and the PR with the fix) was
>>> created before 3.3.6 was released and I just think the jira has not been
>>> updated to reflect the fact that 3.3.6 is also affected.
>>>
>>>  > HADOOP-18757 seems to be merged just two weeks ago and there is no
>>>  > Apache Hadoop release with it, isn't it?
>>>
>>> That is correct, there is no hadoop release containing the fix. So
>>> therefore 3.3.6 would also be affected by the regression.
>>>
>>> Best,
>>> Emil
>>>
>>> On 02/08/2023 07:51, Dongjoon Hyun wrote:
>>> > It's still invalid information, Emil.
>>> >
>>> > Apache Spark is not affected by HADOOP-18757 because it is not a part
>>> of
>>> > both Apache Hadoop 3.3.5 and 3.3.6.
>>> >
>>> > HADOOP-18757 seems to be merged just two weeks ago and there is no
>>> > Apache Hadoop release with it, isn't it?
>>> >
>>> > Could you check your local branch once more, please?
>>> >
>>> > Dongjoon.
>>> >
>>> >
>>> >
>>> > On Tue, Aug 1, 2023 at 9:46 PM Emil Ejbyfeldt <
>>> eejbyfe...@liveintent.com
>>> > <mailto:eejbyfe...@liveintent.com>> wrote:
>>> >
>>> > Hi,
>>> >
>>> > Yes, sorry about that seem to have messed up the link. Should have
>>> been
>>> > https://issues.apache.org/jira/browse/HADOOP-18757
>>> > <https://issues.apache.org/jira/browse/HADOOP-18757>
>>> >
>>> > Best,
>>> > Emil
>>> >
>>> > On 01/08/2023 19:08, Dongjoon Hyun wrote:
>>> >  > Hi, Emil.
>>> >  >
>>> >  > HADOOP-18568 is still open and it seems to be never a part of
>>> the
>>> > Hadoop
>>> >  > trunk branch.
>>> >  >
>>> >  > Do you mean another JIRA?
>>> >  >
>>> >  > Dongjoon.
>>> >  >
>>> >  >
>>> >  >
>>> >  > On Tue, Aug 1, 2023 at 2:59 AM Emil Ejbyfeldt
>>> >  > >> > <mailto:eejbyfe...@liveintent.com>.invalid> wrote:
>>> >  >
>>> >  > Hi,
>>> >  >
>>> >  > We previously ran some experiments on builds from the 3.5
>>> > branch and
>>> >  > noticed that Hadoop had a regression
>>> >  > (https://issues.apache.org/jira/browse/HADOOP-18568
>>> > <https://issues.apache.org/jira/browse/HADOOP-18568>
>>> >  > <https://issues.apache.org/jira/browse/HADOOP-18568
>>> > <https://issues.apache.org/jira/browse/HADOOP-18568>>) in their
>>> s3a
>>> >  > committer affecting 3.3.5 and 3.3.6 (Spark 3.4 uses hadoop
>>> > 3.3.4). This
>>> >  > fix has been merged into Hadoop and will be part the next
>>> > release of
>>> >  > Hadoop.
>>> >  >
>>> >  >   From our testing the regression when writing data to S3
>>> > with large
>>> >  > number of tasks S3 is severe enough that we would need to
>>> > revert to
>>> >  > hadoop 3.3.4 in order to use spark 3.5 release.
>>> >  >
>>> >  > Since it only for S3 I am not sure it warrants action
>>> changes
>>> > in Spark
>>> >  > (

Re: [Reminder] Spark 3.5 RC Cut

2023-08-02 Thread Bjørn Jørgensen
"*As far as I can tell this makes both 3.3.5 and 3.3.6 unusable with s3
without providing an alternative committer code.*"

https://github.com/apache/hadoop/pull/5706#issuecomment-1619927992

ons. 2. aug. 2023 kl. 08:05 skrev Emil Ejbyfeldt
:

>  > Apache Spark is not affected by HADOOP-18757 because it is not a part of
>  > both Apache Hadoop 3.3.5 and 3.3.6.
>
> I am not sure I am following what you are trying to say here. Is that
> the jira is saying that only 3.3.5 is affected? Here I think the Jira is
> just incorrect. The jira was created (and the PR with the fix) was
> created before 3.3.6 was released and I just think the jira has not been
> updated to reflect the fact that 3.3.6 is also affected.
>
>  > HADOOP-18757 seems to be merged just two weeks ago and there is no
>  > Apache Hadoop release with it, isn't it?
>
> That is correct, there is no hadoop release containing the fix. So
> therefore 3.3.6 would also be affected by the regression.
>
> Best,
> Emil
>
> On 02/08/2023 07:51, Dongjoon Hyun wrote:
> > It's still invalid information, Emil.
> >
> > Apache Spark is not affected by HADOOP-18757 because it is not a part of
> > both Apache Hadoop 3.3.5 and 3.3.6.
> >
> > HADOOP-18757 seems to be merged just two weeks ago and there is no
> > Apache Hadoop release with it, isn't it?
> >
> > Could you check your local branch once more, please?
> >
> > Dongjoon.
> >
> >
> >
> > On Tue, Aug 1, 2023 at 9:46 PM Emil Ejbyfeldt  > <mailto:eejbyfe...@liveintent.com>> wrote:
> >
> > Hi,
> >
> > Yes, sorry about that seem to have messed up the link. Should have
> been
> > https://issues.apache.org/jira/browse/HADOOP-18757
> > <https://issues.apache.org/jira/browse/HADOOP-18757>
> >
> > Best,
> > Emil
> >
> > On 01/08/2023 19:08, Dongjoon Hyun wrote:
> >  > Hi, Emil.
> >  >
> >  > HADOOP-18568 is still open and it seems to be never a part of the
> > Hadoop
> >  > trunk branch.
> >  >
> >  > Do you mean another JIRA?
> >  >
> >  > Dongjoon.
> >  >
> >  >
> >  >
> >  > On Tue, Aug 1, 2023 at 2:59 AM Emil Ejbyfeldt
> >  >  > <mailto:eejbyfe...@liveintent.com>.invalid> wrote:
> >  >
> >  > Hi,
> >  >
> >  > We previously ran some experiments on builds from the 3.5
> > branch and
> >  > noticed that Hadoop had a regression
> >  > (https://issues.apache.org/jira/browse/HADOOP-18568
> > <https://issues.apache.org/jira/browse/HADOOP-18568>
> >  > <https://issues.apache.org/jira/browse/HADOOP-18568
> > <https://issues.apache.org/jira/browse/HADOOP-18568>>) in their s3a
> >  > committer affecting 3.3.5 and 3.3.6 (Spark 3.4 uses hadoop
> > 3.3.4). This
> >  > fix has been merged into Hadoop and will be part the next
> > release of
> >  > Hadoop.
> >  >
> >  >   From our testing the regression when writing data to S3
> > with large
> >  > number of tasks S3 is severe enough that we would need to
> > revert to
> >  > hadoop 3.3.4 in order to use spark 3.5 release.
> >  >
> >  > Since it only for S3 I am not sure it warrants action changes
> > in Spark
> >  > (e.g rolling back hadoop to 3.3.4). But it probably something
> > people
> >  > testing the rc against s3 should be aware of.
> >  >
> >  > Best,
> >  > Emil
> >  >
> >  > On 29/07/2023 10:29, Yuanjian Li wrote:
> >  >  > Hi everyone,
> >  >  >
> >  >  > Following the release timeline, I will cut the RC
> > on*Tuesday, Aug
> >  > 1st at
> >  >  > 1 pm PST* as scheduled.
> >  >  >
> >  >  > Date  Event
> >  >  > July 17th 2023
> >  >  > Late July
> >  >  > 2023  Code freeze. Release branch cut.
> >  >  > QA period. Focus on bug fixes, tests, stability and docs.
> >  >  > Generally, no new features merged.
> >  >  >
> >  >  >
> >  >  > August 2023   Release candidates (RC), voting, etc. until
> > final
> >  > release passes
> >  >  >
> >  >  >
> >  >  > Best,
> >  >  > Yuanjian
> >  >
> >  >
> >
>  -
> >  > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > <mailto:dev-unsubscr...@spark.apache.org>
> >  > <mailto:dev-unsubscr...@spark.apache.org
> > <mailto:dev-unsubscr...@spark.apache.org>>
> >  >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Spark 3.4.0 and 3.4.1 and Java version in Dockerfile

2023-07-22 Thread Bjørn Jørgensen
https://hub.docker.com/_/openjdk
DEPRECATION NOTICE

This image is officially deprecated and all users are recommended to find
and use suitable replacements ASAP. Some examples of other Official Image
alternatives (listed in alphabetical order with no intentional or implied
preference):

   - amazoncorretto <https://hub.docker.com/_/amazoncorretto>
   - eclipse-temurin <https://hub.docker.com/_/eclipse-temurin>
   - ibm-semeru-runtimes <https://hub.docker.com/_/ibm-semeru-runtimes>
   - ibmjava <https://hub.docker.com/_/ibmjava>
   - sapmachine <https://hub.docker.com/_/sapmachine>

See docker-library/openjdk#505
<https://github.com/docker-library/openjdk/issues/505> for more information.


[SPARK-40941][K8S] Use Java 17 in K8s Dockerfile by default and remove
Dockerfile.java17 <https://github.com/apache/spark/pull/38417>


lør. 22. juli 2023 kl. 17:33 skrev Mich Talebzadeh <
mich.talebza...@gmail.com>:

> Hi,
>
> I was checking  the contents of Dockerfile for JAVA in Spark directory,
> .i.e
>
> ${SPARK_HOME}/kubernetes/dockerfiles/spark/Dockerfile
>
> in version 3.4.1
>
> I recall that in 3.4.0, I made adjustment to Dockerfile content replacing
>
> #ARG java_image_tag=17-jre
> #FROM eclipse-temurin:${java_image_tag}
>
> with
>
>
> *ARG java_image_tag=11-jre-slim*
>
> *FROM openjdk:${java_image_tag}*
>
> This worked and dockerfile was created
>
> With version 3.4.1, the same issue seems to exist
>
> Sending build context to Docker daemon  466.1MB
> Step 1/18 : ARG java_image_tag=17-jre
> Step 2/18 : FROM eclipse-temurin:${java_image_tag}
> *manifest for eclipse-temurin:11-jre-slim not found: manifest unknown:
> manifest unknown*
> *Failed to build Spark JVM Docker image, please refer to Docker build
> output for details.*
>
> Thanks
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: [VOTE][RESULT] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-20 Thread Bjørn Jørgensen
Adjustment
>>> >
>>> > before setting the timeline for Spark 4.0.0 because we're unclear on
>>> the
>>> > picture of Spark 4.0.0. So discussing the timeline 4.0.0 first is the
>>> > opposite order procedurally.
>>> > The vote passed as a procedural issue, but I would prefer to consider
>>> this
>>> > as a tentative date, and should probably need another vote to adjust
>>> the
>>> > date considering the plans, preview dates, and items we aim for 4.0.0.
>>> >
>>> >
>>> > On Sat, 17 Jun 2023 at 04:33, Dongjoon Hyun 
>>> wrote:
>>> >
>>> > > This was a part of the following on-going discussions.
>>> > >
>>> > > 2023-05-28  Apache Spark 3.5.0 Expectations (?)
>>> > > https://lists.apache.org/thread/3x6dh17bmy20n3frtt3crgxjydnxh2o0
>>> > >
>>> > > 2023-05-30 Apache Spark 4.0 Timeframe?
>>> > > https://lists.apache.org/thread/xhkgj60j361gdpywoxxz7qspp2w80ry6
>>> > >
>>> > > 2023-06-05 ASF policy violation and Scala version issues
>>> > > https://lists.apache.org/thread/k7gr65wt0fwtldc7hp7bd0vkg1k93rrb
>>> > >
>>> > > 2023-06-12 [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)
>>> > > https://lists.apache.org/thread/r0zn6rd8y25yn2dg59ktw3ttrwxzqrfb
>>> > >
>>> > > I'm looking forward to seeing the upcoming detailed discussions
>>> including
>>> > > the following
>>> > > - Apache Spark 4.0.0 Preview (and Dates)
>>> > > - Apache Spark 4.0.0 Items
>>> > > - Apache Spark 4.0.0 Plan Adjustment
>>> > >
>>> > > Please initiate the discussion.
>>> > >
>>> > > Thanks,
>>> > > Dongjoon.
>>> > >
>>> > >
>>> > > On 2023/06/16 19:30:42 Dongjoon Hyun wrote:
>>> > > > The vote passes with 6 +1s (4 binding +1s), one -0, and one -1.
>>> > > > Thank you all for your participation and
>>> > > > especially your additional comments during this voting,
>>> > > > Mridul, Hyukjin, and Jungtaek.
>>> > > >
>>> > > > (* = binding)
>>> > > > +1:
>>> > > > - Dongjoon Hyun *
>>> > > > - Huaxin Gao *
>>> > > > - Liang-Chi Hsieh *
>>> > > > - Kazuyuki Tanimura
>>> > > > - Chao Sun *
>>> > > > - Jia Fan
>>> > > >
>>> > > > -0: Holden Karau
>>> > > >
>>> > > > -1: Xiao Li *
>>> > > >
>>> > >
>>> > > -
>>> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> > >
>>> > >
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Apache Spark 3.5.0 Expectations (?)

2023-05-31 Thread Bjørn Jørgensen
@Cheng Pan

https://issues.apache.org/jira/browse/HIVE-22126

ons. 31. mai 2023 kl. 03:58 skrev Cheng Pan :

> @Bjørn Jørgensen
>
> I did some investigation on upgrading Guava after Spark drop Hadoop2
> support, but unfortunately, the Hive still depends on it, the worse thing
> is, that Guava’s classes are marked as shared in IsolatedClientLoader[1],
> which means Spark can not upgrade Guava even after upgrading the built-in
> Hive from current 2.3.9 to a new version which does not stick on an old
> Guava, to avoid breaking the old version of Hive Metastore client.
>
> I don't find clues why Guava classes need to be marked as shared, can
> anyone bring some background?
>
> [1]
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L215
>
> Thanks,
> Cheng Pan
>
>
> > On May 31, 2023, at 03:49, Bjørn Jørgensen 
> wrote:
> >
> > @Dongjoon Hyun Thank you.
> >
> > I have two points to discuss.
> > First, we are currently conducting tests with Python versions 3.8 and
> 3.9.
> > Should we consider replacing 3.9 with 3.11?
> >
> > Secondly, I'd like to know the status of Google Guava.
> > With Hadoop version 2 no longer being utilized, is there any other
> factor that is posing a blockage for this?
> >
> > tir. 30. mai 2023 kl. 10:39 skrev Mich Talebzadeh <
> mich.talebza...@gmail.com>:
> > I don't know whether it is related but Scala 2.12.17 is fine for the
> Spark 3 family (compile and run) . I spent a day compiling  Spark 3.4.0
> code against Scala 2.13.8 with maven and was getting all sorts of weird and
> wonderful errors at runtime.
> >
> > HTH
> >
> > Mich Talebzadeh,
> > Lead Solutions Architect/Engineering Lead
> > Palantir Technologies Limited
> > London
> > United Kingdom
> >
> >view my Linkedin profile
> >
> >  https://en.everybodywiki.com/Mich_Talebzadeh
> >  Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
> >
> >
> > On Tue, 30 May 2023 at 01:59, Jungtaek Lim 
> wrote:
> > Shall we initiate a new discussion thread for Scala 2.13 by default?
> While I'm not an expert on this area, it sounds like the change is major
> and (probably) breaking. It seems to be worth having a separate discussion
> thread rather than just treat it like one of 25 items.
> >
> > On Tue, May 30, 2023 at 9:54 AM Sean Owen  wrote:
> > It does seem risky; there are still likely libs out there that don't
> cross compile for 2.13. I would make it the default at 4.0, myself.
> >
> > On Mon, May 29, 2023 at 7:16 PM Hyukjin Kwon 
> wrote:
> > While I support going forward with a higher version, actually using
> Scala 2.13 by default is a big deal especially in a way that:
> > • Users would likely download the built-in version assuming that
> it’s backward binary compatible.
> > • PyPI doesn't allow specifying the Scala version, meaning that
> users wouldn’t have a way to 'pip install pyspark' based on Scala 2.12.
> > I wonder if it’s safer to do it in Spark 4 (which I believe will be
> discussed soon).
> >
> >
> > On Mon, 29 May 2023 at 13:21, Jia Fan  wrote:
> > Thanks Dongjoon!
> > There are some ticket I want to share.
> > SPARK-39420 Support ANALYZE TABLE on v2 tables
> > SPARK-42750 Support INSERT INTO by name
> > SPARK-43521 Support CREATE TABLE LIKE FILE
> >
> > Dongjoon Hyun  于2023年5月29日周一 08:42写道:
> > Hi, All.
> >
> > Apache Spark 3.5.0 is scheduled for August (1st Release Candidate) and
> currently a few notable things are under discussions in the mailing list.
> >
> > I believe it's a good time to share a short summary list (containing
> both completed and in-progress items) to give a highlight in advance and to
> collect your targets too.
> >
> > Please share your expectations or working items if you want to
> prioritize them more in the community in Apache Spark 3.5.0 timeframe.
> >
> > (Sorted by ID)
> > SPARK-40497 Upgrade Scala 2.13.11
> > SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
> > SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 ->
> 1.12.316)
> > SPARK-43024 Upgrade Pandas to 2.0.0
> > SPARK-43200 Remove Hadoop 2 reference in docs
> > SPARK-43347 Remove Python 3.7 Support
> > SPARK-4334

Re: Apache Spark 3.5.0 Expectations (?)

2023-05-30 Thread Bjørn Jørgensen
@Dongjoon Hyun  Thank you.

I have two points to discuss.
First, we are currently conducting tests with Python versions 3.8 and 3.9.
Should we consider replacing 3.9 with 3.11?

Secondly, I'd like to know the status of Google Guava.
With Hadoop version 2 no longer being utilized, is there any other factor
that is posing a blockage for this?

tir. 30. mai 2023 kl. 10:39 skrev Mich Talebzadeh :

> I don't know whether it is related but Scala 2.12.17 is fine for the Spark
> 3 family (compile and run) . I spent a day compiling  Spark 3.4.0 code
> against Scala 2.13.8 with maven and was getting all sorts of weird and
> wonderful errors at runtime.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 30 May 2023 at 01:59, Jungtaek Lim 
> wrote:
>
>> Shall we initiate a new discussion thread for Scala 2.13 by default?
>> While I'm not an expert on this area, it sounds like the change is major
>> and (probably) breaking. It seems to be worth having a separate
>> discussion thread rather than just treat it like one of 25 items.
>>
>> On Tue, May 30, 2023 at 9:54 AM Sean Owen  wrote:
>>
>>> It does seem risky; there are still likely libs out there that don't
>>> cross compile for 2.13. I would make it the default at 4.0, myself.
>>>
>>> On Mon, May 29, 2023 at 7:16 PM Hyukjin Kwon 
>>> wrote:
>>>
>>>> While I support going forward with a higher version, actually using
>>>> Scala 2.13 by default is a big deal especially in a way that:
>>>>
>>>>- Users would likely download the built-in version assuming that
>>>>it’s backward binary compatible.
>>>>- PyPI doesn't allow specifying the Scala version, meaning that
>>>>users wouldn’t have a way to 'pip install pyspark' based on Scala 2.12.
>>>>
>>>> I wonder if it’s safer to do it in Spark 4 (which I believe will be
>>>> discussed soon).
>>>>
>>>>
>>>> On Mon, 29 May 2023 at 13:21, Jia Fan  wrote:
>>>>
>>>>> Thanks Dongjoon!
>>>>> There are some ticket I want to share.
>>>>> SPARK-39420 Support ANALYZE TABLE on v2 tables
>>>>> SPARK-42750 Support INSERT INTO by name
>>>>> SPARK-43521 Support CREATE TABLE LIKE FILE
>>>>>
>>>>> Dongjoon Hyun  于2023年5月29日周一 08:42写道:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> Apache Spark 3.5.0 is scheduled for August (1st Release Candidate)
>>>>>> and currently a few notable things are under discussions in the mailing
>>>>>> list.
>>>>>>
>>>>>> I believe it's a good time to share a short summary list (containing
>>>>>> both completed and in-progress items) to give a highlight in advance and 
>>>>>> to
>>>>>> collect your targets too.
>>>>>>
>>>>>> Please share your expectations or working items if you want to
>>>>>> prioritize them more in the community in Apache Spark 3.5.0 timeframe.
>>>>>>
>>>>>> (Sorted by ID)
>>>>>> SPARK-40497 Upgrade Scala 2.13.11
>>>>>> SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
>>>>>> SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 ->
>>>>>> 1.12.316)
>>>>>> SPARK-43024 Upgrade Pandas to 2.0.0
>>>>>> SPARK-43200 Remove Hadoop 2 reference in docs
>>>>>> SPARK-43347 Remove Python 3.7 Support
>>>>>> SPARK-43348 Support Python 3.8 in PyPy3
>>>>>> SPARK-43351 Add Spark Connect Go prototype code and example
>>>>>> SPARK-43379 Deprecate old Java 8 versions prior to 8u371
>>>>>> SPARK-43394 Upgrade to Maven 3.8.8
>>>>>> SPARK-43436 Upgrade to RocksDbjni 8.1.1.1
>>>>>> SPARK-43446 Upgrade to Apache Arrow 12.0.0
>>>>>> SPARK-43447 Support R 4.3.0
>>>>>> SPARK-43489 Remove protobuf 2.5.0
>>>>>> SPARK-43519 Bump Parquet to 1.13.1
>>>>>> SPARK-43581 Upgrade kubernetes-client to 6.6.2
>>>>>> SPARK-43588 Upgrade to ASM 9.5
>>>>>> SPARK-43600 Update K8s doc to recommend K8s 1.24+
>>>>>> SPARK-43738 Upgrade to DropWizard Metrics 4.2.18
>>>>>> SPARK-43831 Build and Run Spark on Java 21
>>>>>> SPARK-43832 Upgrade to Scala 2.12.18
>>>>>> SPARK-43836 Make Scala 2.13 as default in Spark 3.5
>>>>>> SPARK-43842 Upgrade gcs-connector to 2.2.14
>>>>>> SPARK-43844 Update to ORC 1.9.0
>>>>>> UMBRELLA: Add SQL functions into Scala, Python and R API
>>>>>>
>>>>>> Thanks,
>>>>>> Dongjoon.
>>>>>>
>>>>>> PS. The above is not a list of release blockers. Instead, it could be
>>>>>> a nice-to-have from someone's perspective.
>>>>>>
>>>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Slack for Spark Community: Merging various threads

2023-04-07 Thread Bjørn Jørgensen
Yes, I have done some search for slack alternatives
<https://itsfoss.com/open-source-slack-alternative/>
I feel that we should do some search, to find if there can be a
better solution than slack.
For what I have found, there are two that can be an alternative for slack.

Rocket.Chat  <https://www.rocket.chat/>

and

Zulip Chat <https://zulip.com>
Zulip Cloud Standard is free for open-source projects
<https://zulip.com/for/open-source/>
Witch means we get

   - Unlimited search history
   - File storage up to 10 GB per user
   - Message retention policies
   <https://sparkzulip.zulipchat.com/help/message-retention-policy>
   - Brand Zulip with your logo
   - Priority commercial support
   - Funds the Zulip open source project


Rust is using zulip  <https://forge.rust-lang.org/platforms/zulip.html>

We can import chats from slack
<https://sparkzulip.zulipchat.com/help/import-from-slack>
We can use zulip for events <https://zulip.com/for/events/>  With multi-use
invite links <https://zulip.com/help/invite-new-users>, there’s no need to
create individual Zulip invitations.  This means that PMC doesn't have to
send a link to every user.
CODE BLOCKS

Discuss code with ease using Markdown code blocks, syntax
highlighting, and code
playgrounds <https://zulip.com/help/code-blocks#code-playgrounds>.






fre. 7. apr. 2023 kl. 18:54 skrev Holden Karau :

> I think there was some concern around how to make any sync channel show up
> in logs / index / search results?
>
> On Fri, Apr 7, 2023 at 9:41 AM Dongjoon Hyun 
> wrote:
>
>> Thank you, All.
>>
>> I'm very satisfied with the focused and right questions for the real
>> issues by removing irrelevant claims. :)
>>
>> Let me collect your relevant comments simply.
>>
>>
>> # Category 1: Invitation Hurdle
>>
>> > The key question here is that do PMC members have the bandwidth of
>> inviting everyone in user@ and dev@?
>>
>> > Extending this to inviting everyone on @user (over >4k  subscribers
>> according to the previous thread) might be a stretch,
>>
>> > we should have an official project Slack with an easy invitation
>> process.
>>
>>
>> # Category 2: Controllability
>>
>> > Additionally. there is no indication that the-asf.slack.com is
>> intended for general support.
>>
>> > I would also lean towards a standalone workspace, where we have more
>> control over organizing the channels,
>>
>>
>> # Category 3: Policy Suggestion
>>
>> > *Developer* discussions should still happen on email, JIRA and GitHub
>> and be async-friendly (72-hour rule) to fit the ASF’s development model.
>>
>>
>> Are there any other questions?
>>
>>
>> Dongjoon.
>>
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Slack for PySpark users

2023-04-04 Thread Bjørn Jørgensen
ard-to-define
>>>>>>>>>
>>>>>>>>> It's unavoidable if "users" prefer to use an alternative
>>>>>>>>> communication mechanism rather than the user mailing list. Before 
>>>>>>>>> Stack
>>>>>>>>> Overflow days, there had been a meaningful number of questions around 
>>>>>>>>> user@.
>>>>>>>>> It's just impossible to let them go back and post to the user mailing 
>>>>>>>>> list.
>>>>>>>>>
>>>>>>>>> We just need to make sure it is not the purpose of employing Slack
>>>>>>>>> to move all discussions about developments, direction of the project, 
>>>>>>>>> etc
>>>>>>>>> which must happen in dev@/private@. The purpose of Slack thread
>>>>>>>>> here does not seem to aim to serve the purpose.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Mar 31, 2023 at 7:00 AM Mich Talebzadeh <
>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Good discussions and proposals.all around.
>>>>>>>>>>
>>>>>>>>>> I have used slack in anger on a customer site before. For small
>>>>>>>>>> and medium size groups it is good and affordable. Alternatives have 
>>>>>>>>>> been
>>>>>>>>>> suggested as well so those who like investigative search can agree 
>>>>>>>>>> and come
>>>>>>>>>> up with a freebie one.
>>>>>>>>>> I am inclined to agree with Bjorn that this slack has more social
>>>>>>>>>> dimensions than the mailing list. It is akin to a sports club using
>>>>>>>>>> WhatsApp groups for communication. Remember we were originally 
>>>>>>>>>> looking for
>>>>>>>>>> space for webinars, including Spark on Linkedin that Denney Lee 
>>>>>>>>>> suggested.
>>>>>>>>>> I think Slack and mailing groups can coexist happily. On a more 
>>>>>>>>>> serious
>>>>>>>>>> note, when I joined the user group back in 2015-2016, there was a 
>>>>>>>>>> lot of
>>>>>>>>>> traffic. Currently we hardly get many mails daily <> less than 5. So 
>>>>>>>>>> having
>>>>>>>>>> a slack type medium may improve members participation.
>>>>>>>>>>
>>>>>>>>>> so +1 for me as well.
>>>>>>>>>>
>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>> Lead Solutions Architect/Engineering Lead
>>>>>>>>>> Palantir Technologies Limited
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>view my Linkedin profile
>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>> responsibility for any loss, damage or destruction of data or any 
>>>>>>>>>> other
>>>>>>>>>> property which may arise from relying on this email's technical 
>>>>>>>>>> content is
>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any
>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, 30 Mar 2023 at 22:19, Denny Lee 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1.

Re: Slack for PySpark users

2023-03-30 Thread Bjørn Jørgensen
;>> On Wed, Mar 29, 2023 at 11:32 PM Xiao Li  wrote:
>>>>
>>>>> +1
>>>>>
>>>>> + @dev@spark.apache.org 
>>>>>
>>>>> This is a good idea. The other Apache projects (e.g., Pinot, Druid,
>>>>> Flink) have created their own dedicated Slack workspaces for faster
>>>>> communication. We can do the same in Apache Spark. The Slack workspace 
>>>>> will
>>>>> be maintained by the Apache Spark PMC. I propose to initiate a vote for 
>>>>> the
>>>>> creation of a new Apache Spark Slack workspace. Does that sound good?
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Xiao
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Mich Talebzadeh  于2023年3月28日周二 07:07写道:
>>>>>
>>>>>> I created one at slack called pyspark
>>>>>>
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Lead Solutions Architect/Engineering Lead
>>>>>> Palantir Technologies Limited
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 28 Mar 2023 at 03:52, asma zgolli 
>>>>>> wrote:
>>>>>>
>>>>>>> +1 good idea, I d like to join as well.
>>>>>>>
>>>>>>> Le mar. 28 mars 2023 à 04:09, Winston Lai  a
>>>>>>> écrit :
>>>>>>>
>>>>>>>> Please let us know when the channel is created. I'd like to join :)
>>>>>>>>
>>>>>>>> Thank You & Best Regards
>>>>>>>> Winston Lai
>>>>>>>> --
>>>>>>>> *From:* Denny Lee 
>>>>>>>> *Sent:* Tuesday, March 28, 2023 9:43:08 AM
>>>>>>>> *To:* Hyukjin Kwon 
>>>>>>>> *Cc:* keen ; u...@spark.apache.org <
>>>>>>>> u...@spark.apache.org>
>>>>>>>> *Subject:* Re: Slack for PySpark users
>>>>>>>>
>>>>>>>> +1 I think this is a great idea!
>>>>>>>>
>>>>>>>> On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Yeah, actually I think we should better have a slack channel so we
>>>>>>>> can easily discuss with users and developers.
>>>>>>>>
>>>>>>>> On Tue, 28 Mar 2023 at 03:08, keen  wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>> I really like *Slack *as communication channel for a tech
>>>>>>>> community.
>>>>>>>> There is a Slack workspace for *delta lake users* (
>>>>>>>> https://go.delta.io/slack) that I enjoy a lot.
>>>>>>>> I was wondering if there is something similar for PySpark users.
>>>>>>>>
>>>>>>>> If not, would there be anything wrong with creating a new
>>>>>>>> Slack workspace for PySpark users? (when explicitly mentioning that 
>>>>>>>> this is
>>>>>>>> *not* officially part of Apache Spark)?
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>> Martin
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Asma ZGOLLI
>>>>>>>
>>>>>>> Ph.D. in Big Data - Applied Machine Learning
>>>>>>>
>>>>>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Topics for Spark online classes & webinars

2023-03-28 Thread Bjørn Jørgensen
;>>>>
>>>>>
>>>>>
>>>>> On Tue, 14 Mar 2023 at 15:09, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>> Hi Denny,
>>>>>
>>>>> That Apache Spark Linkedin page
>>>>> https://www.linkedin.com/company/apachespark/ looks fine. It also
>>>>> allows a wider audience to benefit from it.
>>>>>
>>>>> +1 for me
>>>>>
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, 14 Mar 2023 at 14:23, Denny Lee  wrote:
>>>>>
>>>>> In the past, we've been using the Apache Spark LinkedIn page
>>>>> <https://www.linkedin.com/company/apachespark/> and group to
>>>>> broadcast these type of events - if you're cool with this?  Or we could go
>>>>> through the process of submitting and updating the current
>>>>> https://spark.apache.org or request to leverage the original Spark
>>>>> confluence page <https://cwiki.apache.org/confluence/display/SPARK>.
>>>>>WDYT?
>>>>>
>>>>> On Mon, Mar 13, 2023 at 9:34 AM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>> Well that needs to be created first for this purpose. The appropriate
>>>>> name etc. to be decided. Maybe @Denny Lee 
>>>>> can facilitate this as he offered his help.
>>>>>
>>>>>
>>>>> cheers
>>>>>
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, 13 Mar 2023 at 16:29, asma zgolli 
>>>>> wrote:
>>>>>
>>>>> Hello Mich,
>>>>>
>>>>> Can you please provide the link for the confluence page?
>>>>>
>>>>> Many thanks
>>>>> Asma
>>>>> Ph.D. in Big Data - Applied Machine Learning
>>>>>
>>>>> Le lun. 13 mars 2023 à 17:21, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> a écrit :
>>>>>
>>>>> Apologies I missed the list.
>>>>>
>>>>> To move forward I selected these topics from the thread "Online
>>>>> classes for spark topics".
>>>>>
>>>>> To take this further I propose a confluence page to be seup.
>>>>>
>>>>>
>>>>>1. Spark UI
>>>>>2. Dynamic allocation
>>>>>3. Tuning of jobs
>>>>>4. Collecting spark metrics for monitoring and alerting
>>>>>5.  For those who prefer to use Pandas API on Spark since the
>>>>>release of Spark 3.2, What are some important notes for those users? 
>>>>> For
>>>>>example, what are the additional factors affecting the Spark 
>>>>> performance
>>>>>using Pandas API on Spark? How to tune them in addition to the 
>>>>> conventional
>>>>>Spark tuning methods applied to Spark SQL users.
>>>>>6. Spark internals and/or comparing spark 3 and 2
>>>>>7. Spark Streaming & Spark Structured Streaming
>>>>>8. Spark on notebooks
>>>>>9. Spark on serverless (for example Spark on Google Cloud)
>>>>>10. Spark on k8s
>>>>>
>>>>> Opinions and how to is welcome
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, 13 Mar 2023 at 16:16, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>> Hi guys
>>>>>
>>>>> To move forward I selected these topics from the thread "Online
>>>>> classes for spark topics".
>>>>>
>>>>> To take this further I propose a confluence page to be seup.
>>>>>
>>>>> Opinions and how to is welcome
>>>>>
>>>>> Cheers
>>>>>
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>
>> --
>> Asma ZGOLLI
>>
>> PhD in Big Data - Applied Machine Learning
>> Email : zgollia...@gmail.com
>> Tel : (+49) 015777685768
>> Skype : asma_zgolli
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Failed to build master google protobuf protoc-3.22.0-linux-x86_64.exe

2023-03-16 Thread Bjørn Jørgensen
Hi, I do build spark master on jupyter docker stack every day.

Today this job failed. For some reasons

Downloaded from gcs-maven-central-mirror:
https://maven-central.storage-download.googleapis.com/maven2/com/google/protobuf/protoc/3.22.0/protoc-3.22.0-linux-x86_64.exe

Why is this an exe file?



https://github.com/bjornjorgensen/jupyter-spark-master-docker/actions/runs/4434813547/jobs/7782658831#step:7:37761






Should we contact Google about this?

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Topics for Spark online classes & webinars

2023-03-15 Thread Bjørn Jørgensen
t;>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, 13 Mar 2023 at 16:29, asma zgolli 
>>>>> wrote:
>>>>>
>>>>>> Hello Mich,
>>>>>>
>>>>>> Can you please provide the link for the confluence page?
>>>>>>
>>>>>> Many thanks
>>>>>> Asma
>>>>>> Ph.D. in Big Data - Applied Machine Learning
>>>>>>
>>>>>> Le lun. 13 mars 2023 à 17:21, Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> a écrit :
>>>>>>
>>>>>>> Apologies I missed the list.
>>>>>>>
>>>>>>> To move forward I selected these topics from the thread "Online
>>>>>>> classes for spark topics".
>>>>>>>
>>>>>>> To take this further I propose a confluence page to be seup.
>>>>>>>
>>>>>>>
>>>>>>>1. Spark UI
>>>>>>>2. Dynamic allocation
>>>>>>>3. Tuning of jobs
>>>>>>>4. Collecting spark metrics for monitoring and alerting
>>>>>>>5.  For those who prefer to use Pandas API on Spark since the
>>>>>>>release of Spark 3.2, What are some important notes for those users? 
>>>>>>> For
>>>>>>>example, what are the additional factors affecting the Spark 
>>>>>>> performance
>>>>>>>using Pandas API on Spark? How to tune them in addition to the 
>>>>>>> conventional
>>>>>>>Spark tuning methods applied to Spark SQL users.
>>>>>>>6. Spark internals and/or comparing spark 3 and 2
>>>>>>>7. Spark Streaming & Spark Structured Streaming
>>>>>>>8. Spark on notebooks
>>>>>>>9. Spark on serverless (for example Spark on Google Cloud)
>>>>>>>10. Spark on k8s
>>>>>>>
>>>>>>> Opinions and how to is welcome
>>>>>>>
>>>>>>>
>>>>>>>view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 13 Mar 2023 at 16:16, Mich Talebzadeh <
>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi guys
>>>>>>>>
>>>>>>>> To move forward I selected these topics from the thread "Online
>>>>>>>> classes for spark topics".
>>>>>>>>
>>>>>>>> To take this further I propose a confluence page to be seup.
>>>>>>>>
>>>>>>>> Opinions and how to is welcome
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>> which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>> damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: [VOTE] Release Apache Spark 3.4.0 (RC4)

2023-03-12 Thread Bjørn Jørgensen
./build/mvn clean package  -Phive

SQLImplicitsTestSuite:
- column resolution
- test implicit encoder resolution *** FAILED ***
 2023-03-11T23:13:13.873033 did not equal 2023-03-11T23:13:13.873033776
(SQLImplicitsTestSuite.scala:63)
FunctionTestSuite:


Run completed in 45 seconds, 854 milliseconds.
Total number of tests run: 683
Suites: completed 12, aborted 0
Tests: succeeded 682, failed 1, canceled 0, ignored 1, pending 0
*** 1 TEST FAILED ***
[INFO]

[INFO] Reactor Summary for Spark Project Parent POM 3.4.0:
[INFO]
[INFO] Spark Project Parent POM ... SUCCESS [
 4.791 s]
[INFO] Spark Project Tags . SUCCESS [
 9.107 s]
[INFO] Spark Project Sketch ... SUCCESS [
23.398 s]
[INFO] Spark Project Local DB . SUCCESS [
15.841 s]
[INFO] Spark Project Networking ... SUCCESS [
58.199 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [
15.835 s]
[INFO] Spark Project Unsafe ... SUCCESS [
16.041 s]
[INFO] Spark Project Launcher . SUCCESS [
10.052 s]
[INFO] Spark Project Core . SUCCESS [35:14
min]
[INFO] Spark Project ML Local Library . SUCCESS [
39.237 s]
[INFO] Spark Project GraphX ... SUCCESS [02:21
min]
[INFO] Spark Project Streaming  SUCCESS [05:43
min]
[INFO] Spark Project Catalyst . SUCCESS [11:15
min]
[INFO] Spark Project SQL .. SUCCESS [
 02:32 h]
[INFO] Spark Project ML Library ... SUCCESS [22:00
min]
[INFO] Spark Project Tools  SUCCESS [
 6.520 s]
[INFO] Spark Project Hive . SUCCESS [
 01:19 h]
[INFO] Spark Project REPL . SUCCESS [02:02
min]
[INFO] Spark Project Assembly . SUCCESS [
12.161 s]
[INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [
25.144 s]
[INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:37
min]
[INFO] Kafka 0.10+ Source for Structured Streaming  SUCCESS [33:00
min]
[INFO] Spark Project Examples . SUCCESS [
57.948 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [
18.391 s]
[INFO] Spark Avro . SUCCESS [02:02
min]
[INFO] Spark Project Connect Common ... SUCCESS [
45.043 s]
[INFO] Spark Project Connect Server ... SUCCESS [01:02
min]
[INFO] Spark Project Connect Client ... FAILURE [01:22
min]
[INFO] Spark Protobuf . SKIPPED
[INFO]

[INFO] BUILD FAILURE
[INFO]

[INFO] Total time:  05:55 h
[INFO] Finished at: 2023-03-11T23:13:25+01:00
[INFO]

[ERROR] Failed to execute goal
org.scalatest:scalatest-maven-plugin:2.2.0:test (test) on project
spark-connect-client-jvm_2.12: There are test failures -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the
command
[ERROR]   mvn  -rf :spark-connect-client-jvm_2.12


lør. 11. mar. 2023 kl. 13:43 skrev yangjie01 :

> Can you test  `./build/mvn clean package  -Phive` ? Thanks
>
>
>
>
>
> *发件人**: *Bjørn Jørgensen 
> *日期**: *2023年3月11日 星期六 20:33
> *收件人**: *Xinrong Meng 
> *抄送**: *beliefer , dev 
> *主题**: *Re: Re: [VOTE] Release Apache Spark 3.4.0 (RC4)
>
>
>
> Ubuntu 23.04
>
> java --version
> openjdk 17.0.6 2023-01-17
> OpenJDK Runtime Environment (build 17.0.6+10-Ubuntu-1)
> OpenJDK 64-Bit Server VM (build 17.0.6+10-Ubuntu-1, mixed mode, sharing)
>
>
>
> python3
> Python 3.11.1 (main, Dec 31 2022, 10:23:59) [GCC 12.2.0] on linux
>
>
>
>
>
> ./build/mvn clean package
>
>
>
>
>
> - broadcast join
> - test temp view *** FAILED ***
>  io.grpc.StatusRuntimeException: INTERNAL: Error while instantiating
> 'org.apache.spark.sql.hive.HiveExternalCatalog':
>  at io.grpc.Status.asRuntimeException(Status.java:535)
>  at
> io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)
>  at

Re: Re: [VOTE] Release Apache Spark 3.4.0 (RC4)

2023-03-11 Thread Bjørn Jørgensen
ark-connect-client-jvm_2.12







lør. 11. mar. 2023 kl. 03:18 skrev Xinrong Meng :

> Thank you @beliefer.
>
> On Sat, Mar 11, 2023 at 9:54 AM beliefer  wrote:
>
>> There is a bug fix.
>>
>> https://issues.apache.org/jira/browse/SPARK-42740
>>
>>
>>
>> 在 2023-03-10 20:48:30,"Xinrong Meng"  写道:
>>
>> https://issues.apache.org/jira/browse/SPARK-42745 can be a new release
>> blocker, thanks @Peter Toth  for reporting that.
>>
>> On Fri, Mar 10, 2023 at 8:21 PM Xinrong Meng 
>> wrote:
>>
>>> Please vote on releasing the following candidate(RC4) as Apache Spark
>>> version 3.4.0.
>>>
>>> The vote is open until 11:59pm Pacific time *March 15th* and passes if
>>> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.4.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is *v3.4.0-rc4* (commit
>>> 4000d6884ce973eb420e871c8d333431490be763):
>>> https://github.com/apache/spark/tree/v3.4.0-rc4
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc4-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1438
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc4-docs/
>>>
>>> The list of bug fixes going into 3.4.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>>>
>>> This release is using the release script of the tag v3.4.0-rc4.
>>>
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.4.0?
>>> ===
>>> The current list of open tickets targeted at 3.4.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.4.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>> Thanks,
>>> Xinrong Meng
>>>
>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: spark executor pod has same memory value for request and limit

2023-03-10 Thread Bjørn Jørgensen
Strange to see that you are using spark 3.1.2 witch is EOL and you are
reading source files from 3.4.0-SNAPSHOT

fre. 10. mar. 2023 kl. 19:01 skrev Ismail Yenigul :

> and If you look at the code
>
>
> https://github.com/apache/spark/blob/e64262f417bf381bdc664dfd1cbcfaa5aa7221fe/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L194
>
> .editOrNewResources()
> .addToRequests("memory", executorMemoryQuantity)
> .addToLimits("memory", executorMemoryQuantity)
> .addToRequests("cpu", executorCpuQuantity)
> .addToLimits(executorResourceQuantities.asJava)
> .endResources()
>
> addToRequests and addToLimits for memory have the same value.
> maybe it is by design. but can I set custom values for them if I use
> podtemplate?
>
>
>
> Ismail Yenigul , 10 Mar 2023 Cum, 20:52
> tarihinde şunu yazdı:
>
>> Hi,
>> using spark version v.3.1.2
>>
>> spark.executor.memory is set.
>> But the problem is not setting spark.executor.memory, the problem is that
>> whatever  value I set spark.executor.memory,
>> spark executor pod has the same value for resources.limit.memory and
>> resources.request.memory.
>> I want to be able to set different values for them.
>>
>>
>>
>>
>> Mich Talebzadeh , 10 Mar 2023 Cum, 20:44
>> tarihinde şunu yazdı:
>>
>>> What are those currently set in spark-submit and which spark version on
>>> k8s
>>>
>>>  --conf spark.driver.memory=2000m \
>>>--conf spark.executor.memory=2000m \
>>>
>>>   HTH
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 10 Mar 2023 at 17:39, Ismail Yenigul 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> There is a cpu parameter to set spark executor on k8s
>>>> spark.kubernetes.executor.limit.cores and
>>>> spark.kubernetes.executor.request.cores
>>>> but there is no parameter to set memory request different then limits
>>>> memory (such as spark.kubernetes.executor.request.memory)
>>>> For that reason,
>>>> spark.executor.memory is assigned to  requests.memory and limits.memory
>>>> like the following
>>>>
>>>> Limits:
>>>>   memory:  5734MiRequests:
>>>>   cpu: 4
>>>>   memory:  5734Mi
>>>>
>>>>
>>>> Is there any special reason to not have
>>>> spark.kubernetes.executor.request.memory parameter?
>>>> and can I use spark.kubernetes.executor.podTemplateFile parameter to
>>>> set smaller memory request than the memory limit in pod template file?
>>>>
>>>>
>>>> Limits:
>>>>   memory:  5734MiRequests:
>>>>   cpu: 4
>>>>   memory:  1024Mi
>>>>
>>>>
>>>> Thanks
>>>>
>>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: [VOTE] Release Apache Spark 3.4.0 (RC1)

2023-02-22 Thread Bjørn Jørgensen
 http://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is *v3.4.0-rc1* (commit
>>>>> e2484f626bb338274665a49078b528365ea18c3b):
>>>>> https://github.com/apache/spark/tree/v3.4.0-rc1
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc1-bin/
>>>>>
>>>>> Signatures used for Spark RCs can be found in this file:
>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1435
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc1-docs/
>>>>>
>>>>> The list of bug fixes going into 3.4.0 can be found at the following
>>>>> URL:
>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>>>>>
>>>>> This release is using the release script of the tag v3.4.0-rc1.
>>>>>
>>>>>
>>>>> FAQ
>>>>>
>>>>> =
>>>>> How can I help test this release?
>>>>> =
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>> the current RC and see if anything important breaks, in the Java/Scala
>>>>> you can add the staging repository to your projects resolvers and test
>>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>>> you don't end up building with a out of date RC going forward).
>>>>>
>>>>> ===
>>>>> What should happen to JIRA tickets still targeting 3.4.0?
>>>>> ===
>>>>> The current list of open tickets targeted at 3.4.0 can be found at:
>>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>>> Version/s" = 3.4.0
>>>>>
>>>>> Committers should look at those and triage. Extremely important bug
>>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>>> be worked on immediately. Everything else please retarget to an
>>>>> appropriate release.
>>>>>
>>>>> ==
>>>>> But my bug isn't fixed?
>>>>> ==
>>>>> In order to make timely releases, we will typically not hold the
>>>>> release unless the bug in question is a regression from the previous
>>>>> release. That being said, if there is something which is a regression
>>>>> that has not been correctly targeted please ping me or a committer to
>>>>> help target the issue.
>>>>>
>>>>> Thanks,
>>>>> Xinrong Meng
>>>>>
>>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-13 Thread Bjørn Jørgensen
There is a fix for python 3.11 https://github.com/apache/spark/pull/38987
We should have this in more branches.

man. 13. feb. 2023 kl. 09:39 skrev Bjørn Jørgensen :

> On manjaro it is Python 3.10.9
>
> On ubuntu it is Python 3.11.1
>
> man. 13. feb. 2023 kl. 03:24 skrev yangjie01 :
>
>> Which Python version do you use for testing? When I use the latest Python
>> 3.11, I can reproduce similar test failures (43 tests of sql module fail),
>> but when I use python 3.10, they will succeed
>>
>>
>>
>> YangJie
>>
>>
>>
>> *发件人**: *Bjørn Jørgensen 
>> *日期**: *2023年2月13日 星期一 05:09
>> *收件人**: *Sean Owen 
>> *抄送**: *"L. C. Hsieh" , Spark dev list <
>> dev@spark.apache.org>
>> *主题**: *Re: [VOTE] Release Spark 3.3.2 (RC1)
>>
>>
>>
>> Tried it one more time and the same result.
>>
>>
>>
>> On another box with Manjaro
>>
>> 
>> [INFO] Reactor Summary for Spark Project Parent POM 3.3.2:
>> [INFO]
>> [INFO] Spark Project Parent POM ... SUCCESS
>> [01:50 min]
>> [INFO] Spark Project Tags . SUCCESS [
>> 17.359 s]
>> [INFO] Spark Project Sketch ... SUCCESS [
>> 12.517 s]
>> [INFO] Spark Project Local DB . SUCCESS [
>> 14.463 s]
>> [INFO] Spark Project Networking ... SUCCESS
>> [01:07 min]
>> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>>  9.013 s]
>> [INFO] Spark Project Unsafe ... SUCCESS [
>>  8.184 s]
>> [INFO] Spark Project Launcher . SUCCESS [
>> 10.454 s]
>> [INFO] Spark Project Core . SUCCESS
>> [23:58 min]
>> [INFO] Spark Project ML Local Library . SUCCESS [
>> 21.218 s]
>> [INFO] Spark Project GraphX ... SUCCESS
>> [01:24 min]
>> [INFO] Spark Project Streaming  SUCCESS
>> [04:57 min]
>> [INFO] Spark Project Catalyst . SUCCESS
>> [08:00 min]
>> [INFO] Spark Project SQL .. SUCCESS [
>>  01:02 h]
>> [INFO] Spark Project ML Library ... SUCCESS
>> [14:38 min]
>> [INFO] Spark Project Tools  SUCCESS [
>>  4.394 s]
>> [INFO] Spark Project Hive . SUCCESS
>> [53:43 min]
>> [INFO] Spark Project REPL . SUCCESS
>> [01:16 min]
>> [INFO] Spark Project Assembly . SUCCESS [
>>  2.186 s]
>> [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [
>> 16.150 s]
>> [INFO] Spark Integration for Kafka 0.10 ... SUCCESS
>> [01:34 min]
>> [INFO] Kafka 0.10+ Source for Structured Streaming  SUCCESS
>> [32:55 min]
>> [INFO] Spark Project Examples . SUCCESS [
>> 23.800 s]
>> [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [
>>  7.301 s]
>> [INFO] Spark Avro . SUCCESS
>> [01:19 min]
>> [INFO]
>> 
>> [INFO] BUILD SUCCESS
>> [INFO]
>> ----
>> [INFO] Total time:  03:31 h
>> [INFO] Finished at: 2023-02-12T21:54:20+01:00
>> [INFO]
>> 
>> [bjorn@amd7g spark-3.3.2]$  java -version
>> openjdk version "17.0.6" 2023-01-17
>> OpenJDK Runtime Environment (build 17.0.6+10)
>> OpenJDK 64-Bit Server VM (build 17.0.6+10, mixed mode)
>>
>>
>>
>>
>>
>> :)
>>
>>
>>
>> So I'm +1
>>
>>
>>
>>
>>
>> søn. 12. feb. 2023 kl. 12:53 skrev Bjørn Jørgensen <
>> bjornjorgen...@gmail.com>:
>>
>> I use ubuntu rolling
>>
>> $ java -version
>> openjdk version "17.0.6" 2023-01-17
>> OpenJDK Runtime Environment (build 17.0.6+10-Ubuntu-0ubuntu1)
>> OpenJDK 64-Bit Server VM (build 17.0.6+10-Ubuntu-0ubuntu1, mixed mode,
>> sharing)
>>
>>
>>
>> I have reboot now and restart ./build/mvn clean package
>>
>>
>>
>>
>>
>>
>>
>> søn. 12. feb. 20

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-13 Thread Bjørn Jørgensen
On manjaro it is Python 3.10.9

On ubuntu it is Python 3.11.1

man. 13. feb. 2023 kl. 03:24 skrev yangjie01 :

> Which Python version do you use for testing? When I use the latest Python
> 3.11, I can reproduce similar test failures (43 tests of sql module fail),
> but when I use python 3.10, they will succeed
>
>
>
> YangJie
>
>
>
> *发件人**: *Bjørn Jørgensen 
> *日期**: *2023年2月13日 星期一 05:09
> *收件人**: *Sean Owen 
> *抄送**: *"L. C. Hsieh" , Spark dev list <
> dev@spark.apache.org>
> *主题**: *Re: [VOTE] Release Spark 3.3.2 (RC1)
>
>
>
> Tried it one more time and the same result.
>
>
>
> On another box with Manjaro
>
> 
> [INFO] Reactor Summary for Spark Project Parent POM 3.3.2:
> [INFO]
> [INFO] Spark Project Parent POM ... SUCCESS [01:50
> min]
> [INFO] Spark Project Tags . SUCCESS [
> 17.359 s]
> [INFO] Spark Project Sketch ... SUCCESS [
> 12.517 s]
> [INFO] Spark Project Local DB . SUCCESS [
> 14.463 s]
> [INFO] Spark Project Networking ... SUCCESS [01:07
> min]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>  9.013 s]
> [INFO] Spark Project Unsafe ... SUCCESS [
>  8.184 s]
> [INFO] Spark Project Launcher . SUCCESS [
> 10.454 s]
> [INFO] Spark Project Core . SUCCESS [23:58
> min]
> [INFO] Spark Project ML Local Library . SUCCESS [
> 21.218 s]
> [INFO] Spark Project GraphX ... SUCCESS [01:24
> min]
> [INFO] Spark Project Streaming  SUCCESS [04:57
> min]
> [INFO] Spark Project Catalyst . SUCCESS [08:00
> min]
> [INFO] Spark Project SQL .. SUCCESS [
>  01:02 h]
> [INFO] Spark Project ML Library ... SUCCESS [14:38
> min]
> [INFO] Spark Project Tools  SUCCESS [
>  4.394 s]
> [INFO] Spark Project Hive . SUCCESS [53:43
> min]
> [INFO] Spark Project REPL . SUCCESS [01:16
> min]
> [INFO] Spark Project Assembly . SUCCESS [
>  2.186 s]
> [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [
> 16.150 s]
> [INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:34
> min]
> [INFO] Kafka 0.10+ Source for Structured Streaming  SUCCESS [32:55
> min]
> [INFO] Spark Project Examples . SUCCESS [
> 23.800 s]
> [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [
>  7.301 s]
> [INFO] Spark Avro . SUCCESS [01:19
> min]
> [INFO]
> 
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time:  03:31 h
> [INFO] Finished at: 2023-02-12T21:54:20+01:00
> [INFO]
> --------
> [bjorn@amd7g spark-3.3.2]$  java -version
> openjdk version "17.0.6" 2023-01-17
> OpenJDK Runtime Environment (build 17.0.6+10)
> OpenJDK 64-Bit Server VM (build 17.0.6+10, mixed mode)
>
>
>
>
>
> :)
>
>
>
> So I'm +1
>
>
>
>
>
> søn. 12. feb. 2023 kl. 12:53 skrev Bjørn Jørgensen <
> bjornjorgen...@gmail.com>:
>
> I use ubuntu rolling
>
> $ java -version
> openjdk version "17.0.6" 2023-01-17
> OpenJDK Runtime Environment (build 17.0.6+10-Ubuntu-0ubuntu1)
> OpenJDK 64-Bit Server VM (build 17.0.6+10-Ubuntu-0ubuntu1, mixed mode,
> sharing)
>
>
>
> I have reboot now and restart ./build/mvn clean package
>
>
>
>
>
>
>
> søn. 12. feb. 2023 kl. 04:47 skrev Sean Owen :
>
> +1 The tests and all results were the same as ever for me (Java 11, Scala
> 2.13, Ubuntu 22.04)
>
> I also didn't see that issue ... maybe somehow locale related? which could
> still be a bug.
>
>
>
> On Sat, Feb 11, 2023 at 8:49 PM L. C. Hsieh  wrote:
>
> Thank you for testing it.
>
> I was going to run it again but still didn't see any errors.
>
> I also checked CI (and looked again now) on branch-3.3 before cutting RC.
>
> BTW, I didn't find an actual test failure (i.e. "- test_name ***
> FAILED ***") in the log file.
>
> Maybe it is due to the dev env? What dev env you're using to run the

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-12 Thread Bjørn Jørgensen
Tried it one more time and the same result.

On another box with Manjaro

[INFO] Reactor Summary for Spark Project Parent POM 3.3.2:
[INFO]
[INFO] Spark Project Parent POM ... SUCCESS [01:50
min]
[INFO] Spark Project Tags . SUCCESS [
17.359 s]
[INFO] Spark Project Sketch ... SUCCESS [
12.517 s]
[INFO] Spark Project Local DB . SUCCESS [
14.463 s]
[INFO] Spark Project Networking ... SUCCESS [01:07
min]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [
 9.013 s]
[INFO] Spark Project Unsafe ... SUCCESS [
 8.184 s]
[INFO] Spark Project Launcher . SUCCESS [
10.454 s]
[INFO] Spark Project Core . SUCCESS [23:58
min]
[INFO] Spark Project ML Local Library . SUCCESS [
21.218 s]
[INFO] Spark Project GraphX ... SUCCESS [01:24
min]
[INFO] Spark Project Streaming  SUCCESS [04:57
min]
[INFO] Spark Project Catalyst . SUCCESS [08:00
min]
[INFO] Spark Project SQL .. SUCCESS [
 01:02 h]
[INFO] Spark Project ML Library ... SUCCESS [14:38
min]
[INFO] Spark Project Tools  SUCCESS [
 4.394 s]
[INFO] Spark Project Hive . SUCCESS [53:43
min]
[INFO] Spark Project REPL . SUCCESS [01:16
min]
[INFO] Spark Project Assembly . SUCCESS [
 2.186 s]
[INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [
16.150 s]
[INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:34
min]
[INFO] Kafka 0.10+ Source for Structured Streaming  SUCCESS [32:55
min]
[INFO] Spark Project Examples . SUCCESS [
23.800 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [
 7.301 s]
[INFO] Spark Avro . SUCCESS [01:19
min]
[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time:  03:31 h
[INFO] Finished at: 2023-02-12T21:54:20+01:00
[INFO]

[bjorn@amd7g spark-3.3.2]$  java -version
openjdk version "17.0.6" 2023-01-17
OpenJDK Runtime Environment (build 17.0.6+10)
OpenJDK 64-Bit Server VM (build 17.0.6+10, mixed mode)


:)

So I'm +1


søn. 12. feb. 2023 kl. 12:53 skrev Bjørn Jørgensen :

> I use ubuntu rolling
> $ java -version
> openjdk version "17.0.6" 2023-01-17
> OpenJDK Runtime Environment (build 17.0.6+10-Ubuntu-0ubuntu1)
> OpenJDK 64-Bit Server VM (build 17.0.6+10-Ubuntu-0ubuntu1, mixed mode,
> sharing)
>
> I have reboot now and restart ./build/mvn clean package
>
>
>
> søn. 12. feb. 2023 kl. 04:47 skrev Sean Owen :
>
>> +1 The tests and all results were the same as ever for me (Java 11, Scala
>> 2.13, Ubuntu 22.04)
>> I also didn't see that issue ... maybe somehow locale related? which
>> could still be a bug.
>>
>> On Sat, Feb 11, 2023 at 8:49 PM L. C. Hsieh  wrote:
>>
>>> Thank you for testing it.
>>>
>>> I was going to run it again but still didn't see any errors.
>>>
>>> I also checked CI (and looked again now) on branch-3.3 before cutting RC.
>>>
>>> BTW, I didn't find an actual test failure (i.e. "- test_name ***
>>> FAILED ***") in the log file.
>>>
>>> Maybe it is due to the dev env? What dev env you're using to run the
>>> test?
>>>
>>>
>>> On Sat, Feb 11, 2023 at 8:58 AM Bjørn Jørgensen
>>>  wrote:
>>> >
>>> >
>>> > ./build/mvn clean package
>>> >
>>> > Run completed in 1 hour, 18 minutes, 29 seconds.
>>> > Total number of tests run: 11652
>>> > Suites: completed 516, aborted 0
>>> > Tests: succeeded 11609, failed 43, canceled 8, ignored 57, pending 0
>>> > *** 43 TESTS FAILED ***
>>> > [INFO]
>>> 
>>> > [INFO] Reactor Summary for Spark Project Parent POM 3.3.2:
>>> > [INFO]
>>> > [INFO] Spark Project Parent POM ... SUCCESS [
>>> 3.418 s]
>>> > [INFO] Spark Project Tags . SUCCESS [
>>> 17.845 s]
>>> > [INFO] Spark Project Sket

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-12 Thread Bjørn Jørgensen
I use ubuntu rolling
$ java -version
openjdk version "17.0.6" 2023-01-17
OpenJDK Runtime Environment (build 17.0.6+10-Ubuntu-0ubuntu1)
OpenJDK 64-Bit Server VM (build 17.0.6+10-Ubuntu-0ubuntu1, mixed mode,
sharing)

I have reboot now and restart ./build/mvn clean package



søn. 12. feb. 2023 kl. 04:47 skrev Sean Owen :

> +1 The tests and all results were the same as ever for me (Java 11, Scala
> 2.13, Ubuntu 22.04)
> I also didn't see that issue ... maybe somehow locale related? which could
> still be a bug.
>
> On Sat, Feb 11, 2023 at 8:49 PM L. C. Hsieh  wrote:
>
>> Thank you for testing it.
>>
>> I was going to run it again but still didn't see any errors.
>>
>> I also checked CI (and looked again now) on branch-3.3 before cutting RC.
>>
>> BTW, I didn't find an actual test failure (i.e. "- test_name ***
>> FAILED ***") in the log file.
>>
>> Maybe it is due to the dev env? What dev env you're using to run the test?
>>
>>
>> On Sat, Feb 11, 2023 at 8:58 AM Bjørn Jørgensen
>>  wrote:
>> >
>> >
>> > ./build/mvn clean package
>> >
>> > Run completed in 1 hour, 18 minutes, 29 seconds.
>> > Total number of tests run: 11652
>> > Suites: completed 516, aborted 0
>> > Tests: succeeded 11609, failed 43, canceled 8, ignored 57, pending 0
>> > *** 43 TESTS FAILED ***
>> > [INFO]
>> 
>> > [INFO] Reactor Summary for Spark Project Parent POM 3.3.2:
>> > [INFO]
>> > [INFO] Spark Project Parent POM ... SUCCESS [
>> 3.418 s]
>> > [INFO] Spark Project Tags . SUCCESS [
>> 17.845 s]
>> > [INFO] Spark Project Sketch ... SUCCESS [
>> 20.791 s]
>> > [INFO] Spark Project Local DB . SUCCESS [
>> 16.527 s]
>> > [INFO] Spark Project Networking ... SUCCESS
>> [01:03 min]
>> > [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>> 9.914 s]
>> > [INFO] Spark Project Unsafe ... SUCCESS [
>> 12.007 s]
>> > [INFO] Spark Project Launcher . SUCCESS [
>> 7.620 s]
>> > [INFO] Spark Project Core . SUCCESS
>> [40:04 min]
>> > [INFO] Spark Project ML Local Library . SUCCESS [
>> 29.997 s]
>> > [INFO] Spark Project GraphX ... SUCCESS
>> [02:33 min]
>> > [INFO] Spark Project Streaming  SUCCESS
>> [05:51 min]
>> > [INFO] Spark Project Catalyst . SUCCESS
>> [13:29 min]
>> > [INFO] Spark Project SQL .. FAILURE [
>> 01:25 h]
>> > [INFO] Spark Project ML Library ... SKIPPED
>> > [INFO] Spark Project Tools  SKIPPED
>> > [INFO] Spark Project Hive . SKIPPED
>> > [INFO] Spark Project REPL . SKIPPED
>> > [INFO] Spark Project Assembly . SKIPPED
>> > [INFO] Kafka 0.10+ Token Provider for Streaming ... SKIPPED
>> > [INFO] Spark Integration for Kafka 0.10 ... SKIPPED
>> > [INFO] Kafka 0.10+ Source for Structured Streaming  SKIPPED
>> > [INFO] Spark Project Examples . SKIPPED
>> > [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
>> > [INFO] Spark Avro . SKIPPED
>> > [INFO]
>> 
>> > [INFO] BUILD FAILURE
>> > [INFO]
>> 
>> > [INFO] Total time:  02:30 h
>> > [INFO] Finished at: 2023-02-11T17:32:45+01:00
>> >
>> > lør. 11. feb. 2023 kl. 06:01 skrev L. C. Hsieh :
>> >>
>> >> Please vote on releasing the following candidate as Apache Spark
>> version 3.3.2.
>> >>
>> >> The vote is open until Feb 15th 9AM (PST) and passes if a majority +1
>> >> PMC votes are cast, with a minimum of 3 +1 votes.
>> >>
>> >> [ ] +1 Release this package as Apache Spark 3.3.2
>> >> [ ] -1 Do not release this package because ...
>> >>
>> >> To learn more about Apache Spark, please see https:

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Bjørn Jørgensen
ventually(condition, timeout=180.0)
>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>> 93, in eventually
>> > raise AssertionError(
>> > AssertionError: Test failed due to timeout after 180 sec, with last
>> condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74,
>> 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78,
>> 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64,
>> 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
>> >
>> > --
>> > Ran 13 tests in 661.536s
>> >
>> > FAILED (failures=3, skipped=1)
>> >
>> > Had test failures in pyspark.mllib.tests.test_streaming_algorithms with
>> /usr/local/bin/python3; see logs.
>> > ```
>> >
>> > Here's how I'm currently building Spark, I was using the
>> [building-spark](https://spark.apache.org/docs/3..1/building-spark.html)
>> docs as a reference.
>> > ```
>> > > git clone g...@github.com:apache/spark.git
>> > > git checkout -b spark-321 v3.2.1
>> > > ./build/mvn -DskipTests clean package -Phive
>> > > export JAVA_HOME=$(path/to/jdk/11)
>> > > ./python/run-tests
>> > ```
>> >
>> > Current Java version
>> > ```
>> > java -version
>> > openjdk version "11.0.17" 2022-10-18
>> > OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
>> > OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
>> > ```
>> >
>> > Alternatively, I've also tried simply building Spark and using a
>> python=3.9 venv and installing the requirements from `pip install -r
>> dev/requirements.txt` and using that as the interpreter to run tests.
>> However, I was running into some failing pandas test which to me seemed
>> like it was coming from a pandas version difference as `requirements.txt`
>> didn't specify a version.
>> >
>> > I suppose I have a couple of questions in regards to this:
>> > 1. Am I missing a build step to build Spark and run PySpark unit tests?
>> > 2. Where could I find whether an upstream test is failing for a
>> specific release?
>> > 3. Would it be possible to configure the `run-tests` script to run all
>> tests regardless of test failures?
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: [Suggest] Add geo function to core

2023-01-17 Thread Bjørn Jørgensen
ve
>> Interface
>> > > (JNI). PROJ is the most well known map projection library, but it is
>> > > difficult to bundle native code in a Java application.
>> > > >
>> > > > I'm not in a neutral position to said that, but I believe that
>> Apache
>> > > SIS is the most powerful open source pure-Java referencing library.
>> But it
>> > > is relatively big, about 4 Mb for the referencing module with its
>> > > dependencies, not counting the optional EPSG geodetic dataset
>> (because not
>> > > compatible with Apache license). Apache SIS is not the library with
>> the
>> > > largest amount of map projections (PROJ4J has more), but it handles
>> some
>> > > difficult problems and scale well with three- or four-dimensional
>> data (or
>> > > more).
>> > > >
>> > > > PROJ4J is a lightweight library which may be sufficient if data are
>> > > mostly two-dimensional (limited 3D support seems also possible) and if
>> > > uncertainty of a few metres in coordinate transformations (depending
>> how
>> > > datum shifts are specified) is acceptable.
>> > > >
>> > > > It is possible to write some code in an implementation-independent
>> way
>> > > using GeoAPI interfaces, which aim to do what JDBC interfaces do for
>> > > databases. Apache SIS and PROJ-JNI are implementations of GeoAPI
>> > > interfaces, so by using those interfaces you can let users choose
>> among
>> > > those two implementations. I think that GeoAPI wrappers could easily
>> be
>> > > contributed to PROJ4J as well if there is a desire for that.
>> > > >
>> > > > Regarding Geohash, if we are talking about the algorithm described
>> at
>> > > https://en.wikipedia.org/wiki/Geohash, then SIS already supports it.
>> SIS
>> > > supports also the Military Grid Reference System (MGRS), which can be
>> seen
>> > > as another kind of geohash with better characteristics.
>> > > >
>> > > > Regards,
>> > > >
>> > > > Martin
>> > > >
>> > > >
>> -
>> > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > > >
>> > > >
>> > >
>> > > -
>> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > >
>> > >
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: [VOTE][RESULT] Release Spark 3.2.3, RC1

2022-11-26 Thread Bjørn Jørgensen
How things going with this relase?
I can't find it on https://spark.apache.org/downloads.html

fre. 18. nov. 2022, 19:38 skrev Chao Sun :

> CORRECTED:
>
> The vote passes with 12 +1s (6 binding +1s).
> Thanks to all who helped with the release!
>
> (* = binding)
> +1:
> - Dongjoon Hyun (*)
> - L. C. Hsieh (*)
> - Huaxin Gao (*)
> - Sean Owen (*)
> - Kazuyuki Tanimura
> - Mridul Muralidharan (*)
> - Yuming Wang
> - Chris Nauroth
> - Yang Jie
> - Wenche Fan (*)
> - Ruifeng Zheng
> - Chao Sun
>
> +0: None
>
> -1: None
>
> On Fri, Nov 18, 2022 at 10:35 AM Chao Sun  wrote:
> >
> > Oops, sorry! I thought he voted but for some reason I didn't see his
> > vote in the email thread. Strange. Now I found it in here:
> > https://lists.apache.org/thread/gh2oktrndxopqnyxbsvp2p0k6jk1n9fs
> >
> > On Fri, Nov 18, 2022 at 10:33 AM Mridul Muralidharan 
> wrote:
> > >
> > >
> > > This vote result is missing Sean Owen's vote.
> > >
> > > - Mridul
> > >
> > >
> > >
> > > On Fri, Nov 18, 2022 at 11:51 AM Chao Sun  wrote:
> > >>
> > >> The vote passes with 11 +1s (5 binding +1s).
> > >> Thanks to all who helped with the release!
> > >>
> > >> (* = binding)
> > >> +1:
> > >> - Dongjoon Hyun (*)
> > >> - L. C. Hsieh (*)
> > >> - Huaxin Gao (*)
> > >> - Kazuyuki Tanimura
> > >> - Mridul Muralidharan (*)
> > >> - Yuming Wang
> > >> - Chris Nauroth
> > >> - Yang Jie
> > >> - Wenche Fan (*)
> > >> - Ruifeng Zheng
> > >> - Chao Sun
> > >>
> > >> +0: None
> > >>
> > >> -1: None
> > >>
> > >> -
> > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Contributor privilege

2022-11-11 Thread Bjørn Jørgensen
It is just to open a PR in github's spark repo.
Then the PR is merged, you will be assigned to the jira.


fre. 11. nov. 2022 kl. 10:31 skrev deng ziming :

> Hello, can someone give me contributor privilege for Spark jira Project, I
> want to assign tasks to my self.
> My jira/confluence user id is: dengziming
>
> --
> Best,
> Ziming
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Upgrade guava to 31.1-jre and remove hadoop2

2022-11-06 Thread Bjørn Jørgensen
Hi, anyone that has tried to upgrade guava now after we stop supporting
hadoop2?
And is there a plan for removing hadoop2 code from the code base?


Re: Issue with SparkContext

2022-09-20 Thread Bjørn Jørgensen
Hi, we have a user group at u...@spark.apache.org

You must install a java JRE

If you are on ubuntu you can type
apt-get install openjdk-17-jre-headless

tir. 20. sep. 2022 kl. 06:15 skrev yogita bhardwaj <
yogita.bhard...@iktara.ai>:

>
>
> I am getting the py4j.protocol.Py4JJavaError while running SparkContext.
> Can you please help me to resolve this issue.
>
>
>
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
> Windows
>
>
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Jupyter notebook on Dataproc versus GKE

2022-09-14 Thread Bjørn Jørgensen
Mitch: Why I'm switching from Jupyter Notebooks to JupyterLab...Such a
better experience! DegreeTutors.com <https://youtu.be/djupMug3qUc>

tir. 6. sep. 2022 kl. 20:28 skrev Holden Karau :

> I’ve used Argo for K8s scheduling, for awhile it’s also what Kubeflow used
> underneath for scheduling.
>
> On Tue, Sep 6, 2022 at 10:01 AM Mich Talebzadeh 
> wrote:
>
>> Thank you all.
>>
>> Has anyone used Argo for k8s scheduler by any chance?
>>
>> On Tue, 6 Sep 2022 at 13:41, Bjørn Jørgensen 
>> wrote:
>>
>>> "*JupyterLab is the next-generation user interface for Project Jupyter
>>> offering all the familiar building blocks of the classic Jupyter Notebook
>>> (notebook, terminal, text editor, file browser, rich outputs, etc.) in a
>>> flexible and powerful user interface.*"
>>> https://github.com/jupyterlab/jupyterlab
>>>
>>> You will find them both at https://jupyter.org
>>>
>>> man. 5. sep. 2022 kl. 23:40 skrev Mich Talebzadeh <
>>> mich.talebza...@gmail.com>:
>>>
>>>> Thanks Bjorn,
>>>>
>>>> What are the differences and the functionality Jupyerlab brings in on
>>>> top of Jupyter notebook?
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, 5 Sept 2022 at 20:58, Bjørn Jørgensen 
>>>> wrote:
>>>>
>>>>> Jupyter notebook is replaced with jupyterlab :)
>>>>>
>>>>> man. 5. sep. 2022 kl. 21:10 skrev Holden Karau :
>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Sep 5, 2022 at 9:00 AM Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks for that.
>>>>>>>
>>>>>>> How do you rate the performance of Jupyter W/Spark on K8s compared
>>>>>>> to the same on  a cluster of VMs (example Dataproc).
>>>>>>>
>>>>>>> Also somehow a related question (may be naive as well). For example,
>>>>>>> Google offers a lot of standard ML libraries for example built into a 
>>>>>>> data
>>>>>>> warehouse like BigQuery. What does the Jupyter notebook offer that 
>>>>>>> others
>>>>>>> don't?
>>>>>>>
>>>>>> Jupyter notebook doesn’t offer any particular set of libraries,
>>>>>> although you can add your own to the container etc.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 5 Sept 2022 at 12:47, Holden Karau 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I’ve run Jupyter w/Spark on K8s, haven’t tried it with Dataproc
>>>>>>>> personally.
>>>>>>>>
>>>>>>>>

Re: Time for Spark 3.3.1 release?

2022-09-14 Thread Bjørn Jørgensen
At least we should upgrade hadoop to the latest version
https://hadoop.apache.org/release/2.10.2.html

Are there some spesial reasons why we have a hadoop version that is 7
years old?

ons. 14. sep. 2022, 20:25 skrev Dongjoon Hyun :

> Ya, +1 for Sean's comment.
>
> In addition, all Apache Spark's Maven artifacts are depending on Hadoop
> 3.3.x already.
>
>
> https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.12/3.3.0
>
> https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.13/3.3.0
>
> Apache Spark has been moving away from Hadoop 2 due to many many reasons.
>
> Dongjoon.
>
>
> On Wed, Sep 14, 2022 at 10:54 AM Sean Owen  wrote:
>
>> Yeah we're not going to make convenience binaries for all possible
>> combinations. It's a pretty good assumption that anyone moving to later
>> Scala versions is also off old Hadoop versions.
>> You can of course build the combo you like.
>>
>> On Wed, Sep 14, 2022 at 11:26 AM Denis Bolshakov <
>> bolshakov.de...@gmail.com> wrote:
>>
>>> Unfortunately it's for hadoop 3 only.
>>>
>>> ср, 14 сент. 2022 г., 19:04 Dongjoon Hyun :
>>>
 Hi, Denis.

 Apache Spark community already provides both Scala 2.12 and 2.13
 pre-built distributions.
 Please check the distribution site and Apache Spark download page.

 https://dlcdn.apache.org/spark/spark-3.3.0/

 spark-3.3.0-bin-hadoop3-scala2.13.tgz
 spark-3.3.0-bin-hadoop3.tgz

 [image: Screenshot 2022-09-14 at 9.03.27 AM.png]

 Dongjoon.

 On Wed, Sep 14, 2022 at 12:31 AM Denis Bolshakov <
 bolshakov.de...@gmail.com> wrote:

> Hello,
>
> It would be great if it's possible to provide a spark distro for both
> scala 2.12 and scala 2.13.
>
> It will encourage spark users to switch to scala 2.13.
>
> I know that spark jar artifacts available for both scala versions, but
> it does not make sense to migrate to scala 2.13 while there is no spark
> distro for this version.
>
> Kind regards,
> Denis
>
> On Tue, 13 Sept 2022 at 17:38, Yuming Wang  wrote:
>
>> Thank you all.
>>
>> I will be preparing 3.3.1 RC1 soon.
>>
>> On Tue, Sep 13, 2022 at 12:09 PM John Zhuge 
>> wrote:
>>
>>> +1
>>>
>>> On Mon, Sep 12, 2022 at 9:08 PM Yang,Jie(INF) 
>>> wrote:
>>>
 +1



 Thanks Yuming ~



 *发件人**: *Hyukjin Kwon 
 *日期**: *2022年9月13日 星期二 08:19
 *收件人**: *Gengliang Wang 
 *抄送**: *"L. C. Hsieh" , Dongjoon Hyun <
 dongjoon.h...@gmail.com>, Yuming Wang , dev <
 dev@spark.apache.org>
 *主题**: *Re: Time for Spark 3.3.1 release?



 +1



 On Tue, 13 Sept 2022 at 06:45, Gengliang Wang 
 wrote:

 +1.

 Thank you, Yuming!



 On Mon, Sep 12, 2022 at 12:10 PM L. C. Hsieh 
 wrote:

 +1

 Thanks Yuming!

 On Mon, Sep 12, 2022 at 11:50 AM Dongjoon Hyun <
 dongjoon.h...@gmail.com> wrote:
 >
 > +1
 >
 > Thanks,
 > Dongjoon.
 >
 > On Mon, Sep 12, 2022 at 6:38 AM Yuming Wang 
 wrote:
 >>
 >> Hi, All.
 >>
 >>
 >>
 >> Since Apache Spark 3.3.0 tag creation (Jun 10), new 138 patches
 including 7 correctness patches arrived at branch-3.3.
 >>
 >>
 >>
 >> Shall we make a new release, Apache Spark 3.3.1, as the second
 release at branch-3.3? I'd like to volunteer as the release manager for
 Apache Spark 3.3.1.
 >>
 >>
 >>
 >> All changes:
 >>
 >> https://github.com/apache/spark/compare/v3.3.0...branch-3.3
 
 >>
 >>
 >>
 >> Correctness issues:
 >>
 >> SPARK-40149: Propagate metadata columns through Project
 >>
 >> SPARK-40002: Don't push down limit through window using ntile
 >>
 >> SPARK-39976: ArrayIntersect should handle null in left
 expression correctly
 >>
 >> SPARK-39833: Disable Parquet column index in DSv1 to fix a
 correctness issue in the case of overlapping partition and data columns
 >>
 >> SPARK-39061: Set nullable correctly for Inline output attributes
 >>
 >> SPARK-39887: RemoveRedundantAliases should keep aliases that
 make the output of projection nodes unique
 >>
 >> SPARK-38614: Don't push down limit through window that's using
 percent_rank


 

Re: Jupyter notebook on Dataproc versus GKE

2022-09-06 Thread Bjørn Jørgensen
"*JupyterLab is the next-generation user interface for Project Jupyter
offering all the familiar building blocks of the classic Jupyter Notebook
(notebook, terminal, text editor, file browser, rich outputs, etc.) in a
flexible and powerful user interface.*"
https://github.com/jupyterlab/jupyterlab

You will find them both at https://jupyter.org

man. 5. sep. 2022 kl. 23:40 skrev Mich Talebzadeh :

> Thanks Bjorn,
>
> What are the differences and the functionality Jupyerlab brings in on top
> of Jupyter notebook?
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 5 Sept 2022 at 20:58, Bjørn Jørgensen 
> wrote:
>
>> Jupyter notebook is replaced with jupyterlab :)
>>
>> man. 5. sep. 2022 kl. 21:10 skrev Holden Karau :
>>
>>>
>>>
>>> On Mon, Sep 5, 2022 at 9:00 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Thanks for that.
>>>>
>>>> How do you rate the performance of Jupyter W/Spark on K8s compared to
>>>> the same on  a cluster of VMs (example Dataproc).
>>>>
>>>> Also somehow a related question (may be naive as well). For example,
>>>> Google offers a lot of standard ML libraries for example built into a data
>>>> warehouse like BigQuery. What does the Jupyter notebook offer that others
>>>> don't?
>>>>
>>> Jupyter notebook doesn’t offer any particular set of libraries, although
>>> you can add your own to the container etc.
>>>
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, 5 Sept 2022 at 12:47, Holden Karau 
>>>> wrote:
>>>>
>>>>> I’ve run Jupyter w/Spark on K8s, haven’t tried it with Dataproc
>>>>> personally.
>>>>>
>>>>> The Spark K8s pod scheduler is now more pluggable for Yunikorn and
>>>>> Volcano can be used with less effort.
>>>>>
>>>>> On Mon, Sep 5, 2022 at 7:44 AM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> Has anyone got experience of running Jupyter on dataproc versus
>>>>>> Jupyter notebook on GKE (k8).
>>>>>>
>>>>>>
>>>>>> I have not looked at this for a while but my understanding is that
>>>>>> Spark on GKE/k8 is not yet performed. This is classic Spark with
>>>>>> Python/Pyspark.
>>>>>>
>>>>>>
>>>>>> Also I would like to know the state of spark with Volcano. Has
>>>>>> progress made on that front.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>>
>>>>>> Mich
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Jupyter notebook on Dataproc versus GKE

2022-09-05 Thread Bjørn Jørgensen
Jupyter notebook is replaced with jupyterlab :)

man. 5. sep. 2022 kl. 21:10 skrev Holden Karau :

>
>
> On Mon, Sep 5, 2022 at 9:00 AM Mich Talebzadeh 
> wrote:
>
>> Thanks for that.
>>
>> How do you rate the performance of Jupyter W/Spark on K8s compared to the
>> same on  a cluster of VMs (example Dataproc).
>>
>> Also somehow a related question (may be naive as well). For example,
>> Google offers a lot of standard ML libraries for example built into a data
>> warehouse like BigQuery. What does the Jupyter notebook offer that others
>> don't?
>>
> Jupyter notebook doesn’t offer any particular set of libraries, although
> you can add your own to the container etc.
>
>>
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 5 Sept 2022 at 12:47, Holden Karau  wrote:
>>
>>> I’ve run Jupyter w/Spark on K8s, haven’t tried it with Dataproc
>>> personally.
>>>
>>> The Spark K8s pod scheduler is now more pluggable for Yunikorn and
>>> Volcano can be used with less effort.
>>>
>>> On Mon, Sep 5, 2022 at 7:44 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>> Has anyone got experience of running Jupyter on dataproc versus Jupyter
>>>> notebook on GKE (k8).
>>>>
>>>>
>>>> I have not looked at this for a while but my understanding is that
>>>> Spark on GKE/k8 is not yet performed. This is classic Spark with
>>>> Python/Pyspark.
>>>>
>>>>
>>>> Also I would like to know the state of spark with Volcano. Has progress
>>>> made on that front.
>>>>
>>>>
>>>> Regards,
>>>>
>>>>
>>>> Mich
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Welcoming three new PMC members

2022-08-10 Thread Bjørn Jørgensen
Congratulations :)

tir. 9. aug. 2022 kl. 18:40 skrev Xiao Li :

> Hi all,
>
> The Spark PMC recently voted to add three new PMC members. Join me in
> welcoming them to their new roles!
>
> New PMC members: Huaxin Gao, Gengliang Wang and Maxim Gekk
>
> The Spark PMC
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Welcome Xinrong Meng as a Spark committer

2022-08-10 Thread Bjørn Jørgensen
Congratulations :)

tir. 9. aug. 2022 kl. 10:13 skrev Hyukjin Kwon :

> Hi all,
>
> The Spark PMC recently added Xinrong Meng as a committer on the project.
> Xinrong is the major contributor of PySpark especially Pandas API on Spark.
> She has guided a lot of new contributors enthusiastically. Please join me
> in welcoming Xinrong!
>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Apache Spark 3.2.2 Release?

2022-07-06 Thread Bjørn Jørgensen
+1

ons. 6. jul. 2022, 23:05 skrev Hyukjin Kwon :

> Yeah +1
>
> On Thu, Jul 7, 2022 at 5:40 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Since Apache Spark 3.2.1 tag creation (Jan 19), new 197 patches
>> including 11 correctness patches arrived at branch-3.2.
>>
>> Shall we make a new release, Apache Spark 3.2.2, as the third release
>> at 3.2 line? I'd like to volunteer as the release manager for Apache
>> Spark 3.2.2. I'm thinking about starting the first RC next week.
>>
>> $ git log --oneline v3.2.1..HEAD | wc -l
>>  197
>>
>> # Correctness issues
>>
>> SPARK-38075 Hive script transform with order by and limit will
>> return fake rows
>> SPARK-38204 All state operators are at a risk of inconsistency
>> between state partitioning and operator partitioning
>> SPARK-38309 SHS has incorrect percentiles for shuffle read bytes
>> and shuffle total blocks metrics
>> SPARK-38320 (flat)MapGroupsWithState can timeout groups which just
>> received inputs in the same microbatch
>> SPARK-38614 After Spark update, df.show() shows incorrect
>> F.percent_rank results
>> SPARK-38655 OffsetWindowFunctionFrameBase cannot find the offset
>> row whose input is not null
>> SPARK-38684 Stream-stream outer join has a possible correctness
>> issue due to weakly read consistent on outer iterators
>> SPARK-39061 Incorrect results or NPE when using Inline function
>> against an array of dynamically created structs
>> SPARK-39107 Silent change in regexp_replace's handling of empty
>> strings
>> SPARK-39259 Timestamps returned by now() and equivalent functions
>> are not consistent in subqueries
>> SPARK-39293 The accumulator of ArrayAggregate should copy the
>> intermediate result if string, struct, array, or map
>>
>> Best,
>> Dongjoon.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: The draft of the Spark 3.3.0 release notes

2022-06-03 Thread Bjørn Jørgensen
   -

   Support lambda `column` parameter of `DataFrame.rename`(SPARK-38763
   <https://issues.apache.org/jira/browse/SPARK-38763>)


This did work before 18.jan.2022, see JIRA post for more information.
I think we can remove this one from the list.

fre. 3. jun. 2022 kl. 20:42 skrev Dongjoon Hyun :

> You are right.
>
> After SPARK-36837, we tried to ship Apache Spark 3.3.0 with Apache Kafka
> 3.1.1 via the following PR.
>
> https://github.com/apache/spark/pull/36135
> [WIP][SPARK-38850][BUILD] Upgrade Kafka to 3.1.1
>
> However, the final decision was to revert it from `branch-3.3` and move
> directly to Apache Kafka 3.2.0 at `master` branch. We need to remove it
> from the 3.3.0 release note.
>
>
> On Fri, Jun 3, 2022 at 9:54 AM Koert Kuipers  wrote:
>
>> i thought SPARK-36837 didnt make it in? i see it in notes
>>
>> On Fri, Jun 3, 2022 at 4:31 AM Maxim Gekk
>>  wrote:
>>
>>> Hi All,
>>>
>>> I am preparing the release notes of Spark 3.3.0. Here is a draft
>>> document:
>>>
>>> https://docs.google.com/document/d/1gGySrLGvIK8bajKdGjTI_mDqk0-YPvHmPN64YjoWfOQ/edit?usp=sharing
>>>
>>> Please take a look and let me know if I missed any major changes or
>>> something.
>>>
>>> Maxim Gekk
>>>
>>> Software Engineer
>>>
>>> Databricks, Inc.
>>>
>>
>> CONFIDENTIALITY NOTICE: This electronic communication and any files
>> transmitted with it are confidential, privileged and intended solely for
>> the use of the individual or entity to whom they are addressed. If you are
>> not the intended recipient, you are hereby notified that any disclosure,
>> copying, distribution (electronic or otherwise) or forwarding of, or the
>> taking of any action in reliance on the contents of this transmission is
>> strictly prohibited. Please notify the sender immediately by e-mail if you
>> have received this email by mistake and delete this email from your system.
>>
>> Is it necessary to print this email? If you care about the environment
>> like we do, please refrain from printing emails. It helps to keep the
>> environment forested and litter-free.
>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Introducing "Pandas API on Spark" component in JIRA, and use "PS" PR title component

2022-05-18 Thread Bjørn Jørgensen
+1
But can will have PR Title and PR label the same,  PS

ons. 18. mai 2022 kl. 18:57 skrev Xinrong Meng
:

> Great!
>
> It saves us from always specifying "Pandas API on Spark" in PR titles.
>
> Thanks!
>
>
> Xinrong Meng
>
> Software Engineer
>
> Databricks
>
>
> On Tue, May 17, 2022 at 1:08 AM Maciej  wrote:
>
>> Sounds good!
>>
>> +1
>>
>> On 5/17/22 06:08, Yikun Jiang wrote:
>> > It's a pretty good idea, +1.
>> >
>> > To be clear in Github:
>> >
>> > - For each PR Title: [SPARK-XXX][PYTHON][PS] The Pandas on spark pr
>> title
>> > (*still keep [PYTHON]* and [PS] new added)
>> >
>> > - For PR label: new added: `PANDAS API ON Spark`, still keep: `PYTHON`,
>> > `CORE`
>> > (*still keep `PYTHON`, `CORE`* and `PANDAS API ON SPARK` new added)
>> > https://github.com/apache/spark/pull/36574
>> > <https://github.com/apache/spark/pull/36574>
>> >
>> > Right?
>> >
>> > Regards,
>> > Yikun
>> >
>> >
>> > On Tue, May 17, 2022 at 11:26 AM Hyukjin Kwon > > <mailto:gurwls...@gmail.com>> wrote:
>> >
>> > Hi all,
>> >
>> > What about we introduce a component in JIRA "Pandas API on Spark",
>> > and use "PS"  (pandas-on-Spark) in PR titles? We already use "ps" in
>> > many places when we: import pyspark.pandas as ps.
>> > This is similar to "Structured Streaming" in JIRA, and "SS" in PR
>> title.
>> >
>> > I think it'd be easier to track the changes here with that.
>> > Currently it's a bit difficult to identify it from pure PySpark
>> changes.
>> >
>>
>>
>> --
>> Best regards,
>> Maciej Szymkiewicz
>>
>> Web: https://zero323.net
>> PGP: A30CEF0C31A501EC
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Issue on Spark on K8s with Proxy user on Kerberized HDFS : Spark-25355

2022-05-03 Thread Bjørn Jørgensen
>> //wiki.apache.org/hadoop/ConnectionRefused
>>>>
>>>> at
>>>> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>> Method)
>>>>
>>>> at
>>>> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
>>>> Source)
>>>>
>>>> at
>>>> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
>>>> Source)
>>>>
>>>> at java.base/java.lang.reflect.Constructor.newInstance(Unknown
>>>> Source)
>>>>
>>>> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
>>>>
>>>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:755)
>>>>
>>>> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1501)
>>>>
>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1443)
>>>>
>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1353)
>>>>
>>>> at
>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
>>>>
>>>> at
>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
>>>>
>>>> at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
>>>>
>>>> at
>>>>
>>>>
>>>>
>>>> On debugging deep , we found the proxy user doesn't have access to
>>>> delegation tokens in case of K8s .SparkSubmit.submit explicitly creating
>>>> the proxy user and this user doesn't have delegation token.
>>>>
>>>>
>>>> Please help me with the same.
>>>>
>>>>
>>>> Regards
>>>>
>>>> Pralabh Kumar
>>>>
>>>>
>>>>
>>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Tools for regression testing

2022-03-24 Thread Bjørn Jørgensen
At the wikipedia regression testing page
https://en.wikipedia.org/wiki/Regression_testing
<https://en.wikipedia.org/wiki/Regression_testing>
Under use
"
Regression tests can be broadly categorized as functional tests
<https://en.wikipedia.org/wiki/Functional_test> or unit tests
<https://en.wikipedia.org/wiki/Unit_testing>. Functional tests exercise the
complete program with various inputs. Unit tests exercise individual
functions, subroutines <https://en.wikipedia.org/wiki/Subroutine>, or
object methods. Both functional testing tools and unit-testing tools tend
to be automated and are often third-party products that are not part of the
compiler suite. A functional test may be a scripted series of program
inputs, possibly even involving an automated mechanism for controlling
mouse movements and clicks. A unit test may be a set of separate functions
within the code itself or a driver layer that links to the code without
altering the code being tested.
"

When you change or add anything to Spark. You fork the git repo. then you
make the changes, then you make a new branch. and push the branch to yours
fork git repo.

"Push commits to your branch. This will trigger “Build and test” and
“Report test results” workflows on your forked repository and start testing
and validating your changes."
This is number 7 in Howto contributing to Spark under Pull request
<https://spark.apache.org/contributing.html>


So to answer your question. Yes, every change is tested before the change
[PR] goes to the master branch.
We don't have unit tests for everything, so we have to test things
manually after building.




tor. 24. mar. 2022 kl. 20:47 skrev Mich Talebzadeh <
mich.talebza...@gmail.com>:

> good point.
>
> I just wanted to know when we do changes to releases or RC, is there some
> mechanism that ensures the Spark release still functions as expected
> after any code changes, updates etc?
>
> For example there was a recent discussion about Kafka upgrade to 3.x with
> Spark upgrade to 3.x and its likely impact. Integration testing can be
> achieved through CI/CD which I believe Spark relied on Jenkins until
> recently.
>
> HTH
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 24 Mar 2022 at 19:38, Sean Owen  wrote:
>
>> Hm, then what are you looking for besides all the tests in Spark?
>>
>> On Thu, Mar 24, 2022, 2:34 PM Mich Talebzadeh 
>> wrote:
>>
>>> Thanks
>>>
>>> I know what unit testing is. The question was not about unit testing. it
>>> was specific to regression testing
>>> <https://katalon.com/resources-center/blog/regression-testing#:~:text=Regression%20testing%20is%20a%20software,functionality%20of%20the%20existing%20features.>
>>>  artifacts .
>>>
>>>
>>> cheers,
>>>
>>>
>>> Mich
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 24 Mar 2022 at 19:02, Bjørn Jørgensen 
>>> wrote:
>>>
>>>> Yes, Spark uses unit tests.
>>>>
>>>> https://app.codecov.io/gh/apache/spark
>>>>
>>>> https://en.wikipedia.org/wiki/Unit_testing
>>>>
>>>>
>>>>
>>>> man. 21. mar. 2022 kl. 15:46 skrev Mich Talebzadeh <
>>>> mich.talebza...@gmail.com>:
>>>>
>>>>> Hi,
>>>>>
>>>>> As a matter of interest do Spark releases deploy a specific regression
>>>>> testing tool?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Bjørn Jørgensen
>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>> Norge
>>>>
>>>> +47 480 94 297
>>>>
>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Tools for regression testing

2022-03-24 Thread Bjørn Jørgensen
Yes, Spark uses unit tests.

https://app.codecov.io/gh/apache/spark

https://en.wikipedia.org/wiki/Unit_testing



man. 21. mar. 2022 kl. 15:46 skrev Mich Talebzadeh <
mich.talebza...@gmail.com>:

> Hi,
>
> As a matter of interest do Spark releases deploy a specific regression
> testing tool?
>
> Thanks
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: One click to run Spark on Kubernetes

2022-02-23 Thread Bjørn Jørgensen
So if I get this right you will make a Helm <https://helm.sh> chart to
deploy Spark and some other stuff on K8S?

ons. 23. feb. 2022 kl. 17:49 skrev bo yang :

> Hi Sarath, let's follow up offline on this.
>
> On Wed, Feb 23, 2022 at 8:32 AM Sarath Annareddy <
> sarath.annare...@gmail.com> wrote:
>
>> Hi bo
>>
>> How do we start?
>>
>> Is there a plan? Onboarding, Arch/design diagram, tasks lined up etc
>>
>>
>> Thanks
>> Sarath
>>
>>
>> Sent from my iPhone
>>
>> On Feb 23, 2022, at 10:27 AM, bo yang  wrote:
>>
>> 
>> Hi Sarath, thanks for your interest and willing to contribute! The
>> project supports local development using MiniKube. Similarly there is a one
>> click command with one extra argument to deploy all components in MiniKube,
>> and people could use that to develop on their local MacBook.
>>
>>
>> On Wed, Feb 23, 2022 at 7:41 AM Sarath Annareddy <
>> sarath.annare...@gmail.com> wrote:
>>
>>> Hi bo
>>>
>>> I am interested to contribute.
>>> But I don’t have free access to any cloud provider. Not sure how I can
>>> get free access. I know Google, aws, azure only provides temp free access,
>>> it may not be sufficient.
>>>
>>> Guidance is appreciated.
>>>
>>> Sarath
>>>
>>> Sent from my iPhone
>>>
>>> On Feb 23, 2022, at 2:01 AM, bo yang  wrote:
>>>
>>> 
>>>
>>> Right, normally people start with simple script, then add more stuff,
>>> like permission and more components. After some time, people want to run
>>> the script consistently in different environments. Things will become
>>> complex.
>>>
>>> That is why we want to see whether people have interest for such a "one
>>> click" tool to make things easy.
>>>
>>>
>>> On Tue, Feb 22, 2022 at 11:31 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> There are two distinct actions here; namely Deploy and Run.
>>>>
>>>> Deployment can be done by command line script with autoscaling. In the
>>>> newer versions of Kubernnetes you don't even need to specify the node
>>>> types, you can leave it to the Kubernetes cluster  to scale up and down and
>>>> decide on node type.
>>>>
>>>> The second point is the running spark that you will need to submit.
>>>> However, that depends on setting up access permission, use of service
>>>> accounts, pulling the correct dockerfiles for the driver and the executors.
>>>> Those details add to the complexity.
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 23 Feb 2022 at 04:06, bo yang  wrote:
>>>>
>>>>> Hi Spark Community,
>>>>>
>>>>> We built an open source tool to deploy and run Spark on Kubernetes
>>>>> with a one click command. For example, on AWS, it could automatically
>>>>> create an EKS cluster, node group, NGINX ingress, and Spark Operator. Then
>>>>> you will be able to use curl or a CLI tool to submit Spark application.
>>>>> After the deployment, you could also install Uber Remote Shuffle Service 
>>>>> to
>>>>> enable Dynamic Allocation on Kuberentes.
>>>>>
>>>>> Anyone interested in using or working together on such a tool?
>>>>>
>>>>> Thanks,
>>>>> Bo
>>>>>
>>>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: [ANNOUNCE] Apache Spark 3.1.3 released + Docker images

2022-02-22 Thread Bjørn Jørgensen
>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, 21 Feb 2022 at 21:58, Holden Karau 
>>>> wrote:
>>>>
>>>>> My bad, the correct link is:
>>>>>
>>>>> https://hub.docker.com/r/apache/spark/tags
>>>>>
>>>>> On Mon, Feb 21, 2022 at 1:17 PM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> well that docker link is not found! may be permission issue
>>>>>>
>>>>>> [image: image.png]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, 21 Feb 2022 at 21:09, Holden Karau 
>>>>>> wrote:
>>>>>>
>>>>>>> We are happy to announce the availability of Spark 3.1.3!
>>>>>>>
>>>>>>> Spark 3.1.3 is a maintenance release containing stability fixes. This
>>>>>>> release is based on the branch-3.1 maintenance branch of Spark. We
>>>>>>> strongly
>>>>>>> recommend all 3.1 users to upgrade to this stable release.
>>>>>>>
>>>>>>> To download Spark 3.1.3, head over to the download page:
>>>>>>> https://spark.apache.org/downloads.html
>>>>>>>
>>>>>>> To view the release notes:
>>>>>>> https://spark.apache.org/releases/spark-release-3-1-3.html
>>>>>>>
>>>>>>> We would like to acknowledge all community members for contributing
>>>>>>> to this
>>>>>>> release. This release would not have been possible without you.
>>>>>>>
>>>>>>> *New Dockerhub magic in this release:*
>>>>>>>
>>>>>>> We've also started publishing docker containers to the Apache
>>>>>>> Dockerhub,
>>>>>>> these contain non-ASF artifacts that are subject to different
>>>>>>> license terms than the
>>>>>>> Spark release. The docker containers are built for Linux x86 and
>>>>>>> ARM64 since that's
>>>>>>> what I have access to (thanks to NV for the ARM64 machines).
>>>>>>>
>>>>>>> You can get them from https://hub.docker.com/apache/spark (and
>>>>>>> spark-r and spark-py) :)
>>>>>>> (And version 3.2.1 is also now published on Dockerhub).
>>>>>>>
>>>>>>> Holden
>>>>>>>
>>>>>>> --
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: [VOTE] Release Spark 3.2.1 (RC2)

2022-01-21 Thread Bjørn Jørgensen
Ok, but deleting users' data without them knowing it is never a good idea.
That's why I give this RC -1.

lør. 22. jan. 2022 kl. 00:16 skrev Sean Owen :

> (Bjorn - unless this is a regression, it would not block a release, even
> if it's a bug)
>
> On Fri, Jan 21, 2022 at 5:09 PM Bjørn Jørgensen 
> wrote:
>
>> [x] -1 Do not release this package because, deletes all my columns with
>> only Null in it.
>>
>> I have opened https://issues.apache.org/jira/browse/SPARK-37981 for this
>> bug.
>>
>>
>>
>>
>> fre. 21. jan. 2022 kl. 21:45 skrev Sean Owen :
>>
>>> (Are you suggesting this is a regression, or is it a general question?
>>> here we're trying to figure out whether there are critical bugs introduced
>>> in 3.2.1 vs 3.2.0)
>>>
>>> On Fri, Jan 21, 2022 at 1:58 PM Bjørn Jørgensen <
>>> bjornjorgen...@gmail.com> wrote:
>>>
>>>> Hi, I am wondering if it's a bug or not.
>>>>
>>>> I do have a lot of json files, where they have some columns that are
>>>> all "null" on.
>>>>
>>>> I start spark with
>>>>
>>>> from pyspark import pandas as ps
>>>> import re
>>>> import numpy as np
>>>> import os
>>>> import pandas as pd
>>>>
>>>> from pyspark import SparkContext, SparkConf
>>>> from pyspark.sql import SparkSession
>>>> from pyspark.sql.functions import concat, concat_ws, lit, col, trim,
>>>> expr
>>>> from pyspark.sql.types import StructType, StructField,
>>>> StringType,IntegerType
>>>>
>>>> os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
>>>>
>>>> def get_spark_session(app_name: str, conf: SparkConf):
>>>> conf.setMaster('local[*]')
>>>> conf \
>>>>   .set('spark.driver.memory', '64g')\
>>>>   .set("fs.s3a.access.key", "minio") \
>>>>   .set("fs.s3a.secret.key", "") \
>>>>   .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \
>>>>   .set("spark.hadoop.fs.s3a.impl",
>>>> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>>>>   .set("spark.hadoop.fs.s3a.path.style.access", "true") \
>>>>   .set("spark.sql.repl.eagerEval.enabled", "True") \
>>>>   .set("spark.sql.adaptive.enabled", "True") \
>>>>   .set("spark.serializer",
>>>> "org.apache.spark.serializer.KryoSerializer") \
>>>>   .set("spark.sql.repl.eagerEval.maxNumRows", "1") \
>>>>   .set("sc.setLogLevel", "error")
>>>>
>>>> return
>>>> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
>>>>
>>>> spark = get_spark_session("Falk", SparkConf())
>>>>
>>>> d3 =
>>>> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")
>>>>
>>>> import pyspark
>>>> def sparkShape(dataFrame):
>>>> return (dataFrame.count(), len(dataFrame.columns))
>>>> pyspark.sql.dataframe.DataFrame.shape = sparkShape
>>>> print(d3.shape())
>>>>
>>>>
>>>> (653610, 267)
>>>>
>>>>
>>>> d3.write.json("d3.json")
>>>>
>>>>
>>>> d3 = spark.read.json("d3.json/*.json")
>>>>
>>>> import pyspark
>>>> def sparkShape(dataFrame):
>>>> return (dataFrame.count(), len(dataFrame.columns))
>>>> pyspark.sql.dataframe.DataFrame.shape = sparkShape
>>>> print(d3.shape())
>>>>
>>>> (653610, 186)
>>>>
>>>>
>>>> So spark is deleting 81 columns. I think that all of these 81 deleted
>>>> columns have only Null in them.
>>>>
>>>> Is this a bug or has this been made on purpose?
>>>>
>>>>
>>>> fre. 21. jan. 2022 kl. 04:59 skrev huaxin gao :
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 3.2.1. The vote is open until 8:00pm Pacific time January 25 and
>>>>> passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. 
>>>>> [
>>

Re: [VOTE] Release Spark 3.2.1 (RC2)

2022-01-21 Thread Bjørn Jørgensen
[x] -1 Do not release this package because, deletes all my columns with
only Null in it.

I have opened https://issues.apache.org/jira/browse/SPARK-37981 for this
bug.




fre. 21. jan. 2022 kl. 21:45 skrev Sean Owen :

> (Are you suggesting this is a regression, or is it a general question?
> here we're trying to figure out whether there are critical bugs introduced
> in 3.2.1 vs 3.2.0)
>
> On Fri, Jan 21, 2022 at 1:58 PM Bjørn Jørgensen 
> wrote:
>
>> Hi, I am wondering if it's a bug or not.
>>
>> I do have a lot of json files, where they have some columns that are all
>> "null" on.
>>
>> I start spark with
>>
>> from pyspark import pandas as ps
>> import re
>> import numpy as np
>> import os
>> import pandas as pd
>>
>> from pyspark import SparkContext, SparkConf
>> from pyspark.sql import SparkSession
>> from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr
>> from pyspark.sql.types import StructType, StructField,
>> StringType,IntegerType
>>
>> os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
>>
>> def get_spark_session(app_name: str, conf: SparkConf):
>> conf.setMaster('local[*]')
>> conf \
>>   .set('spark.driver.memory', '64g')\
>>   .set("fs.s3a.access.key", "minio") \
>>   .set("fs.s3a.secret.key", "") \
>>   .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \
>>   .set("spark.hadoop.fs.s3a.impl",
>> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>>   .set("spark.hadoop.fs.s3a.path.style.access", "true") \
>>   .set("spark.sql.repl.eagerEval.enabled", "True") \
>>   .set("spark.sql.adaptive.enabled", "True") \
>>   .set("spark.serializer",
>> "org.apache.spark.serializer.KryoSerializer") \
>>   .set("spark.sql.repl.eagerEval.maxNumRows", "1") \
>>   .set("sc.setLogLevel", "error")
>>
>> return
>> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
>>
>> spark = get_spark_session("Falk", SparkConf())
>>
>> d3 =
>> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")
>>
>> import pyspark
>> def sparkShape(dataFrame):
>> return (dataFrame.count(), len(dataFrame.columns))
>> pyspark.sql.dataframe.DataFrame.shape = sparkShape
>> print(d3.shape())
>>
>>
>> (653610, 267)
>>
>>
>> d3.write.json("d3.json")
>>
>>
>> d3 = spark.read.json("d3.json/*.json")
>>
>> import pyspark
>> def sparkShape(dataFrame):
>> return (dataFrame.count(), len(dataFrame.columns))
>> pyspark.sql.dataframe.DataFrame.shape = sparkShape
>> print(d3.shape())
>>
>> (653610, 186)
>>
>>
>> So spark is deleting 81 columns. I think that all of these 81 deleted
>> columns have only Null in them.
>>
>> Is this a bug or has this been made on purpose?
>>
>>
>> fre. 21. jan. 2022 kl. 04:59 skrev huaxin gao :
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 3.2.1. The vote is open until 8:00pm Pacific time January 25 and passes if
>>> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1
>>> Release this package as Apache Spark 3.2.1[ ] -1 Do not release this
>>> package because ... To learn more about Apache Spark, please see
>>> http://spark.apache.org/ The tag to be voted on is v3.2.1-rc2 (commit
>>> 4f25b3f71238a00508a356591553f2dfa89f8290):
>>> https://github.com/apache/spark/tree/v3.2.1-rc2
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS The staging
>>> repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1398/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/_site/
>>> The list of bug fixes going into 3.2.1 can be found at the following URL:
>>> https://s.apache.org/yu0cy
>>>
>>> This release is using the release script of the tag v3.2.1-rc

Re: [VOTE] Release Spark 3.2.1 (RC2)

2022-01-21 Thread Bjørn Jørgensen
Hi, I am wondering if it's a bug or not.

I do have a lot of json files, where they have some columns that are all
"null" on.

I start spark with

from pyspark import pandas as ps
import re
import numpy as np
import os
import pandas as pd

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr
from pyspark.sql.types import StructType, StructField,
StringType,IntegerType

os.environ["PYARROW_IGNORE_TIMEZONE"]="1"

def get_spark_session(app_name: str, conf: SparkConf):
conf.setMaster('local[*]')
conf \
  .set('spark.driver.memory', '64g')\
  .set("fs.s3a.access.key", "minio") \
  .set("fs.s3a.secret.key", "") \
  .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \
  .set("spark.hadoop.fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem") \
  .set("spark.hadoop.fs.s3a.path.style.access", "true") \
  .set("spark.sql.repl.eagerEval.enabled", "True") \
  .set("spark.sql.adaptive.enabled", "True") \
  .set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
  .set("spark.sql.repl.eagerEval.maxNumRows", "1") \
  .set("sc.setLogLevel", "error")

return
SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()

spark = get_spark_session("Falk", SparkConf())

d3 =
spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")

import pyspark
def sparkShape(dataFrame):
return (dataFrame.count(), len(dataFrame.columns))
pyspark.sql.dataframe.DataFrame.shape = sparkShape
print(d3.shape())


(653610, 267)


d3.write.json("d3.json")


d3 = spark.read.json("d3.json/*.json")

import pyspark
def sparkShape(dataFrame):
return (dataFrame.count(), len(dataFrame.columns))
pyspark.sql.dataframe.DataFrame.shape = sparkShape
print(d3.shape())

(653610, 186)


So spark is deleting 81 columns. I think that all of these 81 deleted
columns have only Null in them.

Is this a bug or has this been made on purpose?


fre. 21. jan. 2022 kl. 04:59 skrev huaxin gao :

> Please vote on releasing the following candidate as Apache Spark version
> 3.2.1. The vote is open until 8:00pm Pacific time January 25 and passes if
> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1
> Release this package as Apache Spark 3.2.1[ ] -1 Do not release this
> package because ... To learn more about Apache Spark, please see
> http://spark.apache.org/ The tag to be voted on is v3.2.1-rc2 (commit
> 4f25b3f71238a00508a356591553f2dfa89f8290):
> https://github.com/apache/spark/tree/v3.2.1-rc2
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS The staging repository
> for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1398/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/_site/
> The list of bug fixes going into 3.2.1 can be found at the following URL:
> https://s.apache.org/yu0cy
>
> This release is using the release script of the tag v3.2.1-rc2. FAQ
> = How can I help test this release?
> = If you are a Spark user, you can help us test
> this release by taking an existing Spark workload and running on this
> release candidate, then reporting any regressions. If you're working in
> PySpark you can set up a virtual env and install the current RC and see if
> anything important breaks, in the Java/Scala you can add the staging
> repository to your projects resolvers and test with the RC (make sure to
> clean up the artifact cache before/after so you don't end up building with
> a out of date RC going forward).
> === What should happen to JIRA
> tickets still targeting 3.2.1? ===
> The current list of open tickets targeted at 3.2.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.2.1 Committers should look at those and triage. Extremely
> important bug fixes, documentation, and API tweaks that impact
> compatibility should be worked on immediately. Everything else please
> retarget to an appropriate release. == But my bug isn't
> fixed? == In order to make timely releases, we will
> typically not hold the release unless the bug in question is a regression
> from the previous release. That being said, if there is something which is
> a regression that has not been correctly targeted please ping me or a
> committer to help target the issue.
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: [VOTE] Release Spark 3.2.1 (RC1)

2022-01-15 Thread Bjørn Jørgensen
integration tests passed.
>>>>>
>>>>>
>>>>>
>>>>> Qian
>>>>>
>>>>>
>>>>>
>>>>> 2022年1月11日 上午2:09,huaxin gao  写道:
>>>>>
>>>>>
>>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 3.2.1.
>>>>>
>>>>>
>>>>> The vote is open until Jan. 13th at 12 PM PST (8 PM UTC) and passes if
>>>>> a majority
>>>>>
>>>>> +1 PMC votes are cast, with a minimum of 3 + 1 votes.
>>>>>
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 3.2.1
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>
>>>>> There are currently no issues targeting 3.2.1 (try project = SPARK AND
>>>>> "Target Version/s" = "3.2.1" AND status in (Open, Reopened, "In
>>>>> Progress"))
>>>>>
>>>>> The tag to be voted on is v3.2.1-rc1 (commit
>>>>> 2b0ee226f8dd17b278ad11139e62464433191653):
>>>>>
>>>>> https://github.com/apache/spark/tree/v3.2.1-rc1
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc1-bin/
>>>>>
>>>>> Signatures used for Spark RCs can be found in this file:
>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1395/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc1-docs/
>>>>>
>>>>> The list of bug fixes going into 3.2.1 can be found at the following
>>>>> URL:
>>>>> https://s.apache.org/7tzik
>>>>>
>>>>> This release is using the release script of the tag v3.2.1-rc1.
>>>>>
>>>>> FAQ
>>>>>
>>>>>
>>>>> =
>>>>> How can I help test this release?
>>>>> =
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>> the current RC and see if anything important breaks, in the Java/Scala
>>>>> you can add the staging repository to your projects resolvers and test
>>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>>> you don't end up building with an out of date RC going forward).
>>>>>
>>>>> ===
>>>>> What should happen to JIRA tickets still targeting 3.2.1?
>>>>> ===
>>>>>
>>>>> The current list of open tickets targeted at 3.2.1 can be found at:
>>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>>> Version/s" = 3.2.1
>>>>>
>>>>> Committers should look at those and triage. Extremely important bug
>>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>>> be worked on immediately. Everything else please retarget to an
>>>>> appropriate release.
>>>>>
>>>>> ==
>>>>> But my bug isn't fixed?
>>>>> ==
>>>>>
>>>>> In order to make timely releases, we will typically not hold the
>>>>> release unless the bug in question is a regression from the previous
>>>>> release. That being said, if there is something which is a regression
>>>>> that has not been correctly targeted please ping me or a committer to
>>>>> help target the issue.
>>>>>
>>>>>
>>>>>
>>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297