Re: Spark Compatibility with Spring Boot 3.x

2023-10-05 Thread Sean Owen
I think we already updated this in Spark 4. However for now you would have to also include a JAR with the jakarta.* classes instead. You are welcome to try Spark 4 now by building from master, but it's far from release. On Thu, Oct 5, 2023 at 11:53 AM Ahmed Albalawi wrote: > Hello team, > > We

Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Sean Owen
I think the announcement mentioned there were some issues with pypi and the upload size this time. I am sure it's intended to be there when possible. On Wed, Sep 20, 2023, 3:00 PM Kezhi Xiong wrote: > Hi, > > Are there any plans to upload PySpark 3.5.0 to PyPI ( >

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-20 Thread Sean Owen
nd all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destructio

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Sean Owen
Pyspark follows SQL databases here. stddev is stddev_samp, and sample standard deviation is the calculation with the Bessel correction, n-1 in the denominator. stddev_pop is simply standard deviation, with n in the denominator. On Tue, Sep 19, 2023 at 7:13 AM Helene Bøe wrote: > Hi! > > > > I

Re: getting emails in different order!

2023-09-18 Thread Sean Owen
I have seen this, and not sure if it's just the ASF mailer being weird, or more likely, because emails are moderated and we inadvertently moderate them out of order On Mon, Sep 18, 2023 at 10:59 AM Mich Talebzadeh wrote: > Hi, > > I use gmail to receive spark user group emails. > > On

Re: Spark stand-alone mode

2023-09-15 Thread Sean Owen
Yes, should work fine, just set up according to the docs. There needs to be network connectivity between whatever the driver node is and these 4 nodes. On Thu, Sep 14, 2023 at 11:57 PM Ilango wrote: > > Hi all, > > We have 4 HPC nodes and installed spark individually in all nodes. > > Spark is

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Sean Owen
ame issue. > > > org.elasticsearch > elasticsearch-spark-30_${scala.compat.version} > 7.12.1 > > > > On Fri, Sep 8, 2023 at 4:41 AM Sean Owen wrote: > >> By marking it provided, you are not including this dependency with your >> app. If it is also

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Sean Owen
By marking it provided, you are not including this dependency with your app. If it is also not somehow already provided by your spark cluster (this is what it means), then yeah this is not anywhere on the class path at runtime. Remove the provided scope. On Thu, Sep 7, 2023, 4:09 PM Dipayan Dev

Re: Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Sean Owen
f some other dependency. > > > > *From:* Sean Owen > *Sent:* Thursday, August 31, 2023 5:10 PM > *To:* Agrawal, Sanket > *Cc:* user@spark.apache.org > *Subject:* [EXT] Re: Okio Vulnerability in Spark 3.4.1 > > > > Does the vulnerability affect Spark? >

Re: Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Sean Owen
Does the vulnerability affect Spark? In any event, have you tried updating Okio in the Spark build? I don't believe you could just replace the JAR, as other libraries probably rely on it and compiled against the current version. On Thu, Aug 31, 2023 at 6:02 AM Agrawal, Sanket wrote: > Hi All, >

Re: error trying to save to database (Phoenix)

2023-08-21 Thread Sean Owen
ooks like spark 3.4.1 (my version) uses scala Scala 2.12 > How do I specify the scala version? > > On Mon, Aug 21, 2023 at 4:47 PM Sean Owen wrote: > >> That's a mismatch in the version of scala that your library uses vs spark >> uses. >> >> On Mon, Aug 21, 2023, 6:

Re: error trying to save to database (Phoenix)

2023-08-21 Thread Sean Owen
That's a mismatch in the version of scala that your library uses vs spark uses. On Mon, Aug 21, 2023, 6:46 PM Kal Stevens wrote: > I am having a hard time figuring out what I am doing wrong here. > I am not sure if I have an incompatible version of something installed or > something else. > I

Re: Spark Vulnerabilities

2023-08-14 Thread Sean Owen
Yeah, we generally don't respond to "look at the output of my static analyzer". Some of these are already addressed in a later version. Some don't affect Spark. Some are possibly an issue but hard to change without breaking lots of things - they are really issues with upstream dependencies. But

Re: conver panda image column to spark dataframe

2023-08-03 Thread Sean Owen
pp4 has one row, I'm guessing - containing an array of 10 images. You want 10 rows of 1 image each. But, just don't do this. Pass the bytes of the image as an array, along with width/height/channels, and reshape it on use. It's just easier. That is how the Spark image representation works anyway

Re: Interested in contributing to SPARK-24815

2023-08-03 Thread Sean Owen
to the ASF Source Header and Copyright Notice Policy[1], code >>> directly submitted to ASF should include the Apache license header >>> without any additional copyright notice. >>> >>> >>> Kent Yao >>> >>> [1] >>> https://u

Re: spark context list_packages()

2023-07-27 Thread Sean Owen
There is no such method in Spark. I think that's some EMR-specific modification. On Wed, Jul 26, 2023 at 11:06 PM second_co...@yahoo.com.INVALID wrote: > I ran the following code > > spark.sparkContext.list_packages() > > on spark 3.4.1 and i get below error > > An error was encountered: >

Re: Interested in contributing to SPARK-24815

2023-07-24 Thread Sean Owen
When contributing to an ASF project, it's governed by the terms of the ASF ICLA: https://www.apache.org/licenses/icla.pdf or CCLA: https://www.apache.org/licenses/cla-corporate.pdf I don't believe ASF projects ever retain an original author copyright statement, but rather source files have a

Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen
No, a pandas on Spark DF is distributed. On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh wrote: > Thanks but if you create a Spark DF from Pandas DF that Spark DF is not > distributed and remains on the driver. I recall a while back we had this > conversation. I don't think anything has changed.

Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen
It is indeed not part of SparkSession. See the link you cite. It is part of the pyspark pandas API On Tue, Jun 20, 2023, 5:42 AM John Paul Jayme wrote: > Good day, > > > > I have a task to read excel files in databricks but I cannot seem to > proceed. I am referencing the API documents -

Re: Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread Sean Owen
You sure it is not just that it's displaying in your local TZ? Check the actual value as a long for example. That is likely the same time. On Thu, Jun 8, 2023, 5:50 PM karan alang wrote: > ref : >

Re: JDK version support information

2023-05-29 Thread Sean Owen
Per docs, it is Java 8. It's possible Java 11 partly works with 2.x but not supported. But then again 2.x is not supported either. On Mon, May 29, 2023, 6:43 AM Poorna Murali wrote: > We are currently using JDK 11 and spark 2.4.5.1 is working fine with that. > So, we wanted to check the maximum

Re: [MLlib] how-to find implementation of Decision Tree Regressor fit function

2023-05-25 Thread Sean Owen
Are you looking for https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala On Thu, May 25, 2023 at 6:54 AM Max wrote: > Good day, I'm working on an Implantation from Joint Probability Trees > (JPT) using the Spark framework. For this

Re: Tensorflow on Spark CPU

2023-04-30 Thread Sean Owen
nds > > the code is at below > https://gist.github.com/cometta/240bbc549155e22f80f6ba670c9a2e32 > > Do you have an example of tensorflow+big dataset that I can test? > > > > > > > > On Saturday, April 29, 2023 at 08:44:04 PM GMT+8, Sean Owen < > sro...@gmai

Re: Tensorflow on Spark CPU

2023-04-29 Thread Sean Owen
You don't want to use CPUs with Tensorflow. If it's not scaling, you may have a problem that is far too small to distribute. On Sat, Apr 29, 2023 at 7:30 AM second_co...@yahoo.com.INVALID wrote: > Anyone successfully run native tensorflow on Spark ? i tested example at >

Re: Looping through a series of telephone numbers

2023-04-02 Thread Sean Owen
That won't work, you can't use Spark within Spark like that. If it were exact matches, the best solution would be to load both datasets and join on telephone number. For this case, I think your best bet is a UDF that contains the telephone numbers as a list and decides whether a given number

Re: What is the range of the PageRank value of graphx

2023-03-28 Thread Sean Owen
>From the docs: * Note that this is not the "normalized" PageRank and as a consequence pages that have no * inlinks will have a PageRank of alpha. In particular, the pageranks may have some values * greater than 1. On Tue, Mar 28, 2023 at 9:11 AM lee wrote: > When I calculate pagerank using

Re: Question related to asynchronously map transformation using java spark structured streaming

2023-03-26 Thread Sean Owen
What do you mean by asynchronously here? On Sun, Mar 26, 2023, 10:22 AM Emmanouil Kritharakis < kritharakismano...@gmail.com> wrote: > Hello again, > > Do we have any news for the above question? > I would really appreciate it. > > Thank you, > >

Re: Kind help request

2023-03-25 Thread Sean Owen
It is telling you that the UI can't bind to any port. I presume that's because of container restrictions? If you don't want the UI at all, just set spark.ui.enabled to false On Sat, Mar 25, 2023 at 8:28 AM Lorenzo Ferrando < lorenzo.ferra...@edu.unige.it> wrote: > Dear Spark team, > > I am

Re: Question related to parallelism using structed streaming parallelism

2023-03-21 Thread Sean Owen
Yes more specifically, you can't ask for executors once the app starts, in SparkConf like that. You set this when you launch it against a Spark cluster in spark-submit or otherwise. On Tue, Mar 21, 2023 at 4:23 AM Mich Talebzadeh wrote: > Hi Emmanouil, > > This means that your job is running on

Re: Understanding executor memory behavior

2023-03-16 Thread Sean Owen
All else equal it is better to have the same resources in fewer executors. More tasks are local to other tasks which helps perf. There is more possibility of 'borrowing' extra mem and CPU in a task. On Thu, Mar 16, 2023, 2:14 PM Nikhil Goyal wrote: > Hi folks, > I am trying to understand what

Re: logging pickle files on local run of spark.ml Pipeline model

2023-03-15 Thread Sean Owen
Pickle won't work. But the others should. I think you are specifying an invalid path in both cases but hard to say without more detail On Wed, Mar 15, 2023, 9:13 AM Mnisi, Caleb wrote: > Good Day > > > > I am having trouble saving a spark.ml Pipeline model to a pickle file, > when running

Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Sean Owen
That's incorrect, it's spark.default.parallelism, but as the name suggests, that is merely a default. You control partitioning directly with .repartition() On Tue, Mar 14, 2023 at 11:37 AM Mich Talebzadeh wrote: > Check this link > > >

Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Sean Owen
Are you just looking for DataFrame.repartition()? On Tue, Mar 14, 2023 at 10:57 AM Emmanouil Kritharakis < kritharakismano...@gmail.com> wrote: > Hello, > > I hope this email finds you well! > > I have a simple dataflow in which I read from a kafka topic, perform a map > transformation and then

Re: Spark 3.3.2 not running with Antlr4 runtime latest version

2023-03-14 Thread Sean Owen
You want Antlr 3 and Spark is on 4? no I don't think Spark would downgrade. You can shade your app's dependencies maybe. On Tue, Mar 14, 2023 at 8:21 AM Sahu, Karuna wrote: > Hi Team > > > > We are upgrading a legacy application using Spring boot , Spark and > Hibernate. While upgrading

Re: How to share a dataset file across nodes

2023-03-09 Thread Sean Owen
Put the file on HDFS, if you have a Hadoop cluster? On Thu, Mar 9, 2023 at 3:02 PM sam smith wrote: > Hello, > > I use Yarn client mode to submit my driver program to Hadoop, the dataset > I load is from the local file system, when i invoke load("file://path") > Spark complains about the csv

Re: 回复:Re: Build SPARK from source with SBT failed

2023-03-07 Thread Sean Owen
I need to install Apple Developer Tools? > - 原始邮件 - > 发件人:Sean Owen > 收件人:ckgppl_...@sina.cn > 抄送人:user > 主题:Re: Build SPARK from source with SBT failed > 日期:2023年03月07日 20点58分 > > This says you don't have the java compiler installed. Did you install the > Apple

Re: Pandas UDFs vs Inbuilt pyspark functions

2023-03-07 Thread Sean Owen
It's hard to evaluate without knowing what you're doing. Generally, using a built-in function will be fastest. pandas UDFs can be faster than normal UDFs if you can take advantage of processing multiple rows at once. On Tue, Mar 7, 2023 at 6:47 AM neha garde wrote: > Hello All, > > I need help

Re: Build SPARK from source with SBT failed

2023-03-07 Thread Sean Owen
This says you don't have the java compiler installed. Did you install the Apple Developer Tools package? On Tue, Mar 7, 2023 at 1:42 AM wrote: > Hello, > > I have tried to build SPARK source codes with SBT in my local dev > environment (MacOS 13.2.1). But it reported following error: > [error]

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Sean Owen
hich may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sat, 4 Mar 2023 at 20:13, Sean Owen wrote: > >> It's the sam

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Sean Owen
It's the same batch ID already, no? Or why not simply put the logic of both in one function? or write one function that calls both? On Sat, Mar 4, 2023 at 2:07 PM Mich Talebzadeh wrote: > > This is probably pretty straight forward but somehow is does not look > that way > > > > On Spark

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen
", line 62, in main >>> distances = joined.withColumn("distance", max(col("start") - >>> col("position"), col("position") - col("end"), 0)) >>> File >>> "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen
That error sounds like it's from pandas not spark. Are you sure it's this line? On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker < oliv...@broadinstitute.org> wrote: > > Hello, > > I'm trying to calculate the distance between a gene (with start and end) > and a variant (with position),

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread Sean Owen
a single partition, which has the >> same downside as collect, so this is as bad as using collect. >> >> Cheers, >> Enrico >> >> >> Am 12.02.23 um 18:05 schrieb sam smith: >> >> @Enrico Minack Thanks for "unpivot" but I am &g

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread Sean Owen
rsion 3.3.0 (you are taking it way too far as usual :) ) > @Sean Owen Pls then show me how it can be improved by > code. > > Also, why such an approach (using withColumn() ) doesn't work: > > for (String columnName : df.columns()) { > df= df.withColumn(columnName, > df.sele

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread Sean Owen
> > > > > On Fri, 10 Feb 2023 at 21:59, sam smith > wrote: > >> I am not sure i understand well " Just need to do the cols one at a >> time". Plus I think Apostolos is right, this needs a dataframe approach not >> a list approach. >> >>

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread Sean Owen
That gives you all distinct tuples of those col values. You need to select the distinct values of each col one at a time. Sure just collect() the result as you do here. On Fri, Feb 10, 2023, 3:34 PM sam smith wrote: > I want to get the distinct values of each column in a List (is it good >

Re: [PySPark] How to check if value of one column is in array of another column

2023-01-17 Thread Sean Owen
I think you want array_contains: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_contains.html On Tue, Jan 17, 2023 at 4:18 PM Oliver Ruebenacker < oliv...@broadinstitute.org> wrote: > > Hello, > > I have data originally stored as

Re: pyspark.sql.dataframe.DataFrame versus pyspark.pandas.frame.DataFrame

2023-01-13 Thread Sean Owen
One is a normal Pyspark DataFrame, the other is a pandas work-alike wrapper on a Pyspark DataFrame. They're the same thing with different APIs. Neither has a 'storage format'. spark-excel might be fine, and it's used with Spark DataFrames. Because it emulates pandas's read_excel API, the Pyspark

Re: [pyspark/sparksql]: How to overcome redundant/repetitive code? Is a for loop over an sql statement with a variable a bad idea?

2023-01-06 Thread Sean Owen
Right, nothing wrong with a for loop here. Seems like just the right thing. On Fri, Jan 6, 2023, 3:20 PM Joris Billen wrote: > Hello Community, > I am working in pyspark with sparksql and have a very similar very complex > list of dataframes that Ill have to execute several times for all the >

Re: GPU Support

2023-01-05 Thread Sean Owen
Spark itself does not use GPUs, but you can write and run code on Spark that uses GPUs. You'd typically use software like Tensorflow that uses CUDA to access the GPU. On Thu, Jan 5, 2023 at 7:05 AM K B M Kaala Subhikshan < kbmkaalasubhiks...@gmail.com> wrote: > Is Gigabyte GeForce RTX 3080 GPU

Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Sean Owen
That does not appear to be the same input you used in your example. What is the contents of test.csv? On Wed, Jan 4, 2023 at 7:45 AM Saurabh Gulati wrote: > Hi @Sean Owen > Probably the data is incorrect, and the source needs to fix it. > But using python's csv parser returns th

Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Sean Owen
That input is just invalid as CSV for any parser. You end a quoted col without following with a col separator. What would the intended parsing be and how would it work? On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati wrote: > > @Sean Owen Also see the example below with quotes >

Re: Incorrect csv parsing when delimiter used within the data

2023-01-03 Thread Sean Owen
ww.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > fr

Re: Incorrect csv parsing when delimiter used within the data

2023-01-03 Thread Sean Owen
No, you've set the escape character to double-quote, when it looks like you mean for it to be the quote character (which it already is). Remove this setting, as it's incorrect. On Tue, Jan 3, 2023 at 11:00 AM Saurabh Gulati wrote: > Hello, > We are seeing a case with csv data when it parses csv

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Sean Owen
main object is not > getting deserialized in executor, otherise it would have failed then also. > > On Mon, 2 Jan 2023 at 9:15 PM, Sean Owen wrote: > >> It silently allowed the object to serialize, though the >> serialized/deserialized session would not work. Now it explicitly fail

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Sean Owen
error there? > > On Mon, 2 Jan 2023 at 9:09 PM, Sean Owen wrote: > >> Oh, it's because you are defining "spark" within your driver object, and >> then it's getting serialized because you are trying to use TestMain methods >> in your program. >> This was never c

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Sean Owen
L must be set in > your configuration > >at org.apache.spark.SparkContext.(SparkContext.scala:385) > >at > org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2574) > > at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Sean Owen
ething to do with df to rdd conversion or serialization > behavior change from Spark 2.3 to Spark 3.0 if there is any. But couldn't > find the root cause. > > Regards, > Shrikant > > On Mon, 2 Jan 2023 at 7:54 PM, Sean Owen wrote: > >> So call .setMaster("yarn&

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Sean Owen
So call .setMaster("yarn"), per the error On Mon, Jan 2, 2023 at 8:20 AM Shrikant Prasad wrote: > We are running it in cluster deploy mode with yarn. > > Regards, > Shrikant > > On Mon, 2 Jan 2023 at 6:15 PM, Stelios Philippou > wrote: > >> Can we see your Spark Configuration parameters ? >>

Re: Profiling data quality with Spark

2022-12-27 Thread Sean Owen
I think this is kind of mixed up. Data warehouses are simple SQL creatures; Spark is (also) a distributed compute framework. Kind of like comparing maybe a web server to Java. Are you thinking of Spark SQL? then I dunno sure you may well find it more complicated, but it's also just a data

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Sean Owen
As Mich says, isn't this just max by population partitioned by country in a window function? On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker wrote: > > Hello, > > Thank you for the response! > > I can think of two ways to get the largest city by country, but both > seem to be

Re: Unable to run Spark Job(3.3.2 SNAPSHOT) with Volcano scheduler in Kubernetes

2022-12-16 Thread Sean Owen
rote: > I have been following below steps. > > git clone --branch branch-3.3 https://github.com/apache/spark.git > cd spark > ./dev/make-distribution.sh --tgz --name with-volcano > -Pkubernetes,volcano,hadoop-3 > > How to increase stack size ? Please let me know. > > Thank

Re: Unable to run Spark Job(3.3.2 SNAPSHOT) with Volcano scheduler in Kubernetes

2022-12-16 Thread Sean Owen
You need to increase the stack size during compilation. The included mvn wrapper in build does this. Are you using it? On Fri, Dec 16, 2022 at 9:13 AM Gnana Kumar wrote: > This is my latest error and fails to build SPARK CATALYST > > Exception in thread "main" java.lang.StackOverflowError >

Re: [EXTERNAL] Re: [Spark vulnerability] replace jackson-mapper-asl

2022-12-15 Thread Sean Owen
our firm appsec team, given the library is still being > used in spark3.3.1. Also I can see the dependency as below: > > https://github.com/apache/spark/blob/v3.3.1/pom.xml#L1784 > > > > Something misunderstanding? appreciate if you could clarify more, thanks. > > > >

Re: Query regarding Apache spark version 3.0.1

2022-12-15 Thread Sean Owen
Do you mean, when is branch 3.0.x EOL? It was EOL around the end of 2021. But there were releases 3.0.2 and 3.0.3 beyond 3.0.1, so not clear what you mean by support for 3.0.1. On Thu, Dec 15, 2022 at 9:53 AM Pranav Kumar (EXT) wrote: > Hi Team, > > > > Could you please help us to know when

Re: [EXTERNAL] Re: [Spark vulnerability] replace jackson-mapper-asl

2022-12-14 Thread Sean Owen
78a3a34c28fc15e898307e458d501a7e11d6d51?context=explore > > https://pypi.org/project/pyspark/ > > > > Regards > > Harper > > > > > > *From:* Sean Owen > *Sent:* Wednesday, December 14, 2022 9:32 PM > *To:* Wang, Harper (FRPPE) > *Cc:* user@spa

Re: [Spark vulnerability] replace jackson-mapper-asl

2022-12-14 Thread Sean Owen
What Spark version are you referring to? If it's an unsupported version, no, no plans to update it. What image are you referring to? On Wed, Dec 14, 2022 at 7:14 AM haibo.w...@morganstanley.com < haibo.w...@morganstanley.com> wrote: > Hi All > > > > Hope you are doing well. > > > > Writing this

Re: Create Jira account

2022-11-28 Thread Sean Owen
-user@ Send me your preferred email and username for the ASF JIRA and I'll create it. On Mon, Nov 28, 2022 at 10:55 AM Gerben van der Huizen < gerbenvanderhui...@gmail.com> wrote: > Hello, > > I would like to contribute to the Apache Spark project through Jira, but > according to this blog post

Re: Unable to use GPU with pyspark in windows

2022-11-23 Thread Sean Owen
Using a GPU is unrelated to Spark. You can run code that uses GPUs. This error indicates that something failed when you ran your code (GPU OOM?) and you need to investigate why. On Wed, Nov 23, 2022 at 7:51 AM Vajiha Begum S A < vajihabegu...@maestrowiz.com> wrote: > Hi Sean Owen, &g

Re: CVE-2022-33891 mitigation

2022-11-21 Thread Sean Owen
CCing Kostya for a better view, but I believe that this will not be an issue if you're not using the ACLs in Spark, yes. On Mon, Nov 21, 2022 at 2:38 PM Andrew Pomponio wrote: > I am using Spark 2.3.0 and trying to mitigate > https://nvd.nist.gov/vuln/detail/CVE-2022-33891. The correct thing to

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-18 Thread Sean Owen
; > On Fri, Nov 18, 2022, 8:13 AM Ramakrishna Rayudu < > ramakrishna560.ray...@gmail.com> wrote: > >> Sure I will test with latest spark and let you the result. >> >> Thanks, >> Rama >> >> On Thu, Nov 17, 2022, 11:16 PM Sean Owen wrote: >> >&g

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen
ng this kind of queries. Okay then > problem is LIMIT is not coming up in query. Can you please suggest me any > direction. > > Thanks, > Rama > > On Thu, Nov 17, 2022, 10:56 PM Sean Owen wrote: > >> Hm, the existence queries even in 2.4.x had LIMIT 1. Are you s

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen
s in DB logs. > > SELECT 1 FROM (INPUT_QUERY) SPARK_GEN_SUB_0 > > SELECT * FROM (INPUT_QUERY) SPARK_GEN_SUB_0 WHERE 1=0 > > When we see `SELECT *` which ending up with `Where 1=0` but query starts > with `SELECT 1` there is no where condition. > > Thanks, > Rama > >

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen
. > > 1 > 1 > 1 > 1 > . > . > 1 > > > Its impact the performance. Can we any alternate solution for this. > > Thanks, > Rama > > > On Thu, Nov 17, 2022, 10:17 PM Sean Owen wrote: > >> This is a query to check the existence of the table upfr

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

2022-11-17 Thread Sean Owen
This is a query to check the existence of the table upfront. It is nearly a no-op query; can it have a perf impact? On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu < ramakrishna560.ray...@gmail.com> wrote: > Hi Team, > > I am facing one issue. Can you please help me on this. > >

Re: [EXTERNAL] Re: Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-03 Thread Sean Owen
Er, wait, this is what stage-level scheduling is right? this has existed since 3.1 https://issues.apache.org/jira/browse/SPARK-27495 On Thu, Nov 3, 2022 at 12:10 PM bo yang wrote: > Interesting discussion here, looks like Spark does not support configuring > different number of executors in

Re: Ctrl - left and right now working in Spark Shell in Windows 10

2022-11-01 Thread Sean Owen
This won't be related to Spark, but rather your shell or terminal program. On Tue, Nov 1, 2022 at 1:57 PM Salil Surendran wrote: > I installed Spark on Windows 10. Everything works fine except for the Ctrl > - left and Ctrl - right keys which doesn't move a word but just a > character. How do I

Re: spark - local question

2022-10-31 Thread Sean Owen
Sure, as stable and available as your machine is. If you don't need fault tolerance or scale beyond one machine, sure. On Mon, Oct 31, 2022 at 8:43 AM 张健BJ wrote: > Dear developers: > I have a question about the pyspark local > mode. Can it be used in production and Will it cause

Re: Running 30 Spark applications at the same time is slower than one on average

2022-10-26 Thread Sean Owen
is too small, > considering each app only uses a small number of cores and RAM. So you may > consider increase the number of nodes. When all these apps jam on a few > nodes, the cluster manager/scheduler and/or the network becomes > overwhelmed... > > On 10/26/22 8:09 AM, Sean

Re: Running 30 Spark applications at the same time is slower than one on average

2022-10-26 Thread Sean Owen
Resource contention. Now all the CPU and I/O is competing and probably slows down On Wed, Oct 26, 2022, 5:37 AM eab...@163.com wrote: > Hi All, > > I have a CDH5.16.2 hadoop cluster with 1+3 nodes(64C/128G, 1NN/RM + > 3DN/NM), and yarn with 192C/240G. I used the following test scenario: > >

Re: As a Scala newbie starting to work with Spark does it make more sense to learn Scala 2 or Scala 3?

2022-10-11 Thread Sean Owen
ark version would have it > built-in? > > thanks > > Sean Owen wrote: > > I would imagine that Scala 2.12 support goes away, and Scala 3 support > > is added, for maybe Spark 4.0, and maybe that happens in a year or so. > > -- >

Re: As a Scala newbie starting to work with Spark does it make more sense to learn Scala 2 or Scala 3?

2022-10-11 Thread Sean Owen
For Spark, the issue is maintaining simultaneous support for multiple Scala versions, which has historically been mutually incompatible across minor versions. Until Scala 2.12 support is reasonable to remove, it's hard to also support Scala 3, as it would mean maintaining three versions of code. I

Re: [Spark Core][Release]Can we consider add SPARK-39725 into 3.3.1 or 3.3.2 release?

2022-10-04 Thread Sean Owen
I think it's fine to backport that to 3.3.x, regardless of whether it clearly affects Spark or not. On Tue, Oct 4, 2022 at 11:31 AM phoebe chen wrote: > Hi: > (Not sure if this mailing group is good to use for such question, but just > try my luck here, thanks) > > SPARK-39725

Re: Spark ML VarianceThresholdSelector Unexpected Results

2022-09-29 Thread Sean Owen
This is sample variance, not population (i.e. divide by n-1, not n). I think that's justified as the data are notionally a sample from a population. On Thu, Sep 29, 2022 at 9:21 PM 姜鑫 wrote: > Hi folks, > > Has anyone used VarianceThresholdSelector refer to >

Re: Updating Broadcast Variable in Spark Streaming 2.4.4

2022-09-28 Thread Sean Owen
I don't think that can work. Your BroadcastUpdater is copied to the task, with a reference to an initial broadcast. When that is later updated on the driver, this does not affect the broadcast inside the copy in the tasks. On Wed, Sep 28, 2022 at 10:11 AM Dipl.-Inf. Rico Bergmann <

Re: 答复: [how to]RDD using JDBC data source in PySpark

2022-09-19 Thread Sean Owen
Just use the .format('jdbc') data source? This is built in, for all languages. You can get an RDD out if you must. On Mon, Sep 19, 2022, 5:28 AM javaca...@163.com wrote: > Thank you answer alton. > > But i see that is use scala to implement it. > I know java/scala can get data from mysql using

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Sean Owen
Wait, how do you start reduce tasks before maps are finished? is the idea that some reduce tasks don't depend on all the maps, or at least you can get started? You can already execute unrelated DAGs in parallel of course. On Wed, Sep 7, 2022 at 5:49 PM Sungwoo Park wrote: > You are right --

Re: Spark equivalent to hdfs groups

2022-09-07 Thread Sean Owen
ondered if there was a class in Spark (eg. Security or > ACL) which would let you access a particular user's groups. > > > > - Mail original - > De: "Sean Owen" > À: phi...@free.fr > Cc: "User" > Envoyé: Mercredi 7 Septembre 2022 16:41:01 >

Re: Spark equivalent to hdfs groups

2022-09-07 Thread Sean Owen
Spark isn't a storage system or user management system; no there is no notion of groups (groups for what?) On Wed, Sep 7, 2022 at 8:36 AM wrote: > Hello, > is there a Spark equivalent to "hdfs groups "? > Many thanks. > Philippe > >

Re: Error in Spark in Jupyter Notebook

2022-09-06 Thread Sean Owen
That just says a task failed - no real info there. YOu have to look at Spark logs from the UI to see why. On Tue, Sep 6, 2022 at 7:07 AM Mamata Shee wrote: > Hello, > > I'm using spark in Jupyter Notebook, but when performing some queries > getting the below error, can you please tell me what

Re: Spark got incorrect scala version while using spark 3.2.1 and spark 3.2.2

2022-08-26 Thread Sean Owen
Spark is built with and ships with a copy of Scala. It doesn't use your local version. On Fri, Aug 26, 2022 at 2:55 AM wrote: > Hi all, > > I found a strange thing. I have run SPARK 3.2.1 prebuilt in local mode. My > OS scala version is 2.13.7. > But when I run spark-sumit then check the

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Sean Owen
’s RTL utils and other tools to figure out >> how much overhead there is using Pandera and Spark together to validate >> data: https://github.com/Graphlet-AI/graphlet >> >> I’ll respond by tomorrow evening with code in a fist! We’ll see if it >> gets consistent, measurab

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Sean Owen
It's important to realize that while pandas UDFs and pandas on Spark are both related to pandas, they are not themselves directly related. The first lets you use pandas within Spark, the second lets you use pandas on Spark. Hard to say with this info but you want to look at whether you are doing

Re: spark-3.2.2-bin-without-hadoop : NoClassDefFoundError: org/apache/log4j/spi/Filter when starting the master

2022-08-24 Thread Sean Owen
You have to provide your own Hadoop distro and all its dependencies. This build is intended for use on a Hadoop cluster, really. If you're running stand-alone, you should not be using it. Use a 'normal' distribution that bundles Hadoop libs. On Wed, Aug 24, 2022 at 9:35 AM FLORANCE Grégory

Re: Spark with GPU

2022-08-13 Thread Sean Owen
decimal type/Udfs etc. > So, will it use CPU automatically for running those tasks which require > nested types or will it run on GPU and fail. > > Thanks > Rajat > > On Sat, Aug 13, 2022, 18:54 Sean Owen wrote: > >> Spark does not use GPUs itself, but tasks

Re: Spark with GPU

2022-08-13 Thread Sean Owen
Spark does not use GPUs itself, but tasks you run on Spark can. The only 'support' there is is for requesting GPUs as resources for tasks, so it's just a question of resource management. That's in OSS. On Sat, Aug 13, 2022 at 8:16 AM rajat kumar wrote: > Hello, > > I have been hearing about GPU

Re: Spark Scala API still not updated for 2.13 or it's a mistake?

2022-08-02 Thread Sean Owen
t; Thanks > > On 2 Aug 2022, at 18:52, Sean Owen wrote: > > Spark 3.3.0 supports 2.13, though you need to build it for 2.13. The > default binary distro uses 2.12. > > On Tue, Aug 2, 2022, 10:47 AM Roman I wrote: > >> >> For the Scala API, Spark 3.3.0 uses Scala 2.1

Re: Spark Scala API still not updated for 2.13 or it's a mistake?

2022-08-02 Thread Sean Owen
Spark 3.3.0 supports 2.13, though you need to build it for 2.13. The default binary distro uses 2.12. On Tue, Aug 2, 2022, 10:47 AM Roman I wrote: > > For the Scala API, Spark 3.3.0 uses Scala 2.12. You will need to use a > compatible Scala version (2.12.x). > >

Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-02 Thread Sean Owen
EATE TABLE IF NOT EXISTS > > > > > https://spark.apache.org/docs/3.3.0/sql-ref-syntax-ddl-create-table-datasource.html > > On Tue, 2 Aug 2022 at 14:38, Sean Owen wrote: > >> I don't think "CREATE OR REPLACE TABLE" exists (in SQL?); this isn't a >> VIEW. >> D

Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-02 Thread Sean Owen
IVE', > 'UNBOUNDED', 'UNCACHE', 'UNION', 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', > 'UPDATE', 'USE', 'USER', 'USING', 'VALUES', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', > 'WINDOW', 'WITH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 23) > > == SQL == > CREATE OR REPLACE TABLE > &

  1   2   3   4   5   6   7   8   9   10   >