Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
gt;> >>>> Mine is an oracle. postgres is good as well. >>>> >>>> HTH >>>> >>>> Mich Talebzadeh, >>>> Solutions Architect/Engineering Lead >>>> London >>>> United Kingdom >>>> >&g

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
gdom >>> >>> >>>view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disc

Re: why advisoryPartitionSize <= maxShuffledHashJoinLocalMapThreshold

2023-08-15 Thread XiDuo You
CoalesceShufflePartitions will merge small partitions into bigger ones. Say, if you set maxShuffledHashJoinLocalMapThreshold to 32MB, but the advisoryPartitionSize is 64MB, then the final each reducer partition size will be close to 64MB. It breaks the maxShuffledHashJoinLocalMapThreshold. So we

Re: Spark Vulnerabilities

2023-08-14 Thread Cheng Pan
For the Guava case, you may be interested in https://github.com/apache/spark/pull/42493 Thanks, Cheng Pan > On Aug 14, 2023, at 16:50, Sankavi Nagalingam > wrote: > > Hi Team, > We could see there are many dependent vulnerabilities present in the latest > spark-core:3.4.1.jar. PFA > Could

Re: Spark Vulnerabilities

2023-08-14 Thread Sean Owen
Yeah, we generally don't respond to "look at the output of my static analyzer". Some of these are already addressed in a later version. Some don't affect Spark. Some are possibly an issue but hard to change without breaking lots of things - they are really issues with upstream dependencies. But

Re: Spark Vulnerabilities

2023-08-14 Thread Bjørn Jørgensen
I have added links to the github PR. Or comment for those that I have not seen before. Apache Spark has very many dependencies, some can easily be upgraded while others are very hard to fix. Please feel free to open a PR if you wanna help. man. 14. aug. 2023 kl. 14:06 skrev Sankavi Nagalingam :

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Mich Talebzadeh
205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on t

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Patrick Tucci
loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
ling Delta Lake and re-writing all tables as > Delta tables, the issue persists. > > On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> ok sure. >> >> Is this Delta Lake going to be on-premise? >> >>

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
Yes, on premise. Unfortunately after installing Delta Lake and re-writing all tables as Delta tables, the issue persists. On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh wrote: > ok sure. > > Is this Delta Lake going to be on-premise? > > Mich Talebzadeh, > Solutions Arc

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
ok sure. Is this Delta Lake going to be on-premise? Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
Hi Mich, Thanks for the feedback. My original intention after reading your response was to stick to Hive for managing tables. Unfortunately, I'm running into another case of SQL scripts hanging. Since all tables are already Parquet, I'm out of troubleshooting options. I'm going to migrate to

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Mich Talebzadeh
Hi Patrick, There is not anything wrong with Hive On-premise it is the best data warehouse there is Hive handles both ORC and Parquet formal well. They are both columnar implementations of relational model. What you are seeing is the Spark API to Hive which prefers Parquet. I found out a few

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Patrick Tucci
Thanks for the reply Stephen and Mich. Stephen, you're right, it feels like Spark is waiting for something, but I'm not sure what. I'm the only user on the cluster and there are plenty of resources (+60 cores, +250GB RAM). I even tried restarting Hadoop, Spark and the host servers to make sure

Re: unsubscribe

2023-08-11 Thread Mich Talebzadeh
To unsubscribe e-mail: user-unsubscr...@spark.apache.org Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at

Re: Extracting Logical Plan

2023-08-11 Thread Vibhatha Abeykoon
df("column") > 10) >> > val resultDF = filteredDF.select("column1", "column2") >> > >> > // Trigger the execution of the DF to invoke the listener >> > resultDF.show() >> >> Thank You & Best Regards >> Winston Lai >&

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Mich Talebzadeh
Steve may have a valid point. You raised an issue with concurrent writes before, if I recall correctly. Since this limitation may be due to Hive metastore. By default Spark uses Apache Derby for its database persistence. *However it is limited to only one Spark session at any time for the purposes

Re: Spark Connect, Master, and Workers

2023-08-10 Thread Brian Huynh
Hi Kezhi, Yes, you no longer need to start a master to make the client work. Please see the quickstart. https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html You can think of Spark Connect as an API on top of Master so workers can be added to the cluster same

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Stephen Coy
Hi Patrick, When this has happened to me in the past (admittedly via spark-submit) it has been because another job was still running and had already claimed some of the resources (cores and memory). I think this can also happen if your configuration tries to claim resources that will never be

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hi Mich, I don't believe Hive is installed. I set up this cluster from scratch. I installed Hadoop and Spark by downloading them from their project websites. If Hive isn't bundled with Hadoop or Spark, I don't believe I have it. I'm running the Thrift server distributed with Spark, like so:

Re: dockerhub does not contain apache/spark-py 3.4.1

2023-08-10 Thread Mich Talebzadeh
Hi Mark, I created a spark3.4.1 docker file. Details from spark-py-3.4.1-scala_2.12-11-jre-slim-buster Pull instructions are given docker pull

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
sorry host is 10.0.50.1 Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
Hi Patrick That beeline on port 1 is a hive thrift server running on your hive on host 10.0.50.1:1. if you can access that host, you should be able to log into hive by typing hive. The os user is hadoop in your case and sounds like there is no password! Once inside that host, hive logs

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hi Mich, Thanks for the reply. Unfortunately I don't have Hive set up on my cluster. I can explore this if there are no other ways to troubleshoot. I'm using beeline to run commands against the Thrift server. Here's the command I use: ~/spark/bin/beeline -u jdbc:hive2://10.0.50.1:1 -n

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
Can you run this sql query through hive itself? Are you using this command or similar for your thrift server? beeline -u jdbc:hive2:///1/default org.apache.hive.jdbc.HiveDriver -n hadoop -p xxx HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my

Re: [PySpark][UDF][PickleException]

2023-08-10 Thread Bjørn Jørgensen
+ tor. 10. aug. 2023 kl. 17:26 skrev Sanket Sharma : > Hi, > I've been trying to debug a Spark UDF for a couple of days now but I can't > seem to figure out what is going on. The UDF essentially pads a 2D array to > a certain fixed length. When the

Re: [PySpark] Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path:

2023-08-09 Thread lnxpgn
Yes, ls -l /tmp/app-submodules.zip, hdfs dfs -ls /tmp/app-submodules.zip can show the file. 在 2023/8/9 22:48, Mich Talebzadeh 写道: If you are running in the cluster mode, that zip file should exist in all the nodes! Is that the case? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead

Re: dockerhub does not contain apache/spark-py 3.4.1

2023-08-09 Thread Mich Talebzadeh
Hi Mark, you can build it yourself, no big deal :) REPOSITORY TAG IMAGE ID CREATED SIZE sparkpy/spark-py 3.4.1-scala_2.12-11-jre-slim-buster-Dockerfile a876102b2206 1 second ago

Re: [PySpark] Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path:

2023-08-09 Thread Mich Talebzadeh
If you are running in the cluster mode, that zip file should exist in all the nodes! Is that the case? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile

Re: [EXTERNAL] Use of ML in certain aspects of Spark to improve the performance

2023-08-08 Thread Daniel Tavares de Santana
unsubscribe From: Mich Talebzadeh Sent: Tuesday, August 8, 2023 4:43 PM To: user @spark Subject: [EXTERNAL] Use of ML in certain aspects of Spark to improve the performance I am currently pondering and sharing my thoughts openly. Given our reliance on gathered

Re: Dynamic allocation does not deallocate executors

2023-08-08 Thread Holden Karau
So if you disable shuffle tracking but enable shuffle block decommissioning it should work from memory On Tue, Aug 8, 2023 at 4:13 AM Mich Talebzadeh wrote: > Hm. I don't think it will work > > --conf spark.dynamicAllocation.shuffleTracking.enabled=false > > In Spark 3.4.1 running spark in k8s

Re: Dynamic allocation does not deallocate executors

2023-08-08 Thread Mich Talebzadeh
Hm. I don't think it will work --conf spark.dynamicAllocation.shuffleTracking.enabled=false In Spark 3.4.1 running spark in k8s you get : org.apache.spark.SparkException: Dynamic allocation of executors requires the external shuffle service. You may enable this through

Re: Dynamic allocation does not deallocate executors

2023-08-07 Thread Holden Karau
I think you need to set "spark.dynamicAllocation.shuffleTracking.enabled=true" to false. On Mon, Aug 7, 2023 at 2:50 AM Mich Talebzadeh wrote: > Yes I have seen cases where the driver gone but a couple of executors > hanging on. Sounds like a code issue. > > HTH > > Mich Talebzadeh, > Solutions

Re: Dynamic allocation does not deallocate executors

2023-08-07 Thread Mich Talebzadeh
Yes I have seen cases where the driver gone but a couple of executors hanging on. Sounds like a code issue. HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile

Re: conver panda image column to spark dataframe

2023-08-03 Thread Sean Owen
pp4 has one row, I'm guessing - containing an array of 10 images. You want 10 rows of 1 image each. But, just don't do this. Pass the bytes of the image as an array, along with width/height/channels, and reshape it on use. It's just easier. That is how the Spark image representation works anyway

Re: conver panda image column to spark dataframe

2023-08-03 Thread second_co...@yahoo.com.INVALID
Hello Adrian,    here is the snippet import tensorflow_datasets as tfds (ds_train, ds_test), ds_info = tfds.load(     dataset_name, data_dir='',  split=["train", "test"], with_info=True, as_supervised=True ) schema = StructType([     StructField("image",

Re: Interested in contributing to SPARK-24815

2023-08-03 Thread Sean Owen
Formally, an ICLA is required, and you can read more here: https://www.apache.org/licenses/contributor-agreements.html In practice, it's unrealistic to collect and verify an ICLA for every PR contributed by 1000s of people. We have not gated on that. But, contributions are in all cases governed

Re: Interested in contributing to SPARK-24815

2023-08-03 Thread Rinat Shangeeta
(Adding my manager Eugene Kim who will cover me as I plan to be out of the office soon) Hi Kent and Sean, Nice to meet you. I am working on the OSS legal aspects with Pavan who is planning to make the contribution request to the Spark project. I saw that Sean mentioned in his email that the

Re: conver panda image column to spark dataframe

2023-08-03 Thread Adrian Pop-Tifrea
Hello, can you also please show us how you created the pandas dataframe? I mean, how you added the actual data into the dataframe. It would help us for reproducing the error. Thank you, Pop-Tifrea Adrian On Mon, Jul 31, 2023 at 5:03 AM second_co...@yahoo.com < second_co...@yahoo.com> wrote: >

Re: Extracting Logical Plan

2023-08-02 Thread Vibhatha Abeykoon
n") > 10) > > val resultDF = filteredDF.select("column1", "column2") > > > > // Trigger the execution of the DF to invoke the listener > > resultDF.show() > > Thank You & Best Regards > Winston Lai > -- > *

Re: Extracting Logical Plan

2023-08-02 Thread Winston Lai
) > 10) > val resultDF = filteredDF.select("column1", "column2") > > // Trigger the execution of the DF to invoke the listener > resultDF.show() Thank You & Best Regards Winston Lai ____ From: Vibhatha Abeykoon Sent: Wednesday, August 2, 2023

Re: Extracting Logical Plan

2023-08-02 Thread Vibhatha Abeykoon
I understand. I sort of drew the same conclusion. But I wasn’t sure. Thanks everyone for taking time on this. On Wed, Aug 2, 2023 at 2:29 PM Ruifeng Zheng wrote: > In Spark Connect, I think the only API to show optimized plan is > `df.explain("extended")` as Winston mentioned, but it is not a

Re: Extracting Logical Plan

2023-08-02 Thread Ruifeng Zheng
In Spark Connect, I think the only API to show optimized plan is `df.explain("extended")` as Winston mentioned, but it is not a LogicalPlan object. On Wed, Aug 2, 2023 at 4:36 PM Vibhatha Abeykoon wrote: > Hello Ruifeng, > > Thank you for these pointers. Would it be different if I use the Spark

Re: Extracting Logical Plan

2023-08-02 Thread Vibhatha Abeykoon
Hello Ruifeng, Thank you for these pointers. Would it be different if I use the Spark connect? I am not using the regular SparkSession. I am pretty new to these APIs. Appreciate your thoughts. On Wed, Aug 2, 2023 at 2:00 PM Ruifeng Zheng wrote: > Hi Vibhatha, >I think those APIs are still

Re: Extracting Logical Plan

2023-08-02 Thread Ruifeng Zheng
Hi Vibhatha, I think those APIs are still avaiable? ``` Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.4.1 /_/ Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 11.0.19) Type in

Re: Extracting Logical Plan

2023-08-02 Thread Vibhatha Abeykoon
Hi Winston, I need to use the LogicalPlan object and process it with another function I have written. In earlier Spark versions we can access that via the dataframe object. So if it can be accessed via the UI, is there an API to access the object? On Wed, Aug 2, 2023 at 1:24 PM Winston Lai

Re: Extracting Logical Plan

2023-08-02 Thread Winston Lai
Hi Vibhatha, How about reading the logical plan from Spark UI, do you have access to the Spark UI? I am not sure what infra you run your Spark jobs on. Usually you should be able to view the logical and physical plan under Spark UI in text version at least. It is independent from the language

Re: Extracting Logical Plan

2023-08-02 Thread Vibhatha Abeykoon
Hi Winston, I am looking for a way to access the LogicalPlan object in Scala. Not sure if explain function would serve the purpose. On Wed, Aug 2, 2023 at 9:14 AM Winston Lai wrote: > Hi Vibhatha, > > Have you tried pyspark.sql.DataFrame.explain — PySpark 3.4.1 > documentation (apache.org) >

Re: Extracting Logical Plan

2023-08-01 Thread Winston Lai
Hi Vibhatha, Have you tried pyspark.sql.DataFrame.explain — PySpark 3.4.1 documentation (apache.org) before? I am not sure what infra that you have, you can

Re: conver panda image column to spark dataframe

2023-07-31 Thread second_co...@yahoo.com.INVALID
i changed to ArrayType(ArrayType(ArrayType(IntegerType( , still get same error Thank you for responding On Thursday, July 27, 2023 at 06:58:09 PM GMT+8, Adrian Pop-Tifrea wrote: Hello,  when you said your pandas Dataframe has 10 rows, does that mean it contains 10 images?

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Mich Talebzadeh
ok so as expected the underlying database is Hive. Hive uses hdfs storage. You said you encountered limitations on concurrent writes. The order and limitations are introduced by Hive metastore so to speak. Since this is all happening through Spark, by default implementation of the Hive metastore

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Patrick Tucci
Hi Mich and Pol, Thanks for the feedback. The database layer is Hadoop 3.3.5. The cluster restarted so I lost the stack trace in the application UI. In the snippets I saved, it looks like the exception being thrown was from Hive. Given the feedback you've provided, I suspect the issue is with how

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Pol Santamaria
Hi Patrick, You can have multiple writers simultaneously writing to the same table in HDFS by utilizing an open table format with concurrency control. Several formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast Format, offer this capability. All of them provide advanced features

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Mich Talebzadeh
It is not Spark SQL that throws the error. It is the underlying Database or layer that throws the error. Spark acts as an ETL tool. What is the underlying DB where the table resides? Is concurrency supported. Please send the error to this list HTH Mich Talebzadeh, Solutions

Re: The performance difference when running Apache Spark on K8s and traditional server

2023-07-27 Thread Mich Talebzadeh
Spark on tin boxes like Google Dataproc or AWS EC2 often utilise YARN resource manager. YARN is the most widely used resource manager not just for Spark but for other artefacts as well. On-premise YARN is used extensively. In Cloud it is also used widely in Infrastructure as a Service such as

Re: conver panda image column to spark dataframe

2023-07-27 Thread Adrian Pop-Tifrea
Hello, when you said your pandas Dataframe has 10 rows, does that mean it contains 10 images? Because if that's the case, then you'd want ro only use 3 layers of ArrayType when you define the schema. Best regards, Adrian On Thu, Jul 27, 2023, 11:04 second_co...@yahoo.com.INVALID wrote: > i

Re: spark context list_packages()

2023-07-27 Thread Sean Owen
There is no such method in Spark. I think that's some EMR-specific modification. On Wed, Jul 26, 2023 at 11:06 PM second_co...@yahoo.com.INVALID wrote: > I ran the following code > > spark.sparkContext.list_packages() > > on spark 3.4.1 and i get below error > > An error was encountered: >

Re: Interested in contributing to SPARK-24815

2023-07-26 Thread Pavan Kotikalapudi
Thanks for the response with all the information Sean and Kent. Is there a way to figure out if my employer (Twilio) part of CCLA? cc'ing: @Rinat Shangeeta our Open Source Counsel at twilio Thank you, Pavan On Tue, Jul 25, 2023 at 10:48 PM Kent Yao wrote: > Hi Pavan, > > Refer to the ASF

Re: Interested in contributing to SPARK-24815

2023-07-25 Thread Kent Yao
Hi Pavan, Refer to the ASF Source Header and Copyright Notice Policy[1], code directly submitted to ASF should include the Apache license header without any additional copyright notice. Kent Yao [1] https://www.apache.org/legal/src-headers.html#headers Sean Owen 于2023年7月25日周二 07:22写道: > >

Re: Interested in contributing to SPARK-24815

2023-07-24 Thread Sean Owen
When contributing to an ASF project, it's governed by the terms of the ASF ICLA: https://www.apache.org/licenses/icla.pdf or CCLA: https://www.apache.org/licenses/cla-corporate.pdf I don't believe ASF projects ever retain an original author copyright statement, but rather source files have a

Re: Spark 3.3 + parquet 1.10

2023-07-24 Thread Mich Talebzadeh
personally I have not done it myself. CCed to spark user group if some user has tried it among users. HTH Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Re: Unable to launch Spark connect on Docker image

2023-07-22 Thread Mich Talebzadeh
This is the downloaded docker? Try this with the added configuration options as below /opt/spark/sbin/start-connect-server.sh *--conf spark.driver.extraJavaOptions="-Divy.cache.dir=/tmp -Divy.home=/tmp" *--packages org.apache.spark:spark-connect_2.12:3.4.1 And you will get starting

Re: Spark File Output Committer algorithm for GCS

2023-07-21 Thread Mich Talebzadeh
this link might help https://stackoverflow.com/questions/46929351/spark-reading-orc-file-in-driver-not-in-executors Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Re: Spark File Output Committer algorithm for GCS

2023-07-21 Thread Dipayan Dev
I used the following config and the performance has improved a lot. .config("spark.sql.orc.splits.include.file.footer", true) I am not able to find the default value of this config anywhere? Can someone please share what's the default config of this- is it false? Also just curious what this

Re: Spark File Output Committer algorithm for GCS

2023-07-19 Thread Dipayan Dev
Thank you. Will try out these options. With Best Regards, On Wed, Jul 19, 2023 at 1:40 PM Mich Talebzadeh wrote: > Sounds like if the mv command is inherently slow, there is little that can > be done. > > The only suggestion I can make is to create the staging table as > compressed to

Re: Spark File Output Committer algorithm for GCS

2023-07-19 Thread Mich Talebzadeh
Sounds like if the mv command is inherently slow, there is little that can be done. The only suggestion I can make is to create the staging table as compressed to reduce its size and hence mv? Is that feasible? Also the managed table can be created with SNAPPY compression STORED AS ORC

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Dipayan Dev
Hi Mich, Ok, my use-case is a bit different. I have a Hive table partitioned by dates and need to do dynamic partition updates(insert overwrite) daily for the last 30 days (partitions). The ETL inside the staging directories is completed in hardly 5minutes, but then renaming takes a lot of time as

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Mich Talebzadeh
Spark has no role in creating that hive staging directory. That directory belongs to Hive and Spark simply does ETL there, loading to the Hive managed table in your case which ends up in saging directory I suggest that you review your design and use an external hive table with explicit location

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Dipayan Dev
It does help performance but not significantly. I am just wondering, once Spark creates that staging directory along with the SUCCESS file, can we just do a gsutil rsync command and move these files to original directory? Anyone tried this approach or foresee any concern? On Mon, 17 Jul 2023

Re: Spark Scala SBT Local build fails

2023-07-18 Thread Varun Shah
++ DEV community On Mon, Jul 17, 2023 at 4:14 PM Varun Shah wrote: > Resending this message with a proper Subject line > > Hi Spark Community, > > I am trying to set up my forked apache/spark project locally for my 1st > Open Source Contribution, by building and creating a package as mentioned

Re: Spark Scala SBT Local build fails

2023-07-17 Thread Varun Shah
Hi Team, I am still looking for a guidance here. Really appreciate anything that points me in the right direction. On Mon, Jul 17, 2023, 16:14 Varun Shah wrote: > Resending this message with a proper Subject line > > Hi Spark Community, > > I am trying to set up my forked apache/spark project

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
Thanks Jay, is there any suggestion how much I can increase those parameters? On Mon, 17 Jul 2023 at 8:25 PM, Jay wrote: > Fileoutputcommitter v2 is supported in GCS but the rename is a metadata > copy and delete operation in GCS and therefore if there are many number of > files it will take a

Re: Contributing to Spark MLLib

2023-07-17 Thread Gourav Sengupta
Hi, Holden Karau has some fantastic videos in her channel which will be quite helpful. Thanks Gourav On Sun, 16 Jul 2023, 19:15 Brian Huynh, wrote: > Good morning Dipayan, > > Happy to see another contributor! > > Please go through this document for contributors. Please note the >

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay
Fileoutputcommitter v2 is supported in GCS but the rename is a metadata copy and delete operation in GCS and therefore if there are many number of files it will take a long time to perform this step. One workaround will be to create smaller number of larger files if that is possible from Spark and

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Mich Talebzadeh
You said this Hive table was a managed table partitioned by date -->${TODAY} How do you define your Hive managed table? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
It does support- It doesn’t error out for me atleast. But it took around 4 hours to finish the job. Interestingly, it took only 10 minutes to write the output in the staging directory and rest of the time it took to rename the objects. Thats the concern. Looks like a known issue as spark behaves

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Yeachan Park
Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is supported on GCS? IIRC it wasn't, but you could check with GCP support On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev wrote: > Thanks Jay, > > I will try that option. > > Any insight on the file committer algorithms? > > I

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
Thanks Jay, I will try that option. Any insight on the file committer algorithms? I tried v2 algorithm but its not enhancing the runtime. What’s the best practice in Dataproc for dynamic updates in Spark. On Mon, 17 Jul 2023 at 7:05 PM, Jay wrote: > You can try increasing

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay
You can try increasing fs.gs.batch.threads and fs.gs.max.requests.per.batch. The definitions for these flags are available here - https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md On Mon, 17 Jul 2023 at 14:59, Dipayan Dev wrote: > No, I am using Spark

Re: Unsubscribe

2023-07-17 Thread srini subramanian
Unsubscribe  On Monday, July 17, 2023 at 11:19:41 AM GMT+5:30, Bode, Meikel wrote: Unsubscribe

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Dipayan Dev
No, I am using Spark 2.4 to update the GCS partitions . I have a managed Hive table on top of this. [image: image.png] When I do a dynamic partition update of Spark, it creates the new file in a Staging area as shown here. But the GCS blob renaming takes a lot of time. I have a partition based on

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Mich Talebzadeh
So you are using GCP and your Hive is installed on Dataproc which happens to run your Spark as well. Is that correct? What version of Hive are you using? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Re: Contributing to Spark MLLib

2023-07-16 Thread Brian Huynh
Good morning Dipayan, Happy to see another contributor! Please go through this document for contributors. Please note the MLlib-specific contribution guidelines section in particular. https://spark.apache.org/contributing.html Since you are looking for something to start with, take a look at

Re: Unable to populate spark metrics using custom metrics API

2023-07-13 Thread Surya Soma
Gentle reminder on this. On Sat, Jul 8, 2023 at 7:59 PM Surya Soma wrote: > Hello, > > I am trying to publish custom metrics using Spark CustomMetric API as > supported since spark 3.2 https://github.com/apache/spark/pull/31476, > > >

Re: Spark Not Connecting

2023-07-12 Thread Artemis User
Well, in that case, you may want to make sure your Spark server is running properly and you can access the Spark UI using your browser.  If you're not owning the spark cluster, contact your spark admin. On 7/12/23 1:56 PM, timi ayoade wrote: I can't even connect to the spark UI On Wed, Jul

Re: [EXTERNAL] Spark Not Connecting

2023-07-12 Thread Daniel Tavares de Santana
unsubscribe From: timi ayoade Sent: Wednesday, July 12, 2023 6:11 AM To: user@spark.apache.org Subject: [EXTERNAL] Spark Not Connecting Hi Apache spark community, I am a Data EngineerI have been using Apache spark for some time now. I recently tried to use it

Re: Loading in custom Hive jars for spark

2023-07-11 Thread Mich Talebzadeh
Are you using Spark 3.4? Under directory $SPARK_HOME get a list of jar files for hive and hadoop. This one is for version 3.4.0 /opt/spark/jars> ltr *hive* *hadoop* -rw-r--r--. 1 hduser hadoop 717820 Apr 7 03:43 spark-hive_2.12-3.4.0.jar -rw-r--r--. 1 hduser hadoop 563632 Apr 7 03:43

Re: PySpark error java.lang.IllegalArgumentException

2023-07-10 Thread elango vaidyanathan
Finally I was able to solve this issue by setting this conf. "spark.driver.extraJavaOptions=-Dorg.xerial.snappy.tempdir=/my_user/temp_ folder" Thanks all! On Sat, 8 Jul 2023 at 3:45 AM, Brian Huynh wrote: > Hi Khalid, > > Elango mentioned the file is working fine in our another environment

Re: PySpark error java.lang.IllegalArgumentException

2023-07-07 Thread Brian Huynh
Hi Khalid,Elango mentioned the file is working fine in our another environment with the same driver and executor memoryBrianOn Jul 7, 2023, at 10:18 AM, Khalid Mammadov wrote:Perhaps that parquet file is corrupted or got that is in that folder?To check, try to read that file with pandas or other

Re: PySpark error java.lang.IllegalArgumentException

2023-07-07 Thread Khalid Mammadov
Perhaps that parquet file is corrupted or got that is in that folder? To check, try to read that file with pandas or other tools to see if you can read without Spark. On Wed, 5 Jul 2023, 07:25 elango vaidyanathan, wrote: > > Hi team, > > Any updates on this below issue > > On Mon, 3 Jul 2023 at

Re: Unsubscribe

2023-07-07 Thread Atheeth SH
please send an empty email to: user-unsubscr...@spark.apache.org to unsubscribe yourself from the list. Thanks On Fri, 7 Jul 2023 at 12:05, Mihai Musat wrote: > Unsubscribe >

Re: PySpark error java.lang.IllegalArgumentException

2023-07-05 Thread elango vaidyanathan
Hi team, Any updates on this below issue On Mon, 3 Jul 2023 at 6:18 PM, elango vaidyanathan wrote: > > > Hi all, > > I am reading a parquet file like this and it gives > java.lang.IllegalArgumentException. > However i can work with other parquet files (such as nyc taxi parquet > files)

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-04 Thread Shashank Rao
Z is just an example. It could be anything. Basically, anything that's not in schema should be filtered out. On Tue, 4 Jul 2023, 13:27 Hill Liu, wrote: > I think you can define schema with column z and filter out records with z > is null. > > On Tue, Jul 4, 2023 at 3:24 PM Shashank Rao >

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-04 Thread Hill Liu
I think you can define schema with column z and filter out records with z is null. On Tue, Jul 4, 2023 at 3:24 PM Shashank Rao wrote: > Yes, drop malformed does filter out record4. However, record 5 is not. > > On Tue, 4 Jul 2023 at 07:41, Vikas Kumar wrote: > >> Have you tried dropmalformed

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-04 Thread Shashank Rao
Yes, drop malformed does filter out record4. However, record 5 is not. On Tue, 4 Jul 2023 at 07:41, Vikas Kumar wrote: > Have you tried dropmalformed option ? > > On Mon, Jul 3, 2023, 1:34 PM Shashank Rao wrote: > >> Update: Got it working by using the *_corrupt_record *field for the >> first

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-03 Thread Vikas Kumar
Have you tried dropmalformed option ? On Mon, Jul 3, 2023, 1:34 PM Shashank Rao wrote: > Update: Got it working by using the *_corrupt_record *field for the first > case (record 4) > > schema = schema.add("_corrupt_record", DataTypes.StringType); > Dataset ds =

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Gavin Ray
Wow, really neat -- thanks for sharing! On Mon, Jul 3, 2023 at 8:12 PM Gengliang Wang wrote: > Dear Apache Spark community, > > We are delighted to announce the launch of a groundbreaking tool that aims > to make Apache Spark more user-friendly and accessible - the English SDK >

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Hyukjin Kwon
The demo was really amazing. On Tue, 4 Jul 2023 at 09:17, Farshid Ashouri wrote: > This is wonderful news! > > On Tue, 4 Jul 2023 at 01:14, Gengliang Wang wrote: > >> Dear Apache Spark community, >> >> We are delighted to announce the launch of a groundbreaking tool that >> aims to make Apache

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Farshid Ashouri
This is wonderful news! On Tue, 4 Jul 2023 at 01:14, Gengliang Wang wrote: > Dear Apache Spark community, > > We are delighted to announce the launch of a groundbreaking tool that aims > to make Apache Spark more user-friendly and accessible - the English SDK >

Re: Filtering JSON records when there isn't an exact schema match in Spark

2023-07-03 Thread Shashank Rao
Update: Got it working by using the *_corrupt_record *field for the first case (record 4) schema = schema.add("_corrupt_record", DataTypes.StringType); Dataset ds = spark.read().schema(schema).option("mode", "PERMISSIVE").json("path").collect(); ds =

<    1   2   3   4   5   6   7   8   9   10   >