Spark-SQL : Getting current user name in UDF

2022-02-21 Thread Lavelle, Shawn
Hello Spark Users, I have a UDF I wrote for use with Spark-SQL that performs a look up. In that look up, I need to get the current sql user so I can validate their permissions. I was using org.apach.spark.sql.util.Utils.getCurrentUserName() to retrieve the current active user from within

Re: StructuredStreaming - foreach/foreachBatch

2022-02-21 Thread karan alang
Thanks, Gourav - will check out the book. regds, Karan Alang On Thu, Feb 17, 2022 at 9:05 AM Gourav Sengupta wrote: > Hi, > > The following excellent documentation may help as well: >

Re: StructuredStreaming - foreach/foreachBatch

2022-02-21 Thread Danilo Sousa
Hello Gourav, I`’ll read this Document. Thanks. > On 17 Feb 2022, at 14:05, Gourav Sengupta wrote: > > Hi, > > The following excellent documentation may help as well: > https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch > >

Re: Help With unstructured text file with spark scala

2022-02-21 Thread Danilo Sousa
Yes, this a only single file. Thanks Rafael Mendes. > On 13 Feb 2022, at 07:13, Rafael Mendes wrote: > > Hi, Danilo. > Do you have a single large file, only? > If so, I guess you can use tools like sed/awk to split it into more files > based on layout, so you can read these files into Spark.

Re: Question about spark.sql min_by

2022-02-21 Thread Mich Talebzadeh
I gave a similar answer to windowing functions in this thread add an auto_increment column dated 7th February https://lists.apache.org/list.html?user@spark.apache.org HTH view my Linkedin profile

Re: Logging to determine why driver fails

2022-02-21 Thread Artemis User
Another unknown issue I'd like to mention was, as we had in the past, that Spark 3.2.1 was bundled with log4j version 1.2.7.  That jar file is missing some APIs (e.g. RollingFileAppender), and you may encounter some ClassNotFound exceptions.  To resolve that issue, please make sure you

Re: Question about spark.sql min_by

2022-02-21 Thread David Diebold
Thank you for your answers. Indeed windowing should help there. Also, I just realized maybe I can try to create a struct column with both price and sellerId and apply min() on it, ordering would consider price first for the ordering (https://stackoverflow.com/a/52669177/2015762) Cheers! Le lun.

Re: Question about spark.sql min_by

2022-02-21 Thread ayan guha
Why this can not be done by window function? Or is min by is just a short hand? On Tue, 22 Feb 2022 at 12:42 am, Sean Owen wrote: > From the source code, looks like this function was added to pyspark in > Spark 3.3, up for release soon. It exists in SQL. You can still use it in > SQL with

RE: Logging to determine why driver fails

2022-02-21 Thread Michael Williams (SSI)
Thank you. From: Artemis User [mailto:arte...@dtechspace.com] Sent: Monday, February 21, 2022 8:23 AM To: Michael Williams (SSI) Subject: Re: Logging to determine why driver fails Spark uses Log4j for logging. There is a log4j properties template file located in the conf directory. You can

Re: Logging to determine why driver fails

2022-02-21 Thread Artemis User
Spark uses log4j for logging.  There is a log4j properties template file in the conf directory.  Just remove the "template" extension and change the content of log4j.properties to meet your need.  More info on log4j can be found at logging.apache.org... On 2/21/22 9:15 AM, Michael Williams

Logging to determine why driver fails

2022-02-21 Thread Michael Williams (SSI)
Hello, We have a POC using Spark 3.2.1 and none of us have any prior Spark experience. Our setup uses the native Spark REST api (http://localhost:6066/v1/submissions/create) on the master node (not Livy, not Spark Job server). While we have been successful at submitting python jobs via this

Re: Question about spark.sql min_by

2022-02-21 Thread Sean Owen
>From the source code, looks like this function was added to pyspark in Spark 3.3, up for release soon. It exists in SQL. You can still use it in SQL with `spark.sql(...)` in Python though, not hard. On Mon, Feb 21, 2022 at 4:01 AM David Diebold wrote: > Hello all, > > I'm trying to use the

Re: Encoders.STRING() causing performance problems in Java application

2022-02-21 Thread Sean Owen
Oh, yes of course. If you run an entire distributed Spark job for one row, over and over, that's much slower. It would make much more sense to run the whole data set at once - the point is parallelism here. On Mon, Feb 21, 2022 at 2:36 AM wrote: > Thanks a lot, Sean, for the comments. I realize

Question about spark.sql min_by

2022-02-21 Thread David Diebold
Hello all, I'm trying to use the spark.sql min_by aggregation function with pyspark. I'm relying on this distribution of spark : spark-3.2.1-bin-hadoop3.2 I have a dataframe made of these columns: - productId : int - sellerId : int - price : double For each product, I want to get the seller who

Re: Encoders.STRING() causing performance problems in Java application

2022-02-21 Thread martin
Thanks a lot, Sean, for the comments. I realize I didn't provide enough background information to properly diagnose this issue. In the meantime, I have created some test cases for isolating the problem and running some specific performance tests. The numbers are quite revealing: Running

Re: Spark Explain Plan and Joins

2022-02-21 Thread Gourav Sengupta
Hi, I think that the best option is to use the SPARK UI. In SPARK 3.x the UI and its additional settings are fantastic. Try to also see the settings for Adaptive Query Execution in SPARK, under certain conditions it really works wonders. For certain long queries, the way you are finally