Re: Spark-sql can replace Hive ?

2021-06-10 Thread Mich Talebzadeh
a Reddy > *Cc: *d...@spark.apache.org , user@spark.apache.org < > user@spark.apache.org> > *Subject: *Re: Spark-sql can replace Hive ? > > Would you mind expanding the ask? Spark Sql can use hive by itaelf > > > > On Thu, 10 Jun 2021 at 8:58 pm, Battula, Brahma

Re: Spark-sql can replace Hive ?

2021-06-10 Thread Battula, Brahma Reddy
Thanks for prompt reply. I want to replace hive with spark. From: ayan guha Date: Thursday, 10 June 2021 at 4:35 PM To: Battula, Brahma Reddy Cc: d...@spark.apache.org , user@spark.apache.org Subject: Re: Spark-sql can replace Hive ? Would you mind expanding the ask? Spark Sql can use

Re: Spark-sql can replace Hive ?

2021-06-10 Thread ayan guha
Would you mind expanding the ask? Spark Sql can use hive by itaelf On Thu, 10 Jun 2021 at 8:58 pm, Battula, Brahma Reddy wrote: > Hi > > > > Would like know any refences/docs to replace hive with spark-sql > completely like how migrate the existing data in hiv

Spark-sql can replace Hive ?

2021-06-10 Thread Battula, Brahma Reddy
Hi Would like know any refences/docs to replace hive with spark-sql completely like how migrate the existing data in hive.? thanks

[Spark SQL][Intermediate][How to] Custom transformation to datasource V2 write apis

2021-06-03 Thread Sivabalan
Hey folks, Is it possible to add some custom transformations to dataframe with a custom datasource V2 write api? I understand I need to define Table <https://github.com/apache/spark/blob/v3.1.1/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Table.java> -> SupportsWri

Re: DataSource API v2 & Spark-SQL

2021-04-29 Thread roizaig
You can create a custom data source following this blog . It shows how to read a java log file using spark v3 api as an example. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: [Spark SQL]: Does Spark SQL can have better performance?

2021-04-29 Thread Mich Talebzadeh
Hi, your query parquetFile = spark.read.parquet("path/to/hdfs") parquetFile.createOrReplaceTempView("parquetFile") spark.sql("SELECT * FROM parquetFile WHERE field1 = 'value' ORDER BY timestamp LIMIT 1") will be lazily evaluated and won't do anything until the sql statement is actioned with

[Spark SQL]: Does Spark SQL can have better performance?

2021-04-28 Thread Amin Borjian
Hi. We use spark 3.0.1 in HDFS cluster and we store our files as parquet with snappy compression and enabled dictionary. We try to perform a simple query: parquetFile = spark.read.parquet("path/to/hadf") parquetFile.createOrReplaceTempView("parquetFile") spark.sql("SELECT * FROM parquetFile WHER

Re: Is a Hive installation necessary for Spark SQL?

2021-04-25 Thread Dennis Suhari
Hi, you can also load other data source without Hive using spark read format into a spark Dataframe . From there you can also combine the results using the Dataframe world. The use cases of hive is to have a common Abstraction layer when you want to do data tagging, access management under on

Is a Hive installation necessary for Spark SQL?

2021-04-25 Thread krchia
Does it make sense to keep a Hive installation when your parquet files come with a transactional metadata layer like Delta Lake / Apache Iceberg? My understanding from this: https://github.com/delta-io/delta/issues/85 is that Hive is no longer necessary other than discovering where the table is s

Accelerating Spark SQL / Dataframe using GPUs & Alluxio

2021-04-23 Thread Bin Fan
Hi Spark users, We have been working on GPU acceleration for Apache Spark SQL / Dataframe using the RAPIDS Accelerator for Apache Spark <https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/apache-spark-3/> and open source project Alluxio <https://github.com/Alluxi

Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread ayan guha
to add the below scenario based code to the executing spark >>> job,while executing this it took lot of time to complete,please suggest >>> best way to get below requirement without using UDF >>> >>> >>> Thanks, >>> >>> Ankamma Rao B >>

Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread Sean Owen
ecuting this it took lot of time to complete,please suggest >> best way to get below requirement without using UDF >> >> >> Thanks, >> >> Ankamma Rao B >> -- >> *From:* Sean Owen >> *Sent:* Friday, April 9, 2021 6:11 PM &

Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread ayan guha
> *From:* Sean Owen > *Sent:* Friday, April 9, 2021 6:11 PM > *To:* ayan guha > *Cc:* Rao Bandaru ; User > *Subject:* Re: [Spark SQL]:to calculate distance between four > coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk > dataframe >

Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread Rao Bandaru
, April 9, 2021 6:11 PM To: ayan guha Cc: Rao Bandaru ; User Subject: Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe This can be significantly faster with a pandas UDF, note, because you can vectorize the

Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread Sean Owen
os(toRadians(long_x) - toRadians(long_y)) > ) * lit(6371.0) > > distudf = udf(haversine_distance, FloatType()) > > in case you just want to use just Spark SQL, you can still utilize the > functions shown above to implement in SQL. > > Any reason you do not want to use UDF? &g

Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread ayan guha
(toRadians(lat_y)) + cos(toRadians(lat_x)) * cos(toRadians(lat_y)) * cos(toRadians(long_x) - toRadians(long_y)) ) * lit(6371.0) distudf = udf(haversine_distance, FloatType()) in case you just want to use just Spark SQL, you can still utilize the functions shown above to implement

[Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread Rao Bandaru
Hi All, I have a requirement to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe with the help of from geopy import distance without using UDF (user defined function),Please help how to achieve this scenario and do the needful.

Re: [SPARK SQL] Sometimes spark does not scale down on k8s

2021-04-05 Thread Alexei
I've increased spark.scheduler.listenerbus.eventqueue.executorManagement.capacity to 10M, this lead to several things.First, scaler didn't break when it was expected to. I mean, maxNeededExecutors remained low (except peak values).Second, scaler started to behave a bit weird. Having maxExecutors=50

[SPARK SQL] Sometimes spark does not scale down on k8s

2021-04-02 Thread Alexei
Hi all! We are using spark as constantly running sql interface to parquet on hdfs and gcs with our in-house app. We use autoscaling with k8s backend. Sometimes (approx. once a day) something nasty happens and spark stops to scale down staying with max available executors. I've checked graphs (https

Re: [Spark SQL]: Can complex oracle views be created using Spark SQL

2021-03-23 Thread Mich Talebzadeh
ying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Mon, 22 Mar 2021 at 05:38, Gaurav Singh wrote: > Hi Team, > > We have lots of complex oracle views ( containing multiple tables, joins, > analytical and aggregate functions, sub queries etc) and we are wondering > if Spark can help us execute those views faster. > > Also we want to know if those complex views can be implemented using Spark > SQL? > > Thanks and regards, > Gaurav Singh > +91 8600852256 > >

Re: [Spark SQL]: Can complex oracle views be created using Spark SQL

2021-03-22 Thread Mich Talebzadeh
we want to know if those complex views can be implemented using Spark > SQL? > > Thanks and regards, > Gaurav Singh > +91 8600852256 > >

[Spark SQL]: Can complex oracle views be created using Spark SQL

2021-03-21 Thread Gaurav Singh
Hi Team, We have lots of complex oracle views ( containing multiple tables, joins, analytical and aggregate functions, sub queries etc) and we are wondering if Spark can help us execute those views faster. Also we want to know if those complex views can be implemented using Spark SQL? Thanks

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-04 Thread Oldrich Vlasic
rict schema matching". From: Jeff Evans Sent: Thursday, March 4, 2021 2:55 PM To: Oldrich Vlasic Cc: Russell Spitzer ; Sean Owen ; user ; Ondřej Havlíček Subject: Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto Why not perform a df.select(...) before th

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-04 Thread Jeff Evans
tists) from > falling victim to this. > -- > *From:* Russell Spitzer > *Sent:* Wednesday, March 3, 2021 3:31 PM > *To:* Sean Owen > *Cc:* Oldrich Vlasic ; user < > user@spark.apache.org>; Ondřej Havlíček > *Subject:* Re: [Spark SQL, intermedi

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-04 Thread Oldrich Vlasic
lasic ; user ; Ondřej Havlíček Subject: Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto Yep this is the behavior for Insert Into, using the other write apis does schema matching I believe. On Mar 3, 2021, at 8:29 AM, Sean Owen mailto:sro...@gmail.com>> wrot

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-03 Thread Russell Spitzer
Yep this is the behavior for Insert Into, using the other write apis does schema matching I believe. > On Mar 3, 2021, at 8:29 AM, Sean Owen wrote: > > I don't have any good answer here, but, I seem to recall that this is because > of SQL semantics, which follows column ordering not naming whe

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-03 Thread Sean Owen
I don't have any good answer here, but, I seem to recall that this is because of SQL semantics, which follows column ordering not naming when performing operations like this. It may well be as intended. On Tue, Mar 2, 2021 at 6:10 AM Oldrich Vlasic < oldrich.vla...@datasentics.com> wrote: > Hi, >

[Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-02 Thread Oldrich Vlasic
Hi, I have encountered a weird and potentially dangerous behaviour of Spark concerning partial overwrites of partitioned data. Not sure if this is a bug or just abstraction leak. I have checked Spark section of Stack Overflow and haven't found any relevant questions or answers. Full minimal wo

Spark SQL Macros

2021-02-19 Thread Harish Butani
Hi, I have been working on Spark SQL Macros https://github.com/hbutani/spark-sql-macros Spark SQL Macros provide a capability to register custom functions into a Spark Session that is similar to UDF Registration. The difference being that the SQL Macros registration mechanism attempts to

[Spark SQL] - Not able to consume Kafka topics

2021-02-19 Thread Rathore, Yashasvini
same code doesn’t work there. We are referring the official documentation page, and using the exact same syntax and the same versions as mentioned but somehow the code fails on the startingOffsetsByTimestamps line. * The following versions are being used: * Scala : 2.12.12 * Spark-sql

Re: [Spark SQL] - Not able to consume Kafka topics

2021-02-18 Thread Jungtaek Lim
there. We are referring the > official documentation page, and using the exact same syntax and the same > versions as mentioned but somehow the code fails on the > startingOffsetsByTimestamps line. > * The following versions are being used: > > * Scala : 2.12.12 > *

Re: Spark SQL Dataset and BigDecimal

2021-02-18 Thread Khalid Mammadov
>> If it returns a scala decimal, java code cannot handle it. >> >> If you want a scala decimal, you need to convert it by yourself. >> >> Bests, >> Takeshi >> >>> On Wed, Feb 17, 2021 at 9:48 PM Ivan Petrov wrote: >>> Hi, I'm us

[Spark SQL] - Not able to consume Kafka topics

2021-02-18 Thread Rathore, Yashasvini
same code doesn’t work there. We are referring the official documentation page, and using the exact same syntax and the same versions as mentioned but somehow the code fails on the startingOffsetsByTimestamps line. * The following versions are being used: * Scala : 2.12.12 * Spark-sql

Re: Spark SQL Dataset and BigDecimal

2021-02-18 Thread Ivan Petrov
's needed for interoperability between > scala/java. > If it returns a scala decimal, java code cannot handle it. > > If you want a scala decimal, you need to convert it by yourself. > > Bests, > Takeshi > > On Wed, Feb 17, 2021 at 9:48 PM Ivan Petrov wrote: > >>

Re: Spark SQL Dataset and BigDecimal

2021-02-17 Thread Takeshi Yamamuro
#x27;m using Spark Scala Dataset API to write spark sql jobs. > I've noticed that Spark dataset accepts scala BigDecimal as the value but > it always returns java.math.BigDecimal when you read it back. > > Is it by design? > Should I use java.math.BigDecimal everywhere instead?

Spark SQL Dataset and BigDecimal

2021-02-17 Thread Ivan Petrov
Hi, I'm using Spark Scala Dataset API to write spark sql jobs. I've noticed that Spark dataset accepts scala BigDecimal as the value but it always returns java.math.BigDecimal when you read it back. Is it by design? Should I use java.math.BigDecimal everywhere instead? Is there any p

[SPARK-SQL] Does Spark 3.0 support parquet predicate pushdown for array of structs?

2021-02-14 Thread Haijia Zhou
Hi, I know Spark 3.0 has added Parquet predicate pushdown for nested structures (SPARK-17636) Does it also support predicate pushdown for an array of structs?   For example, say I have a spark table 'individuals' (in parquet format) with the following schema  root |-- individual_id: string (nulla

Re: Spark SQL query

2021-02-03 Thread Mich Talebzadeh
content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Wed, 3 Feb 2021 at 11:17, Arpan Bhandari wrote: > Yes Mich, > > Mapping the spark sql query that got executed corresponding to an > app

Re: Spark SQL query

2021-02-03 Thread Arpan Bhandari
Yes Mich, Mapping the spark sql query that got executed corresponding to an application Id on yarn would greatly help in analyzing and debugging the query for any potential problems. Thanks, Arpan Bhandari -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com

Re: Spark SQL query

2021-02-03 Thread Mich Talebzadeh
I gather what you are after is a code sniffer for Spark that provides a form of GUI to get the code that applications run against spark. I don't think Spark has this type of plug-in although it would be potentially useful. Some RDBMS provide this. Usually stored on some form of persistent storage

Re: Spark SQL query

2021-02-02 Thread Arpan Bhandari
Mich, The directory is already there and event logs are getting generated, I have checked them it contains the query plan but not the actual query. Thanks, Arpan Bhandari -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --

Re: Spark SQL query

2021-02-02 Thread Mich Talebzadeh
create a directory in hdfs hdfs dfs -mkdir /spark_event_logs modify file $SPARK_HOME/conf/spark-defaults.conf and add these two lines spark.eventLog.enabled=true # do not use quotes below spark.eventLog.dir=hdfs://rhes75:9000/spark_event_logs Then run a job and check it hdfs dfs -ls /spark_eve

Re: Spark SQL query

2021-02-02 Thread Arpan Bhandari
Yes i can see the jobs on 8088 and also on the spark history url. spark history server is showing up the plan details on the sql tab but not giving the query. Thanks, Arpan Bhandari -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Spark SQL query

2021-02-02 Thread Arpan Bhandari
Hi Mich, I do see the .scala_history directory, but it contains all the queries which got executed uptill now, but if i have to map a specific query to an application Id in yarn that would not correlate, hence this method alone won't suffice Thanks, Arpan Bhandari -- Sent from: http://apache

Re: Spark SQL query

2021-02-02 Thread Mich Talebzadeh
Hi Arpan. I believe all applications including spark and scala create a hidden history file You can go to home directory cd # see list of all hidden files ls -a | egrep '^\.' If you are using scala do you see .scala_history file? .scala_history HTH LinkedIn * https://www.linkedin.com/pr

Re: Spark SQL query

2021-02-02 Thread Arpan Bhandari
Hi Mich, Repeated the steps as suggested, but still there is no such folder created in the home directory. Do we need to enable some property so that it creates one. Thanks, Arpan Bhandari -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Spark SQL query

2021-02-02 Thread Arpan Bhandari
Sanchit, It seems I have to do some sort of analysis from the plan to get the query. Appreciate all your help on this. Thanks, Arpan Bhandari -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscrib

Re: Spark SQL query

2021-02-01 Thread Mich Talebzadeh
Hi Arpan, log in as any user that has execution right for spark. type spark-shell, do some simple commands then exit. go to home directory of that user and look for that hidden file ${HOME/.spark_history it will be there. HTH, LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2g

Re: Spark SQL query

2021-02-01 Thread Sachit Murarka
Application wise it wont show as such. You can try to corelate it with explain plain output using some filters or attribute. Or else if you do not have too much queries in history. Just take queries and find plan of those queries and match it with shown in UI. I know thats the tedious task. But I

Re: Spark SQL query

2021-02-01 Thread Arpan Bhandari
Sachit, That is showing all the queries that got executed, but how it would get mapped to specific application Id it was associated with ? Thanks, Arpan Bhandari -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Spark SQL query

2021-02-01 Thread Sachit Murarka
Hi arpan, In spark shell when you type :history. then also it is not showing? Thanks Sachit On Mon, 1 Feb 2021, 21:13 Arpan Bhandari, wrote: > Hey Sachit, > > It shows the query plan, which is difficult to diagnose out and depict the > actual query. > > > Thanks, > Arpan Bhandari > > > > -- >

Re: Spark SQL query

2021-02-01 Thread Arpan Bhandari
Hey Mich, Thanks for the suggestions, but i don't see any such folder created on the edge node. Thanks, Arpan Bhandari -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr

Re: Spark SQL query

2021-02-01 Thread Arpan Bhandari
Hey Sachit, It shows the query plan, which is difficult to diagnose out and depict the actual query. Thanks, Arpan Bhandari -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-uns

Re: Spark SQL query

2021-01-31 Thread Mich Talebzadeh
n no case be liable for any monetary damages arising from such loss, damage or destruction. On Fri, 29 Jan 2021 at 13:49, Arpan Bhandari wrote: > Hi , > > Is there a way to track back spark sql after it has been already run i.e. > query has been already submitted by a person and i have

Re: Spark SQL query

2021-01-31 Thread Sachit Murarka
Hi Arpan, Launch spark shell and in the shell type ":history" , you will see the query executed. In the Spark UI under SQL Tab you can see the query plan when you click on the details button(Though it won't show you the complete query). But by looking at the plan you can get your query. Hope t

Re: Spark SQL query

2021-01-29 Thread Arpan Bhandari
Hi Sachit, Yes it was executed using spark shell, history is already enabled. already checked sql tab but it is not showing the query. My spark version is 2.4.5 Thanks, Arpan Bhandari -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --

Re: Spark SQL query

2021-01-29 Thread Sachit Murarka
Hi Arpan, Was it executed using spark shell? If yes type :history Do u have history server enabled? If yes , go to the history and go to the SQL tab in History UI. Thanks Sachit On Fri, 29 Jan 2021, 19:19 Arpan Bhandari, wrote: > Hi , > > Is there a way to track back spark sql aft

Spark SQL query

2021-01-29 Thread Arpan Bhandari
Hi , Is there a way to track back spark sql after it has been already run i.e. query has been already submitted by a person and i have to back trace what query actually got submitted. Appreciate any help on this. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com

Re: Column-level encryption in Spark SQL

2021-01-21 Thread Gourav Sengupta
Never heard of it (and have once been tasked to explore a similar use >> case). I'm curious how you'd like it to work? (no idea how Hive does this >> either) >> >> Pozdrawiam, >> Jacek Laskowski >> >> https://about.me/JacekLaskowski >> &

Re: Column-level encryption in Spark SQL

2021-01-21 Thread Mich Talebzadeh
Books <https://books.japila.pl/> > Follow me on https://twitter.com/jaceklaskowski > > <https://twitter.com/jaceklaskowski> > > > On Sat, Dec 19, 2020 at 2:38 AM john washington > wrote: > >> Dear Spark team members, >> >> Can you please advise if Column-level encryption is available in Spark >> SQL? >> I am aware that HIVE supports column level encryption. >> >> Appreciate your response. >> >> Thanks, >> John >> >

Re: Column-level encryption in Spark SQL

2021-01-21 Thread Jacek Laskowski
apila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Sat, Dec 19, 2020 at 2:38 AM john washington wrote: > Dear Spark team members, > > Can you please advise if Column-level encryption is available in Spark SQL? > I am awar

Re: [Spark SQL]HiveQL and Spark SQL producing different results

2021-01-12 Thread Terry Kim
Ying, Can you share a query that produces different results? Thanks, Terry On Sun, Jan 10, 2021 at 1:48 PM Ying Zhou wrote: > Hi, > > I run some SQL using both Hive and Spark. Usually we get the same results. > However when a window function is in the script Hive and Spark can produce > differe

[Spark SQL]HiveQL and Spark SQL producing different results

2021-01-10 Thread Ying Zhou
Hi, I run some SQL using both Hive and Spark. Usually we get the same results. However when a window function is in the script Hive and Spark can produce different results. Is this intended behavior or either Hive or Spark has a bug? Thanks, Ying

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-26 Thread Mich Talebzadeh
Well I gave up on using anything except the standard one offered by PySpark itself. The problem is that anything that is homemade (UDF), is never going to be as performant as the functions offered by Spark itself. What I don't understand is why a numpy STDDEV provided should be more performant than

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-24 Thread Sean Owen
Why not just use STDDEV_SAMP? it's probably more accurate than the differences-of-squares calculation. You can write an aggregate UDF that calls numpy and register it for SQL, but, it is already a built-in. On Thu, Dec 24, 2020 at 8:12 AM Mich Talebzadeh wrote: > Thanks for the feedback. > > I h

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-24 Thread Mich Talebzadeh
l-corrected standard deviation. > > On Thu, Dec 24, 2020 at 3:17 AM Mich Talebzadeh > wrote: > >> >> Well the truth is that we had this discussion in 2016 :(. what Hive >> calls Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This >> is incorrect

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-24 Thread Sean Owen
ard deviation. On Thu, Dec 24, 2020 at 3:17 AM Mich Talebzadeh wrote: > > Well the truth is that we had this discussion in 2016 :(. what Hive calls > Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This is > incorrect and has not been rectified yet! > > > Spark-sql

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-24 Thread Mich Talebzadeh
Well the truth is that we had this discussion in 2016 :(. what Hive calls Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This is incorrect and has not been rectified yet! Spark-sql, Oracle and Sybase point STDDEV to STDDEV_SAMP and not STDDEV_POP. Run a test on *Hive* SELECT

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-23 Thread Sean Owen
gt; """ > > spark.sql(sqltext) > > Now if I wanted to use UDF based on numpy STD function, I can do > > import numpy as np > from pyspark.sql.functions import UserDefinedFunction > from pyspark.sql.types import DoubleType > udf = User

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-23 Thread Mich Talebzadeh
OK Thanks for the tip. I found this link useful for Python from Databricks User-defined functions - Python — Databricks Documentation <https://docs.databricks.com/spark/latest/spark-sql/udf-python.html> <https://docs.databricks.com/spark/latest/spark-sql/udf-python.html> Linke

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-23 Thread Peyman Mohajerian
gt; 3 DESC > > """ > > spark.sql(sqltext) > > Now if I wanted to use UDF based on numpy STD function, I can do > > import numpy as np > from pyspark.sql.functions import UserDefinedFunction > from pyspark.sql.types import DoubleType > udf = User

Using UDF based on Numpy functions in Spark SQL

2020-12-23 Thread Mich Talebzadeh
function, I can do import numpy as np from pyspark.sql.functions import UserDefinedFunction from pyspark.sql.types import DoubleType udf = UserDefinedFunction(np.std, DoubleType()) How can I use that udf with spark SQL? I gather this is only possible through functional programming?

Re: Re: Is Spark SQL able to auto update partition stats like hive by setting hive.stats.autogather=true

2020-12-19 Thread Mich Talebzadeh
roperty which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Sat, 19 Dec 2020 at 07:51, 疯狂的哈丘 wrote: > thx,but `hive.stats.autogather` is n

回复:Re: Is Spark SQL able to auto update partition stats like hive by setting hive.stats.autogather=true

2020-12-18 Thread 疯狂的哈丘
thx,but `hive.stats.autogather` is not work for sparkSQL. - 原始邮件 - 发件人:Mich Talebzadeh 收件人:kongt...@sina.com 抄送人:user 主题:Re: Is Spark SQL able to auto update partition stats like hive by setting hive.stats.autogather=true 日期:2020年12月19日 06点45分 Hi, A fellow forum member kindly spotted a

Column-level encryption in Spark SQL

2020-12-18 Thread john washington
Dear Spark team members, Can you please advise if Column-level encryption is available in Spark SQL? I am aware that HIVE supports column level encryption. Appreciate your response. Thanks, John

Re: Is Spark SQL able to auto update partition stats like hive by setting hive.stats.autogather=true

2020-12-18 Thread Mich Talebzadeh
*Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monet

Re: Is Spark SQL able to auto update partition stats like hive by setting hive.stats.autogather=true

2020-12-18 Thread Mich Talebzadeh
I am afraid not supported for spark sql see Automatic Statistics Collection For Better Query Performance | Qubole <https://www.qubole.com/blog/automatic-statistics-collection-better-query-performance/> I tried it as below spark = SparkSession.builder \ .appName

Is Spark SQL able to auto update partition stats like hive by setting hive.stats.autogather=true

2020-12-17 Thread 疯狂的哈丘
`spark.sql.statistics.size.autoUpdate.enabled` is only work for table stats update.But for partition stats,I can only update it with `ANALYZE TABLE tablename PARTITION(part) COMPUTE STATISTICS`.So is Spark SQL able to auto update partition stats like hive by setting hive.stats.autogather=true?

Re: Spark SQL check timestamp with other table and update a column.

2020-11-21 Thread Khalid Mammadov
Hi, I am not sure if you were writing pseudo-code or real one but there were few issues in the sql. I have reproduced you example in the Spark REPL and all worked as expected and result is the one you need Please see below full code: ## *Spark 3.0.0* >>> a = spark.read.csv("tab1", sep="|"

Spark SQL check timestamp with other table and update a column.

2020-11-18 Thread anbutech
Hi Team, i want to update a col3 in table 1 if col1 from table2 is less than col1 in table1 and update each record in table 1.I 'am not getting the correct output. Table 1: col1|col2|col3 2020-11-17T20:50:57.777+|1|null Table 2: col1|col2|col3 2020-11-17T21:19:06.508+|1|win 2020-11-17T20

Can all the parameters of hive be used on spark sql?

2020-11-17 Thread Gang Li
eg: set hive.merge.smallfiles.avgsize=1600; SET hive.auto.convert.join = true; SET hive.exec.compress.intermediate=true; SET hive.exec.compress.output=true; SET hive.exec.parallel=true; thank you very much!!! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

spark-sql on windows throws Exception in thread "main" java.lang.UnsatisfiedLinkError:

2020-11-16 Thread Mich Talebzadeh
Need to create some hive test tables for pyCharm SPARK_HOME is set up as D:\temp\spark-3.0.1-bin-hadoop2.7 HADOOP_HOME is c:\hadoop\ spark-shell works. Trying to run spark-sql, I get the following errors PS C:\tmp\hive> spark-sql log4j:WARN No appenders could be found for log

Spark[SqL] performance tuning

2020-11-12 Thread Lakshmi Nivedita
Hi all, I have pyspark sql script with loading of one table 80mb and one is 2 mb and rest 3 are small tables performing lots of joins in the script to fetch the data. My system configuration is 4 nodes,300 GB,64 cores To write a data frame into table 24Mb size records . System is taking 4 minut

Re: Integration testing Framework Spark SQL Scala

2020-11-02 Thread Lars Albertsson
Thu, Feb 20, 2020 at 6:09 PM Ruijing Li wrote: >> >> Hi all, >> >> I’m interested in hearing the community’s thoughts on best practices to do >> integration testing for spark sql jobs. We run a lot of our jobs with cloud >> infrastructure and hdfs - this

Re: [Spark SQL] does pyspark udf support spark.sql inside def

2020-10-01 Thread Lakshmi Nivedita
Sure, will do that.I am using impala in pyspark. to retrieve the data A table schema date1 Bigint date2 Bigint ctry string sample data for table A: date1 date2 ctry 22-12-2012 06-01-2013 IN B table schema holidate Bigint Holiday =0/1 —string 0 means holiday—- 1 means

Re: [Spark SQL] does pyspark udf support spark.sql inside def

2020-09-30 Thread Amit Joshi
Can you pls post the schema of both the tables. On Wednesday, September 30, 2020, Lakshmi Nivedita wrote: > Thank you for the clarification.I would like to how can I proceed for > this kind of scenario in pyspark > > I have a scenario subtracting the total number of days with the number of > ho

[Spark SQL]pyspark to count total number of days-no of holidays by using sql

2020-09-30 Thread Lakshmi Nivedita
I have a table with dates date1 date2 in one table and number of holidays in another table df1 = select date1,date2 ,ctry ,unixtimestamp(date2-date1) totalnumberofdays - df2.holidays from table A; df2 = select count(holiays) from table B where holidate >= 'date1'(table A) and holidate < = date

Re: [Spark SQL] does pyspark udf support spark.sql inside def

2020-09-30 Thread Lakshmi Nivedita
Thank you for the clarification.I would like to how can I proceed for this kind of scenario in pyspark I have a scenario subtracting the total number of days with the number of holidays in pyspark by using dataframes I have a table with dates date1 date2 in one table and number of holidays in

Re: [Spark SQL] does pyspark udf support spark.sql inside def

2020-09-30 Thread Sean Owen
No, you can't use the SparkSession from within a function executed by Spark tasks. On Wed, Sep 30, 2020 at 7:29 AM Lakshmi Nivedita wrote: > Here is a spark udf structure as an example > > Def sampl_fn(x): >Spark.sql(“select count(Id) from sample Where Id = x ”) > > > Spark.udf.regis

[Spark SQL] does pyspark udf support spark.sql inside def

2020-09-30 Thread Lakshmi Nivedita
Here is a spark udf structure as an example Def sampl_fn(x): Spark.sql(“select count(Id) from sample Where Id = x ”) Spark.udf.register(“sample_fn”, sample_fn) Spark.sql(“select id, sampl_fn(Id) from example”) Advance Thanks for the help -- k.Lakshmi Nivedita

[Spark SQL]: Rationale for access modifiers and qualifiers in Spark

2020-08-12 Thread 김민우
https://gist.github.com/JoeyValentine/23821c27c1f540a4ac63e446e2243dbc I wonder why the constructor and stateStoreCoordinator are private[sql]. In other words, I want to know the reason why the scope has to be [sql] and not [streaming]. And the next question is: "The reason for using private[sql

Re: [SPARK-SQL] How to return GenericInternalRow from spark udf

2020-08-06 Thread Sean Owen
The UDF should return the result value you want, not a whole Row. In Scala it figures out the schema of the UDF's result from the signature. On Thu, Aug 6, 2020 at 7:56 AM Amit Joshi wrote: > > Hi, > > I have a spark udf written in scala that takes couuple of columns and apply > some logic and o

[SPARK-SQL] How to return GenericInternalRow from spark udf

2020-08-06 Thread Amit Joshi
Hi, I have a spark udf written in scala that takes couuple of columns and apply some logic and output InternalRow. There is spark schema of StructType also present. But when I try to return the InternalRow from UDF there is exception java.lang.

Multi insert with join in Spark SQL

2020-08-05 Thread moqi
Hi, I am trying to migrate Hive SQL to Spark SQL. When I execute the Multi insert with join statement, Spark SQL will scan the same table multiple times, while Hive SQL will only scan once. In the actual production environment, this table is relatively large, which causes the running time of

RE: DataSource API v2 & Spark-SQL

2020-08-03 Thread Lavelle, Shawn
v2 & Spark-SQL <<<< EXTERNAL email. Do not open links or attachments unless you recognize the sender. If suspicious report here<https://osiinet/Global/IMS/SitePages/Reporting.aspx>. >>>> That's a bad error message. Basically you can't make a spark nati

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Henrique Oliveira
Thank you for both tips, I will definitely try the pandas_udfs. About changing the select operation, it's not possible to have multiple explode functions on the same select, sadly they must be applied one at a time. Em seg., 3 de ago. de 2020 às 11:41, Patrick McCarthy < pmccar...@dstillery.com> e

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Patrick McCarthy
If you use pandas_udfs in 2.4 they should be quite performant (or at least won't suffer serialization overhead), might be worth looking into. I didn't run your code but one consideration is that the while loop might be making the DAG a lot bigger than it has to be. You might see if defining those

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Henrique Oliveira
Hi Patrick, thank you for your quick response. That's exactly what I think. Actually, the result of this processing is an intermediate table that is going to be used for other views generation. Another approach I'm trying now, is to move the "explosion" step for this "view generation" step, this wa

Re: [Spark SQL]: Can't write DataFrame after using explode function on multiple columns.

2020-08-03 Thread Patrick McCarthy
This seems like a very expensive operation. Why do you want to write out all the exploded values? If you just want all combinations of values, could you instead do it at read-time with a UDF or something? On Sat, Aug 1, 2020 at 8:34 PM hesouol wrote: > I forgot to add an information. By "can't w

<    1   2   3   4   5   6   7   8   9   10   >