Re: StructuredStreaming error - pyspark.sql.utils.StreamingQueryException: batch 44 doesn't exist

2022-02-27 Thread karan alang
Hi Gourav,

Pls see my responses below :

Can you please let us know:
1. the SPARK version, and the kind of streaming query that you are running?

KA : Apache Spark 3.1.2 - on Dataproc using Ubunto 18.04 (the highest Spark
version supported on dataproc is 3.1.2) ,

2. whether you are using at least once, utmost once, or only once concepts?

KA : default value - at-least once delivery semantics
(per my understanding, i don't believe delivery semantics is related to the
issue, though)

3. any additional details that you can provide, regarding the storage
duration in Kafka, etc?

KA : storage duration - 1 day ..
However, as I mentioned in the stackoverflow ticket, on readStream ->
"failOnDataLoss" = "false", so the log retention should not cause this
issue.

4. are your running stateful or stateless operations? If you are using
stateful operations and SPARK 3.2 try to use RocksDB which is now natively
integrated with SPARK :)

KA : Stateful - since i'm using windowing+watermark in the aggregation
queries.

Also, thnx - will check the links you provided.

regds,
Karan Alang

On Sat, Feb 26, 2022 at 3:31 AM Gourav Sengupta 
wrote:

> Hi,
>
> Can you please let us know:
> 1. the SPARK version, and the kind of streaming query that you are
> running?
> 2. whether you are using at least once, utmost once, or only once concepts?
> 3. any additional details that you can provide, regarding the storage
> duration in Kafka, etc?
> 4. are your running stateful or stateless operations? If you are using
> stateful operations and SPARK 3.2 try to use RocksDB which is now natively
> integrated with SPARK :)
>
> Besides the mail sent by Mich, the following are useful:
> 1.
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#managing-streaming-queries
> (see the stop operation, and awaitTermination... operation)
> 2. Try to always ensure that you are doing exception handling based on the
> option mentioned in the above link, long running streaming programmes in
> distributed systems do have issues, and handling exceptions is important
> 3. There is another thing which I do, and it is around reading the
> streaming metrics and pushing them for logging, that helps me to know in
> long running system whether there are any performance issues or not (
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#reading-metrics-interactively
> or
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#reporting-metrics-programmatically-using-asynchronous-apis)
> . The following is an interesting reading on the kind of metrics to look
> out for and the way to interpret them (
> https://docs.databricks.com/spark/latest/rdd-streaming/debugging-streaming-applications.html
> )
>
>
> Regards,
> Gourav
>
>
> On Sat, Feb 26, 2022 at 10:45 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Check the thread I forwarded on how to gracefully shutdown spark
>> structured streaming
>>
>> HTH
>>
>> On Fri, 25 Feb 2022 at 22:31, karan alang  wrote:
>>
>>> Hello All,
>>> I'm running a StructuredStreaming program on GCP Dataproc, which reads
>>> data from Kafka, does some processing and puts processed data back into
>>> Kafka. The program was running fine, when I killed it (to make minor
>>> changes), and then re-started it.
>>>
>>> It is giving me the error -
>>> pyspark.sql.utils.StreamingQueryExceptionace: batch 44 doesn't exist
>>>
>>> Here is the error:
>>>
>>> 22/02/25 22:14:08 ERROR 
>>> org.apache.spark.sql.execution.streaming.MicroBatchExecution: Query [id = 
>>> 0b73937f-1140-40fe-b201-cf4b7443b599, runId = 
>>> 43c9242d-993a-4b9a-be2b-04d8c9e05b06] terminated with error
>>> java.lang.IllegalStateException: batch 44 doesn't exist
>>> at 
>>> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$populateStartOffsets$1(MicroBatchExecution.scala:286)
>>> at scala.Option.getOrElse(Option.scala:189)
>>> at 
>>> org.apache.spark.sql.execution.streaming.MicroBatchExecution.populateStartOffsets(MicroBatchExecution.scala:286)
>>> at 
>>> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:197)
>>> at 
>>> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>>> at 
>>> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
>>> at 
>>> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
>>> at 
>>> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
>>> at 
>>> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:194)
>>> at 
>>> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
>>> at 
>>> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream

Re: StructuredStreaming error - pyspark.sql.utils.StreamingQueryException: batch 44 doesn't exist

2022-02-27 Thread karan alang
Hi Mich,
thnx .. i'll check the thread you forwarded, and revert back.

regds,
Karan Alang

On Sat, Feb 26, 2022 at 2:44 AM Mich Talebzadeh 
wrote:

> Check the thread I forwarded on how to gracefully shutdown spark
> structured streaming
>
> HTH
>
> On Fri, 25 Feb 2022 at 22:31, karan alang  wrote:
>
>> Hello All,
>> I'm running a StructuredStreaming program on GCP Dataproc, which reads
>> data from Kafka, does some processing and puts processed data back into
>> Kafka. The program was running fine, when I killed it (to make minor
>> changes), and then re-started it.
>>
>> It is giving me the error - pyspark.sql.utils.StreamingQueryExceptionace:
>> batch 44 doesn't exist
>>
>> Here is the error:
>>
>> 22/02/25 22:14:08 ERROR 
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution: Query [id = 
>> 0b73937f-1140-40fe-b201-cf4b7443b599, runId = 
>> 43c9242d-993a-4b9a-be2b-04d8c9e05b06] terminated with error
>> java.lang.IllegalStateException: batch 44 doesn't exist
>> at 
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$populateStartOffsets$1(MicroBatchExecution.scala:286)
>> at scala.Option.getOrElse(Option.scala:189)
>> at 
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution.populateStartOffsets(MicroBatchExecution.scala:286)
>> at 
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:197)
>> at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>> at 
>> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
>> at 
>> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
>> at 
>> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
>> at 
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:194)
>> at 
>> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
>> at 
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:188)
>> at 
>> org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:334)
>> at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>> at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>> at 
>> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:317)
>> at 
>> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:244)
>> Traceback (most recent call last):
>>   File 
>> "/tmp/0149aedd804c42f288718e013fb16f9c/StructuredStreaming_GCP_Versa_Sase_gcloud.py",
>>  line 609, in 
>> query.awaitTermination()
>>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", 
>> line 101, in awaitTermination
>>   File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", 
>> line 1304, in __call__
>>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 
>> 117, in deco
>> pyspark.sql.utils.StreamingQueryException: batch 44 doesn't exist
>>
>>
>> Question - what is the cause of this error and how to debug/fix ? Also, I
>> notice that the checkpoint location gets corrupted occasionally, when I do
>> multiple restarts. After checkpoint corruption, it does not return any
>> records
>>
>> For the above issue(as well as when the checkpoint was corrupted), when i
>> cleared the checkpoint location and re-started the program, it went trhough
>> fine.
>>
>> Pls note: while doing readStream, i've enabled failOnDataLoss=false
>>
>> Additional details are in stackoverflow :
>>
>>
>> https://stackoverflow.com/questions/71272328/structuredstreaming-error-pyspark-sql-utils-streamingqueryexception-batch-44
>>
>> any input on this ?
>>
>> tia!
>>
>>
>> --
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: StructuredStreaming error - pyspark.sql.utils.StreamingQueryException: batch 44 doesn't exist

2022-02-27 Thread karan alang
Hi Gabor,
i just responded to your comment on stackoverflow.

regds,
Karan Alang

On Sat, Feb 26, 2022 at 3:06 PM Gabor Somogyi 
wrote:

> Hi Karan,
>
> Plz have a look at the stackoverflow comment I've had 2 days ago😉
>
> G
>
> On Fri, 25 Feb 2022, 23:31 karan alang,  wrote:
>
>> Hello All,
>> I'm running a StructuredStreaming program on GCP Dataproc, which reads
>> data from Kafka, does some processing and puts processed data back into
>> Kafka. The program was running fine, when I killed it (to make minor
>> changes), and then re-started it.
>>
>> It is giving me the error - pyspark.sql.utils.StreamingQueryException:
>> batch 44 doesn't exist
>>
>> Here is the error:
>>
>> 22/02/25 22:14:08 ERROR 
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution: Query [id = 
>> 0b73937f-1140-40fe-b201-cf4b7443b599, runId = 
>> 43c9242d-993a-4b9a-be2b-04d8c9e05b06] terminated with error
>> java.lang.IllegalStateException: batch 44 doesn't exist
>> at 
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$populateStartOffsets$1(MicroBatchExecution.scala:286)
>> at scala.Option.getOrElse(Option.scala:189)
>> at 
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution.populateStartOffsets(MicroBatchExecution.scala:286)
>> at 
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:197)
>> at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>> at 
>> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
>> at 
>> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
>> at 
>> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
>> at 
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:194)
>> at 
>> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
>> at 
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:188)
>> at 
>> org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:334)
>> at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>> at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>> at 
>> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:317)
>> at 
>> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:244)
>> Traceback (most recent call last):
>>   File 
>> "/tmp/0149aedd804c42f288718e013fb16f9c/StructuredStreaming_GCP_Versa_Sase_gcloud.py",
>>  line 609, in 
>> query.awaitTermination()
>>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", 
>> line 101, in awaitTermination
>>   File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", 
>> line 1304, in __call__
>>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 
>> 117, in deco
>> pyspark.sql.utils.StreamingQueryException: batch 44 doesn't exist
>>
>>
>> Question - what is the cause of this error and how to debug/fix ? Also, I
>> notice that the checkpoint location gets corrupted occasionally, when I do
>> multiple restarts. After checkpoint corruption, it does not return any
>> records
>>
>> For the above issue(as well as when the checkpoint was corrupted), when i
>> cleared the checkpoint location and re-started the program, it went trhough
>> fine.
>>
>> Pls note: while doing readStream, i've enabled failOnDataLoss=false
>>
>> Additional details are in stackoverflow :
>>
>>
>> https://stackoverflow.com/questions/71272328/structuredstreaming-error-pyspark-sql-utils-streamingqueryexception-batch-44
>>
>> any input on this ?
>>
>> tia!
>>
>>
>>


Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Raghavendra Ganesh
What is optimal depends on the context of the problem.
Is the intent here to find the best solution for top n values with a group
by ?

Both the solutions look sub-optimal to me. Window function would be
expensive as it needs an order by (which a top n solution shouldn't need).
It would be best to just group by department and use an aggregate function
which stores the top n values in a heap.
--
Raghavendra


On Mon, Feb 28, 2022 at 12:01 AM Sid  wrote:

> My bad.
>
> Aggregation Query:
>
> # Write your MySQL query statement below
>
>SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
> FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
> WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
>WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
> ORDER by E.DepartmentId, E.Salary DESC
>
> Time Taken: 1212 ms
>
> Windowing Query:
>
> select Department,Employee,Salary from (
> select d.name as Department, e.name as Employee,e.salary as
> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
> rnk from Department d join Employee e on e.departmentId=d.id ) a where
> rnk<=3
>
> Time Taken: 790 ms
>
> Thanks,
> Sid
>
>
> On Sun, Feb 27, 2022 at 11:35 PM Sean Owen  wrote:
>
>> Those two queries are identical?
>>
>> On Sun, Feb 27, 2022 at 11:30 AM Sid  wrote:
>>
>>> Hi Team,
>>>
>>> I am aware that if windowing functions are used, then at first it loads
>>> the entire dataset into one window,scans and then performs the other
>>> mentioned operations for that particular window which could be slower when
>>> dealing with trillions / billions of records.
>>>
>>> I did a POC where I used an example to find the max 3 highest salary for
>>> an employee per department. So, I wrote a below queries and compared the
>>> time for it:
>>>
>>> Windowing Query:
>>>
>>> select Department,Employee,Salary from (
>>> select d.name as Department, e.name as Employee,e.salary as
>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>>> rnk<=3
>>>
>>> Time taken: 790 ms
>>>
>>> Aggregation Query:
>>>
>>> select Department,Employee,Salary from (
>>> select d.name as Department, e.name as Employee,e.salary as
>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>>> rnk<=3
>>>
>>> Time taken: 1212 ms
>>>
>>> But as per my understanding, the aggregation should have run faster. So,
>>> my whole point is if the dataset is huge I should force some kind of map
>>> reduce jobs like we have an option called df.groupby().reduceByGroups()
>>>
>>> So I think the aggregation query is taking more time since the dataset
>>> size here is smaller and as we all know that map reduce works faster when
>>> there is a huge volume of data. Haven't tested it yet on big data but
>>> needed some expert guidance over here.
>>>
>>> Please correct me if I am wrong.
>>>
>>> TIA,
>>> Sid
>>>
>>>
>>>
>>>


Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Mich Talebzadeh
AM I correct that with

.. WHERE (SELECT COUNT(DISTINCT(Salary))..

You will have to shuffle because of DISTINCTas each worker will have to
read data separately and perform the reduce task to get the local
distinct value
and one final shuffle to get the actual distinct
for all the data?



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 27 Feb 2022 at 20:31, Sean Owen  wrote:

> "count distinct' does not have that problem, whether in a group-by or not.
> I'm still not sure these are equivalent queries but maybe not seeing it.
> Windowing makes sense when you need the whole window, or when you need
> sliding windows to express the desired groups.
> It may be unnecessary when your query does not need the window, just a
> summary stat like 'max'. Depends.
>
> On Sun, Feb 27, 2022 at 2:14 PM Bjørn Jørgensen 
> wrote:
>
>> You are using distinct which collects everything to the driver. Soo use
>> the other one :)
>>
>> søn. 27. feb. 2022 kl. 21:00 skrev Sid :
>>
>>> Basically, I am trying two different approaches for the same problem and
>>> my concern is how it will behave in the case of big data if you talk about
>>> millions of records. Which one would be faster? Is using windowing
>>> functions a better way since it will load the entire dataset into a single
>>> window and do the operations?
>>>
>>
>>


Re: Issue while creating spark app

2022-02-27 Thread Mich Talebzadeh
Might as well update the artefacts to the correct versions hopefully.
Downloaded scala 2.12.8

 scala -version

Scala code runner version 2.12.8 -- Copyright 2002-2018, LAMP/EPFL and
Lightbend, Inc.

Edited the pom.xml as below


http://maven.apache.org/POM/4.0.0";
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd";>
4.0.0

spark
MichTest
1.0


8
8



org.scala-lang
scala-library
*2.12.8*


org.apache.spark
*spark-core_2.13*
*3.2.1*


org.apache.spark
*spark-sql_2.13*
*3.2.1*




and built with maven. All good


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 27 Feb 2022 at 20:16, Mich Talebzadeh 
wrote:

> Thanks Bjorn. I am aware of that. I  just really wanted to create the uber
> jar files with both sbt and maven in Intellij.
>
>
> cheers
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 27 Feb 2022 at 20:12, Bjørn Jørgensen 
> wrote:
>
>> Mitch: You are using scala 2.11 to do this. Have a look at Building Spark
>>  "Spark
>> requires Scala 2.12/2.13; support for Scala 2.11 was removed in Spark
>> 3.0.0."
>>
>> søn. 27. feb. 2022 kl. 20:55 skrev Mich Talebzadeh <
>> mich.talebza...@gmail.com>:
>>
>>> OK I decided to give a try to maven.
>>>
>>> Downloaded maven and unzipped the file WSL-Ubuntu terminal as unzip
>>> apache-maven-3.8.4-bin.zip
>>>
>>> Then added to Windows env variable as MVN_HOME and added the bin
>>> directory to path in windows. Restart intellij to pick up the correct path.
>>>
>>> Again on the command line in intellij do
>>>
>>> *mvn -v*
>>> Apache Maven 3.8.4 (9b656c72d54e5bacbed989b64718c159fe39b537)
>>> Maven home: d:\temp\apache-maven-3.8.4
>>> Java version: 1.8.0_73, vendor: Oracle Corporation, runtime: C:\Program
>>> Files\Java\jdk1.8.0_73\jre
>>> Default locale: en_GB, platform encoding: Cp1252
>>> OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"
>>>
>>> in Intellij add maven support to your project. Follow this link Add
>>> Maven support to an existing project
>>> 
>>>
>>> There will be a pom.xml file under project directory
>>>
>>> [image: image.png]
>>>
>>> Edit that pom.xml file and add the following
>>>
>>> 
>>> http://maven.apache.org/POM/4.0.0";
>>>  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
>>>  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
>>> http://maven.apache.org/xsd/maven-4.0.0.xsd";>
>>> 4.0.0
>>>
>>> spark
>>> MichTest
>>> 1.0
>>>
>>> 
>>> 8
>>> 8
>>> 
>>> 
>>> 
>>> org.scala-lang
>>> scala-library
>>> 2.11.7
>>> 
>>> 
>>> org.apache.spark
>>> spark-core_2.10
>>> 2.0.0
>>> 
>>> 
>>> org.apache.spark
>>> spark-sql_2.10
>>> 2.0.0
>>> 
>>> 
>>> 
>>>
>>> In intellij open a Terminal under project sub-directory where the pom
>>> file is created and you edited.
>>>
>>>
>>>  *mvn clean*
>>>
>>> [INFO] Scanning for projects...
>>>
>>> [INFO]
>>>
>>> [INFO] ---< spark:MichTest
>>> >---
>>>
>>> [INFO] Building MichTest 1.0
>>>
>>> [INFO] [ jar
>>> ]-
>>>
>>> [INFO]
>>>
>>> [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ MichTest ---
>>>
>>> [INFO] Deleting D:\temp\intellij\MichTest\target
>>>
>>> [INFO]
>>> 
>>>
>>> [INFO] BUILD SUCCESS
>>>
>>> [INFO]
>>> 
>>>
>>> [INFO] Total time:  4.451 s
>>>
>>> [INFO] Finished at: 2022-02-27T19:37:57Z
>>>
>>> [INFO]
>>> ---

Re: Issue while creating spark app

2022-02-27 Thread Bjørn Jørgensen
Rajat Kumer wrote: "Cannot find project Scala library 2.12.12 for module
SparkSimpleApp"

So when I google this error message I find scala project maven sync failed




søn. 27. feb. 2022 kl. 21:59 skrev Mich Talebzadeh <
mich.talebza...@gmail.com>:

> sorry which error?
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 27 Feb 2022 at 20:52, Bjørn Jørgensen 
> wrote:
>
>> Anyway I did google on your error and found this scala project maven
>> sync failed 
>> Is this the same as the one you are getting?
>>
>>
>> søn. 27. feb. 2022 kl. 21:16 skrev Mich Talebzadeh <
>> mich.talebza...@gmail.com>:
>>
>>> Thanks Bjorn. I am aware of that. I  just really wanted to create the
>>> uber jar files with both sbt and maven in Intellij.
>>>
>>>
>>> cheers
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sun, 27 Feb 2022 at 20:12, Bjørn Jørgensen 
>>> wrote:
>>>
 Mitch: You are using scala 2.11 to do this. Have a look at Building
 Spark  "Spark
 requires Scala 2.12/2.13; support for Scala 2.11 was removed in Spark
 3.0.0."

 søn. 27. feb. 2022 kl. 20:55 skrev Mich Talebzadeh <
 mich.talebza...@gmail.com>:

> OK I decided to give a try to maven.
>
> Downloaded maven and unzipped the file WSL-Ubuntu terminal as unzip
> apache-maven-3.8.4-bin.zip
>
> Then added to Windows env variable as MVN_HOME and added the bin
> directory to path in windows. Restart intellij to pick up the correct 
> path.
>
> Again on the command line in intellij do
>
> *mvn -v*
> Apache Maven 3.8.4 (9b656c72d54e5bacbed989b64718c159fe39b537)
> Maven home: d:\temp\apache-maven-3.8.4
> Java version: 1.8.0_73, vendor: Oracle Corporation, runtime:
> C:\Program Files\Java\jdk1.8.0_73\jre
> Default locale: en_GB, platform encoding: Cp1252
> OS name: "windows 10", version: "10.0", arch: "amd64", family:
> "windows"
>
> in Intellij add maven support to your project. Follow this link Add
> Maven support to an existing project
> 
>
> There will be a pom.xml file under project directory
>
> [image: image.png]
>
> Edit that pom.xml file and add the following
>
> 
> http://maven.apache.org/POM/4.0.0";
>  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
>  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
> http://maven.apache.org/xsd/maven-4.0.0.xsd";>
> 4.0.0
>
> spark
> MichTest
> 1.0
>
> 
> 8
> 8
> 
> 
> 
> org.scala-lang
> scala-library
> 2.11.7
> 
> 
> org.apache.spark
> spark-core_2.10
> 2.0.0
> 
> 
> org.apache.spark
> spark-sql_2.10
> 2.0.0
> 
> 
> 
>
> In intellij open a Terminal under project sub-directory where the pom
> file is created and you edited.
>
>
>  *mvn clean*
>
> [INFO] Scanning for projects...
>
> [INFO]
>
> [INFO] ---< spark:MichTest
> >---
>
> [INFO] Building MichTest 1.0
>
> [INFO] [ jar
> ]-
>
> [INFO]
>
> [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ MichTest ---
>
> [INFO] Deleting D:\temp\intellij\MichTest\target
>
> [INFO]
> 
>
> [INFO] BUILD SUCCESS
>
> [INFO]
> 
>
> [INFO] Total time:  4.451 s
>
>

Re: Issue while creating spark app

2022-02-27 Thread Mich Talebzadeh
sorry which error?



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 27 Feb 2022 at 20:52, Bjørn Jørgensen 
wrote:

> Anyway I did google on your error and found this scala project maven sync
> failed 
> Is this the same as the one you are getting?
>
>
> søn. 27. feb. 2022 kl. 21:16 skrev Mich Talebzadeh <
> mich.talebza...@gmail.com>:
>
>> Thanks Bjorn. I am aware of that. I  just really wanted to create the
>> uber jar files with both sbt and maven in Intellij.
>>
>>
>> cheers
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sun, 27 Feb 2022 at 20:12, Bjørn Jørgensen 
>> wrote:
>>
>>> Mitch: You are using scala 2.11 to do this. Have a look at Building
>>> Spark  "Spark
>>> requires Scala 2.12/2.13; support for Scala 2.11 was removed in Spark
>>> 3.0.0."
>>>
>>> søn. 27. feb. 2022 kl. 20:55 skrev Mich Talebzadeh <
>>> mich.talebza...@gmail.com>:
>>>
 OK I decided to give a try to maven.

 Downloaded maven and unzipped the file WSL-Ubuntu terminal as unzip
 apache-maven-3.8.4-bin.zip

 Then added to Windows env variable as MVN_HOME and added the bin
 directory to path in windows. Restart intellij to pick up the correct path.

 Again on the command line in intellij do

 *mvn -v*
 Apache Maven 3.8.4 (9b656c72d54e5bacbed989b64718c159fe39b537)
 Maven home: d:\temp\apache-maven-3.8.4
 Java version: 1.8.0_73, vendor: Oracle Corporation, runtime: C:\Program
 Files\Java\jdk1.8.0_73\jre
 Default locale: en_GB, platform encoding: Cp1252
 OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"

 in Intellij add maven support to your project. Follow this link Add
 Maven support to an existing project
 

 There will be a pom.xml file under project directory

 [image: image.png]

 Edit that pom.xml file and add the following

 
 http://maven.apache.org/POM/4.0.0";
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
 http://maven.apache.org/xsd/maven-4.0.0.xsd";>
 4.0.0

 spark
 MichTest
 1.0

 
 8
 8
 
 
 
 org.scala-lang
 scala-library
 2.11.7
 
 
 org.apache.spark
 spark-core_2.10
 2.0.0
 
 
 org.apache.spark
 spark-sql_2.10
 2.0.0
 
 
 

 In intellij open a Terminal under project sub-directory where the pom
 file is created and you edited.


  *mvn clean*

 [INFO] Scanning for projects...

 [INFO]

 [INFO] ---< spark:MichTest
 >---

 [INFO] Building MichTest 1.0

 [INFO] [ jar
 ]-

 [INFO]

 [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ MichTest ---

 [INFO] Deleting D:\temp\intellij\MichTest\target

 [INFO]
 

 [INFO] BUILD SUCCESS

 [INFO]
 

 [INFO] Total time:  4.451 s

 [INFO] Finished at: 2022-02-27T19:37:57Z

 [INFO]
 

 *mvn compile*

 [INFO] Scanning for projects...

 [INFO]

 [INFO] ---< spark:MichTest
 >---

 [INFO] Building MichTest 1.0

 [INFO] [ jar
 ]-

 [INFO]

 [INFO] --- m

Re: Issue while creating spark app

2022-02-27 Thread Bjørn Jørgensen
Anyway I did google on your error and found this scala project maven sync
failed 
Is this the same as the one you are getting?


søn. 27. feb. 2022 kl. 21:16 skrev Mich Talebzadeh <
mich.talebza...@gmail.com>:

> Thanks Bjorn. I am aware of that. I  just really wanted to create the uber
> jar files with both sbt and maven in Intellij.
>
>
> cheers
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 27 Feb 2022 at 20:12, Bjørn Jørgensen 
> wrote:
>
>> Mitch: You are using scala 2.11 to do this. Have a look at Building Spark
>>  "Spark
>> requires Scala 2.12/2.13; support for Scala 2.11 was removed in Spark
>> 3.0.0."
>>
>> søn. 27. feb. 2022 kl. 20:55 skrev Mich Talebzadeh <
>> mich.talebza...@gmail.com>:
>>
>>> OK I decided to give a try to maven.
>>>
>>> Downloaded maven and unzipped the file WSL-Ubuntu terminal as unzip
>>> apache-maven-3.8.4-bin.zip
>>>
>>> Then added to Windows env variable as MVN_HOME and added the bin
>>> directory to path in windows. Restart intellij to pick up the correct path.
>>>
>>> Again on the command line in intellij do
>>>
>>> *mvn -v*
>>> Apache Maven 3.8.4 (9b656c72d54e5bacbed989b64718c159fe39b537)
>>> Maven home: d:\temp\apache-maven-3.8.4
>>> Java version: 1.8.0_73, vendor: Oracle Corporation, runtime: C:\Program
>>> Files\Java\jdk1.8.0_73\jre
>>> Default locale: en_GB, platform encoding: Cp1252
>>> OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"
>>>
>>> in Intellij add maven support to your project. Follow this link Add
>>> Maven support to an existing project
>>> 
>>>
>>> There will be a pom.xml file under project directory
>>>
>>> [image: image.png]
>>>
>>> Edit that pom.xml file and add the following
>>>
>>> 
>>> http://maven.apache.org/POM/4.0.0";
>>>  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
>>>  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
>>> http://maven.apache.org/xsd/maven-4.0.0.xsd";>
>>> 4.0.0
>>>
>>> spark
>>> MichTest
>>> 1.0
>>>
>>> 
>>> 8
>>> 8
>>> 
>>> 
>>> 
>>> org.scala-lang
>>> scala-library
>>> 2.11.7
>>> 
>>> 
>>> org.apache.spark
>>> spark-core_2.10
>>> 2.0.0
>>> 
>>> 
>>> org.apache.spark
>>> spark-sql_2.10
>>> 2.0.0
>>> 
>>> 
>>> 
>>>
>>> In intellij open a Terminal under project sub-directory where the pom
>>> file is created and you edited.
>>>
>>>
>>>  *mvn clean*
>>>
>>> [INFO] Scanning for projects...
>>>
>>> [INFO]
>>>
>>> [INFO] ---< spark:MichTest
>>> >---
>>>
>>> [INFO] Building MichTest 1.0
>>>
>>> [INFO] [ jar
>>> ]-
>>>
>>> [INFO]
>>>
>>> [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ MichTest ---
>>>
>>> [INFO] Deleting D:\temp\intellij\MichTest\target
>>>
>>> [INFO]
>>> 
>>>
>>> [INFO] BUILD SUCCESS
>>>
>>> [INFO]
>>> 
>>>
>>> [INFO] Total time:  4.451 s
>>>
>>> [INFO] Finished at: 2022-02-27T19:37:57Z
>>>
>>> [INFO]
>>> 
>>>
>>> *mvn compile*
>>>
>>> [INFO] Scanning for projects...
>>>
>>> [INFO]
>>>
>>> [INFO] ---< spark:MichTest
>>> >---
>>>
>>> [INFO] Building MichTest 1.0
>>>
>>> [INFO] [ jar
>>> ]-
>>>
>>> [INFO]
>>>
>>> [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @
>>> MichTest ---
>>>
>>> [WARNING] Using platform encoding (Cp1252 actually) to copy filtered
>>> resources, i.e. build is platform dependent!
>>>
>>> [INFO] Copying 0 resource
>>>
>>> [INFO]
>>>
>>> [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @
>>> MichTest ---
>>>
>>> [INFO] Nothing to compile - all classes are up to date
>>>
>>> [INFO]
>>> 
>>>
>>> [INFO] BUILD SUCCESS
>>>
>>> [INFO]
>>> 
>>>
>>> [INFO] Total time:  1.242 s
>>>
>>> [INFO] Finished at: 2022-02-27T19:38:58Z
>>>
>>> [INFO]

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sean Owen
"count distinct' does not have that problem, whether in a group-by or not.
I'm still not sure these are equivalent queries but maybe not seeing it.
Windowing makes sense when you need the whole window, or when you need
sliding windows to express the desired groups.
It may be unnecessary when your query does not need the window, just a
summary stat like 'max'. Depends.

On Sun, Feb 27, 2022 at 2:14 PM Bjørn Jørgensen 
wrote:

> You are using distinct which collects everything to the driver. Soo use
> the other one :)
>
> søn. 27. feb. 2022 kl. 21:00 skrev Sid :
>
>> Basically, I am trying two different approaches for the same problem and
>> my concern is how it will behave in the case of big data if you talk about
>> millions of records. Which one would be faster? Is using windowing
>> functions a better way since it will load the entire dataset into a single
>> window and do the operations?
>>
>
>


Re: Issue while creating spark app

2022-02-27 Thread Mich Talebzadeh
Thanks Bjorn. I am aware of that. I  just really wanted to create the uber
jar files with both sbt and maven in Intellij.


cheers



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 27 Feb 2022 at 20:12, Bjørn Jørgensen 
wrote:

> Mitch: You are using scala 2.11 to do this. Have a look at Building Spark
>  "Spark
> requires Scala 2.12/2.13; support for Scala 2.11 was removed in Spark
> 3.0.0."
>
> søn. 27. feb. 2022 kl. 20:55 skrev Mich Talebzadeh <
> mich.talebza...@gmail.com>:
>
>> OK I decided to give a try to maven.
>>
>> Downloaded maven and unzipped the file WSL-Ubuntu terminal as unzip
>> apache-maven-3.8.4-bin.zip
>>
>> Then added to Windows env variable as MVN_HOME and added the bin
>> directory to path in windows. Restart intellij to pick up the correct path.
>>
>> Again on the command line in intellij do
>>
>> *mvn -v*
>> Apache Maven 3.8.4 (9b656c72d54e5bacbed989b64718c159fe39b537)
>> Maven home: d:\temp\apache-maven-3.8.4
>> Java version: 1.8.0_73, vendor: Oracle Corporation, runtime: C:\Program
>> Files\Java\jdk1.8.0_73\jre
>> Default locale: en_GB, platform encoding: Cp1252
>> OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"
>>
>> in Intellij add maven support to your project. Follow this link Add
>> Maven support to an existing project
>> 
>>
>> There will be a pom.xml file under project directory
>>
>> [image: image.png]
>>
>> Edit that pom.xml file and add the following
>>
>> 
>> http://maven.apache.org/POM/4.0.0";
>>  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
>>  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
>> http://maven.apache.org/xsd/maven-4.0.0.xsd";>
>> 4.0.0
>>
>> spark
>> MichTest
>> 1.0
>>
>> 
>> 8
>> 8
>> 
>> 
>> 
>> org.scala-lang
>> scala-library
>> 2.11.7
>> 
>> 
>> org.apache.spark
>> spark-core_2.10
>> 2.0.0
>> 
>> 
>> org.apache.spark
>> spark-sql_2.10
>> 2.0.0
>> 
>> 
>> 
>>
>> In intellij open a Terminal under project sub-directory where the pom
>> file is created and you edited.
>>
>>
>>  *mvn clean*
>>
>> [INFO] Scanning for projects...
>>
>> [INFO]
>>
>> [INFO] ---< spark:MichTest
>> >---
>>
>> [INFO] Building MichTest 1.0
>>
>> [INFO] [ jar
>> ]-
>>
>> [INFO]
>>
>> [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ MichTest ---
>>
>> [INFO] Deleting D:\temp\intellij\MichTest\target
>>
>> [INFO]
>> 
>>
>> [INFO] BUILD SUCCESS
>>
>> [INFO]
>> 
>>
>> [INFO] Total time:  4.451 s
>>
>> [INFO] Finished at: 2022-02-27T19:37:57Z
>>
>> [INFO]
>> 
>>
>> *mvn compile*
>>
>> [INFO] Scanning for projects...
>>
>> [INFO]
>>
>> [INFO] ---< spark:MichTest
>> >---
>>
>> [INFO] Building MichTest 1.0
>>
>> [INFO] [ jar
>> ]-
>>
>> [INFO]
>>
>> [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @
>> MichTest ---
>>
>> [WARNING] Using platform encoding (Cp1252 actually) to copy filtered
>> resources, i.e. build is platform dependent!
>>
>> [INFO] Copying 0 resource
>>
>> [INFO]
>>
>> [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ MichTest
>> ---
>>
>> [INFO] Nothing to compile - all classes are up to date
>>
>> [INFO]
>> 
>>
>> [INFO] BUILD SUCCESS
>>
>> [INFO]
>> 
>>
>> [INFO] Total time:  1.242 s
>>
>> [INFO] Finished at: 2022-02-27T19:38:58Z
>>
>> [INFO]
>> 
>>
>> Now create the package
>>
>>  *mvn package*
>> [INFO] Scanning for projects...
>> [INFO]
>> [INFO] ---< spark:MichTest
>> >---
>> [INFO] Building MichTest 1.0
>> [INFO] [ jar
>> ]-
>> [INFO]
>> [INFO] --- maven-resources-plugin:2.6:resources (default-resource

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Bjørn Jørgensen
You are using distinct which collects everything to the driver. Soo use the
other one :)

søn. 27. feb. 2022 kl. 21:00 skrev Sid :

> Basically, I am trying two different approaches for the same problem and
> my concern is how it will behave in the case of big data if you talk about
> millions of records. Which one would be faster? Is using windowing
> functions a better way since it will load the entire dataset into a single
> window and do the operations?
>
> On Mon, Feb 28, 2022 at 12:26 AM Sean Owen  wrote:
>
>> Those queries look like they do fairly different things. One is selecting
>> top employees by salary, the other is ... selecting where there are less
>> than 3 distinct salaries or something.
>> Not sure what the intended comparison is then; these are not equivalent
>> ways of doing the same thing, or does not seem so as far as I can see.
>>
>> On Sun, Feb 27, 2022 at 12:30 PM Sid  wrote:
>>
>>> My bad.
>>>
>>> Aggregation Query:
>>>
>>> # Write your MySQL query statement below
>>>
>>>SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
>>> FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
>>> WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
>>>WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
>>> ORDER by E.DepartmentId, E.Salary DESC
>>>
>>> Time Taken: 1212 ms
>>>
>>> Windowing Query:
>>>
>>> select Department,Employee,Salary from (
>>> select d.name as Department, e.name as Employee,e.salary as
>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>>> rnk<=3
>>>
>>> Time Taken: 790 ms
>>>
>>> Thanks,
>>> Sid
>>>
>>>
>>> On Sun, Feb 27, 2022 at 11:35 PM Sean Owen  wrote:
>>>
 Those two queries are identical?

 On Sun, Feb 27, 2022 at 11:30 AM Sid  wrote:

> Hi Team,
>
> I am aware that if windowing functions are used, then at first it
> loads the entire dataset into one window,scans and then performs the other
> mentioned operations for that particular window which could be slower when
> dealing with trillions / billions of records.
>
> I did a POC where I used an example to find the max 3 highest salary
> for an employee per department. So, I wrote a below queries and compared
> the time for it:
>
> Windowing Query:
>
> select Department,Employee,Salary from (
> select d.name as Department, e.name as Employee,e.salary as
> Salary,dense_rank() over(partition by d.name order by e.salary desc)
> as rnk from Department d join Employee e on e.departmentId=d.id ) a
> where rnk<=3
>
> Time taken: 790 ms
>
> Aggregation Query:
>
> select Department,Employee,Salary from (
> select d.name as Department, e.name as Employee,e.salary as
> Salary,dense_rank() over(partition by d.name order by e.salary desc)
> as rnk from Department d join Employee e on e.departmentId=d.id ) a
> where rnk<=3
>
> Time taken: 1212 ms
>
> But as per my understanding, the aggregation should have run faster.
> So, my whole point is if the dataset is huge I should force some kind of
> map reduce jobs like we have an option called 
> df.groupby().reduceByGroups()
>
> So I think the aggregation query is taking more time since the dataset
> size here is smaller and as we all know that map reduce works faster when
> there is a huge volume of data. Haven't tested it yet on big data but
> needed some expert guidance over here.
>
> Please correct me if I am wrong.
>
> TIA,
> Sid
>
>
>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Issue while creating spark app

2022-02-27 Thread Bjørn Jørgensen
Mitch: You are using scala 2.11 to do this. Have a look at Building Spark
 "Spark requires
Scala 2.12/2.13; support for Scala 2.11 was removed in Spark 3.0.0."

søn. 27. feb. 2022 kl. 20:55 skrev Mich Talebzadeh <
mich.talebza...@gmail.com>:

> OK I decided to give a try to maven.
>
> Downloaded maven and unzipped the file WSL-Ubuntu terminal as unzip
> apache-maven-3.8.4-bin.zip
>
> Then added to Windows env variable as MVN_HOME and added the bin directory
> to path in windows. Restart intellij to pick up the correct path.
>
> Again on the command line in intellij do
>
> *mvn -v*
> Apache Maven 3.8.4 (9b656c72d54e5bacbed989b64718c159fe39b537)
> Maven home: d:\temp\apache-maven-3.8.4
> Java version: 1.8.0_73, vendor: Oracle Corporation, runtime: C:\Program
> Files\Java\jdk1.8.0_73\jre
> Default locale: en_GB, platform encoding: Cp1252
> OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"
>
> in Intellij add maven support to your project. Follow this link Add Maven
> support to an existing project
> 
>
> There will be a pom.xml file under project directory
>
> [image: image.png]
>
> Edit that pom.xml file and add the following
>
> 
> http://maven.apache.org/POM/4.0.0";
>  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
>  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
> http://maven.apache.org/xsd/maven-4.0.0.xsd";>
> 4.0.0
>
> spark
> MichTest
> 1.0
>
> 
> 8
> 8
> 
> 
> 
> org.scala-lang
> scala-library
> 2.11.7
> 
> 
> org.apache.spark
> spark-core_2.10
> 2.0.0
> 
> 
> org.apache.spark
> spark-sql_2.10
> 2.0.0
> 
> 
> 
>
> In intellij open a Terminal under project sub-directory where the pom file
> is created and you edited.
>
>
>  *mvn clean*
>
> [INFO] Scanning for projects...
>
> [INFO]
>
> [INFO] ---< spark:MichTest
> >---
>
> [INFO] Building MichTest 1.0
>
> [INFO] [ jar
> ]-
>
> [INFO]
>
> [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ MichTest ---
>
> [INFO] Deleting D:\temp\intellij\MichTest\target
>
> [INFO]
> 
>
> [INFO] BUILD SUCCESS
>
> [INFO]
> 
>
> [INFO] Total time:  4.451 s
>
> [INFO] Finished at: 2022-02-27T19:37:57Z
>
> [INFO]
> 
>
> *mvn compile*
>
> [INFO] Scanning for projects...
>
> [INFO]
>
> [INFO] ---< spark:MichTest
> >---
>
> [INFO] Building MichTest 1.0
>
> [INFO] [ jar
> ]-
>
> [INFO]
>
> [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @
> MichTest ---
>
> [WARNING] Using platform encoding (Cp1252 actually) to copy filtered
> resources, i.e. build is platform dependent!
>
> [INFO] Copying 0 resource
>
> [INFO]
>
> [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ MichTest
> ---
>
> [INFO] Nothing to compile - all classes are up to date
>
> [INFO]
> 
>
> [INFO] BUILD SUCCESS
>
> [INFO]
> 
>
> [INFO] Total time:  1.242 s
>
> [INFO] Finished at: 2022-02-27T19:38:58Z
>
> [INFO]
> 
>
> Now create the package
>
>  *mvn package*
> [INFO] Scanning for projects...
> [INFO]
> [INFO] ---< spark:MichTest
> >---
> [INFO] Building MichTest 1.0
> [INFO] [ jar
> ]-
> [INFO]
> [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @
> MichTest ---
> [WARNING] Using platform encoding (Cp1252 actually) to copy filtered
> resources, i.e. build is platform dependent!
> [INFO] Copying 0 resource
> [INFO]
> [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ MichTest
> ---
> [INFO] Nothing to compile - all classes are up to date
> [INFO]
> [INFO] --- maven-resources-plugin:2.6:testResources
> (default-testResources) @ MichTest ---
> [WARNING] Using platform encoding (Cp1252 actually) to copy filtered
> resources, i.e. build is platform dependent!
> [INFO] skip non existing resourceDirectory
> D:\temp\intellij\MichTest\src\test\resources
> [INFO]
> [INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @
> MichTest ---
> [INFO] Nothing to compile - all classes are up to date
> [INFO]
> [INFO] --- maven-surefire-plugin:2.12.4

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sid
Basically, I am trying two different approaches for the same problem and my
concern is how it will behave in the case of big data if you talk about
millions of records. Which one would be faster? Is using windowing
functions a better way since it will load the entire dataset into a single
window and do the operations?

On Mon, Feb 28, 2022 at 12:26 AM Sean Owen  wrote:

> Those queries look like they do fairly different things. One is selecting
> top employees by salary, the other is ... selecting where there are less
> than 3 distinct salaries or something.
> Not sure what the intended comparison is then; these are not equivalent
> ways of doing the same thing, or does not seem so as far as I can see.
>
> On Sun, Feb 27, 2022 at 12:30 PM Sid  wrote:
>
>> My bad.
>>
>> Aggregation Query:
>>
>> # Write your MySQL query statement below
>>
>>SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
>> FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
>> WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
>>WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
>> ORDER by E.DepartmentId, E.Salary DESC
>>
>> Time Taken: 1212 ms
>>
>> Windowing Query:
>>
>> select Department,Employee,Salary from (
>> select d.name as Department, e.name as Employee,e.salary as
>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>> rnk<=3
>>
>> Time Taken: 790 ms
>>
>> Thanks,
>> Sid
>>
>>
>> On Sun, Feb 27, 2022 at 11:35 PM Sean Owen  wrote:
>>
>>> Those two queries are identical?
>>>
>>> On Sun, Feb 27, 2022 at 11:30 AM Sid  wrote:
>>>
 Hi Team,

 I am aware that if windowing functions are used, then at first it loads
 the entire dataset into one window,scans and then performs the other
 mentioned operations for that particular window which could be slower when
 dealing with trillions / billions of records.

 I did a POC where I used an example to find the max 3 highest salary
 for an employee per department. So, I wrote a below queries and compared
 the time for it:

 Windowing Query:

 select Department,Employee,Salary from (
 select d.name as Department, e.name as Employee,e.salary as
 Salary,dense_rank() over(partition by d.name order by e.salary desc)
 as rnk from Department d join Employee e on e.departmentId=d.id ) a
 where rnk<=3

 Time taken: 790 ms

 Aggregation Query:

 select Department,Employee,Salary from (
 select d.name as Department, e.name as Employee,e.salary as
 Salary,dense_rank() over(partition by d.name order by e.salary desc)
 as rnk from Department d join Employee e on e.departmentId=d.id ) a
 where rnk<=3

 Time taken: 1212 ms

 But as per my understanding, the aggregation should have run faster.
 So, my whole point is if the dataset is huge I should force some kind of
 map reduce jobs like we have an option called df.groupby().reduceByGroups()

 So I think the aggregation query is taking more time since the dataset
 size here is smaller and as we all know that map reduce works faster when
 there is a huge volume of data. Haven't tested it yet on big data but
 needed some expert guidance over here.

 Please correct me if I am wrong.

 TIA,
 Sid






Re: Issue while creating spark app

2022-02-27 Thread Mich Talebzadeh
OK I decided to give a try to maven.

Downloaded maven and unzipped the file WSL-Ubuntu terminal as unzip
apache-maven-3.8.4-bin.zip

Then added to Windows env variable as MVN_HOME and added the bin directory
to path in windows. Restart intellij to pick up the correct path.

Again on the command line in intellij do

*mvn -v*
Apache Maven 3.8.4 (9b656c72d54e5bacbed989b64718c159fe39b537)
Maven home: d:\temp\apache-maven-3.8.4
Java version: 1.8.0_73, vendor: Oracle Corporation, runtime: C:\Program
Files\Java\jdk1.8.0_73\jre
Default locale: en_GB, platform encoding: Cp1252
OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"

in Intellij add maven support to your project. Follow this link Add Maven
support to an existing project


There will be a pom.xml file under project directory

[image: image.png]

Edit that pom.xml file and add the following


http://maven.apache.org/POM/4.0.0";
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd";>
4.0.0

spark
MichTest
1.0


8
8



org.scala-lang
scala-library
2.11.7


org.apache.spark
spark-core_2.10
2.0.0


org.apache.spark
spark-sql_2.10
2.0.0




In intellij open a Terminal under project sub-directory where the pom file
is created and you edited.


 *mvn clean*

[INFO] Scanning for projects...

[INFO]

[INFO] ---< spark:MichTest
>---

[INFO] Building MichTest 1.0

[INFO] [ jar
]-

[INFO]

[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ MichTest ---

[INFO] Deleting D:\temp\intellij\MichTest\target

[INFO]


[INFO] BUILD SUCCESS

[INFO]


[INFO] Total time:  4.451 s

[INFO] Finished at: 2022-02-27T19:37:57Z

[INFO]


*mvn compile*

[INFO] Scanning for projects...

[INFO]

[INFO] ---< spark:MichTest
>---

[INFO] Building MichTest 1.0

[INFO] [ jar
]-

[INFO]

[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @
MichTest ---

[WARNING] Using platform encoding (Cp1252 actually) to copy filtered
resources, i.e. build is platform dependent!

[INFO] Copying 0 resource

[INFO]

[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ MichTest
---

[INFO] Nothing to compile - all classes are up to date

[INFO]


[INFO] BUILD SUCCESS

[INFO]


[INFO] Total time:  1.242 s

[INFO] Finished at: 2022-02-27T19:38:58Z

[INFO]


Now create the package

 *mvn package*
[INFO] Scanning for projects...
[INFO]
[INFO] ---< spark:MichTest
>---
[INFO] Building MichTest 1.0
[INFO] [ jar
]-
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @
MichTest ---
[WARNING] Using platform encoding (Cp1252 actually) to copy filtered
resources, i.e. build is platform dependent!
[INFO] Copying 0 resource
[INFO]
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ MichTest
---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources)
@ MichTest ---
[WARNING] Using platform encoding (Cp1252 actually) to copy filtered
resources, i.e. build is platform dependent!
[INFO] skip non existing resourceDirectory
D:\temp\intellij\MichTest\src\test\resources
[INFO]
[INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @
MichTest ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- maven-surefire-plugin:2.12.4:test (default-test) @ MichTest ---
[INFO] No tests to run.
[INFO]
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ MichTest ---
[INFO] *Building jar:
D:\temp\intellij\MichTest\target\scala-2.11\MichTest-1.0.jar*
[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time:  1.511 s
[INFO] Finished at: 2022-02-27T19:40:22Z
[INFO]


sbt should work from the same directory as well.

I find it easier

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sid
Hi Enrico,

Thanks for your time :)

Consider a huge data volume scenario, If I don't use any keywords like
distinct, which one would be faster ? Window with partitionBy or normal SQL
aggregation methods? and how does df.groupBy().reduceByGroups() work
internally ?

Thanks,
Sid

On Mon, Feb 28, 2022 at 12:59 AM Enrico Minack 
wrote:

> Sid,
>
> Your Aggregation Query selects all employees where less than three
> distinct salaries exist that are larger. So, both queries seem to do the
> same.
>
> The Windowing Query is explicit in what it does: give me the rank for
> salaries per department in the given order and pick the top 3 per
> department.
>
> The Aggregation Query is trying to get to this conclusion by constructing
> some comparison. The former is the better approach, the second scales badly
> as this is done by counting distinct salaries that are larger than each
> salary in E. This looks like a Cartesian product of Employees. You make
> this very hard to optimize or execute by the query engine.
>
> And as you say, your example is very small, so this will not give any
> insights into big data.
>
> Enrico
>
>
> Am 27.02.22 um 19:30 schrieb Sid:
>
> My bad.
>
> Aggregation Query:
>
> # Write your MySQL query statement below
>
>SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
> FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
> WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
>WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
> ORDER by E.DepartmentId, E.Salary DESC
>
> Time Taken: 1212 ms
>
> Windowing Query:
>
> select Department,Employee,Salary from (
> select d.name as Department, e.name as Employee,e.salary as
> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
> rnk from Department d join Employee e on e.departmentId=d.id ) a where
> rnk<=3
>
> Time Taken: 790 ms
>
> Thanks,
> Sid
>
>
> On Sun, Feb 27, 2022 at 11:35 PM Sean Owen  wrote:
>
>> Those two queries are identical?
>>
>> On Sun, Feb 27, 2022 at 11:30 AM Sid  wrote:
>>
>>> Hi Team,
>>>
>>> I am aware that if windowing functions are used, then at first it loads
>>> the entire dataset into one window,scans and then performs the other
>>> mentioned operations for that particular window which could be slower when
>>> dealing with trillions / billions of records.
>>>
>>> I did a POC where I used an example to find the max 3 highest salary for
>>> an employee per department. So, I wrote a below queries and compared the
>>> time for it:
>>>
>>> Windowing Query:
>>>
>>> select Department,Employee,Salary from (
>>> select d.name as Department, e.name as Employee,e.salary as
>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>>> rnk<=3
>>>
>>> Time taken: 790 ms
>>>
>>> Aggregation Query:
>>>
>>> select Department,Employee,Salary from (
>>> select d.name as Department, e.name as Employee,e.salary as
>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>>> rnk<=3
>>>
>>> Time taken: 1212 ms
>>>
>>> But as per my understanding, the aggregation should have run faster. So,
>>> my whole point is if the dataset is huge I should force some kind of map
>>> reduce jobs like we have an option called df.groupby().reduceByGroups()
>>>
>>> So I think the aggregation query is taking more time since the dataset
>>> size here is smaller and as we all know that map reduce works faster when
>>> there is a huge volume of data. Haven't tested it yet on big data but
>>> needed some expert guidance over here.
>>>
>>> Please correct me if I am wrong.
>>>
>>> TIA,
>>> Sid
>>>
>>>
>>>
>>>
>


Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Enrico Minack

Sid,

Your Aggregation Query selects all employees where less than three 
distinct salaries exist that are larger. So, both queries seem to do the 
same.


The Windowing Query is explicit in what it does: give me the rank for 
salaries per department in the given order and pick the top 3 per 
department.


The Aggregation Query is trying to get to this conclusion by 
constructing some comparison. The former is the better approach, the 
second scales badly as this is done by counting distinct salaries that 
are larger than each salary in E. This looks like a Cartesian product of 
Employees. You make this very hard to optimize or execute by the query 
engine.


And as you say, your example is very small, so this will not give any 
insights into big data.


Enrico


Am 27.02.22 um 19:30 schrieb Sid:

My bad.

Aggregation Query:

# Write your MySQL query statement below

   SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
       WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
ORDER by E.DepartmentId, E.Salary DESC

Time Taken: 1212 ms

Windowing Query:

select Department,Employee,Salary from (
select d.name  as Department, e.name 
 as Employee,e.salary as Salary,dense_rank() 
over(partition by d.name  order by e.salary desc) as 
rnk from Department d join Employee e on e.departmentId=d.id 
 ) a where rnk<=3


Time Taken: 790 ms

Thanks,
Sid


On Sun, Feb 27, 2022 at 11:35 PM Sean Owen  wrote:

Those two queries are identical?

On Sun, Feb 27, 2022 at 11:30 AM Sid  wrote:

Hi Team,

I am aware that if windowing functions are used, then at first
it loads the entire dataset into one window,scans and then
performs the other mentioned operations for that particular
window which could be slower when dealing with trillions /
billions of records.

I did a POC where I used an example to find the max 3 highest
salary for an employee per department. So, I wrote a below
queries and compared the time for it:

Windowing Query:

select Department,Employee,Salary from (
select d.name  as Department, e.name
 as Employee,e.salary as Salary,dense_rank()
over(partition by d.name  order by e.salary
desc) as rnk from Department d join Employee e on
e.departmentId=d.id  ) a where rnk<=3

Time taken: 790 ms

Aggregation Query:

select Department,Employee,Salary from (
select d.name  as Department, e.name
 as Employee,e.salary as Salary,dense_rank()
over(partition by d.name  order by e.salary
desc) as rnk from Department d join Employee e on
e.departmentId=d.id  ) a where rnk<=3

Time taken: 1212 ms

But as per my understanding, the aggregation should have run
faster. So, my whole point is if the dataset is huge I should
force some kind of map reduce jobs like we have an option
called df.groupby().reduceByGroups()

So I think the aggregation query is taking more time since the
dataset size here is smaller and as we all know that map
reduce works faster when there is a huge volume of data.
Haven't tested it yet on big data but needed some expert
guidance over here.

Please correct me if I am wrong.

TIA,
Sid




Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sean Owen
Those queries look like they do fairly different things. One is selecting
top employees by salary, the other is ... selecting where there are less
than 3 distinct salaries or something.
Not sure what the intended comparison is then; these are not equivalent
ways of doing the same thing, or does not seem so as far as I can see.

On Sun, Feb 27, 2022 at 12:30 PM Sid  wrote:

> My bad.
>
> Aggregation Query:
>
> # Write your MySQL query statement below
>
>SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
> FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
> WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
>WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
> ORDER by E.DepartmentId, E.Salary DESC
>
> Time Taken: 1212 ms
>
> Windowing Query:
>
> select Department,Employee,Salary from (
> select d.name as Department, e.name as Employee,e.salary as
> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
> rnk from Department d join Employee e on e.departmentId=d.id ) a where
> rnk<=3
>
> Time Taken: 790 ms
>
> Thanks,
> Sid
>
>
> On Sun, Feb 27, 2022 at 11:35 PM Sean Owen  wrote:
>
>> Those two queries are identical?
>>
>> On Sun, Feb 27, 2022 at 11:30 AM Sid  wrote:
>>
>>> Hi Team,
>>>
>>> I am aware that if windowing functions are used, then at first it loads
>>> the entire dataset into one window,scans and then performs the other
>>> mentioned operations for that particular window which could be slower when
>>> dealing with trillions / billions of records.
>>>
>>> I did a POC where I used an example to find the max 3 highest salary for
>>> an employee per department. So, I wrote a below queries and compared the
>>> time for it:
>>>
>>> Windowing Query:
>>>
>>> select Department,Employee,Salary from (
>>> select d.name as Department, e.name as Employee,e.salary as
>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>>> rnk<=3
>>>
>>> Time taken: 790 ms
>>>
>>> Aggregation Query:
>>>
>>> select Department,Employee,Salary from (
>>> select d.name as Department, e.name as Employee,e.salary as
>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>>> rnk<=3
>>>
>>> Time taken: 1212 ms
>>>
>>> But as per my understanding, the aggregation should have run faster. So,
>>> my whole point is if the dataset is huge I should force some kind of map
>>> reduce jobs like we have an option called df.groupby().reduceByGroups()
>>>
>>> So I think the aggregation query is taking more time since the dataset
>>> size here is smaller and as we all know that map reduce works faster when
>>> there is a huge volume of data. Haven't tested it yet on big data but
>>> needed some expert guidance over here.
>>>
>>> Please correct me if I am wrong.
>>>
>>> TIA,
>>> Sid
>>>
>>>
>>>
>>>


Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sid
My bad.

Aggregation Query:

# Write your MySQL query statement below

   SELECT D.Name AS Department, E.Name AS Employee, E.Salary AS Salary
FROM Employee E INNER JOIN Department D ON E.DepartmentId = D.Id
WHERE (SELECT COUNT(DISTINCT(Salary)) FROM Employee
   WHERE DepartmentId = E.DepartmentId AND Salary > E.Salary) < 3
ORDER by E.DepartmentId, E.Salary DESC

Time Taken: 1212 ms

Windowing Query:

select Department,Employee,Salary from (
select d.name as Department, e.name as Employee,e.salary as
Salary,dense_rank() over(partition by d.name order by e.salary desc) as rnk
from Department d join Employee e on e.departmentId=d.id ) a where rnk<=3

Time Taken: 790 ms

Thanks,
Sid


On Sun, Feb 27, 2022 at 11:35 PM Sean Owen  wrote:

> Those two queries are identical?
>
> On Sun, Feb 27, 2022 at 11:30 AM Sid  wrote:
>
>> Hi Team,
>>
>> I am aware that if windowing functions are used, then at first it loads
>> the entire dataset into one window,scans and then performs the other
>> mentioned operations for that particular window which could be slower when
>> dealing with trillions / billions of records.
>>
>> I did a POC where I used an example to find the max 3 highest salary for
>> an employee per department. So, I wrote a below queries and compared the
>> time for it:
>>
>> Windowing Query:
>>
>> select Department,Employee,Salary from (
>> select d.name as Department, e.name as Employee,e.salary as
>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>> rnk<=3
>>
>> Time taken: 790 ms
>>
>> Aggregation Query:
>>
>> select Department,Employee,Salary from (
>> select d.name as Department, e.name as Employee,e.salary as
>> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
>> rnk from Department d join Employee e on e.departmentId=d.id ) a where
>> rnk<=3
>>
>> Time taken: 1212 ms
>>
>> But as per my understanding, the aggregation should have run faster. So,
>> my whole point is if the dataset is huge I should force some kind of map
>> reduce jobs like we have an option called df.groupby().reduceByGroups()
>>
>> So I think the aggregation query is taking more time since the dataset
>> size here is smaller and as we all know that map reduce works faster when
>> there is a huge volume of data. Haven't tested it yet on big data but
>> needed some expert guidance over here.
>>
>> Please correct me if I am wrong.
>>
>> TIA,
>> Sid
>>
>>
>>
>>


Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sean Owen
Those two queries are identical?

On Sun, Feb 27, 2022 at 11:30 AM Sid  wrote:

> Hi Team,
>
> I am aware that if windowing functions are used, then at first it loads
> the entire dataset into one window,scans and then performs the other
> mentioned operations for that particular window which could be slower when
> dealing with trillions / billions of records.
>
> I did a POC where I used an example to find the max 3 highest salary for
> an employee per department. So, I wrote a below queries and compared the
> time for it:
>
> Windowing Query:
>
> select Department,Employee,Salary from (
> select d.name as Department, e.name as Employee,e.salary as
> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
> rnk from Department d join Employee e on e.departmentId=d.id ) a where
> rnk<=3
>
> Time taken: 790 ms
>
> Aggregation Query:
>
> select Department,Employee,Salary from (
> select d.name as Department, e.name as Employee,e.salary as
> Salary,dense_rank() over(partition by d.name order by e.salary desc) as
> rnk from Department d join Employee e on e.departmentId=d.id ) a where
> rnk<=3
>
> Time taken: 1212 ms
>
> But as per my understanding, the aggregation should have run faster. So,
> my whole point is if the dataset is huge I should force some kind of map
> reduce jobs like we have an option called df.groupby().reduceByGroups()
>
> So I think the aggregation query is taking more time since the dataset
> size here is smaller and as we all know that map reduce works faster when
> there is a huge volume of data. Haven't tested it yet on big data but
> needed some expert guidance over here.
>
> Please correct me if I am wrong.
>
> TIA,
> Sid
>
>
>
>


Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Sid
Hi Team,

I am aware that if windowing functions are used, then at first it loads the
entire dataset into one window,scans and then performs the other mentioned
operations for that particular window which could be slower when dealing
with trillions / billions of records.

I did a POC where I used an example to find the max 3 highest salary for an
employee per department. So, I wrote a below queries and compared the time
for it:

Windowing Query:

select Department,Employee,Salary from (
select d.name as Department, e.name as Employee,e.salary as
Salary,dense_rank() over(partition by d.name order by e.salary desc) as rnk
from Department d join Employee e on e.departmentId=d.id ) a where rnk<=3

Time taken: 790 ms

Aggregation Query:

select Department,Employee,Salary from (
select d.name as Department, e.name as Employee,e.salary as
Salary,dense_rank() over(partition by d.name order by e.salary desc) as rnk
from Department d join Employee e on e.departmentId=d.id ) a where rnk<=3

Time taken: 1212 ms

But as per my understanding, the aggregation should have run faster. So, my
whole point is if the dataset is huge I should force some kind of map
reduce jobs like we have an option called df.groupby().reduceByGroups()

So I think the aggregation query is taking more time since the dataset size
here is smaller and as we all know that map reduce works faster when there
is a huge volume of data. Haven't tested it yet on big data but needed some
expert guidance over here.

Please correct me if I am wrong.

TIA,
Sid


Re: Issue while creating spark app

2022-02-27 Thread Mich Talebzadeh
Got curious with this intellij stuff.

I recall using sbt rather than MVN so go to terminal in your intellij and
verify what is installed

 sbt -version
sbt version in this project: 1.3.4
sbt script version: 1.3.4

 scala -version

Scala code runner version 2.11.7 -- Copyright 2002-2013, LAMP/EPFL

java --version
openjdk 11.0.7 2020-04-14
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.7+10)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.7+10, mixed mode)

For now in the directory where you have Main.scala create build.sbt file

// The simplest possible sbt build file is just one line:

scalaVersion := "2.11.7"
// That is, to create a valid sbt build, all you've got to do is define the
// version of Scala you'd like your project to use.

/ To learn more about multi-project builds, head over to the official sbt
// documentation at http://www.scala-sbt.org/documentation.html
libraryDependencies += "org.scala-lang.modules" %%
"scala-parser-combinators" % "1.1.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.1.0"


This is my Main.scala file a copy from sparkbyexample



package org.example
import org.apache.spark.sql.SparkSession
object SparkSessionTest extends App{
  val spark = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate();

  println("First SparkContext:")
  println("APP Name :"+spark.sparkContext.appName);
  println("Deploy Mode :"+spark.sparkContext.deployMode);
  println("Master :"+spark.sparkContext.master);

  val sparkSession2 = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample-test")
.getOrCreate();

  println("Second SparkContext:")
  println("APP Name :"+sparkSession2.sparkContext.appName);
  println("Deploy Mode :"+sparkSession2.sparkContext.deployMode);
  println("Master :"+sparkSession2.sparkContext.master);
}

 Go back to Terminal under directory where you have both files built.sbt
and Main.scala


*sbt clean*

[info] Loading global plugins from C:\Users\admin\.sbt\1.0\plugins

[info] Loading project definition from
D:\temp\intellij\MichTest\src\main\scala\com\ctp\training\scala\project

[info] Loading settings for project scala from build.sbt ...

[info] Set current project to MichTest (in build
file:/D:/temp/intellij/MichTest/src/main/scala/com/ctp/training/scala/)

[success] Total time: 0 s, completed Feb 27, 2022 9:54:10 AM

*sbt compile*
[info] Loading global plugins from C:\Users\admin\.sbt\1.0\plugins
[info] Loading project definition from
D:\temp\intellij\MichTest\src\main\scala\com\ctp\training\scala\project
[info] Loading settings for project scala from build.sbt ...
[info] Set current project to MichTest (in build
file:/D:/temp/intellij/MichTest/src/main/scala/com/ctp/training/scala/)
[info] Executing in batch mode. For better performance use sbt's shell
[warn] There may be incompatibilities among your library dependencies; run
'evicted' to see detailed eviction warnings.
[info] Compiling 1 Scala source to
D:\temp\intellij\MichTest\src\main\scala\com\ctp\training\scala\target\scala-2.11\classes
...
[success] Total time: 5 s, completed Feb 27, 2022 9:55:10 AM

 *sbt package*

[info] Loading global plugins from C:\Users\admin\.sbt\1.0\plugins

[info] Loading project definition from
D:\temp\intellij\MichTest\src\main\scala\com\ctp\training\scala\project

[info] Loading settings for project scala from build.sbt ...

[info] Set current project to MichTest (in build
file:/D:/temp/intellij/MichTest/src/main/scala/com/ctp/training/scala/)

[success] Total time: 1 s, completed Feb 27, 2022 9:56:48 AM

 *ls*


Directory:
D:\temp\intellij\MichTest\src\main\scala\com\ctp\training\scala


Mode LastWriteTime Length Name

 - -- 

d- 2/27/2022   7:04 AMnull

d- 2/27/2022   8:33 AMproject

d- 2/27/2022   9:08 AMspark-warehouse

d- 2/27/2022   9:55 AMtarget

-a 2/27/2022   9:17 AM   3511 build.sbt

Note that you have target directory and underneath scala-2.11 (in my case)
and the uber jar file michtest_2.11-1.0.jar

*ls*
Directory:
D:\temp\intellij\MichTest\src\main\scala\com\ctp\training\scala\target\scala-2.11
Mode LastWriteTime Length Name
 - -- 
d- 2/27/2022   9:55 AMclasses
d- 2/27/2022   9:55 AMupdate
-a 2/27/2022   9:56 AM   3938 *michtest_2.11-1.0.jar*


These are old stuff but still shows how to create a jar file with sbt

HTH


  view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction