Spark SQL readSideCharPadding issue while reading ENUM column from mysql

2024-09-21 Thread Suyash Ajmera
I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am querying
to Mysql Database and applying

`*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working as
expected in spark 3.3.1 , but not working with 3.5.0.

Where Condition ::  `*UPPER(vn) = 'ERICSSON' AND (upper(st) = 'OPEN' OR
upper(st) = 'REOPEN' OR upper(st) = 'CLOSED')*`

The *st *column is ENUM in the database and it is causing the issue.

Below is the Physical Plan of *FILTER* phase :

For 3.3.1 :

+- Filter ((upper(vn#11) = ERICSSON) AND (((upper(st#42) = OPEN) OR
(upper(st#42) = REOPEN)) OR (upper(st#42) = CLOSED)))

For 3.5.0 :

+- Filter ((upper(vn#11) = ERICSSON) AND (((upper(staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true)) = OPEN) OR
(upper(staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true)) = REOPEN)) OR
(upper(staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true)) = CLOSED)))

-

I have debug it and found that Spark added a property in version 3.4.0 ,
i.e. **spark.sql.readSideCharPadding** which has default value **true**.

Link to the JIRA : https://issues.apache.org/jira/browse/SPARK-40697

Added a new method in Class **CharVarcharCodegenUtils**

public static UTF8String readSidePadding(UTF8String inputStr, int limit) {
int numChars = inputStr.numChars();
if (numChars == limit) {
  return inputStr;
} else if (numChars < limit) {
  return inputStr.rpad(limit, SPACE);
} else {
  return inputStr;
}
  }


**This method is appending some whitespace padding to the ENUM values while
reading and causing the Issue.**

---

When I am removing the UPPER function from the where condition the
**FILTER** Phase looks like this :

 +- Filter (((staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils,
 StringType, readSidePadding, st#42, 13, true, false, true) = OPEN
) OR (staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true) = REOPEN   )) OR
(staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true) = CLOSED   ))


**You can see it has added some white space after the value and the query
runs fine giving the correct result.**

But with the UPPER function I am not getting the data.

--

I have also tried to disable this Property *spark.sql.readSideCharPadding =
false* with following cases :

1. With Upper function in where clause :
   It is not pushing the filters to Database and the *query works fine*.

  +- Filter (((upper(st#42) = OPEN) OR (upper(st#42) = REOPEN)) OR
(upper(st#42) = CLOSED))

2. But when I am removing the upper function

 *It is pushing the filter to Mysql with the white spaces and I am not
getting the data. (THIS IS A CAUSING VERY BIG ISSUE)*

  PushedFilters: [*IsNotNull(vn), *EqualTo(vn,ERICSSON),
*Or(Or(EqualTo(st,OPEN ),EqualTo(st,REOPEN
)),EqualTo(st,CLOSED   ))]

I cannot move this filter to JDBC read query , also I can't remove this
UPPER function in the where clause.



Also I found same data getting written to CASSANDRA with *PADDING .*


Re: [Issue] Spark SQL - broadcast failure

2024-08-01 Thread Sudharshan V
Hi all,
Do we have any idea on this.

Thanks

On Tue, 23 Jul, 2024, 12:54 pm Sudharshan V, 
wrote:

> We removed the explicit broadcast for that particular table and it took
> longer time since the join type changed from BHJ to SMJ.
>
> I wanted to understand how I can find what went wrong with the broadcast
> now.
> How do I know the size of the table inside of spark memory.
>
> I have tried to cache the table hoping I could see the table size in the
> storage tab of spark UI of EMR.
>
> But I see no data there .
>
> Thanks
>
> On Tue, 23 Jul, 2024, 12:48 pm Sudharshan V, 
> wrote:
>
>> Hi all, apologies for the delayed response.
>>
>> We are using spark version 3.4.1 in jar and EMR 6.11 runtime.
>>
>> We have disabled the auto broadcast always and would broadcast the
>> smaller tables using explicit broadcast.
>>
>> It was working fine historically and only now it is failing.
>>
>> The data sizes I mentioned was taken from S3.
>>
>> Thanks,
>> Sudharshan
>>
>> On Wed, 17 Jul, 2024, 1:53 am Meena Rajani, 
>> wrote:
>>
>>> Can you try disabling broadcast join and see what happens?
>>>
>>> On Mon, Jul 8, 2024 at 12:03 PM Sudharshan V 
>>> wrote:
>>>
 Hi all,

 Been facing a weird issue lately.
 In our production code base , we have an explicit broadcast for a small
 table.
 It is just a look up table that is around 1gb in size in s3 and just
 had few million records and 5 columns.

 The ETL was running fine , but with no change from the codebase nor the
 infrastructure, we are getting broadcast failures. Even weird fact is the
 older size of the data is 1.4gb while for the new run is just 900 MB

 Below is the error message
 Cannot broadcast table that is larger than 8 GB : 8GB.

 I find it extremely weird considering that the data size is very well
 under the thresholds.

 Are there any other ways to find what could be the issue and how we can
 rectify this issue?

 Could the data characteristics be an issue?

 Any help would be immensely appreciated.

 Thanks

>>>


Re: [Issue] Spark SQL - broadcast failure

2024-07-23 Thread Sudharshan V
We removed the explicit broadcast for that particular table and it took
longer time since the join type changed from BHJ to SMJ.

I wanted to understand how I can find what went wrong with the broadcast
now.
How do I know the size of the table inside of spark memory.

I have tried to cache the table hoping I could see the table size in the
storage tab of spark UI of EMR.

But I see no data there .

Thanks

On Tue, 23 Jul, 2024, 12:48 pm Sudharshan V, 
wrote:

> Hi all, apologies for the delayed response.
>
> We are using spark version 3.4.1 in jar and EMR 6.11 runtime.
>
> We have disabled the auto broadcast always and would broadcast the smaller
> tables using explicit broadcast.
>
> It was working fine historically and only now it is failing.
>
> The data sizes I mentioned was taken from S3.
>
> Thanks,
> Sudharshan
>
> On Wed, 17 Jul, 2024, 1:53 am Meena Rajani, 
> wrote:
>
>> Can you try disabling broadcast join and see what happens?
>>
>> On Mon, Jul 8, 2024 at 12:03 PM Sudharshan V 
>> wrote:
>>
>>> Hi all,
>>>
>>> Been facing a weird issue lately.
>>> In our production code base , we have an explicit broadcast for a small
>>> table.
>>> It is just a look up table that is around 1gb in size in s3 and just had
>>> few million records and 5 columns.
>>>
>>> The ETL was running fine , but with no change from the codebase nor the
>>> infrastructure, we are getting broadcast failures. Even weird fact is the
>>> older size of the data is 1.4gb while for the new run is just 900 MB
>>>
>>> Below is the error message
>>> Cannot broadcast table that is larger than 8 GB : 8GB.
>>>
>>> I find it extremely weird considering that the data size is very well
>>> under the thresholds.
>>>
>>> Are there any other ways to find what could be the issue and how we can
>>> rectify this issue?
>>>
>>> Could the data characteristics be an issue?
>>>
>>> Any help would be immensely appreciated.
>>>
>>> Thanks
>>>
>>


Re: [Issue] Spark SQL - broadcast failure

2024-07-23 Thread Sudharshan V
Hi all, apologies for the delayed response.

We are using spark version 3.4.1 in jar and EMR 6.11 runtime.

We have disabled the auto broadcast always and would broadcast the smaller
tables using explicit broadcast.

It was working fine historically and only now it is failing.

The data sizes I mentioned was taken from S3.

Thanks,
Sudharshan

On Wed, 17 Jul, 2024, 1:53 am Meena Rajani,  wrote:

> Can you try disabling broadcast join and see what happens?
>
> On Mon, Jul 8, 2024 at 12:03 PM Sudharshan V 
> wrote:
>
>> Hi all,
>>
>> Been facing a weird issue lately.
>> In our production code base , we have an explicit broadcast for a small
>> table.
>> It is just a look up table that is around 1gb in size in s3 and just had
>> few million records and 5 columns.
>>
>> The ETL was running fine , but with no change from the codebase nor the
>> infrastructure, we are getting broadcast failures. Even weird fact is the
>> older size of the data is 1.4gb while for the new run is just 900 MB
>>
>> Below is the error message
>> Cannot broadcast table that is larger than 8 GB : 8GB.
>>
>> I find it extremely weird considering that the data size is very well
>> under the thresholds.
>>
>> Are there any other ways to find what could be the issue and how we can
>> rectify this issue?
>>
>> Could the data characteristics be an issue?
>>
>> Any help would be immensely appreciated.
>>
>> Thanks
>>
>


[Spark SQL]: Why the OptimizeSkewedJoin rule does not optimize FullOuterJoin?

2024-07-22 Thread 王仲轩(万章)
Hi,
I am a beginner in Spark and currently learning the Spark source code. I have a 
question about the AQE rule OptimizeSkewedJoin. 
I have a SQL query using SMJ FullOuterJoin, where there is read skew on the 
left side (the case is mentioned below). 
case:
remote bytes read total (min, med, max)
90.5 GiB [bytes:97189264140] (208.5 MiB [bytes:218673776], 210.0 MiB 
[bytes:220191607], 18.1 GiB [bytes:19467332173])
However, the OptimizeSkewedJoin rule does not optimize FullOuterJoin. I would 
like to know the reason behind this.
Thanks.


Re: [Issue] Spark SQL - broadcast failure

2024-07-16 Thread Meena Rajani
Can you try disabling broadcast join and see what happens?

On Mon, Jul 8, 2024 at 12:03 PM Sudharshan V 
wrote:

> Hi all,
>
> Been facing a weird issue lately.
> In our production code base , we have an explicit broadcast for a small
> table.
> It is just a look up table that is around 1gb in size in s3 and just had
> few million records and 5 columns.
>
> The ETL was running fine , but with no change from the codebase nor the
> infrastructure, we are getting broadcast failures. Even weird fact is the
> older size of the data is 1.4gb while for the new run is just 900 MB
>
> Below is the error message
> Cannot broadcast table that is larger than 8 GB : 8GB.
>
> I find it extremely weird considering that the data size is very well
> under the thresholds.
>
> Are there any other ways to find what could be the issue and how we can
> rectify this issue?
>
> Could the data characteristics be an issue?
>
> Any help would be immensely appreciated.
>
> Thanks
>


Re: [Issue] Spark SQL - broadcast failure

2024-07-16 Thread Mich Talebzadeh
It will help if you mention the Spark version and the piece of problematic
code

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
PhD  Imperial College
London 
London, United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 16 Jul 2024 at 08:51, Sudharshan V 
wrote:

>
> On Mon, 8 Jul, 2024, 7:53 pm Sudharshan V, 
> wrote:
>
>> Hi all,
>>
>> Been facing a weird issue lately.
>> In our production code base , we have an explicit broadcast for a small
>> table.
>> It is just a look up table that is around 1gb in size in s3 and just had
>> few million records and 5 columns.
>>
>> The ETL was running fine , but with no change from the codebase nor the
>> infrastructure, we are getting broadcast failures. Even weird fact is the
>> older size of the data is 1.4gb while for the new run is just 900 MB
>>
>> Below is the error message
>> Cannot broadcast table that is larger than 8 GB : 8GB.
>>
>> I find it extremely weird considering that the data size is very well
>> under the thresholds.
>>
>> Are there any other ways to find what could be the issue and how we can
>> rectify this issue?
>>
>> Could the data characteristics be an issue?
>>
>> Any help would be immensely appreciated.
>>
>> Thanks
>>
>


[Issue] Spark SQL - broadcast failure

2024-07-08 Thread Sudharshan V
Hi all,

Been facing a weird issue lately.
In our production code base , we have an explicit broadcast for a small
table.
It is just a look up table that is around 1gb in size in s3 and just had
few million records and 5 columns.

The ETL was running fine , but with no change from the codebase nor the
infrastructure, we are getting broadcast failures. Even weird fact is the
older size of the data is 1.4gb while for the new run is just 900 MB

Below is the error message
Cannot broadcast table that is larger than 8 GB : 8GB.

I find it extremely weird considering that the data size is very well under
the thresholds.

Are there any other ways to find what could be the issue and how we can
rectify this issue?

Could the data characteristics be an issue?

Any help would be immensely appreciated.

Thanks


Re: [Spark SQL]: Does Spark support processing records with timestamp NULL in stateful streaming?

2024-05-27 Thread Mich Talebzadeh
When you use applyInPandasWithState, Spark processes each input row as it
arrives, regardless of whether certain columns, such as the timestamp
column, contain NULL values. This behavior is useful where you want to
handle incomplete or missing data gracefully within your stateful
processing logic. By allowing NULL timestamps to trigger calls to the
stateful function, you can implement custom handling strategies, such as
skipping incomplete records, within your stateful function.


However, it is important to understand that this behavior also *means that
the watermark is not advanced for NULL timestamps*. The watermark is used
for event-time processing in Spark Structured Streaming, to track the
progress of event-time in your data stream and is typically based on the
timestamp column. Since NULL timestamps do not contribute to the watermark
advancement,

Regarding whether you can rely on this behavior for your production code,
it largely depends on your requirements and use case. If your application
logic is designed to handle NULL timestamps appropriately and you have
tested it to ensure it behaves as expected, then you can generally rely on
this behavior. FYI, I have not tested it myself, so I cannot provide a
definitive answer.

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
PhD  Imperial College
London 
London, United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 27 May 2024 at 22:04, Juan Casse  wrote:

> I am using applyInPandasWithState in PySpark 3.5.0.
>
> I noticed that records with timestamp==NULL are processed (i.e., trigger a
> call to the stateful function). And, as you would expect, does not advance
> the watermark.
>
> I am taking advantage of this in my application.
>
> My question: Is this a supported feature of Spark? Can I rely on this
> behavior for my production code?
>
> Thanks,
> Juan
>


[Spark SQL]: Does Spark support processing records with timestamp NULL in stateful streaming?

2024-05-27 Thread Juan Casse
I am using applyInPandasWithState in PySpark 3.5.0.

I noticed that records with timestamp==NULL are processed (i.e., trigger a
call to the stateful function). And, as you would expect, does not advance
the watermark.

I am taking advantage of this in my application.

My question: Is this a supported feature of Spark? Can I rely on this
behavior for my production code?

Thanks,
Juan


Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Shay Elbaz
Seen this before; had a very(!) complex plan behind a DataFrame, to the point 
where any additional transformation went OOM on the driver.

A quick and ugly solution was to break the plan - convert the DataFrame to rdd 
and back to DF at certain points to make the plan shorter. This has obvious 
drawbacks, and is not recommended in general, but at least we had something 
working. The real, long-term solution was to replace the many ( > 200)  
withColumn() calls to only a few select() calls. You can easily find sources on 
the internet for why this is better. (it was on Spark 2.3, but I think the main 
principles remain). TBH, it was easier than I expected, as it mainly involved 
moving pieces of code from one place to another, and not a "real", meaningful 
refactoring.



From: Mich Talebzadeh 
Sent: Monday, May 27, 2024 15:43
Cc: user@spark.apache.org 
Subject: Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame 
Processing


This message contains hyperlinks, take precaution before opening these links.

Few ideas on top of my head for how to go about solving the problem


  1.  Try with subsets: Try reproducing the issue with smaller subsets of your 
data to pinpoint the specific operation causing the memory problems.
  2.  Explode or Flatten Nested Structures: If your DataFrame schema involves 
deep nesting, consider using techniques like explode or flattening to transform 
it into a less nested structure. This can reduce memory usage during operations 
like withColumn.
  3.  Lazy Evaluation: Use select before withColumn: this ensures lazy 
evaluation, meaning Spark only materializes the data when necessary. This can 
improve memory usage compared to directly calling withColumn on the entire 
DataFrame.
  4.  spark.sql.shuffle.partitions: Setting this configuration to a value close 
to the number of executors can improve shuffle performance and potentially 
reduce memory usage.
  5.  Spark UI Monitoring: Utilize the Spark UI to monitor memory usage 
throughout your job execution and identify potential memory bottlenecks.

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge but 
of course cannot be guaranteed . It is essential to note that, as with any 
advice, quote "one test result is worth one-thousand expert opinions (Werner 
<https://en.wikipedia.org/wiki/Wernher_von_Braun> Von 
Braun<https://en.wikipedia.org/wiki/Wernher_von_Braun>)".



On Mon, 27 May 2024 at 12:50, Gaurav Madan 
 wrote:
Dear Community,

I'm reaching out to seek your assistance with a memory issue we've been facing 
while processing certain large and nested DataFrames using Apache Spark. We 
have encountered a scenario where the driver runs out of memory when applying 
the `withColumn` method on specific DataFrames in Spark 3.4.1. However, the 
same DataFrames are processed successfully in Spark 2.4.0.

Problem Summary:
For certain DataFrames, applying the `withColumn` method in Spark 3.4.1 causes 
the driver to choke and run out of memory. However, the same DataFrames are 
processed successfully in Spark 2.4.0.

Heap Dump Analysis:
We performed a heap dump analysis after enabling heap dump on out-of-memory 
errors, and the analysis revealed the following significant frames and local 
variables:

```

org.apache.spark.sql.Dataset.withPlan(Lorg/apache/spark/sql/catalyst/plans/logical/LogicalPlan;)Lorg/apache/spark/sql/Dataset;
 (Dataset.scala:4273)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%) bytes

org.apache.spark.sql.Dataset.select(Lscala/collection/Seq;)Lorg/apache/spark/sql/Dataset;
 (Dataset.scala:1622)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%) bytes

org.apache.spark.sql.Dataset.withColumns(Lscala/collection/Seq;Lscala/collection/Seq;)Lorg/apache/spark/sql/Dataset;
 (Dataset.scala:2820)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%) bytes

org.apache.spark.sql.Dataset.withColumn(Ljava/lang/String;Lorg/apache/spark/sql/Column;)Lorg/apache/spark/sql/Dataset;
 (Dataset.scala:2759)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%) bytes

com.urbanclap.dp.eventpersistence.utils.DataPersistenceUtil$.addStartTimeInPartitionColumn(Lorg/apache/spark/sql/Dataset;Lcom/urbanclap/dp/factory/output/Event;Lcom/urbanclap/dp/eventpersistence/JobMetadata;)Lorg/apache/spark/sql/Dataset;
 (DataPersistenceUtil.scala:88)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%) bytes

com.urbanclap.dp.eventpersistence.

Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Mich Talebzadeh
Few ideas on top of my head for how to go about solving the problem


   1. Try with subsets: Try reproducing the issue with smaller subsets of
   your data to pinpoint the specific operation causing the memory problems.
   2. Explode or Flatten Nested Structures: If your DataFrame schema
   involves deep nesting, consider using techniques like explode or flattening
   to transform it into a less nested structure. This can reduce memory usage
   during operations like withColumn.
   3. Lazy Evaluation: Use select before withColumn: this ensures lazy
   evaluation, meaning Spark only materializes the data when necessary. This
   can improve memory usage compared to directly calling withColumn on the
   entire DataFrame.
   4. spark.sql.shuffle.partitions: Setting this configuration to a value
   close to the number of executors can improve shuffle performance and
   potentially reduce memory usage.
   5. Spark UI Monitoring: Utilize the Spark UI to monitor memory usage
   throughout your job execution and identify potential memory bottlenecks.

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".



On Mon, 27 May 2024 at 12:50, Gaurav Madan
 wrote:

> Dear Community,
>
> I'm reaching out to seek your assistance with a memory issue we've been
> facing while processing certain large and nested DataFrames using Apache
> Spark. We have encountered a scenario where the driver runs out of memory
> when applying the `withColumn` method on specific DataFrames in Spark
> 3.4.1. However, the same DataFrames are processed successfully in Spark
> 2.4.0.
>
>
> *Problem Summary:*For certain DataFrames, applying the `withColumn`
> method in Spark 3.4.1 causes the driver to choke and run out of memory.
> However, the same DataFrames are processed successfully in Spark 2.4.0.
>
>
> *Heap Dump Analysis:*We performed a heap dump analysis after enabling
> heap dump on out-of-memory errors, and the analysis revealed the following
> significant frames and local variables:
>
> ```
>
> org.apache.spark.sql.Dataset.withPlan(Lorg/apache/spark/sql/catalyst/plans/logical/LogicalPlan;)Lorg/apache/spark/sql/Dataset;
> (Dataset.scala:4273)
>
> org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
> bytes
>
> org.apache.spark.sql.Dataset.select(Lscala/collection/Seq;)Lorg/apache/spark/sql/Dataset;
> (Dataset.scala:1622)
>
> org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
> bytes
>
> org.apache.spark.sql.Dataset.withColumns(Lscala/collection/Seq;Lscala/collection/Seq;)Lorg/apache/spark/sql/Dataset;
> (Dataset.scala:2820)
>
> org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
> bytes
>
> org.apache.spark.sql.Dataset.withColumn(Ljava/lang/String;Lorg/apache/spark/sql/Column;)Lorg/apache/spark/sql/Dataset;
> (Dataset.scala:2759)
>
> org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
> bytes
>
> com.urbanclap.dp.eventpersistence.utils.DataPersistenceUtil$.addStartTimeInPartitionColumn(Lorg/apache/spark/sql/Dataset;Lcom/urbanclap/dp/factory/output/Event;Lcom/urbanclap/dp/eventpersistence/JobMetadata;)Lorg/apache/spark/sql/Dataset;
> (DataPersistenceUtil.scala:88)
>
> org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
> bytes
>
> com.urbanclap.dp.eventpersistence.utils.DataPersistenceUtil$.writeRawData(Lorg/apache/spark/sql/Dataset;Lcom/urbanclap/dp/factory/output/Event;Ljava/lang/String;Lcom/urbanclap/dp/eventpersistence/JobMetadata;)V
> (DataPersistenceUtil.scala:19)
>
> org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
> bytes
>
> com.urbanclap.dp.eventpersistence.step.bronze.BronzeStep$.persistRecords(Lorg/apache/spark/sql/Dataset;Lcom/urbanclap/dp/factory/output/Event;Lcom/urbanclap/dp/eventpersistence/JobMetadata;)V
> (BronzeStep.scala:23)
>
> org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
> bytes
>
> com.urbanclap.dp.eventpersistence.MainJob$.processRecords(Lorg/apache/spark/sql/SparkSession;Lcom/urbanclap/dp/factory/output/Event;Lcom/urbanclap/dp/eventpersistence/JobMetadata;Lorg/apache/spark/sql/Dataset;)V
> (MainJob.scala:78)
>
> org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,8

Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Gaurav Madan
Dear Community,

I'm reaching out to seek your assistance with a memory issue we've been
facing while processing certain large and nested DataFrames using Apache
Spark. We have encountered a scenario where the driver runs out of memory
when applying the `withColumn` method on specific DataFrames in Spark
3.4.1. However, the same DataFrames are processed successfully in Spark
2.4.0.


*Problem Summary:*For certain DataFrames, applying the `withColumn` method
in Spark 3.4.1 causes the driver to choke and run out of memory. However,
the same DataFrames are processed successfully in Spark 2.4.0.


*Heap Dump Analysis:*We performed a heap dump analysis after enabling heap
dump on out-of-memory errors, and the analysis revealed the following
significant frames and local variables:

```

org.apache.spark.sql.Dataset.withPlan(Lorg/apache/spark/sql/catalyst/plans/logical/LogicalPlan;)Lorg/apache/spark/sql/Dataset;
(Dataset.scala:4273)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
bytes

org.apache.spark.sql.Dataset.select(Lscala/collection/Seq;)Lorg/apache/spark/sql/Dataset;
(Dataset.scala:1622)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
bytes

org.apache.spark.sql.Dataset.withColumns(Lscala/collection/Seq;Lscala/collection/Seq;)Lorg/apache/spark/sql/Dataset;
(Dataset.scala:2820)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
bytes

org.apache.spark.sql.Dataset.withColumn(Ljava/lang/String;Lorg/apache/spark/sql/Column;)Lorg/apache/spark/sql/Dataset;
(Dataset.scala:2759)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
bytes

com.urbanclap.dp.eventpersistence.utils.DataPersistenceUtil$.addStartTimeInPartitionColumn(Lorg/apache/spark/sql/Dataset;Lcom/urbanclap/dp/factory/output/Event;Lcom/urbanclap/dp/eventpersistence/JobMetadata;)Lorg/apache/spark/sql/Dataset;
(DataPersistenceUtil.scala:88)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
bytes

com.urbanclap.dp.eventpersistence.utils.DataPersistenceUtil$.writeRawData(Lorg/apache/spark/sql/Dataset;Lcom/urbanclap/dp/factory/output/Event;Ljava/lang/String;Lcom/urbanclap/dp/eventpersistence/JobMetadata;)V
(DataPersistenceUtil.scala:19)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
bytes

com.urbanclap.dp.eventpersistence.step.bronze.BronzeStep$.persistRecords(Lorg/apache/spark/sql/Dataset;Lcom/urbanclap/dp/factory/output/Event;Lcom/urbanclap/dp/eventpersistence/JobMetadata;)V
(BronzeStep.scala:23)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
bytes

com.urbanclap.dp.eventpersistence.MainJob$.processRecords(Lorg/apache/spark/sql/SparkSession;Lcom/urbanclap/dp/factory/output/Event;Lcom/urbanclap/dp/eventpersistence/JobMetadata;Lorg/apache/spark/sql/Dataset;)V
(MainJob.scala:78)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
bytes

com.urbanclap.dp.eventpersistence.MainJob$.processBatchCB(Lorg/apache/spark/sql/SparkSession;Lcom/urbanclap/dp/eventpersistence/JobMetadata;Lcom/urbanclap/dp/factory/output/Event;Lorg/apache/spark/sql/Dataset;)V
(MainJob.scala:66)

org.apache.spark.sql.Dataset @ 0x680190b10 retains 4,307,840,192 (80.38%)
bytes

```


*Driver Configuration:*1. Driver instance: c6g.xlarge with 4 vCPUs and 8 GB
RAM.
2.  `spark.driver.memory` and `spark.driver.memoryOverhead` are set to
default values.


*Observations:*- The DataFrame schema is very nested and large, which might
be contributing to the memory issue.
- Despite similar configurations, Spark 2.4.0 processes the DataFrame
without issues, while Spark 3.4.1 does not.


*Tried Solutions:*We have tried several solutions, including disabling
Adaptive Query Execution, setting the driver max result size, increasing
driver cores, and enabling specific optimizer rules. However, the issue
persisted until we increased the driver memory to 48 GB and memory overhead
to 5 GB, which allowed the driver to schedule the tasks successfully.


*Request for Suggestions:*Are there any additional configurations or
optimizations that could help mitigate this memory issue without always
resorting to a larger machine? We would greatly appreciate any insights or
recommendations from the community on how to resolve this issue effectively.

I have attached the DataFrame schema and the complete stack trace from the
heap dump analysis for your reference.

Doc explaining the issue
<https://docs.google.com/document/d/1FL6Zeim6IN1riLH1Hp7Jw4acsBoSyWbSN13mCYnjo60/edit?usp=sharing>
DataFrame Schema
<https://drive.google.com/file/d/1wgFB0_WvdQdGoEMGFePhZwLR7aQZ5fPn/view?usp=sharing>

Thank you in advance for your assistance.

Best regards,
Gaurav Madan
LinkedIn <https://www.linkedin.com/in/gaurav-madan-210b62177/>
*Personal Mail: *gauravmadan...@gmail.com
*Work Mail:* gauravma...@urbancompany.com


Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Sadly Apache Spark sounds like it has nothing to do within materialised
views. I was hoping it could read it!

>>> *spark.sql("SELECT * FROM test.mv <http://test.mv>").show()*
Traceback (most recent call last):
  File "", line 1, in 
  File "/opt/spark/python/pyspark/sql/session.py", line 1440, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
  File
"/usr/src/Python-3.9.16/venv/venv3.9/lib/python3.9/site-packages/py4j/java_gateway.py",
line 1321, in __call__
return_value = get_return_value(
  File "/opt/spark/python/pyspark/errors/exceptions/captured.py", line 175,
in deco
raise converted from None
*Pyspark.errors.exceptions.captured.AnalysisException: Hive materialized
view is not supported.*


HTH

Mch Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Fri, 3 May 2024 at 11:03, Mich Talebzadeh 
wrote:

> Thanks for the comments I received.
>
> So in summary, Apache Spark itself doesn't directly manage materialized
> views,(MV)  but it can work with them through integration with the
> underlying data storage systems like Hive or through iceberg. I believe
> databricks through unity catalog support MVs as well.
>
> Moreover, there is a case for supporting MVs. However, Spark can utilize
> materialized views even though it doesn't directly manage them.. This came
> about because someone in the Spark user forum enquired about "Spark
> streaming issue to Elastic data*". One option I thought of was that uUsing
> materialized views with Spark Structured Streaming and Change Data Capture
> (CDC) is a potential solution for efficiently streaming view data updates
> in this scenario. .
>
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Fri, 3 May 2024 at 00:54, Mich Talebzadeh 
> wrote:
>
>> An issue I encountered while working with Materialized Views in Spark
>> SQL. It appears that there is an inconsistency between the behavior of
>> Materialized Views in Spark SQL and Hive.
>>
>> When attempting to execute a statement like DROP MATERIALIZED VIEW IF
>> EXISTS test.mv in Spark SQL, I encountered a syntax error indicating
>> that the keyword MATERIALIZED is not recognized. However, the same
>> statement executes successfully in Hive without any errors.
>>
>> pyspark.errors.exceptions.captured.ParseException:
>> [PARSE_SYNTAX_ERROR] Syntax error at or near 'MATERIALIZED'.(line 1, pos
>> 5)
>>
>> == SQL ==
>> DROP MATERIALIZED VIEW IF EXISTS test.mv
>> -^^^
>>
>> Here are the versions I am using:
>>
>>
>>
>> *Hive: 3.1.1Spark: 3.4*
>> my Spark session:
>>
>> spark = SparkSession.builder \
>>   .appName("test") \
>>   .enableHiveSupport() \
>>   .getOrCreate()
>>
>> Has anyone seen this behaviour or encountered a similar issue or if there
>> are any insights into why this discrepancy exists between Spark SQL and
>> Hive.
>>
>> Thanks
>>
>> Mich Talebzadeh,
>>
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner Von Braun)".
>>
>


Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Thanks for the comments I received.

So in summary, Apache Spark itself doesn't directly manage materialized
views,(MV)  but it can work with them through integration with the
underlying data storage systems like Hive or through iceberg. I believe
databricks through unity catalog support MVs as well.

Moreover, there is a case for supporting MVs. However, Spark can utilize
materialized views even though it doesn't directly manage them.. This came
about because someone in the Spark user forum enquired about "Spark
streaming issue to Elastic data*". One option I thought of was that uUsing
materialized views with Spark Structured Streaming and Change Data Capture
(CDC) is a potential solution for efficiently streaming view data updates
in this scenario. .


Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Fri, 3 May 2024 at 00:54, Mich Talebzadeh 
wrote:

> An issue I encountered while working with Materialized Views in Spark SQL.
> It appears that there is an inconsistency between the behavior of
> Materialized Views in Spark SQL and Hive.
>
> When attempting to execute a statement like DROP MATERIALIZED VIEW IF
> EXISTS test.mv in Spark SQL, I encountered a syntax error indicating that
> the keyword MATERIALIZED is not recognized. However, the same statement
> executes successfully in Hive without any errors.
>
> pyspark.errors.exceptions.captured.ParseException:
> [PARSE_SYNTAX_ERROR] Syntax error at or near 'MATERIALIZED'.(line 1, pos 5)
>
> == SQL ==
> DROP MATERIALIZED VIEW IF EXISTS test.mv
> -^^^
>
> Here are the versions I am using:
>
>
>
> *Hive: 3.1.1Spark: 3.4*
> my Spark session:
>
> spark = SparkSession.builder \
>   .appName("test") \
>   .enableHiveSupport() \
>   .getOrCreate()
>
> Has anyone seen this behaviour or encountered a similar issue or if there
> are any insights into why this discrepancy exists between Spark SQL and
> Hive.
>
> Thanks
>
> Mich Talebzadeh,
>
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>


Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Jungtaek Lim
(removing dev@ as I don't think this is dev@ related thread but more about
"question")

My understanding is that Apache Spark does not support Materialized View.
That's all. IMHO it's not a proper expectation that all operations in
Apache Hive will be supported in Apache Spark. They are different projects
and Apache Spark does not aim to be 100% compatible with Apache Hive. There
was a time the community tried to provide some sort of compatibility, but
both projects are 10+ years old, and mature enough to have their own
roadmap to drive.

That said, that's not a bug or an issue. You can initiate a feature request
and wish the community to include that into the roadmap.

On Fri, May 3, 2024 at 12:01 PM Mich Talebzadeh 
wrote:

> An issue I encountered while working with Materialized Views in Spark SQL.
> It appears that there is an inconsistency between the behavior of
> Materialized Views in Spark SQL and Hive.
>
> When attempting to execute a statement like DROP MATERIALIZED VIEW IF
> EXISTS test.mv in Spark SQL, I encountered a syntax error indicating that
> the keyword MATERIALIZED is not recognized. However, the same statement
> executes successfully in Hive without any errors.
>
> pyspark.errors.exceptions.captured.ParseException:
> [PARSE_SYNTAX_ERROR] Syntax error at or near 'MATERIALIZED'.(line 1, pos 5)
>
> == SQL ==
> DROP MATERIALIZED VIEW IF EXISTS test.mv
> -^^^
>
> Here are the versions I am using:
>
>
>
> *Hive: 3.1.1Spark: 3.4*
> my Spark session:
>
> spark = SparkSession.builder \
>   .appName("test") \
>   .enableHiveSupport() \
>   .getOrCreate()
>
> Has anyone seen this behaviour or encountered a similar issue or if there
> are any insights into why this discrepancy exists between Spark SQL and
> Hive.
>
> Thanks
>
> Mich Talebzadeh,
>
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>


Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Walaa Eldin Moustafa
I do not think the issue is with DROP MATERIALIZED VIEW only, but also with
CREATE MATERIALIZED VIEW, because neither is supported in Spark. I guess
you must have created the view from Hive and are trying to drop it from
Spark and that is why you are running to the issue with DROP first.

There is some work in the Iceberg community to add the support to Spark
through SQL extensions, and Iceberg support for views and
materialization tables. Some recent discussions can be found here [1] along
with a WIP Iceberg-Spark PR.

[1] https://lists.apache.org/thread/rotmqzmwk5jrcsyxhzjhrvcjs5v3yjcc

Thanks,
Walaa.

On Thu, May 2, 2024 at 4:55 PM Mich Talebzadeh 
wrote:

> An issue I encountered while working with Materialized Views in Spark SQL.
> It appears that there is an inconsistency between the behavior of
> Materialized Views in Spark SQL and Hive.
>
> When attempting to execute a statement like DROP MATERIALIZED VIEW IF
> EXISTS test.mv in Spark SQL, I encountered a syntax error indicating that
> the keyword MATERIALIZED is not recognized. However, the same statement
> executes successfully in Hive without any errors.
>
> pyspark.errors.exceptions.captured.ParseException:
> [PARSE_SYNTAX_ERROR] Syntax error at or near 'MATERIALIZED'.(line 1, pos 5)
>
> == SQL ==
> DROP MATERIALIZED VIEW IF EXISTS test.mv
> -^^^
>
> Here are the versions I am using:
>
>
>
> *Hive: 3.1.1Spark: 3.4*
> my Spark session:
>
> spark = SparkSession.builder \
>   .appName("test") \
>   .enableHiveSupport() \
>   .getOrCreate()
>
> Has anyone seen this behaviour or encountered a similar issue or if there
> are any insights into why this discrepancy exists between Spark SQL and
> Hive.
>
> Thanks
>
> Mich Talebzadeh,
>
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>


Issue with Materialized Views in Spark SQL

2024-05-02 Thread Mich Talebzadeh
An issue I encountered while working with Materialized Views in Spark SQL.
It appears that there is an inconsistency between the behavior of
Materialized Views in Spark SQL and Hive.

When attempting to execute a statement like DROP MATERIALIZED VIEW IF
EXISTS test.mv in Spark SQL, I encountered a syntax error indicating that
the keyword MATERIALIZED is not recognized. However, the same statement
executes successfully in Hive without any errors.

pyspark.errors.exceptions.captured.ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near 'MATERIALIZED'.(line 1, pos 5)

== SQL ==
DROP MATERIALIZED VIEW IF EXISTS test.mv
-^^^

Here are the versions I am using:



*Hive: 3.1.1Spark: 3.4*
my Spark session:

spark = SparkSession.builder \
  .appName("test") \
  .enableHiveSupport() \
  .getOrCreate()

Has anyone seen this behaviour or encountered a similar issue or if there
are any insights into why this discrepancy exists between Spark SQL and
Hive.

Thanks

Mich Talebzadeh,

Technologist | Architect | Data Engineer  | Generative AI | FinCrime

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge
but of course cannot be guaranteed . It is essential to note that, as with
any advice, quote "one test result is worth one-thousand expert opinions
(Werner Von Braun)".


How to use Structured Streaming in Spark SQL

2024-04-22 Thread ????
In Flink, you can create flow calculation tables using Flink SQL, and directly 
connect with SQL through CDC and Kafka. How to use SQL for flow calculation in 
Spark



308027...@qq.com



 

[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian
Hello,

I'm very new to the Spark ecosystem, apologies if this question is a bit
simple.

I want to modify a custom fork of Spark to remove function support. For
example, I want to remove the query runners ability to call reflect and
java_method. I saw that there exists a data structure in spark-sql called
FunctionRegistry that seems to act as an allowlist on what functions Spark
can execute. If I remove a function of the registry, is that enough
guarantee that that function can "never" be invoked in Spark, or are there
other areas that would need to be changed as well?

Thanks,
Matthew McMillian


[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian
Hello,

I'm very new to the Spark ecosystem, apologies if this question is a bit
simple.

I want to modify a custom fork of Spark to remove function support. For
example, I want to remove the query runners ability to call reflect and
java_method. I saw that there exists a data structure in spark-sql called
FunctionRegistry that seems to act as an allowlist on what functions Spark
can execute. If I remove a function of the registry, is that enough
guarantee that that function can "never" be invoked in Spark, or are there
other areas that would need to be changed as well?

Thanks,
Matthew McMillian


[Spark SQL] xxhash64 default seed of 42 confusion

2024-04-16 Thread Igor Calabria
Hi all,

I've noticed that spark's xxhas64 output doesn't match other tool's due to
using seed=42 as a default. I've looked at a few libraries and they use 0
as a default seed:

- python https://github.com/ifduyue/python-xxhash
- java https://github.com/OpenHFT/Zero-Allocation-Hashing/
- java (slice library, used by trino)
https://github.com/airlift/slice/blob/master/src/main/java/io/airlift/slice/XxHash64.java

Was there a special motivation behind this? or is 42 just used for the sake
of the hitchhiker's guide reference? It's very common for spark to interact
with other tools (either via data or direct connection) and this just seems
like a unnecessary footgun.


Re: [Spark SQL]: Source code for PartitionedFile

2024-04-11 Thread Ashley McManamon
Hi Mich,

Thanks for the reply.

I did come across that file but it didn't align with the appearance of
`PartitionedFile`:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala

In fact, the code snippet you shared also references the type
`PartitionedFile`.

There's actually this javadoc.io page for a `PartitionedFile`
at org.apache.spark.sql.execution.datasources for spark-sql_2.12:3.0.2:
https://javadoc.io/doc/org.apache.spark/spark-sql_2.12/3.0.2/org/apache/spark/sql/execution/datasources/PartitionedFile.html.
I double checked the source code for version 3.0.2 and doesn't seem to
exist there either:
https://github.com/apache/spark/tree/v3.0.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources

Ashley


On Mon, 8 Apr 2024 at 22:41, Mich Talebzadeh 
wrote:

> Hi,
>
> I believe this is the package
>
>
> https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala
>
> And the code
>
> case class FilePartition(index: Int, files: Array[PartitionedFile])
>   extends Partition with InputPartition {
>   override def preferredLocations(): Array[String] = {
> // Computes total number of bytes that can be retrieved from each host.
> val hostToNumBytes = mutable.HashMap.empty[String, Long]
> files.foreach { file =>
>   file.locations.filter(_ != "localhost").foreach { host =>
> hostToNumBytes(host) = hostToNumBytes.getOrElse(host, 0L) +
> file.length
>   }
> }
>
> // Selects the first 3 hosts with the most data to be retrieved.
> hostToNumBytes.toSeq.sortBy {
>   case (host, numBytes) => numBytes
> }.reverse.take(3).map {
>   case (host, numBytes) => host
> }.toArray
>   }
> }
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 8 Apr 2024 at 20:31, Ashley McManamon <
> ashley.mcmana...@quantcast.com> wrote:
>
>> Hi All,
>>
>> I've been diving into the source code to get a better understanding of
>> how file splitting works from a user perspective. I've hit a deadend at
>> `PartitionedFile`, for which I cannot seem to find a definition? It appears
>> though it should be found at
>> org.apache.spark.sql.execution.datasources but I find no definition in
>> the entire source code. Am I missing something?
>>
>> I appreciate there may be an obvious answer here, apologies if I'm being
>> naive.
>>
>> Thanks,
>> Ashley McManamon
>>
>>


Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread Mich Talebzadeh
interesting. So below should be the corrected code with the suggestion in
the [SPARK-47718] .sql() does not recognize watermark defined upstream -
ASF JIRA (apache.org) 

# Define schema for parsing Kafka messages
schema = StructType([
StructField('createTime', TimestampType(), True),
StructField('orderId', LongType(), True),
StructField('payAmount', DoubleType(), True),
StructField('payPlatform', IntegerType(), True),
StructField('provinceId', IntegerType(), True),
])

# Read streaming data from Kafka source
streaming_df = session.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "payment_msg") \
.option("startingOffsets", "earliest") \
.load() \
.select(from_json(col("value").cast("string"),
schema).alias("parsed_value")) \
.select("parsed_value.*") \
.withWatermark("createTime", "10 seconds")

# Create temporary view for SQL queries
*streaming_df.createOrReplaceTempView("streaming_df")*
# Define SQL query with correct window function usage
query = """
*SELECT*
*window(start, '1 hour', '30 minutes') as window,*
provinceId,
sum(payAmount) as totalPayAmount
FROM streaming_df
GROUP BY provinceId, window(start, '1 hour', '30 minutes')
ORDER BY window.start
"""

# Write the aggregated results to Kafka sink
stream = session.sql(query) \
.writeStream \
.format("kafka") \
.option("checkpointLocation", "checkpoint") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "sink") \
.start()


Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 9 Apr 2024 at 21:45, 刘唯  wrote:

> Sorry this is not a bug but essentially a user error. Spark throws a
> really confusing error and I'm also confused. Please see the reply in the
> ticket for how to make things correct.
> https://issues.apache.org/jira/browse/SPARK-47718
>
> 刘唯  于2024年4月6日周六 11:41写道:
>
>> This indeed looks like a bug. I will take some time to look into it.
>>
>> Mich Talebzadeh  于2024年4月3日周三 01:55写道:
>>
>>>
>>> hm. you are getting below
>>>
>>> AnalysisException: Append output mode not supported when there are
>>> streaming aggregations on streaming DataFrames/DataSets without watermark;
>>>
>>> The problem seems to be that you are using the append output mode when
>>> writing the streaming query results to Kafka. This mode is designed for
>>> scenarios where you want to append new data to an existing dataset at the
>>> sink (in this case, the "sink" topic in Kafka). However, your query
>>> involves a streaming aggregation: group by provinceId, window('createTime',
>>> '1 hour', '30 minutes'). The problem is that Spark Structured Streaming
>>> requires a watermark to ensure exactly-once processing when using
>>> aggregations with append mode. Your code already defines a watermark on the
>>> "createTime" column with a delay of 10 seconds (withWatermark("createTime",
>>> "10 seconds")). However, the error message indicates it is missing on the
>>> start column. Try adding watermark to "start" Column: Modify your code as
>>> below  to include a watermark on the "start" column generated by the
>>> window function:
>>>
>>> from pyspark.sql.functions import col, from_json, explode, window, sum,
>>> watermark
>>>
>>> streaming_df = session.readStream \
>>>   .format("kafka") \
>>>   .option("kafka.bootstrap.servers", "localhost:9092") \
>>>   .option("subscribe", "payment_msg") \
>>>   .option("startingOffsets", "earliest") \
>>>   .load() \
>>>   .select(from_json(col("value").cast("string"),
>>> schema).alias("parsed_value")) \
>>>   .select("parsed_value.*") \
>>>   .withWatermark("createTime", "10 seconds")  # Existing watermark on
>>> createTime
>>>
>>> *# Modified section with watermark on 'start' column*
>>> streaming_df = streaming_df.groupBy(
>>>   col("provinceId"),
>>>   window(col("createTime"), "1 hour", "30 minutes")
>>> ).agg(
>>>   sum(col("payAmount")).alias("totalPayAmount")
>>> ).withWatermark(expr("start"), "10 seconds")  # Watermark on
>>> window-generated 'start'
>>>
>>> # Rest of the code remains the same
>>> streaming_df.createOrReplaceTempView("streaming_df")
>>>
>>> spark.sql("""
>>> SELECT
>>>   window.start, window.end, provinceId, totalPayAmount
>>> FROM streaming_df
>>> ORDER BY window.start
>>> """) \
>>> .writeStream \
>>> .format("kafka") \
>>> .option("checkpointLocation", "chec

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread 刘唯
Sorry this is not a bug but essentially a user error. Spark throws a really
confusing error and I'm also confused. Please see the reply in the ticket
for how to make things correct.
https://issues.apache.org/jira/browse/SPARK-47718

刘唯  于2024年4月6日周六 11:41写道:

> This indeed looks like a bug. I will take some time to look into it.
>
> Mich Talebzadeh  于2024年4月3日周三 01:55写道:
>
>>
>> hm. you are getting below
>>
>> AnalysisException: Append output mode not supported when there are
>> streaming aggregations on streaming DataFrames/DataSets without watermark;
>>
>> The problem seems to be that you are using the append output mode when
>> writing the streaming query results to Kafka. This mode is designed for
>> scenarios where you want to append new data to an existing dataset at the
>> sink (in this case, the "sink" topic in Kafka). However, your query
>> involves a streaming aggregation: group by provinceId, window('createTime',
>> '1 hour', '30 minutes'). The problem is that Spark Structured Streaming
>> requires a watermark to ensure exactly-once processing when using
>> aggregations with append mode. Your code already defines a watermark on the
>> "createTime" column with a delay of 10 seconds (withWatermark("createTime",
>> "10 seconds")). However, the error message indicates it is missing on the
>> start column. Try adding watermark to "start" Column: Modify your code as
>> below  to include a watermark on the "start" column generated by the
>> window function:
>>
>> from pyspark.sql.functions import col, from_json, explode, window, sum,
>> watermark
>>
>> streaming_df = session.readStream \
>>   .format("kafka") \
>>   .option("kafka.bootstrap.servers", "localhost:9092") \
>>   .option("subscribe", "payment_msg") \
>>   .option("startingOffsets", "earliest") \
>>   .load() \
>>   .select(from_json(col("value").cast("string"),
>> schema).alias("parsed_value")) \
>>   .select("parsed_value.*") \
>>   .withWatermark("createTime", "10 seconds")  # Existing watermark on
>> createTime
>>
>> *# Modified section with watermark on 'start' column*
>> streaming_df = streaming_df.groupBy(
>>   col("provinceId"),
>>   window(col("createTime"), "1 hour", "30 minutes")
>> ).agg(
>>   sum(col("payAmount")).alias("totalPayAmount")
>> ).withWatermark(expr("start"), "10 seconds")  # Watermark on
>> window-generated 'start'
>>
>> # Rest of the code remains the same
>> streaming_df.createOrReplaceTempView("streaming_df")
>>
>> spark.sql("""
>> SELECT
>>   window.start, window.end, provinceId, totalPayAmount
>> FROM streaming_df
>> ORDER BY window.start
>> """) \
>> .writeStream \
>> .format("kafka") \
>> .option("checkpointLocation", "checkpoint") \
>> .option("kafka.bootstrap.servers", "localhost:9092") \
>> .option("topic", "sink") \
>> .start()
>>
>> Try and see how it goes
>>
>> HTH
>>
>> Mich Talebzadeh,
>>
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner Von Braun)".
>>
>> Mich Talebzadeh,
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Tue, 2 Apr 2024 at 22:43, Chloe He 
>> wrote:
>>
>>> Hi Mich,
>>>
>>> Thank you so much for your response. I really appreciate your help!
>>>
>>> You mentioned "defining the watermark using the withWatermark function
>>> on the streaming_df before creating the temporary view” - I believe this is
>>> what I’m doing and it’s not working for me. Here is the exact code snippet
>>> that I’m running:
>>>
>>> ```
>>> >>> streaming_df = session.readStream\
>>> .format("kafka")\
>>> .option("kafka.bootstrap.servers", "localhost:9092")\
>>> .option("subscribe", "payment_msg")\
>>> .option("startingOffsets","earliest")\
>>> .load()\
>>> .select(from_json(col("value").cast("string"),
>>> schema).alias("parsed_value"))\
>>> .select("parsed_value.*")\
>>> .withWatermark("createTime", "10 seconds")
>>>
>>> >>> streaming_df.createOrReplaceTempView("streaming_df”)
>>>
>>> >>> spark.sql("""
>>> SELECT
>>> window.start, window.end, provinceId, sum(payAmount) as
>>> totalPayAmount
>>> FROM streaming

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Mich Talebzadeh
Hi,

I believe this is the package

https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala

And the code

case class FilePartition(index: Int, files: Array[PartitionedFile])
  extends Partition with InputPartition {
  override def preferredLocations(): Array[String] = {
// Computes total number of bytes that can be retrieved from each host.
val hostToNumBytes = mutable.HashMap.empty[String, Long]
files.foreach { file =>
  file.locations.filter(_ != "localhost").foreach { host =>
hostToNumBytes(host) = hostToNumBytes.getOrElse(host, 0L) +
file.length
  }
}

// Selects the first 3 hosts with the most data to be retrieved.
hostToNumBytes.toSeq.sortBy {
  case (host, numBytes) => numBytes
}.reverse.take(3).map {
  case (host, numBytes) => host
}.toArray
  }
}

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 8 Apr 2024 at 20:31, Ashley McManamon <
ashley.mcmana...@quantcast.com> wrote:

> Hi All,
>
> I've been diving into the source code to get a better understanding of how
> file splitting works from a user perspective. I've hit a deadend at
> `PartitionedFile`, for which I cannot seem to find a definition? It appears
> though it should be found at
> org.apache.spark.sql.execution.datasources but I find no definition in the
> entire source code. Am I missing something?
>
> I appreciate there may be an obvious answer here, apologies if I'm being
> naive.
>
> Thanks,
> Ashley McManamon
>
>


[Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Ashley McManamon
Hi All,

I've been diving into the source code to get a better understanding of how
file splitting works from a user perspective. I've hit a deadend at
`PartitionedFile`, for which I cannot seem to find a definition? It appears
though it should be found at
org.apache.spark.sql.execution.datasources but I find no definition in the
entire source code. Am I missing something?

I appreciate there may be an obvious answer here, apologies if I'm being
naive.

Thanks,
Ashley McManamon


Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-06 Thread 刘唯
This indeed looks like a bug. I will take some time to look into it.

Mich Talebzadeh  于2024年4月3日周三 01:55写道:

>
> hm. you are getting below
>
> AnalysisException: Append output mode not supported when there are
> streaming aggregations on streaming DataFrames/DataSets without watermark;
>
> The problem seems to be that you are using the append output mode when
> writing the streaming query results to Kafka. This mode is designed for
> scenarios where you want to append new data to an existing dataset at the
> sink (in this case, the "sink" topic in Kafka). However, your query
> involves a streaming aggregation: group by provinceId, window('createTime',
> '1 hour', '30 minutes'). The problem is that Spark Structured Streaming
> requires a watermark to ensure exactly-once processing when using
> aggregations with append mode. Your code already defines a watermark on the
> "createTime" column with a delay of 10 seconds (withWatermark("createTime",
> "10 seconds")). However, the error message indicates it is missing on the
> start column. Try adding watermark to "start" Column: Modify your code as
> below  to include a watermark on the "start" column generated by the
> window function:
>
> from pyspark.sql.functions import col, from_json, explode, window, sum,
> watermark
>
> streaming_df = session.readStream \
>   .format("kafka") \
>   .option("kafka.bootstrap.servers", "localhost:9092") \
>   .option("subscribe", "payment_msg") \
>   .option("startingOffsets", "earliest") \
>   .load() \
>   .select(from_json(col("value").cast("string"),
> schema).alias("parsed_value")) \
>   .select("parsed_value.*") \
>   .withWatermark("createTime", "10 seconds")  # Existing watermark on
> createTime
>
> *# Modified section with watermark on 'start' column*
> streaming_df = streaming_df.groupBy(
>   col("provinceId"),
>   window(col("createTime"), "1 hour", "30 minutes")
> ).agg(
>   sum(col("payAmount")).alias("totalPayAmount")
> ).withWatermark(expr("start"), "10 seconds")  # Watermark on
> window-generated 'start'
>
> # Rest of the code remains the same
> streaming_df.createOrReplaceTempView("streaming_df")
>
> spark.sql("""
> SELECT
>   window.start, window.end, provinceId, totalPayAmount
> FROM streaming_df
> ORDER BY window.start
> """) \
> .writeStream \
> .format("kafka") \
> .option("checkpointLocation", "checkpoint") \
> .option("kafka.bootstrap.servers", "localhost:9092") \
> .option("topic", "sink") \
> .start()
>
> Try and see how it goes
>
> HTH
>
> Mich Talebzadeh,
>
> Technologist | Solutions Architect | Data Engineer  | Generative AI
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Tue, 2 Apr 2024 at 22:43, Chloe He 
> wrote:
>
>> Hi Mich,
>>
>> Thank you so much for your response. I really appreciate your help!
>>
>> You mentioned "defining the watermark using the withWatermark function on
>> the streaming_df before creating the temporary view” - I believe this is
>> what I’m doing and it’s not working for me. Here is the exact code snippet
>> that I’m running:
>>
>> ```
>> >>> streaming_df = session.readStream\
>> .format("kafka")\
>> .option("kafka.bootstrap.servers", "localhost:9092")\
>> .option("subscribe", "payment_msg")\
>> .option("startingOffsets","earliest")\
>> .load()\
>> .select(from_json(col("value").cast("string"),
>> schema).alias("parsed_value"))\
>> .select("parsed_value.*")\
>> .withWatermark("createTime", "10 seconds")
>>
>> >>> streaming_df.createOrReplaceTempView("streaming_df”)
>>
>> >>> spark.sql("""
>> SELECT
>> window.start, window.end, provinceId, sum(payAmount) as totalPayAmount
>> FROM streaming_df
>> GROUP BY provinceId, window('createTime', '1 hour', '30 minutes')
>> ORDER BY window.start
>> """)\
>>   .withWatermark("start", "10 seconds")\
>>   .writeStream\
>>   .format("kafka") \
>>   .option("checkpointLocation", "checkpoint") \
>>   .option("kafka.bootstrap.servers", "localhost:9092") \
>>   .option("topic", "sink") \
>>   .start()
>>
>> AnalysisException: Append output mode not 

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
hm. you are getting below

AnalysisException: Append output mode not supported when there are
streaming aggregations on streaming DataFrames/DataSets without watermark;

The problem seems to be that you are using the append output mode when
writing the streaming query results to Kafka. This mode is designed for
scenarios where you want to append new data to an existing dataset at the
sink (in this case, the "sink" topic in Kafka). However, your query
involves a streaming aggregation: group by provinceId, window('createTime',
'1 hour', '30 minutes'). The problem is that Spark Structured Streaming
requires a watermark to ensure exactly-once processing when using
aggregations with append mode. Your code already defines a watermark on the
"createTime" column with a delay of 10 seconds (withWatermark("createTime",
"10 seconds")). However, the error message indicates it is missing on the
start column. Try adding watermark to "start" Column: Modify your code as
below  to include a watermark on the "start" column generated by the window
function:

from pyspark.sql.functions import col, from_json, explode, window, sum,
watermark

streaming_df = session.readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("subscribe", "payment_msg") \
  .option("startingOffsets", "earliest") \
  .load() \
  .select(from_json(col("value").cast("string"),
schema).alias("parsed_value")) \
  .select("parsed_value.*") \
  .withWatermark("createTime", "10 seconds")  # Existing watermark on
createTime

*# Modified section with watermark on 'start' column*
streaming_df = streaming_df.groupBy(
  col("provinceId"),
  window(col("createTime"), "1 hour", "30 minutes")
).agg(
  sum(col("payAmount")).alias("totalPayAmount")
).withWatermark(expr("start"), "10 seconds")  # Watermark on
window-generated 'start'

# Rest of the code remains the same
streaming_df.createOrReplaceTempView("streaming_df")

spark.sql("""
SELECT
  window.start, window.end, provinceId, totalPayAmount
FROM streaming_df
ORDER BY window.start
""") \
.writeStream \
.format("kafka") \
.option("checkpointLocation", "checkpoint") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "sink") \
.start()

Try and see how it goes

HTH

Mich Talebzadeh,

Technologist | Solutions Architect | Data Engineer  | Generative AI

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge
but of course cannot be guaranteed . It is essential to note that, as with
any advice, quote "one test result is worth one-thousand expert opinions
(Werner Von Braun)".

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 2 Apr 2024 at 22:43, Chloe He  wrote:

> Hi Mich,
>
> Thank you so much for your response. I really appreciate your help!
>
> You mentioned "defining the watermark using the withWatermark function on
> the streaming_df before creating the temporary view” - I believe this is
> what I’m doing and it’s not working for me. Here is the exact code snippet
> that I’m running:
>
> ```
> >>> streaming_df = session.readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", "localhost:9092")\
> .option("subscribe", "payment_msg")\
> .option("startingOffsets","earliest")\
> .load()\
> .select(from_json(col("value").cast("string"),
> schema).alias("parsed_value"))\
> .select("parsed_value.*")\
> .withWatermark("createTime", "10 seconds")
>
> >>> streaming_df.createOrReplaceTempView("streaming_df”)
>
> >>> spark.sql("""
> SELECT
> window.start, window.end, provinceId, sum(payAmount) as totalPayAmount
> FROM streaming_df
> GROUP BY provinceId, window('createTime', '1 hour', '30 minutes')
> ORDER BY window.start
> """)\
>   .withWatermark("start", "10 seconds")\
>   .writeStream\
>   .format("kafka") \
>   .option("checkpointLocation", "checkpoint") \
>   .option("kafka.bootstrap.servers", "localhost:9092") \
>   .option("topic", "sink") \
>   .start()
>
> AnalysisException: Append output mode not supported when there are
> streaming aggregations on streaming DataFrames/DataSets without watermark;
> EventTimeWatermark start#37: timestamp, 10 seconds
> ```
>
> I’m using pyspark 3.5.1. Please let me know if I missed something. Thanks
> again!
>
> Best,
> Chloe
>
>
> On 2024/04/02 20:32:11 Mich Talebzadeh wrote:
> > ok let us tak

RE: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Chloe He
Hi Mich,

Thank you so much for your response. I really appreciate your help!

You mentioned "defining the watermark using the withWatermark function on the 
streaming_df before creating the temporary view” - I believe this is what I’m 
doing and it’s not working for me. Here is the exact code snippet that I’m 
running:

```
>>> streaming_df = session.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("subscribe", "payment_msg")\
.option("startingOffsets","earliest")\
.load()\
.select(from_json(col("value").cast("string"), 
schema).alias("parsed_value"))\
.select("parsed_value.*")\
.withWatermark("createTime", "10 seconds")

>>> streaming_df.createOrReplaceTempView("streaming_df”)

>>> spark.sql("""
SELECT
window.start, window.end, provinceId, sum(payAmount) as totalPayAmount
FROM streaming_df
GROUP BY provinceId, window('createTime', '1 hour', '30 minutes')
ORDER BY window.start
""")\
  .withWatermark("start", "10 seconds")\
  .writeStream\
  .format("kafka") \
  .option("checkpointLocation", "checkpoint") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("topic", "sink") \
  .start()

AnalysisException: Append output mode not supported when there are streaming 
aggregations on streaming DataFrames/DataSets without watermark;
EventTimeWatermark start#37: timestamp, 10 seconds
```

I’m using pyspark 3.5.1. Please let me know if I missed something. Thanks again!

Best,
Chloe


On 2024/04/02 20:32:11 Mich Talebzadeh wrote:
> ok let us take it for a test.
> 
> The original code of mine
> 
> def fetch_data(self):
> self.sc.setLogLevel("ERROR")
> schema = StructType() \
>  .add("rowkey", StringType()) \
>  .add("timestamp", TimestampType()) \
>  .add("temperature", IntegerType())
> checkpoint_path = "file:///ssd/hduser/avgtemperature/chkpt"
> try:
> 
> # construct a streaming dataframe 'streamingDataFrame' that
> subscribes to topic temperature
> streamingDataFrame = self.spark \
> .readStream \
> .format("kafka") \
> .option("kafka.bootstrap.servers",
> config['MDVariables']['bootstrapServers'],) \
> .option("schema.registry.url",
> config['MDVariables']['schemaRegistryURL']) \
> .option("group.id", config['common']['appName']) \
> .option("zookeeper.connection.timeout.ms",
> config['MDVariables']['zookeeperConnectionTimeoutMs']) \
> .option("rebalance.backoff.ms",
> config['MDVariables']['rebalanceBackoffMS']) \
> .option("zookeeper.session.timeout.ms",
> config['MDVariables']['zookeeperSessionTimeOutMs']) \
> .option("auto.commit.interval.ms",
> config['MDVariables']['autoCommitIntervalMS']) \
> .option("subscribe", "temperature") \
> .option("failOnDataLoss", "false") \
> .option("includeHeaders", "true") \
> .option("startingOffsets", "earliest") \
> .load() \
> .select(from_json(col("value").cast("string"),
> schema).alias("parsed_value"))
> 
> 
> resultC = streamingDataFrame.select( \
>  col("parsed_value.rowkey").alias("rowkey") \
>, col("parsed_value.timestamp").alias("timestamp") \
>, col("parsed_value.temperature").alias("temperature"))
> 
> """
> We work out the window and the AVG(temperature) in the window's
> timeframe below
> This should return back the following Dataframe as struct
> 
>  root
>  |-- window: struct (nullable = false)
>  ||-- start: timestamp (nullable = true)
>  ||-- end: timestamp (nullable = true)
>  |-- avg(temperature): double (nullable = true)
> 
> """
> resultM = resultC. \
>  withWatermark("timestamp", "5 minutes"). \
>  groupBy(window(resultC.timestamp, "5 minutes", "5
> minutes")). \
>  avg('temperature')
> 
> # We take the above DataFrame and flatten it to get the columns
> aliased as "startOfWindowFrame", "endOfWindowFrame" and "AVGTemperature"
> resultMF = resultM. \
>select( \
> F.col("window.start").alias("startOfWindow") \
>   , F.col("window.end").alias("endOfWindow") \
>   ,
> F.col("avg(temperature)").alias("AVGTemperature"))
> 
> # Kafka producer requires a key, value pair. We generate UUID
> key as the unique identifier of Kafka record
> uuidUdf= F.udf(lambda : str(uuid.uuid4()),StringType())
> 
> """
> We take DataFrame resultMF containing temperature info and
> write it to Kafka. The uuid is serialized as a str

Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
ok let us take it for a test.

The original code of mine

def fetch_data(self):
self.sc.setLogLevel("ERROR")
schema = StructType() \
 .add("rowkey", StringType()) \
 .add("timestamp", TimestampType()) \
 .add("temperature", IntegerType())
checkpoint_path = "file:///ssd/hduser/avgtemperature/chkpt"
try:

# construct a streaming dataframe 'streamingDataFrame' that
subscribes to topic temperature
streamingDataFrame = self.spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers",
config['MDVariables']['bootstrapServers'],) \
.option("schema.registry.url",
config['MDVariables']['schemaRegistryURL']) \
.option("group.id", config['common']['appName']) \
.option("zookeeper.connection.timeout.ms",
config['MDVariables']['zookeeperConnectionTimeoutMs']) \
.option("rebalance.backoff.ms",
config['MDVariables']['rebalanceBackoffMS']) \
.option("zookeeper.session.timeout.ms",
config['MDVariables']['zookeeperSessionTimeOutMs']) \
.option("auto.commit.interval.ms",
config['MDVariables']['autoCommitIntervalMS']) \
.option("subscribe", "temperature") \
.option("failOnDataLoss", "false") \
.option("includeHeaders", "true") \
.option("startingOffsets", "earliest") \
.load() \
.select(from_json(col("value").cast("string"),
schema).alias("parsed_value"))


resultC = streamingDataFrame.select( \
 col("parsed_value.rowkey").alias("rowkey") \
   , col("parsed_value.timestamp").alias("timestamp") \
   , col("parsed_value.temperature").alias("temperature"))

"""
We work out the window and the AVG(temperature) in the window's
timeframe below
This should return back the following Dataframe as struct

 root
 |-- window: struct (nullable = false)
 ||-- start: timestamp (nullable = true)
 ||-- end: timestamp (nullable = true)
 |-- avg(temperature): double (nullable = true)

"""
resultM = resultC. \
 withWatermark("timestamp", "5 minutes"). \
 groupBy(window(resultC.timestamp, "5 minutes", "5
minutes")). \
 avg('temperature')

# We take the above DataFrame and flatten it to get the columns
aliased as "startOfWindowFrame", "endOfWindowFrame" and "AVGTemperature"
resultMF = resultM. \
   select( \
F.col("window.start").alias("startOfWindow") \
  , F.col("window.end").alias("endOfWindow") \
  ,
F.col("avg(temperature)").alias("AVGTemperature"))

# Kafka producer requires a key, value pair. We generate UUID
key as the unique identifier of Kafka record
uuidUdf= F.udf(lambda : str(uuid.uuid4()),StringType())

"""
We take DataFrame resultMF containing temperature info and
write it to Kafka. The uuid is serialized as a string and used as the key.
We take all the columns of the DataFrame and serialize them as
a JSON string, putting the results in the "value" of the record.
"""
result = resultMF.withColumn("uuid",uuidUdf()) \
 .selectExpr("CAST(uuid AS STRING) AS key",
"to_json(struct(startOfWindow, endOfWindow, AVGTemperature)) AS value") \
 .writeStream \
 .outputMode('complete') \
 .format("kafka") \
 .option("kafka.bootstrap.servers",
config['MDVariables']['bootstrapServers'],) \
 .option("topic", "avgtemperature") \
 .option('checkpointLocation', checkpoint_path) \
 .queryName("avgtemperature") \
 .start()

except Exception as e:
print(f"""{e}, quitting""")
sys.exit(1)

#print(result.status)
#print(result.recentProgress)
#print(result.lastProgress)

result.awaitTermination()

Now try to use sql for the entire transformation and aggression

#import this and anything else needed
from pyspark.sql.functions import from_json, col, window
from pyspark.sql.types import StructType, StringType,IntegerType,
FloatType, TimestampType


# Define the schema for the JSON data
schema = ... # Replace with your schema definition

# construct a streaming dataframe 'streamingDataFrame' that
subscribes to topic temperature
streamingDataFrame = self.spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers",
config['MDVariables']['bootstrapServers'],) \

[Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Chloe He
Hello!

I am attempting to write a streaming pipeline that would consume data from a 
Kafka source, manipulate the data, and then write results to a downstream sink 
(Kafka, Redis, etc). I want to write fully formed SQL instead of using the 
function API that Spark offers. I read a few guides on how to do this and my 
understanding is that I need to create a temp view in order to execute my raw 
SQL queries via spark.sql(). 

However, I’m having trouble defining watermarks on my source. It doesn’t seem 
like there is a way to introduce watermark in the raw SQL that Spark supports, 
so I’m using the .withWatermark() function. However, this watermark does not 
work on the temp view.

Example code:
```
streaming_df.select(from_json(col("value").cast("string"), 
schema).alias("parsed_value")).select("parsed_value.*").withWatermark("createTime",
 "10 seconds”)

json_df.createOrReplaceTempView("json_df”)

session.sql("""
SELECT
window.start, window.end, provinceId, sum(payAmount) as totalPayAmount
FROM json_df
GROUP BY provinceId, window('createTime', '1 hour', '30 minutes')
ORDER BY window.start
""")\
.writeStream\
.format("kafka") \
.option("checkpointLocation", "checkpoint") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "sink") \
.start()
```
This throws
```
AnalysisException: Append output mode not supported when there are streaming 
aggregations on streaming DataFrames/DataSets without watermark;
```

If I switch out the SQL query and write it in the function API instead, 
everything seems to work fine.

How can I use .sql() in conjunction with watermarks?

Best,
Chloe

[Spark SQL]: Crash when attempting to select PostgreSQL bpchar without length specifier in Spark 3.5.0

2024-01-29 Thread Lily Hahn
Hi,

I’m currently migrating an ETL project to Spark 3.5.0 from 3.2.1 and ran into 
an issue with some of our queries that read from PostgreSQL databases.

Any attempt to run a Spark SQL query that selects a bpchar without a length 
specifier from the source DB seems to crash:
py4j.protocol.Py4JJavaError: An error occurred while calling 
o1061.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0) (192.168.1.48 executor driver): java.lang.OutOfMemoryError: Requested array 
size exceeds VM limit
at org.apache.spark.unsafe.types.UTF8String.rpad(UTF8String.java:880)
at 
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.readSidePadding(CharVarcharCodegenUtils.java:62)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at 
org.apache.spark.executor.Executor$TaskRunner$$Lambda$2882/0x00080124d840.apply(Unknown
 Source)
at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)

This appears to be the plan step related to the traceback:
staticinvoke(class org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, 
StringType, readSidePadding, bpcharcolumn#7, 2147483647, true, false, true)

Reading from a subquery and casting the column to varchar, appears to work 
around the issue.

In PostgreSQL, the bpchar type acts as a variable unlimited length, 
blank-trimmed string if the length is omitted from the definition.

Is this an issue with Spark? I think the column is incorrectly getting 
interpreted as a char, which behaves the same way as a bpchar(n).


Re: Validate spark sql

2023-12-26 Thread Gourav Sengupta
Dear friend,

thanks a ton was looking for linting for SQL for a long time, looks like
https://sqlfluff.com/ is something that can be used :)

Thank you so much, and wish you all a wonderful new year.

Regards,
Gourav

On Tue, Dec 26, 2023 at 4:42 AM Bjørn Jørgensen 
wrote:

> You can try sqlfluff  it's a linter for SQL code
> and it seems to have support for sparksql
> 
>
>
> man. 25. des. 2023 kl. 17:13 skrev ram manickam :
>
>> Thanks Mich, Nicholas. I tried looking over the stack overflow post and
>> none of them
>> Seems to cover the syntax validation. Do you know if it's even possible
>> to do syntax validation in spark?
>>
>> Thanks
>> Ram
>>
>> On Sun, Dec 24, 2023 at 12:49 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Well not to put too finer point on it, in a public forum, one ought to
>>> respect the importance of open communication. Everyone has the right to ask
>>> questions, seek information, and engage in discussions without facing
>>> unnecessary patronization.
>>>
>>>
>>>
>>> Mich Talebzadeh,
>>> Dad | Technologist | Solutions Architect | Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sun, 24 Dec 2023 at 18:27, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 This is a user-list question, not a dev-list question. Moving this
 conversation to the user list and BCC-ing the dev list.

 Also, this statement

 > We are not validating against table or column existence.

 is not correct. When you call spark.sql(…), Spark will lookup the table
 references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them.

 Also, when you run DDL via spark.sql(…), Spark will actually run it. So
 spark.sql(“drop table my_table”) will actually drop my_table. It’s not a
 validation-only operation.

 This question of validating SQL is already discussed on Stack Overflow
 . You may find some
 useful tips there.

 Nick


 On Dec 24, 2023, at 4:52 AM, Mich Talebzadeh 
 wrote:


 Yes, you can validate the syntax of your PySpark SQL queries without
 connecting to an actual dataset or running the queries on a cluster.
 PySpark provides a method for syntax validation without executing the
 query. Something like below
   __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 3.4.0
   /_/

 Using Python version 3.9.16 (main, Apr 24 2023 10:36:11)
 Spark context Web UI available at http://rhes75:4040
 Spark context available as 'sc' (master = local[*], app id =
 local-1703410019374).
 SparkSession available as 'spark'.
 >>> from pyspark.sql import SparkSession
 >>> spark = SparkSession.builder.appName("validate").getOrCreate()
 23/12/24 09:28:02 WARN SparkSession: Using an existing Spark session;
 only runtime SQL configurations will take effect.
 >>> sql = "SELECT * FROM  WHERE  = some value"
 >>> try:
 ...   spark.sql(sql)
 ...   print("is working")
 ... except Exception as e:
 ...   print(f"Syntax error: {e}")
 ...
 Syntax error:
 [PARSE_SYNTAX_ERROR] Syntax error at or near '<'.(line 1, pos 14)

 == SQL ==
 SELECT * FROM  WHERE  = some value
 --^^^

 Here we only check for syntax errors and not the actual existence of
 query semantics. We are not validating against table or column existence.

 This method is useful when you want to catch obvious syntax errors
 before submitting your PySpark job to a cluster, especially when you don't
 have access to the actual data.

 In summary

- Theis method validates syntax but will not catch semantic errors
- If you need more comprehensive validation, consider using a
testing framework and a small dataset.
- For complex queries, using a linter or code analysis tool can
help identify potential issues.

 HTH


 Mich Talebzadeh,
 Dad | Technologist | Solutions Architect | Engineer
 London
 United Kingdom

view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh


 *Disclaimer:* Use it at y

Re: Validate spark sql

2023-12-26 Thread Mich Talebzadeh
Worth trying EXPLAIN
<https://spark.apache.org/docs/latest/sql-ref-syntax-qry-explain.html>statement
as suggested by @tianlangstudio
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 25 Dec 2023 at 11:15, tianlangstudio 
wrote:

>
> What about EXPLAIN?
> https://spark.apache.org/docs/3.5.0/sql-ref-syntax-qry-explain.html#content
>
>
>
><https://www.upwork.com/fl/huanqingzhu>
> <https://www.tianlang.tech/>Fusion Zhu <https://www.tianlang.tech/>
>
> --
> 发件人:ram manickam 
> 发送时间:2023年12月25日(星期一) 12:58
> 收件人:Mich Talebzadeh
> 抄 送:Nicholas Chammas; user<
> user@spark.apache.org>
> 主 题:Re: Validate spark sql
>
> Thanks Mich, Nicholas. I tried looking over the stack overflow post and
> none of them
> Seems to cover the syntax validation. Do you know if it's even possible to
> do syntax validation in spark?
>
> Thanks
> Ram
>
> On Sun, Dec 24, 2023 at 12:49 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
> Well not to put too finer point on it, in a public forum, one ought to
> respect the importance of open communication. Everyone has the right to ask
> questions, seek information, and engage in discussions without facing
> unnecessary patronization.
>
>
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: Use it at your own risk.Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 24 Dec 2023 at 18:27, Nicholas Chammas 
> wrote:
> This is a user-list question, not a dev-list question. Moving this
> conversation to the user list and BCC-ing the dev list.
>
> Also, this statement
>
> > We are not validating against table or column existence.
>
> is not correct. When you call spark.sql(…), Spark will lookup the table
> references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them.
>
> Also, when you run DDL via spark.sql(…), Spark will actually run it. So
> spark.sql(“drop table my_table”) will actually drop my_table. It’s not a
> validation-only operation.
>
> This question of validating SQL is already discussed on Stack Overflow
> <https://stackoverflow.com/q/46973729/877069>. You may find some useful
> tips there.
>
> Nick
>
>
> On Dec 24, 2023, at 4:52 AM, Mich Talebzadeh 
> wrote:
>
>
> Yes, you can validate the syntax of your PySpark SQL queries without
> connecting to an actual dataset or running the queries on a cluster.
> PySpark provides a method for syntax validation without executing the
> query. Something like below
>   __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.4.0
>   /_/
> Using Python version 3.9.16 (main, Apr 24 2023 10:36:11)
> Spark context Web UI available at http://rhes75:4040
> Spark context available as 'sc' (master = local[*], app id =
> local-1703410019374).
> SparkSession available as 'spark'.
> >>> from pyspark.sql import SparkSession
> >>> spark = SparkSession.builder.appName("validate").getOrCreate()
> 23/12/24 09:28:02 WARN SparkSession: Using an existing Spark session; only
> runtime SQL configurations will take effect.
> >>> sql = "SELECT * FROM  WHERE  = some value"
> >>> try:
> ...   spark.sql(sql)
> ...   print("is working")
> ... except Exception as e:
> ...   print(f"Syntax error: {e}")
> ...
> Syntax error:
> [PARSE_SYNTAX_ERROR] Syntax error at or near '<'.(line 1, pos 14)
> == SQL ==
> SELECT * FROM  WHERE  = some value
> --^^^
>
> Here we only check for syntax errors and not the act

Re: Validate spark sql

2023-12-25 Thread Bjørn Jørgensen
You can try sqlfluff  it's a linter for SQL code and
it seems to have support for sparksql 


man. 25. des. 2023 kl. 17:13 skrev ram manickam :

> Thanks Mich, Nicholas. I tried looking over the stack overflow post and
> none of them
> Seems to cover the syntax validation. Do you know if it's even possible to
> do syntax validation in spark?
>
> Thanks
> Ram
>
> On Sun, Dec 24, 2023 at 12:49 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Well not to put too finer point on it, in a public forum, one ought to
>> respect the importance of open communication. Everyone has the right to ask
>> questions, seek information, and engage in discussions without facing
>> unnecessary patronization.
>>
>>
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sun, 24 Dec 2023 at 18:27, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> This is a user-list question, not a dev-list question. Moving this
>>> conversation to the user list and BCC-ing the dev list.
>>>
>>> Also, this statement
>>>
>>> > We are not validating against table or column existence.
>>>
>>> is not correct. When you call spark.sql(…), Spark will lookup the table
>>> references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them.
>>>
>>> Also, when you run DDL via spark.sql(…), Spark will actually run it. So
>>> spark.sql(“drop table my_table”) will actually drop my_table. It’s not a
>>> validation-only operation.
>>>
>>> This question of validating SQL is already discussed on Stack Overflow
>>> . You may find some useful
>>> tips there.
>>>
>>> Nick
>>>
>>>
>>> On Dec 24, 2023, at 4:52 AM, Mich Talebzadeh 
>>> wrote:
>>>
>>>
>>> Yes, you can validate the syntax of your PySpark SQL queries without
>>> connecting to an actual dataset or running the queries on a cluster.
>>> PySpark provides a method for syntax validation without executing the
>>> query. Something like below
>>>   __
>>>  / __/__  ___ _/ /__
>>> _\ \/ _ \/ _ `/ __/  '_/
>>>/__ / .__/\_,_/_/ /_/\_\   version 3.4.0
>>>   /_/
>>>
>>> Using Python version 3.9.16 (main, Apr 24 2023 10:36:11)
>>> Spark context Web UI available at http://rhes75:4040
>>> Spark context available as 'sc' (master = local[*], app id =
>>> local-1703410019374).
>>> SparkSession available as 'spark'.
>>> >>> from pyspark.sql import SparkSession
>>> >>> spark = SparkSession.builder.appName("validate").getOrCreate()
>>> 23/12/24 09:28:02 WARN SparkSession: Using an existing Spark session;
>>> only runtime SQL configurations will take effect.
>>> >>> sql = "SELECT * FROM  WHERE  = some value"
>>> >>> try:
>>> ...   spark.sql(sql)
>>> ...   print("is working")
>>> ... except Exception as e:
>>> ...   print(f"Syntax error: {e}")
>>> ...
>>> Syntax error:
>>> [PARSE_SYNTAX_ERROR] Syntax error at or near '<'.(line 1, pos 14)
>>>
>>> == SQL ==
>>> SELECT * FROM  WHERE  = some value
>>> --^^^
>>>
>>> Here we only check for syntax errors and not the actual existence of
>>> query semantics. We are not validating against table or column existence.
>>>
>>> This method is useful when you want to catch obvious syntax errors
>>> before submitting your PySpark job to a cluster, especially when you don't
>>> have access to the actual data.
>>>
>>> In summary
>>>
>>>- Theis method validates syntax but will not catch semantic errors
>>>- If you need more comprehensive validation, consider using a
>>>testing framework and a small dataset.
>>>- For complex queries, using a linter or code analysis tool can help
>>>identify potential issues.
>>>
>>> HTH
>>>
>>>
>>> Mich Talebzadeh,
>>> Dad | Technologist | Solutions Architect | Engineer
>>> London
>>> United Kingdom
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sun, 24 Dec 2023 at 07:57, ram manickam  wrote:
>>>
 Hello,
 Is ther

Re: Validate spark sql

2023-12-25 Thread Bjørn Jørgensen
Mailing lists

For broad, opinion based, ask for external resources, debug issues, bugs,
contributing to the project, and scenarios, it is recommended you use the
user@spark.apache.org mailing list.

   - user@spark.apache.org
    is for usage
   questions, help, and announcements. (subscribe)
   

(unsubscribe)
   

(archives) 
   - d...@spark.apache.org
    is for people
   who want to contribute code to Spark. (subscribe)
   

(unsubscribe)
   

(archives) 



man. 25. des. 2023 kl. 04:58 skrev Mich Talebzadeh <
mich.talebza...@gmail.com>:

> Well not to put too finer point on it, in a public forum, one ought to
> respect the importance of open communication. Everyone has the right to ask
> questions, seek information, and engage in discussions without facing
> unnecessary patronization.
>
>
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 24 Dec 2023 at 18:27, Nicholas Chammas 
> wrote:
>
>> This is a user-list question, not a dev-list question. Moving this
>> conversation to the user list and BCC-ing the dev list.
>>
>> Also, this statement
>>
>> > We are not validating against table or column existence.
>>
>> is not correct. When you call spark.sql(…), Spark will lookup the table
>> references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them.
>>
>> Also, when you run DDL via spark.sql(…), Spark will actually run it. So
>> spark.sql(“drop table my_table”) will actually drop my_table. It’s not a
>> validation-only operation.
>>
>> This question of validating SQL is already discussed on Stack Overflow
>> . You may find some useful
>> tips there.
>>
>> Nick
>>
>>
>> On Dec 24, 2023, at 4:52 AM, Mich Talebzadeh 
>> wrote:
>>
>>
>> Yes, you can validate the syntax of your PySpark SQL queries without
>> connecting to an actual dataset or running the queries on a cluster.
>> PySpark provides a method for syntax validation without executing the
>> query. Something like below
>>   __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/__ / .__/\_,_/_/ /_/\_\   version 3.4.0
>>   /_/
>>
>> Using Python version 3.9.16 (main, Apr 24 2023 10:36:11)
>> Spark context Web UI available at http://rhes75:4040
>> Spark context available as 'sc' (master = local[*], app id =
>> local-1703410019374).
>> SparkSession available as 'spark'.
>> >>> from pyspark.sql import SparkSession
>> >>> spark = SparkSession.builder.appName("validate").getOrCreate()
>> 23/12/24 09:28:02 WARN SparkSession: Using an existing Spark session;
>> only runtime SQL configurations will take effect.
>> >>> sql = "SELECT * FROM  WHERE  = some value"
>> >>> try:
>> ...   spark.sql(sql)
>> ...   print("is working")
>> ... except Exception as e:
>> ...   print(f"Syntax error: {e}")
>> ...
>> Syntax error:
>> [PARSE_SYNTAX_ERROR] Syntax error at or near '<'.(line 1, pos 14)
>>
>> == SQL ==
>> SELECT * FROM  WHERE  = some value
>> --^^^
>>
>> Here we only check for syntax errors and not the actual existence of
>> query semantics. We are not validating against table or column existence.
>>
>> This method is useful when you want to catch obvious syntax errors before
>> submitting your PySpark job to a cluster, especially when you don't have
>> access to the actual data.
>>
>> In summary
>>
>>- Theis method validates syntax but will not catch semantic errors
>>- If you need more comprehensive validation, consider using a testing
>>framework and a small dataset.
>>- For complex queries, using a linter or code analysis tool can help
>>identify potential issues.
>>
>> HTH
>>
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from suc

回复:Validate spark sql

2023-12-25 Thread tianlangstudio
What about EXPLAIN?
https://spark.apache.org/docs/3.5.0/sql-ref-syntax-qry-explain.html#content 
<https://spark.apache.org/docs/3.5.0/sql-ref-syntax-qry-explain.html#content >
 <https://www.upwork.com/fl/huanqingzhu >
 <https://www.tianlang.tech/ >Fusion Zhu <https://www.tianlang.tech/ >
--
发件人:ram manickam 
发送时间:2023年12月25日(星期一) 12:58
收件人:Mich Talebzadeh
抄 送:Nicholas Chammas; user
主 题:Re: Validate spark sql
Thanks Mich, Nicholas. I tried looking over the stack overflow post and none of 
them
Seems to cover the syntax validation. Do you know if it's even possible to do 
syntax validation in spark?
Thanks
Ram
On Sun, Dec 24, 2023 at 12:49 PM Mich Talebzadeh mailto:mich.talebza...@gmail.com >> wrote:
Well not to put too finer point on it, in a public forum, one ought to respect 
the importance of open communication. Everyone has the right to ask questions, 
seek information, and engage in discussions without facing unnecessary 
patronization.
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile 
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/ >
https://en.everybodywiki.com/Mich_Talebzadeh 
<https://en.everybodywiki.com/Mich_Talebzadeh >
Disclaimer: Use it at your own risk.Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 
On Sun, 24 Dec 2023 at 18:27, Nicholas Chammas mailto:nicholas.cham...@gmail.com >> wrote:
This is a user-list question, not a dev-list question. Moving this conversation 
to the user list and BCC-ing the dev list.
Also, this statement
> We are not validating against table or column existence.
is not correct. When you call spark.sql(…), Spark will lookup the table 
references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them.
Also, when you run DDL via spark.sql(…), Spark will actually run it. So 
spark.sql(“drop table my_table”) will actually drop my_table. It’s not a 
validation-only operation.
This question of validating SQL is already discussed on Stack Overflow 
<https://stackoverflow.com/q/46973729/877069 >. You may find some useful tips 
there.
Nick
On Dec 24, 2023, at 4:52 AM, Mich Talebzadeh mailto:mich.talebza...@gmail.com >> wrote:
Yes, you can validate the syntax of your PySpark SQL queries without connecting 
to an actual dataset or running the queries on a cluster. 
PySpark provides a method for syntax validation without executing the query. 
Something like below
 __
 / __/__ ___ _/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /__ / .__/\_,_/_/ /_/\_\ version 3.4.0
 /_/
Using Python version 3.9.16 (main, Apr 24 2023 10:36:11)
Spark context Web UI available at http://rhes75:4040 <http://rhes75:4040/ >
Spark context available as 'sc' (master = local[*], app id = 
local-1703410019374).
SparkSession available as 'spark'.
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName("validate").getOrCreate()
23/12/24 09:28:02 WARN SparkSession: Using an existing Spark session; only 
runtime SQL configurations will take effect.
>>> sql = "SELECT * FROM  WHERE  = some value"
>>> try:
... spark.sql(sql)
... print("is working")
... except Exception as e:
... print(f"Syntax error: {e}")
...
Syntax error:
[PARSE_SYNTAX_ERROR] Syntax error at or near '<'.(line 1, pos 14)
== SQL ==
SELECT * FROM  WHERE  = some value
--^^^
Here we only check for syntax errors and not the actual existence of query 
semantics. We are not validating against table or column existence.
This method is useful when you want to catch obvious syntax errors before 
submitting your PySpark job to a cluster, especially when you don't have access 
to the actual data.
In summary

 * 
Theis method validates syntax but will not catch semantic errors

 * 
If you need more comprehensive validation, consider using a testing framework 
and a small dataset.

 * 
For complex queries, using a linter or code analysis tool can help identify 
potential issues.
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile 
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/ >
https://en.everybodywiki.com/Mich_Talebzadeh 
<https://en.everybodywiki.com/Mich_Talebzadeh >
Disclaimer: Use it at your own risk.Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages ari

Re: Validate spark sql

2023-12-24 Thread ram manickam
Thanks Mich, Nicholas. I tried looking over the stack overflow post and
none of them
Seems to cover the syntax validation. Do you know if it's even possible to
do syntax validation in spark?

Thanks
Ram

On Sun, Dec 24, 2023 at 12:49 PM Mich Talebzadeh 
wrote:

> Well not to put too finer point on it, in a public forum, one ought to
> respect the importance of open communication. Everyone has the right to ask
> questions, seek information, and engage in discussions without facing
> unnecessary patronization.
>
>
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 24 Dec 2023 at 18:27, Nicholas Chammas 
> wrote:
>
>> This is a user-list question, not a dev-list question. Moving this
>> conversation to the user list and BCC-ing the dev list.
>>
>> Also, this statement
>>
>> > We are not validating against table or column existence.
>>
>> is not correct. When you call spark.sql(…), Spark will lookup the table
>> references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them.
>>
>> Also, when you run DDL via spark.sql(…), Spark will actually run it. So
>> spark.sql(“drop table my_table”) will actually drop my_table. It’s not a
>> validation-only operation.
>>
>> This question of validating SQL is already discussed on Stack Overflow
>> . You may find some useful
>> tips there.
>>
>> Nick
>>
>>
>> On Dec 24, 2023, at 4:52 AM, Mich Talebzadeh 
>> wrote:
>>
>>
>> Yes, you can validate the syntax of your PySpark SQL queries without
>> connecting to an actual dataset or running the queries on a cluster.
>> PySpark provides a method for syntax validation without executing the
>> query. Something like below
>>   __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/__ / .__/\_,_/_/ /_/\_\   version 3.4.0
>>   /_/
>>
>> Using Python version 3.9.16 (main, Apr 24 2023 10:36:11)
>> Spark context Web UI available at http://rhes75:4040
>> Spark context available as 'sc' (master = local[*], app id =
>> local-1703410019374).
>> SparkSession available as 'spark'.
>> >>> from pyspark.sql import SparkSession
>> >>> spark = SparkSession.builder.appName("validate").getOrCreate()
>> 23/12/24 09:28:02 WARN SparkSession: Using an existing Spark session;
>> only runtime SQL configurations will take effect.
>> >>> sql = "SELECT * FROM  WHERE  = some value"
>> >>> try:
>> ...   spark.sql(sql)
>> ...   print("is working")
>> ... except Exception as e:
>> ...   print(f"Syntax error: {e}")
>> ...
>> Syntax error:
>> [PARSE_SYNTAX_ERROR] Syntax error at or near '<'.(line 1, pos 14)
>>
>> == SQL ==
>> SELECT * FROM  WHERE  = some value
>> --^^^
>>
>> Here we only check for syntax errors and not the actual existence of
>> query semantics. We are not validating against table or column existence.
>>
>> This method is useful when you want to catch obvious syntax errors before
>> submitting your PySpark job to a cluster, especially when you don't have
>> access to the actual data.
>>
>> In summary
>>
>>- Theis method validates syntax but will not catch semantic errors
>>- If you need more comprehensive validation, consider using a testing
>>framework and a small dataset.
>>- For complex queries, using a linter or code analysis tool can help
>>identify potential issues.
>>
>> HTH
>>
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sun, 24 Dec 2023 at 07:57, ram manickam  wrote:
>>
>>> Hello,
>>> Is there a way to validate pyspark sql to validate only syntax errors?.
>>> I cannot connect do actual data set to perform this validation.  Any
>>> help would be appreciated.
>>>
>>>
>>> Thanks
>>> Ram
>>>
>>
>>


Re: Validate spark sql

2023-12-24 Thread Mich Talebzadeh
Well not to put too finer point on it, in a public forum, one ought to
respect the importance of open communication. Everyone has the right to ask
questions, seek information, and engage in discussions without facing
unnecessary patronization.



Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 24 Dec 2023 at 18:27, Nicholas Chammas 
wrote:

> This is a user-list question, not a dev-list question. Moving this
> conversation to the user list and BCC-ing the dev list.
>
> Also, this statement
>
> > We are not validating against table or column existence.
>
> is not correct. When you call spark.sql(…), Spark will lookup the table
> references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them.
>
> Also, when you run DDL via spark.sql(…), Spark will actually run it. So
> spark.sql(“drop table my_table”) will actually drop my_table. It’s not a
> validation-only operation.
>
> This question of validating SQL is already discussed on Stack Overflow
> . You may find some useful
> tips there.
>
> Nick
>
>
> On Dec 24, 2023, at 4:52 AM, Mich Talebzadeh 
> wrote:
>
>
> Yes, you can validate the syntax of your PySpark SQL queries without
> connecting to an actual dataset or running the queries on a cluster.
> PySpark provides a method for syntax validation without executing the
> query. Something like below
>   __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.4.0
>   /_/
>
> Using Python version 3.9.16 (main, Apr 24 2023 10:36:11)
> Spark context Web UI available at http://rhes75:4040
> Spark context available as 'sc' (master = local[*], app id =
> local-1703410019374).
> SparkSession available as 'spark'.
> >>> from pyspark.sql import SparkSession
> >>> spark = SparkSession.builder.appName("validate").getOrCreate()
> 23/12/24 09:28:02 WARN SparkSession: Using an existing Spark session; only
> runtime SQL configurations will take effect.
> >>> sql = "SELECT * FROM  WHERE  = some value"
> >>> try:
> ...   spark.sql(sql)
> ...   print("is working")
> ... except Exception as e:
> ...   print(f"Syntax error: {e}")
> ...
> Syntax error:
> [PARSE_SYNTAX_ERROR] Syntax error at or near '<'.(line 1, pos 14)
>
> == SQL ==
> SELECT * FROM  WHERE  = some value
> --^^^
>
> Here we only check for syntax errors and not the actual existence of query
> semantics. We are not validating against table or column existence.
>
> This method is useful when you want to catch obvious syntax errors before
> submitting your PySpark job to a cluster, especially when you don't have
> access to the actual data.
>
> In summary
>
>- Theis method validates syntax but will not catch semantic errors
>- If you need more comprehensive validation, consider using a testing
>framework and a small dataset.
>- For complex queries, using a linter or code analysis tool can help
>identify potential issues.
>
> HTH
>
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 24 Dec 2023 at 07:57, ram manickam  wrote:
>
>> Hello,
>> Is there a way to validate pyspark sql to validate only syntax errors?. I
>> cannot connect do actual data set to perform this validation.  Any
>> help would be appreciated.
>>
>>
>> Thanks
>> Ram
>>
>
>


Re: Validate spark sql

2023-12-24 Thread Nicholas Chammas
This is a user-list question, not a dev-list question. Moving this conversation 
to the user list and BCC-ing the dev list.

Also, this statement

> We are not validating against table or column existence.

is not correct. When you call spark.sql(…), Spark will lookup the table 
references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them.

Also, when you run DDL via spark.sql(…), Spark will actually run it. So 
spark.sql(“drop table my_table”) will actually drop my_table. It’s not a 
validation-only operation.

This question of validating SQL is already discussed on Stack Overflow 
. You may find some useful tips 
there.

Nick


> On Dec 24, 2023, at 4:52 AM, Mich Talebzadeh  
> wrote:
> 
>   
> Yes, you can validate the syntax of your PySpark SQL queries without 
> connecting to an actual dataset or running the queries on a cluster.
> PySpark provides a method for syntax validation without executing the query. 
> Something like below
>   __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.4.0
>   /_/
> 
> Using Python version 3.9.16 (main, Apr 24 2023 10:36:11)
> Spark context Web UI available at http://rhes75:4040 
> Spark context available as 'sc' (master = local[*], app id = 
> local-1703410019374).
> SparkSession available as 'spark'.
> >>> from pyspark.sql import SparkSession
> >>> spark = SparkSession.builder.appName("validate").getOrCreate()
> 23/12/24 09:28:02 WARN SparkSession: Using an existing Spark session; only 
> runtime SQL configurations will take effect.
> >>> sql = "SELECT * FROM  WHERE  = some value"
> >>> try:
> ...   spark.sql(sql)
> ...   print("is working")
> ... except Exception as e:
> ...   print(f"Syntax error: {e}")
> ...
> Syntax error:
> [PARSE_SYNTAX_ERROR] Syntax error at or near '<'.(line 1, pos 14)
> 
> == SQL ==
> SELECT * FROM  WHERE  = some value
> --^^^
> 
> Here we only check for syntax errors and not the actual existence of query 
> semantics. We are not validating against table or column existence.
> 
> This method is useful when you want to catch obvious syntax errors before 
> submitting your PySpark job to a cluster, especially when you don't have 
> access to the actual data.
> In summary
> Theis method validates syntax but will not catch semantic errors
> If you need more comprehensive validation, consider using a testing framework 
> and a small dataset.
> For complex queries, using a linter or code analysis tool can help identify 
> potential issues.
> HTH
> 
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
> 
>view my Linkedin profile 
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Sun, 24 Dec 2023 at 07:57, ram manickam  > wrote:
>> Hello,
>> Is there a way to validate pyspark sql to validate only syntax errors?. I 
>> cannot connect do actual data set to perform this validation.  Any help 
>> would be appreciated.
>> 
>> 
>> Thanks
>> Ram



[Spark-sql 3.2.4] Wrong Statistic INFO From 'ANALYZE TABLE' Command

2023-11-24 Thread Nick Luo
Hi, all

The ANALYZE TABLE command run from Spark on a Hive table.

Question:
Before I run ANALYZE TABLE' Command on Spark-sql client, I ran 'ANALYZE
TABLE' Command on Hive client, the wrong Statistic Info show up.

For example
 1. run the analyze table command o hive client

 - create table test_anaylze (id int) partitioned by (dt string);
 - insert into test_anaylze partition (dt = "2023-11-24") values(1321);
 - analyze table  test_anaylze partition(dt = "2023-11-24") COMPUTE
STATISTICS;

 2.  run the analyze table command o spark-sql client

- analyze table  test_anaylze partition(dt = "2023-11-24") COMPUTE
STATISTICS;
- DESC EXTENED test_anaylze PARTITION (dt = "2023-11-24")

I got the correct Info at the first time, but when I inserted another value
by using spark-sql, and ran 'ANALYZE TABLE' Command on spark-sql client, i
still got right information of numRows ,totalSize. But when I inserted
third value into Hive table, and ran 'ANALYZE TABLE' Command on Hive
client, then I ran ran 'ANALYZE TABLE' Command on spark-sql client, I got
wrong Statistic INFO from the PARTITION STATISTICS.It seems that Spark will
check the INFO from hive metastore whether the params of hive (numRows,
TotalSize) is currect, the param of spark (spark.sql.statistics.numRows,
spark.sql.statistics.TotalSize) will not update anymore


Can anyone explain why this suitation occurs? [image:
1516f391eab71a4533593f0cf167c4e.png]

[image: 5b8b8067878e22875b524b49b39fa3c.png]


Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-11-07 Thread Suyash Ajmera
Any update on this?


On Fri, 13 Oct, 2023, 12:56 pm Suyash Ajmera, 
wrote:

> This issue is related to CharVarcharCodegenUtils readSidePadding method .
>
> Appending white spaces while reading ENUM data from mysql
>
> Causing issue in querying , writing the same data to Cassandra.
>
> On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, 
> wrote:
>
>> I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am
>> querying to Mysql Database and applying
>>
>> `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working
>> as expected in spark 3.3.1 , but not working with 3.5.0.
>>
>> Where Condition ::  `*UPPER(vn) = 'ERICSSON' AND (upper(st) = 'OPEN' OR
>> upper(st) = 'REOPEN' OR upper(st) = 'CLOSED')*`
>>
>> The *st *column is ENUM in the database and it is causing the issue.
>>
>> Below is the Physical Plan of *FILTER* phase :
>>
>> For 3.3.1 :
>>
>> +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(st#42) = OPEN) OR
>> (upper(st#42) = REOPEN)) OR (upper(st#42) = CLOSED)))
>>
>> For 3.5.0 :
>>
>> +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(staticinvoke(class
>> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
>> readSidePadding, st#42, 13, true, false, true)) = OPEN) OR
>> (upper(staticinvoke(class
>> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
>> readSidePadding, st#42, 13, true, false, true)) = REOPEN)) OR
>> (upper(staticinvoke(class
>> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
>> readSidePadding, st#42, 13, true, false, true)) = CLOSED)))
>>
>> -
>>
>> I have debug it and found that Spark added a property in version 3.4.0 ,
>> i.e. **spark.sql.readSideCharPadding** which has default value **true**.
>>
>> Link to the JIRA : https://issues.apache.org/jira/browse/SPARK-40697
>>
>> Added a new method in Class **CharVarcharCodegenUtils**
>>
>> public static UTF8String readSidePadding(UTF8String inputStr, int limit) {
>> int numChars = inputStr.numChars();
>> if (numChars == limit) {
>>   return inputStr;
>> } else if (numChars < limit) {
>>   return inputStr.rpad(limit, SPACE);
>> } else {
>>   return inputStr;
>> }
>>   }
>>
>>
>> **This method is appending some whitespace padding to the ENUM values
>> while reading and causing the Issue.**
>>
>> ---
>>
>> When I am removing the UPPER function from the where condition the
>> **FILTER** Phase looks like this :
>>
>>  +- Filter (((staticinvoke(class
>> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils,
>>  StringType, readSidePadding, st#42, 13, true, false, true) = OPEN
>> ) OR (staticinvoke(class
>> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
>> readSidePadding, st#42, 13, true, false, true) = REOPEN   )) OR
>> (staticinvoke(class
>> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
>> readSidePadding, st#42, 13, true, false, true) = CLOSED   ))
>>
>>
>> **You can see it has added some white space after the value and the query
>> runs fine giving the correct result.**
>>
>> But with the UPPER function I am not getting the data.
>>
>> --
>>
>> I have also tried to disable this Property *spark.sql.readSideCharPadding
>> = false* with following cases :
>>
>> 1. With Upper function in where clause :
>>It is not pushing the filters to Database and the *query works fine*.
>>
>>
>>   +- Filter (((upper(st#42) = OPEN) OR (upper(st#42) = REOPEN)) OR
>> (upper(st#42) = CLOSED))
>>
>> 2. But when I am removing the upper function
>>
>>  *It is pushing the filter to Mysql with the white spaces and I am not
>> getting the data. (THIS IS A CAUSING VERY BIG ISSUE)*
>>
>>   PushedFilters: [*IsNotNull(vn), *EqualTo(vn,ERICSSON),
>> *Or(Or(EqualTo(st,OPEN ),EqualTo(st,REOPEN
>> )),EqualTo(st,CLOSED   ))]
>>
>> I cannot move this filter to JDBC read query , also I can't remove this
>> UPPER function in the where clause.
>>
>>
>> 
>>
>> Also I found same data getting written to CASSANDRA with *PADDING .*
>>
>


[Spark SQL] [Bug] Adding `checkpoint()` causes "column [...] cannot be resolved" error

2023-11-05 Thread Robin Zimmerman
Hi all,

Wondering if anyone has run into this as I can't find any similar issues in
JIRA, mailing list archives, Stack Overflow, etc. I had a query that was
running successfully, but the query planning time was extremely long (4+
hours). To fix this I added `checkpoint()` calls earlier in the code to
truncate the query plan. This worked to improve the performance, but now I
am getting the error "A column or function parameter with name
`B`.`JOIN_KEY` cannot be resolved." Nothing else in the query changed
besides the `checkpoint()` calls. The only thing I can surmise is that this
is related to a very complex nested query plan where the same table is used
multiple times upstream. The general flow is something like this:

```py
df = spark.sql("...")
df = df.checkpoint()
df.createOrReplaceTempView("df")

df2 = spark.sql("SELECT  JOIN df ...")
df2.createOrReplaceTempView("df2")

# Error happens here: A column or function parameter with name
`a`.`join_key` cannot be resolved. Did you mean one of the following?
[`b`.`join_key`, `a`.`col1`, `b`.`col2`]
spark.sql(""'
SELECT *
FROM  (
SELECT
a.join_key,
a.col1,
b.col2
FROM df2 b
LEFT JOIN df a ON b.join_key = a.join_key
)
""")
```

In the actual code df and df2 are very complex multi-level nested views
built upon other views. If I checkpoint all of the dataframes in the query
right before I run it the error goes away. Unfortunately I have not been
able to put together a minimal reproducible example.

Any ideas?

Thanks,
Robin


Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-13 Thread Suyash Ajmera
This issue is related to CharVarcharCodegenUtils readSidePadding method .

Appending white spaces while reading ENUM data from mysql

Causing issue in querying , writing the same data to Cassandra.

On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, 
wrote:

> I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am
> querying to Mysql Database and applying
>
> `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working
> as expected in spark 3.3.1 , but not working with 3.5.0.
>
> Where Condition ::  `*UPPER(vn) = 'ERICSSON' AND (upper(st) = 'OPEN' OR
> upper(st) = 'REOPEN' OR upper(st) = 'CLOSED')*`
>
> The *st *column is ENUM in the database and it is causing the issue.
>
> Below is the Physical Plan of *FILTER* phase :
>
> For 3.3.1 :
>
> +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(st#42) = OPEN) OR
> (upper(st#42) = REOPEN)) OR (upper(st#42) = CLOSED)))
>
> For 3.5.0 :
>
> +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(staticinvoke(class
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
> readSidePadding, st#42, 13, true, false, true)) = OPEN) OR
> (upper(staticinvoke(class
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
> readSidePadding, st#42, 13, true, false, true)) = REOPEN)) OR
> (upper(staticinvoke(class
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
> readSidePadding, st#42, 13, true, false, true)) = CLOSED)))
>
> -
>
> I have debug it and found that Spark added a property in version 3.4.0 ,
> i.e. **spark.sql.readSideCharPadding** which has default value **true**.
>
> Link to the JIRA : https://issues.apache.org/jira/browse/SPARK-40697
>
> Added a new method in Class **CharVarcharCodegenUtils**
>
> public static UTF8String readSidePadding(UTF8String inputStr, int limit) {
> int numChars = inputStr.numChars();
> if (numChars == limit) {
>   return inputStr;
> } else if (numChars < limit) {
>   return inputStr.rpad(limit, SPACE);
> } else {
>   return inputStr;
> }
>   }
>
>
> **This method is appending some whitespace padding to the ENUM values
> while reading and causing the Issue.**
>
> ---
>
> When I am removing the UPPER function from the where condition the
> **FILTER** Phase looks like this :
>
>  +- Filter (((staticinvoke(class
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils,
>  StringType, readSidePadding, st#42, 13, true, false, true) = OPEN
> ) OR (staticinvoke(class
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
> readSidePadding, st#42, 13, true, false, true) = REOPEN   )) OR
> (staticinvoke(class
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
> readSidePadding, st#42, 13, true, false, true) = CLOSED   ))
>
>
> **You can see it has added some white space after the value and the query
> runs fine giving the correct result.**
>
> But with the UPPER function I am not getting the data.
>
> --
>
> I have also tried to disable this Property *spark.sql.readSideCharPadding
> = false* with following cases :
>
> 1. With Upper function in where clause :
>It is not pushing the filters to Database and the *query works fine*.
>
>   +- Filter (((upper(st#42) = OPEN) OR (upper(st#42) = REOPEN)) OR
> (upper(st#42) = CLOSED))
>
> 2. But when I am removing the upper function
>
>  *It is pushing the filter to Mysql with the white spaces and I am not
> getting the data. (THIS IS A CAUSING VERY BIG ISSUE)*
>
>   PushedFilters: [*IsNotNull(vn), *EqualTo(vn,ERICSSON),
> *Or(Or(EqualTo(st,OPEN ),EqualTo(st,REOPEN
> )),EqualTo(st,CLOSED   ))]
>
> I cannot move this filter to JDBC read query , also I can't remove this
> UPPER function in the where clause.
>
>
> 
>
> Also I found same data getting written to CASSANDRA with *PADDING .*
>


[ SPARK SQL ]: PPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-12 Thread Suyash Ajmera
I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am querying
to Mysql Database and applying

`*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working as
expected in spark 3.3.1 , but not working with 3.5.0.

Where Condition ::  `*UPPER(vn) = 'ERICSSON' AND (upper(st) = 'OPEN' OR
upper(st) = 'REOPEN' OR upper(st) = 'CLOSED')*`

The *st *column is ENUM in the database and it is causing the issue.

Below is the Physical Plan of *FILTER* phase :

For 3.3.1 :

+- Filter ((upper(vn#11) = ERICSSON) AND (((upper(st#42) = OPEN) OR
(upper(st#42) = REOPEN)) OR (upper(st#42) = CLOSED)))

For 3.5.0 :

+- Filter ((upper(vn#11) = ERICSSON) AND (((upper(staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true)) = OPEN) OR
(upper(staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true)) = REOPEN)) OR
(upper(staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true)) = CLOSED)))

-

I have debug it and found that Spark added a property in version 3.4.0 ,
i.e. **spark.sql.readSideCharPadding** which has default value **true**.

Link to the JIRA : https://issues.apache.org/jira/browse/SPARK-40697

Added a new method in Class **CharVarcharCodegenUtils**

public static UTF8String readSidePadding(UTF8String inputStr, int limit) {
int numChars = inputStr.numChars();
if (numChars == limit) {
  return inputStr;
} else if (numChars < limit) {
  return inputStr.rpad(limit, SPACE);
} else {
  return inputStr;
}
  }


**This method is appending some whitespace padding to the ENUM values while
reading and causing the Issue.**

---

When I am removing the UPPER function from the where condition the
**FILTER** Phase looks like this :

 +- Filter (((staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils,
 StringType, readSidePadding, st#42, 13, true, false, true) = OPEN
) OR (staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true) = REOPEN   )) OR
(staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true) = CLOSED   ))


**You can see it has added some white space after the value and the query
runs fine giving the correct result.**

But with the UPPER function I am not getting the data.

--

I have also tried to disable this Property *spark.sql.readSideCharPadding =
false* with following cases :

1. With Upper function in where clause :
   It is not pushing the filters to Database and the *query works fine*.

  +- Filter (((upper(st#42) = OPEN) OR (upper(st#42) = REOPEN)) OR
(upper(st#42) = CLOSED))

2. But when I am removing the upper function

 *It is pushing the filter to Mysql with the white spaces and I am not
getting the data. (THIS IS A CAUSING VERY BIG ISSUE)*

  PushedFilters: [*IsNotNull(vn), *EqualTo(vn,ERICSSON),
*Or(Or(EqualTo(st,OPEN ),EqualTo(st,REOPEN
)),EqualTo(st,CLOSED   ))]

I cannot move this filter to JDBC read query , also I can't remove this
UPPER function in the where clause.



Also I found same data getting written to CASSANDRA with *PADDING .*


Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-18 Thread Mich Talebzadeh
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>> responsibility for any loss, damage or destruction of data or any 
>>>>>>>>>> other
>>>>>>>>>> property which may arise from relying on this email's technical 
>>>>>>>>>> content is
>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any
>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, 12 Aug 2023 at 18:31, Patrick Tucci <
>>>>>>>>>> patrick.tu...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes, on premise.
>>>>>>>>>>>
>>>>>>>>>>> Unfortunately after installing Delta Lake and re-writing all
>>>>>>>>>>> tables as Delta tables, the issue persists.
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh <
>>>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> ok sure.
>>>>>>>>>>>>
>>>>>>>>>>>> Is this Delta Lake going to be on-premise?
>>>>>>>>>>>>
>>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>>> Solutions Architect/Engineering Lead
>>>>>>>>>>>> London
>>>>>>>>>>>> United Kingdom
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>view my Linkedin profile
>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>>>> responsibility for any loss, damage or destruction of data or any 
>>>>>>>>>>>> other
>>>>>>>>>>>> property which may arise from relying on this email's technical 
>>>>>>>>>>>> content is
>>>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any
>>>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci <
>>>>>>>>>>>> patrick.tu...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Mich,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the feedback. My original intention after reading
>>>>>>>>>>>>> your response was to stick to Hive for managing tables. 
>>>>>>>>>>>>> Unfortunately, I'm
>>>>>>>>>>>>> running into another case of SQL scripts hanging. Since all 
>>>>>>>>>>>>> tables are
>>>>>>>>>>>>> already Parquet, I'm out of troubleshooting options. I'm going to 
>>>>>>>>>>>>> migrate
>>>>>>>&

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
>>>>>>>> and magic than examining errors/logs and fixing code. I feel that I 
>>>>>>>> should
>>>>>>>> have a contingency plan in place for when I run into an issue with 
>>>>>>>> Spark
>>>>>>>> that can't be resolved.
>>>>>>>>
>>>>>>>> Thanks everyone.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Aug 12, 2023 at 2:18 PM Mich Talebzadeh <
>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> OK you would not have known unless you went through the process so
>>>>>>>>> to speak.
>>>>>>>>>
>>>>>>>>> Let us do something revolutionary here 😁
>>>>>>>>>
>>>>>>>>> Install hive and its metastore. You already have hadoop anyway
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://cwiki.apache.org/confluence/display/hive/adminmanual+installation
>>>>>>>>>
>>>>>>>>> hive metastore
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://data-flair.training/blogs/apache-hive-metastore/#:~:text=What%20is%20Hive%20Metastore%3F,by%20using%20metastore%20service%20API
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> choose one of these
>>>>>>>>>
>>>>>>>>> derby  hive  mssql  mysql  oracle  postgres
>>>>>>>>>
>>>>>>>>> Mine is an oracle. postgres is good as well.
>>>>>>>>>
>>>>>>>>> HTH
>>>>>>>>>
>>>>>>>>> Mich Talebzadeh,
>>>>>>>>> Solutions Architect/Engineering Lead
>>>>>>>>> London
>>>>>>>>> United Kingdom
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>view my Linkedin profile
>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>>> which may
>>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>>> damages
>>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, 12 Aug 2023 at 18:31, Patrick Tucci <
>>>>>>>>> patrick.tu...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Yes, on premise.
>>>>>>>>>>
>>>>>>>>>> Unfortunately after installing Delta Lake and re-writing all
>>>>>>>>>> tables as Delta tables, the issue persists.
>>>>>>>>>>
>>>>>>>>>> On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh <
>>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> ok sure.
>>>>>>>>>>>
>>>>>>>>>>> Is this Delta Lake going to be on-premise?
>>>>>>>>>>>
>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>> Solutions Architect/Engineering Lead
>>>>>>>>>>> London
>>>>>>>>>>> United Kingdom
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>view my Linkedin profile
>>>>>>>>>>>

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
gt;
>>>>>>>> choose one of these
>>>>>>>>
>>>>>>>> derby  hive  mssql  mysql  oracle  postgres
>>>>>>>>
>>>>>>>> Mine is an oracle. postgres is good as well.
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Mich Talebzadeh,
>>>>>>>> Solutions Architect/Engineering Lead
>>>>>>>> London
>>>>>>>> United Kingdom
>>>>>>>>
>>>>>>>>
>>>>>>>>view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>> which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>> damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, 12 Aug 2023 at 18:31, Patrick Tucci <
>>>>>>>> patrick.tu...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Yes, on premise.
>>>>>>>>>
>>>>>>>>> Unfortunately after installing Delta Lake and re-writing all
>>>>>>>>> tables as Delta tables, the issue persists.
>>>>>>>>>
>>>>>>>>> On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh <
>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> ok sure.
>>>>>>>>>>
>>>>>>>>>> Is this Delta Lake going to be on-premise?
>>>>>>>>>>
>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>> Solutions Architect/Engineering Lead
>>>>>>>>>> London
>>>>>>>>>> United Kingdom
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>view my Linkedin profile
>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>> responsibility for any loss, damage or destruction of data or any 
>>>>>>>>>> other
>>>>>>>>>> property which may arise from relying on this email's technical 
>>>>>>>>>> content is
>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any
>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci <
>>>>>>>>>> patrick.tu...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Mich,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the feedback. My original intention after reading
>>>>>>>>>>> your response was to stick to Hive for managing tables. 
>>>>>>>>>>> Unfortunately, I'm
>>>>>>>>>>> running into another case of SQL scripts hanging. Since all tables 
>>>>>>>>>>> a

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
t;>>>>>>
>>>>>>>> On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh <
>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> ok sure.
>>>>>>>>>
>>>>>>>>> Is this Delta Lake going to be on-premise?
>>>>>>>>>
>>>>>>>>> Mich Talebzadeh,
>>>>>>>>> Solutions Architect/Engineering Lead
>>>>>>>>> London
>>>>>>>>> United Kingdom
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>view my Linkedin profile
>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>>> which may
>>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>>> damages
>>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci <
>>>>>>>>> patrick.tu...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Mich,
>>>>>>>>>>
>>>>>>>>>> Thanks for the feedback. My original intention after reading your
>>>>>>>>>> response was to stick to Hive for managing tables. Unfortunately, I'm
>>>>>>>>>> running into another case of SQL scripts hanging. Since all tables 
>>>>>>>>>> are
>>>>>>>>>> already Parquet, I'm out of troubleshooting options. I'm going to 
>>>>>>>>>> migrate
>>>>>>>>>> to Delta Lake and see if that solves the issue.
>>>>>>>>>>
>>>>>>>>>> Thanks again for your feedback.
>>>>>>>>>>
>>>>>>>>>> Patrick
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
>>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Patrick,
>>>>>>>>>>>
>>>>>>>>>>> There is not anything wrong with Hive On-premise it is the best
>>>>>>>>>>> data warehouse there is
>>>>>>>>>>>
>>>>>>>>>>> Hive handles both ORC and Parquet formal well. They are both
>>>>>>>>>>> columnar implementations of relational model. What you are seeing 
>>>>>>>>>>> is the
>>>>>>>>>>> Spark API to Hive which prefers Parquet. I found out a few years 
>>>>>>>>>>> ago.
>>>>>>>>>>>
>>>>>>>>>>> From your point of view I suggest you stick to
>>>>>>>>>>> parquet format with Hive specific to Spark. As far as I know you 
>>>>>>>>>>> don't have
>>>>>>>>>>> a fully independent Hive DB as yet.
>>>>>>>>>>>
>>>>>>>>>>> Anyway stick to Hive for now as you never know what issues you
>>>>>>>>>>> may be facing using moving to Delta Lake.
>>>>>>>>>>>
>>>>>>>>>>> You can also use compression
>>>>>>>>>>>
>>>>>>>>>>> STORED AS PARQUET
>>>>>>>>>>> TBLPROPERTIES ("parquet.compression

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
gt;>>> to Delta Lake and see if that solves the issue.
>>>>>>>>>
>>>>>>>>> Thanks again for your feedback.
>>>>>>>>>
>>>>>>>>> Patrick
>>>>>>>>>
>>>>>>>>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Patrick,
>>>>>>>>>>
>>>>>>>>>> There is not anything wrong with Hive On-premise it is the best
>>>>>>>>>> data warehouse there is
>>>>>>>>>>
>>>>>>>>>> Hive handles both ORC and Parquet formal well. They are both
>>>>>>>>>> columnar implementations of relational model. What you are seeing is 
>>>>>>>>>> the
>>>>>>>>>> Spark API to Hive which prefers Parquet. I found out a few years ago.
>>>>>>>>>>
>>>>>>>>>> From your point of view I suggest you stick to
>>>>>>>>>> parquet format with Hive specific to Spark. As far as I know you 
>>>>>>>>>> don't have
>>>>>>>>>> a fully independent Hive DB as yet.
>>>>>>>>>>
>>>>>>>>>> Anyway stick to Hive for now as you never know what issues you
>>>>>>>>>> may be facing using moving to Delta Lake.
>>>>>>>>>>
>>>>>>>>>> You can also use compression
>>>>>>>>>>
>>>>>>>>>> STORED AS PARQUET
>>>>>>>>>> TBLPROPERTIES ("parquet.compression"="SNAPPY")
>>>>>>>>>>
>>>>>>>>>> ALSO
>>>>>>>>>>
>>>>>>>>>> ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMNS
>>>>>>>>>>
>>>>>>>>>> HTH
>>>>>>>>>>
>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>> Solutions Architect/Engineering Lead
>>>>>>>>>> London
>>>>>>>>>> United Kingdom
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>view my Linkedin profile
>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>> responsibility for any loss, damage or destruction of data or any 
>>>>>>>>>> other
>>>>>>>>>> property which may arise from relying on this email's technical 
>>>>>>>>>> content is
>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any
>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, 11 Aug 2023 at 11:26, Patrick Tucci <
>>>>>>>>>> patrick.tu...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for the reply Stephen and Mich.
>>>>>>>>>>>
>>>>>>>>>>> Stephen, you're right, it feels like Spark is waiting for
>>>>>>>>>>> something, but I'm not sure what. I'm the only user on the cluster 
>>>>>>>>>>> and
>>>>>>>>>>> there are plenty of resources (+60 cores, +250GB RAM). I even tried
>>>>>>>>>>> restarting Hadoop, Spark and the host servers to make sure nothing 
>>>>>>>>>>> was
>>>>>>>>>>> lingering in the background.
>>>>>>

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
xplicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, 12 Aug 2023 at 18:31, Patrick Tucci 
>>>>> wrote:
>>>>>
>>>>>> Yes, on premise.
>>>>>>
>>>>>> Unfortunately after installing Delta Lake and re-writing all tables
>>>>>> as Delta tables, the issue persists.
>>>>>>
>>>>>> On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> ok sure.
>>>>>>>
>>>>>>> Is this Delta Lake going to be on-premise?
>>>>>>>
>>>>>>> Mich Talebzadeh,
>>>>>>> Solutions Architect/Engineering Lead
>>>>>>> London
>>>>>>> United Kingdom
>>>>>>>
>>>>>>>
>>>>>>>view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Mich,
>>>>>>>>
>>>>>>>> Thanks for the feedback. My original intention after reading your
>>>>>>>> response was to stick to Hive for managing tables. Unfortunately, I'm
>>>>>>>> running into another case of SQL scripts hanging. Since all tables are
>>>>>>>> already Parquet, I'm out of troubleshooting options. I'm going to 
>>>>>>>> migrate
>>>>>>>> to Delta Lake and see if that solves the issue.
>>>>>>>>
>>>>>>>> Thanks again for your feedback.
>>>>>>>>
>>>>>>>> Patrick
>>>>>>>>
>>>>>>>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Patrick,
>>>>>>>>>
>>>>>>>>> There is not anything wrong with Hive On-premise it is the best
>>>>>>>>> data warehouse there is
>>>>>>>>>
>>>>>>>>> Hive handles both ORC and Parquet formal well. They are both
>>>>>>>>> columnar implementations of relational model. What you are seeing is 
>>>>>>>>> the
>>>>>>>>> Spark API to Hive which prefers Parquet. I found out a few years ago.
>>>>>>>>>
>>>>>>>>> From your point of view I suggest you stick to parquet format with
>>>>>>>>> Hive specific to Spark. As far as I know you don't have a fully 
>>>>>>>>> independent
>>>>>>>>> Hive DB as yet.
>>>>>>>>>
>>>>>>>>> Anyway stick to Hive for now as you never know what issues you may
>>>>>>>>> be facing using moving to Delta Lake.
>>>>>>>>>
>>>>>>>>> You can also use compression
>>>>>>>>>
>>>>>>>>> STORED AS PARQUET
>>>>>>>>> TBLPROPERTIES ("parquet.compression"="SNAPPY")
>>>>>>>>>
>>>>>>>>> ALSO
>>>>>>>

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Mich Talebzadeh
loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Mich,
>>>>>>>
>>>>>>> Thanks for the feedback. My original intention after reading your
>>>>>>> response was to stick to Hive for managing tables. Unfortunately, I'm
>>>>>>> running into another case of SQL scripts hanging. Since all tables are
>>>>>>> already Parquet, I'm out of troubleshooting options. I'm going to 
>>>>>>> migrate
>>>>>>> to Delta Lake and see if that solves the issue.
>>>>>>>
>>>>>>> Thanks again for your feedback.
>>>>>>>
>>>>>>> Patrick
>>>>>>>
>>>>>>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Patrick,
>>>>>>>>
>>>>>>>> There is not anything wrong with Hive On-premise it is the best
>>>>>>>> data warehouse there is
>>>>>>>>
>>>>>>>> Hive handles both ORC and Parquet formal well. They are both
>>>>>>>> columnar implementations of relational model. What you are seeing is 
>>>>>>>> the
>>>>>>>> Spark API to Hive which prefers Parquet. I found out a few years ago.
>>>>>>>>
>>>>>>>> From your point of view I suggest you stick to parquet format with
>>>>>>>> Hive specific to Spark. As far as I know you don't have a fully 
>>>>>>>> independent
>>>>>>>> Hive DB as yet.
>>>>>>>>
>>>>>>>> Anyway stick to Hive for now as you never know what issues you may
>>>>>>>> be facing using moving to Delta Lake.
>>>>>>>>
>>>>>>>> You can also use compression
>>>>>>>>
>>>>>>>> STORED AS PARQUET
>>>>>>>> TBLPROPERTIES ("parquet.compression"="SNAPPY")
>>>>>>>>
>>>>>>>> ALSO
>>>>>>>>
>>>>>>>> ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMNS
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Mich Talebzadeh,
>>>>>>>> Solutions Architect/Engineering Lead
>>>>>>>> London
>>>>>>>> United Kingdom
>>>>>>>>
>>>>>>>>
>>>>>>>>view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>> which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>> damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, 11 Aug 2023 at 11:26, Patrick Tucci <
>>>>>>>> patrick.tu...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the reply Stephen and Mich.
>>>>>>>>>
>>>>>>>>> Stephen, you're right, it feels like Spark is waiting for
>>>>>>>>> something, but I'm not sure what. I'm the only user on the cluster and
>>>>>>>>> there are plenty of resources (+60 cores, +250GB RAM). I even tried
>>>>>>>>> restarting Hadoop, Spark and the host servers to make sure nothing was

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-17 Thread Patrick Tucci
5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 12 Aug 2023 at 18:31, Patrick Tucci 
>>> wrote:
>>>
>>>> Yes, on premise.
>>>>
>>>> Unfortunately after installing Delta Lake and re-writing all tables as
>>>> Delta tables, the issue persists.
>>>>
>>>> On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> ok sure.
>>>>>
>>>>> Is this Delta Lake going to be on-premise?
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Solutions Architect/Engineering Lead
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci 
>>>>> wrote:
>>>>>
>>>>>> Hi Mich,
>>>>>>
>>>>>> Thanks for the feedback. My original intention after reading your
>>>>>> response was to stick to Hive for managing tables. Unfortunately, I'm
>>>>>> running into another case of SQL scripts hanging. Since all tables are
>>>>>> already Parquet, I'm out of troubleshooting options. I'm going to migrate
>>>>>> to Delta Lake and see if that solves the issue.
>>>>>>
>>>>>> Thanks again for your feedback.
>>>>>>
>>>>>> Patrick
>>>>>>
>>>>>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Patrick,
>>>>>>>
>>>>>>> There is not anything wrong with Hive On-premise it is the best data
>>>>>>> warehouse there is
>>>>>>>
>>>>>>> Hive handles both ORC and Parquet formal well. They are both
>>>>>>> columnar implementations of relational model. What you are seeing is the
>>>>>>> Spark API to Hive which prefers Parquet. I found out a few years ago.
>>>>>>>
>>>>>>> From your point of view I suggest you stick to parquet format with
>>>>>>> Hive specific to Spark. As far as I know you don't have a fully 
>>>>>>> independent
>>>>>>> Hive DB as yet.
>>>>>>>
>>>>>>> Anyway stick to Hive for now as you never know what issues you may
>>>>>>> be facing using moving to Delta Lake.
>>>>>>>
>>>>>>> You can also use compression
>>>>>>>
>>>>>>> STORED AS PARQUET
>>>>>>> TBLPROPERTIES ("parquet.compression"="SNAPPY")
>>>>>>>
>>>>>>> ALSO
>>>>>>>
>>>>>>> ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMNS
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Mich Talebzadeh,
>>>>>>> Solutions Architect/Engineering Lead
>>>>>>> London
>>>>>>> United Kingdom
>>>>>>>
>>>>>>>
>>>>>>

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Mich Talebzadeh
t; Mich Talebzadeh,
>>>> Solutions Architect/Engineering Lead
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci 
>>>> wrote:
>>>>
>>>>> Hi Mich,
>>>>>
>>>>> Thanks for the feedback. My original intention after reading your
>>>>> response was to stick to Hive for managing tables. Unfortunately, I'm
>>>>> running into another case of SQL scripts hanging. Since all tables are
>>>>> already Parquet, I'm out of troubleshooting options. I'm going to migrate
>>>>> to Delta Lake and see if that solves the issue.
>>>>>
>>>>> Thanks again for your feedback.
>>>>>
>>>>> Patrick
>>>>>
>>>>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Hi Patrick,
>>>>>>
>>>>>> There is not anything wrong with Hive On-premise it is the best data
>>>>>> warehouse there is
>>>>>>
>>>>>> Hive handles both ORC and Parquet formal well. They are both columnar
>>>>>> implementations of relational model. What you are seeing is the Spark API
>>>>>> to Hive which prefers Parquet. I found out a few years ago.
>>>>>>
>>>>>> From your point of view I suggest you stick to parquet format with
>>>>>> Hive specific to Spark. As far as I know you don't have a fully 
>>>>>> independent
>>>>>> Hive DB as yet.
>>>>>>
>>>>>> Anyway stick to Hive for now as you never know what issues you may be
>>>>>> facing using moving to Delta Lake.
>>>>>>
>>>>>> You can also use compression
>>>>>>
>>>>>> STORED AS PARQUET
>>>>>> TBLPROPERTIES ("parquet.compression"="SNAPPY")
>>>>>>
>>>>>> ALSO
>>>>>>
>>>>>> ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMNS
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Solutions Architect/Engineering Lead
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, 11 Aug 2023 at 11:26, Patrick Tucci 
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the reply Stephen and Mich.
>>>>>>>
>>>>>>> Stephen, you're right, it feels like Spark is waiting for something,
>>>>>>> but I'm not sure what. I'm the only user on the cluster and there are
>>>>>>> plenty of resources (+60 cores, +250GB RAM). I even tried restarting
>>>>>>> Hado

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-13 Thread Patrick Tucci
of SQL scripts hanging. Since all tables are
>>>> already Parquet, I'm out of troubleshooting options. I'm going to migrate
>>>> to Delta Lake and see if that solves the issue.
>>>>
>>>> Thanks again for your feedback.
>>>>
>>>> Patrick
>>>>
>>>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi Patrick,
>>>>>
>>>>> There is not anything wrong with Hive On-premise it is the best data
>>>>> warehouse there is
>>>>>
>>>>> Hive handles both ORC and Parquet formal well. They are both columnar
>>>>> implementations of relational model. What you are seeing is the Spark API
>>>>> to Hive which prefers Parquet. I found out a few years ago.
>>>>>
>>>>> From your point of view I suggest you stick to parquet format with
>>>>> Hive specific to Spark. As far as I know you don't have a fully 
>>>>> independent
>>>>> Hive DB as yet.
>>>>>
>>>>> Anyway stick to Hive for now as you never know what issues you may be
>>>>> facing using moving to Delta Lake.
>>>>>
>>>>> You can also use compression
>>>>>
>>>>> STORED AS PARQUET
>>>>> TBLPROPERTIES ("parquet.compression"="SNAPPY")
>>>>>
>>>>> ALSO
>>>>>
>>>>> ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMNS
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Solutions Architect/Engineering Lead
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, 11 Aug 2023 at 11:26, Patrick Tucci 
>>>>> wrote:
>>>>>
>>>>>> Thanks for the reply Stephen and Mich.
>>>>>>
>>>>>> Stephen, you're right, it feels like Spark is waiting for something,
>>>>>> but I'm not sure what. I'm the only user on the cluster and there are
>>>>>> plenty of resources (+60 cores, +250GB RAM). I even tried restarting
>>>>>> Hadoop, Spark and the host servers to make sure nothing was lingering in
>>>>>> the background.
>>>>>>
>>>>>> Mich, thank you so much, your suggestion worked. Storing the tables
>>>>>> as Parquet solves the issue.
>>>>>>
>>>>>> Interestingly, I found that only the MemberEnrollment table needs to
>>>>>> be Parquet. The ID field in MemberEnrollment is an int calculated during
>>>>>> load by a ROW_NUMBER() function. Further testing found that if I hard 
>>>>>> code
>>>>>> a 0 as MemberEnrollment.ID instead of using the ROW_NUMBER() function, 
>>>>>> the
>>>>>> query works without issue even if both tables are ORC.
>>>>>>
>>>>>> Should I infer from this issue that the Hive components prefer
>>>>>> Parquet over ORC? Furthermore, should I consider using a different table
>>>>>> storage framework, like Delta Lake, instead of the Hive components? Given
>>>>>> this issue and other issues I've had with Hive, I'm starting to think a
>>>>>> different solution might be more robust and stable. The main condition is
>>>>>> that my application operates solely through Thrift server, so I need to 
>>>>>> be
>>>>>> able to connect to Spark through Thrift server and have it write tables
>>>>>> using Delta Lake ins

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
onetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, 11 Aug 2023 at 11:26, Patrick Tucci 
>>>> wrote:
>>>>
>>>>> Thanks for the reply Stephen and Mich.
>>>>>
>>>>> Stephen, you're right, it feels like Spark is waiting for something,
>>>>> but I'm not sure what. I'm the only user on the cluster and there are
>>>>> plenty of resources (+60 cores, +250GB RAM). I even tried restarting
>>>>> Hadoop, Spark and the host servers to make sure nothing was lingering in
>>>>> the background.
>>>>>
>>>>> Mich, thank you so much, your suggestion worked. Storing the tables as
>>>>> Parquet solves the issue.
>>>>>
>>>>> Interestingly, I found that only the MemberEnrollment table needs to
>>>>> be Parquet. The ID field in MemberEnrollment is an int calculated during
>>>>> load by a ROW_NUMBER() function. Further testing found that if I hard code
>>>>> a 0 as MemberEnrollment.ID instead of using the ROW_NUMBER() function, the
>>>>> query works without issue even if both tables are ORC.
>>>>>
>>>>> Should I infer from this issue that the Hive components prefer Parquet
>>>>> over ORC? Furthermore, should I consider using a different table storage
>>>>> framework, like Delta Lake, instead of the Hive components? Given this
>>>>> issue and other issues I've had with Hive, I'm starting to think a
>>>>> different solution might be more robust and stable. The main condition is
>>>>> that my application operates solely through Thrift server, so I need to be
>>>>> able to connect to Spark through Thrift server and have it write tables
>>>>> using Delta Lake instead of Hive. From this StackOverflow question, it
>>>>> looks like this is possible:
>>>>> https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc
>>>>>
>>>>> Thanks again to everyone who replied for their help.
>>>>>
>>>>> Patrick
>>>>>
>>>>>
>>>>> On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Steve may have a valid point. You raised an issue with concurrent
>>>>>> writes before, if I recall correctly. Since this limitation may be due to
>>>>>> Hive metastore. By default Spark uses Apache Derby for its database
>>>>>> persistence. *However it is limited to only one Spark session at any
>>>>>> time for the purposes of metadata storage.*  That may be the cause
>>>>>> here as well. Does this happen if the underlying tables are created as
>>>>>> PARQUET as opposed to ORC?
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Solutions Architect/Engineering Lead
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, 11 Aug 2023 at 01:33, Stephen Coy
>>>>>>  wrote:
>>>>>>
>>>>>>> Hi Patrick,
>>>>>>>
>>>>>>> When this has happened to me in the past (admittedly via
>>>>>>> spark-submit) it has been because another job was still running and had
>>>>>>> already claimed some of the resources (cores and mem

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
Yes, on premise.

Unfortunately after installing Delta Lake and re-writing all tables as
Delta tables, the issue persists.

On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh 
wrote:

> ok sure.
>
> Is this Delta Lake going to be on-premise?
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci 
> wrote:
>
>> Hi Mich,
>>
>> Thanks for the feedback. My original intention after reading your
>> response was to stick to Hive for managing tables. Unfortunately, I'm
>> running into another case of SQL scripts hanging. Since all tables are
>> already Parquet, I'm out of troubleshooting options. I'm going to migrate
>> to Delta Lake and see if that solves the issue.
>>
>> Thanks again for your feedback.
>>
>> Patrick
>>
>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Patrick,
>>>
>>> There is not anything wrong with Hive On-premise it is the best data
>>> warehouse there is
>>>
>>> Hive handles both ORC and Parquet formal well. They are both columnar
>>> implementations of relational model. What you are seeing is the Spark API
>>> to Hive which prefers Parquet. I found out a few years ago.
>>>
>>> From your point of view I suggest you stick to parquet format with Hive
>>> specific to Spark. As far as I know you don't have a fully independent Hive
>>> DB as yet.
>>>
>>> Anyway stick to Hive for now as you never know what issues you may be
>>> facing using moving to Delta Lake.
>>>
>>> You can also use compression
>>>
>>> STORED AS PARQUET
>>> TBLPROPERTIES ("parquet.compression"="SNAPPY")
>>>
>>> ALSO
>>>
>>> ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMNS
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 11 Aug 2023 at 11:26, Patrick Tucci 
>>> wrote:
>>>
>>>> Thanks for the reply Stephen and Mich.
>>>>
>>>> Stephen, you're right, it feels like Spark is waiting for something,
>>>> but I'm not sure what. I'm the only user on the cluster and there are
>>>> plenty of resources (+60 cores, +250GB RAM). I even tried restarting
>>>> Hadoop, Spark and the host servers to make sure nothing was lingering in
>>>> the background.
>>>>
>>>> Mich, thank you so much, your suggestion worked. Storing the tables as
>>>> Parquet solves the issue.
>>>>
>>>> Interestingly, I found that only the MemberEnrollment table needs to be
>>>> Parquet. The ID field in MemberEnrollment is an int calculated during load
>>>> by a ROW_NUMBER() function. Further testing found that if I hard code a 0
>>>> as MemberEnrollment.ID instead of using the ROW_NUMBER() function, the
>>>> query works without issue even if both tables are ORC.
>>>>
>>>> Should I infer from this issue that the Hive components prefer Parquet
>>>> over ORC? Furthermore, should I consider using a different table storage
>>>> framework, like Delta Lake, instead of the Hive components? Given this
&g

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
ok sure.

Is this Delta Lake going to be on-premise?

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 12 Aug 2023 at 12:03, Patrick Tucci  wrote:

> Hi Mich,
>
> Thanks for the feedback. My original intention after reading your response
> was to stick to Hive for managing tables. Unfortunately, I'm running into
> another case of SQL scripts hanging. Since all tables are already Parquet,
> I'm out of troubleshooting options. I'm going to migrate to Delta Lake and
> see if that solves the issue.
>
> Thanks again for your feedback.
>
> Patrick
>
> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi Patrick,
>>
>> There is not anything wrong with Hive On-premise it is the best data
>> warehouse there is
>>
>> Hive handles both ORC and Parquet formal well. They are both columnar
>> implementations of relational model. What you are seeing is the Spark API
>> to Hive which prefers Parquet. I found out a few years ago.
>>
>> From your point of view I suggest you stick to parquet format with Hive
>> specific to Spark. As far as I know you don't have a fully independent Hive
>> DB as yet.
>>
>> Anyway stick to Hive for now as you never know what issues you may be
>> facing using moving to Delta Lake.
>>
>> You can also use compression
>>
>> STORED AS PARQUET
>> TBLPROPERTIES ("parquet.compression"="SNAPPY")
>>
>> ALSO
>>
>> ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMNS
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 11 Aug 2023 at 11:26, Patrick Tucci 
>> wrote:
>>
>>> Thanks for the reply Stephen and Mich.
>>>
>>> Stephen, you're right, it feels like Spark is waiting for something, but
>>> I'm not sure what. I'm the only user on the cluster and there are plenty of
>>> resources (+60 cores, +250GB RAM). I even tried restarting Hadoop, Spark
>>> and the host servers to make sure nothing was lingering in the background.
>>>
>>> Mich, thank you so much, your suggestion worked. Storing the tables as
>>> Parquet solves the issue.
>>>
>>> Interestingly, I found that only the MemberEnrollment table needs to be
>>> Parquet. The ID field in MemberEnrollment is an int calculated during load
>>> by a ROW_NUMBER() function. Further testing found that if I hard code a 0
>>> as MemberEnrollment.ID instead of using the ROW_NUMBER() function, the
>>> query works without issue even if both tables are ORC.
>>>
>>> Should I infer from this issue that the Hive components prefer Parquet
>>> over ORC? Furthermore, should I consider using a different table storage
>>> framework, like Delta Lake, instead of the Hive components? Given this
>>> issue and other issues I've had with Hive, I'm starting to think a
>>> different solution might be more robust and stable. The main condition is
>>> that my application operates solely through Thrift server, so I need to be
>>> able to connect to Spark through Thrift server and have it write tables
>>> using Delta Lake instead of Hive. From this StackOverflow question, it
>>> looks like this is possible:
>>> https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc
>>>
>>> Thanks again to everyone who replied for their h

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
Hi Mich,

Thanks for the feedback. My original intention after reading your response
was to stick to Hive for managing tables. Unfortunately, I'm running into
another case of SQL scripts hanging. Since all tables are already Parquet,
I'm out of troubleshooting options. I'm going to migrate to Delta Lake and
see if that solves the issue.

Thanks again for your feedback.

Patrick

On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh 
wrote:

> Hi Patrick,
>
> There is not anything wrong with Hive On-premise it is the best data
> warehouse there is
>
> Hive handles both ORC and Parquet formal well. They are both columnar
> implementations of relational model. What you are seeing is the Spark API
> to Hive which prefers Parquet. I found out a few years ago.
>
> From your point of view I suggest you stick to parquet format with Hive
> specific to Spark. As far as I know you don't have a fully independent Hive
> DB as yet.
>
> Anyway stick to Hive for now as you never know what issues you may be
> facing using moving to Delta Lake.
>
> You can also use compression
>
> STORED AS PARQUET
> TBLPROPERTIES ("parquet.compression"="SNAPPY")
>
> ALSO
>
> ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMNS
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 11 Aug 2023 at 11:26, Patrick Tucci 
> wrote:
>
>> Thanks for the reply Stephen and Mich.
>>
>> Stephen, you're right, it feels like Spark is waiting for something, but
>> I'm not sure what. I'm the only user on the cluster and there are plenty of
>> resources (+60 cores, +250GB RAM). I even tried restarting Hadoop, Spark
>> and the host servers to make sure nothing was lingering in the background.
>>
>> Mich, thank you so much, your suggestion worked. Storing the tables as
>> Parquet solves the issue.
>>
>> Interestingly, I found that only the MemberEnrollment table needs to be
>> Parquet. The ID field in MemberEnrollment is an int calculated during load
>> by a ROW_NUMBER() function. Further testing found that if I hard code a 0
>> as MemberEnrollment.ID instead of using the ROW_NUMBER() function, the
>> query works without issue even if both tables are ORC.
>>
>> Should I infer from this issue that the Hive components prefer Parquet
>> over ORC? Furthermore, should I consider using a different table storage
>> framework, like Delta Lake, instead of the Hive components? Given this
>> issue and other issues I've had with Hive, I'm starting to think a
>> different solution might be more robust and stable. The main condition is
>> that my application operates solely through Thrift server, so I need to be
>> able to connect to Spark through Thrift server and have it write tables
>> using Delta Lake instead of Hive. From this StackOverflow question, it
>> looks like this is possible:
>> https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc
>>
>> Thanks again to everyone who replied for their help.
>>
>> Patrick
>>
>>
>> On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Steve may have a valid point. You raised an issue with concurrent writes
>>> before, if I recall correctly. Since this limitation may be due to Hive
>>> metastore. By default Spark uses Apache Derby for its database
>>> persistence. *However it is limited to only one Spark session at any
>>> time for the purposes of metadata storage.*  That may be the cause here
>>> as well. Does this happen if the underlying tables are created as PARQUET
>>> as opposed to ORC?
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzad

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Mich Talebzadeh
Hi Patrick,

There is not anything wrong with Hive On-premise it is the best data
warehouse there is

Hive handles both ORC and Parquet formal well. They are both columnar
implementations of relational model. What you are seeing is the Spark API
to Hive which prefers Parquet. I found out a few years ago.

>From your point of view I suggest you stick to parquet format with Hive
specific to Spark. As far as I know you don't have a fully independent Hive
DB as yet.

Anyway stick to Hive for now as you never know what issues you may be
facing using moving to Delta Lake.

You can also use compression

STORED AS PARQUET
TBLPROPERTIES ("parquet.compression"="SNAPPY")

ALSO

ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMNS

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 11 Aug 2023 at 11:26, Patrick Tucci  wrote:

> Thanks for the reply Stephen and Mich.
>
> Stephen, you're right, it feels like Spark is waiting for something, but
> I'm not sure what. I'm the only user on the cluster and there are plenty of
> resources (+60 cores, +250GB RAM). I even tried restarting Hadoop, Spark
> and the host servers to make sure nothing was lingering in the background.
>
> Mich, thank you so much, your suggestion worked. Storing the tables as
> Parquet solves the issue.
>
> Interestingly, I found that only the MemberEnrollment table needs to be
> Parquet. The ID field in MemberEnrollment is an int calculated during load
> by a ROW_NUMBER() function. Further testing found that if I hard code a 0
> as MemberEnrollment.ID instead of using the ROW_NUMBER() function, the
> query works without issue even if both tables are ORC.
>
> Should I infer from this issue that the Hive components prefer Parquet
> over ORC? Furthermore, should I consider using a different table storage
> framework, like Delta Lake, instead of the Hive components? Given this
> issue and other issues I've had with Hive, I'm starting to think a
> different solution might be more robust and stable. The main condition is
> that my application operates solely through Thrift server, so I need to be
> able to connect to Spark through Thrift server and have it write tables
> using Delta Lake instead of Hive. From this StackOverflow question, it
> looks like this is possible:
> https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc
>
> Thanks again to everyone who replied for their help.
>
> Patrick
>
>
> On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh 
> wrote:
>
>> Steve may have a valid point. You raised an issue with concurrent writes
>> before, if I recall correctly. Since this limitation may be due to Hive
>> metastore. By default Spark uses Apache Derby for its database
>> persistence. *However it is limited to only one Spark session at any
>> time for the purposes of metadata storage.*  That may be the cause here
>> as well. Does this happen if the underlying tables are created as PARQUET
>> as opposed to ORC?
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 11 Aug 2023 at 01:33, Stephen Coy 
>> wrote:
>>
>>> Hi Patrick,
>>>
>>> When this has happened to me in the past (admittedly via spark-submit)
>>> it has been because another job was still running and had already claimed
>>> some of the resources (cores and memory).
>>>
>>> I think this can also happen if your configuration tries to claim
>>> resources that will never be available.
>>>
>>&

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-11 Thread Patrick Tucci
Thanks for the reply Stephen and Mich.

Stephen, you're right, it feels like Spark is waiting for something, but
I'm not sure what. I'm the only user on the cluster and there are plenty of
resources (+60 cores, +250GB RAM). I even tried restarting Hadoop, Spark
and the host servers to make sure nothing was lingering in the background.

Mich, thank you so much, your suggestion worked. Storing the tables as
Parquet solves the issue.

Interestingly, I found that only the MemberEnrollment table needs to be
Parquet. The ID field in MemberEnrollment is an int calculated during load
by a ROW_NUMBER() function. Further testing found that if I hard code a 0
as MemberEnrollment.ID instead of using the ROW_NUMBER() function, the
query works without issue even if both tables are ORC.

Should I infer from this issue that the Hive components prefer Parquet over
ORC? Furthermore, should I consider using a different table storage
framework, like Delta Lake, instead of the Hive components? Given this
issue and other issues I've had with Hive, I'm starting to think a
different solution might be more robust and stable. The main condition is
that my application operates solely through Thrift server, so I need to be
able to connect to Spark through Thrift server and have it write tables
using Delta Lake instead of Hive. From this StackOverflow question, it
looks like this is possible:
https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc

Thanks again to everyone who replied for their help.

Patrick


On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh 
wrote:

> Steve may have a valid point. You raised an issue with concurrent writes
> before, if I recall correctly. Since this limitation may be due to Hive
> metastore. By default Spark uses Apache Derby for its database
> persistence. *However it is limited to only one Spark session at any time
> for the purposes of metadata storage.*  That may be the cause here as
> well. Does this happen if the underlying tables are created as PARQUET as
> opposed to ORC?
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 11 Aug 2023 at 01:33, Stephen Coy 
> wrote:
>
>> Hi Patrick,
>>
>> When this has happened to me in the past (admittedly via spark-submit) it
>> has been because another job was still running and had already claimed some
>> of the resources (cores and memory).
>>
>> I think this can also happen if your configuration tries to claim
>> resources that will never be available.
>>
>> Cheers,
>>
>> SteveC
>>
>>
>> On 11 Aug 2023, at 3:36 am, Patrick Tucci 
>> wrote:
>>
>> Hello,
>>
>> I'm attempting to run a query on Spark 3.4.0 through the Spark
>> ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
>> standalone mode using HDFS for storage.
>>
>> The query is as follows:
>>
>> SELECT ME.*, MB.BenefitID
>> FROM MemberEnrollment ME
>> JOIN MemberBenefits MB
>> ON ME.ID <http://me.id/> = MB.EnrollmentID
>> WHERE MB.BenefitID = 5
>> LIMIT 10
>>
>> The tables are defined as follows:
>>
>> -- Contains about 3M rows
>> CREATE TABLE MemberEnrollment
>> (
>> ID INT
>> , MemberID VARCHAR(50)
>> , StartDate DATE
>> , EndDate DATE
>> -- Other columns, but these are the most important
>> ) STORED AS ORC;
>>
>> -- Contains about 25m rows
>> CREATE TABLE MemberBenefits
>> (
>> EnrollmentID INT
>> , BenefitID INT
>> ) STORED AS ORC;
>>
>> When I execute the query, it runs a single broadcast exchange stage,
>> which completes after a few seconds. Then everything just hangs. The
>> JDBC/ODBC tab in the UI shows the query state as COMPILED, but no stages or
>> tasks are executing or pending:
>>
>> 
>>
>> I've let the query run for as long as 30 minutes with no additional
>> stages, progress, or errors. I'm not sure where to start troubleshooting.
>>
>> Thanks for your help,
>>
>> Patrick
>>
&

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
Steve may have a valid point. You raised an issue with concurrent writes
before, if I recall correctly. Since this limitation may be due to Hive
metastore. By default Spark uses Apache Derby for its database
persistence. *However
it is limited to only one Spark session at any time for the purposes of
metadata storage.*  That may be the cause here as well. Does this happen if
the underlying tables are created as PARQUET as opposed to ORC?

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 11 Aug 2023 at 01:33, Stephen Coy 
wrote:

> Hi Patrick,
>
> When this has happened to me in the past (admittedly via spark-submit) it
> has been because another job was still running and had already claimed some
> of the resources (cores and memory).
>
> I think this can also happen if your configuration tries to claim
> resources that will never be available.
>
> Cheers,
>
> SteveC
>
>
> On 11 Aug 2023, at 3:36 am, Patrick Tucci  wrote:
>
> Hello,
>
> I'm attempting to run a query on Spark 3.4.0 through the Spark
> ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
> standalone mode using HDFS for storage.
>
> The query is as follows:
>
> SELECT ME.*, MB.BenefitID
> FROM MemberEnrollment ME
> JOIN MemberBenefits MB
> ON ME.ID  = MB.EnrollmentID
> WHERE MB.BenefitID = 5
> LIMIT 10
>
> The tables are defined as follows:
>
> -- Contains about 3M rows
> CREATE TABLE MemberEnrollment
> (
> ID INT
> , MemberID VARCHAR(50)
> , StartDate DATE
> , EndDate DATE
> -- Other columns, but these are the most important
> ) STORED AS ORC;
>
> -- Contains about 25m rows
> CREATE TABLE MemberBenefits
> (
> EnrollmentID INT
> , BenefitID INT
> ) STORED AS ORC;
>
> When I execute the query, it runs a single broadcast exchange stage, which
> completes after a few seconds. Then everything just hangs. The JDBC/ODBC
> tab in the UI shows the query state as COMPILED, but no stages or tasks are
> executing or pending:
>
> 
>
> I've let the query run for as long as 30 minutes with no additional
> stages, progress, or errors. I'm not sure where to start troubleshooting.
>
> Thanks for your help,
>
> Patrick
>
>
> This email contains confidential information of and is the copyright of
> Infomedia. It must not be forwarded, amended or disclosed without consent
> of the sender. If you received this message by mistake, please advise the
> sender and delete all copies. Security of transmission on the internet
> cannot be guaranteed, could be infected, intercepted, or corrupted and you
> should ensure you have suitable antivirus protection in place. By sending
> us your or any third party personal details, you consent to (or confirm you
> have obtained consent from such third parties) to Infomedia’s privacy
> policy. http://www.infomedia.com.au/privacy-policy/
>


Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Stephen Coy
Hi Patrick,

When this has happened to me in the past (admittedly via spark-submit) it has 
been because another job was still running and had already claimed some of the 
resources (cores and memory).

I think this can also happen if your configuration tries to claim resources 
that will never be available.

Cheers,

SteveC


On 11 Aug 2023, at 3:36 am, Patrick Tucci  wrote:

Hello,

I'm attempting to run a query on Spark 3.4.0 through the Spark ThriftServer. 
The cluster has 64 cores, 250GB RAM, and operates in standalone mode using HDFS 
for storage.

The query is as follows:

SELECT ME.*, MB.BenefitID
FROM MemberEnrollment ME
JOIN MemberBenefits MB
ON ME.ID = MB.EnrollmentID
WHERE MB.BenefitID = 5
LIMIT 10

The tables are defined as follows:

-- Contains about 3M rows
CREATE TABLE MemberEnrollment
(
ID INT
, MemberID VARCHAR(50)
, StartDate DATE
, EndDate DATE
-- Other columns, but these are the most important
) STORED AS ORC;

-- Contains about 25m rows
CREATE TABLE MemberBenefits
(
EnrollmentID INT
, BenefitID INT
) STORED AS ORC;

When I execute the query, it runs a single broadcast exchange stage, which 
completes after a few seconds. Then everything just hangs. The JDBC/ODBC tab in 
the UI shows the query state as COMPILED, but no stages or tasks are executing 
or pending:



I've let the query run for as long as 30 minutes with no additional stages, 
progress, or errors. I'm not sure where to start troubleshooting.

Thanks for your help,

Patrick

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia's privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hi Mich,

I don't believe Hive is installed. I set up this cluster from scratch. I
installed Hadoop and Spark by downloading them from their project websites.
If Hive isn't bundled with Hadoop or Spark, I don't believe I have it. I'm
running the Thrift server distributed with Spark, like so:

~/spark/sbin/start-thriftserver.sh --master spark://10.0.50.1:7077

I can look into installing Hive, but it might take some time. I tried to
set up Hive when I first started evaluating distributed data processing
solutions, but I encountered many issues. Spark was much simpler, which was
part of the reason why I chose it.

Thanks again for the reply, I truly appreciate your help.

Patrick

On Thu, Aug 10, 2023 at 3:43 PM Mich Talebzadeh 
wrote:

> sorry host is 10.0.50.1
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 10 Aug 2023 at 20:41, Mich Talebzadeh 
> wrote:
>
>> Hi Patrick
>>
>> That beeline on port 1 is a hive thrift server running on your hive
>> on host 10.0.50.1:1.
>>
>> if you can access that host, you should be able to log into hive by
>> typing hive. The os user is hadoop in your case and sounds like there is no
>> password!
>>
>> Once inside that host, hive logs are kept in your case
>> /tmp/hadoop/hive.log or go to /tmp and do
>>
>> /tmp> find ./ -name hive.log. It should be under /tmp/hive.log
>>
>> Try running the sql inside hive and see what it says
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 10 Aug 2023 at 20:02, Patrick Tucci 
>> wrote:
>>
>>> Hi Mich,
>>>
>>> Thanks for the reply. Unfortunately I don't have Hive set up on my
>>> cluster. I can explore this if there are no other ways to troubleshoot.
>>>
>>> I'm using beeline to run commands against the Thrift server. Here's the
>>> command I use:
>>>
>>> ~/spark/bin/beeline -u jdbc:hive2://10.0.50.1:1 -n hadoop -f
>>> command.sql
>>>
>>> Thanks again for your help.
>>>
>>> Patrick
>>>
>>>
>>> On Thu, Aug 10, 2023 at 2:24 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Can you run this sql query through hive itself?

 Are you using this command or similar for your thrift server?

 beeline -u jdbc:hive2:///1/default
 org.apache.hive.jdbc.HiveDriver -n hadoop -p xxx

 HTH

 Mich Talebzadeh,
 Solutions Architect/Engineering Lead
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Thu, 10 Aug 2023 at 18:39, Patrick Tucci 
 wrote:

> Hello,
>
> I'm attempting to run a query on Spark 3.4.0 through the Spark
> ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
> standalone mode using HDFS for storage.
>
> The query is as follows:
>
> SELECT ME.*, MB.BenefitID
> FROM MemberEnrollment ME
> JOIN MemberBenefits MB
> ON ME.ID = MB.EnrollmentID
> WHERE MB.BenefitID = 5
> LIMIT 10
>
> The tables are defined as follows:
>
> -- Contains about 3M rows
> CREATE TABLE MemberEnrollment
> (
> ID INT
> , MemberID VARCHAR(50)
> , StartDate DATE
> , EndDate DATE
> -- Other columns, but these are the most important
> ) STORED AS ORC;
>
> -- Contains about 25m rows
> CREATE TABLE MemberBenefits
> (
> EnrollmentID INT
> , BenefitID INT
> ) STORED AS ORC;
>
> W

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
sorry host is 10.0.50.1

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 10 Aug 2023 at 20:41, Mich Talebzadeh 
wrote:

> Hi Patrick
>
> That beeline on port 1 is a hive thrift server running on your hive on
> host 10.0.50.1:1.
>
> if you can access that host, you should be able to log into hive by typing
> hive. The os user is hadoop in your case and sounds like there is no
> password!
>
> Once inside that host, hive logs are kept in your case
> /tmp/hadoop/hive.log or go to /tmp and do
>
> /tmp> find ./ -name hive.log. It should be under /tmp/hive.log
>
> Try running the sql inside hive and see what it says
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 10 Aug 2023 at 20:02, Patrick Tucci 
> wrote:
>
>> Hi Mich,
>>
>> Thanks for the reply. Unfortunately I don't have Hive set up on my
>> cluster. I can explore this if there are no other ways to troubleshoot.
>>
>> I'm using beeline to run commands against the Thrift server. Here's the
>> command I use:
>>
>> ~/spark/bin/beeline -u jdbc:hive2://10.0.50.1:1 -n hadoop -f
>> command.sql
>>
>> Thanks again for your help.
>>
>> Patrick
>>
>>
>> On Thu, Aug 10, 2023 at 2:24 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Can you run this sql query through hive itself?
>>>
>>> Are you using this command or similar for your thrift server?
>>>
>>> beeline -u jdbc:hive2:///1/default
>>> org.apache.hive.jdbc.HiveDriver -n hadoop -p xxx
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 10 Aug 2023 at 18:39, Patrick Tucci 
>>> wrote:
>>>
 Hello,

 I'm attempting to run a query on Spark 3.4.0 through the Spark
 ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
 standalone mode using HDFS for storage.

 The query is as follows:

 SELECT ME.*, MB.BenefitID
 FROM MemberEnrollment ME
 JOIN MemberBenefits MB
 ON ME.ID = MB.EnrollmentID
 WHERE MB.BenefitID = 5
 LIMIT 10

 The tables are defined as follows:

 -- Contains about 3M rows
 CREATE TABLE MemberEnrollment
 (
 ID INT
 , MemberID VARCHAR(50)
 , StartDate DATE
 , EndDate DATE
 -- Other columns, but these are the most important
 ) STORED AS ORC;

 -- Contains about 25m rows
 CREATE TABLE MemberBenefits
 (
 EnrollmentID INT
 , BenefitID INT
 ) STORED AS ORC;

 When I execute the query, it runs a single broadcast exchange stage,
 which completes after a few seconds. Then everything just hangs. The
 JDBC/ODBC tab in the UI shows the query state as COMPILED, but no stages or
 tasks are executing or pending:

 [image: image.png]

 I've let the query run for as long as 30 minutes with no additional
 stages, progress, or errors. I'm not sure where to start troubleshooting.

 Thanks for your help,

 Patrick

>>>


Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
Hi Patrick

That beeline on port 1 is a hive thrift server running on your hive on
host 10.0.50.1:1.

if you can access that host, you should be able to log into hive by typing
hive. The os user is hadoop in your case and sounds like there is no
password!

Once inside that host, hive logs are kept in your case /tmp/hadoop/hive.log
or go to /tmp and do

/tmp> find ./ -name hive.log. It should be under /tmp/hive.log

Try running the sql inside hive and see what it says

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 10 Aug 2023 at 20:02, Patrick Tucci  wrote:

> Hi Mich,
>
> Thanks for the reply. Unfortunately I don't have Hive set up on my
> cluster. I can explore this if there are no other ways to troubleshoot.
>
> I'm using beeline to run commands against the Thrift server. Here's the
> command I use:
>
> ~/spark/bin/beeline -u jdbc:hive2://10.0.50.1:1 -n hadoop -f
> command.sql
>
> Thanks again for your help.
>
> Patrick
>
>
> On Thu, Aug 10, 2023 at 2:24 PM Mich Talebzadeh 
> wrote:
>
>> Can you run this sql query through hive itself?
>>
>> Are you using this command or similar for your thrift server?
>>
>> beeline -u jdbc:hive2:///1/default
>> org.apache.hive.jdbc.HiveDriver -n hadoop -p xxx
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 10 Aug 2023 at 18:39, Patrick Tucci 
>> wrote:
>>
>>> Hello,
>>>
>>> I'm attempting to run a query on Spark 3.4.0 through the Spark
>>> ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
>>> standalone mode using HDFS for storage.
>>>
>>> The query is as follows:
>>>
>>> SELECT ME.*, MB.BenefitID
>>> FROM MemberEnrollment ME
>>> JOIN MemberBenefits MB
>>> ON ME.ID = MB.EnrollmentID
>>> WHERE MB.BenefitID = 5
>>> LIMIT 10
>>>
>>> The tables are defined as follows:
>>>
>>> -- Contains about 3M rows
>>> CREATE TABLE MemberEnrollment
>>> (
>>> ID INT
>>> , MemberID VARCHAR(50)
>>> , StartDate DATE
>>> , EndDate DATE
>>> -- Other columns, but these are the most important
>>> ) STORED AS ORC;
>>>
>>> -- Contains about 25m rows
>>> CREATE TABLE MemberBenefits
>>> (
>>> EnrollmentID INT
>>> , BenefitID INT
>>> ) STORED AS ORC;
>>>
>>> When I execute the query, it runs a single broadcast exchange stage,
>>> which completes after a few seconds. Then everything just hangs. The
>>> JDBC/ODBC tab in the UI shows the query state as COMPILED, but no stages or
>>> tasks are executing or pending:
>>>
>>> [image: image.png]
>>>
>>> I've let the query run for as long as 30 minutes with no additional
>>> stages, progress, or errors. I'm not sure where to start troubleshooting.
>>>
>>> Thanks for your help,
>>>
>>> Patrick
>>>
>>


Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hi Mich,

Thanks for the reply. Unfortunately I don't have Hive set up on my cluster.
I can explore this if there are no other ways to troubleshoot.

I'm using beeline to run commands against the Thrift server. Here's the
command I use:

~/spark/bin/beeline -u jdbc:hive2://10.0.50.1:1 -n hadoop -f command.sql

Thanks again for your help.

Patrick


On Thu, Aug 10, 2023 at 2:24 PM Mich Talebzadeh 
wrote:

> Can you run this sql query through hive itself?
>
> Are you using this command or similar for your thrift server?
>
> beeline -u jdbc:hive2:///1/default
> org.apache.hive.jdbc.HiveDriver -n hadoop -p xxx
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 10 Aug 2023 at 18:39, Patrick Tucci 
> wrote:
>
>> Hello,
>>
>> I'm attempting to run a query on Spark 3.4.0 through the Spark
>> ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
>> standalone mode using HDFS for storage.
>>
>> The query is as follows:
>>
>> SELECT ME.*, MB.BenefitID
>> FROM MemberEnrollment ME
>> JOIN MemberBenefits MB
>> ON ME.ID = MB.EnrollmentID
>> WHERE MB.BenefitID = 5
>> LIMIT 10
>>
>> The tables are defined as follows:
>>
>> -- Contains about 3M rows
>> CREATE TABLE MemberEnrollment
>> (
>> ID INT
>> , MemberID VARCHAR(50)
>> , StartDate DATE
>> , EndDate DATE
>> -- Other columns, but these are the most important
>> ) STORED AS ORC;
>>
>> -- Contains about 25m rows
>> CREATE TABLE MemberBenefits
>> (
>> EnrollmentID INT
>> , BenefitID INT
>> ) STORED AS ORC;
>>
>> When I execute the query, it runs a single broadcast exchange stage,
>> which completes after a few seconds. Then everything just hangs. The
>> JDBC/ODBC tab in the UI shows the query state as COMPILED, but no stages or
>> tasks are executing or pending:
>>
>> [image: image.png]
>>
>> I've let the query run for as long as 30 minutes with no additional
>> stages, progress, or errors. I'm not sure where to start troubleshooting.
>>
>> Thanks for your help,
>>
>> Patrick
>>
>


Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Mich Talebzadeh
Can you run this sql query through hive itself?

Are you using this command or similar for your thrift server?

beeline -u jdbc:hive2:///1/default
org.apache.hive.jdbc.HiveDriver -n hadoop -p xxx

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 10 Aug 2023 at 18:39, Patrick Tucci  wrote:

> Hello,
>
> I'm attempting to run a query on Spark 3.4.0 through the Spark
> ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
> standalone mode using HDFS for storage.
>
> The query is as follows:
>
> SELECT ME.*, MB.BenefitID
> FROM MemberEnrollment ME
> JOIN MemberBenefits MB
> ON ME.ID = MB.EnrollmentID
> WHERE MB.BenefitID = 5
> LIMIT 10
>
> The tables are defined as follows:
>
> -- Contains about 3M rows
> CREATE TABLE MemberEnrollment
> (
> ID INT
> , MemberID VARCHAR(50)
> , StartDate DATE
> , EndDate DATE
> -- Other columns, but these are the most important
> ) STORED AS ORC;
>
> -- Contains about 25m rows
> CREATE TABLE MemberBenefits
> (
> EnrollmentID INT
> , BenefitID INT
> ) STORED AS ORC;
>
> When I execute the query, it runs a single broadcast exchange stage, which
> completes after a few seconds. Then everything just hangs. The JDBC/ODBC
> tab in the UI shows the query state as COMPILED, but no stages or tasks are
> executing or pending:
>
> [image: image.png]
>
> I've let the query run for as long as 30 minutes with no additional
> stages, progress, or errors. I'm not sure where to start troubleshooting.
>
> Thanks for your help,
>
> Patrick
>


Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Patrick Tucci
Hello,

I'm attempting to run a query on Spark 3.4.0 through the Spark
ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
standalone mode using HDFS for storage.

The query is as follows:

SELECT ME.*, MB.BenefitID
FROM MemberEnrollment ME
JOIN MemberBenefits MB
ON ME.ID = MB.EnrollmentID
WHERE MB.BenefitID = 5
LIMIT 10

The tables are defined as follows:

-- Contains about 3M rows
CREATE TABLE MemberEnrollment
(
ID INT
, MemberID VARCHAR(50)
, StartDate DATE
, EndDate DATE
-- Other columns, but these are the most important
) STORED AS ORC;

-- Contains about 25m rows
CREATE TABLE MemberBenefits
(
EnrollmentID INT
, BenefitID INT
) STORED AS ORC;

When I execute the query, it runs a single broadcast exchange stage, which
completes after a few seconds. Then everything just hangs. The JDBC/ODBC
tab in the UI shows the query state as COMPILED, but no stages or tasks are
executing or pending:

[image: image.png]

I've let the query run for as long as 30 minutes with no additional stages,
progress, or errors. I'm not sure where to start troubleshooting.

Thanks for your help,

Patrick


Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Mich Talebzadeh
ok so as expected the underlying database is Hive. Hive uses hdfs storage.

You said you encountered limitations on concurrent writes. The order and
limitations are introduced by Hive metastore so to speak. Since this is all
happening through Spark, by default implementation of the Hive metastore
<https://hive.apache.org/#:~:text=The%20Hive%20Metastore%20(HMS)%20is,using%20the%20metastore%20service%20API.>
in Apache Spark uses Apache Derby for its database persistence. This is
available with no configuration required as in your case. *However it is
limited to only one Spark session at any time for the purposes of metadata
storage.* This obviously makes it unsuitable for use in multi-concurrency
situations as you observed. For industrial strength backend Hive metastore
databases, you should consider a multi-user ACID-compliant relational
database product for hosting the metastore. Any current RDBMSs should do. I
use Oracle 12g myself and others use MySQL or postgresQL for this purpose
etc.

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 30 Jul 2023 at 13:46, Patrick Tucci  wrote:

> Hi Mich and Pol,
>
> Thanks for the feedback. The database layer is Hadoop 3.3.5. The cluster
> restarted so I lost the stack trace in the application UI. In the snippets
> I saved, it looks like the exception being thrown was from Hive. Given the
> feedback you've provided, I suspect the issue is with how the Hive
> components are handling concurrent writes.
>
> While using a different format would likely help with this issue, I think
> I have found an easier solution for now. Currently I have many individual
> scripts that perform logic and insert the results separately. Instead of
> each script performing an insert, each script can instead create a view.
> After the views are created, one single script can perform one single
> INSERT, combining the views with UNION ALL statements.
>
> -- Old logic --
> -- Script 1
> INSERT INTO EventClaims
> /*Long, complicated query 1*/
>
> -- Script N
> INSERT INTO EventClaims
> /*Long, complicated query N*/
>
> -- New logic --
> -- Script 1
> CREATE VIEW Q1 AS
> /*Long, complicated query 1*/
>
> -- Script N
> CREATE VIEW QN AS
> /*Long, complicated query N*/
>
> -- Final script --
> INSERT INTO EventClaims
> SELECT * FROM Q1 UNION ALL
> SELECT * FROM QN
>
> The old approach had almost two dozen stages with relatively fewer tasks.
> The new approach requires only 3 stages. With fewer stages and more tasks,
> cluster utilization is much higher.
>
> Thanks again for your feedback. I suspect better concurrent writes will be
> valuable for my project in the future, so this is good information to have
> ready.
>
> Thanks,
>
> Patrick
>
> On Sun, Jul 30, 2023 at 5:30 AM Pol Santamaria  wrote:
>
>> Hi Patrick,
>>
>> You can have multiple writers simultaneously writing to the same table in
>> HDFS by utilizing an open table format with concurrency control. Several
>> formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast
>> Format, offer this capability. All of them provide advanced features that
>> will work better in different use cases according to the writing pattern,
>> type of queries, data characteristics, etc.
>>
>> *Pol Santamaria*
>>
>>
>> On Sat, Jul 29, 2023 at 4:28 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> It is not Spark SQL that throws the error. It is the underlying Database
>>> or layer that throws the error.
>>>
>>> Spark acts as an ETL tool.  What is the underlying DB  where the table
>>> resides? Is concurrency supported. Please send the error to this list
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and al

Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Patrick Tucci
Hi Mich and Pol,

Thanks for the feedback. The database layer is Hadoop 3.3.5. The cluster
restarted so I lost the stack trace in the application UI. In the snippets
I saved, it looks like the exception being thrown was from Hive. Given the
feedback you've provided, I suspect the issue is with how the Hive
components are handling concurrent writes.

While using a different format would likely help with this issue, I think I
have found an easier solution for now. Currently I have many individual
scripts that perform logic and insert the results separately. Instead of
each script performing an insert, each script can instead create a view.
After the views are created, one single script can perform one single
INSERT, combining the views with UNION ALL statements.

-- Old logic --
-- Script 1
INSERT INTO EventClaims
/*Long, complicated query 1*/

-- Script N
INSERT INTO EventClaims
/*Long, complicated query N*/

-- New logic --
-- Script 1
CREATE VIEW Q1 AS
/*Long, complicated query 1*/

-- Script N
CREATE VIEW QN AS
/*Long, complicated query N*/

-- Final script --
INSERT INTO EventClaims
SELECT * FROM Q1 UNION ALL
SELECT * FROM QN

The old approach had almost two dozen stages with relatively fewer tasks.
The new approach requires only 3 stages. With fewer stages and more tasks,
cluster utilization is much higher.

Thanks again for your feedback. I suspect better concurrent writes will be
valuable for my project in the future, so this is good information to have
ready.

Thanks,

Patrick

On Sun, Jul 30, 2023 at 5:30 AM Pol Santamaria  wrote:

> Hi Patrick,
>
> You can have multiple writers simultaneously writing to the same table in
> HDFS by utilizing an open table format with concurrency control. Several
> formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast
> Format, offer this capability. All of them provide advanced features that
> will work better in different use cases according to the writing pattern,
> type of queries, data characteristics, etc.
>
> *Pol Santamaria*
>
>
> On Sat, Jul 29, 2023 at 4:28 PM Mich Talebzadeh 
> wrote:
>
>> It is not Spark SQL that throws the error. It is the underlying Database
>> or layer that throws the error.
>>
>> Spark acts as an ETL tool.  What is the underlying DB  where the table
>> resides? Is concurrency supported. Please send the error to this list
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 29 Jul 2023 at 12:02, Patrick Tucci 
>> wrote:
>>
>>> Hello,
>>>
>>> I'm building an application on Spark SQL. The cluster is set up in
>>> standalone mode with HDFS as storage. The only Spark application running is
>>> the Spark Thrift Server using FAIR scheduling mode. Queries are submitted
>>> to Thrift Server using beeline.
>>>
>>> I have multiple queries that insert rows into the same table
>>> (EventClaims). These queries work fine when run sequentially, however, some
>>> individual queries don't fully utilize the resources available on the
>>> cluster. I would like to run all of these queries concurrently to more
>>> fully utilize available resources. When I attempt to do this, tasks
>>> eventually begin to fail. The stack trace is pretty long, but here's what
>>> looks like the most relevant parts:
>>>
>>>
>>> org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:788)
>>>
>>> org.apache.hive.service.cli.HiveSQLException: Error running query:
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 28
>>> in stage 128.0 failed 4 times, most recent failure: Lost task 28.3 in stage
>>> 128.0 (TID 6578) (10.0.50.2 executor 0): org.apache.spark.SparkException:
>>> [TASK_WRITE_FAILED] Task failed while writing rows to hdfs://
>>> 10.0.50.1:8020/user/spark/warehouse/eventclaims.
>>>
>>> Is it possible to have multiple concurrent writers to the same table
>>> with Spark SQL? Is there any way to make this work?
>>>
>>> Thanks for the help.
>>>
>>> Patrick
>>>
>>


Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-30 Thread Pol Santamaria
Hi Patrick,

You can have multiple writers simultaneously writing to the same table in
HDFS by utilizing an open table format with concurrency control. Several
formats, such as Apache Hudi, Apache Iceberg, Delta Lake, and Qbeast
Format, offer this capability. All of them provide advanced features that
will work better in different use cases according to the writing pattern,
type of queries, data characteristics, etc.

*Pol Santamaria*


On Sat, Jul 29, 2023 at 4:28 PM Mich Talebzadeh 
wrote:

> It is not Spark SQL that throws the error. It is the underlying Database
> or layer that throws the error.
>
> Spark acts as an ETL tool.  What is the underlying DB  where the table
> resides? Is concurrency supported. Please send the error to this list
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 29 Jul 2023 at 12:02, Patrick Tucci 
> wrote:
>
>> Hello,
>>
>> I'm building an application on Spark SQL. The cluster is set up in
>> standalone mode with HDFS as storage. The only Spark application running is
>> the Spark Thrift Server using FAIR scheduling mode. Queries are submitted
>> to Thrift Server using beeline.
>>
>> I have multiple queries that insert rows into the same table
>> (EventClaims). These queries work fine when run sequentially, however, some
>> individual queries don't fully utilize the resources available on the
>> cluster. I would like to run all of these queries concurrently to more
>> fully utilize available resources. When I attempt to do this, tasks
>> eventually begin to fail. The stack trace is pretty long, but here's what
>> looks like the most relevant parts:
>>
>>
>> org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:788)
>>
>> org.apache.hive.service.cli.HiveSQLException: Error running query:
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 28
>> in stage 128.0 failed 4 times, most recent failure: Lost task 28.3 in stage
>> 128.0 (TID 6578) (10.0.50.2 executor 0): org.apache.spark.SparkException:
>> [TASK_WRITE_FAILED] Task failed while writing rows to hdfs://
>> 10.0.50.1:8020/user/spark/warehouse/eventclaims.
>>
>> Is it possible to have multiple concurrent writers to the same table with
>> Spark SQL? Is there any way to make this work?
>>
>> Thanks for the help.
>>
>> Patrick
>>
>


Re: Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Mich Talebzadeh
It is not Spark SQL that throws the error. It is the underlying Database or
layer that throws the error.

Spark acts as an ETL tool.  What is the underlying DB  where the table
resides? Is concurrency supported. Please send the error to this list

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 29 Jul 2023 at 12:02, Patrick Tucci  wrote:

> Hello,
>
> I'm building an application on Spark SQL. The cluster is set up in
> standalone mode with HDFS as storage. The only Spark application running is
> the Spark Thrift Server using FAIR scheduling mode. Queries are submitted
> to Thrift Server using beeline.
>
> I have multiple queries that insert rows into the same table
> (EventClaims). These queries work fine when run sequentially, however, some
> individual queries don't fully utilize the resources available on the
> cluster. I would like to run all of these queries concurrently to more
> fully utilize available resources. When I attempt to do this, tasks
> eventually begin to fail. The stack trace is pretty long, but here's what
> looks like the most relevant parts:
>
>
> org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:788)
>
> org.apache.hive.service.cli.HiveSQLException: Error running query:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 28
> in stage 128.0 failed 4 times, most recent failure: Lost task 28.3 in stage
> 128.0 (TID 6578) (10.0.50.2 executor 0): org.apache.spark.SparkException:
> [TASK_WRITE_FAILED] Task failed while writing rows to hdfs://
> 10.0.50.1:8020/user/spark/warehouse/eventclaims.
>
> Is it possible to have multiple concurrent writers to the same table with
> Spark SQL? Is there any way to make this work?
>
> Thanks for the help.
>
> Patrick
>


Spark-SQL - Concurrent Inserts Into Same Table Throws Exception

2023-07-29 Thread Patrick Tucci
Hello,

I'm building an application on Spark SQL. The cluster is set up in
standalone mode with HDFS as storage. The only Spark application running is
the Spark Thrift Server using FAIR scheduling mode. Queries are submitted
to Thrift Server using beeline.

I have multiple queries that insert rows into the same table (EventClaims).
These queries work fine when run sequentially, however, some individual
queries don't fully utilize the resources available on the cluster. I would
like to run all of these queries concurrently to more fully utilize
available resources. When I attempt to do this, tasks eventually begin to
fail. The stack trace is pretty long, but here's what looks like the most
relevant parts:

org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:788)

org.apache.hive.service.cli.HiveSQLException: Error running query:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 28
in stage 128.0 failed 4 times, most recent failure: Lost task 28.3 in stage
128.0 (TID 6578) (10.0.50.2 executor 0): org.apache.spark.SparkException:
[TASK_WRITE_FAILED] Task failed while writing rows to hdfs://
10.0.50.1:8020/user/spark/warehouse/eventclaims.

Is it possible to have multiple concurrent writers to the same table with
Spark SQL? Is there any way to make this work?

Thanks for the help.

Patrick


Re: [Spark SQL] Data objects from query history

2023-07-03 Thread Jack Wells
 Hi Ruben,

I’m not sure if this answers your question, but if you’re interested in
exploring the underlying tables, you could always try something like the
below in a Databricks notebook:

display(spark.read.table(’samples.nyctaxi.trips’))

(For vanilla Spark users, it would be
spark.read.table(’samples.nyctaxi.trips’).show(100, False) )

Since you’re using Databricks, you can also find the data under the Data
menu, scroll down to the samples metastore then click through to trips to
find the file location, schema, and sample data.

On Jun 29, 2023 at 23:53:25, Ruben Mennes  wrote:

> Dear Apache Spark community,
>
> I hope this email finds you well. My name is Ruben, and I am an
> enthusiastic user of Apache Spark, specifically through the Databricks
> platform. I am reaching out to you today to seek your assistance and
> guidance regarding a specific use case.
>
> I have been exploring the capabilities of Spark SQL and Databricks, and I
> have encountered a challenge related to accessing the data objects used by
> queries from the query history. I am aware that Databricks provides a
> comprehensive query history that contains valuable information about
> executed queries.
>
> However, my objective is to extract the underlying data objects (tables)
> involved in each query. By doing so, I aim to analyze and understand the
> dependencies between queries and the data they operate on. This information
> will provide us new insights in how data is used across our data platform.
>
> I have attempted to leverage the Spark SQL Antlr grammar, available at
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4,
> to parse the queries retrieved from the query history. Unfortunately, I
> have encountered difficulties when parsing more complex queries.
>
> As an example, I have struggled to parse queries with intricate constructs
> such as the following:
>
> SELECT
>   concat(pickup_zip, '-', dropoff_zip) as route,
>   AVG(fare_amount) as average_fare
> FROM
>   `samples`.`nyctaxi`.`trips`
> GROUP BY
>   1
> ORDER BY
>   2 DESC
> LIMIT 1000
>
> I would greatly appreciate it if you could provide me with some guidance
> on how to overcome these challenges. Specifically, I am interested in
> understanding if there are alternative approaches or existing tools that
> can help me achieve my goal of extracting the data objects used by queries
> from the Databricks query history.
>
> Additionally, if there are any resources, documentation, or examples that
> provide further clarity on this topic, I would be more than grateful to
> receive them. Any insights you can provide would be of immense help in
> advancing my understanding and enabling me to make the most of the Spark
> SQL and Databricks ecosystem.
>
> Thank you very much for your time and support. I eagerly look forward to
> hearing from you and benefiting from your expertise.
>
> Best regards,
> Ruben Mennes
>


[Spark SQL] Data objects from query history

2023-06-30 Thread Ruben Mennes
Dear Apache Spark community,

I hope this email finds you well. My name is Ruben, and I am an enthusiastic 
user of Apache Spark, specifically through the Databricks platform. I am 
reaching out to you today to seek your assistance and guidance regarding a 
specific use case.

I have been exploring the capabilities of Spark SQL and Databricks, and I have 
encountered a challenge related to accessing the data objects used by queries 
from the query history. I am aware that Databricks provides a comprehensive 
query history that contains valuable information about executed queries.

However, my objective is to extract the underlying data objects (tables) 
involved in each query. By doing so, I aim to analyze and understand the 
dependencies between queries and the data they operate on. This information 
will provide us new insights in how data is used across our data platform.

I have attempted to leverage the Spark SQL Antlr grammar, available at 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4,
 to parse the queries retrieved from the query history. Unfortunately, I have 
encountered difficulties when parsing more complex queries.

As an example, I have struggled to parse queries with intricate constructs such 
as the following:
> SELECT
>   concat(pickup_zip, '-', dropoff_zip) as route,
>   AVG(fare_amount) as average_fare
> FROM
>   `samples`.`nyctaxi`.`trips`
> GROUP BY
>   1
> ORDER BY
>   2 DESC
> LIMIT 1000
I would greatly appreciate it if you could provide me with some guidance on how 
to overcome these challenges. Specifically, I am interested in understanding if 
there are alternative approaches or existing tools that can help me achieve my 
goal of extracting the data objects used by queries from the Databricks query 
history.

Additionally, if there are any resources, documentation, or examples that 
provide further clarity on this topic, I would be more than grateful to receive 
them. Any insights you can provide would be of immense help in advancing my 
understanding and enabling me to make the most of the Spark SQL and Databricks 
ecosystem.

Thank you very much for your time and support. I eagerly look forward to 
hearing from you and benefiting from your expertise.

Best regards,
Ruben Mennes


[Spark-SQL] Dataframe write saveAsTable failed

2023-06-26 Thread Anil Dasari
Hi,

We have upgraded Spark from 2.4.x to 3.3.1 recently and managed table
creation while writing dataframe as saveAsTable failed with below error.

Can not create the managed table(``) The associated
location('hdfs:') already exists.

On high level our code does below before writing dataframe as table:

sparkSession.sql(s"DROP TABLE IF EXISTS $hiveTableName PURGE")
mydataframe.write.mode(SaveMode.Overwrite).saveAsTable(hiveTableName)

The above code works with Spark 2 because of
spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation which is
deprecated in Spark 3.

The table is dropped and purged before writing the dataframe. I expected
dataframe write shouldn't complain that the path already exists.

After digging further, I noticed there is `_tempory` folder present in the
hdfs table path.

dfs -ls /apps/hive/warehouse//
Found 1 items
drwxr-xr-x   - hadoop hdfsadmingroup  0 2023-06-23 04:45
/apps/hive/warehouse//_temporary

[root@ip-10-121-107-90 bin]# hdfs dfs -ls
/apps/hive/warehouse//_temporary
Found 1 items
drwxr-xr-x   - hadoop hdfsadmingroup  0 2023-06-23 04:45
/apps/hive/warehouse//_temporary/0

[root@ip-10-121-107-90 bin]# hdfs dfs -ls
/apps/hive/warehouse//_temporary/0
Found 1 items
drwxr-xr-x   - hadoop hdfsadmingroup  0 2023-06-23 04:45
/apps/hive/warehouse//_temporary/0/_temporary

Is it because of task failures ? Is there a way to workaround this issue ?

Thanks


Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
OK, good news. You have made some progress here :)

bzip (bzip2) works (splittable) because it is block-oriented whereas gzip
is stream oriented. I also noticed that you are creating a managed ORC
file.  You can bucket and partition an ORC (Optimized Row Columnar file
format. An example below:


DROP TABLE IF EXISTS dummy;

CREATE TABLE dummy (
 ID INT
   , CLUSTERED INT
   , SCATTERED INT
   , RANDOMISED INT
   , RANDOM_STRING VARCHAR(50)
   , SMALL_VC VARCHAR(10)
   , PADDING  VARCHAR(10)
)
CLUSTERED BY (ID) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES (
"orc.create.index"="true",
"orc.bloom.filter.columns"="ID",
"orc.bloom.filter.fpp"="0.05",
"orc.compress"="SNAPPY",
"orc.stripe.size"="16777216",
"orc.row.index.stride"="1" )
;

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 26 Jun 2023 at 19:35, Patrick Tucci  wrote:

> Hi Mich,
>
> Thanks for the reply. I started running ANALYZE TABLE on the external
> table, but the progress was very slow. The stage had only read about 275MB
> in 10 minutes. That equates to about 5.5 hours just to analyze the table.
>
> This might just be the reality of trying to process a 240m record file
> with 80+ columns, unless there's an obvious issue with my setup that
> someone sees. The solution is likely going to involve increasing
> parallelization.
>
> To that end, I extracted and re-zipped this file in bzip. Since bzip is
> splittable and gzip is not, Spark can process the bzip file in parallel.
> The same CTAS query only took about 45 minutes. This is still a bit slower
> than I had hoped, but the import from bzip fully utilized all available
> cores. So we can give the cluster more resources if we need the process to
> go faster.
>
> Patrick
>
> On Mon, Jun 26, 2023 at 12:52 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> OK for now have you analyzed statistics in Hive external table
>>
>> spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL
>> COLUMNS;
>> spark-sql (default)> DESC EXTENDED test.stg_t2;
>>
>> Hive external tables have little optimization
>>
>> HTH
>>
>>
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 26 Jun 2023 at 16:33, Patrick Tucci 
>> wrote:
>>
>>> Hello,
>>>
>>> I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master
>>> node has 2 cores and 8GB of RAM. There is a single worker node with 8 cores
>>> and 64GB of RAM.
>>>
>>> I'm trying to process a large pipe delimited file that has been
>>> compressed with gzip (9.2GB zipped, ~58GB unzipped, ~241m records, ~85
>>> columns). I uploaded the gzipped file to HDFS and created an external table
>>> using the attached script. I tried two simpler queries on the same table,
>>> and they finished in ~5 and ~10 minutes respectively:
>>>
>>> SELECT COUNT(*) FROM ClaimsImport;
>>> SELECT COUNT(*) FROM ClaimsImport WHERE ClaimLineID = 1;
>>>
>>> However, when I tried to create a table stored as ORC using this table
>>> as the input, the query ran for almost 4 hours:
>>>
>>> CREATE TABLE Claims STORED AS ORC
>>> AS
>>> SELECT *
>>> FROM ClaimsImport
>>> --Exclude the header record
>>> WHERE ClaimID <> 'ClaimID';
>>>
>>> [image: image.png]
>>>
>>> Why is there such a speed disparity between these different operations?
>>> I understand that this job cannot be parallelized because the file is
>>> compressed with gzip. I also understand that creating an ORC table from the
>>> input will take more time than a simple COUNT(*). But it doesn't feel like
>>> the CREATE TABLE operation should take more than 24x longer than a simple
>>> SELECT COUNT(*) statement.
>>>
>>> Thanks for any help. Please let me know if I can provide any additional
>>> information.
>>>
>>> Patrick
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
Hi Mich,

Thanks for the reply. I started running ANALYZE TABLE on the external
table, but the progress was very slow. The stage had only read about 275MB
in 10 minutes. That equates to about 5.5 hours just to analyze the table.

This might just be the reality of trying to process a 240m record file with
80+ columns, unless there's an obvious issue with my setup that someone
sees. The solution is likely going to involve increasing parallelization.

To that end, I extracted and re-zipped this file in bzip. Since bzip is
splittable and gzip is not, Spark can process the bzip file in parallel.
The same CTAS query only took about 45 minutes. This is still a bit slower
than I had hoped, but the import from bzip fully utilized all available
cores. So we can give the cluster more resources if we need the process to
go faster.

Patrick

On Mon, Jun 26, 2023 at 12:52 PM Mich Talebzadeh 
wrote:

> OK for now have you analyzed statistics in Hive external table
>
> spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL
> COLUMNS;
> spark-sql (default)> DESC EXTENDED test.stg_t2;
>
> Hive external tables have little optimization
>
> HTH
>
>
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 26 Jun 2023 at 16:33, Patrick Tucci 
> wrote:
>
>> Hello,
>>
>> I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master
>> node has 2 cores and 8GB of RAM. There is a single worker node with 8 cores
>> and 64GB of RAM.
>>
>> I'm trying to process a large pipe delimited file that has been
>> compressed with gzip (9.2GB zipped, ~58GB unzipped, ~241m records, ~85
>> columns). I uploaded the gzipped file to HDFS and created an external table
>> using the attached script. I tried two simpler queries on the same table,
>> and they finished in ~5 and ~10 minutes respectively:
>>
>> SELECT COUNT(*) FROM ClaimsImport;
>> SELECT COUNT(*) FROM ClaimsImport WHERE ClaimLineID = 1;
>>
>> However, when I tried to create a table stored as ORC using this table as
>> the input, the query ran for almost 4 hours:
>>
>> CREATE TABLE Claims STORED AS ORC
>> AS
>> SELECT *
>> FROM ClaimsImport
>> --Exclude the header record
>> WHERE ClaimID <> 'ClaimID';
>>
>> [image: image.png]
>>
>> Why is there such a speed disparity between these different operations? I
>> understand that this job cannot be parallelized because the file is
>> compressed with gzip. I also understand that creating an ORC table from the
>> input will take more time than a simple COUNT(*). But it doesn't feel like
>> the CREATE TABLE operation should take more than 24x longer than a simple
>> SELECT COUNT(*) statement.
>>
>> Thanks for any help. Please let me know if I can provide any additional
>> information.
>>
>> Patrick
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
OK for now have you analyzed statistics in Hive external table

spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL
COLUMNS;
spark-sql (default)> DESC EXTENDED test.stg_t2;

Hive external tables have little optimization

HTH



Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 26 Jun 2023 at 16:33, Patrick Tucci  wrote:

> Hello,
>
> I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master
> node has 2 cores and 8GB of RAM. There is a single worker node with 8 cores
> and 64GB of RAM.
>
> I'm trying to process a large pipe delimited file that has been compressed
> with gzip (9.2GB zipped, ~58GB unzipped, ~241m records, ~85 columns). I
> uploaded the gzipped file to HDFS and created an external table using the
> attached script. I tried two simpler queries on the same table, and they
> finished in ~5 and ~10 minutes respectively:
>
> SELECT COUNT(*) FROM ClaimsImport;
> SELECT COUNT(*) FROM ClaimsImport WHERE ClaimLineID = 1;
>
> However, when I tried to create a table stored as ORC using this table as
> the input, the query ran for almost 4 hours:
>
> CREATE TABLE Claims STORED AS ORC
> AS
> SELECT *
> FROM ClaimsImport
> --Exclude the header record
> WHERE ClaimID <> 'ClaimID';
>
> [image: image.png]
>
> Why is there such a speed disparity between these different operations? I
> understand that this job cannot be parallelized because the file is
> compressed with gzip. I also understand that creating an ORC table from the
> input will take more time than a simple COUNT(*). But it doesn't feel like
> the CREATE TABLE operation should take more than 24x longer than a simple
> SELECT COUNT(*) statement.
>
> Thanks for any help. Please let me know if I can provide any additional
> information.
>
> Patrick
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
Hello,

I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master node
has 2 cores and 8GB of RAM. There is a single worker node with 8 cores and
64GB of RAM.

I'm trying to process a large pipe delimited file that has been compressed
with gzip (9.2GB zipped, ~58GB unzipped, ~241m records, ~85 columns). I
uploaded the gzipped file to HDFS and created an external table using the
attached script. I tried two simpler queries on the same table, and they
finished in ~5 and ~10 minutes respectively:

SELECT COUNT(*) FROM ClaimsImport;
SELECT COUNT(*) FROM ClaimsImport WHERE ClaimLineID = 1;

However, when I tried to create a table stored as ORC using this table as
the input, the query ran for almost 4 hours:

CREATE TABLE Claims STORED AS ORC
AS
SELECT *
FROM ClaimsImport
--Exclude the header record
WHERE ClaimID <> 'ClaimID';

[image: image.png]

Why is there such a speed disparity between these different operations? I
understand that this job cannot be parallelized because the file is
compressed with gzip. I also understand that creating an ORC table from the
input will take more time than a simple COUNT(*). But it doesn't feel like
the CREATE TABLE operation should take more than 24x longer than a simple
SELECT COUNT(*) statement.

Thanks for any help. Please let me know if I can provide any additional
information.

Patrick


Create Table.sql
Description: Binary data

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

RE: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-10 Thread Vijay B
Please see if this works

-- aggregate array into map of element of count
SELECT aggregate(array(1,2,3,4,5),
  map('cnt',0),
  (acc,x) -> map('cnt', acc.cnt+1)) as array_count

thanks
Vijay

On 2023/05/05 19:32:04 Yong Zhang wrote:
> Hi, This is on Spark 3.1 environment.
>
> For some reason, I can ONLY do this in Spark SQL, instead of either Scala or 
> PySpark environment.
>
> I want to aggregate an array into a Map of element count, within that array, 
> but in Spark SQL.
> I know that there is an aggregate function available like
>
> aggregate(expr, start, merge [, finish])
>
> But I want to know if this can be done in the Spark SQL only, and:
>
>   *   How to represent an empty Map as "start" element above
>   *   How to merge each element (as String type) into Map (as adding count if 
> exist in the Map, or add as (element -> 1) as new entry in the Map if not 
> exist)
>
> Like the following example -> 
> https://docs.databricks.com/sql/language-manual/functions/aggregate.html
>
> SELECT aggregate(array(1, 2, 3, 4),
>named_struct('sum', 0, 'cnt', 0),
>(acc, x) -> named_struct('sum', acc.sum + x, 'cnt', 
> acc.cnt + 1),
>acc -> acc.sum / acc.cnt) AS avg
>
> I wonder:
> select
>   aggregate(
>   array('a','b','a')),
>   map('', 0),
>   (acc, x) -> ???
>   acc -> acc) as output
>
> How to do the logic after "(acc, x) -> ", so I can output a map of count of 
> each element in the array?
> I know I can "explode", then groupby + count, but since I have multi array 
> columns need to transform, so I want to do more a high order function way, 
> and in pure Spark SQL.
>
> Thanks
>
>



Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang
Hi, Mich:

Thanks for your reply, but maybe I didn't make my question clear.

I am looking for a solution to compute the count of each element in an array, 
without "exploding" the array, and output a Map structure as a column.
For example, for an array as ('a', 'b', 'a'), I want to output a column as 
Map('a' -> 2, 'b' -> 1).
I think that the "aggregate" function should be able to, following the example 
shown in the link of my original email, as

SELECT aggregate(array('a', 'b', 'a'),   map(), 
  (acc, x) -> ???,   acc -> acc) AS feq_cnt

Here are my questions:

  *   Is using "map()" above the best way? The "start" structure in this case 
should be Map.empty[String, Int], but of course, it won't work in pure Spark 
SQL, so the best solution I can think of is "map()", and I want a mutable Map.
  *   How to implement the logic in "???" place? If I do it in Scala, I will do 
"acc.update(x, acc.getOrElse(x, 0) + 1)", which means if an element exists, 
plus one for the value; otherwise, start the element with the count of 1. Of 
course, the above code won't work in Spark SQL.
  *   As I said, I am NOT running in either Scale or PySpark session, but in a 
pure Spark SQL.
  *   Is it possible to do the above logic in Spark SQL, without using 
"exploding"?

Thanks


From: Mich Talebzadeh 
Sent: Saturday, May 6, 2023 4:52 PM
To: Yong Zhang 
Cc: user@spark.apache.org 
Subject: Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map 
of element of count?

you can create DF from your SQL RS and work with that in Python the way you want

## you don't need all these
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import udf, col, current_timestamp, lit
from pyspark.sql.types import *
sqltext = """
SELECT aggregate(array(1, 2, 3, 4),
   named_struct('sum', 0, 'cnt', 0),
   (acc, x) -> named_struct('sum', acc.sum + x, 'cnt', acc.cnt 
+ 1),
   acc -> acc.sum / acc.cnt) AS avg
"""
df = spark.sql(sqltext)
df.printSchema()

root
 |-- avg: double (nullable = true)


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 5 May 2023 at 20:33, Yong Zhang 
mailto:java8...@hotmail.com>> wrote:
Hi, This is on Spark 3.1 environment.

For some reason, I can ONLY do this in Spark SQL, instead of either Scala or 
PySpark environment.

I want to aggregate an array into a Map of element count, within that array, 
but in Spark SQL.
I know that there is an aggregate function available like

aggregate(expr, start, merge [, finish])

But I want to know if this can be done in the Spark SQL only, and:

  *   How to represent an empty Map as "start" element above
  *   How to merge each element (as String type) into Map (as adding count if 
exist in the Map, or add as (element -> 1) as new entry in the Map if not exist)

Like the following example -> 
https://docs.databricks.com/sql/language-manual/functions/aggregate.html

SELECT aggregate(array(1, 2, 3, 4),
   named_struct('sum', 0, 'cnt', 0),
   (acc, x) -> named_struct('sum', acc.sum + x, 'cnt', acc.cnt 
+ 1),
   acc -> acc.sum / acc.cnt) AS avg

I wonder:
select
  aggregate(
  array('a','b','a')),
  map('', 0),
  (acc, x) -> ???
  acc -> acc) as output

How to do the logic after "(acc, x) -> ", so I can output a map of count of 
each element in the array?
I know I can "explode", then groupby + count, but since I have multi array 
columns need to transform, so I want to do more a high order function way, and 
in pure Spark SQL.

Thanks



Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang
Hi, Mich:

Thanks for your reply, but maybe I didn't make my question clear.

I am looking for a solution to compute the count of each element in an array, 
without "exploding" the array, and output a Map structure as a column.
For example, for an array as ('a', 'b', 'a'), I want to output a column as 
Map('a' -> 2, 'b' -> 1).
I think that "aggregate" function should be able to, using the example shown in 
the link of my original email, as

SELECT aggregate(array('a', 'b', 'a'),
   map(),
   (acc, x) -> ???,
   acc -> acc) AS feq_cnt

Here are my questions:

  *   Is using "map()" above the best way? The "start" structure in this case 
should be Map.empty[String, Int], but of course, it won't work in pure Spark 
SQL, so the best solution I can think of is "map()", and it is a mutable Map.
  *   How to implement the logic in "???" place? If I do it in the Scala, I 
will do "acc.update(x, acc.getOrElse(x, 0) + 1)", which means if element 
exists, plus one for the value; otherwise, start the element with count of 0. 
Of course, the above code wont' work in Spark SQL.
  *   As I said, I am NOT running in either Scale or PySpark session, but in a 
pure Spark SQL.
  *   Is it possible to do the above logic in Spark SQL, without using 
"exploding"?

Thanks


From: Mich Talebzadeh 
Sent: Saturday, May 6, 2023 4:52 PM
To: Yong Zhang 
Cc: user@spark.apache.org 
Subject: Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map 
of element of count?

you can create DF from your SQL RS and work with that in Python the way you want

## you don't need all these
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import udf, col, current_timestamp, lit
from pyspark.sql.types import *
sqltext = """
SELECT aggregate(array(1, 2, 3, 4),
   named_struct('sum', 0, 'cnt', 0),
   (acc, x) -> named_struct('sum', acc.sum + x, 'cnt', acc.cnt 
+ 1),
   acc -> acc.sum / acc.cnt) AS avg
"""
df = spark.sql(sqltext)
df.printSchema()

root
 |-- avg: double (nullable = true)


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 5 May 2023 at 20:33, Yong Zhang 
mailto:java8...@hotmail.com>> wrote:
Hi, This is on Spark 3.1 environment.

For some reason, I can ONLY do this in Spark SQL, instead of either Scala or 
PySpark environment.

I want to aggregate an array into a Map of element count, within that array, 
but in Spark SQL.
I know that there is an aggregate function available like

aggregate(expr, start, merge [, finish])

But I want to know if this can be done in the Spark SQL only, and:

  *   How to represent an empty Map as "start" element above
  *   How to merge each element (as String type) into Map (as adding count if 
exist in the Map, or add as (element -> 1) as new entry in the Map if not exist)

Like the following example -> 
https://docs.databricks.com/sql/language-manual/functions/aggregate.html

SELECT aggregate(array(1, 2, 3, 4),
   named_struct('sum', 0, 'cnt', 0),
   (acc, x) -> named_struct('sum', acc.sum + x, 'cnt', acc.cnt 
+ 1),
   acc -> acc.sum / acc.cnt) AS avg

I wonder:
select
  aggregate(
  array('a','b','a')),
  map('', 0),
  (acc, x) -> ???
  acc -> acc) as output

How to do the logic after "(acc, x) -> ", so I can output a map of count of 
each element in the array?
I know I can "explode", then groupby + count, but since I have multi array 
columns need to transform, so I want to do more a high order function way, and 
in pure Spark SQL.

Thanks



Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-06 Thread Mich Talebzadeh
you can create DF from your SQL RS and work with that in Python the way you
want

## you don't need all these
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import udf, col, current_timestamp, lit
from pyspark.sql.types import *
sqltext = """
SELECT aggregate(array(1, 2, 3, 4),
   named_struct('sum', 0, 'cnt', 0),
   (acc, x) -> named_struct('sum', acc.sum + x, 'cnt',
acc.cnt + 1),
   acc -> acc.sum / acc.cnt) AS avg
"""
df = spark.sql(sqltext)
df.printSchema()

root
 |-- avg: double (nullable = true)


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 5 May 2023 at 20:33, Yong Zhang  wrote:

> Hi, This is on Spark 3.1 environment.
>
> For some reason, I can ONLY do this in Spark SQL, instead of either Scala
> or PySpark environment.
>
> I want to aggregate an array into a Map of element count, within that
> array, but in Spark SQL.
> I know that there is an aggregate function available like
>
> aggregate(expr, start, merge [, finish])
>
>
> But I want to know if this can be done in the Spark SQL only, and:
>
>- How to represent an empty Map as "start" element above
>- How to merge each element (as String type) into Map (as adding count
>if exist in the Map, or add as (element -> 1) as new entry in the Map if
>not exist)
>
> Like the following example ->
> https://docs.databricks.com/sql/language-manual/functions/aggregate.html
>
> SELECT aggregate(array(1, 2, 3, 4),   named_struct('sum', 0, 
> 'cnt', 0),   (acc, x) -> named_struct('sum', acc.sum + x, 
> 'cnt', acc.cnt + 1),   acc -> acc.sum / acc.cnt) AS avg
>
>
> I wonder:
> select
> aggregate(
> array('a','b','a')),
> map('', 0),
> (acc, x) -> ???
> acc -> acc) as output
>
> How to do the logic after "(acc, x) -> ", so I can output a map of count
> of each element in the array?
> I know I can "explode", then groupby + count, but since I have multi array
> columns need to transform, so I want to do more a high order function way,
> and in pure Spark SQL.
>
> Thanks
>
>


Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-05 Thread Yong Zhang
Hi, This is on Spark 3.1 environment.

For some reason, I can ONLY do this in Spark SQL, instead of either Scala or 
PySpark environment.

I want to aggregate an array into a Map of element count, within that array, 
but in Spark SQL.
I know that there is an aggregate function available like

aggregate(expr, start, merge [, finish])

But I want to know if this can be done in the Spark SQL only, and:

  *   How to represent an empty Map as "start" element above
  *   How to merge each element (as String type) into Map (as adding count if 
exist in the Map, or add as (element -> 1) as new entry in the Map if not exist)

Like the following example -> 
https://docs.databricks.com/sql/language-manual/functions/aggregate.html

SELECT aggregate(array(1, 2, 3, 4),
   named_struct('sum', 0, 'cnt', 0),
   (acc, x) -> named_struct('sum', acc.sum + x, 'cnt', acc.cnt 
+ 1),
   acc -> acc.sum / acc.cnt) AS avg

I wonder:
select
  aggregate(
  array('a','b','a')),
  map('', 0),
  (acc, x) -> ???
  acc -> acc) as output

How to do the logic after "(acc, x) -> ", so I can output a map of count of 
each element in the array?
I know I can "explode", then groupby + count, but since I have multi array 
columns need to transform, so I want to do more a high order function way, and 
in pure Spark SQL.

Thanks



Re:Upgrading from Spark SQL 3.2 to 3.3 faild

2023-02-15 Thread lk_spark
I need to use cast function to surround computed expression then explain the 
SQL is ok, for example :

cast(a.Split_Amt * b.percent / 100 asdecimal(20,8)) as split_amt



I don't know why , is there a config property could compatibility with spark3.2 
? 










At 2023-02-16 13:47:25, "lk_spark"  wrote:

hi,all :
I have a sql statement wich can be run on spark 3.2.1 but not on spark 3.3.1 . 
when I try to explain it, will got error with message: 
org.apache.spark.sql.catalyst.expressions.Literal cannot be cast to 
org.apache.spark.sql.catalyst.expressions.AnsiCast


execute the sql, error stack is : 


ERROR ExecuteStatement: Error operating ExecuteStatement: 
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.Literal 
cannot be cast to org.apache.spark.sql.catalyst.expressions.AnsiCast
at 
org.apache.spark.sql.catalyst.expressions.CheckOverflowInTableInsert.withNewChildInternal(Cast.scala:2362)
at 
org.apache.spark.sql.catalyst.expressions.CheckOverflowInTableInsert.withNewChildInternal(Cast.scala:2360)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.$anonfun$mapChildren$5(TreeNode.scala:1233)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1232)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1227)
at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:498)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:635)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:635)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1228)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1227)
at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:498)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:635)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:188)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:200)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:200)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:211)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:216)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:216)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:221)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:427)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:221)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:188)
at 
org.apache.spark.sql.catalyst.optimizer.NullPropagation$$anonfun$apply$19.applyOrElse(expressions.scala:804)
at 
org.apache.spark.sql.catalyst.optimizer.NullPropagation$$anonfun$apply$19.applyOrElse(expressions.scala:801)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNod

Upgrading from Spark SQL 3.2 to 3.3 faild

2023-02-15 Thread lk_spark
hi,all :
I have a sql statement wich can be run on spark 3.2.1 but not on spark 3.3.1 . 
when I try to explain it, will got error with message: 
org.apache.spark.sql.catalyst.expressions.Literal cannot be cast to 
org.apache.spark.sql.catalyst.expressions.AnsiCast


execute the sql, error stack is : 


ERROR ExecuteStatement: Error operating ExecuteStatement: 
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.Literal 
cannot be cast to org.apache.spark.sql.catalyst.expressions.AnsiCast
at 
org.apache.spark.sql.catalyst.expressions.CheckOverflowInTableInsert.withNewChildInternal(Cast.scala:2362)
at 
org.apache.spark.sql.catalyst.expressions.CheckOverflowInTableInsert.withNewChildInternal(Cast.scala:2360)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.$anonfun$mapChildren$5(TreeNode.scala:1233)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1232)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1227)
at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:498)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:635)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:635)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1228)
at 
org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1227)
at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:498)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:635)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:188)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:200)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:200)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:211)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:216)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:216)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:221)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:427)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:221)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:188)
at 
org.apache.spark.sql.catalyst.optimizer.NullPropagation$$anonfun$apply$19.applyOrElse(expressions.scala:804)
at 
org.apache.spark.sql.catalyst.optimizer.NullPropagation$$anonfun$apply$19.applyOrElse(expressions.scala:801)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at

Fwd: [Spark SQL] : Delete is only supported on V2 tables.

2023-02-09 Thread Jeevan Chhajed
-- Forwarded message -
From: Jeevan Chhajed 
Date: Tue, 7 Feb 2023, 15:16
Subject: [Spark SQL] : Delete is only supported on V2 tables.
To: 


Hi,
How do we create V2 tables? I tried a couple of things using sql but was
unable to do so.

Can you share links/content it will be of much help.

Is delete support on V2 tables still under dev ?

Thanks,
Jeevan


[Spark SQL]: Spark 3.2 generates different results to query when columns name have mixed casing vs when they have same casing

2023-02-08 Thread Amit Singh Rathore
Hi Team,

I am running a query in Spark 3.2.

val df1 =
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4",
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show()

This query runs fine. But when I change the casing of the op_cols to have
mix of upper & lower case ("id" & "ID") it throws an ambiguous col ref
error.

val df1 =
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4",
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show()

My question is why is this different behavior when I have duplicate columns
with the same names ("id", "id") vs the same name in different cases ("id",
"ID")? Either both should fail or non should fail considering spark
caseSensitive is false by default in 3.2

Note I checked, this issue is there in spark 2.4 as well. It works for both
case (mixed & single casing) spark 2.3.


Thanks
Spark user


[Spark SQL] : Delete is only supported on V2 tables.

2023-02-07 Thread Jeevan Chhajed
Hi,
How do we create V2 tables? I tried a couple of things using sql but was
unable to do so.

Can you share links/content it will be of much help.

Is delete support on V2 tables still under dev ?

Thanks,
Jeevan


SQL GROUP BY alias with dots, was: Spark SQL question

2023-02-07 Thread Enrico Minack

Hi,

you are right, that is an interesting question.

Looks like GROUP BY is doing something funny / magic here (spark-shell 
3.3.1 and 3.5.0-SNAPSHOT):


With an alias, it behaves as you have pointed out:

spark.range(3).createTempView("ids_without_dots")
spark.sql("SELECT * FROM ids_without_dots").show()

// works
spark.sql("SELECT id AS `an.id` FROM ids_without_dots GROUP BY 
an.id").show()

// fails
spark.sql("SELECT id AS `an.id` FROM ids_without_dots GROUP BY 
`an.id`").show()



Without an alias, it behaves as expected, which is the opposite of above 
(a column with a dot exists, no alias used in SELECT):


spark.range(3).select($"id".as("an.id")).createTempView("ids_with_dots")
spark.sql("SELECT `an.id` FROM ids_with_dots").show()

// works
spark.sql("SELECT `an.id` FROM ids_with_dots GROUP BY `an.id`").show()
// fails
spark.sql("SELECT `an.id` FROM ids_with_dots GROUP BY an.id").show()


With a struct column, it also behaves as expected:

spark.range(3).select(struct($"id").as("an")).createTempView("ids_with_struct")
spark.sql("SELECT an.id FROM ids_with_struct").show()

// works
spark.sql("SELECT an.id FROM ids_with_struct GROUP BY an.id").show()
// fails
spark.sql("SELECT `an.id` FROM ids_with_struct GROUP BY an.id").show()
spark.sql("SELECT an.id FROM ids_with_struct GROUP BY `an.id`").show()
spark.sql("SELECT `an.id` FROM ids_with_struct GROUP BY `an.id`").show()


This does not feel very consistent.

Enrico



Am 28.01.23 um 00:34 schrieb Kohki Nishio:

this SQL works

select 1 as *`data.group`* from tbl group by *data.group*


Since there's no such field as *data,* I thought the SQL has to look 
like this


select 1 as *`data.group`* from tbl group by `*data.group`*


 But that gives and error (cannot resolve '`data.group`') ... I'm no 
expert in SQL, but feel like it's a strange behavior... does anybody 
have a good explanation for it ?


Thanks

--
Kohki Nishio




Re: Spark SQL question

2023-01-28 Thread Bjørn Jørgensen
Hi Mich.
This is a Spark user group mailing list where people can ask *any*
questions about spark.
You know SQL and streaming, but I don't think it's necessary to start a
replay with "*LOL*" to the question that's being asked.
No questions are to stupid to be asked.


lør. 28. jan. 2023 kl. 09:22 skrev Mich Talebzadeh <
mich.talebza...@gmail.com>:

> LOL
>
> First one
>
> spark-sql> select 1 as `data.group` from abc group by data.group;
> 1
> Time taken: 0.198 seconds, Fetched 1 row(s)
>
> means that are assigning alias data.group to select and you are using that
> alias -> data.group in your group by statement
>
>
> This is equivalent to
>
>
> spark-sql> select 1 as `data.group` from abc group by 1;
>
> 1
>
> With regard to your second sql
>
>
> select 1 as *`data.group`* from tbl group by `*data.group`;*
>
>
> *will throw an error *
>
>
> *spark-sql> select 1 as `data.group` from abc group by `data.group`;*
>
> *Error in query: cannot resolve '`data.group`' given input columns:
> [spark_catalog.elayer.abc.keyword, spark_catalog.elayer.abc.occurence];
> line 1 pos 43;*
>
> *'Aggregate ['`data.group`], [1 AS data.group#225]*
>
> *+- SubqueryAlias spark_catalog.elayer.abc*
>
> *   +- HiveTableRelation [`elayer`.`abc`,
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols:
> [keyword#226, occurence#227L], Partition Cols: []]*
>
> `data.group` with quotes is neither the name of the column or its alias
>
>
> *HTH*
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 27 Jan 2023 at 23:36, Kohki Nishio  wrote:
>
>> this SQL works
>>
>> select 1 as *`data.group`* from tbl group by *data.group*
>>
>>
>> Since there's no such field as *data,* I thought the SQL has to look
>> like this
>>
>> select 1 as *`data.group`* from tbl group by `*data.group`*
>>
>>
>>  But that gives and error (cannot resolve '`data.group`') ... I'm no
>> expert in SQL, but feel like it's a strange behavior... does anybody have a
>> good explanation for it ?
>>
>> Thanks
>>
>> --
>> Kohki Nishio
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Spark SQL question

2023-01-28 Thread Mich Talebzadeh
LOL

First one

spark-sql> select 1 as `data.group` from abc group by data.group;
1
Time taken: 0.198 seconds, Fetched 1 row(s)

means that are assigning alias data.group to select and you are using that
alias -> data.group in your group by statement


This is equivalent to


spark-sql> select 1 as `data.group` from abc group by 1;

1

With regard to your second sql


select 1 as *`data.group`* from tbl group by `*data.group`;*


*will throw an error *


*spark-sql> select 1 as `data.group` from abc group by `data.group`;*

*Error in query: cannot resolve '`data.group`' given input columns:
[spark_catalog.elayer.abc.keyword, spark_catalog.elayer.abc.occurence];
line 1 pos 43;*

*'Aggregate ['`data.group`], [1 AS data.group#225]*

*+- SubqueryAlias spark_catalog.elayer.abc*

*   +- HiveTableRelation [`elayer`.`abc`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols:
[keyword#226, occurence#227L], Partition Cols: []]*

`data.group` with quotes is neither the name of the column or its alias


*HTH*



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 27 Jan 2023 at 23:36, Kohki Nishio  wrote:

> this SQL works
>
> select 1 as *`data.group`* from tbl group by *data.group*
>
>
> Since there's no such field as *data,* I thought the SQL has to look like
> this
>
> select 1 as *`data.group`* from tbl group by `*data.group`*
>
>
>  But that gives and error (cannot resolve '`data.group`') ... I'm no
> expert in SQL, but feel like it's a strange behavior... does anybody have a
> good explanation for it ?
>
> Thanks
>
> --
> Kohki Nishio
>


Spark SQL question

2023-01-27 Thread Kohki Nishio
this SQL works

select 1 as *`data.group`* from tbl group by *data.group*


Since there's no such field as *data,* I thought the SQL has to look like
this

select 1 as *`data.group`* from tbl group by `*data.group`*


 But that gives and error (cannot resolve '`data.group`') ... I'm no expert
in SQL, but feel like it's a strange behavior... does anybody have a good
explanation for it ?

Thanks

-- 
Kohki Nishio


[Spark SQL] Data duplicate or data lost with non-deterministic function

2023-01-14 Thread 李建伟
Hi All, I met one data duplicate issue when writing table with shuffle data and 
non-deterministic function.

For example:

insert overwrite table target_table partition(ds)
select ... from a join b join c...
ditributed by ds, cast(rand()*10 as int)

As rand() is non deterministic, the order of input to shuffle data may change 
in retry task. a row that is already present in another shuffe output might get 
distributed again to a new shuffle output (causing data duplication) or some 
row might not get any shuffle out as the designated shuffle output might have 
already finished (causing data loss).
Do we have any suggestion on how to avoid data duplicate and data lost for such 
scenario?

Re: [Spark SQL]: unpredictable errors: java.io.IOException: can not read class org.apache.parquet.format.PageHeader

2022-12-19 Thread Eric Hanchrow
We’ve discovered a workaround for this; it’s described 
here<https://issues.apache.org/jira/browse/HADOOP-18521>.

From: Eric Hanchrow 
Date: Thursday, December 8, 2022 at 17:03
To: user@spark.apache.org 
Subject: [Spark SQL]: unpredictable errors: java.io.IOException: can not read 
class org.apache.parquet.format.PageHeader
My company runs java code that uses Spark to read from, and write to, Azure 
Blob storage.  This code runs more or less 24x7.

Recently we've noticed a few failures that leave stack traces in our logs; what 
they have in common are exceptions that look variously like

Caused by: java.io.IOException: can not read class 
org.apache.parquet.format.PageHeader: Unrecognized type 0
Caused by: java.io.IOException: can not read class 
org.apache.parquet.format.PageHeader : don't know what type: 14
Caused by: java.io.IOException: can not read class 
org.apache.parquet.format.PageHeader Required field 'num_values' was not found 
in serialized data!
Caused by: java.io.IOException: can not read class 
org.apache.parquet.format.PageHeader Required field 'uncompressed_page_size' 
was not found in serialized data!

I searched 
https://stackoverflow.com/search?q=%5Bapache-spark%5D+java.io.IOException+can+not+read+class+org.apache.parquet.format.PageHeader
 and found exactly one marginally-relevant hit -- 
https://stackoverflow.com/questions/47211392/required-field-uncompressed-page-size-was-not-found-in-serialized-data-parque
It contains a suggested workaround which I haven't yet tried, but intend to 
soon.

I searched the ASF archive for 
user@spark.apache.org<mailto:user@spark.apache.org>; the only hit is 
https://lists.apache.org/list?user@spark.apache.org:2022-9:can%20not%20read%20class%20org.apache.parquet.format.PageHeader
 which is relevant but unhelpful.

It cites https://issues.apache.org/jira/browse/SPARK-11844 which is quite 
relevant, but again unhelpful.

Unfortunately, we cannot provide the relevant parquet file to the mailing list, 
since it of course contains proprietary data.

I've posted the stack trace at 
https://gist.github.com/erich-truveta/f30d77441186a8c30c5f22f9c44bf59f

Here are various maven dependencies that might be relevant (gotten from the 
output of `mvn dependency:tree`):

org.apache.hadoop.thirdparty:hadoop-shaded-guava:jar:1.1.1
org.apache.hadoop.thirdparty:hadoop-shaded-protobuf_3_7 :jar:1.1.1

org.apache.hadoop:hadoop-annotations:jar:3.3.4
org.apache.hadoop:hadoop-auth   :jar:3.3.4
org.apache.hadoop:hadoop-azure  :jar:3.3.4
org.apache.hadoop:hadoop-client-api :jar:3.3.4
org.apache.hadoop:hadoop-client-runtime :jar:3.3.4
org.apache.hadoop:hadoop-client :jar:3.3.4
org.apache.hadoop:hadoop-common :jar:3.3.4
org.apache.hadoop:hadoop-hdfs-client:jar:3.3.4
org.apache.hadoop:hadoop-mapreduce-client-common:jar:3.3.4
org.apache.hadoop:hadoop-mapreduce-client-core  :jar:3.3.4
org.apache.hadoop:hadoop-mapreduce-client-jobclient :jar:3.3.4
org.apache.hadoop:hadoop-yarn-api   :jar:3.3.4
org.apache.hadoop:hadoop-yarn-client:jar:3.3.4
org.apache.hadoop:hadoop-yarn-common:jar:3.3.4

org.apache.hive:hive-storage-api :jar:2.7.2

org.apache.parquet:parquet-column:jar:1.12.2
org.apache.parquet:parquet-common:jar:1.12.2
org.apache.parquet:parquet-encoding  :jar:1.12.2
org.apache.parquet:parquet-format-structures :jar:1.12.2
org.apache.parquet:parquet-hadoop:jar:1.12.2
org.apache.parquet:parquet-jackson   :jar:1.12.2

org.apache.spark:spark-catalyst_2.12:jar:3.3.1
org.apache.spark:spark-core_2.12:jar:3.3.1
org.apache.spark:spark-kvstore_2.12 :jar:3.3.1
org.apache.spark:spark-launcher_2.12:jar:3.3.1
org.apache.spark:spark-network-common_2.12  :jar:3.3.1
org.apache.spark:spark-network-shuffle_2.12 :jar:3.3.1
org.apache.spark:spark-sketch_2.12  :jar:3.3.1
org.apache.spark:spark-sql_2.12 :jar:3.3.1
org.apache.spark:spark-tags_2.12:jar:3.3.1
org.apache.spark:spark-unsafe_2.12  :jar:3.3.1

Thank you for any help you can provide!


[Spark SQL]: unpredictable errors: java.io.IOException: can not read class org.apache.parquet.format.PageHeader

2022-12-08 Thread Eric Hanchrow
My company runs java code that uses Spark to read from, and write to, Azure 
Blob storage.  This code runs more or less 24x7.

Recently we've noticed a few failures that leave stack traces in our logs; what 
they have in common are exceptions that look variously like

Caused by: java.io.IOException: can not read class 
org.apache.parquet.format.PageHeader: Unrecognized type 0
Caused by: java.io.IOException: can not read class 
org.apache.parquet.format.PageHeader : don't know what type: 14
Caused by: java.io.IOException: can not read class 
org.apache.parquet.format.PageHeader Required field 'num_values' was not found 
in serialized data!
Caused by: java.io.IOException: can not read class 
org.apache.parquet.format.PageHeader Required field 'uncompressed_page_size' 
was not found in serialized data!

I searched 
https://stackoverflow.com/search?q=%5Bapache-spark%5D+java.io.IOException+can+not+read+class+org.apache.parquet.format.PageHeader
 and found exactly one marginally-relevant hit -- 
https://stackoverflow.com/questions/47211392/required-field-uncompressed-page-size-was-not-found-in-serialized-data-parque
It contains a suggested workaround which I haven't yet tried, but intend to 
soon.

I searched the ASF archive for 
user@spark.apache.org; the only hit is 
https://lists.apache.org/list?user@spark.apache.org:2022-9:can%20not%20read%20class%20org.apache.parquet.format.PageHeader
 which is relevant but unhelpful.

It cites https://issues.apache.org/jira/browse/SPARK-11844 which is quite 
relevant, but again unhelpful.

Unfortunately, we cannot provide the relevant parquet file to the mailing list, 
since it of course contains proprietary data.

I've posted the stack trace at 
https://gist.github.com/erich-truveta/f30d77441186a8c30c5f22f9c44bf59f

Here are various maven dependencies that might be relevant (gotten from the 
output of `mvn dependency:tree`):

org.apache.hadoop.thirdparty:hadoop-shaded-guava:jar:1.1.1
org.apache.hadoop.thirdparty:hadoop-shaded-protobuf_3_7 :jar:1.1.1

org.apache.hadoop:hadoop-annotations:jar:3.3.4
org.apache.hadoop:hadoop-auth   :jar:3.3.4
org.apache.hadoop:hadoop-azure  :jar:3.3.4
org.apache.hadoop:hadoop-client-api :jar:3.3.4
org.apache.hadoop:hadoop-client-runtime :jar:3.3.4
org.apache.hadoop:hadoop-client :jar:3.3.4
org.apache.hadoop:hadoop-common :jar:3.3.4
org.apache.hadoop:hadoop-hdfs-client:jar:3.3.4
org.apache.hadoop:hadoop-mapreduce-client-common:jar:3.3.4
org.apache.hadoop:hadoop-mapreduce-client-core  :jar:3.3.4
org.apache.hadoop:hadoop-mapreduce-client-jobclient :jar:3.3.4
org.apache.hadoop:hadoop-yarn-api   :jar:3.3.4
org.apache.hadoop:hadoop-yarn-client:jar:3.3.4
org.apache.hadoop:hadoop-yarn-common:jar:3.3.4

org.apache.hive:hive-storage-api :jar:2.7.2

org.apache.parquet:parquet-column:jar:1.12.2
org.apache.parquet:parquet-common:jar:1.12.2
org.apache.parquet:parquet-encoding  :jar:1.12.2
org.apache.parquet:parquet-format-structures :jar:1.12.2
org.apache.parquet:parquet-hadoop:jar:1.12.2
org.apache.parquet:parquet-jackson   :jar:1.12.2

org.apache.spark:spark-catalyst_2.12:jar:3.3.1
org.apache.spark:spark-core_2.12:jar:3.3.1
org.apache.spark:spark-kvstore_2.12 :jar:3.3.1
org.apache.spark:spark-launcher_2.12:jar:3.3.1
org.apache.spark:spark-network-common_2.12  :jar:3.3.1
org.apache.spark:spark-network-shuffle_2.12 :jar:3.3.1
org.apache.spark:spark-sketch_2.12  :jar:3.3.1
org.apache.spark:spark-sql_2.12 :jar:3.3.1
org.apache.spark:spark-tags_2.12:jar:3.3.1
org.apache.spark:spark-unsafe_2.12  :jar:3.3.1

Thank you for any help you can provide!


RE: Re: [Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-22 Thread Patrick Tucci

Thanks. How would I go about formally submitting a feature request for this?

On 2022/11/21 23:47:16 Andrew Melo wrote:
> I think this is the right place, just a hard question :) As far as I
> know, there's no "case insensitive flag", so YMMV
>
> On Mon, Nov 21, 2022 at 5:40 PM Patrick Tucci  wrote:
> >
> > Is this the wrong list for this type of question?
> >
> > On 2022/11/12 16:34:48 Patrick Tucci wrote:
> > > Hello,
> > >
> > > Is there a way to set string comparisons to be case-insensitive
> > globally? I
> > > understand LOWER() can be used, but my codebase contains 27k 
lines of SQL
> > > and many string comparisons. I would need to apply LOWER() to 
each string
> > > literal in the code base. I'd also need to change all the 
ETL/import code

> > > to apply LOWER() to each string value on import.
> > >
> > > Current behavior:
> > >
> > > SELECT 'ABC' = 'abc';
> > > false
> > > Time taken: 5.466 seconds, Fetched 1 row(s)
> > >
> > > SELECT 'ABC' IN ('AbC', 'abc');
> > > false
> > > Time taken: 5.498 seconds, Fetched 1 row(s)
> > >
> > > SELECT 'ABC' like 'Ab%'
> > > false
> > > Time taken: 5.439 seconds, Fetched 1 row(s)
> > >
> > > Desired behavior would be true for all of the above with the proposed
> > > case-insensitive flag set.
> > >
> > > Thanks,
> > >
> > > Patrick
> > >
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >