Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
OK you would not have known unless you went through the process so to speak.

Let us do something revolutionary here 

Install hive and its metastore. You already have hadoop anyway

https://cwiki.apache.org/confluence/display/hive/adminmanual+installation

hive metastore

https://data-flair.training/blogs/apache-hive-metastore/#:~:text=What%20is%20Hive%20Metastore%3F,by%20using%20metastore%20service%20API
.

choose one of these

derby  hive  mssql  mysql  oracle  postgres

Mine is an oracle. postgres is good as well.

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 12 Aug 2023 at 18:31, Patrick Tucci  wrote:

> Yes, on premise.
>
> Unfortunately after installing Delta Lake and re-writing all tables as
> Delta tables, the issue persists.
>
> On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> ok sure.
>>
>> Is this Delta Lake going to be on-premise?
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci 
>> wrote:
>>
>>> Hi Mich,
>>>
>>> Thanks for the feedback. My original intention after reading your
>>> response was to stick to Hive for managing tables. Unfortunately, I'm
>>> running into another case of SQL scripts hanging. Since all tables are
>>> already Parquet, I'm out of troubleshooting options. I'm going to migrate
>>> to Delta Lake and see if that solves the issue.
>>>
>>> Thanks again for your feedback.
>>>
>>> Patrick
>>>
>>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi Patrick,

 There is not anything wrong with Hive On-premise it is the best data
 warehouse there is

 Hive handles both ORC and Parquet formal well. They are both columnar
 implementations of relational model. What you are seeing is the Spark API
 to Hive which prefers Parquet. I found out a few years ago.

 From your point of view I suggest you stick to parquet format with Hive
 specific to Spark. As far as I know you don't have a fully independent Hive
 DB as yet.

 Anyway stick to Hive for now as you never know what issues you may be
 facing using moving to Delta Lake.

 You can also use compression

 STORED AS PARQUET
 TBLPROPERTIES ("parquet.compression"="SNAPPY")

 ALSO

 ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMNS

 HTH

 Mich Talebzadeh,
 Solutions Architect/Engineering Lead
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Fri, 11 Aug 2023 at 11:26, Patrick Tucci 
 wrote:

> Thanks for the reply Stephen and Mich.
>
> Stephen, you're right, it feels like Spark is waiting for something,
> but I'm not sure what. I'm the only user on the cluster and there are
> plenty of resources (+60 cores, +250GB RAM). I even tried restarting
> Hadoop, Spark and the host servers to make sure nothing was lingering in
> the background.
>
> Mich, thank you so much, your suggestion worked. Storing the tables as
> Parquet solves the issue.
>
> Interestingly, I found that only the MemberEnrollment table needs to
> be Parquet. The ID field in MemberEnrollment is an int calculated during
> load by a ROW_NUMBER() function. Further testing found that if I hard code
> a 0 as MemberEnrollment.ID instead of using the ROW_NUMBER() function, 

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
Yes, on premise.

Unfortunately after installing Delta Lake and re-writing all tables as
Delta tables, the issue persists.

On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh 
wrote:

> ok sure.
>
> Is this Delta Lake going to be on-premise?
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci 
> wrote:
>
>> Hi Mich,
>>
>> Thanks for the feedback. My original intention after reading your
>> response was to stick to Hive for managing tables. Unfortunately, I'm
>> running into another case of SQL scripts hanging. Since all tables are
>> already Parquet, I'm out of troubleshooting options. I'm going to migrate
>> to Delta Lake and see if that solves the issue.
>>
>> Thanks again for your feedback.
>>
>> Patrick
>>
>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Patrick,
>>>
>>> There is not anything wrong with Hive On-premise it is the best data
>>> warehouse there is
>>>
>>> Hive handles both ORC and Parquet formal well. They are both columnar
>>> implementations of relational model. What you are seeing is the Spark API
>>> to Hive which prefers Parquet. I found out a few years ago.
>>>
>>> From your point of view I suggest you stick to parquet format with Hive
>>> specific to Spark. As far as I know you don't have a fully independent Hive
>>> DB as yet.
>>>
>>> Anyway stick to Hive for now as you never know what issues you may be
>>> facing using moving to Delta Lake.
>>>
>>> You can also use compression
>>>
>>> STORED AS PARQUET
>>> TBLPROPERTIES ("parquet.compression"="SNAPPY")
>>>
>>> ALSO
>>>
>>> ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMNS
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 11 Aug 2023 at 11:26, Patrick Tucci 
>>> wrote:
>>>
 Thanks for the reply Stephen and Mich.

 Stephen, you're right, it feels like Spark is waiting for something,
 but I'm not sure what. I'm the only user on the cluster and there are
 plenty of resources (+60 cores, +250GB RAM). I even tried restarting
 Hadoop, Spark and the host servers to make sure nothing was lingering in
 the background.

 Mich, thank you so much, your suggestion worked. Storing the tables as
 Parquet solves the issue.

 Interestingly, I found that only the MemberEnrollment table needs to be
 Parquet. The ID field in MemberEnrollment is an int calculated during load
 by a ROW_NUMBER() function. Further testing found that if I hard code a 0
 as MemberEnrollment.ID instead of using the ROW_NUMBER() function, the
 query works without issue even if both tables are ORC.

 Should I infer from this issue that the Hive components prefer Parquet
 over ORC? Furthermore, should I consider using a different table storage
 framework, like Delta Lake, instead of the Hive components? Given this
 issue and other issues I've had with Hive, I'm starting to think a
 different solution might be more robust and stable. The main condition is
 that my application operates solely through Thrift server, so I need to be
 able to connect to Spark through Thrift server and have it write tables
 using Delta Lake instead of Hive. From this StackOverflow question, it
 looks like this is possible:
 https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc

 Thanks again to everyone who replied for their help.

 Patrick


 On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Steve may have a valid point. You raised an issue with concurrent
> writes before, if I recall correctly. Since this limitation may be due to
> Hive metastore. By default Spark uses Apache Derby for its database
> persistence. 

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Mich Talebzadeh
ok sure.

Is this Delta Lake going to be on-premise?

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 12 Aug 2023 at 12:03, Patrick Tucci  wrote:

> Hi Mich,
>
> Thanks for the feedback. My original intention after reading your response
> was to stick to Hive for managing tables. Unfortunately, I'm running into
> another case of SQL scripts hanging. Since all tables are already Parquet,
> I'm out of troubleshooting options. I'm going to migrate to Delta Lake and
> see if that solves the issue.
>
> Thanks again for your feedback.
>
> Patrick
>
> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi Patrick,
>>
>> There is not anything wrong with Hive On-premise it is the best data
>> warehouse there is
>>
>> Hive handles both ORC and Parquet formal well. They are both columnar
>> implementations of relational model. What you are seeing is the Spark API
>> to Hive which prefers Parquet. I found out a few years ago.
>>
>> From your point of view I suggest you stick to parquet format with Hive
>> specific to Spark. As far as I know you don't have a fully independent Hive
>> DB as yet.
>>
>> Anyway stick to Hive for now as you never know what issues you may be
>> facing using moving to Delta Lake.
>>
>> You can also use compression
>>
>> STORED AS PARQUET
>> TBLPROPERTIES ("parquet.compression"="SNAPPY")
>>
>> ALSO
>>
>> ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMNS
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 11 Aug 2023 at 11:26, Patrick Tucci 
>> wrote:
>>
>>> Thanks for the reply Stephen and Mich.
>>>
>>> Stephen, you're right, it feels like Spark is waiting for something, but
>>> I'm not sure what. I'm the only user on the cluster and there are plenty of
>>> resources (+60 cores, +250GB RAM). I even tried restarting Hadoop, Spark
>>> and the host servers to make sure nothing was lingering in the background.
>>>
>>> Mich, thank you so much, your suggestion worked. Storing the tables as
>>> Parquet solves the issue.
>>>
>>> Interestingly, I found that only the MemberEnrollment table needs to be
>>> Parquet. The ID field in MemberEnrollment is an int calculated during load
>>> by a ROW_NUMBER() function. Further testing found that if I hard code a 0
>>> as MemberEnrollment.ID instead of using the ROW_NUMBER() function, the
>>> query works without issue even if both tables are ORC.
>>>
>>> Should I infer from this issue that the Hive components prefer Parquet
>>> over ORC? Furthermore, should I consider using a different table storage
>>> framework, like Delta Lake, instead of the Hive components? Given this
>>> issue and other issues I've had with Hive, I'm starting to think a
>>> different solution might be more robust and stable. The main condition is
>>> that my application operates solely through Thrift server, so I need to be
>>> able to connect to Spark through Thrift server and have it write tables
>>> using Delta Lake instead of Hive. From this StackOverflow question, it
>>> looks like this is possible:
>>> https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc
>>>
>>> Thanks again to everyone who replied for their help.
>>>
>>> Patrick
>>>
>>>
>>> On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Steve may have a valid point. You raised an issue with concurrent
 writes before, if I recall correctly. Since this limitation may be due to
 Hive metastore. By default Spark uses Apache Derby for its database
 persistence. *However it is limited to only one Spark session at any
 time for the purposes of metadata storage.*  That may be the cause
 here as well. Does this happen if the underlying tables are created as
 PARQUET as opposed to ORC?

 HTH

 Mich Talebzadeh,
 Solutions Architect/Engineering Lead
 London
 United 

Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-12 Thread Patrick Tucci
Hi Mich,

Thanks for the feedback. My original intention after reading your response
was to stick to Hive for managing tables. Unfortunately, I'm running into
another case of SQL scripts hanging. Since all tables are already Parquet,
I'm out of troubleshooting options. I'm going to migrate to Delta Lake and
see if that solves the issue.

Thanks again for your feedback.

Patrick

On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh 
wrote:

> Hi Patrick,
>
> There is not anything wrong with Hive On-premise it is the best data
> warehouse there is
>
> Hive handles both ORC and Parquet formal well. They are both columnar
> implementations of relational model. What you are seeing is the Spark API
> to Hive which prefers Parquet. I found out a few years ago.
>
> From your point of view I suggest you stick to parquet format with Hive
> specific to Spark. As far as I know you don't have a fully independent Hive
> DB as yet.
>
> Anyway stick to Hive for now as you never know what issues you may be
> facing using moving to Delta Lake.
>
> You can also use compression
>
> STORED AS PARQUET
> TBLPROPERTIES ("parquet.compression"="SNAPPY")
>
> ALSO
>
> ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMNS
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 11 Aug 2023 at 11:26, Patrick Tucci 
> wrote:
>
>> Thanks for the reply Stephen and Mich.
>>
>> Stephen, you're right, it feels like Spark is waiting for something, but
>> I'm not sure what. I'm the only user on the cluster and there are plenty of
>> resources (+60 cores, +250GB RAM). I even tried restarting Hadoop, Spark
>> and the host servers to make sure nothing was lingering in the background.
>>
>> Mich, thank you so much, your suggestion worked. Storing the tables as
>> Parquet solves the issue.
>>
>> Interestingly, I found that only the MemberEnrollment table needs to be
>> Parquet. The ID field in MemberEnrollment is an int calculated during load
>> by a ROW_NUMBER() function. Further testing found that if I hard code a 0
>> as MemberEnrollment.ID instead of using the ROW_NUMBER() function, the
>> query works without issue even if both tables are ORC.
>>
>> Should I infer from this issue that the Hive components prefer Parquet
>> over ORC? Furthermore, should I consider using a different table storage
>> framework, like Delta Lake, instead of the Hive components? Given this
>> issue and other issues I've had with Hive, I'm starting to think a
>> different solution might be more robust and stable. The main condition is
>> that my application operates solely through Thrift server, so I need to be
>> able to connect to Spark through Thrift server and have it write tables
>> using Delta Lake instead of Hive. From this StackOverflow question, it
>> looks like this is possible:
>> https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc
>>
>> Thanks again to everyone who replied for their help.
>>
>> Patrick
>>
>>
>> On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Steve may have a valid point. You raised an issue with concurrent writes
>>> before, if I recall correctly. Since this limitation may be due to Hive
>>> metastore. By default Spark uses Apache Derby for its database
>>> persistence. *However it is limited to only one Spark session at any
>>> time for the purposes of metadata storage.*  That may be the cause here
>>> as well. Does this happen if the underlying tables are created as PARQUET
>>> as opposed to ORC?
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 11 Aug 2023 at 01:33, Stephen Coy 
>>> wrote:
>>>
 Hi Patrick,

 When this has happened to me in the past (admittedly via spark-submit)
 it has been because another job was still running and had already claimed
 some of the resources