Re: Spark-SQL - Query Hanging, How To Troubleshoot

Patrick Tucci Thu, 17 Aug 2023 09:18:29 -0700

Hi Everyone,

I just wanted to follow up on this issue. This issue has continued since
our last correspondence. Today I had a query hang and couldn't resolve the
issue. I decided to upgrade my Spark install from 3.4.0 to 3.4.1. After
doing so, instead of the query hanging, I got an error message that the
driver didn't have enough memory to broadcast objects. After increasing the
driver memory, the query runs without issue.


I hope this can be helpful to someone else in the future. Thanks again for
the support,

Patrick

On Sun, Aug 13, 2023 at 7:52 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> OK I use Hive 3.1.1
>
> My suggestion is to put your hive issues to u...@hive.apache.org and for
> JAVA version compatibility
>
> They will give you better info.
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 13 Aug 2023 at 11:48, Patrick Tucci <patrick.tu...@gmail.com>
> wrote:
>
>> I attempted to install Hive yesterday. The experience was similar to
>> other attempts at installing Hive: it took a few hours and at the end of
>> the process, I didn't have a working setup. The latest stable release would
>> not run. I never discovered the cause, but similar StackOverflow questions
>> suggest it might be a Java incompatibility issue. Since I didn't want to
>> downgrade or install an additional Java version, I attempted to use the
>> latest alpha as well. This appears to have worked, although I couldn't
>> figure out how to get it to use the metastore_db from Spark.
>>
>> After turning my attention back to Spark, I determined the issue. After
>> much troubleshooting, I discovered that if I performed a COUNT(*) using
>> the same JOINs, the problem query worked. I removed all the columns from
>> the SELECT statement and added them one by one until I found the culprit.
>> It's a text field on one of the tables. When the query SELECTs this column,
>> or attempts to filter on it, the query hangs and never completes. If I
>> remove all explicit references to this column, the query works fine. Since
>> I need this column in the results, I went back to the ETL and extracted the
>> values to a dimension table. I replaced the text column in the source table
>> with an integer ID column and the query worked without issue.
>>
>> On the topic of Hive, does anyone have any detailed resources for how to
>> set up Hive from scratch? Aside from the official site, since those
>> instructions didn't work for me. I'm starting to feel uneasy about building
>> my process around Spark. There really shouldn't be any instances where I
>> ask Spark to run legal ANSI SQL code and it just does nothing. In the past
>> 4 days I've run into 2 of these instances, and the solution was more voodoo
>> and magic than examining errors/logs and fixing code. I feel that I should
>> have a contingency plan in place for when I run into an issue with Spark
>> that can't be resolved.
>>
>> Thanks everyone.
>>
>>
>> On Sat, Aug 12, 2023 at 2:18 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> OK you would not have known unless you went through the process so to
>>> speak.
>>>
>>> Let us do something revolutionary here 😁
>>>
>>> Install hive and its metastore. You already have hadoop anyway
>>>
>>> https://cwiki.apache.org/confluence/display/hive/adminmanual+installation
>>>
>>> hive metastore
>>>
>>>
>>> https://data-flair.training/blogs/apache-hive-metastore/#:~:text=What%20is%20Hive%20Metastore%3F,by%20using%20metastore%20service%20API
>>> .
>>>
>>> choose one of these
>>>
>>> derby  hive  mssql  mysql  oracle  postgres
>>>
>>> Mine is an oracle. postgres is good as well.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 12 Aug 2023 at 18:31, Patrick Tucci <patrick.tu...@gmail.com>
>>> wrote:
>>>
>>>> Yes, on premise.
>>>>
>>>> Unfortunately after installing Delta Lake and re-writing all tables as
>>>> Delta tables, the issue persists.
>>>>
>>>> On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> ok sure.
>>>>>
>>>>> Is this Delta Lake going to be on-premise?
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Solutions Architect/Engineering Lead
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci <patrick.tu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Mich,
>>>>>>
>>>>>> Thanks for the feedback. My original intention after reading your
>>>>>> response was to stick to Hive for managing tables. Unfortunately, I'm
>>>>>> running into another case of SQL scripts hanging. Since all tables are
>>>>>> already Parquet, I'm out of troubleshooting options. I'm going to migrate
>>>>>> to Delta Lake and see if that solves the issue.
>>>>>>
>>>>>> Thanks again for your feedback.
>>>>>>
>>>>>> Patrick
>>>>>>
>>>>>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Patrick,
>>>>>>>
>>>>>>> There is not anything wrong with Hive On-premise it is the best data
>>>>>>> warehouse there is
>>>>>>>
>>>>>>> Hive handles both ORC and Parquet formal well. They are both
>>>>>>> columnar implementations of relational model. What you are seeing is the
>>>>>>> Spark API to Hive which prefers Parquet. I found out a few years ago.
>>>>>>>
>>>>>>> From your point of view I suggest you stick to parquet format with
>>>>>>> Hive specific to Spark. As far as I know you don't have a fully 
>>>>>>> independent
>>>>>>> Hive DB as yet.
>>>>>>>
>>>>>>> Anyway stick to Hive for now as you never know what issues you may
>>>>>>> be facing using moving to Delta Lake.
>>>>>>>
>>>>>>> You can also use compression
>>>>>>>
>>>>>>> STORED AS PARQUET
>>>>>>> TBLPROPERTIES ("parquet.compression"="SNAPPY")
>>>>>>>
>>>>>>> ALSO
>>>>>>>
>>>>>>> ANALYZE TABLE <TABLE_NAME> COMPUTE STATISTICS FOR COLUMNS
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Mich Talebzadeh,
>>>>>>> Solutions Architect/Engineering Lead
>>>>>>> London
>>>>>>> United Kingdom
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, 11 Aug 2023 at 11:26, Patrick Tucci <patrick.tu...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for the reply Stephen and Mich.
>>>>>>>>
>>>>>>>> Stephen, you're right, it feels like Spark is waiting for
>>>>>>>> something, but I'm not sure what. I'm the only user on the cluster and
>>>>>>>> there are plenty of resources (+60 cores, +250GB RAM). I even tried
>>>>>>>> restarting Hadoop, Spark and the host servers to make sure nothing was
>>>>>>>> lingering in the background.
>>>>>>>>
>>>>>>>> Mich, thank you so much, your suggestion worked. Storing the tables
>>>>>>>> as Parquet solves the issue.
>>>>>>>>
>>>>>>>> Interestingly, I found that only the MemberEnrollment table needs
>>>>>>>> to be Parquet. The ID field in MemberEnrollment is an int calculated 
>>>>>>>> during
>>>>>>>> load by a ROW_NUMBER() function. Further testing found that if I hard 
>>>>>>>> code
>>>>>>>> a 0 as MemberEnrollment.ID instead of using the ROW_NUMBER() function, 
>>>>>>>> the
>>>>>>>> query works without issue even if both tables are ORC.
>>>>>>>>
>>>>>>>> Should I infer from this issue that the Hive components prefer
>>>>>>>> Parquet over ORC? Furthermore, should I consider using a different 
>>>>>>>> table
>>>>>>>> storage framework, like Delta Lake, instead of the Hive components? 
>>>>>>>> Given
>>>>>>>> this issue and other issues I've had with Hive, I'm starting to think a
>>>>>>>> different solution might be more robust and stable. The main condition 
>>>>>>>> is
>>>>>>>> that my application operates solely through Thrift server, so I need 
>>>>>>>> to be
>>>>>>>> able to connect to Spark through Thrift server and have it write tables
>>>>>>>> using Delta Lake instead of Hive. From this StackOverflow question, it
>>>>>>>> looks like this is possible:
>>>>>>>> https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc
>>>>>>>>
>>>>>>>> Thanks again to everyone who replied for their help.
>>>>>>>>
>>>>>>>> Patrick
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh <
>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Steve may have a valid point. You raised an issue with concurrent
>>>>>>>>> writes before, if I recall correctly. Since this limitation may be 
>>>>>>>>> due to
>>>>>>>>> Hive metastore. By default Spark uses Apache Derby for its
>>>>>>>>> database persistence. *However it is limited to only one Spark
>>>>>>>>> session at any time for the purposes of metadata storage.*  That
>>>>>>>>> may be the cause here as well. Does this happen if the underlying 
>>>>>>>>> tables
>>>>>>>>> are created as PARQUET as opposed to ORC?
>>>>>>>>>
>>>>>>>>> HTH
>>>>>>>>>
>>>>>>>>> Mich Talebzadeh,
>>>>>>>>> Solutions Architect/Engineering Lead
>>>>>>>>> London
>>>>>>>>> United Kingdom
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    view my Linkedin profile
>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>>> which may
>>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>>> damages
>>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, 11 Aug 2023 at 01:33, Stephen Coy
>>>>>>>>> <s...@infomedia.com.au.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Patrick,
>>>>>>>>>>
>>>>>>>>>> When this has happened to me in the past (admittedly via
>>>>>>>>>> spark-submit) it has been because another job was still running and 
>>>>>>>>>> had
>>>>>>>>>> already claimed some of the resources (cores and memory).
>>>>>>>>>>
>>>>>>>>>> I think this can also happen if your configuration tries to claim
>>>>>>>>>> resources that will never be available.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> SteveC
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 11 Aug 2023, at 3:36 am, Patrick Tucci <
>>>>>>>>>> patrick.tu...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I'm attempting to run a query on Spark 3.4.0 through the Spark
>>>>>>>>>> ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
>>>>>>>>>> standalone mode using HDFS for storage.
>>>>>>>>>>
>>>>>>>>>> The query is as follows:
>>>>>>>>>>
>>>>>>>>>> SELECT ME.*, MB.BenefitID
>>>>>>>>>> FROM MemberEnrollment ME
>>>>>>>>>> JOIN MemberBenefits MB
>>>>>>>>>> ON ME.ID <http://me.id/> = MB.EnrollmentID
>>>>>>>>>> WHERE MB.BenefitID = 5
>>>>>>>>>> LIMIT 10
>>>>>>>>>>
>>>>>>>>>> The tables are defined as follows:
>>>>>>>>>>
>>>>>>>>>> -- Contains about 3M rows
>>>>>>>>>> CREATE TABLE MemberEnrollment
>>>>>>>>>> (
>>>>>>>>>>     ID INT
>>>>>>>>>>     , MemberID VARCHAR(50)
>>>>>>>>>>     , StartDate DATE
>>>>>>>>>>     , EndDate DATE
>>>>>>>>>>     -- Other columns, but these are the most important
>>>>>>>>>> ) STORED AS ORC;
>>>>>>>>>>
>>>>>>>>>> -- Contains about 25m rows
>>>>>>>>>> CREATE TABLE MemberBenefits
>>>>>>>>>> (
>>>>>>>>>>     EnrollmentID INT
>>>>>>>>>>     , BenefitID INT
>>>>>>>>>> ) STORED AS ORC;
>>>>>>>>>>
>>>>>>>>>> When I execute the query, it runs a single broadcast exchange
>>>>>>>>>> stage, which completes after a few seconds. Then everything just 
>>>>>>>>>> hangs. The
>>>>>>>>>> JDBC/ODBC tab in the UI shows the query state as COMPILED, but no 
>>>>>>>>>> stages or
>>>>>>>>>> tasks are executing or pending:
>>>>>>>>>>
>>>>>>>>>> <image.png>
>>>>>>>>>>
>>>>>>>>>> I've let the query run for as long as 30 minutes with no
>>>>>>>>>> additional stages, progress, or errors. I'm not sure where to start
>>>>>>>>>> troubleshooting.
>>>>>>>>>>
>>>>>>>>>> Thanks for your help,
>>>>>>>>>>
>>>>>>>>>> Patrick
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This email contains confidential information of and is the
>>>>>>>>>> copyright of Infomedia. It must not be forwarded, amended or 
>>>>>>>>>> disclosed
>>>>>>>>>> without consent of the sender. If you received this message by 
>>>>>>>>>> mistake,
>>>>>>>>>> please advise the sender and delete all copies. Security of 
>>>>>>>>>> transmission on
>>>>>>>>>> the internet cannot be guaranteed, could be infected, intercepted, or
>>>>>>>>>> corrupted and you should ensure you have suitable antivirus 
>>>>>>>>>> protection in
>>>>>>>>>> place. By sending us your or any third party personal details, you 
>>>>>>>>>> consent
>>>>>>>>>> to (or confirm you have obtained consent from such third parties) to
>>>>>>>>>> Infomedia’s privacy policy.
>>>>>>>>>> http://www.infomedia.com.au/privacy-policy/
>>>>>>>>>>
>>>>>>>>>

Re: Spark-SQL - Query Hanging, How To Troubleshoot

Reply via email to