Re: Spark-SQL - Query Hanging, How To Troubleshoot

Patrick Tucci Sat, 12 Aug 2023 10:31:47 -0700

Yes, on premise.

Unfortunately after installing Delta Lake and re-writing all tables as
Delta tables, the issue persists.


On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> ok sure.
>
> Is this Delta Lake going to be on-premise?
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci <patrick.tu...@gmail.com>
> wrote:
>
>> Hi Mich,
>>
>> Thanks for the feedback. My original intention after reading your
>> response was to stick to Hive for managing tables. Unfortunately, I'm
>> running into another case of SQL scripts hanging. Since all tables are
>> already Parquet, I'm out of troubleshooting options. I'm going to migrate
>> to Delta Lake and see if that solves the issue.
>>
>> Thanks again for your feedback.
>>
>> Patrick
>>
>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Patrick,
>>>
>>> There is not anything wrong with Hive On-premise it is the best data
>>> warehouse there is
>>>
>>> Hive handles both ORC and Parquet formal well. They are both columnar
>>> implementations of relational model. What you are seeing is the Spark API
>>> to Hive which prefers Parquet. I found out a few years ago.
>>>
>>> From your point of view I suggest you stick to parquet format with Hive
>>> specific to Spark. As far as I know you don't have a fully independent Hive
>>> DB as yet.
>>>
>>> Anyway stick to Hive for now as you never know what issues you may be
>>> facing using moving to Delta Lake.
>>>
>>> You can also use compression
>>>
>>> STORED AS PARQUET
>>> TBLPROPERTIES ("parquet.compression"="SNAPPY")
>>>
>>> ALSO
>>>
>>> ANALYZE TABLE <TABLE_NAME> COMPUTE STATISTICS FOR COLUMNS
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 11 Aug 2023 at 11:26, Patrick Tucci <patrick.tu...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for the reply Stephen and Mich.
>>>>
>>>> Stephen, you're right, it feels like Spark is waiting for something,
>>>> but I'm not sure what. I'm the only user on the cluster and there are
>>>> plenty of resources (+60 cores, +250GB RAM). I even tried restarting
>>>> Hadoop, Spark and the host servers to make sure nothing was lingering in
>>>> the background.
>>>>
>>>> Mich, thank you so much, your suggestion worked. Storing the tables as
>>>> Parquet solves the issue.
>>>>
>>>> Interestingly, I found that only the MemberEnrollment table needs to be
>>>> Parquet. The ID field in MemberEnrollment is an int calculated during load
>>>> by a ROW_NUMBER() function. Further testing found that if I hard code a 0
>>>> as MemberEnrollment.ID instead of using the ROW_NUMBER() function, the
>>>> query works without issue even if both tables are ORC.
>>>>
>>>> Should I infer from this issue that the Hive components prefer Parquet
>>>> over ORC? Furthermore, should I consider using a different table storage
>>>> framework, like Delta Lake, instead of the Hive components? Given this
>>>> issue and other issues I've had with Hive, I'm starting to think a
>>>> different solution might be more robust and stable. The main condition is
>>>> that my application operates solely through Thrift server, so I need to be
>>>> able to connect to Spark through Thrift server and have it write tables
>>>> using Delta Lake instead of Hive. From this StackOverflow question, it
>>>> looks like this is possible:
>>>> https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc
>>>>
>>>> Thanks again to everyone who replied for their help.
>>>>
>>>> Patrick
>>>>
>>>>
>>>> On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Steve may have a valid point. You raised an issue with concurrent
>>>>> writes before, if I recall correctly. Since this limitation may be due to
>>>>> Hive metastore. By default Spark uses Apache Derby for its database
>>>>> persistence. *However it is limited to only one Spark session at any
>>>>> time for the purposes of metadata storage.*  That may be the cause
>>>>> here as well. Does this happen if the underlying tables are created as
>>>>> PARQUET as opposed to ORC?
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Solutions Architect/Engineering Lead
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, 11 Aug 2023 at 01:33, Stephen Coy
>>>>> <s...@infomedia.com.au.invalid> wrote:
>>>>>
>>>>>> Hi Patrick,
>>>>>>
>>>>>> When this has happened to me in the past (admittedly via
>>>>>> spark-submit) it has been because another job was still running and had
>>>>>> already claimed some of the resources (cores and memory).
>>>>>>
>>>>>> I think this can also happen if your configuration tries to claim
>>>>>> resources that will never be available.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> SteveC
>>>>>>
>>>>>>
>>>>>> On 11 Aug 2023, at 3:36 am, Patrick Tucci <patrick.tu...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I'm attempting to run a query on Spark 3.4.0 through the Spark
>>>>>> ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
>>>>>> standalone mode using HDFS for storage.
>>>>>>
>>>>>> The query is as follows:
>>>>>>
>>>>>> SELECT ME.*, MB.BenefitID
>>>>>> FROM MemberEnrollment ME
>>>>>> JOIN MemberBenefits MB
>>>>>> ON ME.ID <http://me.id/> = MB.EnrollmentID
>>>>>> WHERE MB.BenefitID = 5
>>>>>> LIMIT 10
>>>>>>
>>>>>> The tables are defined as follows:
>>>>>>
>>>>>> -- Contains about 3M rows
>>>>>> CREATE TABLE MemberEnrollment
>>>>>> (
>>>>>>     ID INT
>>>>>>     , MemberID VARCHAR(50)
>>>>>>     , StartDate DATE
>>>>>>     , EndDate DATE
>>>>>>     -- Other columns, but these are the most important
>>>>>> ) STORED AS ORC;
>>>>>>
>>>>>> -- Contains about 25m rows
>>>>>> CREATE TABLE MemberBenefits
>>>>>> (
>>>>>>     EnrollmentID INT
>>>>>>     , BenefitID INT
>>>>>> ) STORED AS ORC;
>>>>>>
>>>>>> When I execute the query, it runs a single broadcast exchange stage,
>>>>>> which completes after a few seconds. Then everything just hangs. The
>>>>>> JDBC/ODBC tab in the UI shows the query state as COMPILED, but no stages 
>>>>>> or
>>>>>> tasks are executing or pending:
>>>>>>
>>>>>> <image.png>
>>>>>>
>>>>>> I've let the query run for as long as 30 minutes with no additional
>>>>>> stages, progress, or errors. I'm not sure where to start troubleshooting.
>>>>>>
>>>>>> Thanks for your help,
>>>>>>
>>>>>> Patrick
>>>>>>
>>>>>>
>>>>>> This email contains confidential information of and is the copyright
>>>>>> of Infomedia. It must not be forwarded, amended or disclosed without
>>>>>> consent of the sender. If you received this message by mistake, please
>>>>>> advise the sender and delete all copies. Security of transmission on the
>>>>>> internet cannot be guaranteed, could be infected, intercepted, or 
>>>>>> corrupted
>>>>>> and you should ensure you have suitable antivirus protection in place. By
>>>>>> sending us your or any third party personal details, you consent to (or
>>>>>> confirm you have obtained consent from such third parties) to Infomedia’s
>>>>>> privacy policy. http://www.infomedia.com.au/privacy-policy/
>>>>>>
>>>>>

Re: Spark-SQL - Query Hanging, How To Troubleshoot

Reply via email to