Hi Everyone, I just wanted to follow up on this issue. This issue has continued since our last correspondence. Today I had a query hang and couldn't resolve the issue. I decided to upgrade my Spark install from 3.4.0 to 3.4.1. After doing so, instead of the query hanging, I got an error message that the driver didn't have enough memory to broadcast objects. After increasing the driver memory, the query runs without issue.
I hope this can be helpful to someone else in the future. Thanks again for the support, Patrick On Sun, Aug 13, 2023 at 7:52 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > OK I use Hive 3.1.1 > > My suggestion is to put your hive issues to u...@hive.apache.org and for > JAVA version compatibility > > They will give you better info. > > HTH > > Mich Talebzadeh, > Solutions Architect/Engineering Lead > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sun, 13 Aug 2023 at 11:48, Patrick Tucci <patrick.tu...@gmail.com> > wrote: > >> I attempted to install Hive yesterday. The experience was similar to >> other attempts at installing Hive: it took a few hours and at the end of >> the process, I didn't have a working setup. The latest stable release would >> not run. I never discovered the cause, but similar StackOverflow questions >> suggest it might be a Java incompatibility issue. Since I didn't want to >> downgrade or install an additional Java version, I attempted to use the >> latest alpha as well. This appears to have worked, although I couldn't >> figure out how to get it to use the metastore_db from Spark. >> >> After turning my attention back to Spark, I determined the issue. After >> much troubleshooting, I discovered that if I performed a COUNT(*) using >> the same JOINs, the problem query worked. I removed all the columns from >> the SELECT statement and added them one by one until I found the culprit. >> It's a text field on one of the tables. When the query SELECTs this column, >> or attempts to filter on it, the query hangs and never completes. If I >> remove all explicit references to this column, the query works fine. Since >> I need this column in the results, I went back to the ETL and extracted the >> values to a dimension table. I replaced the text column in the source table >> with an integer ID column and the query worked without issue. >> >> On the topic of Hive, does anyone have any detailed resources for how to >> set up Hive from scratch? Aside from the official site, since those >> instructions didn't work for me. I'm starting to feel uneasy about building >> my process around Spark. There really shouldn't be any instances where I >> ask Spark to run legal ANSI SQL code and it just does nothing. In the past >> 4 days I've run into 2 of these instances, and the solution was more voodoo >> and magic than examining errors/logs and fixing code. I feel that I should >> have a contingency plan in place for when I run into an issue with Spark >> that can't be resolved. >> >> Thanks everyone. >> >> >> On Sat, Aug 12, 2023 at 2:18 PM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> OK you would not have known unless you went through the process so to >>> speak. >>> >>> Let us do something revolutionary here 😁 >>> >>> Install hive and its metastore. You already have hadoop anyway >>> >>> https://cwiki.apache.org/confluence/display/hive/adminmanual+installation >>> >>> hive metastore >>> >>> >>> https://data-flair.training/blogs/apache-hive-metastore/#:~:text=What%20is%20Hive%20Metastore%3F,by%20using%20metastore%20service%20API >>> . >>> >>> choose one of these >>> >>> derby hive mssql mysql oracle postgres >>> >>> Mine is an oracle. postgres is good as well. >>> >>> HTH >>> >>> Mich Talebzadeh, >>> Solutions Architect/Engineering Lead >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Sat, 12 Aug 2023 at 18:31, Patrick Tucci <patrick.tu...@gmail.com> >>> wrote: >>> >>>> Yes, on premise. >>>> >>>> Unfortunately after installing Delta Lake and re-writing all tables as >>>> Delta tables, the issue persists. >>>> >>>> On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> ok sure. >>>>> >>>>> Is this Delta Lake going to be on-premise? >>>>> >>>>> Mich Talebzadeh, >>>>> Solutions Architect/Engineering Lead >>>>> London >>>>> United Kingdom >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Sat, 12 Aug 2023 at 12:03, Patrick Tucci <patrick.tu...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Mich, >>>>>> >>>>>> Thanks for the feedback. My original intention after reading your >>>>>> response was to stick to Hive for managing tables. Unfortunately, I'm >>>>>> running into another case of SQL scripts hanging. Since all tables are >>>>>> already Parquet, I'm out of troubleshooting options. I'm going to migrate >>>>>> to Delta Lake and see if that solves the issue. >>>>>> >>>>>> Thanks again for your feedback. >>>>>> >>>>>> Patrick >>>>>> >>>>>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com> wrote: >>>>>> >>>>>>> Hi Patrick, >>>>>>> >>>>>>> There is not anything wrong with Hive On-premise it is the best data >>>>>>> warehouse there is >>>>>>> >>>>>>> Hive handles both ORC and Parquet formal well. They are both >>>>>>> columnar implementations of relational model. What you are seeing is the >>>>>>> Spark API to Hive which prefers Parquet. I found out a few years ago. >>>>>>> >>>>>>> From your point of view I suggest you stick to parquet format with >>>>>>> Hive specific to Spark. As far as I know you don't have a fully >>>>>>> independent >>>>>>> Hive DB as yet. >>>>>>> >>>>>>> Anyway stick to Hive for now as you never know what issues you may >>>>>>> be facing using moving to Delta Lake. >>>>>>> >>>>>>> You can also use compression >>>>>>> >>>>>>> STORED AS PARQUET >>>>>>> TBLPROPERTIES ("parquet.compression"="SNAPPY") >>>>>>> >>>>>>> ALSO >>>>>>> >>>>>>> ANALYZE TABLE <TABLE_NAME> COMPUTE STATISTICS FOR COLUMNS >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> Mich Talebzadeh, >>>>>>> Solutions Architect/Engineering Lead >>>>>>> London >>>>>>> United Kingdom >>>>>>> >>>>>>> >>>>>>> view my Linkedin profile >>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>> >>>>>>> >>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>> for any loss, damage or destruction of data or any other property which >>>>>>> may >>>>>>> arise from relying on this email's technical content is explicitly >>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>> damages >>>>>>> arising from such loss, damage or destruction. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, 11 Aug 2023 at 11:26, Patrick Tucci <patrick.tu...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks for the reply Stephen and Mich. >>>>>>>> >>>>>>>> Stephen, you're right, it feels like Spark is waiting for >>>>>>>> something, but I'm not sure what. I'm the only user on the cluster and >>>>>>>> there are plenty of resources (+60 cores, +250GB RAM). I even tried >>>>>>>> restarting Hadoop, Spark and the host servers to make sure nothing was >>>>>>>> lingering in the background. >>>>>>>> >>>>>>>> Mich, thank you so much, your suggestion worked. Storing the tables >>>>>>>> as Parquet solves the issue. >>>>>>>> >>>>>>>> Interestingly, I found that only the MemberEnrollment table needs >>>>>>>> to be Parquet. The ID field in MemberEnrollment is an int calculated >>>>>>>> during >>>>>>>> load by a ROW_NUMBER() function. Further testing found that if I hard >>>>>>>> code >>>>>>>> a 0 as MemberEnrollment.ID instead of using the ROW_NUMBER() function, >>>>>>>> the >>>>>>>> query works without issue even if both tables are ORC. >>>>>>>> >>>>>>>> Should I infer from this issue that the Hive components prefer >>>>>>>> Parquet over ORC? Furthermore, should I consider using a different >>>>>>>> table >>>>>>>> storage framework, like Delta Lake, instead of the Hive components? >>>>>>>> Given >>>>>>>> this issue and other issues I've had with Hive, I'm starting to think a >>>>>>>> different solution might be more robust and stable. The main condition >>>>>>>> is >>>>>>>> that my application operates solely through Thrift server, so I need >>>>>>>> to be >>>>>>>> able to connect to Spark through Thrift server and have it write tables >>>>>>>> using Delta Lake instead of Hive. From this StackOverflow question, it >>>>>>>> looks like this is possible: >>>>>>>> https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc >>>>>>>> >>>>>>>> Thanks again to everyone who replied for their help. >>>>>>>> >>>>>>>> Patrick >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh < >>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Steve may have a valid point. You raised an issue with concurrent >>>>>>>>> writes before, if I recall correctly. Since this limitation may be >>>>>>>>> due to >>>>>>>>> Hive metastore. By default Spark uses Apache Derby for its >>>>>>>>> database persistence. *However it is limited to only one Spark >>>>>>>>> session at any time for the purposes of metadata storage.* That >>>>>>>>> may be the cause here as well. Does this happen if the underlying >>>>>>>>> tables >>>>>>>>> are created as PARQUET as opposed to ORC? >>>>>>>>> >>>>>>>>> HTH >>>>>>>>> >>>>>>>>> Mich Talebzadeh, >>>>>>>>> Solutions Architect/Engineering Lead >>>>>>>>> London >>>>>>>>> United Kingdom >>>>>>>>> >>>>>>>>> >>>>>>>>> view my Linkedin profile >>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>> >>>>>>>>> >>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>>>> for any loss, damage or destruction of data or any other property >>>>>>>>> which may >>>>>>>>> arise from relying on this email's technical content is explicitly >>>>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>>>> damages >>>>>>>>> arising from such loss, damage or destruction. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, 11 Aug 2023 at 01:33, Stephen Coy >>>>>>>>> <s...@infomedia.com.au.invalid> wrote: >>>>>>>>> >>>>>>>>>> Hi Patrick, >>>>>>>>>> >>>>>>>>>> When this has happened to me in the past (admittedly via >>>>>>>>>> spark-submit) it has been because another job was still running and >>>>>>>>>> had >>>>>>>>>> already claimed some of the resources (cores and memory). >>>>>>>>>> >>>>>>>>>> I think this can also happen if your configuration tries to claim >>>>>>>>>> resources that will never be available. >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> >>>>>>>>>> SteveC >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 11 Aug 2023, at 3:36 am, Patrick Tucci < >>>>>>>>>> patrick.tu...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> I'm attempting to run a query on Spark 3.4.0 through the Spark >>>>>>>>>> ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in >>>>>>>>>> standalone mode using HDFS for storage. >>>>>>>>>> >>>>>>>>>> The query is as follows: >>>>>>>>>> >>>>>>>>>> SELECT ME.*, MB.BenefitID >>>>>>>>>> FROM MemberEnrollment ME >>>>>>>>>> JOIN MemberBenefits MB >>>>>>>>>> ON ME.ID <http://me.id/> = MB.EnrollmentID >>>>>>>>>> WHERE MB.BenefitID = 5 >>>>>>>>>> LIMIT 10 >>>>>>>>>> >>>>>>>>>> The tables are defined as follows: >>>>>>>>>> >>>>>>>>>> -- Contains about 3M rows >>>>>>>>>> CREATE TABLE MemberEnrollment >>>>>>>>>> ( >>>>>>>>>> ID INT >>>>>>>>>> , MemberID VARCHAR(50) >>>>>>>>>> , StartDate DATE >>>>>>>>>> , EndDate DATE >>>>>>>>>> -- Other columns, but these are the most important >>>>>>>>>> ) STORED AS ORC; >>>>>>>>>> >>>>>>>>>> -- Contains about 25m rows >>>>>>>>>> CREATE TABLE MemberBenefits >>>>>>>>>> ( >>>>>>>>>> EnrollmentID INT >>>>>>>>>> , BenefitID INT >>>>>>>>>> ) STORED AS ORC; >>>>>>>>>> >>>>>>>>>> When I execute the query, it runs a single broadcast exchange >>>>>>>>>> stage, which completes after a few seconds. Then everything just >>>>>>>>>> hangs. The >>>>>>>>>> JDBC/ODBC tab in the UI shows the query state as COMPILED, but no >>>>>>>>>> stages or >>>>>>>>>> tasks are executing or pending: >>>>>>>>>> >>>>>>>>>> <image.png> >>>>>>>>>> >>>>>>>>>> I've let the query run for as long as 30 minutes with no >>>>>>>>>> additional stages, progress, or errors. I'm not sure where to start >>>>>>>>>> troubleshooting. >>>>>>>>>> >>>>>>>>>> Thanks for your help, >>>>>>>>>> >>>>>>>>>> Patrick >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> This email contains confidential information of and is the >>>>>>>>>> copyright of Infomedia. It must not be forwarded, amended or >>>>>>>>>> disclosed >>>>>>>>>> without consent of the sender. If you received this message by >>>>>>>>>> mistake, >>>>>>>>>> please advise the sender and delete all copies. Security of >>>>>>>>>> transmission on >>>>>>>>>> the internet cannot be guaranteed, could be infected, intercepted, or >>>>>>>>>> corrupted and you should ensure you have suitable antivirus >>>>>>>>>> protection in >>>>>>>>>> place. By sending us your or any third party personal details, you >>>>>>>>>> consent >>>>>>>>>> to (or confirm you have obtained consent from such third parties) to >>>>>>>>>> Infomedia’s privacy policy. >>>>>>>>>> http://www.infomedia.com.au/privacy-policy/ >>>>>>>>>> >>>>>>>>>