Yes, on premise. Unfortunately after installing Delta Lake and re-writing all tables as Delta tables, the issue persists.
On Sat, Aug 12, 2023 at 11:34 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > ok sure. > > Is this Delta Lake going to be on-premise? > > Mich Talebzadeh, > Solutions Architect/Engineering Lead > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sat, 12 Aug 2023 at 12:03, Patrick Tucci <patrick.tu...@gmail.com> > wrote: > >> Hi Mich, >> >> Thanks for the feedback. My original intention after reading your >> response was to stick to Hive for managing tables. Unfortunately, I'm >> running into another case of SQL scripts hanging. Since all tables are >> already Parquet, I'm out of troubleshooting options. I'm going to migrate >> to Delta Lake and see if that solves the issue. >> >> Thanks again for your feedback. >> >> Patrick >> >> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Hi Patrick, >>> >>> There is not anything wrong with Hive On-premise it is the best data >>> warehouse there is >>> >>> Hive handles both ORC and Parquet formal well. They are both columnar >>> implementations of relational model. What you are seeing is the Spark API >>> to Hive which prefers Parquet. I found out a few years ago. >>> >>> From your point of view I suggest you stick to parquet format with Hive >>> specific to Spark. As far as I know you don't have a fully independent Hive >>> DB as yet. >>> >>> Anyway stick to Hive for now as you never know what issues you may be >>> facing using moving to Delta Lake. >>> >>> You can also use compression >>> >>> STORED AS PARQUET >>> TBLPROPERTIES ("parquet.compression"="SNAPPY") >>> >>> ALSO >>> >>> ANALYZE TABLE <TABLE_NAME> COMPUTE STATISTICS FOR COLUMNS >>> >>> HTH >>> >>> Mich Talebzadeh, >>> Solutions Architect/Engineering Lead >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Fri, 11 Aug 2023 at 11:26, Patrick Tucci <patrick.tu...@gmail.com> >>> wrote: >>> >>>> Thanks for the reply Stephen and Mich. >>>> >>>> Stephen, you're right, it feels like Spark is waiting for something, >>>> but I'm not sure what. I'm the only user on the cluster and there are >>>> plenty of resources (+60 cores, +250GB RAM). I even tried restarting >>>> Hadoop, Spark and the host servers to make sure nothing was lingering in >>>> the background. >>>> >>>> Mich, thank you so much, your suggestion worked. Storing the tables as >>>> Parquet solves the issue. >>>> >>>> Interestingly, I found that only the MemberEnrollment table needs to be >>>> Parquet. The ID field in MemberEnrollment is an int calculated during load >>>> by a ROW_NUMBER() function. Further testing found that if I hard code a 0 >>>> as MemberEnrollment.ID instead of using the ROW_NUMBER() function, the >>>> query works without issue even if both tables are ORC. >>>> >>>> Should I infer from this issue that the Hive components prefer Parquet >>>> over ORC? Furthermore, should I consider using a different table storage >>>> framework, like Delta Lake, instead of the Hive components? Given this >>>> issue and other issues I've had with Hive, I'm starting to think a >>>> different solution might be more robust and stable. The main condition is >>>> that my application operates solely through Thrift server, so I need to be >>>> able to connect to Spark through Thrift server and have it write tables >>>> using Delta Lake instead of Hive. From this StackOverflow question, it >>>> looks like this is possible: >>>> https://stackoverflow.com/questions/69862388/how-to-run-spark-sql-thrift-server-in-local-mode-and-connect-to-delta-using-jdbc >>>> >>>> Thanks again to everyone who replied for their help. >>>> >>>> Patrick >>>> >>>> >>>> On Fri, Aug 11, 2023 at 2:14 AM Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> Steve may have a valid point. You raised an issue with concurrent >>>>> writes before, if I recall correctly. Since this limitation may be due to >>>>> Hive metastore. By default Spark uses Apache Derby for its database >>>>> persistence. *However it is limited to only one Spark session at any >>>>> time for the purposes of metadata storage.* That may be the cause >>>>> here as well. Does this happen if the underlying tables are created as >>>>> PARQUET as opposed to ORC? >>>>> >>>>> HTH >>>>> >>>>> Mich Talebzadeh, >>>>> Solutions Architect/Engineering Lead >>>>> London >>>>> United Kingdom >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, 11 Aug 2023 at 01:33, Stephen Coy >>>>> <s...@infomedia.com.au.invalid> wrote: >>>>> >>>>>> Hi Patrick, >>>>>> >>>>>> When this has happened to me in the past (admittedly via >>>>>> spark-submit) it has been because another job was still running and had >>>>>> already claimed some of the resources (cores and memory). >>>>>> >>>>>> I think this can also happen if your configuration tries to claim >>>>>> resources that will never be available. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> SteveC >>>>>> >>>>>> >>>>>> On 11 Aug 2023, at 3:36 am, Patrick Tucci <patrick.tu...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Hello, >>>>>> >>>>>> I'm attempting to run a query on Spark 3.4.0 through the Spark >>>>>> ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in >>>>>> standalone mode using HDFS for storage. >>>>>> >>>>>> The query is as follows: >>>>>> >>>>>> SELECT ME.*, MB.BenefitID >>>>>> FROM MemberEnrollment ME >>>>>> JOIN MemberBenefits MB >>>>>> ON ME.ID <http://me.id/> = MB.EnrollmentID >>>>>> WHERE MB.BenefitID = 5 >>>>>> LIMIT 10 >>>>>> >>>>>> The tables are defined as follows: >>>>>> >>>>>> -- Contains about 3M rows >>>>>> CREATE TABLE MemberEnrollment >>>>>> ( >>>>>> ID INT >>>>>> , MemberID VARCHAR(50) >>>>>> , StartDate DATE >>>>>> , EndDate DATE >>>>>> -- Other columns, but these are the most important >>>>>> ) STORED AS ORC; >>>>>> >>>>>> -- Contains about 25m rows >>>>>> CREATE TABLE MemberBenefits >>>>>> ( >>>>>> EnrollmentID INT >>>>>> , BenefitID INT >>>>>> ) STORED AS ORC; >>>>>> >>>>>> When I execute the query, it runs a single broadcast exchange stage, >>>>>> which completes after a few seconds. Then everything just hangs. The >>>>>> JDBC/ODBC tab in the UI shows the query state as COMPILED, but no stages >>>>>> or >>>>>> tasks are executing or pending: >>>>>> >>>>>> <image.png> >>>>>> >>>>>> I've let the query run for as long as 30 minutes with no additional >>>>>> stages, progress, or errors. I'm not sure where to start troubleshooting. >>>>>> >>>>>> Thanks for your help, >>>>>> >>>>>> Patrick >>>>>> >>>>>> >>>>>> This email contains confidential information of and is the copyright >>>>>> of Infomedia. It must not be forwarded, amended or disclosed without >>>>>> consent of the sender. If you received this message by mistake, please >>>>>> advise the sender and delete all copies. Security of transmission on the >>>>>> internet cannot be guaranteed, could be infected, intercepted, or >>>>>> corrupted >>>>>> and you should ensure you have suitable antivirus protection in place. By >>>>>> sending us your or any third party personal details, you consent to (or >>>>>> confirm you have obtained consent from such third parties) to Infomedia’s >>>>>> privacy policy. http://www.infomedia.com.au/privacy-policy/ >>>>>> >>>>>