[ https://issues.apache.org/jira/browse/SPARK-5452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen closed SPARK-5452. ---------------------------- Resolution: Not a Problem > We are migrating Tera Data SQL to Spark SQL. Query is taking long time. > Please have a look on this issue > -------------------------------------------------------------------------------------------------------- > > Key: SPARK-5452 > URL: https://issues.apache.org/jira/browse/SPARK-5452 > Project: Spark > Issue Type: Test > Components: Spark Shell > Affects Versions: 1.2.0 > Reporter: irfan > Labels: SparkSql > > Hi Team, > we are migrating TeraData SQL to Spark SQL because of complexity we have > spilted into below 4 sub-quries > and we are running through hive context > ==================================================== > val HIVETMP1 = hc.sql("SELECT PARTY_ACCOUNT_ID AS > PARTY_ACCOUNT_ID,LMS_ACCOUNT_ID AS LMS_ACCOUNT_ID FROM VW_PARTY_ACCOUNT WHERE > PARTY_ACCOUNT_TYPE_CODE IN('04') AND LMS_ACCOUNT_ID IS NOT NULL") > HIVETMP1.registerTempTable("VW_HIVETMP1") > val HIVETMP2 = hc.sql("SELECT PACCNT.LMS_ACCOUNT_ID AS LMS_ACCOUNT_ID, > 'NULL' AS RANDOM_PARTY_ACCOUNT_ID ,'NULL' AS MOST_RECENT_SPEND_LA > ,STXN.PARTY_ACCOUNT_ID AS MAX_SPEND_12WKS_LA ,STXN.MAX_SPEND_12WKS_LADATE > AS MAX_SPEND_12WKS_LADATE FROM VW_HIVETMP1 AS PACCNT INNER JOIN (SELECT > STXTMP.PARTY_ACCOUNT_ID AS PARTY_ACCOUNT_ID, SUM(CASE WHEN > (CAST(STXTMP.TRANSACTION_DATE AS DATE ) > > DATE_SUB(CAST(CONCAT(SUBSTRING(SYSTMP.OPTION_VAL,1,4),'-',SUBSTRING(SYSTMP.OPTION_VAL,5,2),'-',SUBSTRING(SYSTMP.OPTION_VAL,7,2)) > AS DATE),84)) THEN STXTMP.TRANSACTION_VALUE ELSE 0.00 END) AS > MAX_SPEND_12WKS_LADATE FROM VW_SHOPPING_TRANSACTION_TABLE AS STXTMP INNER > JOIN SYSTEM_OPTION_TABLE AS SYSTMP ON STXTMP.FLAG == SYSTMP.FLAG AND > SYSTMP.OPTION_NAME = 'RID' AND STXTMP.PARTY_ACCOUNT_TYPE_CODE IN('04') GROUP > BY STXTMP.PARTY_ACCOUNT_ID) AS STXN ON PACCNT.PARTY_ACCOUNT_ID = > STXN.PARTY_ACCOUNT_ID WHERE STXN.MAX_SPEND_12WKS_LADATE IS NOT NULL") > HIVETMP2.registerTempTable("VW_HIVETMP2") > val HIVETMP3 = hc.sql("SELECT LMS_ACCOUNT_ID,MAX(MAX_SPEND_12WKS_LA) AS > MAX_SPEND_12WKS_LA, 1 AS RANK FROM VW_HIVETMP2 GROUP BY LMS_ACCOUNT_ID") > HIVETMP3.registerTempTable("VW_HIVETMP3") > val HIVETMP4 = hc.sql(" SELECT PACCNT.LMS_ACCOUNT_ID,'NULL' AS > RANDOM_PARTY_ACCOUNT_ID ,'NULL' AS > MOST_RECENT_SPEND_LA,STXN.MAX_SPEND_12WKS_LA AS MAX_SPEND_12WKS_LA,1 AS RANK1 > FROM VW_HIVETMP2 AS PACCNT INNER JOIN VW_HIVETMP3 AS STXN ON > PACCNT.LMS_ACCOUNT_ID = STXN.LMS_ACCOUNT_ID AND PACCNT.MAX_SPEND_12WKS_LA = > STXN.MAX_SPEND_12WKS_LA") > HIVETMP4.registerTempTable("WT03_ACCOUNT_BHVR3") > HIVETMP4.saveAsTextFile("hdfs:/file/") > ========================== > This query has two Group By clauses which are running on huge files(19.5GB). > And the query took 40min to get the final result. Is there any changes > required in run time environment or Configuration Setting in Spark which can > improve the query performance. > below are our Environment and configuration details: > Environment details: > No of nodes:4 > capacity on each node:62 GB RAM on each node. > Storage capacity :9TB on each node > total cores :48 > Spark Configuration: > > .set("spark.default.parallelism","64") > .set("spark.driver.maxResultSize","2G") > .set("spark.driver.memory","10g") > .set("spark.rdd.compress","true") > .set("spark.shuffle.spill.compress","true") > .set("spark.shuffle.compress","true") > .set("spark.shuffle.consolidateFiles","true/false") > .set("spark.shuffle.spill","true/false") > > Data file size : > SHOPPING_TRANSACTION 19.5GB > PARTY_ACCOUNT 1.4GB > SYSTEM_OPTIONS 11.6K > please help us to resolve above issue. > Thanks, -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org