Errors while creating a new table using existing table schema
Hello, I am trying to create a new table using an existing table's schema (existing table name in hive: jobs). However, when I do that it doesn't put the new table (new table name in hive: jobs_ex2) in the same location as the existing table. When I specify the location explicitly, it errors out. Query which has the problem is pasted below: *create table jobs_ex2as select year, capitalregion, universe from jobsrow format delimited fields terminated by ',' Location '/user/hive/warehouse/default.db/jobs_ex2'* The file that is being used to create a table is in the following location: */user/hive/warehouse/default.db/jobs/universe=1/Jobs.csv* where universe=1 is the partition. The new table jobs_ex_2 needs to be created inside default.db folder. thanks, Vidya
Re: how to control hive log location on 0.13?
Thanks, Satish Mittal, I've added that information to the Error Logs section https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ErrorLogs of the Getting Started wiki. -- Lefty On Fri, Jul 18, 2014 at 12:19 AM, Satish Mittal satish.mit...@inmobi.com wrote: You can configure the following property in $HIVE_HOME/conf/hive-log4j.properties: hive.log.dir=your location The default value of this property is ${java.io.tmpdir}/${user.name}. Thanks, Satish On Thu, Jul 17, 2014 at 11:58 PM, Yang tedd...@gmail.com wrote: we just moved to hadoop2.0 (HDP2.1 distro). it turns out that the new hive version generates a lot of logs into /tmp/ and is quickly creating the danger of running out of our /tmp/ space. I see these 2 different logs : [myuser@mybox ~]$ ls -lt /tmp/myuser/ total 1988 -rw-rw-r-- 1 myuser myuser 191687 2014-07-17 11:17 hive.log -rw-rw-r-- 1 myuser myuser 14472 2014-07-16 14:43 myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log -rw-rw-r-- 1 myuser myuser 14260 2014-07-16 14:04 myuser_20140716135353_de698da0-807f-4e3b-8b97-5af5064b55f2.log -rw-rw-r-- 1 myuser myuser 14254 2014-07-16 13:42 myuser_20140716133838_208329bd-77bb-4981-a2e9-e747647d0704.log from the doc at https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ErrorLogs I can see that per Hive session basis in /tmp/user.name/, but can be configured in hive-site.xml https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration with the hive.querylog.location property., but I tried to pass it to -hiveconf hive.querylog.location=/tmp/mycustomlogdir/ , doesn't seem to work; the hive.log location is not changed by this approach either. so how can I change the location of both the logs , by some per-script params ? (i.e. we can't afford to change the system hive-site.xml or /etc/hive/conf etc) Thanks a lot Yang _ The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt.
Re: how to control hive log location on 0.13?
Make sure the directory you specify has the sticky bit set, otherwise users will have permission problems: chmod 1777 dir On 18 July 2014 14:19, Satish Mittal satish.mit...@inmobi.com wrote: You can configure the following property in $HIVE_HOME/conf/hive-log4j.properties: hive.log.dir=your location The default value of this property is ${java.io.tmpdir}/${user.name}. Thanks, Satish On Thu, Jul 17, 2014 at 11:58 PM, Yang tedd...@gmail.com wrote: we just moved to hadoop2.0 (HDP2.1 distro). it turns out that the new hive version generates a lot of logs into /tmp/ and is quickly creating the danger of running out of our /tmp/ space. I see these 2 different logs : [myuser@mybox ~]$ ls -lt /tmp/myuser/ total 1988 -rw-rw-r-- 1 myuser myuser 191687 2014-07-17 11:17 hive.log -rw-rw-r-- 1 myuser myuser 14472 2014-07-16 14:43 myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log -rw-rw-r-- 1 myuser myuser 14260 2014-07-16 14:04 myuser_20140716135353_de698da0-807f-4e3b-8b97-5af5064b55f2.log -rw-rw-r-- 1 myuser myuser 14254 2014-07-16 13:42 myuser_20140716133838_208329bd-77bb-4981-a2e9-e747647d0704.log from the doc at https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ErrorLogs I can see that per Hive session basis in /tmp/user.name/, but can be configured in hive-site.xml https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration with the hive.querylog.location property., but I tried to pass it to -hiveconf hive.querylog.location=/tmp/mycustomlogdir/ , doesn't seem to work; the hive.log location is not changed by this approach either. so how can I change the location of both the logs , by some per-script params ? (i.e. we can't afford to change the system hive-site.xml or /etc/hive/conf etc) Thanks a lot Yang _ The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt. -- André Araújo Big Data Consultant/Solutions Architect The Pythian Group - Australia - www.pythian.com Office (calls from within Australia): 1300 366 021 x1270 Office (international): +61 2 8016 7000 x270 *OR* +1 613 565 8696 x1270 Mobile: +61 410 323 559 Fax: +61 2 9805 0544 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk “Success is not about standing at the top, it's the steps you leave behind.” — Iker Pou (rock climber) -- --
ERROR in JDBC
Hello Hive Community, I am trying to run the JDBC (from cwiki.apache.org), using HiveServer2. Everything in the Java code (attached above) runs well except for the last query: sql = select * from + tableName; Attached is the complete log file of several runs. I have noticed the following error: ERROR mr.ExecDriver (MapRedTask.java:execute(304)) - Exception: Cannot run program /usr/local/bin/hadoop-2.2.0\bin\hadoop.cmd (in directory C:\cygwin64\usr\local\bin\hive-0.12.0-bin\bin): CreateProcess error=2, The system cannot find the file specified I have tried to navigate to the following path and manually add the path along with the hadoop.cmd script but this did not work. So I had { C:\cygwin64\usr\local\bin\hive-0.12.0-bin\bin\usr\local\bin\hadoop-2.2.0\bin\hadoop.cmd) as a path but still it failed to find it. I have tried to debug the Apache Java code to try and figure out the problem. I have noticed that TStatusCode.fingByValue returns ERROR_STATUS which was only the case in the last query mentioned earlier. This flag is eventually checked and results in throwing the exception below: Exception in thread main java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:165) at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:153) at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:198) at org.apache.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:300) at murex.pop.hadoop.connector.HiveJdbcClient.main(HiveJdbcClient.java:67) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) My questions are: 1- Why isn't my Hiveserver2 finding the hadoop.cmd script? In general everything during the installation was set to default and no extra configuration is done 2- Why is the exception thrown specifically when trying this query? Do you have any ideas about what might the glitch be? Any help would be much appreciated Abdallah Chebaro *** This e-mail contains information for the intended recipient only. It may contain proprietary material or confidential information. If you are not the intended recipient you are not authorised to distribute, copy or use this e-mail or any attachment to it. Murex cannot guarantee that it is virus free and accepts no responsibility for any loss or damage arising from its use. If you have received this e-mail in error please notify immediately the sender and delete the original email received, any attachments and all copies from your system. HiveJdbcClient .java Description: HiveJdbcClient .java hive.log Description: hive.log
Hive huge 'startup time'
This is probably a simple question, but I'm noticing that for queries that run on 1+TB of data, it can take Hive up to 30 minutes to actually start the first map-reduce stage. What is it doing? I imagine it's gathering information about the data somehow, this 'startup' time is clearly a function of the amount of data I'm trying to process. Cheers,
Re: Hive huge 'startup time'
may be you can post your partition structure and the query..Over partitioning data is one of the reasons it happens. On Fri, Jul 18, 2014 at 2:36 PM, diogo di...@uken.com wrote: This is probably a simple question, but I'm noticing that for queries that run on 1+TB of data, it can take Hive up to 30 minutes to actually start the first map-reduce stage. What is it doing? I imagine it's gathering information about the data somehow, this 'startup' time is clearly a function of the amount of data I'm trying to process. Cheers,
Re: Hive huge 'startup time'
The planning phase needs to do work for every hive partition and every hadoop files. If you have a lot of 'small' files or many partitions this can take a long time. Also the planning phase that happens on the job tracker is single threaded. Also the new yarn stuff requires back and forth to allocated containers. Sometimes raising the heap to for the hive-cli/launching process helps because the default heap of 1 GB may not be a lot of space to deal with all of the partition information and memory overhead will make this go faster. Sometimes setting the min split size higher launches less map tasks which speeds up everything. So the answer...Try to tune everything, start hive like this: bin/hive -hiveconf hive.root.logger=DEBUG,console And record where the longest spaces with no output are, that is what you should try to tune first. On Fri, Jul 18, 2014 at 9:36 AM, diogo di...@uken.com wrote: This is probably a simple question, but I'm noticing that for queries that run on 1+TB of data, it can take Hive up to 30 minutes to actually start the first map-reduce stage. What is it doing? I imagine it's gathering information about the data somehow, this 'startup' time is clearly a function of the amount of data I'm trying to process. Cheers,
Hive Join Running Out of Memory
Hello everyone. I need some assistance. I have a join that fails with return code 3. The query is; SELECT B.CARD_NBR AS CNT FROM TENDER_TABLE A JOIN LOYALTY_CARDS B ON A.CARD_NBR = B.CARD_NBR LIMIT 10; -- Row Counts -- LOYALTY_CARDS = 43,876,938 -- TENDER_TABLE = 1,412,228,333 The query execution output starts with; 2014-07-18 10:30:17 Starting to launch local task to process map join; maximum memory = 1065484288 The last output is as follows; 2014-07-18 10:30:44 Processing rows:380 Hashtable size: 379 Memory usage: 969531248 percentage: 0.91 I ran SET mapred.child.java.opts=-Xmx4G; before the query but that did not change the maximum memory. What am I not understanding and how should I troubleshoot his issue? hive SELECT B.CARD_NBR AS CNT FROM TENDER_TABLE A JOIN LOYALTY_CARDS B ON A.CARD_NBR = B.CARD_NBR LIMIT 10; Query ID = root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474 Total jobs = 1 14/07/18 10:30:17 WARN conf.Configuration: file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/07/18 10:30:17 WARN conf.Configuration: file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed Execution log at: /tmp/root/root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474.log 2014-07-18 10:30:17 Starting to launch local task to process map join; maximum memory = 1065484288 2014-07-18 10:30:20 Processing rows:20 Hashtable size: 19 Memory usage: 53829960percentage: 0.051 2014-07-18 10:30:21 Processing rows:30 Hashtable size: 29 Memory usage: 76926312percentage: 0.072 2014-07-18 10:30:22 Processing rows:40 Hashtable size: 39 Memory usage: 105119456 percentage: 0.099 2014-07-18 10:30:23 Processing rows:50 Hashtable size: 49 Memory usage: 129079592 percentage: 0.121 2014-07-18 10:30:24 Processing rows:60 Hashtable size: 59 Memory usage: 151469744 percentage: 0.142 2014-07-18 10:30:24 Processing rows:70 Hashtable size: 69 Memory usage: 174968512 percentage: 0.164 2014-07-18 10:30:25 Processing rows:80 Hashtable size: 79 Memory usage: 207735176 percentage: 0.195 2014-07-18 10:30:25 Processing rows:90 Hashtable size: 89 Memory usage: 232306976 percentage: 0.218 2014-07-18 10:30:26 Processing rows:100 Hashtable size: 99 Memory usage: 255813784 percentage: 0.24 2014-07-18 10:30:27 Processing rows:110 Hashtable size: 109 Memory usage: 280781448 percentage: 0.264 2014-07-18 10:30:27 Processing rows:120 Hashtable size: 119 Memory usage: 305606024 percentage: 0.287 2014-07-18 10:30:28 Processing rows:130 Hashtable size: 129 Memory usage: 323502504 percentage: 0.304 2014-07-18 10:30:28 Processing rows:140 Hashtable size: 139 Memory usage: 347450792 percentage: 0.326 2014-07-18 10:30:29 Processing rows:150 Hashtable size: 149 Memory usage: 372281800 percentage: 0.349 2014-07-18 10:30:30 Processing rows:160 Hashtable size: 159 Memory usage: 413191040 percentage: 0.388 2014-07-18 10:30:30 Processing rows:170 Hashtable size: 169 Memory
Re: Hive Join Running Out of Memory
This is a failed optimization hive is trying to build the lookup table locally and then put it in the distributed cache and then to a map join. Look through your hive site for the configuration to turn these auto-map joins off. Based on your version the variables changed a names /deprecated etc so I can not tell you the exact ones. On Fri, Jul 18, 2014 at 10:35 AM, Clay McDonald stuart.mcdon...@bateswhite.com wrote: Hello everyone. I need some assistance. I have a join that fails with return code 3. The query is; SELECT B.CARD_NBR AS CNT FROM TENDER_TABLE A JOIN LOYALTY_CARDS B ON A.CARD_NBR = B.CARD_NBR LIMIT 10; -- Row Counts -- LOYALTY_CARDS = 43,876,938 -- TENDER_TABLE = 1,412,228,333 The query execution output starts with; 2014-07-18 10:30:17 Starting to launch local task to process map join; maximum memory = 1065484288 The last output is as follows; 2014-07-18 10:30:44 Processing rows:380 Hashtable size: 379 Memory usage: 969531248 percentage: 0.91 I ran SET mapred.child.java.opts=-Xmx4G; before the query but that did not change the maximum memory. What am I not understanding and how should I troubleshoot his issue? hive SELECT B.CARD_NBR AS CNT FROM TENDER_TABLE A JOIN LOYALTY_CARDS B ON A.CARD_NBR = B.CARD_NBR LIMIT 10; Query ID = root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474 Total jobs = 1 14/07/18 10:30:17 WARN conf.Configuration: file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/07/18 10:30:17 WARN conf.Configuration: file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed Execution log at: /tmp/root/root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474.log 2014-07-18 10:30:17 Starting to launch local task to process map join; maximum memory = 1065484288 2014-07-18 10:30:20 Processing rows:20 Hashtable size: 19 Memory usage: 53829960percentage: 0.051 2014-07-18 10:30:21 Processing rows:30 Hashtable size: 29 Memory usage: 76926312percentage: 0.072 2014-07-18 10:30:22 Processing rows:40 Hashtable size: 39 Memory usage: 105119456 percentage: 0.099 2014-07-18 10:30:23 Processing rows:50 Hashtable size: 49 Memory usage: 129079592 percentage: 0.121 2014-07-18 10:30:24 Processing rows:60 Hashtable size: 59 Memory usage: 151469744 percentage: 0.142 2014-07-18 10:30:24 Processing rows:70 Hashtable size: 69 Memory usage: 174968512 percentage: 0.164 2014-07-18 10:30:25 Processing rows:80 Hashtable size: 79 Memory usage: 207735176 percentage: 0.195 2014-07-18 10:30:25 Processing rows:90 Hashtable size: 89 Memory usage: 232306976 percentage: 0.218 2014-07-18 10:30:26 Processing rows:100 Hashtable size: 99 Memory usage: 255813784 percentage: 0.24 2014-07-18 10:30:27 Processing rows:110 Hashtable size: 109 Memory usage: 280781448 percentage: 0.264 2014-07-18 10:30:27 Processing rows:120 Hashtable size: 119 Memory usage: 305606024 percentage: 0.287 2014-07-18 10:30:28 Processing rows:130 Hashtable size: 129 Memory usage: 323502504 percentage: 0.304 2014-07-18 10:30:28
Re: Hive huge 'startup time'
Sweet, great answers, thanks. Indeed, I have a small number of partitions, but lots of small files, ~20MB each. I'll make sure to combine them. Also, increasing the heap size of the cli process already helped speed it up. Thanks, again. On Fri, Jul 18, 2014 at 10:26 AM, Edward Capriolo edlinuxg...@gmail.com wrote: The planning phase needs to do work for every hive partition and every hadoop files. If you have a lot of 'small' files or many partitions this can take a long time. Also the planning phase that happens on the job tracker is single threaded. Also the new yarn stuff requires back and forth to allocated containers. Sometimes raising the heap to for the hive-cli/launching process helps because the default heap of 1 GB may not be a lot of space to deal with all of the partition information and memory overhead will make this go faster. Sometimes setting the min split size higher launches less map tasks which speeds up everything. So the answer...Try to tune everything, start hive like this: bin/hive -hiveconf hive.root.logger=DEBUG,console And record where the longest spaces with no output are, that is what you should try to tune first. On Fri, Jul 18, 2014 at 9:36 AM, diogo di...@uken.com wrote: This is probably a simple question, but I'm noticing that for queries that run on 1+TB of data, it can take Hive up to 30 minutes to actually start the first map-reduce stage. What is it doing? I imagine it's gathering information about the data somehow, this 'startup' time is clearly a function of the amount of data I'm trying to process. Cheers,
Re: Hive huge 'startup time'
Unleash ze file crusha! https://github.com/edwardcapriolo/filecrush On Fri, Jul 18, 2014 at 10:51 AM, diogo di...@uken.com wrote: Sweet, great answers, thanks. Indeed, I have a small number of partitions, but lots of small files, ~20MB each. I'll make sure to combine them. Also, increasing the heap size of the cli process already helped speed it up. Thanks, again. On Fri, Jul 18, 2014 at 10:26 AM, Edward Capriolo edlinuxg...@gmail.com wrote: The planning phase needs to do work for every hive partition and every hadoop files. If you have a lot of 'small' files or many partitions this can take a long time. Also the planning phase that happens on the job tracker is single threaded. Also the new yarn stuff requires back and forth to allocated containers. Sometimes raising the heap to for the hive-cli/launching process helps because the default heap of 1 GB may not be a lot of space to deal with all of the partition information and memory overhead will make this go faster. Sometimes setting the min split size higher launches less map tasks which speeds up everything. So the answer...Try to tune everything, start hive like this: bin/hive -hiveconf hive.root.logger=DEBUG,console And record where the longest spaces with no output are, that is what you should try to tune first. On Fri, Jul 18, 2014 at 9:36 AM, diogo di...@uken.com wrote: This is probably a simple question, but I'm noticing that for queries that run on 1+TB of data, it can take Hive up to 30 minutes to actually start the first map-reduce stage. What is it doing? I imagine it's gathering information about the data somehow, this 'startup' time is clearly a function of the amount of data I'm trying to process. Cheers,
RE: Hive Join Running Out of Memory
Thank you. Would it be acceptable to use the following? SET hive.exec.mode.local.auto=false; From: Edward Capriolo [mailto:edlinuxg...@gmail.com] Sent: Friday, July 18, 2014 10:45 AM To: user@hive.apache.org Subject: Re: Hive Join Running Out of Memory This is a failed optimization hive is trying to build the lookup table locally and then put it in the distributed cache and then to a map join. Look through your hive site for the configuration to turn these auto-map joins off. Based on your version the variables changed a names /deprecated etc so I can not tell you the exact ones. On Fri, Jul 18, 2014 at 10:35 AM, Clay McDonald stuart.mcdon...@bateswhite.com wrote: Hello everyone. I need some assistance. I have a join that fails with return code 3. The query is; SELECT B.CARD_NBR AS CNT FROM TENDER_TABLE A JOIN LOYALTY_CARDS B ON A.CARD_NBR = B.CARD_NBR LIMIT 10; -- Row Counts -- LOYALTY_CARDS = 43,876,938 -- TENDER_TABLE = 1,412,228,333 The query execution output starts with; 2014-07-18 10:30:17 Starting to launch local task to process map join; maximum memory = 1065484288 The last output is as follows; 2014-07-18 10:30:44 Processing rows: 380 Hashtable size: 379 Memory usage: 969531248 percentage: 0.91 I ran SET mapred.child.java.opts=-Xmx4G; before the query but that did not change the maximum memory. What am I not understanding and how should I troubleshoot his issue? hive SELECT B.CARD_NBR AS CNT FROM TENDER_TABLE A JOIN LOYALTY_CARDS B ON A.CARD_NBR = B.CARD_NBR LIMIT 10; Query ID = root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474 Total jobs = 1 14/07/18 10:30:17 WARN conf.Configuration: file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/07/18 10:30:17 WARN conf.Configuration: file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed Execution log at: /tmp/root/root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474.log 2014-07-18 10:30:17 Starting to launch local task to process map join; maximum memory = 1065484288 2014-07-18 10:30:20 Processing rows: 20 Hashtable size: 19 Memory usage: 53829960 percentage: 0.051 2014-07-18 10:30:21 Processing rows: 30 Hashtable size: 29 Memory usage: 76926312 percentage: 0.072 2014-07-18 10:30:22 Processing rows: 40 Hashtable size: 39 Memory usage: 105119456 percentage: 0.099 2014-07-18 10:30:23 Processing rows: 50 Hashtable size: 49 Memory usage: 129079592 percentage: 0.121 2014-07-18 10:30:24 Processing rows: 60 Hashtable size: 59 Memory usage: 151469744 percentage: 0.142 2014-07-18 10:30:24 Processing rows: 70 Hashtable size: 69 Memory usage: 174968512 percentage: 0.164 2014-07-18 10:30:25 Processing rows: 80 Hashtable size: 79 Memory usage: 207735176 percentage: 0.195 2014-07-18 10:30:25 Processing rows: 90 Hashtable size: 89 Memory usage: 232306976 percentage: 0.218 2014-07-18 10:30:26 Processing rows: 100 Hashtable size: 99 Memory usage: 255813784 percentage: 0.24 2014-07-18 10:30:27 Processing rows: 110 Hashtable size: 109 Memory usage: 280781448 percentage: 0.264 2014-07-18 10:30:27 Processing rows: 120 Hashtable size: 119
Re: Hive Join Running Out of Memory
I believe that would be the one. On Fri, Jul 18, 2014 at 10:54 AM, Clay McDonald stuart.mcdon...@bateswhite.com wrote: Thank you. Would it be acceptable to use the following? SET hive.exec.mode.local.auto=false; From: Edward Capriolo [mailto:edlinuxg...@gmail.com] Sent: Friday, July 18, 2014 10:45 AM To: user@hive.apache.org Subject: Re: Hive Join Running Out of Memory This is a failed optimization hive is trying to build the lookup table locally and then put it in the distributed cache and then to a map join. Look through your hive site for the configuration to turn these auto-map joins off. Based on your version the variables changed a names /deprecated etc so I can not tell you the exact ones. On Fri, Jul 18, 2014 at 10:35 AM, Clay McDonald stuart.mcdon...@bateswhite.com wrote: Hello everyone. I need some assistance. I have a join that fails with return code 3. The query is; SELECT B.CARD_NBR AS CNT FROM TENDER_TABLE A JOIN LOYALTY_CARDS B ON A.CARD_NBR = B.CARD_NBR LIMIT 10; -- Row Counts -- LOYALTY_CARDS = 43,876,938 -- TENDER_TABLE = 1,412,228,333 The query execution output starts with; 2014-07-18 10:30:17 Starting to launch local task to process map join; maximum memory = 1065484288 The last output is as follows; 2014-07-18 10:30:44 Processing rows:380 Hashtable size: 379 Memory usage: 969531248 percentage: 0.91 I ran SET mapred.child.java.opts=-Xmx4G; before the query but that did not change the maximum memory. What am I not understanding and how should I troubleshoot his issue? hive SELECT B.CARD_NBR AS CNT FROM TENDER_TABLE A JOIN LOYALTY_CARDS B ON A.CARD_NBR = B.CARD_NBR LIMIT 10; Query ID = root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474 Total jobs = 1 14/07/18 10:30:17 WARN conf.Configuration: file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/07/18 10:30:17 WARN conf.Configuration: file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed Execution log at: /tmp/root/root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474.log 2014-07-18 10:30:17 Starting to launch local task to process map join; maximum memory = 1065484288 2014-07-18 10:30:20 Processing rows:20 Hashtable size: 19 Memory usage: 53829960percentage: 0.051 2014-07-18 10:30:21 Processing rows:30 Hashtable size: 29 Memory usage: 76926312percentage: 0.072 2014-07-18 10:30:22 Processing rows:40 Hashtable size: 39 Memory usage: 105119456 percentage: 0.099 2014-07-18 10:30:23 Processing rows:50 Hashtable size: 49 Memory usage: 129079592 percentage: 0.121 2014-07-18 10:30:24 Processing rows:60 Hashtable size: 59 Memory usage: 151469744 percentage: 0.142 2014-07-18 10:30:24 Processing rows:70 Hashtable size: 69 Memory usage: 174968512 percentage: 0.164 2014-07-18 10:30:25 Processing rows:80 Hashtable size: 79 Memory usage: 207735176 percentage: 0.195 2014-07-18 10:30:25 Processing rows:90 Hashtable size: 89 Memory usage: 232306976 percentage: 0.218 2014-07-18 10:30:26 Processing rows:100 Hashtable size: 99 Memory usage: 255813784 percentage: 0.24 2014-07-18 10:30:27 Processing rows:
RE: Hive Join Running Out of Memory
I changed the hive.auto.convert.join.noconditionaltask = false in the hive site and that seemed to do the trick. Thanks! From: Edward Capriolo [mailto:edlinuxg...@gmail.com] Sent: Friday, July 18, 2014 10:57 AM To: user@hive.apache.org Subject: Re: Hive Join Running Out of Memory I believe that would be the one. On Fri, Jul 18, 2014 at 10:54 AM, Clay McDonald stuart.mcdon...@bateswhite.com wrote: Thank you. Would it be acceptable to use the following? SET hive.exec.mode.local.auto=false; From: Edward Capriolo [mailto:edlinuxg...@gmail.com] Sent: Friday, July 18, 2014 10:45 AM To: user@hive.apache.org Subject: Re: Hive Join Running Out of Memory This is a failed optimization hive is trying to build the lookup table locally and then put it in the distributed cache and then to a map join. Look through your hive site for the configuration to turn these auto-map joins off. Based on your version the variables changed a names /deprecated etc so I can not tell you the exact ones. On Fri, Jul 18, 2014 at 10:35 AM, Clay McDonald stuart.mcdon...@bateswhite.com wrote: Hello everyone. I need some assistance. I have a join that fails with return code 3. The query is; SELECT B.CARD_NBR AS CNT FROM TENDER_TABLE A JOIN LOYALTY_CARDS B ON A.CARD_NBR = B.CARD_NBR LIMIT 10; -- Row Counts -- LOYALTY_CARDS = 43,876,938 -- TENDER_TABLE = 1,412,228,333 The query execution output starts with; 2014-07-18 10:30:17 Starting to launch local task to process map join; maximum memory = 1065484288 The last output is as follows; 2014-07-18 10:30:44 Processing rows: 380 Hashtable size: 379 Memory usage: 969531248 percentage: 0.91 I ran SET mapred.child.java.opts=-Xmx4G; before the query but that did not change the maximum memory. What am I not understanding and how should I troubleshoot his issue? hive SELECT B.CARD_NBR AS CNT FROM TENDER_TABLE A JOIN LOYALTY_CARDS B ON A.CARD_NBR = B.CARD_NBR LIMIT 10; Query ID = root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474 Total jobs = 1 14/07/18 10:30:17 WARN conf.Configuration: file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/07/18 10:30:17 WARN conf.Configuration: file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed Execution log at: /tmp/root/root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474.log 2014-07-18 10:30:17 Starting to launch local task to process map join; maximum memory = 1065484288 2014-07-18 10:30:20 Processing rows: 20 Hashtable size: 19 Memory usage: 53829960 percentage: 0.051 2014-07-18 10:30:21 Processing rows: 30 Hashtable size: 29 Memory usage: 76926312 percentage: 0.072 2014-07-18 10:30:22 Processing rows: 40 Hashtable size: 39 Memory usage: 105119456 percentage: 0.099 2014-07-18 10:30:23 Processing rows: 50 Hashtable size: 49 Memory usage: 129079592 percentage: 0.121 2014-07-18 10:30:24 Processing rows: 60 Hashtable size: 59 Memory usage: 151469744 percentage: 0.142 2014-07-18 10:30:24 Processing rows: 70 Hashtable size: 69 Memory usage: 174968512 percentage: 0.164 2014-07-18 10:30:25 Processing rows: 80 Hashtable size: 79 Memory usage: 207735176 percentage: 0.195 2014-07-18 10:30:25 Processing rows: 90 Hashtable size:
Hive support for filtering Unicode data
Hello Hive, I posted the below question http://stackoverflow.com/questions/24817308/hive-support-for-filtering-unicode-data?noredirect=1#comment38534961_24817308 on Stackoverflow http://stackoverflow.com/questions/24817308/hive-support-for-filtering-unicode-data?noredirect=1#comment38534961_24817308, but decided to ask the same one here: I have a Hive table with Unicode data. When trying to perform a simple query SELECT * FROM table, I get back the correct data in correct Unicode encoding. However, when I tried to add filtering criteria such as ... WHERE column = 'some unicode value', my query returned nothing. Is it Hive's limitation? Or is there anyway to make Unicode filtering work with Hive? Thank you! -- Duc Le
Re: how to control hive log location on 0.13?
thanks guys. anybody knows what generates the log like myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log ? I checked our application code, it doesn't generate this, looks from hive. On Fri, Jul 18, 2014 at 12:28 AM, Andre Araujo ara...@pythian.com wrote: Make sure the directory you specify has the sticky bit set, otherwise users will have permission problems: chmod 1777 dir On 18 July 2014 14:19, Satish Mittal satish.mit...@inmobi.com wrote: You can configure the following property in $HIVE_HOME/conf/hive-log4j.properties: hive.log.dir=your location The default value of this property is ${java.io.tmpdir}/${user.name}. Thanks, Satish On Thu, Jul 17, 2014 at 11:58 PM, Yang tedd...@gmail.com wrote: we just moved to hadoop2.0 (HDP2.1 distro). it turns out that the new hive version generates a lot of logs into /tmp/ and is quickly creating the danger of running out of our /tmp/ space. I see these 2 different logs : [myuser@mybox ~]$ ls -lt /tmp/myuser/ total 1988 -rw-rw-r-- 1 myuser myuser 191687 2014-07-17 11:17 hive.log -rw-rw-r-- 1 myuser myuser 14472 2014-07-16 14:43 myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log -rw-rw-r-- 1 myuser myuser 14260 2014-07-16 14:04 myuser_20140716135353_de698da0-807f-4e3b-8b97-5af5064b55f2.log -rw-rw-r-- 1 myuser myuser 14254 2014-07-16 13:42 myuser_20140716133838_208329bd-77bb-4981-a2e9-e747647d0704.log from the doc at https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ErrorLogs I can see that per Hive session basis in /tmp/user.name/, but can be configured in hive-site.xml https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration with the hive.querylog.location property., but I tried to pass it to -hiveconf hive.querylog.location=/tmp/mycustomlogdir/ , doesn't seem to work; the hive.log location is not changed by this approach either. so how can I change the location of both the logs , by some per-script params ? (i.e. we can't afford to change the system hive-site.xml or /etc/hive/conf etc) Thanks a lot Yang _ The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt. -- André Araújo Big Data Consultant/Solutions Architect The Pythian Group - Australia - www.pythian.com Office (calls from within Australia): 1300 366 021 x1270 Office (international): +61 2 8016 7000 x270 *OR* +1 613 565 8696 x1270 Mobile: +61 410 323 559 Fax: +61 2 9805 0544 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk “Success is not about standing at the top, it's the steps you leave behind.” — Iker Pou (rock climber) --
Re: how to control hive log location on 0.13?
Thanks André, I've added the sticky bit advice to Error Logs https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ErrorLogs . -- Lefty On Fri, Jul 18, 2014 at 2:38 PM, Yang tedd...@gmail.com wrote: thanks guys. anybody knows what generates the log like myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log ? I checked our application code, it doesn't generate this, looks from hive. On Fri, Jul 18, 2014 at 12:28 AM, Andre Araujo ara...@pythian.com wrote: Make sure the directory you specify has the sticky bit set, otherwise users will have permission problems: chmod 1777 dir On 18 July 2014 14:19, Satish Mittal satish.mit...@inmobi.com wrote: You can configure the following property in $HIVE_HOME/conf/hive-log4j.properties: hive.log.dir=your location The default value of this property is ${java.io.tmpdir}/${user.name}. Thanks, Satish On Thu, Jul 17, 2014 at 11:58 PM, Yang tedd...@gmail.com wrote: we just moved to hadoop2.0 (HDP2.1 distro). it turns out that the new hive version generates a lot of logs into /tmp/ and is quickly creating the danger of running out of our /tmp/ space. I see these 2 different logs : [myuser@mybox ~]$ ls -lt /tmp/myuser/ total 1988 -rw-rw-r-- 1 myuser myuser 191687 2014-07-17 11:17 hive.log -rw-rw-r-- 1 myuser myuser 14472 2014-07-16 14:43 myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log -rw-rw-r-- 1 myuser myuser 14260 2014-07-16 14:04 myuser_20140716135353_de698da0-807f-4e3b-8b97-5af5064b55f2.log -rw-rw-r-- 1 myuser myuser 14254 2014-07-16 13:42 myuser_20140716133838_208329bd-77bb-4981-a2e9-e747647d0704.log from the doc at https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ErrorLogs I can see that per Hive session basis in /tmp/user.name/, but can be configured in hive-site.xml https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration with the hive.querylog.location property., but I tried to pass it to -hiveconf hive.querylog.location=/tmp/mycustomlogdir/ , doesn't seem to work; the hive.log location is not changed by this approach either. so how can I change the location of both the logs , by some per-script params ? (i.e. we can't afford to change the system hive-site.xml or /etc/hive/conf etc) Thanks a lot Yang _ The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt. -- André Araújo Big Data Consultant/Solutions Architect The Pythian Group - Australia - www.pythian.com Office (calls from within Australia): 1300 366 021 x1270 Office (international): +61 2 8016 7000 x270 *OR* +1 613 565 8696 x1270 Mobile: +61 410 323 559 Fax: +61 2 9805 0544 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk “Success is not about standing at the top, it's the steps you leave behind.” — Iker Pou (rock climber) --
Re: how to control hive log location on 0.13?
Can you give us an excerpt of the contents of this log? On 19 July 2014 04:38, Yang tedd...@gmail.com wrote: thanks guys. anybody knows what generates the log like myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log ? I checked our application code, it doesn't generate this, looks from hive. On Fri, Jul 18, 2014 at 12:28 AM, Andre Araujo ara...@pythian.com wrote: Make sure the directory you specify has the sticky bit set, otherwise users will have permission problems: chmod 1777 dir On 18 July 2014 14:19, Satish Mittal satish.mit...@inmobi.com wrote: You can configure the following property in $HIVE_HOME/conf/hive-log4j.properties: hive.log.dir=your location The default value of this property is ${java.io.tmpdir}/${user.name}. Thanks, Satish On Thu, Jul 17, 2014 at 11:58 PM, Yang tedd...@gmail.com wrote: we just moved to hadoop2.0 (HDP2.1 distro). it turns out that the new hive version generates a lot of logs into /tmp/ and is quickly creating the danger of running out of our /tmp/ space. I see these 2 different logs : [myuser@mybox ~]$ ls -lt /tmp/myuser/ total 1988 -rw-rw-r-- 1 myuser myuser 191687 2014-07-17 11:17 hive.log -rw-rw-r-- 1 myuser myuser 14472 2014-07-16 14:43 myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log -rw-rw-r-- 1 myuser myuser 14260 2014-07-16 14:04 myuser_20140716135353_de698da0-807f-4e3b-8b97-5af5064b55f2.log -rw-rw-r-- 1 myuser myuser 14254 2014-07-16 13:42 myuser_20140716133838_208329bd-77bb-4981-a2e9-e747647d0704.log from the doc at https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ErrorLogs I can see that per Hive session basis in /tmp/user.name/, but can be configured in hive-site.xml https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration with the hive.querylog.location property., but I tried to pass it to -hiveconf hive.querylog.location=/tmp/mycustomlogdir/ , doesn't seem to work; the hive.log location is not changed by this approach either. so how can I change the location of both the logs , by some per-script params ? (i.e. we can't afford to change the system hive-site.xml or /etc/hive/conf etc) Thanks a lot Yang _ The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt. -- André Araújo Big Data Consultant/Solutions Architect The Pythian Group - Australia - www.pythian.com Office (calls from within Australia): 1300 366 021 x1270 Office (international): +61 2 8016 7000 x270 *OR* +1 613 565 8696 x1270 Mobile: +61 410 323 559 Fax: +61 2 9805 0544 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk “Success is not about standing at the top, it's the steps you leave behind.” — Iker Pou (rock climber) -- -- André Araújo Big Data Consultant/Solutions Architect The Pythian Group - Australia - www.pythian.com Office (calls from within Australia): 1300 366 021 x1270 Office (international): +61 2 8016 7000 x270 *OR* +1 613 565 8696 x1270 Mobile: +61 410 323 559 Fax: +61 2 9805 0544 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk “Success is not about standing at the top, it's the steps you leave behind.” — Iker Pou (rock climber) -- --
Re: how to control hive log location on 0.13?
and where is it located? On 19 July 2014 10:58, Andre Araujo ara...@pythian.com wrote: Can you give us an excerpt of the contents of this log? On 19 July 2014 04:38, Yang tedd...@gmail.com wrote: thanks guys. anybody knows what generates the log like myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log ? I checked our application code, it doesn't generate this, looks from hive. On Fri, Jul 18, 2014 at 12:28 AM, Andre Araujo ara...@pythian.com wrote: Make sure the directory you specify has the sticky bit set, otherwise users will have permission problems: chmod 1777 dir On 18 July 2014 14:19, Satish Mittal satish.mit...@inmobi.com wrote: You can configure the following property in $HIVE_HOME/conf/hive-log4j.properties: hive.log.dir=your location The default value of this property is ${java.io.tmpdir}/${user.name}. Thanks, Satish On Thu, Jul 17, 2014 at 11:58 PM, Yang tedd...@gmail.com wrote: we just moved to hadoop2.0 (HDP2.1 distro). it turns out that the new hive version generates a lot of logs into /tmp/ and is quickly creating the danger of running out of our /tmp/ space. I see these 2 different logs : [myuser@mybox ~]$ ls -lt /tmp/myuser/ total 1988 -rw-rw-r-- 1 myuser myuser 191687 2014-07-17 11:17 hive.log -rw-rw-r-- 1 myuser myuser 14472 2014-07-16 14:43 myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log -rw-rw-r-- 1 myuser myuser 14260 2014-07-16 14:04 myuser_20140716135353_de698da0-807f-4e3b-8b97-5af5064b55f2.log -rw-rw-r-- 1 myuser myuser 14254 2014-07-16 13:42 myuser_20140716133838_208329bd-77bb-4981-a2e9-e747647d0704.log from the doc at https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ErrorLogs I can see that per Hive session basis in /tmp/user.name/, but can be configured in hive-site.xml https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration with the hive.querylog.location property., but I tried to pass it to -hiveconf hive.querylog.location=/tmp/mycustomlogdir/ , doesn't seem to work; the hive.log location is not changed by this approach either. so how can I change the location of both the logs , by some per-script params ? (i.e. we can't afford to change the system hive-site.xml or /etc/hive/conf etc) Thanks a lot Yang _ The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt. -- André Araújo Big Data Consultant/Solutions Architect The Pythian Group - Australia - www.pythian.com Office (calls from within Australia): 1300 366 021 x1270 Office (international): +61 2 8016 7000 x270 *OR* +1 613 565 8696 x1270 Mobile: +61 410 323 559 Fax: +61 2 9805 0544 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk “Success is not about standing at the top, it's the steps you leave behind.” — Iker Pou (rock climber) -- -- André Araújo Big Data Consultant/Solutions Architect The Pythian Group - Australia - www.pythian.com Office (calls from within Australia): 1300 366 021 x1270 Office (international): +61 2 8016 7000 x270 *OR* +1 613 565 8696 x1270 Mobile: +61 410 323 559 Fax: +61 2 9805 0544 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk “Success is not about standing at the top, it's the steps you leave behind.” — Iker Pou (rock climber) -- André Araújo Big Data Consultant/Solutions Architect The Pythian Group - Australia - www.pythian.com Office (calls from within Australia): 1300 366 021 x1270 Office (international): +61 2 8016 7000 x270 *OR* +1 613 565 8696 x1270 Mobile: +61 410 323 559 Fax: +61 2 9805 0544 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk “Success is not about standing at the top, it's the steps you leave behind.” — Iker Pou (rock climber) -- --
Re: Hive huge 'startup time'
Hello everyone, Thanks for sharing valuable inputs. I am working on similar kind of task, it will be really helpful if you can share the command for increasing the heap size of hive-cli/launching process. Thanks, Saurabh Sent from my iPhone, please avoid typos. On 18-Jul-2014, at 8:23 pm, Edward Capriolo edlinuxg...@gmail.com wrote: Unleash ze file crusha! https://github.com/edwardcapriolo/filecrush On Fri, Jul 18, 2014 at 10:51 AM, diogo di...@uken.com wrote: Sweet, great answers, thanks. Indeed, I have a small number of partitions, but lots of small files, ~20MB each. I'll make sure to combine them. Also, increasing the heap size of the cli process already helped speed it up. Thanks, again. On Fri, Jul 18, 2014 at 10:26 AM, Edward Capriolo edlinuxg...@gmail.com wrote: The planning phase needs to do work for every hive partition and every hadoop files. If you have a lot of 'small' files or many partitions this can take a long time. Also the planning phase that happens on the job tracker is single threaded. Also the new yarn stuff requires back and forth to allocated containers. Sometimes raising the heap to for the hive-cli/launching process helps because the default heap of 1 GB may not be a lot of space to deal with all of the partition information and memory overhead will make this go faster. Sometimes setting the min split size higher launches less map tasks which speeds up everything. So the answer...Try to tune everything, start hive like this: bin/hive -hiveconf hive.root.logger=DEBUG,console And record where the longest spaces with no output are, that is what you should try to tune first. On Fri, Jul 18, 2014 at 9:36 AM, diogo di...@uken.com wrote: This is probably a simple question, but I'm noticing that for queries that run on 1+TB of data, it can take Hive up to 30 minutes to actually start the first map-reduce stage. What is it doing? I imagine it's gathering information about the data somehow, this 'startup' time is clearly a function of the amount of data I'm trying to process. Cheers,
Re: how to control hive log location on 0.13?
2014-07-18 15:03:37,774 INFO mr.ExecDriver (SessionState.java:printInfo(537)) - Execution log at: /tmp/myuser/myuser_2014071815030 3_56bf6bb0-db30-4dbc-807c-9023ce4103f4.log 2014-07-18 15:03:37,864 WARN conf.Configuration (Configuration.java:loadProperty(2358)) - file:/tmp/myuser/hive_2014-07-18_15-03-30_423_ 6799963466906099923-1/-local-10011/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Igno ring. 2014-07-18 15:03:37,871 WARN conf.Configuration (Configuration.java:loadProperty(2358)) - file:/tmp/myuser/hive_2014-07-18_15-03-30_423_ 6799963466906099923-1/-local-10011/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 2014-07-18 15:03:37,951 INFO log.PerfLogger (PerfLogger.java:PerfLogBegin(108)) - PERFLOG method=deserializePlan from=org.apache.hadoop.hive. ql.exec.Utilities 2014-07-18 15:03:37,951 INFO exec.Utilities (Utilities.java:deserializePlan(822)) - Deserializing MapredLocalWork via kryo 2014-07-18 15:03:38,237 INFO log.PerfLogger (PerfLogger.java:PerfLogEnd(135)) - /PERFLOG method=deserializePlan start=1405721017951 end=14057 21018237 duration=286 from=org.apache.hadoop.hive.ql.exec.Utilities 2014-07-18 15:03:38,246 INFO mr.MapredLocalTask (SessionState.java:printInfo(537)) - 2014-07-18 03:03:38 Starting to launch local task t o process map join; maximum memory = 4261937152 2014-07-18 15:03:38,261 INFO mr.MapredLocalTask (MapredLocalTask.java:initializeOperators(406)) - fetchoperator for null-subquery2:a-subquery2 :dpkg_cntr:dpkg_wtransaction_p2_id_user_30m created 2014-07-18 15:03:38,263 INFO mr.MapredLocalTask (MapredLocalTask.java:initializeOperators(406)) - fetchoperator for null-subquery2:a-subquery2 :dpkg:dpkg_wtransaction_p2_id_user_30m created 2014-07-18 15:03:38,264 INFO mr.MapredLocalTask (MapredLocalTask.java:initializeOperators(406)) - fetchoperator for null-subquery2:a-subquery2 :xclick:b:wtrans_data_map_p2_30m created 2014-07-18 15:03:38,266 INFO mr.MapredLocalTask (MapredLocalTask.java:initializeOperators(406)) - fetchoperator for null-subquery1:a-subquery1 :dpkg_cntr:dpkg_wtransaction_id_user_30m created 2014-07-18 15:03:38,268 INFO mr.MapredLocalTask (MapredLocalTask.java:initializeOperators(406)) - fetchoperator for null-subquery1:a-subquery1 :dpkg:dpkg_wtransaction_id_user_30m created 2014-07-18 15:03:38,269 INFO mr.MapredLocalTask (MapredLocalTask.java:initializeOperators(406)) - fetchoperator for null-subquery1:a-subquery1 :xclick:b:wtrans_data_map_30m created - whole bunch of stuff omitted here -- -- 2014-07-18 15:04:08,678 INFO exec.HashTableSinkOperator (HashTableSinkOperator.java:flushToFile(278)) - Temp URI for side table: file:/tmp/myuser/hive_2014-07-18_15-03-30_423_6799963466906099923-1/-local-10008/HashTable-Stage-2 2014-07-18 15:04:08,678 INFO exec.HashTableSinkOperator (SessionState.java:printInfo(537)) - 2014-07-18 03:04:08 Dump the side-table into file: file:/tmp/myuser/hive_2014-07-18_15-03-30_423_6799963466906099923-1/-local-10008/HashTable-Stage-2/MapJoin-mapfile11--.hashtable 2014-07-18 15:04:09,943 INFO exec.HashTableSinkOperator (SessionState.java:printInfo(537)) - 2014-07-18 03:04:09 Uploaded 1 File to: file:/tmp/myuser/hive_2014-07-18_15-03-30_423_6799963466906099923-1/-local-10008/HashTable-Stage-2/MapJoin-mapfile11--.hashtable (58010217 bytes) 2014-07-18 15:04:09,943 INFO exec.HashTableSinkOperator (Operator.java:close(591)) - 6 Close done 2014-07-18 15:04:09,943 INFO exec.SelectOperator (Operator.java:close(591)) - 5 Close done 2014-07-18 15:04:09,943 INFO exec.TableScanOperator (Operator.java:close(591)) - 4 Close done 2014-07-18 15:04:09,951 INFO mapred.FileInputFormat (FileInputFormat.java:listStatus(247)) - Total input paths to process : 1 2014-07-18 15:04:10,008 INFO mapred.FileInputFormat (FileInputFormat.java:listStatus(247)) - Total input paths to process : 1 2014-07-18 15:04:11,526 INFO exec.HashTableSinkOperator (SessionState.java:printInfo(537)) - 2014-07-18 03:04:11 Processing rows: 20 Hashtable size: 19 Memory usage: 190041576 percentage: 0.045 2014-07-18 15:04:11,950 INFO exec.HashTableSinkOperator (SessionState.java:printInfo(537)) - 2014-07-18 03:04:11 Processing rows: 30 Hashtable size: 29 Memory usage: 250890416 percentage: 0.059 2014-07-18 15:04:12,456 INFO exec.HashTableSinkOperator (SessionState.java:printInfo(537)) - 2014-07-18 03:04:12 Processing rows: 40 Hashtable size: 39 Memory usage: 304697120 percentage: 0.071 2014-07-18 15:04:12,744 INFO exec.TableScanOperator (Operator.java:close(574)) - 11 finished. closing... 2014-07-18 15:04:12,745 INFO exec.FilterOperator (Operator.java:close(574)) - 12 finished. closing... 2014-07-18 15:04:12,745 INFO
Re: how to control hive log location on 0.13?
it's in /tmp/my_user/ the funny thing is that I already have a hive.log there. On Fri, Jul 18, 2014 at 6:01 PM, Andre Araujo ara...@pythian.com wrote: and where is it located? On 19 July 2014 10:58, Andre Araujo ara...@pythian.com wrote: Can you give us an excerpt of the contents of this log? On 19 July 2014 04:38, Yang tedd...@gmail.com wrote: thanks guys. anybody knows what generates the log like myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log ? I checked our application code, it doesn't generate this, looks from hive. On Fri, Jul 18, 2014 at 12:28 AM, Andre Araujo ara...@pythian.com wrote: Make sure the directory you specify has the sticky bit set, otherwise users will have permission problems: chmod 1777 dir On 18 July 2014 14:19, Satish Mittal satish.mit...@inmobi.com wrote: You can configure the following property in $HIVE_HOME/conf/hive-log4j.properties: hive.log.dir=your location The default value of this property is ${java.io.tmpdir}/${user.name}. Thanks, Satish On Thu, Jul 17, 2014 at 11:58 PM, Yang tedd...@gmail.com wrote: we just moved to hadoop2.0 (HDP2.1 distro). it turns out that the new hive version generates a lot of logs into /tmp/ and is quickly creating the danger of running out of our /tmp/ space. I see these 2 different logs : [myuser@mybox ~]$ ls -lt /tmp/myuser/ total 1988 -rw-rw-r-- 1 myuser myuser 191687 2014-07-17 11:17 hive.log -rw-rw-r-- 1 myuser myuser 14472 2014-07-16 14:43 myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log -rw-rw-r-- 1 myuser myuser 14260 2014-07-16 14:04 myuser_20140716135353_de698da0-807f-4e3b-8b97-5af5064b55f2.log -rw-rw-r-- 1 myuser myuser 14254 2014-07-16 13:42 myuser_20140716133838_208329bd-77bb-4981-a2e9-e747647d0704.log from the doc at https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ErrorLogs I can see that per Hive session basis in /tmp/user.name/, but can be configured in hive-site.xml https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration with the hive.querylog.location property., but I tried to pass it to -hiveconf hive.querylog.location=/tmp/mycustomlogdir/ , doesn't seem to work; the hive.log location is not changed by this approach either. so how can I change the location of both the logs , by some per-script params ? (i.e. we can't afford to change the system hive-site.xml or /etc/hive/conf etc) Thanks a lot Yang _ The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt. -- André Araújo Big Data Consultant/Solutions Architect The Pythian Group - Australia - www.pythian.com Office (calls from within Australia): 1300 366 021 x1270 Office (international): +61 2 8016 7000 x270 *OR* +1 613 565 8696 x1270 Mobile: +61 410 323 559 Fax: +61 2 9805 0544 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk “Success is not about standing at the top, it's the steps you leave behind.” — Iker Pou (rock climber) -- -- André Araújo Big Data Consultant/Solutions Architect The Pythian Group - Australia - www.pythian.com Office (calls from within Australia): 1300 366 021 x1270 Office (international): +61 2 8016 7000 x270 *OR* +1 613 565 8696 x1270 Mobile: +61 410 323 559 Fax: +61 2 9805 0544 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk “Success is not about standing at the top, it's the steps you leave behind.” — Iker Pou (rock climber) -- André Araújo Big Data Consultant/Solutions Architect The Pythian Group - Australia - www.pythian.com Office (calls from within Australia): 1300 366 021 x1270 Office (international): +61 2 8016 7000 x270 *OR* +1 613 565 8696 x1270 Mobile: +61 410 323 559 Fax: +61 2 9805 0544 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk “Success is not about standing at the top, it's the steps you leave behind.” — Iker Pou (rock climber) --