Errors while creating a new table using existing table schema

2014-07-18 Thread Vidya Sujeet
Hello,

I am trying to create a new table using an existing table's schema
(existing table name in hive: jobs). However, when I do that it doesn't put
the new table (new table name in hive: jobs_ex2) in the same location as
the existing table. When I specify the location explicitly, it errors out.


Query which has the problem is pasted below:





*create table jobs_ex2as select year, capitalregion, universe from jobsrow
format delimited fields terminated by
','  Location
'/user/hive/warehouse/default.db/jobs_ex2'*

The file that is being used to create a table is in the following location:
 */user/hive/warehouse/default.db/jobs/universe=1/Jobs.csv*  where
universe=1 is the partition. The new table jobs_ex_2 needs to be created
inside default.db folder.

thanks,
Vidya


Re: how to control hive log location on 0.13?

2014-07-18 Thread Lefty Leverenz
Thanks, Satish Mittal, I've added that information to the Error Logs section
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ErrorLogs
of the Getting Started wiki.

-- Lefty


On Fri, Jul 18, 2014 at 12:19 AM, Satish Mittal satish.mit...@inmobi.com
wrote:

 You can configure the following property in
 $HIVE_HOME/conf/hive-log4j.properties:

 hive.log.dir=your location

 The default value of this property is ${java.io.tmpdir}/${user.name}.

 Thanks,
 Satish


 On Thu, Jul 17, 2014 at 11:58 PM, Yang tedd...@gmail.com wrote:

 we just moved to hadoop2.0 (HDP2.1 distro). it turns out that the new
 hive version generates a lot of logs into /tmp/ and is quickly creating the
 danger of running out of our /tmp/ space.


 I see these 2 different logs :

 [myuser@mybox ~]$  ls -lt /tmp/myuser/
 total 1988
 -rw-rw-r-- 1 myuser myuser  191687 2014-07-17 11:17 hive.log
 -rw-rw-r-- 1 myuser myuser   14472 2014-07-16 14:43
 myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log
 -rw-rw-r-- 1 myuser myuser   14260 2014-07-16 14:04
 myuser_20140716135353_de698da0-807f-4e3b-8b97-5af5064b55f2.log
 -rw-rw-r-- 1 myuser myuser   14254 2014-07-16 13:42
 myuser_20140716133838_208329bd-77bb-4981-a2e9-e747647d0704.log



 from the doc at
 https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ErrorLogs
 I can see that per Hive session basis in /tmp/user.name/, but can be
 configured in hive-site.xml
 https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration 
 with
 the hive.querylog.location property.,
 but I tried to pass it to -hiveconf 
 hive.querylog.location=/tmp/mycustomlogdir/
  , doesn't seem to work; the hive.log location is not changed by this
 approach either.

 so how can I change the location of both the logs , by some per-script
 params ? (i.e. we can't afford to change the system hive-site.xml or
 /etc/hive/conf etc)

 Thanks a lot
  Yang



 _
 The information contained in this communication is intended solely for the
 use of the individual or entity to whom it is addressed and others
 authorized to receive it. It may contain confidential or legally privileged
 information. If you are not the intended recipient you are hereby notified
 that any disclosure, copying, distribution or taking any action in reliance
 on the contents of this information is strictly prohibited and may be
 unlawful. If you have received this communication in error, please notify
 us immediately by responding to this email and then delete it from your
 system. The firm is neither liable for the proper and complete transmission
 of the information contained in this communication nor for any delay in its
 receipt.


Re: how to control hive log location on 0.13?

2014-07-18 Thread Andre Araujo
Make sure the directory you specify has the sticky bit set, otherwise users
will have permission problems:

chmod 1777 dir


On 18 July 2014 14:19, Satish Mittal satish.mit...@inmobi.com wrote:

 You can configure the following property in
 $HIVE_HOME/conf/hive-log4j.properties:

 hive.log.dir=your location

 The default value of this property is ${java.io.tmpdir}/${user.name}.

 Thanks,
 Satish


 On Thu, Jul 17, 2014 at 11:58 PM, Yang tedd...@gmail.com wrote:

 we just moved to hadoop2.0 (HDP2.1 distro). it turns out that the new
 hive version generates a lot of logs into /tmp/ and is quickly creating the
 danger of running out of our /tmp/ space.


 I see these 2 different logs :

 [myuser@mybox ~]$  ls -lt /tmp/myuser/
 total 1988
 -rw-rw-r-- 1 myuser myuser  191687 2014-07-17 11:17 hive.log
 -rw-rw-r-- 1 myuser myuser   14472 2014-07-16 14:43
 myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log
 -rw-rw-r-- 1 myuser myuser   14260 2014-07-16 14:04
 myuser_20140716135353_de698da0-807f-4e3b-8b97-5af5064b55f2.log
 -rw-rw-r-- 1 myuser myuser   14254 2014-07-16 13:42
 myuser_20140716133838_208329bd-77bb-4981-a2e9-e747647d0704.log



 from the doc at
 https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ErrorLogs
 I can see that per Hive session basis in /tmp/user.name/, but can be
 configured in hive-site.xml
 https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration 
 with
 the hive.querylog.location property.,
 but I tried to pass it to -hiveconf 
 hive.querylog.location=/tmp/mycustomlogdir/
  , doesn't seem to work; the hive.log location is not changed by this
 approach either.

 so how can I change the location of both the logs , by some per-script
 params ? (i.e. we can't afford to change the system hive-site.xml or
 /etc/hive/conf etc)

 Thanks a lot
  Yang



 _
 The information contained in this communication is intended solely for the
 use of the individual or entity to whom it is addressed and others
 authorized to receive it. It may contain confidential or legally privileged
 information. If you are not the intended recipient you are hereby notified
 that any disclosure, copying, distribution or taking any action in reliance
 on the contents of this information is strictly prohibited and may be
 unlawful. If you have received this communication in error, please notify
 us immediately by responding to this email and then delete it from your
 system. The firm is neither liable for the proper and complete transmission
 of the information contained in this communication nor for any delay in its
 receipt.




-- 
André Araújo
Big Data Consultant/Solutions Architect
The Pythian Group - Australia - www.pythian.com

Office (calls from within Australia): 1300 366 021 x1270
Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696   x1270
Mobile: +61 410 323 559
Fax: +61 2 9805 0544
IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk

“Success is not about standing at the top, it's the steps you leave behind.”
— Iker Pou (rock climber)

-- 


--





ERROR in JDBC

2014-07-18 Thread CHEBARO Abdallah
Hello Hive Community,

I am trying to run the JDBC (from cwiki.apache.org), using HiveServer2. 
Everything in the Java code (attached above) runs well except for the last 
query: sql = select * from  + tableName;

Attached is the complete log file of several runs. I have noticed the following 
error:
ERROR mr.ExecDriver (MapRedTask.java:execute(304)) - Exception: Cannot run 
program /usr/local/bin/hadoop-2.2.0\bin\hadoop.cmd (in directory 
C:\cygwin64\usr\local\bin\hive-0.12.0-bin\bin): CreateProcess error=2, The 
system cannot find the file specified

I have tried to navigate to the following path and manually add the path along 
with the hadoop.cmd script but this did not work. So I had { 
C:\cygwin64\usr\local\bin\hive-0.12.0-bin\bin\usr\local\bin\hadoop-2.2.0\bin\hadoop.cmd)
 as a path but still it failed to find it.

I have tried to debug the Apache Java code to try and figure out the problem. I 
have noticed that  TStatusCode.fingByValue returns ERROR_STATUS which was only 
the case in the last query mentioned earlier. This flag is eventually checked 
and results in throwing the exception below:

Exception in thread main java.sql.SQLException: Error while processing 
statement: FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.mr.MapRedTask
at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:165)
at 
org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:153)
at 
org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:198)
at 
org.apache.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:300)
at 
murex.pop.hadoop.connector.HiveJdbcClient.main(HiveJdbcClient.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)

My questions are:

1-  Why isn't my Hiveserver2 finding the hadoop.cmd script? In general 
everything during the installation was set to default and no extra 
configuration is done

2-  Why is the exception thrown specifically when trying this query? Do you 
have any ideas about what might the glitch be?

Any help would be much appreciated
Abdallah Chebaro
***

This e-mail contains information for the intended recipient only. It may 
contain proprietary material or confidential information. If you are not the 
intended recipient you are not authorised to distribute, copy or use this 
e-mail or any attachment to it. Murex cannot guarantee that it is virus free 
and accepts no responsibility for any loss or damage arising from its use. If 
you have received this e-mail in error please notify immediately the sender and 
delete the original email received, any attachments and all copies from your 
system.


HiveJdbcClient .java
Description: HiveJdbcClient .java


hive.log
Description: hive.log


Hive huge 'startup time'

2014-07-18 Thread diogo
This is probably a simple question, but I'm noticing that for queries that
run on 1+TB of data, it can take Hive up to 30 minutes to actually start
the first map-reduce stage. What is it doing? I imagine it's gathering
information about the data somehow, this 'startup' time is clearly a
function of the amount of data I'm trying to process.

Cheers,


Re: Hive huge 'startup time'

2014-07-18 Thread Prem Yadav
may be you can post your partition structure and the query..Over
partitioning data is one of the reasons it happens.


On Fri, Jul 18, 2014 at 2:36 PM, diogo di...@uken.com wrote:

 This is probably a simple question, but I'm noticing that for queries that
 run on 1+TB of data, it can take Hive up to 30 minutes to actually start
 the first map-reduce stage. What is it doing? I imagine it's gathering
 information about the data somehow, this 'startup' time is clearly a
 function of the amount of data I'm trying to process.

 Cheers,



Re: Hive huge 'startup time'

2014-07-18 Thread Edward Capriolo
The planning phase needs to do work for every hive partition and every
hadoop files. If you have a lot of 'small' files or many partitions this
can take a long time.
Also the planning phase that happens on the job tracker is single threaded.
Also the new yarn stuff requires back and forth to allocated containers.

Sometimes raising the heap to for the hive-cli/launching process helps
because the default heap of 1 GB may not be a lot of space to deal with all
of the partition information and memory overhead will make this go faster.
Sometimes setting the min split size higher launches less map tasks which
speeds up everything.

So the answer...Try to tune everything, start hive like this:

bin/hive -hiveconf hive.root.logger=DEBUG,console

And record where the longest spaces with no output are, that is what you
should try to tune first.




On Fri, Jul 18, 2014 at 9:36 AM, diogo di...@uken.com wrote:

 This is probably a simple question, but I'm noticing that for queries that
 run on 1+TB of data, it can take Hive up to 30 minutes to actually start
 the first map-reduce stage. What is it doing? I imagine it's gathering
 information about the data somehow, this 'startup' time is clearly a
 function of the amount of data I'm trying to process.

 Cheers,



Hive Join Running Out of Memory

2014-07-18 Thread Clay McDonald
Hello everyone. I need some assistance. I have a join that fails with  return 
code 3. The query is;

SELECT B.CARD_NBR AS CNT
FROM TENDER_TABLE A
JOIN  LOYALTY_CARDS B
ON A.CARD_NBR = B.CARD_NBR
LIMIT 10;

-- Row Counts
-- LOYALTY_CARDS =   43,876,938
-- TENDER_TABLE = 1,412,228,333

The query execution output starts with;

2014-07-18 10:30:17 Starting to launch local task to process map join;  
maximum memory = 1065484288

The last output is as follows;

2014-07-18 10:30:44 Processing rows:380 Hashtable size: 379 
Memory usage:   969531248   percentage: 0.91

I ran SET mapred.child.java.opts=-Xmx4G; before the query but that did not 
change the maximum memory. What am I not understanding and how should I 
troubleshoot his issue?


hive SELECT B.CARD_NBR AS CNT
 FROM TENDER_TABLE A
 JOIN  LOYALTY_CARDS B
 ON A.CARD_NBR = B.CARD_NBR
 LIMIT 10;
Query ID = root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474
Total jobs = 1
14/07/18 10:30:17 WARN conf.Configuration: 
file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
14/07/18 10:30:17 WARN conf.Configuration: 
file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.attempts;  Ignoring.
14/07/18 10:30:17 INFO Configuration.deprecation: mapred.reduce.tasks is 
deprecated. Instead, use mapreduce.job.reduces
14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size is 
deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
14/07/18 10:30:17 INFO Configuration.deprecation: 
mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
mapreduce.reduce.speculative
14/07/18 10:30:17 INFO Configuration.deprecation: 
mapred.min.split.size.per.node is deprecated. Instead, use 
mapreduce.input.fileinputformat.split.minsize.per.node
14/07/18 10:30:17 INFO Configuration.deprecation: mapred.input.dir.recursive is 
deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
14/07/18 10:30:17 INFO Configuration.deprecation: 
mapred.min.split.size.per.rack is deprecated. Instead, use 
mapreduce.input.fileinputformat.split.minsize.per.rack
14/07/18 10:30:17 INFO Configuration.deprecation: mapred.max.split.size is 
deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
14/07/18 10:30:17 INFO Configuration.deprecation: 
mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use 
mapreduce.job.committer.setup.cleanup.needed
Execution log at: 
/tmp/root/root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474.log
2014-07-18 10:30:17 Starting to launch local task to process map join;  
maximum memory = 1065484288
2014-07-18 10:30:20 Processing rows:20  Hashtable size: 19  
Memory usage:   53829960percentage: 0.051
2014-07-18 10:30:21 Processing rows:30  Hashtable size: 29  
Memory usage:   76926312percentage: 0.072
2014-07-18 10:30:22 Processing rows:40  Hashtable size: 39  
Memory usage:   105119456   percentage: 0.099
2014-07-18 10:30:23 Processing rows:50  Hashtable size: 49  
Memory usage:   129079592   percentage: 0.121
2014-07-18 10:30:24 Processing rows:60  Hashtable size: 59  
Memory usage:   151469744   percentage: 0.142
2014-07-18 10:30:24 Processing rows:70  Hashtable size: 69  
Memory usage:   174968512   percentage: 0.164
2014-07-18 10:30:25 Processing rows:80  Hashtable size: 79  
Memory usage:   207735176   percentage: 0.195
2014-07-18 10:30:25 Processing rows:90  Hashtable size: 89  
Memory usage:   232306976   percentage: 0.218
2014-07-18 10:30:26 Processing rows:100 Hashtable size: 99  
Memory usage:   255813784   percentage: 0.24
2014-07-18 10:30:27 Processing rows:110 Hashtable size: 109 
Memory usage:   280781448   percentage: 0.264
2014-07-18 10:30:27 Processing rows:120 Hashtable size: 119 
Memory usage:   305606024   percentage: 0.287
2014-07-18 10:30:28 Processing rows:130 Hashtable size: 129 
Memory usage:   323502504   percentage: 0.304
2014-07-18 10:30:28 Processing rows:140 Hashtable size: 139 
Memory usage:   347450792   percentage: 0.326
2014-07-18 10:30:29 Processing rows:150 Hashtable size: 149 
Memory usage:   372281800   percentage: 0.349
2014-07-18 10:30:30 Processing rows:160 Hashtable size: 159 
Memory usage:   413191040   percentage: 0.388
2014-07-18 10:30:30 Processing rows:170 Hashtable size: 169 
Memory 

Re: Hive Join Running Out of Memory

2014-07-18 Thread Edward Capriolo
This is a failed optimization hive is trying to build the lookup table
locally and then put it in the distributed cache and then to a map join.
Look through your hive site for the configuration to turn these auto-map
joins off. Based on your version the variables changed a names /deprecated
etc so I can not tell you the exact ones.


On Fri, Jul 18, 2014 at 10:35 AM, Clay McDonald 
stuart.mcdon...@bateswhite.com wrote:

 Hello everyone. I need some assistance. I have a join that fails with
  return code 3. The query is;

 SELECT B.CARD_NBR AS CNT
 FROM TENDER_TABLE A
 JOIN  LOYALTY_CARDS B
 ON A.CARD_NBR = B.CARD_NBR
 LIMIT 10;

 -- Row Counts
 -- LOYALTY_CARDS =   43,876,938
 -- TENDER_TABLE = 1,412,228,333

 The query execution output starts with;

 2014-07-18 10:30:17 Starting to launch local task to process map join;
  maximum memory = 1065484288

 The last output is as follows;

 2014-07-18 10:30:44 Processing rows:380 Hashtable size:
 379 Memory usage:   969531248   percentage: 0.91

 I ran SET mapred.child.java.opts=-Xmx4G; before the query but that did not
 change the maximum memory. What am I not understanding and how should I
 troubleshoot his issue?


 hive SELECT B.CARD_NBR AS CNT
  FROM TENDER_TABLE A
  JOIN  LOYALTY_CARDS B
  ON A.CARD_NBR = B.CARD_NBR
  LIMIT 10;
 Query ID = root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474
 Total jobs = 1
 14/07/18 10:30:17 WARN conf.Configuration:
 file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an
 attempt to override final parameter:
 mapreduce.job.end-notification.max.retry.interval;  Ignoring.
 14/07/18 10:30:17 WARN conf.Configuration:
 file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an
 attempt to override final parameter:
 mapreduce.job.end-notification.max.attempts;  Ignoring.
 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.reduce.tasks is
 deprecated. Instead, use mapreduce.job.reduces
 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size is
 deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
 14/07/18 10:30:17 INFO Configuration.deprecation:
 mapred.reduce.tasks.speculative.execution is deprecated. Instead, use
 mapreduce.reduce.speculative
 14/07/18 10:30:17 INFO Configuration.deprecation:
 mapred.min.split.size.per.node is deprecated. Instead, use
 mapreduce.input.fileinputformat.split.minsize.per.node
 14/07/18 10:30:17 INFO Configuration.deprecation:
 mapred.input.dir.recursive is deprecated. Instead, use
 mapreduce.input.fileinputformat.input.dir.recursive
 14/07/18 10:30:17 INFO Configuration.deprecation:
 mapred.min.split.size.per.rack is deprecated. Instead, use
 mapreduce.input.fileinputformat.split.minsize.per.rack
 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.max.split.size is
 deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
 14/07/18 10:30:17 INFO Configuration.deprecation:
 mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use
 mapreduce.job.committer.setup.cleanup.needed
 Execution log at:
 /tmp/root/root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474.log
 2014-07-18 10:30:17 Starting to launch local task to process map join;
  maximum memory = 1065484288
 2014-07-18 10:30:20 Processing rows:20  Hashtable size:
 19  Memory usage:   53829960percentage: 0.051
 2014-07-18 10:30:21 Processing rows:30  Hashtable size:
 29  Memory usage:   76926312percentage: 0.072
 2014-07-18 10:30:22 Processing rows:40  Hashtable size:
 39  Memory usage:   105119456   percentage: 0.099
 2014-07-18 10:30:23 Processing rows:50  Hashtable size:
 49  Memory usage:   129079592   percentage: 0.121
 2014-07-18 10:30:24 Processing rows:60  Hashtable size:
 59  Memory usage:   151469744   percentage: 0.142
 2014-07-18 10:30:24 Processing rows:70  Hashtable size:
 69  Memory usage:   174968512   percentage: 0.164
 2014-07-18 10:30:25 Processing rows:80  Hashtable size:
 79  Memory usage:   207735176   percentage: 0.195
 2014-07-18 10:30:25 Processing rows:90  Hashtable size:
 89  Memory usage:   232306976   percentage: 0.218
 2014-07-18 10:30:26 Processing rows:100 Hashtable size:
 99  Memory usage:   255813784   percentage: 0.24
 2014-07-18 10:30:27 Processing rows:110 Hashtable size:
 109 Memory usage:   280781448   percentage: 0.264
 2014-07-18 10:30:27 Processing rows:120 Hashtable size:
 119 Memory usage:   305606024   percentage: 0.287
 2014-07-18 10:30:28 Processing rows:130 Hashtable size:
 129 Memory usage:   323502504   percentage: 0.304
 2014-07-18 10:30:28 

Re: Hive huge 'startup time'

2014-07-18 Thread diogo
Sweet, great answers, thanks.

Indeed, I have a small number of partitions, but lots of small files, ~20MB
each. I'll make sure to combine them. Also, increasing the heap size of the
cli process already helped speed it up.

Thanks, again.


On Fri, Jul 18, 2014 at 10:26 AM, Edward Capriolo edlinuxg...@gmail.com
wrote:

 The planning phase needs to do work for every hive partition and every
 hadoop files. If you have a lot of 'small' files or many partitions this
 can take a long time.
 Also the planning phase that happens on the job tracker is single threaded.
 Also the new yarn stuff requires back and forth to allocated containers.

 Sometimes raising the heap to for the hive-cli/launching process helps
 because the default heap of 1 GB may not be a lot of space to deal with all
 of the partition information and memory overhead will make this go faster.
 Sometimes setting the min split size higher launches less map tasks which
 speeds up everything.

 So the answer...Try to tune everything, start hive like this:

 bin/hive -hiveconf hive.root.logger=DEBUG,console

 And record where the longest spaces with no output are, that is what you
 should try to tune first.




 On Fri, Jul 18, 2014 at 9:36 AM, diogo di...@uken.com wrote:

 This is probably a simple question, but I'm noticing that for queries
 that run on 1+TB of data, it can take Hive up to 30 minutes to actually
 start the first map-reduce stage. What is it doing? I imagine it's
 gathering information about the data somehow, this 'startup' time is
 clearly a function of the amount of data I'm trying to process.

 Cheers,





Re: Hive huge 'startup time'

2014-07-18 Thread Edward Capriolo
Unleash ze file crusha!

https://github.com/edwardcapriolo/filecrush


On Fri, Jul 18, 2014 at 10:51 AM, diogo di...@uken.com wrote:

 Sweet, great answers, thanks.

 Indeed, I have a small number of partitions, but lots of small files,
 ~20MB each. I'll make sure to combine them. Also, increasing the heap size
 of the cli process already helped speed it up.

 Thanks, again.


 On Fri, Jul 18, 2014 at 10:26 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 The planning phase needs to do work for every hive partition and every
 hadoop files. If you have a lot of 'small' files or many partitions this
 can take a long time.
 Also the planning phase that happens on the job tracker is single
 threaded.
 Also the new yarn stuff requires back and forth to allocated containers.

 Sometimes raising the heap to for the hive-cli/launching process helps
 because the default heap of 1 GB may not be a lot of space to deal with all
 of the partition information and memory overhead will make this go faster.
 Sometimes setting the min split size higher launches less map tasks which
 speeds up everything.

 So the answer...Try to tune everything, start hive like this:

 bin/hive -hiveconf hive.root.logger=DEBUG,console

 And record where the longest spaces with no output are, that is what you
 should try to tune first.




 On Fri, Jul 18, 2014 at 9:36 AM, diogo di...@uken.com wrote:

 This is probably a simple question, but I'm noticing that for queries
 that run on 1+TB of data, it can take Hive up to 30 minutes to actually
 start the first map-reduce stage. What is it doing? I imagine it's
 gathering information about the data somehow, this 'startup' time is
 clearly a function of the amount of data I'm trying to process.

 Cheers,






RE: Hive Join Running Out of Memory

2014-07-18 Thread Clay McDonald
Thank you. Would it be acceptable to use the following?

SET hive.exec.mode.local.auto=false;


From: Edward Capriolo [mailto:edlinuxg...@gmail.com] 
Sent: Friday, July 18, 2014 10:45 AM
To: user@hive.apache.org
Subject: Re: Hive Join Running Out of Memory

This is a failed optimization hive is trying to build the lookup table locally 
and then put it in the distributed cache and then to a map join. Look through 
your hive site for the configuration to turn these auto-map joins off. Based on 
your version the variables changed a names /deprecated etc so I can not tell 
you the exact ones.

On Fri, Jul 18, 2014 at 10:35 AM, Clay McDonald 
stuart.mcdon...@bateswhite.com wrote:
Hello everyone. I need some assistance. I have a join that fails with  return 
code 3. The query is;

SELECT B.CARD_NBR AS CNT
FROM TENDER_TABLE A
JOIN  LOYALTY_CARDS B
ON A.CARD_NBR = B.CARD_NBR
LIMIT 10;

-- Row Counts
-- LOYALTY_CARDS =   43,876,938
-- TENDER_TABLE = 1,412,228,333

The query execution output starts with;

2014-07-18 10:30:17     Starting to launch local task to process map join;      
maximum memory = 1065484288

The last output is as follows;

2014-07-18 10:30:44     Processing rows:        380 Hashtable size: 379 
Memory usage:   969531248       percentage:     0.91

I ran SET mapred.child.java.opts=-Xmx4G; before the query but that did not 
change the maximum memory. What am I not understanding and how should I 
troubleshoot his issue?


hive SELECT B.CARD_NBR AS CNT
     FROM TENDER_TABLE A
     JOIN  LOYALTY_CARDS B
     ON A.CARD_NBR = B.CARD_NBR
     LIMIT 10;
Query ID = root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474
Total jobs = 1
14/07/18 10:30:17 WARN conf.Configuration: 
file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
14/07/18 10:30:17 WARN conf.Configuration: 
file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.attempts;  Ignoring.
14/07/18 10:30:17 INFO Configuration.deprecation: mapred.reduce.tasks is 
deprecated. Instead, use mapreduce.job.reduces
14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size is 
deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
14/07/18 10:30:17 INFO Configuration.deprecation: 
mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
mapreduce.reduce.speculative
14/07/18 10:30:17 INFO Configuration.deprecation: 
mapred.min.split.size.per.node is deprecated. Instead, use 
mapreduce.input.fileinputformat.split.minsize.per.node
14/07/18 10:30:17 INFO Configuration.deprecation: mapred.input.dir.recursive is 
deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
14/07/18 10:30:17 INFO Configuration.deprecation: 
mapred.min.split.size.per.rack is deprecated. Instead, use 
mapreduce.input.fileinputformat.split.minsize.per.rack
14/07/18 10:30:17 INFO Configuration.deprecation: mapred.max.split.size is 
deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
14/07/18 10:30:17 INFO Configuration.deprecation: 
mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use 
mapreduce.job.committer.setup.cleanup.needed
Execution log at: 
/tmp/root/root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474.log
2014-07-18 10:30:17     Starting to launch local task to process map join;      
maximum memory = 1065484288
2014-07-18 10:30:20     Processing rows:        20  Hashtable size: 19  
Memory usage:   53829960        percentage:     0.051
2014-07-18 10:30:21     Processing rows:        30  Hashtable size: 29  
Memory usage:   76926312        percentage:     0.072
2014-07-18 10:30:22     Processing rows:        40  Hashtable size: 39  
Memory usage:   105119456       percentage:     0.099
2014-07-18 10:30:23     Processing rows:        50  Hashtable size: 49  
Memory usage:   129079592       percentage:     0.121
2014-07-18 10:30:24     Processing rows:        60  Hashtable size: 59  
Memory usage:   151469744       percentage:     0.142
2014-07-18 10:30:24     Processing rows:        70  Hashtable size: 69  
Memory usage:   174968512       percentage:     0.164
2014-07-18 10:30:25     Processing rows:        80  Hashtable size: 79  
Memory usage:   207735176       percentage:     0.195
2014-07-18 10:30:25     Processing rows:        90  Hashtable size: 89  
Memory usage:   232306976       percentage:     0.218
2014-07-18 10:30:26     Processing rows:        100 Hashtable size: 99  
Memory usage:   255813784       percentage:     0.24
2014-07-18 10:30:27     Processing rows:        110 Hashtable size: 109 
Memory usage:   280781448       percentage:     0.264
2014-07-18 10:30:27     Processing rows:        120 Hashtable size: 119 

Re: Hive Join Running Out of Memory

2014-07-18 Thread Edward Capriolo
I believe that would be the one.


On Fri, Jul 18, 2014 at 10:54 AM, Clay McDonald 
stuart.mcdon...@bateswhite.com wrote:

 Thank you. Would it be acceptable to use the following?

 SET hive.exec.mode.local.auto=false;


 From: Edward Capriolo [mailto:edlinuxg...@gmail.com]
 Sent: Friday, July 18, 2014 10:45 AM
 To: user@hive.apache.org
 Subject: Re: Hive Join Running Out of Memory

 This is a failed optimization hive is trying to build the lookup table
 locally and then put it in the distributed cache and then to a map join.
 Look through your hive site for the configuration to turn these auto-map
 joins off. Based on your version the variables changed a names /deprecated
 etc so I can not tell you the exact ones.

 On Fri, Jul 18, 2014 at 10:35 AM, Clay McDonald 
 stuart.mcdon...@bateswhite.com wrote:
 Hello everyone. I need some assistance. I have a join that fails with
  return code 3. The query is;

 SELECT B.CARD_NBR AS CNT
 FROM TENDER_TABLE A
 JOIN  LOYALTY_CARDS B
 ON A.CARD_NBR = B.CARD_NBR
 LIMIT 10;

 -- Row Counts
 -- LOYALTY_CARDS =   43,876,938
 -- TENDER_TABLE = 1,412,228,333

 The query execution output starts with;

 2014-07-18 10:30:17 Starting to launch local task to process map join;
  maximum memory = 1065484288

 The last output is as follows;

 2014-07-18 10:30:44 Processing rows:380 Hashtable size:
 379 Memory usage:   969531248   percentage: 0.91

 I ran SET mapred.child.java.opts=-Xmx4G; before the query but that did not
 change the maximum memory. What am I not understanding and how should I
 troubleshoot his issue?


 hive SELECT B.CARD_NBR AS CNT
  FROM TENDER_TABLE A
  JOIN  LOYALTY_CARDS B
  ON A.CARD_NBR = B.CARD_NBR
  LIMIT 10;
 Query ID = root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474
 Total jobs = 1
 14/07/18 10:30:17 WARN conf.Configuration:
 file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an
 attempt to override final parameter:
 mapreduce.job.end-notification.max.retry.interval;  Ignoring.
 14/07/18 10:30:17 WARN conf.Configuration:
 file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an
 attempt to override final parameter:
 mapreduce.job.end-notification.max.attempts;  Ignoring.
 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.reduce.tasks is
 deprecated. Instead, use mapreduce.job.reduces
 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size is
 deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
 14/07/18 10:30:17 INFO Configuration.deprecation:
 mapred.reduce.tasks.speculative.execution is deprecated. Instead, use
 mapreduce.reduce.speculative
 14/07/18 10:30:17 INFO Configuration.deprecation:
 mapred.min.split.size.per.node is deprecated. Instead, use
 mapreduce.input.fileinputformat.split.minsize.per.node
 14/07/18 10:30:17 INFO Configuration.deprecation:
 mapred.input.dir.recursive is deprecated. Instead, use
 mapreduce.input.fileinputformat.input.dir.recursive
 14/07/18 10:30:17 INFO Configuration.deprecation:
 mapred.min.split.size.per.rack is deprecated. Instead, use
 mapreduce.input.fileinputformat.split.minsize.per.rack
 14/07/18 10:30:17 INFO Configuration.deprecation: mapred.max.split.size is
 deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
 14/07/18 10:30:17 INFO Configuration.deprecation:
 mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use
 mapreduce.job.committer.setup.cleanup.needed
 Execution log at:
 /tmp/root/root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474.log
 2014-07-18 10:30:17 Starting to launch local task to process map join;
  maximum memory = 1065484288
 2014-07-18 10:30:20 Processing rows:20  Hashtable size:
 19  Memory usage:   53829960percentage: 0.051
 2014-07-18 10:30:21 Processing rows:30  Hashtable size:
 29  Memory usage:   76926312percentage: 0.072
 2014-07-18 10:30:22 Processing rows:40  Hashtable size:
 39  Memory usage:   105119456   percentage: 0.099
 2014-07-18 10:30:23 Processing rows:50  Hashtable size:
 49  Memory usage:   129079592   percentage: 0.121
 2014-07-18 10:30:24 Processing rows:60  Hashtable size:
 59  Memory usage:   151469744   percentage: 0.142
 2014-07-18 10:30:24 Processing rows:70  Hashtable size:
 69  Memory usage:   174968512   percentage: 0.164
 2014-07-18 10:30:25 Processing rows:80  Hashtable size:
 79  Memory usage:   207735176   percentage: 0.195
 2014-07-18 10:30:25 Processing rows:90  Hashtable size:
 89  Memory usage:   232306976   percentage: 0.218
 2014-07-18 10:30:26 Processing rows:100 Hashtable size:
 99  Memory usage:   255813784   percentage: 0.24
 2014-07-18 10:30:27 Processing rows: 

RE: Hive Join Running Out of Memory

2014-07-18 Thread Clay McDonald
I changed the hive.auto.convert.join.noconditionaltask = false in the hive site 
and that seemed to do the trick. Thanks! 


From: Edward Capriolo [mailto:edlinuxg...@gmail.com] 
Sent: Friday, July 18, 2014 10:57 AM
To: user@hive.apache.org
Subject: Re: Hive Join Running Out of Memory

I believe that would be the one. 

On Fri, Jul 18, 2014 at 10:54 AM, Clay McDonald 
stuart.mcdon...@bateswhite.com wrote:
Thank you. Would it be acceptable to use the following?

SET hive.exec.mode.local.auto=false;


From: Edward Capriolo [mailto:edlinuxg...@gmail.com]
Sent: Friday, July 18, 2014 10:45 AM
To: user@hive.apache.org
Subject: Re: Hive Join Running Out of Memory

This is a failed optimization hive is trying to build the lookup table locally 
and then put it in the distributed cache and then to a map join. Look through 
your hive site for the configuration to turn these auto-map joins off. Based on 
your version the variables changed a names /deprecated etc so I can not tell 
you the exact ones.

On Fri, Jul 18, 2014 at 10:35 AM, Clay McDonald 
stuart.mcdon...@bateswhite.com wrote:
Hello everyone. I need some assistance. I have a join that fails with  return 
code 3. The query is;

SELECT B.CARD_NBR AS CNT
FROM TENDER_TABLE A
JOIN  LOYALTY_CARDS B
ON A.CARD_NBR = B.CARD_NBR
LIMIT 10;

-- Row Counts
-- LOYALTY_CARDS =   43,876,938
-- TENDER_TABLE = 1,412,228,333

The query execution output starts with;

2014-07-18 10:30:17     Starting to launch local task to process map join;      
maximum memory = 1065484288

The last output is as follows;

2014-07-18 10:30:44     Processing rows:        380 Hashtable size: 379 
Memory usage:   969531248       percentage:     0.91

I ran SET mapred.child.java.opts=-Xmx4G; before the query but that did not 
change the maximum memory. What am I not understanding and how should I 
troubleshoot his issue?


hive SELECT B.CARD_NBR AS CNT
     FROM TENDER_TABLE A
     JOIN  LOYALTY_CARDS B
     ON A.CARD_NBR = B.CARD_NBR
     LIMIT 10;
Query ID = root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474
Total jobs = 1
14/07/18 10:30:17 WARN conf.Configuration: 
file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
14/07/18 10:30:17 WARN conf.Configuration: 
file:/tmp/root/hive_2014-07-18_10-30-15_081_1503496466695602651-1/-local-10006/jobconf.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.attempts;  Ignoring.
14/07/18 10:30:17 INFO Configuration.deprecation: mapred.reduce.tasks is 
deprecated. Instead, use mapreduce.job.reduces
14/07/18 10:30:17 INFO Configuration.deprecation: mapred.min.split.size is 
deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
14/07/18 10:30:17 INFO Configuration.deprecation: 
mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
mapreduce.reduce.speculative
14/07/18 10:30:17 INFO Configuration.deprecation: 
mapred.min.split.size.per.node is deprecated. Instead, use 
mapreduce.input.fileinputformat.split.minsize.per.node
14/07/18 10:30:17 INFO Configuration.deprecation: mapred.input.dir.recursive is 
deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
14/07/18 10:30:17 INFO Configuration.deprecation: 
mapred.min.split.size.per.rack is deprecated. Instead, use 
mapreduce.input.fileinputformat.split.minsize.per.rack
14/07/18 10:30:17 INFO Configuration.deprecation: mapred.max.split.size is 
deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
14/07/18 10:30:17 INFO Configuration.deprecation: 
mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use 
mapreduce.job.committer.setup.cleanup.needed
Execution log at: 
/tmp/root/root_20140718103030_df1e7af9-7d66-4ba5-8d73-2d0bf58bb474.log
2014-07-18 10:30:17     Starting to launch local task to process map join;      
maximum memory = 1065484288
2014-07-18 10:30:20     Processing rows:        20  Hashtable size: 19  
Memory usage:   53829960        percentage:     0.051
2014-07-18 10:30:21     Processing rows:        30  Hashtable size: 29  
Memory usage:   76926312        percentage:     0.072
2014-07-18 10:30:22     Processing rows:        40  Hashtable size: 39  
Memory usage:   105119456       percentage:     0.099
2014-07-18 10:30:23     Processing rows:        50  Hashtable size: 49  
Memory usage:   129079592       percentage:     0.121
2014-07-18 10:30:24     Processing rows:        60  Hashtable size: 59  
Memory usage:   151469744       percentage:     0.142
2014-07-18 10:30:24     Processing rows:        70  Hashtable size: 69  
Memory usage:   174968512       percentage:     0.164
2014-07-18 10:30:25     Processing rows:        80  Hashtable size: 79  
Memory usage:   207735176       percentage:     0.195
2014-07-18 10:30:25     Processing rows:        90  Hashtable size: 

Hive support for filtering Unicode data

2014-07-18 Thread Duc le anh
Hello Hive,

I posted the below question
http://stackoverflow.com/questions/24817308/hive-support-for-filtering-unicode-data?noredirect=1#comment38534961_24817308
on Stackoverflow
http://stackoverflow.com/questions/24817308/hive-support-for-filtering-unicode-data?noredirect=1#comment38534961_24817308,
but decided to ask the same one here:

I have a Hive table with Unicode data. When trying to perform a simple
query SELECT * FROM table, I get back the correct data in correct Unicode
encoding. However, when I tried to add filtering criteria such as ...
WHERE column = 'some unicode value', my query returned nothing.

Is it Hive's limitation? Or is there anyway to make Unicode filtering work
with Hive?

Thank you!

-- 
Duc Le


Re: how to control hive log location on 0.13?

2014-07-18 Thread Yang
thanks guys.   anybody knows what generates the log like 
myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log ? I checked
our application code, it doesn't generate this, looks from hive.


On Fri, Jul 18, 2014 at 12:28 AM, Andre Araujo ara...@pythian.com wrote:

 Make sure the directory you specify has the sticky bit set, otherwise
 users will have permission problems:

 chmod 1777 dir


 On 18 July 2014 14:19, Satish Mittal satish.mit...@inmobi.com wrote:

 You can configure the following property in
 $HIVE_HOME/conf/hive-log4j.properties:

 hive.log.dir=your location

 The default value of this property is ${java.io.tmpdir}/${user.name}.

 Thanks,
 Satish


 On Thu, Jul 17, 2014 at 11:58 PM, Yang tedd...@gmail.com wrote:

 we just moved to hadoop2.0 (HDP2.1 distro). it turns out that the new
 hive version generates a lot of logs into /tmp/ and is quickly creating the
 danger of running out of our /tmp/ space.


 I see these 2 different logs :

 [myuser@mybox ~]$  ls -lt /tmp/myuser/
 total 1988
 -rw-rw-r-- 1 myuser myuser  191687 2014-07-17 11:17 hive.log
 -rw-rw-r-- 1 myuser myuser   14472 2014-07-16 14:43
 myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log
 -rw-rw-r-- 1 myuser myuser   14260 2014-07-16 14:04
 myuser_20140716135353_de698da0-807f-4e3b-8b97-5af5064b55f2.log
 -rw-rw-r-- 1 myuser myuser   14254 2014-07-16 13:42
 myuser_20140716133838_208329bd-77bb-4981-a2e9-e747647d0704.log



 from the doc at
 https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ErrorLogs
 I can see that per Hive session basis in /tmp/user.name/, but can be
 configured in hive-site.xml
 https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration
  with
 the hive.querylog.location property.,
 but I tried to pass it to -hiveconf 
 hive.querylog.location=/tmp/mycustomlogdir/
  , doesn't seem to work; the hive.log location is not changed by this
 approach either.

 so how can I change the location of both the logs , by some per-script
 params ? (i.e. we can't afford to change the system hive-site.xml or
 /etc/hive/conf etc)

 Thanks a lot
  Yang



 _
 The information contained in this communication is intended solely for
 the use of the individual or entity to whom it is addressed and others
 authorized to receive it. It may contain confidential or legally privileged
 information. If you are not the intended recipient you are hereby notified
 that any disclosure, copying, distribution or taking any action in reliance
 on the contents of this information is strictly prohibited and may be
 unlawful. If you have received this communication in error, please notify
 us immediately by responding to this email and then delete it from your
 system. The firm is neither liable for the proper and complete transmission
 of the information contained in this communication nor for any delay in its
 receipt.




 --
 André Araújo
 Big Data Consultant/Solutions Architect
 The Pythian Group - Australia - www.pythian.com

 Office (calls from within Australia): 1300 366 021 x1270
 Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696   x1270
 Mobile: +61 410 323 559
 Fax: +61 2 9805 0544
 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk

 “Success is not about standing at the top, it's the steps you leave behind.”
 — Iker Pou (rock climber)

 --






Re: how to control hive log location on 0.13?

2014-07-18 Thread Lefty Leverenz
Thanks André, I've added the sticky bit advice to Error Logs
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ErrorLogs
.


-- Lefty


On Fri, Jul 18, 2014 at 2:38 PM, Yang tedd...@gmail.com wrote:

 thanks guys.   anybody knows what generates the log like 
 myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log ? I
 checked our application code, it doesn't generate this, looks from hive.


 On Fri, Jul 18, 2014 at 12:28 AM, Andre Araujo ara...@pythian.com wrote:

 Make sure the directory you specify has the sticky bit set, otherwise
 users will have permission problems:

 chmod 1777 dir


 On 18 July 2014 14:19, Satish Mittal satish.mit...@inmobi.com wrote:

 You can configure the following property in
 $HIVE_HOME/conf/hive-log4j.properties:

 hive.log.dir=your location

 The default value of this property is ${java.io.tmpdir}/${user.name}.

 Thanks,
 Satish


 On Thu, Jul 17, 2014 at 11:58 PM, Yang tedd...@gmail.com wrote:

 we just moved to hadoop2.0 (HDP2.1 distro). it turns out that the new
 hive version generates a lot of logs into /tmp/ and is quickly creating the
 danger of running out of our /tmp/ space.


 I see these 2 different logs :

 [myuser@mybox ~]$  ls -lt /tmp/myuser/
 total 1988
 -rw-rw-r-- 1 myuser myuser  191687 2014-07-17 11:17 hive.log
 -rw-rw-r-- 1 myuser myuser   14472 2014-07-16 14:43
 myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log
 -rw-rw-r-- 1 myuser myuser   14260 2014-07-16 14:04
 myuser_20140716135353_de698da0-807f-4e3b-8b97-5af5064b55f2.log
 -rw-rw-r-- 1 myuser myuser   14254 2014-07-16 13:42
 myuser_20140716133838_208329bd-77bb-4981-a2e9-e747647d0704.log



 from the doc at
 https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ErrorLogs
 I can see that per Hive session basis in /tmp/user.name/, but can
 be configured in hive-site.xml
 https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration
  with
 the hive.querylog.location property.,
 but I tried to pass it to -hiveconf 
 hive.querylog.location=/tmp/mycustomlogdir/
  , doesn't seem to work; the hive.log location is not changed by this
 approach either.

 so how can I change the location of both the logs , by some per-script
 params ? (i.e. we can't afford to change the system hive-site.xml or
 /etc/hive/conf etc)

 Thanks a lot
  Yang



 _
 The information contained in this communication is intended solely for
 the use of the individual or entity to whom it is addressed and others
 authorized to receive it. It may contain confidential or legally privileged
 information. If you are not the intended recipient you are hereby notified
 that any disclosure, copying, distribution or taking any action in reliance
 on the contents of this information is strictly prohibited and may be
 unlawful. If you have received this communication in error, please notify
 us immediately by responding to this email and then delete it from your
 system. The firm is neither liable for the proper and complete transmission
 of the information contained in this communication nor for any delay in its
 receipt.




 --
 André Araújo
 Big Data Consultant/Solutions Architect
 The Pythian Group - Australia - www.pythian.com

 Office (calls from within Australia): 1300 366 021 x1270
 Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696
 x1270
 Mobile: +61 410 323 559
 Fax: +61 2 9805 0544
 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk

 “Success is not about standing at the top, it's the steps you leave
 behind.” — Iker Pou (rock climber)

 --







Re: how to control hive log location on 0.13?

2014-07-18 Thread Andre Araujo
Can you give us an excerpt of the contents of this log?


On 19 July 2014 04:38, Yang tedd...@gmail.com wrote:

 thanks guys.   anybody knows what generates the log like 
 myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log ? I
 checked our application code, it doesn't generate this, looks from hive.


 On Fri, Jul 18, 2014 at 12:28 AM, Andre Araujo ara...@pythian.com wrote:

 Make sure the directory you specify has the sticky bit set, otherwise
 users will have permission problems:

 chmod 1777 dir


 On 18 July 2014 14:19, Satish Mittal satish.mit...@inmobi.com wrote:

 You can configure the following property in
 $HIVE_HOME/conf/hive-log4j.properties:

 hive.log.dir=your location

 The default value of this property is ${java.io.tmpdir}/${user.name}.

 Thanks,
 Satish


 On Thu, Jul 17, 2014 at 11:58 PM, Yang tedd...@gmail.com wrote:

 we just moved to hadoop2.0 (HDP2.1 distro). it turns out that the new
 hive version generates a lot of logs into /tmp/ and is quickly creating the
 danger of running out of our /tmp/ space.


 I see these 2 different logs :

 [myuser@mybox ~]$  ls -lt /tmp/myuser/
 total 1988
 -rw-rw-r-- 1 myuser myuser  191687 2014-07-17 11:17 hive.log
 -rw-rw-r-- 1 myuser myuser   14472 2014-07-16 14:43
 myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log
 -rw-rw-r-- 1 myuser myuser   14260 2014-07-16 14:04
 myuser_20140716135353_de698da0-807f-4e3b-8b97-5af5064b55f2.log
 -rw-rw-r-- 1 myuser myuser   14254 2014-07-16 13:42
 myuser_20140716133838_208329bd-77bb-4981-a2e9-e747647d0704.log



 from the doc at
 https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ErrorLogs
 I can see that per Hive session basis in /tmp/user.name/, but can
 be configured in hive-site.xml
 https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration
  with
 the hive.querylog.location property.,
 but I tried to pass it to -hiveconf 
 hive.querylog.location=/tmp/mycustomlogdir/
  , doesn't seem to work; the hive.log location is not changed by this
 approach either.

 so how can I change the location of both the logs , by some per-script
 params ? (i.e. we can't afford to change the system hive-site.xml or
 /etc/hive/conf etc)

 Thanks a lot
  Yang



 _
 The information contained in this communication is intended solely for
 the use of the individual or entity to whom it is addressed and others
 authorized to receive it. It may contain confidential or legally privileged
 information. If you are not the intended recipient you are hereby notified
 that any disclosure, copying, distribution or taking any action in reliance
 on the contents of this information is strictly prohibited and may be
 unlawful. If you have received this communication in error, please notify
 us immediately by responding to this email and then delete it from your
 system. The firm is neither liable for the proper and complete transmission
 of the information contained in this communication nor for any delay in its
 receipt.




 --
 André Araújo
 Big Data Consultant/Solutions Architect
 The Pythian Group - Australia - www.pythian.com

 Office (calls from within Australia): 1300 366 021 x1270
 Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696
 x1270
 Mobile: +61 410 323 559
 Fax: +61 2 9805 0544
 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk

 “Success is not about standing at the top, it's the steps you leave
 behind.” — Iker Pou (rock climber)

 --







-- 
André Araújo
Big Data Consultant/Solutions Architect
The Pythian Group - Australia - www.pythian.com

Office (calls from within Australia): 1300 366 021 x1270
Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696   x1270
Mobile: +61 410 323 559
Fax: +61 2 9805 0544
IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk

“Success is not about standing at the top, it's the steps you leave behind.”
— Iker Pou (rock climber)

-- 


--





Re: how to control hive log location on 0.13?

2014-07-18 Thread Andre Araujo
and where is it located?


On 19 July 2014 10:58, Andre Araujo ara...@pythian.com wrote:

 Can you give us an excerpt of the contents of this log?


 On 19 July 2014 04:38, Yang tedd...@gmail.com wrote:

 thanks guys.   anybody knows what generates the log like 
 myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log ? I
 checked our application code, it doesn't generate this, looks from hive.


 On Fri, Jul 18, 2014 at 12:28 AM, Andre Araujo ara...@pythian.com
 wrote:

 Make sure the directory you specify has the sticky bit set, otherwise
 users will have permission problems:

 chmod 1777 dir


 On 18 July 2014 14:19, Satish Mittal satish.mit...@inmobi.com wrote:

 You can configure the following property in
 $HIVE_HOME/conf/hive-log4j.properties:

 hive.log.dir=your location

 The default value of this property is ${java.io.tmpdir}/${user.name}.

 Thanks,
 Satish


 On Thu, Jul 17, 2014 at 11:58 PM, Yang tedd...@gmail.com wrote:

 we just moved to hadoop2.0 (HDP2.1 distro). it turns out that the new
 hive version generates a lot of logs into /tmp/ and is quickly creating 
 the
 danger of running out of our /tmp/ space.


 I see these 2 different logs :

 [myuser@mybox ~]$  ls -lt /tmp/myuser/
 total 1988
 -rw-rw-r-- 1 myuser myuser  191687 2014-07-17 11:17 hive.log
 -rw-rw-r-- 1 myuser myuser   14472 2014-07-16 14:43
 myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log
 -rw-rw-r-- 1 myuser myuser   14260 2014-07-16 14:04
 myuser_20140716135353_de698da0-807f-4e3b-8b97-5af5064b55f2.log
 -rw-rw-r-- 1 myuser myuser   14254 2014-07-16 13:42
 myuser_20140716133838_208329bd-77bb-4981-a2e9-e747647d0704.log



 from the doc at
 https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ErrorLogs
 I can see that per Hive session basis in /tmp/user.name/, but can
 be configured in hive-site.xml
 https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration
  with
 the hive.querylog.location property.,
 but I tried to pass it to -hiveconf 
 hive.querylog.location=/tmp/mycustomlogdir/
  , doesn't seem to work; the hive.log location is not changed by this
 approach either.

 so how can I change the location of both the logs , by some per-script
 params ? (i.e. we can't afford to change the system hive-site.xml or
 /etc/hive/conf etc)

 Thanks a lot
  Yang



 _
 The information contained in this communication is intended solely for
 the use of the individual or entity to whom it is addressed and others
 authorized to receive it. It may contain confidential or legally privileged
 information. If you are not the intended recipient you are hereby notified
 that any disclosure, copying, distribution or taking any action in reliance
 on the contents of this information is strictly prohibited and may be
 unlawful. If you have received this communication in error, please notify
 us immediately by responding to this email and then delete it from your
 system. The firm is neither liable for the proper and complete transmission
 of the information contained in this communication nor for any delay in its
 receipt.




 --
 André Araújo
 Big Data Consultant/Solutions Architect
 The Pythian Group - Australia - www.pythian.com

 Office (calls from within Australia): 1300 366 021 x1270
 Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696
 x1270
 Mobile: +61 410 323 559
 Fax: +61 2 9805 0544
 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk

 “Success is not about standing at the top, it's the steps you leave
 behind.” — Iker Pou (rock climber)

 --







 --
 André Araújo
 Big Data Consultant/Solutions Architect
 The Pythian Group - Australia - www.pythian.com

 Office (calls from within Australia): 1300 366 021 x1270
 Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696   x1270
 Mobile: +61 410 323 559
 Fax: +61 2 9805 0544
 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk

 “Success is not about standing at the top, it's the steps you leave behind.”
 — Iker Pou (rock climber)




-- 
André Araújo
Big Data Consultant/Solutions Architect
The Pythian Group - Australia - www.pythian.com

Office (calls from within Australia): 1300 366 021 x1270
Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696   x1270
Mobile: +61 410 323 559
Fax: +61 2 9805 0544
IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk

“Success is not about standing at the top, it's the steps you leave behind.”
— Iker Pou (rock climber)

-- 


--





Re: Hive huge 'startup time'

2014-07-18 Thread Db-Blog
Hello everyone, 

Thanks for sharing valuable inputs. I am working on similar kind of task, it 
will be really helpful if you can share the command for  increasing the heap 
size of hive-cli/launching process. 

Thanks,
Saurabh

Sent from my iPhone, please avoid typos.

 On 18-Jul-2014, at 8:23 pm, Edward Capriolo edlinuxg...@gmail.com wrote:
 
 Unleash ze file crusha!
 
 https://github.com/edwardcapriolo/filecrush
 
 
 On Fri, Jul 18, 2014 at 10:51 AM, diogo di...@uken.com wrote:
 Sweet, great answers, thanks.
 
 Indeed, I have a small number of partitions, but lots of small files, ~20MB 
 each. I'll make sure to combine them. Also, increasing the heap size of the 
 cli process already helped speed it up.
 
 Thanks, again.
 
 
 On Fri, Jul 18, 2014 at 10:26 AM, Edward Capriolo edlinuxg...@gmail.com 
 wrote:
 The planning phase needs to do work for every hive partition and every 
 hadoop files. If you have a lot of 'small' files or many partitions this 
 can take a long time. 
 Also the planning phase that happens on the job tracker is single threaded.
 Also the new yarn stuff requires back and forth to allocated containers. 
 
 Sometimes raising the heap to for the hive-cli/launching process helps 
 because the default heap of 1 GB may not be a lot of space to deal with all 
 of the partition information and memory overhead will make this go faster.
 Sometimes setting the min split size higher launches less map tasks which 
 speeds up everything.
 
 So the answer...Try to tune everything, start hive like this:
 
 bin/hive -hiveconf hive.root.logger=DEBUG,console
 
 And record where the longest spaces with no output are, that is what you 
 should try to tune first.
 
 
 
 
 On Fri, Jul 18, 2014 at 9:36 AM, diogo di...@uken.com wrote:
 This is probably a simple question, but I'm noticing that for queries that 
 run on 1+TB of data, it can take Hive up to 30 minutes to actually start 
 the first map-reduce stage. What is it doing? I imagine it's gathering 
 information about the data somehow, this 'startup' time is clearly a 
 function of the amount of data I'm trying to process.
 
 Cheers,
 


Re: how to control hive log location on 0.13?

2014-07-18 Thread Yang
2014-07-18 15:03:37,774 INFO  mr.ExecDriver
(SessionState.java:printInfo(537)) - Execution log at:
/tmp/myuser/myuser_2014071815030
3_56bf6bb0-db30-4dbc-807c-9023ce4103f4.log
2014-07-18 15:03:37,864 WARN  conf.Configuration
(Configuration.java:loadProperty(2358)) -
file:/tmp/myuser/hive_2014-07-18_15-03-30_423_
6799963466906099923-1/-local-10011/jobconf.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.retry.interval;  Igno
ring.
2014-07-18 15:03:37,871 WARN  conf.Configuration
(Configuration.java:loadProperty(2358)) -
file:/tmp/myuser/hive_2014-07-18_15-03-30_423_
6799963466906099923-1/-local-10011/jobconf.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2014-07-18 15:03:37,951 INFO  log.PerfLogger
(PerfLogger.java:PerfLogBegin(108)) - PERFLOG method=deserializePlan
from=org.apache.hadoop.hive.
ql.exec.Utilities
2014-07-18 15:03:37,951 INFO  exec.Utilities
(Utilities.java:deserializePlan(822)) - Deserializing MapredLocalWork via
kryo
2014-07-18 15:03:38,237 INFO  log.PerfLogger
(PerfLogger.java:PerfLogEnd(135)) - /PERFLOG method=deserializePlan
start=1405721017951 end=14057
21018237 duration=286 from=org.apache.hadoop.hive.ql.exec.Utilities
2014-07-18 15:03:38,246 INFO  mr.MapredLocalTask
(SessionState.java:printInfo(537)) - 2014-07-18 03:03:38   Starting to
launch local task t
o process map join;  maximum memory = 4261937152
2014-07-18 15:03:38,261 INFO  mr.MapredLocalTask
(MapredLocalTask.java:initializeOperators(406)) - fetchoperator for
null-subquery2:a-subquery2
:dpkg_cntr:dpkg_wtransaction_p2_id_user_30m created
2014-07-18 15:03:38,263 INFO  mr.MapredLocalTask
(MapredLocalTask.java:initializeOperators(406)) - fetchoperator for
null-subquery2:a-subquery2
:dpkg:dpkg_wtransaction_p2_id_user_30m created
2014-07-18 15:03:38,264 INFO  mr.MapredLocalTask
(MapredLocalTask.java:initializeOperators(406)) - fetchoperator for
null-subquery2:a-subquery2
:xclick:b:wtrans_data_map_p2_30m created
2014-07-18 15:03:38,266 INFO  mr.MapredLocalTask
(MapredLocalTask.java:initializeOperators(406)) - fetchoperator for
null-subquery1:a-subquery1
:dpkg_cntr:dpkg_wtransaction_id_user_30m created
2014-07-18 15:03:38,268 INFO  mr.MapredLocalTask
(MapredLocalTask.java:initializeOperators(406)) - fetchoperator for
null-subquery1:a-subquery1
:dpkg:dpkg_wtransaction_id_user_30m created
2014-07-18 15:03:38,269 INFO  mr.MapredLocalTask
(MapredLocalTask.java:initializeOperators(406)) - fetchoperator for
null-subquery1:a-subquery1
:xclick:b:wtrans_data_map_30m created
-

whole bunch of stuff omitted here

--


--
2014-07-18 15:04:08,678 INFO  exec.HashTableSinkOperator
(HashTableSinkOperator.java:flushToFile(278)) - Temp URI for side table:
file:/tmp/myuser/hive_2014-07-18_15-03-30_423_6799963466906099923-1/-local-10008/HashTable-Stage-2
2014-07-18 15:04:08,678 INFO  exec.HashTableSinkOperator
(SessionState.java:printInfo(537)) - 2014-07-18 03:04:08   Dump the
side-table into file:
file:/tmp/myuser/hive_2014-07-18_15-03-30_423_6799963466906099923-1/-local-10008/HashTable-Stage-2/MapJoin-mapfile11--.hashtable
2014-07-18 15:04:09,943 INFO  exec.HashTableSinkOperator
(SessionState.java:printInfo(537)) - 2014-07-18 03:04:09   Uploaded 1
File to:
file:/tmp/myuser/hive_2014-07-18_15-03-30_423_6799963466906099923-1/-local-10008/HashTable-Stage-2/MapJoin-mapfile11--.hashtable
(58010217 bytes)
2014-07-18 15:04:09,943 INFO  exec.HashTableSinkOperator
(Operator.java:close(591)) - 6 Close done
2014-07-18 15:04:09,943 INFO  exec.SelectOperator
(Operator.java:close(591)) - 5 Close done
2014-07-18 15:04:09,943 INFO  exec.TableScanOperator
(Operator.java:close(591)) - 4 Close done
2014-07-18 15:04:09,951 INFO  mapred.FileInputFormat
(FileInputFormat.java:listStatus(247)) - Total input paths to process : 1
2014-07-18 15:04:10,008 INFO  mapred.FileInputFormat
(FileInputFormat.java:listStatus(247)) - Total input paths to process : 1
2014-07-18 15:04:11,526 INFO  exec.HashTableSinkOperator
(SessionState.java:printInfo(537)) - 2014-07-18 03:04:11   Processing
rows:
20  Hashtable size: 19  Memory usage:   190041576
percentage: 0.045
2014-07-18 15:04:11,950 INFO  exec.HashTableSinkOperator
(SessionState.java:printInfo(537)) - 2014-07-18 03:04:11   Processing
rows:
30  Hashtable size: 29  Memory usage:   250890416
percentage: 0.059
2014-07-18 15:04:12,456 INFO  exec.HashTableSinkOperator
(SessionState.java:printInfo(537)) - 2014-07-18 03:04:12   Processing
rows:
40  Hashtable size: 39  Memory usage:   304697120
percentage: 0.071
2014-07-18 15:04:12,744 INFO  exec.TableScanOperator
(Operator.java:close(574)) - 11 finished. closing...
2014-07-18 15:04:12,745 INFO  exec.FilterOperator
(Operator.java:close(574)) - 12 finished. closing...
2014-07-18 15:04:12,745 INFO  

Re: how to control hive log location on 0.13?

2014-07-18 Thread Yang
it's in /tmp/my_user/

the funny thing is that I already  have a hive.log there.


On Fri, Jul 18, 2014 at 6:01 PM, Andre Araujo ara...@pythian.com wrote:

 and where is it located?


 On 19 July 2014 10:58, Andre Araujo ara...@pythian.com wrote:

 Can you give us an excerpt of the contents of this log?


 On 19 July 2014 04:38, Yang tedd...@gmail.com wrote:

 thanks guys.   anybody knows what generates the log like 
 myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log ? I
 checked our application code, it doesn't generate this, looks from hive.


 On Fri, Jul 18, 2014 at 12:28 AM, Andre Araujo ara...@pythian.com
 wrote:

 Make sure the directory you specify has the sticky bit set, otherwise
 users will have permission problems:

 chmod 1777 dir


 On 18 July 2014 14:19, Satish Mittal satish.mit...@inmobi.com wrote:

 You can configure the following property in
 $HIVE_HOME/conf/hive-log4j.properties:

 hive.log.dir=your location

 The default value of this property is ${java.io.tmpdir}/${user.name}.

 Thanks,
 Satish


 On Thu, Jul 17, 2014 at 11:58 PM, Yang tedd...@gmail.com wrote:

 we just moved to hadoop2.0 (HDP2.1 distro). it turns out that the new
 hive version generates a lot of logs into /tmp/ and is quickly creating 
 the
 danger of running out of our /tmp/ space.


 I see these 2 different logs :

 [myuser@mybox ~]$  ls -lt /tmp/myuser/
 total 1988
 -rw-rw-r-- 1 myuser myuser  191687 2014-07-17 11:17 hive.log
 -rw-rw-r-- 1 myuser myuser   14472 2014-07-16 14:43
 myuser_20140716143232_d76043ed-1c4b-42a0-bf0a-2816377a6a2a.log
 -rw-rw-r-- 1 myuser myuser   14260 2014-07-16 14:04
 myuser_20140716135353_de698da0-807f-4e3b-8b97-5af5064b55f2.log
 -rw-rw-r-- 1 myuser myuser   14254 2014-07-16 13:42
 myuser_20140716133838_208329bd-77bb-4981-a2e9-e747647d0704.log



 from the doc at
 https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ErrorLogs
 I can see that per Hive session basis in /tmp/user.name/, but can
 be configured in hive-site.xml
 https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration
  with
 the hive.querylog.location property.,
 but I tried to pass it to -hiveconf 
 hive.querylog.location=/tmp/mycustomlogdir/
  , doesn't seem to work; the hive.log location is not changed by this
 approach either.

 so how can I change the location of both the logs , by some
 per-script params ? (i.e. we can't afford to change the system
 hive-site.xml or /etc/hive/conf etc)

 Thanks a lot
  Yang



 _
 The information contained in this communication is intended solely for
 the use of the individual or entity to whom it is addressed and others
 authorized to receive it. It may contain confidential or legally 
 privileged
 information. If you are not the intended recipient you are hereby notified
 that any disclosure, copying, distribution or taking any action in 
 reliance
 on the contents of this information is strictly prohibited and may be
 unlawful. If you have received this communication in error, please notify
 us immediately by responding to this email and then delete it from your
 system. The firm is neither liable for the proper and complete 
 transmission
 of the information contained in this communication nor for any delay in 
 its
 receipt.




 --
 André Araújo
 Big Data Consultant/Solutions Architect
 The Pythian Group - Australia - www.pythian.com

 Office (calls from within Australia): 1300 366 021 x1270
 Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696
 x1270
 Mobile: +61 410 323 559
 Fax: +61 2 9805 0544
 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk

 “Success is not about standing at the top, it's the steps you leave
 behind.” — Iker Pou (rock climber)

 --







 --
 André Araújo
 Big Data Consultant/Solutions Architect
 The Pythian Group - Australia - www.pythian.com

 Office (calls from within Australia): 1300 366 021 x1270
 Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696
 x1270
 Mobile: +61 410 323 559
 Fax: +61 2 9805 0544
 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk

 “Success is not about standing at the top, it's the steps you leave
 behind.” — Iker Pou (rock climber)




 --
 André Araújo
 Big Data Consultant/Solutions Architect
 The Pythian Group - Australia - www.pythian.com

 Office (calls from within Australia): 1300 366 021 x1270
 Office (international): +61 2 8016 7000  x270 *OR* +1 613 565 8696   x1270
 Mobile: +61 410 323 559
 Fax: +61 2 9805 0544
 IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk

 “Success is not about standing at the top, it's the steps you leave behind.”
 — Iker Pou (rock climber)

 --