Re: Obvious and not so obvious query optimzations in Hive

2012-06-27 Thread yongqiang he
1.   Having my external table data gzipped and reading it in the
table v/s no compression at all.

You may want GZip your data since it is offline. But space is not a
concern and you want to optimize CPU, use snappy.

With snappy, there is no reason to go with no compression.


3.   Creating intermediate external tables v/s non external tables
v/s creating views?

First go with normal tables. External tables are hard to manage.
Views are there for complex things which are hard to do with 'managed table'.

4.   Storing the external table as Textfile v/s Sequence file. I
know sequence file compresses the data, but in what format? I read
about RC files and how efficient they are, how to use them?

rcfile if you query your data with Hive. 'create table xxx(xxx) stored
as rcfile'

5.   How are number of reducers get set for a Hive query (The way
group by and order by sets the number of reducers to 1) ? If I am not
changing it explicitly does it pick it from the underlying Hadoop
cluster? I am trying to understand the bottleneck between query and
cluster size.

Can you say more about your concern about query and cluster size?

On Wed, Jun 27, 2012 at 11:46 AM, Bejoy KS bejoy...@yahoo.com wrote:
 Hi Richin

 I'm not an AWS guy but still lemme try answering a few questions in general
 (not wrt AWS EMR)


 1.   Having my external table data gzipped and reading it in the table
 v/s no compression at all.

 Bejoy: When the data volume is large compression saves up the disk space and
 hence it is recommended. Gzip is not splittable, means one file can't be
 distributed across map tasks. Go in with Lzo or Snappy.


 2.   Having the external table data on S3 v/s having it on HDFS?

 3.   Creating intermediate external tables v/s non external tables v/s
 creating views?

 Bejoy: External Tables are not much different than managed tables. In
 managed tables when you drop the table using HQL the underlying data in hdfs
 is also deleted but in case of External tables only the table defn is
 dropped and the data in hdfs is preserved. Views are not evaluated on its
 creation time, but when it is used MR jobs are triggered and the required
 data is extracted out of source tables. So if you are planning to reuse a
 view n number of times it'll be better creating a table and using it else
 the view query will be evaluated n times.


 4.   Storing the external table as Textfile v/s Sequence file. I know
 sequence file compresses the data, but in what format? I read about RC files
 and how efficient they are, how to use them?

 Bejoy: Sequence files are splitable on its own so it is a good choice when
 using with Snappy. In sequence file, compression happens either at record
 level or block level which is configurable at the time of compressing. If
 you are using Gzip, TextFiles offer more compression ratio as the whole data
 is compressed in a go compared with Sequence files where it happens per
 record/block. But then you compromise on splitability. RC files are a good
 choice when your queries involve querying only a few columns rather than all
 columns in a row.

 5.   How are number of reducers get set for a Hive query (The way group
 by and order by sets the number of reducers to 1) ? If I am not changing it
 explicitly does it pick it from the underlying Hadoop cluster? I am trying
 to understand the bottleneck between query and cluster size.

 Bejoy: Some queries in hive are forced to have just a single reducer like
 Order By. In case of other queries hive determines the number of reducers.
 However you can always specify the number of reducer on a per query basis
 based on the data it process.

 6.   Any other optimizations/ best practices?

 Bejoy: There are lots of other optimizations in hive which can be injected
 based on the query. There are various join optimizations, group by
 optimizations etc suited for specific needs.


 Regards
 Bejoy KS

 Sent from handheld, please excuse typos.
 
 From: richin.j...@nokia.com
 Date: Wed, 27 Jun 2012 15:47:54 +
 To: user@hive.apache.org
 ReplyTo: user@hive.apache.org
 Subject: Obvious and not so obvious query optimzations in Hive

 Hey Hivers,



 I am trying to understand what are some of the obvious and not so obvious
 optimization I could do for a Hive Query on AWS EMR cluster. I know the
 answer for some of these questions but want to know what do you guys think
 and by what factor it affects the performance over the other approach.



 1.   Having my external table data gzipped and reading it in the table
 v/s no compression at all.

 2.   Having the external table data on S3 v/s having it on HDFS?

 3.   Creating intermediate external tables v/s non external tables v/s
 creating views?

 4.   Storing the external table as Textfile v/s Sequence file. I know
 sequence file compresses the data, but in what format? I read about RC files
 and how efficient they are, how to use them?

 

Re: hive - snappy and sequence file vs RC file

2012-06-26 Thread yongqiang he
Can you share the reason of choosing snappy as your compression codec?
Like @omalley mentioned, RCFile will compress the data more densely,
and will avoid reading data not required in your hive query. And I
think Facebook use it to store tens of PB (if not hundred PB) of data.

Thanks
Yongqiang
On Tue, Jun 26, 2012 at 9:49 AM, Owen O'Malley omal...@apache.org wrote:
 SequenceFile compared to RCFile:
   * More widely deployed.
   * Available from MapReduce and Pig
   * Doesn't compress as small (in RCFile all of each columns values are put
 together)
   * Uncompresses and deserializes all of the columns, even if you are only
 reading a few

 In either case, for long term storage, you should seriously consider the
 default codec since that will provide much tighter compression (at the cost
 of cpu to compress it).

 -- Owen


Re: ColumnarSerDe and LazyBinaryColumnarSerDe

2012-03-06 Thread yongqiang he
I guess LazyBinaryColumnarSerDe is not saving spaces, but is cpu efficient.
You tests aligns with our internal tests long time ago.

On Tue, Mar 6, 2012 at 8:58 AM, Yin Huai huaiyin@gmail.com wrote:
 Hi,

 Is LazyBinaryColumnarSerDe more space efficient than ColumnarSerDe in
 general?

 Let me make my question more specific.

 I generated two tables from the table lineitem of TPC-H
 using ColumnarSerDe and LazyBinaryColumnarSerDe as follows...
 CREATE TABLE lineitem_rcfile_lazybinary
 ROW FORMAT SERDE
 org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe
 STORED AS RCFile AS
 SELECT * from lineitem;

 CREATE TABLE lineitem_rcfile_lazy
 ROW FORMAT SERDE org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe
 STORED AS RCFile AS
 SELECT * from lineitem;

 Since serialization of LazyBinaryColumnarSerDe is binary-based and that
 of ColumnarSerDe is text-based, I expect to see
 table lineitem_rcfile_lazybinary is smaller than lineitem_rcfile_lazy.
 However, no matter whether compression is
 enabled, lineitem_rcfile_lazybinary is little bit larger
 than lineitem_rcfile_lazy. Did I use LazyBinaryColumnarSerDe in a wrong way?

 btw, the row group size of RCFile is 32MB.

 Thanks,

 Yin


Re: Hive 0.7.1 authorization woes

2011-08-25 Thread yongqiang he
what is your unix name on that machine? can u do a whoami?

On Thu, Aug 25, 2011 at 5:15 PM, Alex Holmes grep.a...@gmail.com wrote:
 Here's the hive-site.xml file (I use the same file for both the client
 and remote metastore).  We're using mysql as the metastore DB.


 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?
 configuration
 property
  namehive.security.authorization.enabled/name
  valuetrue/value
 /property
 property
  namehive.metastore.local/name
  valuefalse/value
 /property
 property
  namehive.metastore.uris/name
  valuethrift://localhost:9083/value
 /property
 property
  namejavax.jdo.option.ConnectionURL/name
  valuejdbc:mysql://localhost/hive?createDatabaseIfNotExist=true/value
 /property
 property
  namejavax.jdo.option.ConnectionDriverName/name
  valuecom.mysql.jdbc.Driver/value
 /property
 property
  namejavax.jdo.option.ConnectionUserName/name
  valuehive/value
 /property
 property
  namejavax.jdo.option.ConnectionPassword/name
  valuesecret/value
 /property
 /configuration



 On Wed, Aug 24, 2011 at 6:06 PM, yongqiang he heyongqiang...@gmail.com 
 wrote:
 this is what i have tried with a remote metastore:

     set hive.security.authorization.enabled=false;
 hive
    
    
     drop table src2;
 OK
 Time taken: 1.002 seconds
 hive create table src2 (key int, value string);
 OK
 Time taken: 0.03 seconds
 hive
    
    
     set hive.security.authorization.enabled=true;
 hive grant select on table src2 to user heyongqiang;
 OK
 Time taken: 0.113 seconds
 hive select * from src2;
 OK
 Time taken: 0.188 seconds
 hive show grant user heyongqiang on table src2;
 OK

 database        default
 table   src2
 principalName   heyongqiang
 principalType   USER
 privilege       Select
 grantTime       Wed Aug 24 15:03:51 PDT 2011
 grantor heyongqiang

 can u do a show grant?

 (But with remote metastore, i think hive should not return empty list
 instead of null for list_privileges etc.)



 On Wed, Aug 24, 2011 at 2:34 PM, Alex Holmes grep.a...@gmail.com wrote:
 Authorization works for me with the local metastore.  The remote
 metastore works with authorization turned off, but as soon as I turn
 it on and issue any commands I get these exceptions on the hive
 client.

 Could you also try the remote metastore please?  I'm pretty sure that
 authorization does not work with it at all.

 Thanks,
 Alex

 On Wed, Aug 24, 2011 at 5:20 PM, yongqiang he heyongqiang...@gmail.com 
 wrote:
 I am using local metastore,  and can not reproduce the problem.

 what message did you get when running local metastore?

 On Wed, Aug 24, 2011 at 1:58 PM, Alex Holmes grep.a...@gmail.com wrote:
 Thanks for opening a ticket.

 Table-level grants aren't working for me either (HIVE-2405 suggests
 that the bug is only related to global grants).

 hive set hive.security.authorization.enabled=false;
 hive CREATE TABLE pokes (foo INT, bar STRING);
 OK
 Time taken: 1.245 seconds
 hive LOAD DATA LOCAL INPATH 'hive1.in' OVERWRITE INTO TABLE pokes;
 FAILED: Error in semantic analysis: Line 1:23 Invalid path 'hive1.in':
 No files matching path file:/app/hadoop/hive-0.7.1/conf/hive1.in
 hive LOAD DATA LOCAL INPATH '/app/hadoop/hive1.in' OVERWRITE INTO TABLE 
 pokes;
 Copying data from file:/app/hadoop/hive1.in
 Copying file: file:/app/hadoop/hive1.in
 Loading data to table default.pokes
 Moved to trash: hdfs://localhost:54310/user/hive/warehouse/pokes
 OK
 Time taken: 0.33 seconds
 hive select * from pokes;
 OK
 1       a
 2       b
 3       c
 Time taken: 0.095 seconds
 hive grant select on table pokes to user hduser;
 OK
 Time taken: 0.251 seconds
 hive set hive.security.authorization.enabled=true;
 hive select * from pokes;
 FAILED: Hive Internal Error:
 org.apache.hadoop.hive.ql.metadata.HiveException(org.apache.thrift.TApplicationException:
 get_privilege_set failed: unknown result)
 org.apache.hadoop.hive.ql.metadata.HiveException:
 org.apache.thrift.TApplicationException: get_privilege_set failed:
 unknown result
        at 
 org.apache.hadoop.hive.ql.metadata.Hive.get_privilege_set(Hive.java:1617)
        at 
 org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserPriv(DefaultHiveAuthorizationProvider.java:201)
        at 
 org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserAndDBPriv(DefaultHiveAuthorizationProvider.java:226)
 ...

 mysql select * from TBL_PRIVS;
 +--+-+--+-+--+++--++
 | TBL_GRANT_ID | CREATE_TIME | GRANT_OPTION | GRANTOR | GRANTOR_TYPE |
 PRINCIPAL_NAME | PRINCIPAL_TYPE | TBL_PRIV | TBL_ID |
 +--+-+--+-+--+++--++
 |            1 |  1314219701 |            0 | hduser  | USER         |
 hduser         | USER           | Select   |      1

Re: Hive 0.7.1 authorization woes

2011-08-24 Thread yongqiang he
This is a bug. Will open a jira to fix this. and will backport it to 0.7.1.
https://issues.apache.org/jira/browse/HIVE-2405

thanks for reporting this one!

On Wed, Aug 24, 2011 at 6:25 AM, Alex Holmes grep.a...@gmail.com wrote:
 I created the mysql database (with the simple create database command)
 and the remote metastore seemed to creat the mysql tables.  Here's
 some grant information and what I see in the database:

 [hduser@aholmes-desktop conf]$ hive
 hive grant all to user hduser;
 OK
 Time taken: 0.334 seconds
 hive show grant user hduser;
 OK

 principalName   hduser
 principalType   USER
 privilege       All
 grantTime       1314191500
 grantor hduser
 Time taken: 0.046 seconds
 hive CREATE TABLE pokes (foo INT, bar STRING);
 FAILED: Hive Internal Error:
 org.apache.hadoop.hive.ql.metadata.HiveException(org.apache.thrift.TApplicationException:
 get_privilege_set failed: unknown result)
 org.apache.hadoop.hive.ql.metadata.HiveException:
 org.apache.thrift.TApplicationException: get_privilege_set failed:
 unknown result
        at 
 org.apache.hadoop.hive.ql.metadata.Hive.get_privilege_set(Hive.java:1617)
        at 
 org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserPriv(DefaultHiveAuthorizationProvider.java:201)
        at 
 org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserAndDBPriv(DefaultHiveAuthorizationProvider.java:226)
        at 
 org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorize(DefaultHiveAuthorizationProvider.java:89)
 ...

 mysql use hive;
 Database changed
 mysql select * from GLOBAL_PRIVS;
 +---+-+--+-+--+++---+
 | USER_GRANT_ID | CREATE_TIME | GRANT_OPTION | GRANTOR | GRANTOR_TYPE
 | PRINCIPAL_NAME | PRINCIPAL_TYPE | USER_PRIV |
 +---+-+--+-+--+++---+
 |             1 |  1314191500 |            0 | hduser  | USER
 | hduser         | USER           | All       |
 +---+-+--+-+--+++---+
 1 row in set (0.00 sec)


 Thanks for your help,
 Alex

 On Tue, Aug 23, 2011 at 1:27 PM, yongqiang he heyongqiang...@gmail.com 
 wrote:
 Have you created the metastore mysql tables for authorization? Can u
 do a show grant?

 thanks
 yongqiang
 On Tue, Aug 16, 2011 at 2:55 PM, Alex Holmes grep.a...@gmail.com wrote:
 Hi all,

 I've been struggling with getting Hive authorization to work for a few
 hours, and I really hope someone can help me.  I installed Hive 0.7.1
 on top of Hadoop 0.20.203.  I'm using mysql for the metastore, and
 configured Hive to enable authorization:

 property
  namehive.security.authorization.enabled/name
  valuetrue/value
  descriptionenable or disable the hive client authorization/description
 /property

 I kept all the other Hive security configs with their default settings.

 I'm running in pseudo-distributed mode on a single node.  HDFS, the Hive
 metastore and the Hive CLI are all running as the same user (the HDFS
 superuser).  Here are the sequence of steps that are causing me issues.
 Without authorization everything works perfectly (creating, loading, 
 selecting).
 I've also tried creating and loading the table without authorization, 
 granting
 the select privilege at various levels (global, table, database), turning on
 auth and performing the select, resulting in the same exception.

 Any help with this would be greatly appreciated!

 Thanks,
 Alex

 --

 [hduser@aholmes-desktop ~]$ hive
 Hive history 
 file=/tmp/hduser/hive_job_log_hduser_201108162158_1976573160.txt
 hive set hive.security.authorization.enabled=false;
 hive grant all to user hduser;
 OK
 Time taken: 0.233 seconds
 hive set hive.security.authorization.enabled=true;
 hive CREATE TABLE pokes3 (foo INT, bar STRING);
 FAILED: Hive Internal Error:
 org.apache.hadoop.hive.ql.metadata.HiveException(org.apache.thrift.TApplicationException:
 get_privilege_set failed: unknown result)
 org.apache.hadoop.hive.ql.metadata.HiveException:
 org.apache.thrift.TApplicationException: get_privilege_set failed:
 unknown result
        at 
 org.apache.hadoop.hive.ql.metadata.Hive.get_privilege_set(Hive.java:1617)
        at 
 org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserPriv(DefaultHiveAuthorizationProvider.java:201)
        at 
 org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserAndDBPriv(DefaultHiveAuthorizationProvider.java:226)
        at 
 org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorize(DefaultHiveAuthorizationProvider.java:89)
        at org.apache.hadoop.hive.ql.Driver.doAuthorization(Driver.java:433)
        at org.apache.hadoop.hive.ql.Driver.compile

Re: Hive 0.7.1 authorization woes

2011-08-24 Thread yongqiang he
I am using local metastore,  and can not reproduce the problem.

what message did you get when running local metastore?

On Wed, Aug 24, 2011 at 1:58 PM, Alex Holmes grep.a...@gmail.com wrote:
 Thanks for opening a ticket.

 Table-level grants aren't working for me either (HIVE-2405 suggests
 that the bug is only related to global grants).

 hive set hive.security.authorization.enabled=false;
 hive CREATE TABLE pokes (foo INT, bar STRING);
 OK
 Time taken: 1.245 seconds
 hive LOAD DATA LOCAL INPATH 'hive1.in' OVERWRITE INTO TABLE pokes;
 FAILED: Error in semantic analysis: Line 1:23 Invalid path 'hive1.in':
 No files matching path file:/app/hadoop/hive-0.7.1/conf/hive1.in
 hive LOAD DATA LOCAL INPATH '/app/hadoop/hive1.in' OVERWRITE INTO TABLE 
 pokes;
 Copying data from file:/app/hadoop/hive1.in
 Copying file: file:/app/hadoop/hive1.in
 Loading data to table default.pokes
 Moved to trash: hdfs://localhost:54310/user/hive/warehouse/pokes
 OK
 Time taken: 0.33 seconds
 hive select * from pokes;
 OK
 1       a
 2       b
 3       c
 Time taken: 0.095 seconds
 hive grant select on table pokes to user hduser;
 OK
 Time taken: 0.251 seconds
 hive set hive.security.authorization.enabled=true;
 hive select * from pokes;
 FAILED: Hive Internal Error:
 org.apache.hadoop.hive.ql.metadata.HiveException(org.apache.thrift.TApplicationException:
 get_privilege_set failed: unknown result)
 org.apache.hadoop.hive.ql.metadata.HiveException:
 org.apache.thrift.TApplicationException: get_privilege_set failed:
 unknown result
        at 
 org.apache.hadoop.hive.ql.metadata.Hive.get_privilege_set(Hive.java:1617)
        at 
 org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserPriv(DefaultHiveAuthorizationProvider.java:201)
        at 
 org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserAndDBPriv(DefaultHiveAuthorizationProvider.java:226)
 ...

 mysql select * from TBL_PRIVS;
 +--+-+--+-+--+++--++
 | TBL_GRANT_ID | CREATE_TIME | GRANT_OPTION | GRANTOR | GRANTOR_TYPE |
 PRINCIPAL_NAME | PRINCIPAL_TYPE | TBL_PRIV | TBL_ID |
 +--+-+--+-+--+++--++
 |            1 |  1314219701 |            0 | hduser  | USER         |
 hduser         | USER           | Select   |      1 |
 +--+-+--+-+--+++--++

 Also, I noticed in HIVE-2405 that you get a meaningful error message:

  Authorization failed:No privilege 'Create' found for outputs {
 database:default}. Use show grant to get more details.

 Whereas I just get an exception (as you can see above).  Were you also
 running with the remote metastore?  I get these meaningful messages
 with the local metastore (and authorization on), but with the remote
 metastore with authorization turned on, I always get exceptions.

 Many thanks,
 Alex

 On Wed, Aug 24, 2011 at 3:38 PM, yongqiang he heyongqiang...@gmail.com 
 wrote:
 This is a bug. Will open a jira to fix this. and will backport it to 0.7.1.
 https://issues.apache.org/jira/browse/HIVE-2405

 thanks for reporting this one!

 On Wed, Aug 24, 2011 at 6:25 AM, Alex Holmes grep.a...@gmail.com wrote:
 I created the mysql database (with the simple create database command)
 and the remote metastore seemed to creat the mysql tables.  Here's
 some grant information and what I see in the database:

 [hduser@aholmes-desktop conf]$ hive
 hive grant all to user hduser;
 OK
 Time taken: 0.334 seconds
 hive show grant user hduser;
 OK

 principalName   hduser
 principalType   USER
 privilege       All
 grantTime       1314191500
 grantor hduser
 Time taken: 0.046 seconds
 hive CREATE TABLE pokes (foo INT, bar STRING);
 FAILED: Hive Internal Error:
 org.apache.hadoop.hive.ql.metadata.HiveException(org.apache.thrift.TApplicationException:
 get_privilege_set failed: unknown result)
 org.apache.hadoop.hive.ql.metadata.HiveException:
 org.apache.thrift.TApplicationException: get_privilege_set failed:
 unknown result
        at 
 org.apache.hadoop.hive.ql.metadata.Hive.get_privilege_set(Hive.java:1617)
        at 
 org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserPriv(DefaultHiveAuthorizationProvider.java:201)
        at 
 org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserAndDBPriv(DefaultHiveAuthorizationProvider.java:226)
        at 
 org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorize(DefaultHiveAuthorizationProvider.java:89)
 ...

 mysql use hive;
 Database changed
 mysql select * from GLOBAL_PRIVS;
 +---+-+--+-+--+++---+
 | USER_GRANT_ID

Re: Hive 0.7.1 authorization woes

2011-08-24 Thread yongqiang he
this is what i have tried with a remote metastore:

 set hive.security.authorization.enabled=false;
hive


 drop table src2;
OK
Time taken: 1.002 seconds
hive create table src2 (key int, value string);
OK
Time taken: 0.03 seconds
hive


 set hive.security.authorization.enabled=true;
hive grant select on table src2 to user heyongqiang;
OK
Time taken: 0.113 seconds
hive select * from src2;
OK
Time taken: 0.188 seconds
hive show grant user heyongqiang on table src2;
OK

databasedefault 
table   src2
principalName   heyongqiang 
principalType   USER
privilege   Select  
grantTime   Wed Aug 24 15:03:51 PDT 2011
grantor heyongqiang 

can u do a show grant?

(But with remote metastore, i think hive should not return empty list
instead of null for list_privileges etc.)



On Wed, Aug 24, 2011 at 2:34 PM, Alex Holmes grep.a...@gmail.com wrote:
 Authorization works for me with the local metastore.  The remote
 metastore works with authorization turned off, but as soon as I turn
 it on and issue any commands I get these exceptions on the hive
 client.

 Could you also try the remote metastore please?  I'm pretty sure that
 authorization does not work with it at all.

 Thanks,
 Alex

 On Wed, Aug 24, 2011 at 5:20 PM, yongqiang he heyongqiang...@gmail.com 
 wrote:
 I am using local metastore,  and can not reproduce the problem.

 what message did you get when running local metastore?

 On Wed, Aug 24, 2011 at 1:58 PM, Alex Holmes grep.a...@gmail.com wrote:
 Thanks for opening a ticket.

 Table-level grants aren't working for me either (HIVE-2405 suggests
 that the bug is only related to global grants).

 hive set hive.security.authorization.enabled=false;
 hive CREATE TABLE pokes (foo INT, bar STRING);
 OK
 Time taken: 1.245 seconds
 hive LOAD DATA LOCAL INPATH 'hive1.in' OVERWRITE INTO TABLE pokes;
 FAILED: Error in semantic analysis: Line 1:23 Invalid path 'hive1.in':
 No files matching path file:/app/hadoop/hive-0.7.1/conf/hive1.in
 hive LOAD DATA LOCAL INPATH '/app/hadoop/hive1.in' OVERWRITE INTO TABLE 
 pokes;
 Copying data from file:/app/hadoop/hive1.in
 Copying file: file:/app/hadoop/hive1.in
 Loading data to table default.pokes
 Moved to trash: hdfs://localhost:54310/user/hive/warehouse/pokes
 OK
 Time taken: 0.33 seconds
 hive select * from pokes;
 OK
 1       a
 2       b
 3       c
 Time taken: 0.095 seconds
 hive grant select on table pokes to user hduser;
 OK
 Time taken: 0.251 seconds
 hive set hive.security.authorization.enabled=true;
 hive select * from pokes;
 FAILED: Hive Internal Error:
 org.apache.hadoop.hive.ql.metadata.HiveException(org.apache.thrift.TApplicationException:
 get_privilege_set failed: unknown result)
 org.apache.hadoop.hive.ql.metadata.HiveException:
 org.apache.thrift.TApplicationException: get_privilege_set failed:
 unknown result
        at 
 org.apache.hadoop.hive.ql.metadata.Hive.get_privilege_set(Hive.java:1617)
        at 
 org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserPriv(DefaultHiveAuthorizationProvider.java:201)
        at 
 org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserAndDBPriv(DefaultHiveAuthorizationProvider.java:226)
 ...

 mysql select * from TBL_PRIVS;
 +--+-+--+-+--+++--++
 | TBL_GRANT_ID | CREATE_TIME | GRANT_OPTION | GRANTOR | GRANTOR_TYPE |
 PRINCIPAL_NAME | PRINCIPAL_TYPE | TBL_PRIV | TBL_ID |
 +--+-+--+-+--+++--++
 |            1 |  1314219701 |            0 | hduser  | USER         |
 hduser         | USER           | Select   |      1 |
 +--+-+--+-+--+++--++

 Also, I noticed in HIVE-2405 that you get a meaningful error message:

  Authorization failed:No privilege 'Create' found for outputs {
 database:default}. Use show grant to get more details.

 Whereas I just get an exception (as you can see above).  Were you also
 running with the remote metastore?  I get these meaningful messages
 with the local metastore (and authorization on), but with the remote
 metastore with authorization turned on, I always get exceptions.

 Many thanks,
 Alex

 On Wed, Aug 24, 2011 at 3:38 PM, yongqiang he heyongqiang...@gmail.com 
 wrote:
 This is a bug. Will open a jira to fix this. and will backport it to 0.7.1.
 https://issues.apache.org/jira/browse/HIVE-2405

 thanks for reporting this one!

 On Wed, Aug 24, 2011 at 6:25 AM, Alex Holmes grep.a...@gmail.com wrote:
 I created the mysql database (with the simple create database command)
 and the remote metastore seemed to creat the mysql tables.  Here's
 some grant information and what I see in the database:

 [hduser@aholmes-desktop conf]$ hive
 hive grant

Re: Local and remote metastores

2011-08-24 Thread yongqiang he
Are there features that are only provided by the remote datastore
(like authorization)?

No. All features should be provided by both. In the case of
authorization, the problem is that it did not get fully tested with a
remote metastore. So you saw some small bugs when running
authorization on a remote metastore.

Thanks
Yongqiang
On Wed, Aug 24, 2011 at 5:37 PM, Ashutosh Chauhan hashut...@apache.org wrote:
 Alex,
 Difference between the two is that in case of local metastore, hive client
 communicates directly to the mysql while in case of remote metastore, hive
 client first talks to a remote metastore which in turns talks to mysql. In
 terms of feature-set they are identical.
 Hope it helps,
 Ashutosh

 On Mon, Aug 22, 2011 at 09:03, Alex Holmes grep.a...@gmail.com wrote:

 Hi everyone,

 Does anyone know the differences between local and remove Hive
 metastores?  Are there features that are only provided by the remote
 datastore (like authorization)?  Is the use of a local metastore
 recommended in production?

 Many thanks,
 Alex




Re: Indexing .gz files

2011-08-03 Thread yongqiang he
unfortunately it does not, because can not split .gz file.

2011/8/3 Martin Konicek martin.koni...@gmail.com:
 Hi,

 can indexes work on gzipped files?

 The index gets build without errors using
 ALTER INDEX syslog_index ON syslog PARTITION(dt='2011-08-03') REBUILD;

 but when querying, no results are returned (and no errors reported). The
 query should be correct because with plaintext files it works.


 Best regards and thanks
 Martin Konicek



Re: hive mapjoin decision process

2011-07-19 Thread yongqiang he
in most cases, the mapjoin falls back to normal join because of one of
these three reasons:
1) the input table size is very big, so there will be no try on mapjoin
2) if one of the input table is small (let's say less than 25MB which
is configurable), hive will try a local hashmap dump. If it cause OOM
on the client side when doing the local hashmap dump, it will go back
normal join.The reason here is mostly due to very good compression on
the input data.
3) the mapjoin actually got started, and fails. it will fall back
normal join. This will most unlikely happen

Thanks
Yongqiang
On Tue, Jul 19, 2011 at 11:16 AM, Koert Kuipers ko...@tresata.com wrote:
 note: this is somewhat a repost of something i posted on the CDH3 user
 group. apologies if that is not appropriate.

 i am exploring map-joins in hive. with hive.auto.convert.join=true hive
 tries to do a map-join and then falls back on a mapreduce-join if certain
 conditions are not met. this sounds great. but when i do a
 query and i notice it falls back on a mapreduce-join, how can i see which
 condition triggered the fallback (smalltablle.filesize or mapjoin.maxsize or
 something else perhaps memory related)?

 i tried reading the default log that a hive session produces, but it seems
 more like a massive json file than a log to me, so it is very hard for me to
 interpret that. i also turned on logging to console with debugging, looking
 for any clues there but without luck so far. is the info there and am i just
 overlooking it? any ideas?

 thanks! koert





Re: hive mapjoin decision process

2011-07-19 Thread yongqiang he
 i thought only one table needed to be small?
Yes.

 hive.mapjoin.maxsize also apply to big table?
No.

 i made sure hive.mapjoin.smalltable.filesize and hive.mapjoin.maxsize are 
 set large enough to accomodate the small table. yet hive does not attempt to 
 do a mapjoin.

There are physical limitations. If the local machine can not hold all
records in memory locally, the local hashmap has to fail. So check
your machine's memory or the memory allocated for hive.

Thanks
Yongqiang
On Tue, Jul 19, 2011 at 1:55 PM, Koert Kuipers ko...@tresata.com wrote:
 thanks!
 i only see hive create the hashmap dump and perform mapjoin if both tables
 are small. i thought only one table needed to be small?

 i try to merge a very large table with a small table. i made sure
 hive.mapjoin.smalltable.filesize and hive.mapjoin.maxsize are set large
 enough to accomodate the small table. yet hive does not attempt to do a
 mapjoin. does hive.mapjoin.maxsize also apply to big table? or do i need to
 look at other parameters as well?

 On Tue, Jul 19, 2011 at 4:15 PM, yongqiang he heyongqiang...@gmail.com
 wrote:

 in most cases, the mapjoin falls back to normal join because of one of
 these three reasons:
 1) the input table size is very big, so there will be no try on mapjoin
 2) if one of the input table is small (let's say less than 25MB which
 is configurable), hive will try a local hashmap dump. If it cause OOM
 on the client side when doing the local hashmap dump, it will go back
 normal join.The reason here is mostly due to very good compression on
 the input data.
 3) the mapjoin actually got started, and fails. it will fall back
 normal join. This will most unlikely happen

 Thanks
 Yongqiang
 On Tue, Jul 19, 2011 at 11:16 AM, Koert Kuipers ko...@tresata.com wrote:
  note: this is somewhat a repost of something i posted on the CDH3 user
  group. apologies if that is not appropriate.
 
  i am exploring map-joins in hive. with hive.auto.convert.join=true hive
  tries to do a map-join and then falls back on a mapreduce-join if
  certain
  conditions are not met. this sounds great. but when i do a
  query and i notice it falls back on a mapreduce-join, how can i see
  which
  condition triggered the fallback (smalltablle.filesize or
  mapjoin.maxsize or
  something else perhaps memory related)?
 
  i tried reading the default log that a hive session produces, but it
  seems
  more like a massive json file than a log to me, so it is very hard for
  me to
  interpret that. i also turned on logging to console with debugging,
  looking
  for any clues there but without luck so far. is the info there and am i
  just
  overlooking it? any ideas?
 
  thanks! koert
 
 
 




Re: Hive map join - process a little larger tables with moderate number of rows

2011-03-31 Thread yongqiang he
You possibly got a OOM error when processing the small tables. OOM is
a fatal error that can not be controlled by the hive configs. So can
you try to increase your memory setting?

thanks
yongqiang
On Thu, Mar 31, 2011 at 7:25 AM, Bejoy Ks bejoy...@yahoo.com wrote:
 Hi Experts
     I'm currently working with hive 0.7 mostly with JOINS. In all
 permissible cases i'm using map joins by setting the
 hive.auto.convert.join=true  parameter. Usage of local map joins have made a
 considerable performance improvement in hive queries.I have used this local
 map join only on the default set of hive configuration parameters now i'd
 try to dig more deeper into this. Want to try out this local map join on
 little bigger tables with more no of rows. Given below is a failure log of
 one of my local map tasks and in turn executing its back up common join task

 2011-03-31 09:56:54 Starting to launch local task to process map
 join;  maximum memory = 932118528
 2011-03-31 09:56:57 Processing rows:    20  Hashtable size:
 19  Memory usage:   115481024   rate:   0.124
 2011-03-31 09:57:00 Processing rows:    30  Hashtable size:
 29  Memory usage:   169344064   rate:   0.182
 2011-03-31 09:57:03 Processing rows:    40  Hashtable size:
 39  Memory usage:   232132792   rate:   0.249
 2011-03-31 09:57:06 Processing rows:    50  Hashtable size:
 49  Memory usage:   282338544   rate:   0.303
 2011-03-31 09:57:10 Processing rows:    60  Hashtable size:
 59  Memory usage:   336738640   rate:   0.361
 2011-03-31 09:57:14 Processing rows:    70  Hashtable size:
 69  Memory usage:   391117888   rate:   0.42
 2011-03-31 09:57:22 Processing rows:    80  Hashtable size:
 79  Memory usage:   453906496   rate:   0.487
 2011-03-31 09:57:27 Processing rows:    90  Hashtable size:
 89  Memory usage:   508306552   rate:   0.545
 2011-03-31 09:57:34 Processing rows:    100 Hashtable size:
 99  Memory usage:   562706496   rate:   0.604
 FAILED: Execution Error, return code 2 from
 org.apache.hadoop.hive.ql.exec.MapredLocalTask
 ATTEMPT: Execute BackupTask: org.apache.hadoop.hive.ql.exec.MapRedTask
 Launching Job 4 out of 6


 Here id like to make this local map task running, for the same i tried
 setting the following hive parameters as
 hive -f  HiveJob.txt -hiveconf hive.mapjoin.maxsize=100 -hiveconf
 hive.mapjoin.smalltable.filesize=4000 -hiveconf
 hive.auto.convert.join=true
 Butting setting the two config parameters doesn't make my local map task
 proceed beyond this stage.  I didn't try out
 overriding the hive.mapjoin.localtask.max.memory.usage=0.90 because from my
 task log shows that the memory usage rate is just 0.604, so i assume setting
 the same with a larger value wont cater to a solution in my case.Could some
 one please guide me what are the actual parameters and the values I should
 set to get things rolling.

 Thank You

 Regards
 Bejoy.K.S




Re: skew join optimization

2011-03-20 Thread yongqiang he
skew join does not work together with map join.

Map join does not require any reducer.
Please double check the hive that you use has the auto map join feature.
If there is auto covert join is your hive, only SET set
hive.auto.convert.join = true; should do the work.

thanks
yongqiang
On Sun, Mar 20, 2011 at 8:22 AM, Edward Capriolo edlinuxg...@gmail.com wrote:
 On Sun, Mar 20, 2011 at 11:20 AM, Ted Yu yuzhih...@gmail.com wrote:
 How about link to http://imageshack.us/ or TinyPic ?

 Thanks

 On Sun, Mar 20, 2011 at 7:56 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 On Sun, Mar 20, 2011 at 10:30 AM, Ted Yu yuzhih...@gmail.com wrote:
  Can someone re-attach the missing figures for that wiki ?
 
  Thanks
 
  On Sun, Mar 20, 2011 at 7:15 AM, bharath vissapragada
  bharathvissapragada1...@gmail.com wrote:
 
  Hi Igor,
 
  See http://wiki.apache.org/hadoop/Hive/JoinOptimization and see the
  jira 1642 which automatically converts a normal join into map-join
  (Otherwise you can specify the mapjoin hints in the query itself.).
  Because your 'S' table is very small , it can be replicated across all
  the mappers and the reduce phase can be avoided. This can greatly
  reduce the runtime .. (See the results section in the page for
  details.).
 
  Hope this helps.
 
  Thanks
 
 
  On Sun, Mar 20, 2011 at 6:37 PM, Jov zhao6...@gmail.com wrote:
   2011/3/20 Igor Tatarinov i...@decide.com:
   I have the following join that takes 4.5 hours (with 12 nodes)
   mostly
   because of a single reduce task that gets the bulk of the work:
   SELECT ...
   FROM T
   LEFT OUTER JOIN S
   ON T.timestamp = S.timestamp and T.id = S.id
   This is a 1:0/1 join so the size of the output is exactly the same
   as
   the
   size of T (500M records). S is actually very small (5K).
   I've tried:
   - switching the order of the join conditions
   - using a different hash function setting (jenkins instead of
   murmur)
   - using SET set hive.auto.convert.join = true;
  
   are you sure your query convert to mapjoin? if not,try use explicit
   mapjoin hint.
  
  
   - using SET hive.optimize.skewjoin = true;
   but nothing helped :(
   Anything else I can try?
   Thanks!
  
 
 
 
  --
  Regards,
  Bharath .V
  w:http://research.iiit.ac.in/~bharath.v
 
 

 The wiki does not allow images, confluence does but we have not moved
 their yet.



 Not a bad option, but lame that the foundation with the most popular
 web server in the world has to host our images elsewhere :(



Re: Building Custom RCFiles

2011-03-17 Thread yongqiang he
You need to customize Hive's ColumnarSerde (maybe functions in
LazySerde)'s serde and deserialize function (depends you want to read
or write.). And the main thing is that you need to use your own type
def (not LazyInt/LazyLong).

If your type is int or long (not double/float), casting it to string
only wastes some CPU, but can save you more spaces.

Thanks
Yongqiang
On Thu, Mar 17, 2011 at 3:48 PM, Severance, Steve ssevera...@ebay.com wrote:
 Hi,



 I am working on building a MR job that generates RCFiles that will become
 partitions of a hive table. I have most of it working however only strings
 (Text) are being deserialized inside of Hive. The hive table is specified to
 use a columnarserde which I thought should allow the writable types stored
 in the RCFile to be deserialized properly.



 Currently all numeric types (IntWritable and LongWritable) come back a null.



 Has anyone else seen anything like this or have any ideas? I would rather
 not convert all my data to strings to use RCFile.



 Thanks.



 Steve


Re: Building Custom RCFiles

2011-03-17 Thread yongqiang he
A side note, in hive, we make all columns saved as Text internally
(even the column's type is int or double etc). And with some
experiments, string is more friendly to compression. But it needs CPU
to decode to its original type.

Thanks
Yongqiang
On Thu, Mar 17, 2011 at 4:04 PM, yongqiang he heyongqiang...@gmail.com wrote:
 You need to customize Hive's ColumnarSerde (maybe functions in
 LazySerde)'s serde and deserialize function (depends you want to read
 or write.). And the main thing is that you need to use your own type
 def (not LazyInt/LazyLong).

 If your type is int or long (not double/float), casting it to string
 only wastes some CPU, but can save you more spaces.

 Thanks
 Yongqiang
 On Thu, Mar 17, 2011 at 3:48 PM, Severance, Steve ssevera...@ebay.com wrote:
 Hi,



 I am working on building a MR job that generates RCFiles that will become
 partitions of a hive table. I have most of it working however only strings
 (Text) are being deserialized inside of Hive. The hive table is specified to
 use a columnarserde which I thought should allow the writable types stored
 in the RCFile to be deserialized properly.



 Currently all numeric types (IntWritable and LongWritable) come back a null.



 Has anyone else seen anything like this or have any ideas? I would rather
 not convert all my data to strings to use RCFile.



 Thanks.



 Steve



Re: [VOTE] Sponsoring Howl as an Apache Incubator project

2011-02-03 Thread yongqiang he
I am interested in some numbers around the lines of code changes (or
files of changes) which are in Howl but not in Hive?
Can anyone give some information here?

Thanks
Yongqiang
On Thu, Feb 3, 2011 at 1:15 PM, Jeff Hammerbacher ham...@cloudera.com wrote:
 Hey,


 If we do go ahead with pulling the metastore out of Hive, it might make
 most sense for Howl to become its own TLP rather than a subproject.

 Yes, I did not read the proposal closely enough. I think an end state as a
 TLP makes more sense for Howl than as a Pig subproject. I'd really love to
 see Howl replace the metastore in Hive and it would be more natural to do so
 as a TLP than as a Pig subproject--especially since the current Howl
 repository is literally a fork of Hive.


 In the incubator proposal, we have mentioned these issues, but we've
 attempted to avoid prejudicing any decision.  Instead, we'd like to assess
 the pros and cons (including effort required and impact expected) for both
 approaches as part of the incubation process.

 Glad the issues are being considered.
 Later,
 Jeff


Re: tons of bugs and problem found

2011-02-01 Thread yongqiang he
I just noticed that your input file is actually text file.  There is
SkipBadRecords feature in Hadoop for text file. But i think hive does
not support that now. But i think you can hack by doing the setting
yourself.

Just look at the SkipBadRecords's code to find the conf name and
value, and set it manually before running your query.
Good luck.

On Tue, Feb 1, 2011 at 12:54 PM, Laurent Laborde kerdez...@gmail.com wrote:
 thank you for your replies.
 i reinstalled hadoop and hive, switched from Cloudera CDH3 to CDH2,
 restarted everything from scratch
 i've set io.skip.checksum.errors=true

 and i still have the same error :(

 what's wrong ? :(
 the dataset come from a postgresql database and is consistant.


 On Tue, Feb 1, 2011 at 6:57 AM, Aaron Kimball akimbal...@gmail.com wrote:
 In MapReduce, filenames that begin with an underscore are hidden files and
 are not enumerated by FileInputFormat (Hive, I believe, processes tables
 with TextInputFormat and SequenceFileInputFormat, both descendants of this
 class).
 Using _foo as a hidden/ignored filename is conventional in the Hadoop
 world. This is different than the UNIX convention of using .foo, but
 that's software engineering for you. ;)
 This is unlikely to change soon; MapReduce emits files with names like
 _SUCCESS into directories to indicate successful job completion.
 Directories such as _tmp and _logs also appear in datasets, and are
 therefore ignored as input by MapReduce-based tools, but those metadata
 names are established in other projects.
 If you run 'hadoop fs -mv /path/to/_top.sql /path/to/top.sql', that should
 make things work for you.
 - Aaron

 On Mon, Jan 31, 2011 at 10:21 AM, yongqiang he heyongqiang...@gmail.com
 wrote:

 You can first try to set io.skip.checksum.errors to true, which will
 ignore bad checksum.

 In facebook, we also had a requirement to ignore corrupt/bad data - but
  it has not been committed yet. Yongqiang, what is the jira number ?
 there seems no jira for this issue.

 thanks
 yongqiang
 2011/1/31 Namit Jain nj...@fb.com:
 
 
  On 1/31/11 7:46 AM, Laurent Laborde kerdez...@gmail.com wrote:
 
 On Fri, Jan 28, 2011 at 8:05 AM, Laurent Laborde kerdez...@gmail.com
 wrote:
  On Fri, Jan 28, 2011 at 1:12 AM, Namit Jain nj...@fb.com wrote:
  Hi Laurent,
 
  1. Are you saying that _top.sql did not exist in the home directory.
  Or that, _top.sql existed, but hive was not able to read it after
 loading
 
  It exist, it's loaded, and i can see it in the hive's warehouse
 directory.
  it's just impossible to query it.
 
  2. I don¹t think reserved words are documented somewhere. Can you
  file
 a
  jira for this ?
 
  Ok; will do that today.
 
  3. The bad row is printed in the task log.
 
  1. 2011-01-27 11:11:07,046 INFO org.apache.hadoop.fs.FSInputChecker:
 Found
  checksum error: b[1024,
 

  1536]=7374796c653d22666f6e742d73697a653a20313270743b223e3c623e266e627370
 3b2
 

  66e6273703b266e6273703b202a202838302920416d69656e733a3c2f623e3c2f7370616
 e3e
 

  3c2f7370616e3e5c6e20203c2f703e5c6e20203c703e5c6e202020203c7370616e207374
 796
 

  c653d22666f66742d66616d696c793a2068656c7665746963613b223e3c7370616e20737
 479
 

  6c653d22666f6e742d73697a653a20313270743b223e3c623e266e6273703b266e627370
 3b2
 

  66e6273703b266e6273703b266e6273703b266e6273703b266e6273703b266e6273703b2
 66e
 

  6273703b206f203132682c2050697175652d6e6971756520646576616e74206c65205265
 637
 

  46f7261742e3c2f623e3c2f7370616e3e3c2f7370616e3e5c6e20203c2f703e5c6e20203
 c70
 

  3e5c6e202020203c7370616e207374796c653d22666f6e742d66616d696c793a2068656c
 766
 

  5746963613b223e3c7370616e207374796c653d22666f6e742d73697a653a20313270743
 b22
 

  3e3c623e266e6273703b266e6273703b266e6273703b266e6273703b266e6273703b266e
 627
 

  3703b266e6273703b266e6273703b266e6273703b206f2031346833302c204d6169736f6
 e20
 

  6465206c612063756c747572652e3c2f623e3c2f7370616e3e3c2f7370616e3e5c6e2020
 3c2
  f703e5c6e20203c703e5c6e202020203c7370616e207374796c653d
 
  Is this the actual data ?
 
  2. org.apache.hadoop.fs.ChecksumException: Checksum error:
  /blk_2466764552666222475:of:/user/hive/warehouse/article/article.copy
 at
  23446528
 
  23446528 is the line number ?
 
  thank you
 
 optional question (the previous ones are still open) :
 is there a way to tell hive to ignore invalid data ? (if the problem
 is invalid data)
 
 
  Currently, not.
  In facebook, we also had a requirement to ignore corrupt/bad data - but
  it
  has not
  been committed yet. Yongqiang, what is the jira number ?
 
 
  Thanks,
  -namit
 
 
 
 --
 Laurent ker2x Laborde
 Sysadmin  DBA at http://www.over-blog.com/
 
 





 --
 Laurent ker2x Laborde
 Sysadmin  DBA at http://www.over-blog.com/



Re: tons of bugs and problem found

2011-01-31 Thread yongqiang he
You can first try to set io.skip.checksum.errors to true, which will
ignore bad checksum.

In facebook, we also had a requirement to ignore corrupt/bad data - but it 
has not been committed yet. Yongqiang, what is the jira number ?
there seems no jira for this issue.

thanks
yongqiang
2011/1/31 Namit Jain nj...@fb.com:


 On 1/31/11 7:46 AM, Laurent Laborde kerdez...@gmail.com wrote:

On Fri, Jan 28, 2011 at 8:05 AM, Laurent Laborde kerdez...@gmail.com
wrote:
 On Fri, Jan 28, 2011 at 1:12 AM, Namit Jain nj...@fb.com wrote:
 Hi Laurent,

 1. Are you saying that _top.sql did not exist in the home directory.
 Or that, _top.sql existed, but hive was not able to read it after
loading

 It exist, it's loaded, and i can see it in the hive's warehouse
directory.
 it's just impossible to query it.

 2. I don¹t think reserved words are documented somewhere. Can you file
a
 jira for this ?

 Ok; will do that today.

 3. The bad row is printed in the task log.

 1. 2011-01-27 11:11:07,046 INFO org.apache.hadoop.fs.FSInputChecker:
Found
 checksum error: b[1024,

1536]=7374796c653d22666f6e742d73697a653a20313270743b223e3c623e266e627370
3b2

66e6273703b266e6273703b202a202838302920416d69656e733a3c2f623e3c2f7370616
e3e

3c2f7370616e3e5c6e20203c2f703e5c6e20203c703e5c6e202020203c7370616e207374
796

c653d22666f66742d66616d696c793a2068656c7665746963613b223e3c7370616e20737
479

6c653d22666f6e742d73697a653a20313270743b223e3c623e266e6273703b266e627370
3b2

66e6273703b266e6273703b266e6273703b266e6273703b266e6273703b266e6273703b2
66e

6273703b206f203132682c2050697175652d6e6971756520646576616e74206c65205265
637

46f7261742e3c2f623e3c2f7370616e3e3c2f7370616e3e5c6e20203c2f703e5c6e20203
c70

3e5c6e202020203c7370616e207374796c653d22666f6e742d66616d696c793a2068656c
766

5746963613b223e3c7370616e207374796c653d22666f6e742d73697a653a20313270743
b22

3e3c623e266e6273703b266e6273703b266e6273703b266e6273703b266e6273703b266e
627

3703b266e6273703b266e6273703b266e6273703b206f2031346833302c204d6169736f6
e20

6465206c612063756c747572652e3c2f623e3c2f7370616e3e3c2f7370616e3e5c6e2020
3c2
 f703e5c6e20203c703e5c6e202020203c7370616e207374796c653d

 Is this the actual data ?

 2. org.apache.hadoop.fs.ChecksumException: Checksum error:
 /blk_2466764552666222475:of:/user/hive/warehouse/article/article.copy
at
 23446528

 23446528 is the line number ?

 thank you

optional question (the previous ones are still open) :
is there a way to tell hive to ignore invalid data ? (if the problem
is invalid data)


 Currently, not.
 In facebook, we also had a requirement to ignore corrupt/bad data - but it
 has not
 been committed yet. Yongqiang, what is the jira number ?


 Thanks,
 -namit



--
Laurent ker2x Laborde
Sysadmin  DBA at http://www.over-blog.com/




Re: Question of Hive Partitioning

2010-12-27 Thread yongqiang he
Can you post your query? Is there  or  connecting the partitions'
predicate with others?

Thanks
Yongqiang
On Mon, Dec 27, 2010 at 11:09 PM, Narendra uncle.n...@gmail.com wrote:
 Hi,

 Hive wiki says Hive currently does input pruning only if the partition
 predicates are specified in the WHERE clause closest to the table_reference
 in the FROM clause... what does this mean?

 I have a query in which I am joining 4 tables two of which need to be pruned
 by partition. The partition predicates are specified in the WHERE clause.
 However, I see that only one of the table gets pruned by the partition
 whereas the other one goes for a full table scan.

 Have a great holiday season!
 Narendra


Re: Query output formatting

2010-12-06 Thread yongqiang he
You can use collect_set() udaf. (And use lateral view join and explode
if you want operate on the set data.)

On Mon, Dec 6, 2010 at 1:01 PM, Steven Wong sw...@netflix.com wrote:
 I have this query to calculate some averages:



 select hour, cdn, avg(bitrate) from fact_table group by hour, cdn

 1  8  a

 1  9  b

 2  8  c

 3  8  d

 3  9  e



 But I want the output to be in the following format so that it can be
 eyeballed/graphed more easily:



 1  a  b

 2  c  NULL

 3  d e



 (The 2nd and 3rd columns are averages for cdn 8 and 9, respectively.)



 Is there an easy way to do this – in Hive, Unix, etc.? Suggestions (or,
 better yet, solutions) are welcome.



 I imagine a UDAF could do this (see below), but AFAIK it is not built into
 Hive.



 select hour, some_udaf(abr, cdn, array(8, 9))

 from (select hour, cdn, avg(bitrate) as abr from fact_table group by hour,
 cdn) t

 group by hour



 Thanks.

 Steven




Re: Hive produces very small files despite hive.merge...=true settings

2010-11-19 Thread yongqiang he
These are the parameters that control the behavior. (Try to set them
to different values if it does not work in your environment.)

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
set mapred.min.split.size.per.node=10;
set mapred.min.split.size.per.rack=10;
set mapred.max.split.size=10;

set hive.merge.size.per.task=10;
set hive.merge.smallfiles.avgsize=10;
set hive.merge.size.smallfiles.avgsize=10;
set hive.exec.dynamic.partition.mode=nonstrict;


The output size of the second job is also controlled by the split
size, as shown in the first 4 lines.


On Fri, Nov 19, 2010 at 10:22 AM, Leo Alekseyev dnqu...@gmail.com wrote:
 I'm using Hadoop 0.20.2.  Merge jobs (with static partitions) have
 worked for me in the past.  Again, what's strange here is with the
 latest Hive build the merge stage appears to run, but it doesn't
 actually merge -- it's a quick map-only job that, near as I can tell,
 doesn't do anything.

 On Fri, Nov 19, 2010 at 6:14 AM, Dave Brondsema dbronds...@geek.net wrote:
 What version of Hadoop are you on?

 On Thu, Nov 18, 2010 at 10:48 PM, Leo Alekseyev dnqu...@gmail.com wrote:

 I thought I was running Hive with those changes merged in, but to make
 sure, I built the latest trunk version.  The behavior changed somewhat
 (as in, it runs 2 stages instead of 1), but it still generates the
 same number of files (# of files generated is equal to the number of
 the original mappers, so I have no idea what the second stage is
 actually doing).

 See below for query / explain query.  Stage 1 runs always; Stage 3
 runs if hive.merge.mapfiles=true is set, but it still generates lots
 of small files.

 The query is kind of large, but in essence it's simply
 insert overwrite table foo partition(bar) select [columns] from
 [table] tablesample(bucket 1 out of 1 on rand()) where
 [conditions].


 explain insert overwrite table hbase_prefilter3_us_sample partition
 (ds) select
 server_host,client_ip,time_stamp,concat(server_host,':',regexp_extract(request_url,'/[^/]+/[^/]+/([^/]+)$',1)),referrer,parse_url(referrer,'HOST'),user_agent,cookie,geoip_int(client_ip,
 'COUNTRY_CODE',  './GeoIP.dat'),'',ds from alogs_master
 TABLESAMPLE(BUCKET 1 OUT OF 1 ON rand()) am_s where
 am_s.ds='2010-11-05' and am_s.request_url rlike
 '^/img[0-9]+/[0-9]+/[^.]+\.(png|jpg|gif|mp4|swf)$' and
 geoip_int(am_s.client_ip, 'COUNTRY_CODE',  './GeoIP.dat')='US';
 OK
 ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF alogs_master (TOK_TABLESAMPLE 1
 1 (TOK_FUNCTION rand)) am_s)) (TOK_INSERT (TOK_DESTINATION
 (TOK_TAB hbase_prefilter3_us_sample (TOK_PARTSPEC (TOK_PARTVAL ds
 (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL server_host)) (TOK_SELEXPR
 (TOK_TABLE_OR_COL client_ip)) (TOK_SELEXPR (TOK_TABLE_OR_COL
 time_stamp)) (TOK_SELEXPR (TOK_FUNCTION concat (TOK_TABLE_OR_COL
 server_host) ':' (TOK_FUNCTION regexp_extract (TOK_TABLE_OR_COL
 request_url) '/[^/]+/[^/]+/([^/]+)$' 1))) (TOK_SELEXPR
 (TOK_TABLE_OR_COL referrer)) (TOK_SELEXPR (TOK_FUNCTION parse_url
 (TOK_TABLE_OR_COL referrer) 'HOST')) (TOK_SELEXPR (TOK_TABLE_OR_COL
 user_agent)) (TOK_SELEXPR (TOK_TABLE_OR_COL cookie)) (TOK_SELEXPR
 (TOK_FUNCTION geoip_int (TOK_TABLE_OR_COL client_ip) 'COUNTRY_CODE'
 './GeoIP.dat')) (TOK_SELEXPR '') (TOK_SELEXPR (TOK_TABLE_OR_COL ds)))
 (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL am_s) ds) '2010-11-05')
 (rlike (. (TOK_TABLE_OR_COL am_s) request_url)
 '^/img[0-9]+/[0-9]+/[^.]+\.(png|jpg|gif|mp4|swf)$')) (= (TOK_FUNCTION
 geoip_int (. (TOK_TABLE_OR_COL am_s) client_ip) 'COUNTRY_CODE'
 './GeoIP.dat') 'US')

 STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-5 depends on stages: Stage-1 , consists of Stage-4, Stage-3
  Stage-4
  Stage-0 depends on stages: Stage-4, Stage-3
  Stage-2 depends on stages: Stage-0
  Stage-3

 STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias - Map Operator Tree:
        am_s
          TableScan
            alias: am_s
            Filter Operator
              predicate:
                  expr: (((hash(rand())  2147483647) % 1) = 0)
                  type: boolean
              Filter Operator
                predicate:
                    expr: ((request_url rlike
 '^/img[0-9]+/[0-9]+/[^.]+.(png|jpg|gif|mp4|swf)$') and
 (GenericUDFGeoIP ( client_ip, 'COUNTRY_CODE', './GeoIP.dat' ) = 'US'))
                    type: boolean
                Filter Operator
                  predicate:
                      expr: (((ds = '2010-11-05') and (request_url
 rlike '^/img[0-9]+/[0-9]+/[^.]+.(png|jpg|gif|mp4|swf)$')) and
 (GenericUDFGeoIP ( client_ip, 'COUNTRY_CODE', './GeoIP.dat' ) = 'US'))
                      type: boolean
                  Select Operator
                    expressions:
                          expr: server_host
                          type: string
                          expr: client_ip
                          type: int
                          expr: time_stamp
                     

Re: Hive produces very small files despite hive.merge...=true settings

2010-11-19 Thread yongqiang he
I can not think this could be the cause.

The problem should be: your files can not be merged. I mean the file
size is bigger than the split size

On Friday, November 19, 2010, Leo Alekseyev dnqu...@gmail.com wrote:
 Folks, thanks for your help.  I've narrowed the problem down to
 compression.  When I set hive.exec.compress.output=false, merges
 proceed as expected.  When compression is on, the merge job doesn't
 seem to actually merge, it just spits out the input.

 On Fri, Nov 19, 2010 at 10:51 AM, yongqiang he heyongqiang...@gmail.com 
 wrote:
 These are the parameters that control the behavior. (Try to set them
 to different values if it does not work in your environment.)

 set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
 set mapred.min.split.size.per.node=10;
 set mapred.min.split.size.per.rack=10;
 set mapred.max.split.size=10;

 set hive.merge.size.per.task=10;
 set hive.merge.smallfiles.avgsize=10;
 set hive.merge.size.smallfiles.avgsize=10;
 set hive.exec.dynamic.partition.mode=nonstrict;


 The output size of the second job is also controlled by the split
 size, as shown in the first 4 lines.


 On Fri, Nov 19, 2010 at 10:22 AM, Leo Alekseyev dnqu...@gmail.com wrote:
 I'm using Hadoop 0.20.2.  Merge jobs (with static partitions) have
 worked for me in the past.  Again, what's strange here is with the
 latest Hive build the merge stage appears to run, but it doesn't
 actually merge -- it's a quick map-only job that, near as I can tell,
 doesn't do anything.

 On Fri, Nov 19, 2010 at 6:14 AM, Dave Brondsema dbronds...@geek.net wrote:
 What version of Hadoop are you on?

 On Thu, Nov 18, 2010 at 10:48 PM, Leo Alekseyev dnqu...@gmail.com wrote:

 I thought I was running Hive with those changes merged in, but to make
 sure, I built the latest trunk version.  The behavior changed somewhat
 (as in, it runs 2 stages instead of 1), but it still generates the
 same number of files (# of files generated is equal to the number of
 the original mappers, so I have no idea what the second stage is
 actually doing).

 See below for query / explain query.  Stage 1 runs always; Stage 3
 runs if hive.merge.mapfiles=true is set, but it still generates lots
 of small files.

 The query is kind of large, but in essence it's simply
 insert overwrite table foo partition(bar) select [columns] from
 [table] tablesample(bucket 1 out of 1 on rand()) where
 [conditions].


 explain insert overwrite table hbase_prefilter3_us_sample partition
 (ds) select
 server_host,client_ip,time_stamp,concat(server_host,':',regexp_extract(request_url,'/[^/]+/[^/]+/([^/]+)$',1)),referrer,parse_url(referrer,'HOST'),user_agent,cookie,geoip_int(client_ip,
 'COUNTRY_CODE',  './GeoIP.dat'),'',ds from alogs_master
 TABLESAMPLE(BUCKET 1 OUT OF 1 ON rand()) am_s where
 am_s.ds='2010-11-05' and am_s.request_url rlike
 '^/img[0-9]+/[0-9]+/[^.]+\.(png|jpg|gif|mp4|swf)$' and
 geoip_int(am_s.client_ip, 'COUNTRY_CODE',  './GeoIP.dat')='US';
 OK
 ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF alogs_master (TOK_TABLESAMPLE 1
 1 (TOK_FUNCTION rand)) am_s)) (TOK_INSERT (TOK_DESTINATION
 (TOK_TAB hbase_prefilter3_us_sample (TOK_PARTSPEC (TOK_PARTVAL ds
 (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL server_host)) (TOK_SELEXPR
 (TOK_TABLE_OR_COL client_ip)) (TOK_SELEXPR (TOK_TABLE_OR_COL
 time_stamp)) (TOK_SELEXPR (TOK_FUNCTION concat (TOK_TABLE_OR_COL
 server_host) ':' (TOK_FUNCTION regexp_extract (TOK_TABLE_OR_COL
 request_url) '/[^/]+/[^/]+/([^/]+)$' 1))) (TOK_SELEXPR
 (TOK_TABLE_OR_COL referrer)) (TOK_SELEXPR (TOK_FUNCTION parse_url
 (TOK_TABLE_OR_COL referrer) 'HOST')) (TOK_SELEXPR (TOK_TABLE_OR_COL
 user_agent)) (TOK_SELEXPR (TOK_TABLE_OR_COL cookie)) (TOK_SELEXPR
 (TOK_FUNCTION geoip_int (TOK_TABLE_OR_COL client_ip) 'COUNTRY_CODE'