Re: Obvious and not so obvious query optimzations in Hive
1. Having my external table data gzipped and reading it in the table v/s no compression at all. You may want GZip your data since it is offline. But space is not a concern and you want to optimize CPU, use snappy. With snappy, there is no reason to go with no compression. 3. Creating intermediate external tables v/s non external tables v/s creating views? First go with normal tables. External tables are hard to manage. Views are there for complex things which are hard to do with 'managed table'. 4. Storing the external table as Textfile v/s Sequence file. I know sequence file compresses the data, but in what format? I read about RC files and how efficient they are, how to use them? rcfile if you query your data with Hive. 'create table xxx(xxx) stored as rcfile' 5. How are number of reducers get set for a Hive query (The way group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size. Can you say more about your concern about query and cluster size? On Wed, Jun 27, 2012 at 11:46 AM, Bejoy KS bejoy...@yahoo.com wrote: Hi Richin I'm not an AWS guy but still lemme try answering a few questions in general (not wrt AWS EMR) 1. Having my external table data gzipped and reading it in the table v/s no compression at all. Bejoy: When the data volume is large compression saves up the disk space and hence it is recommended. Gzip is not splittable, means one file can't be distributed across map tasks. Go in with Lzo or Snappy. 2. Having the external table data on S3 v/s having it on HDFS? 3. Creating intermediate external tables v/s non external tables v/s creating views? Bejoy: External Tables are not much different than managed tables. In managed tables when you drop the table using HQL the underlying data in hdfs is also deleted but in case of External tables only the table defn is dropped and the data in hdfs is preserved. Views are not evaluated on its creation time, but when it is used MR jobs are triggered and the required data is extracted out of source tables. So if you are planning to reuse a view n number of times it'll be better creating a table and using it else the view query will be evaluated n times. 4. Storing the external table as Textfile v/s Sequence file. I know sequence file compresses the data, but in what format? I read about RC files and how efficient they are, how to use them? Bejoy: Sequence files are splitable on its own so it is a good choice when using with Snappy. In sequence file, compression happens either at record level or block level which is configurable at the time of compressing. If you are using Gzip, TextFiles offer more compression ratio as the whole data is compressed in a go compared with Sequence files where it happens per record/block. But then you compromise on splitability. RC files are a good choice when your queries involve querying only a few columns rather than all columns in a row. 5. How are number of reducers get set for a Hive query (The way group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size. Bejoy: Some queries in hive are forced to have just a single reducer like Order By. In case of other queries hive determines the number of reducers. However you can always specify the number of reducer on a per query basis based on the data it process. 6. Any other optimizations/ best practices? Bejoy: There are lots of other optimizations in hive which can be injected based on the query. There are various join optimizations, group by optimizations etc suited for specific needs. Regards Bejoy KS Sent from handheld, please excuse typos. From: richin.j...@nokia.com Date: Wed, 27 Jun 2012 15:47:54 + To: user@hive.apache.org ReplyTo: user@hive.apache.org Subject: Obvious and not so obvious query optimzations in Hive Hey Hivers, I am trying to understand what are some of the obvious and not so obvious optimization I could do for a Hive Query on AWS EMR cluster. I know the answer for some of these questions but want to know what do you guys think and by what factor it affects the performance over the other approach. 1. Having my external table data gzipped and reading it in the table v/s no compression at all. 2. Having the external table data on S3 v/s having it on HDFS? 3. Creating intermediate external tables v/s non external tables v/s creating views? 4. Storing the external table as Textfile v/s Sequence file. I know sequence file compresses the data, but in what format? I read about RC files and how efficient they are, how to use them?
Re: hive - snappy and sequence file vs RC file
Can you share the reason of choosing snappy as your compression codec? Like @omalley mentioned, RCFile will compress the data more densely, and will avoid reading data not required in your hive query. And I think Facebook use it to store tens of PB (if not hundred PB) of data. Thanks Yongqiang On Tue, Jun 26, 2012 at 9:49 AM, Owen O'Malley omal...@apache.org wrote: SequenceFile compared to RCFile: * More widely deployed. * Available from MapReduce and Pig * Doesn't compress as small (in RCFile all of each columns values are put together) * Uncompresses and deserializes all of the columns, even if you are only reading a few In either case, for long term storage, you should seriously consider the default codec since that will provide much tighter compression (at the cost of cpu to compress it). -- Owen
Re: ColumnarSerDe and LazyBinaryColumnarSerDe
I guess LazyBinaryColumnarSerDe is not saving spaces, but is cpu efficient. You tests aligns with our internal tests long time ago. On Tue, Mar 6, 2012 at 8:58 AM, Yin Huai huaiyin@gmail.com wrote: Hi, Is LazyBinaryColumnarSerDe more space efficient than ColumnarSerDe in general? Let me make my question more specific. I generated two tables from the table lineitem of TPC-H using ColumnarSerDe and LazyBinaryColumnarSerDe as follows... CREATE TABLE lineitem_rcfile_lazybinary ROW FORMAT SERDE org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe STORED AS RCFile AS SELECT * from lineitem; CREATE TABLE lineitem_rcfile_lazy ROW FORMAT SERDE org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe STORED AS RCFile AS SELECT * from lineitem; Since serialization of LazyBinaryColumnarSerDe is binary-based and that of ColumnarSerDe is text-based, I expect to see table lineitem_rcfile_lazybinary is smaller than lineitem_rcfile_lazy. However, no matter whether compression is enabled, lineitem_rcfile_lazybinary is little bit larger than lineitem_rcfile_lazy. Did I use LazyBinaryColumnarSerDe in a wrong way? btw, the row group size of RCFile is 32MB. Thanks, Yin
Re: Hive 0.7.1 authorization woes
what is your unix name on that machine? can u do a whoami? On Thu, Aug 25, 2011 at 5:15 PM, Alex Holmes grep.a...@gmail.com wrote: Here's the hive-site.xml file (I use the same file for both the client and remote metastore). We're using mysql as the metastore DB. ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namehive.security.authorization.enabled/name valuetrue/value /property property namehive.metastore.local/name valuefalse/value /property property namehive.metastore.uris/name valuethrift://localhost:9083/value /property property namejavax.jdo.option.ConnectionURL/name valuejdbc:mysql://localhost/hive?createDatabaseIfNotExist=true/value /property property namejavax.jdo.option.ConnectionDriverName/name valuecom.mysql.jdbc.Driver/value /property property namejavax.jdo.option.ConnectionUserName/name valuehive/value /property property namejavax.jdo.option.ConnectionPassword/name valuesecret/value /property /configuration On Wed, Aug 24, 2011 at 6:06 PM, yongqiang he heyongqiang...@gmail.com wrote: this is what i have tried with a remote metastore: set hive.security.authorization.enabled=false; hive drop table src2; OK Time taken: 1.002 seconds hive create table src2 (key int, value string); OK Time taken: 0.03 seconds hive set hive.security.authorization.enabled=true; hive grant select on table src2 to user heyongqiang; OK Time taken: 0.113 seconds hive select * from src2; OK Time taken: 0.188 seconds hive show grant user heyongqiang on table src2; OK database default table src2 principalName heyongqiang principalType USER privilege Select grantTime Wed Aug 24 15:03:51 PDT 2011 grantor heyongqiang can u do a show grant? (But with remote metastore, i think hive should not return empty list instead of null for list_privileges etc.) On Wed, Aug 24, 2011 at 2:34 PM, Alex Holmes grep.a...@gmail.com wrote: Authorization works for me with the local metastore. The remote metastore works with authorization turned off, but as soon as I turn it on and issue any commands I get these exceptions on the hive client. Could you also try the remote metastore please? I'm pretty sure that authorization does not work with it at all. Thanks, Alex On Wed, Aug 24, 2011 at 5:20 PM, yongqiang he heyongqiang...@gmail.com wrote: I am using local metastore, and can not reproduce the problem. what message did you get when running local metastore? On Wed, Aug 24, 2011 at 1:58 PM, Alex Holmes grep.a...@gmail.com wrote: Thanks for opening a ticket. Table-level grants aren't working for me either (HIVE-2405 suggests that the bug is only related to global grants). hive set hive.security.authorization.enabled=false; hive CREATE TABLE pokes (foo INT, bar STRING); OK Time taken: 1.245 seconds hive LOAD DATA LOCAL INPATH 'hive1.in' OVERWRITE INTO TABLE pokes; FAILED: Error in semantic analysis: Line 1:23 Invalid path 'hive1.in': No files matching path file:/app/hadoop/hive-0.7.1/conf/hive1.in hive LOAD DATA LOCAL INPATH '/app/hadoop/hive1.in' OVERWRITE INTO TABLE pokes; Copying data from file:/app/hadoop/hive1.in Copying file: file:/app/hadoop/hive1.in Loading data to table default.pokes Moved to trash: hdfs://localhost:54310/user/hive/warehouse/pokes OK Time taken: 0.33 seconds hive select * from pokes; OK 1 a 2 b 3 c Time taken: 0.095 seconds hive grant select on table pokes to user hduser; OK Time taken: 0.251 seconds hive set hive.security.authorization.enabled=true; hive select * from pokes; FAILED: Hive Internal Error: org.apache.hadoop.hive.ql.metadata.HiveException(org.apache.thrift.TApplicationException: get_privilege_set failed: unknown result) org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: get_privilege_set failed: unknown result at org.apache.hadoop.hive.ql.metadata.Hive.get_privilege_set(Hive.java:1617) at org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserPriv(DefaultHiveAuthorizationProvider.java:201) at org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserAndDBPriv(DefaultHiveAuthorizationProvider.java:226) ... mysql select * from TBL_PRIVS; +--+-+--+-+--+++--++ | TBL_GRANT_ID | CREATE_TIME | GRANT_OPTION | GRANTOR | GRANTOR_TYPE | PRINCIPAL_NAME | PRINCIPAL_TYPE | TBL_PRIV | TBL_ID | +--+-+--+-+--+++--++ | 1 | 1314219701 | 0 | hduser | USER | hduser | USER | Select | 1
Re: Hive 0.7.1 authorization woes
This is a bug. Will open a jira to fix this. and will backport it to 0.7.1. https://issues.apache.org/jira/browse/HIVE-2405 thanks for reporting this one! On Wed, Aug 24, 2011 at 6:25 AM, Alex Holmes grep.a...@gmail.com wrote: I created the mysql database (with the simple create database command) and the remote metastore seemed to creat the mysql tables. Here's some grant information and what I see in the database: [hduser@aholmes-desktop conf]$ hive hive grant all to user hduser; OK Time taken: 0.334 seconds hive show grant user hduser; OK principalName hduser principalType USER privilege All grantTime 1314191500 grantor hduser Time taken: 0.046 seconds hive CREATE TABLE pokes (foo INT, bar STRING); FAILED: Hive Internal Error: org.apache.hadoop.hive.ql.metadata.HiveException(org.apache.thrift.TApplicationException: get_privilege_set failed: unknown result) org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: get_privilege_set failed: unknown result at org.apache.hadoop.hive.ql.metadata.Hive.get_privilege_set(Hive.java:1617) at org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserPriv(DefaultHiveAuthorizationProvider.java:201) at org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserAndDBPriv(DefaultHiveAuthorizationProvider.java:226) at org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorize(DefaultHiveAuthorizationProvider.java:89) ... mysql use hive; Database changed mysql select * from GLOBAL_PRIVS; +---+-+--+-+--+++---+ | USER_GRANT_ID | CREATE_TIME | GRANT_OPTION | GRANTOR | GRANTOR_TYPE | PRINCIPAL_NAME | PRINCIPAL_TYPE | USER_PRIV | +---+-+--+-+--+++---+ | 1 | 1314191500 | 0 | hduser | USER | hduser | USER | All | +---+-+--+-+--+++---+ 1 row in set (0.00 sec) Thanks for your help, Alex On Tue, Aug 23, 2011 at 1:27 PM, yongqiang he heyongqiang...@gmail.com wrote: Have you created the metastore mysql tables for authorization? Can u do a show grant? thanks yongqiang On Tue, Aug 16, 2011 at 2:55 PM, Alex Holmes grep.a...@gmail.com wrote: Hi all, I've been struggling with getting Hive authorization to work for a few hours, and I really hope someone can help me. I installed Hive 0.7.1 on top of Hadoop 0.20.203. I'm using mysql for the metastore, and configured Hive to enable authorization: property namehive.security.authorization.enabled/name valuetrue/value descriptionenable or disable the hive client authorization/description /property I kept all the other Hive security configs with their default settings. I'm running in pseudo-distributed mode on a single node. HDFS, the Hive metastore and the Hive CLI are all running as the same user (the HDFS superuser). Here are the sequence of steps that are causing me issues. Without authorization everything works perfectly (creating, loading, selecting). I've also tried creating and loading the table without authorization, granting the select privilege at various levels (global, table, database), turning on auth and performing the select, resulting in the same exception. Any help with this would be greatly appreciated! Thanks, Alex -- [hduser@aholmes-desktop ~]$ hive Hive history file=/tmp/hduser/hive_job_log_hduser_201108162158_1976573160.txt hive set hive.security.authorization.enabled=false; hive grant all to user hduser; OK Time taken: 0.233 seconds hive set hive.security.authorization.enabled=true; hive CREATE TABLE pokes3 (foo INT, bar STRING); FAILED: Hive Internal Error: org.apache.hadoop.hive.ql.metadata.HiveException(org.apache.thrift.TApplicationException: get_privilege_set failed: unknown result) org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: get_privilege_set failed: unknown result at org.apache.hadoop.hive.ql.metadata.Hive.get_privilege_set(Hive.java:1617) at org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserPriv(DefaultHiveAuthorizationProvider.java:201) at org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserAndDBPriv(DefaultHiveAuthorizationProvider.java:226) at org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorize(DefaultHiveAuthorizationProvider.java:89) at org.apache.hadoop.hive.ql.Driver.doAuthorization(Driver.java:433) at org.apache.hadoop.hive.ql.Driver.compile
Re: Hive 0.7.1 authorization woes
I am using local metastore, and can not reproduce the problem. what message did you get when running local metastore? On Wed, Aug 24, 2011 at 1:58 PM, Alex Holmes grep.a...@gmail.com wrote: Thanks for opening a ticket. Table-level grants aren't working for me either (HIVE-2405 suggests that the bug is only related to global grants). hive set hive.security.authorization.enabled=false; hive CREATE TABLE pokes (foo INT, bar STRING); OK Time taken: 1.245 seconds hive LOAD DATA LOCAL INPATH 'hive1.in' OVERWRITE INTO TABLE pokes; FAILED: Error in semantic analysis: Line 1:23 Invalid path 'hive1.in': No files matching path file:/app/hadoop/hive-0.7.1/conf/hive1.in hive LOAD DATA LOCAL INPATH '/app/hadoop/hive1.in' OVERWRITE INTO TABLE pokes; Copying data from file:/app/hadoop/hive1.in Copying file: file:/app/hadoop/hive1.in Loading data to table default.pokes Moved to trash: hdfs://localhost:54310/user/hive/warehouse/pokes OK Time taken: 0.33 seconds hive select * from pokes; OK 1 a 2 b 3 c Time taken: 0.095 seconds hive grant select on table pokes to user hduser; OK Time taken: 0.251 seconds hive set hive.security.authorization.enabled=true; hive select * from pokes; FAILED: Hive Internal Error: org.apache.hadoop.hive.ql.metadata.HiveException(org.apache.thrift.TApplicationException: get_privilege_set failed: unknown result) org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: get_privilege_set failed: unknown result at org.apache.hadoop.hive.ql.metadata.Hive.get_privilege_set(Hive.java:1617) at org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserPriv(DefaultHiveAuthorizationProvider.java:201) at org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserAndDBPriv(DefaultHiveAuthorizationProvider.java:226) ... mysql select * from TBL_PRIVS; +--+-+--+-+--+++--++ | TBL_GRANT_ID | CREATE_TIME | GRANT_OPTION | GRANTOR | GRANTOR_TYPE | PRINCIPAL_NAME | PRINCIPAL_TYPE | TBL_PRIV | TBL_ID | +--+-+--+-+--+++--++ | 1 | 1314219701 | 0 | hduser | USER | hduser | USER | Select | 1 | +--+-+--+-+--+++--++ Also, I noticed in HIVE-2405 that you get a meaningful error message: Authorization failed:No privilege 'Create' found for outputs { database:default}. Use show grant to get more details. Whereas I just get an exception (as you can see above). Were you also running with the remote metastore? I get these meaningful messages with the local metastore (and authorization on), but with the remote metastore with authorization turned on, I always get exceptions. Many thanks, Alex On Wed, Aug 24, 2011 at 3:38 PM, yongqiang he heyongqiang...@gmail.com wrote: This is a bug. Will open a jira to fix this. and will backport it to 0.7.1. https://issues.apache.org/jira/browse/HIVE-2405 thanks for reporting this one! On Wed, Aug 24, 2011 at 6:25 AM, Alex Holmes grep.a...@gmail.com wrote: I created the mysql database (with the simple create database command) and the remote metastore seemed to creat the mysql tables. Here's some grant information and what I see in the database: [hduser@aholmes-desktop conf]$ hive hive grant all to user hduser; OK Time taken: 0.334 seconds hive show grant user hduser; OK principalName hduser principalType USER privilege All grantTime 1314191500 grantor hduser Time taken: 0.046 seconds hive CREATE TABLE pokes (foo INT, bar STRING); FAILED: Hive Internal Error: org.apache.hadoop.hive.ql.metadata.HiveException(org.apache.thrift.TApplicationException: get_privilege_set failed: unknown result) org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: get_privilege_set failed: unknown result at org.apache.hadoop.hive.ql.metadata.Hive.get_privilege_set(Hive.java:1617) at org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserPriv(DefaultHiveAuthorizationProvider.java:201) at org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserAndDBPriv(DefaultHiveAuthorizationProvider.java:226) at org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorize(DefaultHiveAuthorizationProvider.java:89) ... mysql use hive; Database changed mysql select * from GLOBAL_PRIVS; +---+-+--+-+--+++---+ | USER_GRANT_ID
Re: Hive 0.7.1 authorization woes
this is what i have tried with a remote metastore: set hive.security.authorization.enabled=false; hive drop table src2; OK Time taken: 1.002 seconds hive create table src2 (key int, value string); OK Time taken: 0.03 seconds hive set hive.security.authorization.enabled=true; hive grant select on table src2 to user heyongqiang; OK Time taken: 0.113 seconds hive select * from src2; OK Time taken: 0.188 seconds hive show grant user heyongqiang on table src2; OK databasedefault table src2 principalName heyongqiang principalType USER privilege Select grantTime Wed Aug 24 15:03:51 PDT 2011 grantor heyongqiang can u do a show grant? (But with remote metastore, i think hive should not return empty list instead of null for list_privileges etc.) On Wed, Aug 24, 2011 at 2:34 PM, Alex Holmes grep.a...@gmail.com wrote: Authorization works for me with the local metastore. The remote metastore works with authorization turned off, but as soon as I turn it on and issue any commands I get these exceptions on the hive client. Could you also try the remote metastore please? I'm pretty sure that authorization does not work with it at all. Thanks, Alex On Wed, Aug 24, 2011 at 5:20 PM, yongqiang he heyongqiang...@gmail.com wrote: I am using local metastore, and can not reproduce the problem. what message did you get when running local metastore? On Wed, Aug 24, 2011 at 1:58 PM, Alex Holmes grep.a...@gmail.com wrote: Thanks for opening a ticket. Table-level grants aren't working for me either (HIVE-2405 suggests that the bug is only related to global grants). hive set hive.security.authorization.enabled=false; hive CREATE TABLE pokes (foo INT, bar STRING); OK Time taken: 1.245 seconds hive LOAD DATA LOCAL INPATH 'hive1.in' OVERWRITE INTO TABLE pokes; FAILED: Error in semantic analysis: Line 1:23 Invalid path 'hive1.in': No files matching path file:/app/hadoop/hive-0.7.1/conf/hive1.in hive LOAD DATA LOCAL INPATH '/app/hadoop/hive1.in' OVERWRITE INTO TABLE pokes; Copying data from file:/app/hadoop/hive1.in Copying file: file:/app/hadoop/hive1.in Loading data to table default.pokes Moved to trash: hdfs://localhost:54310/user/hive/warehouse/pokes OK Time taken: 0.33 seconds hive select * from pokes; OK 1 a 2 b 3 c Time taken: 0.095 seconds hive grant select on table pokes to user hduser; OK Time taken: 0.251 seconds hive set hive.security.authorization.enabled=true; hive select * from pokes; FAILED: Hive Internal Error: org.apache.hadoop.hive.ql.metadata.HiveException(org.apache.thrift.TApplicationException: get_privilege_set failed: unknown result) org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: get_privilege_set failed: unknown result at org.apache.hadoop.hive.ql.metadata.Hive.get_privilege_set(Hive.java:1617) at org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserPriv(DefaultHiveAuthorizationProvider.java:201) at org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider.authorizeUserAndDBPriv(DefaultHiveAuthorizationProvider.java:226) ... mysql select * from TBL_PRIVS; +--+-+--+-+--+++--++ | TBL_GRANT_ID | CREATE_TIME | GRANT_OPTION | GRANTOR | GRANTOR_TYPE | PRINCIPAL_NAME | PRINCIPAL_TYPE | TBL_PRIV | TBL_ID | +--+-+--+-+--+++--++ | 1 | 1314219701 | 0 | hduser | USER | hduser | USER | Select | 1 | +--+-+--+-+--+++--++ Also, I noticed in HIVE-2405 that you get a meaningful error message: Authorization failed:No privilege 'Create' found for outputs { database:default}. Use show grant to get more details. Whereas I just get an exception (as you can see above). Were you also running with the remote metastore? I get these meaningful messages with the local metastore (and authorization on), but with the remote metastore with authorization turned on, I always get exceptions. Many thanks, Alex On Wed, Aug 24, 2011 at 3:38 PM, yongqiang he heyongqiang...@gmail.com wrote: This is a bug. Will open a jira to fix this. and will backport it to 0.7.1. https://issues.apache.org/jira/browse/HIVE-2405 thanks for reporting this one! On Wed, Aug 24, 2011 at 6:25 AM, Alex Holmes grep.a...@gmail.com wrote: I created the mysql database (with the simple create database command) and the remote metastore seemed to creat the mysql tables. Here's some grant information and what I see in the database: [hduser@aholmes-desktop conf]$ hive hive grant
Re: Local and remote metastores
Are there features that are only provided by the remote datastore (like authorization)? No. All features should be provided by both. In the case of authorization, the problem is that it did not get fully tested with a remote metastore. So you saw some small bugs when running authorization on a remote metastore. Thanks Yongqiang On Wed, Aug 24, 2011 at 5:37 PM, Ashutosh Chauhan hashut...@apache.org wrote: Alex, Difference between the two is that in case of local metastore, hive client communicates directly to the mysql while in case of remote metastore, hive client first talks to a remote metastore which in turns talks to mysql. In terms of feature-set they are identical. Hope it helps, Ashutosh On Mon, Aug 22, 2011 at 09:03, Alex Holmes grep.a...@gmail.com wrote: Hi everyone, Does anyone know the differences between local and remove Hive metastores? Are there features that are only provided by the remote datastore (like authorization)? Is the use of a local metastore recommended in production? Many thanks, Alex
Re: Indexing .gz files
unfortunately it does not, because can not split .gz file. 2011/8/3 Martin Konicek martin.koni...@gmail.com: Hi, can indexes work on gzipped files? The index gets build without errors using ALTER INDEX syslog_index ON syslog PARTITION(dt='2011-08-03') REBUILD; but when querying, no results are returned (and no errors reported). The query should be correct because with plaintext files it works. Best regards and thanks Martin Konicek
Re: hive mapjoin decision process
in most cases, the mapjoin falls back to normal join because of one of these three reasons: 1) the input table size is very big, so there will be no try on mapjoin 2) if one of the input table is small (let's say less than 25MB which is configurable), hive will try a local hashmap dump. If it cause OOM on the client side when doing the local hashmap dump, it will go back normal join.The reason here is mostly due to very good compression on the input data. 3) the mapjoin actually got started, and fails. it will fall back normal join. This will most unlikely happen Thanks Yongqiang On Tue, Jul 19, 2011 at 11:16 AM, Koert Kuipers ko...@tresata.com wrote: note: this is somewhat a repost of something i posted on the CDH3 user group. apologies if that is not appropriate. i am exploring map-joins in hive. with hive.auto.convert.join=true hive tries to do a map-join and then falls back on a mapreduce-join if certain conditions are not met. this sounds great. but when i do a query and i notice it falls back on a mapreduce-join, how can i see which condition triggered the fallback (smalltablle.filesize or mapjoin.maxsize or something else perhaps memory related)? i tried reading the default log that a hive session produces, but it seems more like a massive json file than a log to me, so it is very hard for me to interpret that. i also turned on logging to console with debugging, looking for any clues there but without luck so far. is the info there and am i just overlooking it? any ideas? thanks! koert
Re: hive mapjoin decision process
i thought only one table needed to be small? Yes. hive.mapjoin.maxsize also apply to big table? No. i made sure hive.mapjoin.smalltable.filesize and hive.mapjoin.maxsize are set large enough to accomodate the small table. yet hive does not attempt to do a mapjoin. There are physical limitations. If the local machine can not hold all records in memory locally, the local hashmap has to fail. So check your machine's memory or the memory allocated for hive. Thanks Yongqiang On Tue, Jul 19, 2011 at 1:55 PM, Koert Kuipers ko...@tresata.com wrote: thanks! i only see hive create the hashmap dump and perform mapjoin if both tables are small. i thought only one table needed to be small? i try to merge a very large table with a small table. i made sure hive.mapjoin.smalltable.filesize and hive.mapjoin.maxsize are set large enough to accomodate the small table. yet hive does not attempt to do a mapjoin. does hive.mapjoin.maxsize also apply to big table? or do i need to look at other parameters as well? On Tue, Jul 19, 2011 at 4:15 PM, yongqiang he heyongqiang...@gmail.com wrote: in most cases, the mapjoin falls back to normal join because of one of these three reasons: 1) the input table size is very big, so there will be no try on mapjoin 2) if one of the input table is small (let's say less than 25MB which is configurable), hive will try a local hashmap dump. If it cause OOM on the client side when doing the local hashmap dump, it will go back normal join.The reason here is mostly due to very good compression on the input data. 3) the mapjoin actually got started, and fails. it will fall back normal join. This will most unlikely happen Thanks Yongqiang On Tue, Jul 19, 2011 at 11:16 AM, Koert Kuipers ko...@tresata.com wrote: note: this is somewhat a repost of something i posted on the CDH3 user group. apologies if that is not appropriate. i am exploring map-joins in hive. with hive.auto.convert.join=true hive tries to do a map-join and then falls back on a mapreduce-join if certain conditions are not met. this sounds great. but when i do a query and i notice it falls back on a mapreduce-join, how can i see which condition triggered the fallback (smalltablle.filesize or mapjoin.maxsize or something else perhaps memory related)? i tried reading the default log that a hive session produces, but it seems more like a massive json file than a log to me, so it is very hard for me to interpret that. i also turned on logging to console with debugging, looking for any clues there but without luck so far. is the info there and am i just overlooking it? any ideas? thanks! koert
Re: Hive map join - process a little larger tables with moderate number of rows
You possibly got a OOM error when processing the small tables. OOM is a fatal error that can not be controlled by the hive configs. So can you try to increase your memory setting? thanks yongqiang On Thu, Mar 31, 2011 at 7:25 AM, Bejoy Ks bejoy...@yahoo.com wrote: Hi Experts I'm currently working with hive 0.7 mostly with JOINS. In all permissible cases i'm using map joins by setting the hive.auto.convert.join=true parameter. Usage of local map joins have made a considerable performance improvement in hive queries.I have used this local map join only on the default set of hive configuration parameters now i'd try to dig more deeper into this. Want to try out this local map join on little bigger tables with more no of rows. Given below is a failure log of one of my local map tasks and in turn executing its back up common join task 2011-03-31 09:56:54 Starting to launch local task to process map join; maximum memory = 932118528 2011-03-31 09:56:57 Processing rows: 20 Hashtable size: 19 Memory usage: 115481024 rate: 0.124 2011-03-31 09:57:00 Processing rows: 30 Hashtable size: 29 Memory usage: 169344064 rate: 0.182 2011-03-31 09:57:03 Processing rows: 40 Hashtable size: 39 Memory usage: 232132792 rate: 0.249 2011-03-31 09:57:06 Processing rows: 50 Hashtable size: 49 Memory usage: 282338544 rate: 0.303 2011-03-31 09:57:10 Processing rows: 60 Hashtable size: 59 Memory usage: 336738640 rate: 0.361 2011-03-31 09:57:14 Processing rows: 70 Hashtable size: 69 Memory usage: 391117888 rate: 0.42 2011-03-31 09:57:22 Processing rows: 80 Hashtable size: 79 Memory usage: 453906496 rate: 0.487 2011-03-31 09:57:27 Processing rows: 90 Hashtable size: 89 Memory usage: 508306552 rate: 0.545 2011-03-31 09:57:34 Processing rows: 100 Hashtable size: 99 Memory usage: 562706496 rate: 0.604 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapredLocalTask ATTEMPT: Execute BackupTask: org.apache.hadoop.hive.ql.exec.MapRedTask Launching Job 4 out of 6 Here id like to make this local map task running, for the same i tried setting the following hive parameters as hive -f HiveJob.txt -hiveconf hive.mapjoin.maxsize=100 -hiveconf hive.mapjoin.smalltable.filesize=4000 -hiveconf hive.auto.convert.join=true Butting setting the two config parameters doesn't make my local map task proceed beyond this stage. I didn't try out overriding the hive.mapjoin.localtask.max.memory.usage=0.90 because from my task log shows that the memory usage rate is just 0.604, so i assume setting the same with a larger value wont cater to a solution in my case.Could some one please guide me what are the actual parameters and the values I should set to get things rolling. Thank You Regards Bejoy.K.S
Re: skew join optimization
skew join does not work together with map join. Map join does not require any reducer. Please double check the hive that you use has the auto map join feature. If there is auto covert join is your hive, only SET set hive.auto.convert.join = true; should do the work. thanks yongqiang On Sun, Mar 20, 2011 at 8:22 AM, Edward Capriolo edlinuxg...@gmail.com wrote: On Sun, Mar 20, 2011 at 11:20 AM, Ted Yu yuzhih...@gmail.com wrote: How about link to http://imageshack.us/ or TinyPic ? Thanks On Sun, Mar 20, 2011 at 7:56 AM, Edward Capriolo edlinuxg...@gmail.com wrote: On Sun, Mar 20, 2011 at 10:30 AM, Ted Yu yuzhih...@gmail.com wrote: Can someone re-attach the missing figures for that wiki ? Thanks On Sun, Mar 20, 2011 at 7:15 AM, bharath vissapragada bharathvissapragada1...@gmail.com wrote: Hi Igor, See http://wiki.apache.org/hadoop/Hive/JoinOptimization and see the jira 1642 which automatically converts a normal join into map-join (Otherwise you can specify the mapjoin hints in the query itself.). Because your 'S' table is very small , it can be replicated across all the mappers and the reduce phase can be avoided. This can greatly reduce the runtime .. (See the results section in the page for details.). Hope this helps. Thanks On Sun, Mar 20, 2011 at 6:37 PM, Jov zhao6...@gmail.com wrote: 2011/3/20 Igor Tatarinov i...@decide.com: I have the following join that takes 4.5 hours (with 12 nodes) mostly because of a single reduce task that gets the bulk of the work: SELECT ... FROM T LEFT OUTER JOIN S ON T.timestamp = S.timestamp and T.id = S.id This is a 1:0/1 join so the size of the output is exactly the same as the size of T (500M records). S is actually very small (5K). I've tried: - switching the order of the join conditions - using a different hash function setting (jenkins instead of murmur) - using SET set hive.auto.convert.join = true; are you sure your query convert to mapjoin? if not,try use explicit mapjoin hint. - using SET hive.optimize.skewjoin = true; but nothing helped :( Anything else I can try? Thanks! -- Regards, Bharath .V w:http://research.iiit.ac.in/~bharath.v The wiki does not allow images, confluence does but we have not moved their yet. Not a bad option, but lame that the foundation with the most popular web server in the world has to host our images elsewhere :(
Re: Building Custom RCFiles
You need to customize Hive's ColumnarSerde (maybe functions in LazySerde)'s serde and deserialize function (depends you want to read or write.). And the main thing is that you need to use your own type def (not LazyInt/LazyLong). If your type is int or long (not double/float), casting it to string only wastes some CPU, but can save you more spaces. Thanks Yongqiang On Thu, Mar 17, 2011 at 3:48 PM, Severance, Steve ssevera...@ebay.com wrote: Hi, I am working on building a MR job that generates RCFiles that will become partitions of a hive table. I have most of it working however only strings (Text) are being deserialized inside of Hive. The hive table is specified to use a columnarserde which I thought should allow the writable types stored in the RCFile to be deserialized properly. Currently all numeric types (IntWritable and LongWritable) come back a null. Has anyone else seen anything like this or have any ideas? I would rather not convert all my data to strings to use RCFile. Thanks. Steve
Re: Building Custom RCFiles
A side note, in hive, we make all columns saved as Text internally (even the column's type is int or double etc). And with some experiments, string is more friendly to compression. But it needs CPU to decode to its original type. Thanks Yongqiang On Thu, Mar 17, 2011 at 4:04 PM, yongqiang he heyongqiang...@gmail.com wrote: You need to customize Hive's ColumnarSerde (maybe functions in LazySerde)'s serde and deserialize function (depends you want to read or write.). And the main thing is that you need to use your own type def (not LazyInt/LazyLong). If your type is int or long (not double/float), casting it to string only wastes some CPU, but can save you more spaces. Thanks Yongqiang On Thu, Mar 17, 2011 at 3:48 PM, Severance, Steve ssevera...@ebay.com wrote: Hi, I am working on building a MR job that generates RCFiles that will become partitions of a hive table. I have most of it working however only strings (Text) are being deserialized inside of Hive. The hive table is specified to use a columnarserde which I thought should allow the writable types stored in the RCFile to be deserialized properly. Currently all numeric types (IntWritable and LongWritable) come back a null. Has anyone else seen anything like this or have any ideas? I would rather not convert all my data to strings to use RCFile. Thanks. Steve
Re: [VOTE] Sponsoring Howl as an Apache Incubator project
I am interested in some numbers around the lines of code changes (or files of changes) which are in Howl but not in Hive? Can anyone give some information here? Thanks Yongqiang On Thu, Feb 3, 2011 at 1:15 PM, Jeff Hammerbacher ham...@cloudera.com wrote: Hey, If we do go ahead with pulling the metastore out of Hive, it might make most sense for Howl to become its own TLP rather than a subproject. Yes, I did not read the proposal closely enough. I think an end state as a TLP makes more sense for Howl than as a Pig subproject. I'd really love to see Howl replace the metastore in Hive and it would be more natural to do so as a TLP than as a Pig subproject--especially since the current Howl repository is literally a fork of Hive. In the incubator proposal, we have mentioned these issues, but we've attempted to avoid prejudicing any decision. Instead, we'd like to assess the pros and cons (including effort required and impact expected) for both approaches as part of the incubation process. Glad the issues are being considered. Later, Jeff
Re: tons of bugs and problem found
I just noticed that your input file is actually text file. There is SkipBadRecords feature in Hadoop for text file. But i think hive does not support that now. But i think you can hack by doing the setting yourself. Just look at the SkipBadRecords's code to find the conf name and value, and set it manually before running your query. Good luck. On Tue, Feb 1, 2011 at 12:54 PM, Laurent Laborde kerdez...@gmail.com wrote: thank you for your replies. i reinstalled hadoop and hive, switched from Cloudera CDH3 to CDH2, restarted everything from scratch i've set io.skip.checksum.errors=true and i still have the same error :( what's wrong ? :( the dataset come from a postgresql database and is consistant. On Tue, Feb 1, 2011 at 6:57 AM, Aaron Kimball akimbal...@gmail.com wrote: In MapReduce, filenames that begin with an underscore are hidden files and are not enumerated by FileInputFormat (Hive, I believe, processes tables with TextInputFormat and SequenceFileInputFormat, both descendants of this class). Using _foo as a hidden/ignored filename is conventional in the Hadoop world. This is different than the UNIX convention of using .foo, but that's software engineering for you. ;) This is unlikely to change soon; MapReduce emits files with names like _SUCCESS into directories to indicate successful job completion. Directories such as _tmp and _logs also appear in datasets, and are therefore ignored as input by MapReduce-based tools, but those metadata names are established in other projects. If you run 'hadoop fs -mv /path/to/_top.sql /path/to/top.sql', that should make things work for you. - Aaron On Mon, Jan 31, 2011 at 10:21 AM, yongqiang he heyongqiang...@gmail.com wrote: You can first try to set io.skip.checksum.errors to true, which will ignore bad checksum. In facebook, we also had a requirement to ignore corrupt/bad data - but it has not been committed yet. Yongqiang, what is the jira number ? there seems no jira for this issue. thanks yongqiang 2011/1/31 Namit Jain nj...@fb.com: On 1/31/11 7:46 AM, Laurent Laborde kerdez...@gmail.com wrote: On Fri, Jan 28, 2011 at 8:05 AM, Laurent Laborde kerdez...@gmail.com wrote: On Fri, Jan 28, 2011 at 1:12 AM, Namit Jain nj...@fb.com wrote: Hi Laurent, 1. Are you saying that _top.sql did not exist in the home directory. Or that, _top.sql existed, but hive was not able to read it after loading It exist, it's loaded, and i can see it in the hive's warehouse directory. it's just impossible to query it. 2. I don¹t think reserved words are documented somewhere. Can you file a jira for this ? Ok; will do that today. 3. The bad row is printed in the task log. 1. 2011-01-27 11:11:07,046 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[1024, 1536]=7374796c653d22666f6e742d73697a653a20313270743b223e3c623e266e627370 3b2 66e6273703b266e6273703b202a202838302920416d69656e733a3c2f623e3c2f7370616 e3e 3c2f7370616e3e5c6e20203c2f703e5c6e20203c703e5c6e202020203c7370616e207374 796 c653d22666f66742d66616d696c793a2068656c7665746963613b223e3c7370616e20737 479 6c653d22666f6e742d73697a653a20313270743b223e3c623e266e6273703b266e627370 3b2 66e6273703b266e6273703b266e6273703b266e6273703b266e6273703b266e6273703b2 66e 6273703b206f203132682c2050697175652d6e6971756520646576616e74206c65205265 637 46f7261742e3c2f623e3c2f7370616e3e3c2f7370616e3e5c6e20203c2f703e5c6e20203 c70 3e5c6e202020203c7370616e207374796c653d22666f6e742d66616d696c793a2068656c 766 5746963613b223e3c7370616e207374796c653d22666f6e742d73697a653a20313270743 b22 3e3c623e266e6273703b266e6273703b266e6273703b266e6273703b266e6273703b266e 627 3703b266e6273703b266e6273703b266e6273703b206f2031346833302c204d6169736f6 e20 6465206c612063756c747572652e3c2f623e3c2f7370616e3e3c2f7370616e3e5c6e2020 3c2 f703e5c6e20203c703e5c6e202020203c7370616e207374796c653d Is this the actual data ? 2. org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_2466764552666222475:of:/user/hive/warehouse/article/article.copy at 23446528 23446528 is the line number ? thank you optional question (the previous ones are still open) : is there a way to tell hive to ignore invalid data ? (if the problem is invalid data) Currently, not. In facebook, we also had a requirement to ignore corrupt/bad data - but it has not been committed yet. Yongqiang, what is the jira number ? Thanks, -namit -- Laurent ker2x Laborde Sysadmin DBA at http://www.over-blog.com/ -- Laurent ker2x Laborde Sysadmin DBA at http://www.over-blog.com/
Re: tons of bugs and problem found
You can first try to set io.skip.checksum.errors to true, which will ignore bad checksum. In facebook, we also had a requirement to ignore corrupt/bad data - but it has not been committed yet. Yongqiang, what is the jira number ? there seems no jira for this issue. thanks yongqiang 2011/1/31 Namit Jain nj...@fb.com: On 1/31/11 7:46 AM, Laurent Laborde kerdez...@gmail.com wrote: On Fri, Jan 28, 2011 at 8:05 AM, Laurent Laborde kerdez...@gmail.com wrote: On Fri, Jan 28, 2011 at 1:12 AM, Namit Jain nj...@fb.com wrote: Hi Laurent, 1. Are you saying that _top.sql did not exist in the home directory. Or that, _top.sql existed, but hive was not able to read it after loading It exist, it's loaded, and i can see it in the hive's warehouse directory. it's just impossible to query it. 2. I don¹t think reserved words are documented somewhere. Can you file a jira for this ? Ok; will do that today. 3. The bad row is printed in the task log. 1. 2011-01-27 11:11:07,046 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[1024, 1536]=7374796c653d22666f6e742d73697a653a20313270743b223e3c623e266e627370 3b2 66e6273703b266e6273703b202a202838302920416d69656e733a3c2f623e3c2f7370616 e3e 3c2f7370616e3e5c6e20203c2f703e5c6e20203c703e5c6e202020203c7370616e207374 796 c653d22666f66742d66616d696c793a2068656c7665746963613b223e3c7370616e20737 479 6c653d22666f6e742d73697a653a20313270743b223e3c623e266e6273703b266e627370 3b2 66e6273703b266e6273703b266e6273703b266e6273703b266e6273703b266e6273703b2 66e 6273703b206f203132682c2050697175652d6e6971756520646576616e74206c65205265 637 46f7261742e3c2f623e3c2f7370616e3e3c2f7370616e3e5c6e20203c2f703e5c6e20203 c70 3e5c6e202020203c7370616e207374796c653d22666f6e742d66616d696c793a2068656c 766 5746963613b223e3c7370616e207374796c653d22666f6e742d73697a653a20313270743 b22 3e3c623e266e6273703b266e6273703b266e6273703b266e6273703b266e6273703b266e 627 3703b266e6273703b266e6273703b266e6273703b206f2031346833302c204d6169736f6 e20 6465206c612063756c747572652e3c2f623e3c2f7370616e3e3c2f7370616e3e5c6e2020 3c2 f703e5c6e20203c703e5c6e202020203c7370616e207374796c653d Is this the actual data ? 2. org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_2466764552666222475:of:/user/hive/warehouse/article/article.copy at 23446528 23446528 is the line number ? thank you optional question (the previous ones are still open) : is there a way to tell hive to ignore invalid data ? (if the problem is invalid data) Currently, not. In facebook, we also had a requirement to ignore corrupt/bad data - but it has not been committed yet. Yongqiang, what is the jira number ? Thanks, -namit -- Laurent ker2x Laborde Sysadmin DBA at http://www.over-blog.com/
Re: Question of Hive Partitioning
Can you post your query? Is there or connecting the partitions' predicate with others? Thanks Yongqiang On Mon, Dec 27, 2010 at 11:09 PM, Narendra uncle.n...@gmail.com wrote: Hi, Hive wiki says Hive currently does input pruning only if the partition predicates are specified in the WHERE clause closest to the table_reference in the FROM clause... what does this mean? I have a query in which I am joining 4 tables two of which need to be pruned by partition. The partition predicates are specified in the WHERE clause. However, I see that only one of the table gets pruned by the partition whereas the other one goes for a full table scan. Have a great holiday season! Narendra
Re: Query output formatting
You can use collect_set() udaf. (And use lateral view join and explode if you want operate on the set data.) On Mon, Dec 6, 2010 at 1:01 PM, Steven Wong sw...@netflix.com wrote: I have this query to calculate some averages: select hour, cdn, avg(bitrate) from fact_table group by hour, cdn 1 8 a 1 9 b 2 8 c 3 8 d 3 9 e But I want the output to be in the following format so that it can be eyeballed/graphed more easily: 1 a b 2 c NULL 3 d e (The 2nd and 3rd columns are averages for cdn 8 and 9, respectively.) Is there an easy way to do this – in Hive, Unix, etc.? Suggestions (or, better yet, solutions) are welcome. I imagine a UDAF could do this (see below), but AFAIK it is not built into Hive. select hour, some_udaf(abr, cdn, array(8, 9)) from (select hour, cdn, avg(bitrate) as abr from fact_table group by hour, cdn) t group by hour Thanks. Steven
Re: Hive produces very small files despite hive.merge...=true settings
These are the parameters that control the behavior. (Try to set them to different values if it does not work in your environment.) set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; set mapred.min.split.size.per.node=10; set mapred.min.split.size.per.rack=10; set mapred.max.split.size=10; set hive.merge.size.per.task=10; set hive.merge.smallfiles.avgsize=10; set hive.merge.size.smallfiles.avgsize=10; set hive.exec.dynamic.partition.mode=nonstrict; The output size of the second job is also controlled by the split size, as shown in the first 4 lines. On Fri, Nov 19, 2010 at 10:22 AM, Leo Alekseyev dnqu...@gmail.com wrote: I'm using Hadoop 0.20.2. Merge jobs (with static partitions) have worked for me in the past. Again, what's strange here is with the latest Hive build the merge stage appears to run, but it doesn't actually merge -- it's a quick map-only job that, near as I can tell, doesn't do anything. On Fri, Nov 19, 2010 at 6:14 AM, Dave Brondsema dbronds...@geek.net wrote: What version of Hadoop are you on? On Thu, Nov 18, 2010 at 10:48 PM, Leo Alekseyev dnqu...@gmail.com wrote: I thought I was running Hive with those changes merged in, but to make sure, I built the latest trunk version. The behavior changed somewhat (as in, it runs 2 stages instead of 1), but it still generates the same number of files (# of files generated is equal to the number of the original mappers, so I have no idea what the second stage is actually doing). See below for query / explain query. Stage 1 runs always; Stage 3 runs if hive.merge.mapfiles=true is set, but it still generates lots of small files. The query is kind of large, but in essence it's simply insert overwrite table foo partition(bar) select [columns] from [table] tablesample(bucket 1 out of 1 on rand()) where [conditions]. explain insert overwrite table hbase_prefilter3_us_sample partition (ds) select server_host,client_ip,time_stamp,concat(server_host,':',regexp_extract(request_url,'/[^/]+/[^/]+/([^/]+)$',1)),referrer,parse_url(referrer,'HOST'),user_agent,cookie,geoip_int(client_ip, 'COUNTRY_CODE', './GeoIP.dat'),'',ds from alogs_master TABLESAMPLE(BUCKET 1 OUT OF 1 ON rand()) am_s where am_s.ds='2010-11-05' and am_s.request_url rlike '^/img[0-9]+/[0-9]+/[^.]+\.(png|jpg|gif|mp4|swf)$' and geoip_int(am_s.client_ip, 'COUNTRY_CODE', './GeoIP.dat')='US'; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF alogs_master (TOK_TABLESAMPLE 1 1 (TOK_FUNCTION rand)) am_s)) (TOK_INSERT (TOK_DESTINATION (TOK_TAB hbase_prefilter3_us_sample (TOK_PARTSPEC (TOK_PARTVAL ds (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL server_host)) (TOK_SELEXPR (TOK_TABLE_OR_COL client_ip)) (TOK_SELEXPR (TOK_TABLE_OR_COL time_stamp)) (TOK_SELEXPR (TOK_FUNCTION concat (TOK_TABLE_OR_COL server_host) ':' (TOK_FUNCTION regexp_extract (TOK_TABLE_OR_COL request_url) '/[^/]+/[^/]+/([^/]+)$' 1))) (TOK_SELEXPR (TOK_TABLE_OR_COL referrer)) (TOK_SELEXPR (TOK_FUNCTION parse_url (TOK_TABLE_OR_COL referrer) 'HOST')) (TOK_SELEXPR (TOK_TABLE_OR_COL user_agent)) (TOK_SELEXPR (TOK_TABLE_OR_COL cookie)) (TOK_SELEXPR (TOK_FUNCTION geoip_int (TOK_TABLE_OR_COL client_ip) 'COUNTRY_CODE' './GeoIP.dat')) (TOK_SELEXPR '') (TOK_SELEXPR (TOK_TABLE_OR_COL ds))) (TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL am_s) ds) '2010-11-05') (rlike (. (TOK_TABLE_OR_COL am_s) request_url) '^/img[0-9]+/[0-9]+/[^.]+\.(png|jpg|gif|mp4|swf)$')) (= (TOK_FUNCTION geoip_int (. (TOK_TABLE_OR_COL am_s) client_ip) 'COUNTRY_CODE' './GeoIP.dat') 'US') STAGE DEPENDENCIES: Stage-1 is a root stage Stage-5 depends on stages: Stage-1 , consists of Stage-4, Stage-3 Stage-4 Stage-0 depends on stages: Stage-4, Stage-3 Stage-2 depends on stages: Stage-0 Stage-3 STAGE PLANS: Stage: Stage-1 Map Reduce Alias - Map Operator Tree: am_s TableScan alias: am_s Filter Operator predicate: expr: (((hash(rand()) 2147483647) % 1) = 0) type: boolean Filter Operator predicate: expr: ((request_url rlike '^/img[0-9]+/[0-9]+/[^.]+.(png|jpg|gif|mp4|swf)$') and (GenericUDFGeoIP ( client_ip, 'COUNTRY_CODE', './GeoIP.dat' ) = 'US')) type: boolean Filter Operator predicate: expr: (((ds = '2010-11-05') and (request_url rlike '^/img[0-9]+/[0-9]+/[^.]+.(png|jpg|gif|mp4|swf)$')) and (GenericUDFGeoIP ( client_ip, 'COUNTRY_CODE', './GeoIP.dat' ) = 'US')) type: boolean Select Operator expressions: expr: server_host type: string expr: client_ip type: int expr: time_stamp
Re: Hive produces very small files despite hive.merge...=true settings
I can not think this could be the cause. The problem should be: your files can not be merged. I mean the file size is bigger than the split size On Friday, November 19, 2010, Leo Alekseyev dnqu...@gmail.com wrote: Folks, thanks for your help. I've narrowed the problem down to compression. When I set hive.exec.compress.output=false, merges proceed as expected. When compression is on, the merge job doesn't seem to actually merge, it just spits out the input. On Fri, Nov 19, 2010 at 10:51 AM, yongqiang he heyongqiang...@gmail.com wrote: These are the parameters that control the behavior. (Try to set them to different values if it does not work in your environment.) set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; set mapred.min.split.size.per.node=10; set mapred.min.split.size.per.rack=10; set mapred.max.split.size=10; set hive.merge.size.per.task=10; set hive.merge.smallfiles.avgsize=10; set hive.merge.size.smallfiles.avgsize=10; set hive.exec.dynamic.partition.mode=nonstrict; The output size of the second job is also controlled by the split size, as shown in the first 4 lines. On Fri, Nov 19, 2010 at 10:22 AM, Leo Alekseyev dnqu...@gmail.com wrote: I'm using Hadoop 0.20.2. Merge jobs (with static partitions) have worked for me in the past. Again, what's strange here is with the latest Hive build the merge stage appears to run, but it doesn't actually merge -- it's a quick map-only job that, near as I can tell, doesn't do anything. On Fri, Nov 19, 2010 at 6:14 AM, Dave Brondsema dbronds...@geek.net wrote: What version of Hadoop are you on? On Thu, Nov 18, 2010 at 10:48 PM, Leo Alekseyev dnqu...@gmail.com wrote: I thought I was running Hive with those changes merged in, but to make sure, I built the latest trunk version. The behavior changed somewhat (as in, it runs 2 stages instead of 1), but it still generates the same number of files (# of files generated is equal to the number of the original mappers, so I have no idea what the second stage is actually doing). See below for query / explain query. Stage 1 runs always; Stage 3 runs if hive.merge.mapfiles=true is set, but it still generates lots of small files. The query is kind of large, but in essence it's simply insert overwrite table foo partition(bar) select [columns] from [table] tablesample(bucket 1 out of 1 on rand()) where [conditions]. explain insert overwrite table hbase_prefilter3_us_sample partition (ds) select server_host,client_ip,time_stamp,concat(server_host,':',regexp_extract(request_url,'/[^/]+/[^/]+/([^/]+)$',1)),referrer,parse_url(referrer,'HOST'),user_agent,cookie,geoip_int(client_ip, 'COUNTRY_CODE', './GeoIP.dat'),'',ds from alogs_master TABLESAMPLE(BUCKET 1 OUT OF 1 ON rand()) am_s where am_s.ds='2010-11-05' and am_s.request_url rlike '^/img[0-9]+/[0-9]+/[^.]+\.(png|jpg|gif|mp4|swf)$' and geoip_int(am_s.client_ip, 'COUNTRY_CODE', './GeoIP.dat')='US'; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF alogs_master (TOK_TABLESAMPLE 1 1 (TOK_FUNCTION rand)) am_s)) (TOK_INSERT (TOK_DESTINATION (TOK_TAB hbase_prefilter3_us_sample (TOK_PARTSPEC (TOK_PARTVAL ds (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL server_host)) (TOK_SELEXPR (TOK_TABLE_OR_COL client_ip)) (TOK_SELEXPR (TOK_TABLE_OR_COL time_stamp)) (TOK_SELEXPR (TOK_FUNCTION concat (TOK_TABLE_OR_COL server_host) ':' (TOK_FUNCTION regexp_extract (TOK_TABLE_OR_COL request_url) '/[^/]+/[^/]+/([^/]+)$' 1))) (TOK_SELEXPR (TOK_TABLE_OR_COL referrer)) (TOK_SELEXPR (TOK_FUNCTION parse_url (TOK_TABLE_OR_COL referrer) 'HOST')) (TOK_SELEXPR (TOK_TABLE_OR_COL user_agent)) (TOK_SELEXPR (TOK_TABLE_OR_COL cookie)) (TOK_SELEXPR (TOK_FUNCTION geoip_int (TOK_TABLE_OR_COL client_ip) 'COUNTRY_CODE'