RE: Partition performance
Hi, Just to check that I understand this problem, my reading suggests that the overhead of many partitions is currently unavoidable. Specifically this means that any query on a table that has, let’s say, 10,000 partitions will be significantly slower (than on un-partitioned table with the “same” data) even if the query explicitly specifies a single partition. (I mean I _could_ actually do the experiments myself…) Regards, Z From: Owen O'Malley [mailto:omal...@apache.org] Sent: 02 July 2013 15:52 To: user@hive.apache.org Subject: Re: Partition performance On Tue, Jul 2, 2013 at 2:34 AM, Peter Marron peter.mar...@trilliumsoftware.commailto:peter.mar...@trilliumsoftware.com wrote: Hi Owen, I’m curious about this advice about partitioning. Is there some fundamental reason why Hive is slow when the number of partitions is 10,000 rather than 1,000? The precise numbers don't matter. I wanted to give people a ballpark range that they should be looking at. Most tables at 1000 partitions won't cause big slow downs, but the cost scales with the number of partitions. By the time you are at 10,000 the cost is noticeable. I have one customer who has a table with 1.2 million partitions. That causes a lot of slow downs. And the improvements that you mention are they going to be in version 12? Is there a JIRA raised so that I can track them? (It’s not currently a problem for me but I can see that I am going to need to be able to explain the situation.) I think this is the one they will use: https://issues.apache.org/jira/browse/HIVE-4051 -- Owen
Elastic MapReduce Hive Avro SerDe
Hi! I'm working on a few Avro MapReduce jobs whose output will end up on S3 to be processed by Hive. Amazon's latest Hive version [1] is 0.8.1 but Avro support was added in 0.9.1. I can only find the haivvreo project [2] that supports 0.7. Is this my only option? Thanks! [1] http://aws.amazon.com/elasticmapreduce/faqs/#hive-19 [2] https://github.com/jghoman/haivvreo
RE: Partition performance
Sorry, just caught up with the last couple of day’s email and I feel that this question has already been answered fairly comprehensively. Apologies. Z From: Peter Marron [mailto:peter.mar...@trilliumsoftware.com] Sent: 04 July 2013 08:37 To: user@hive.apache.org Subject: RE: Partition performance Hi, Just to check that I understand this problem, my reading suggests that the overhead of many partitions is currently unavoidable. Specifically this means that any query on a table that has, let’s say, 10,000 partitions will be significantly slower (than on un-partitioned table with the “same” data) even if the query explicitly specifies a single partition. (I mean I _could_ actually do the experiments myself…) Regards, Z From: Owen O'Malley [mailto:omal...@apache.org] Sent: 02 July 2013 15:52 To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Partition performance On Tue, Jul 2, 2013 at 2:34 AM, Peter Marron peter.mar...@trilliumsoftware.commailto:peter.mar...@trilliumsoftware.com wrote: Hi Owen, I’m curious about this advice about partitioning. Is there some fundamental reason why Hive is slow when the number of partitions is 10,000 rather than 1,000? The precise numbers don't matter. I wanted to give people a ballpark range that they should be looking at. Most tables at 1000 partitions won't cause big slow downs, but the cost scales with the number of partitions. By the time you are at 10,000 the cost is noticeable. I have one customer who has a table with 1.2 million partitions. That causes a lot of slow downs. And the improvements that you mention are they going to be in version 12? Is there a JIRA raised so that I can track them? (It’s not currently a problem for me but I can see that I am going to need to be able to explain the situation.) I think this is the one they will use: https://issues.apache.org/jira/browse/HIVE-4051 -- Owen
How Can I store the Hive query result in one file ?
Hello Hive users, Is there a manner to store the Hive query result (SELECT *.) in a specfique and alone file (given the file name) like (INSERT OVERWRITE LOCAL DIRECTORY '/directory_path_name/')? Thanks for your answers
RE: metastore security issue
One setting was missing: hive.metastore.authorization.storage.checks true This solves the problem -Original Message- From: Shunichi Otsuka [mailto:sots...@yahoo-corp.jp] Sent: Thursday, July 04, 2013 2:28 PM To: user@hive.apache.org Subject: metastore security issue I am trying to setup hive securely doing authorization at the metastore. However there is a problem. I have relied on hive JIRA HIVE-3705 to decide the configuration which were set as below: javax.jdo.option.ConnectionURLjdbc javax.jdo.option.ConnectionDriverName java.database.jdbc.mysql javax.jdo.option.ConnectionUserName hive javax.jdo.option.ConnectionPassword userpass hive.metastore.execute.setugi true hive.metastore.uris thrift://thriftserver.example.com:9083 hive.metastore.sasl.enabled true hive.metastore.kerberos.keytab.file /etc/grid-keytabs/hive.keytab hive.metastore.kerberos.principal hive/thriftserver.example@example.com hive.security.metastore.authorization.enabled true hive.security.metastore.authenticator.manager org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator hive.security.metastore.authorization.manager org.apache.hadoop.hive.ql.security.authorization.DefaultHiveMetastoreAuthorizationProvider hive.security.authorization.enabled false However this does authorize an unauthorized user to drop a table or database from the metastore as below: alice create database db1 location '/user/alice/warehouse/db1.db'; [The permission of db1.db is drwx-- alice:users] However, bob drop database db1; OK This should not happen, so why is it happening? Is my setting wrong or is it that the code has not covered this case? If it is that it has not been implemented yet, what measures have you taken to avoid malicious users from dropping other users' database/tables? Java version is 1.6.0_33 hive version is 0.11 Thanks
RE: Experience of Hive local mode execution style
Local mode really helps with those little delays. It definately helps for small data sets. But my concerns are about consistency of results with distributed modes and some requests that fails only when it is triggered (see my description below). From: Edward Capriolo Sent: 03 July 2013 00:07 To: user@hive.apache.org Subject: Re: Experience of Hive local mode execution style Local mode is fast. In particular older version pf hadoop take a lot of time scheduling tasks and a delay betwen map and reduce phase. Local mode really helps with those little delays. On Monday, July 1, 2013, Guillaume Allain guillau...@blinkbox.commailto:guillau...@blinkbox.com wrote: Hi all, Would anybody have any comments or feedback about the hive local mode execution? It is advertised as providing a boost to performance for small data sets. It seem to fit nicely when running unit/integration tests on single node or virtual machine. My exact questions are the following : - How significantly diverge the local mode execution of queries compared to distributed mode? Do the results may be different in some way? - I have had encountered error when running complex queries (with several joins/distinct/groupbys) that seem to relate to configuration (see below). I got no exact answers from the ML and I am kind of ready to dive into the source code. Any idea where I should aim in order to solve that particular problem? Thanks in advance, Guillaume From: Guillaume Allain Sent: 18 June 2013 12:14 To: user@hive.apache.orgmailto:user@hive.apache.org Subject: FileNotFoundException when using hive local mode execution style Hi all, I plan to use hive local in order to speed-up unit testing on (very) small data sets. (Data is still on hdfs). I switch the local mode by setting the following variables : SET hive.exec.mode.local.auto=true; SET mapred.local.dir=/user; SET mapred.tmp.dir=file:///tmp; (plus creating needed directories and permissions) Simple GROUP BY, INNER and OUTER JOIN queries work just fine (with up to 3 jobs) with nice performance improvements. Unfortunately I ran into a FileNotFoundException:/tmp/vagrant/hive_2013-06-17_16-10-05_614_7672774118904458113/-mr-1/1/emptyFile) on some more complex query (4 jobs, distinct on top of several joins, see below logs if needed). Any idea about that error? What other option I am missing to have a fully fonctional local mode? Thanks in advance, Guillaume $ tail -50 /tmp/vagrant/vagrant_20130617171313_82baad8b-1961-4055-a52e-d8865b2cd4f8.lo 2013-06-17 16:10:05,669 INFO exec.ExecDriver (ExecDriver.java:execute(320)) - Using org.apache.hadoop.hive.ql.io.CombineHiveInputFormat 2013-06-17 16:10:05,688 INFO exec.ExecDriver (ExecDriver.java:execute(342)) - adding libjars: file:///opt/events-warehouse/build/jars/joda-time.jar,file:///opt/events-warehouse/build/jars/we7-hive-udfs.jar,file:///usr/lib/hive/lib/hive-json-serde-0.2.jar,file:///usr/lib/hive/lib/hive-builtins-0.9.0-cdh4.1.2.jar,file:///opt/events-warehouse/build/jars/guava.jar 2013-06-17 16:10:05,688 INFO exec.ExecDriver (ExecDriver.java:addInputPaths(840)) - Processing alias dc 2013-06-17 16:10:05,688 INFO exec.ExecDriver (ExecDriver.java:addInputPaths(858)) - Adding input file hdfs://localhost/user/hive/warehouse/events_super_mart_test.db/dim_cohorts 2013-06-17 16:10:05,689 INFO exec.Utilities (Utilities.java:isEmptyPath(1807)) - Content Summary not cached for hdfs://localhost/user/hive/warehouse/events_super_mart_test.db/dim_cohorts 2013-06-17 16:10:06,185 INFO exec.ExecDriver (ExecDriver.java:addInputPath(789)) - Changed input file to file:/tmp/vagrant/hive_2013-06-17_16-10-05_614_7672774118904458113/-mr-1/1 2013-06-17 16:10:06,226 INFO exec.ExecDriver (ExecDriver.java:addInputPaths(840)) - Processing alias $INTNAME 2013-06-17 16:10:06,226 INFO exec.ExecDriver (ExecDriver.java:addInputPaths(858)) - Adding input file hdfs://localhost/tmp/hive-vagrant/hive_2013-06-17_16-09-42_560_407729448242367/-mr-10004 2013-06-17 16:10:06,226 INFO exec.Utilities (Utilities.java:isEmptyPath(1807)) - Content Summary not cached for hdfs://localhost/tmp/hive-vagrant/hive_2013-06-17_16-09-42_560_407729448242367/-mr-10004 2013-06-17 16:10:06,681 WARN conf.Configuration (Configuration.java:warnOnceIfDeprecated(808)) - session.idhttp://session.id is deprecated. Instead, use dfs.metrics.session-id 2013-06-17 16:10:06,682 INFO jvm.JvmMetrics (JvmMetrics.java:init(76)) - Initializing JVM Metrics with processName=JobTracker, sessionId= 2013-06-17 16:10:06,688 INFO exec.ExecDriver (ExecDriver.java:createTmpDirs(215)) - Making Temp Directory: hdfs://localhost/tmp/hive-vagrant/hive_2013-06-17_16-09-42_560_407729448242367/-mr-10002 2013-06-17 16:10:06,706 WARN mapred.JobClient (JobClient.java:copyAndConfigureFiles(704)) - Use
Re: How Can I store the Hive query result in one file ?
will hive -e query filename or hive -f query.q filename will do ? you specially want it to write into a named file on hdfs only? On Thu, Jul 4, 2013 at 3:12 PM, Matouk IFTISSEN matouk.iftis...@ysance.comwrote: Hello Hive users, Is there a manner to store the Hive query result (SELECT *.) in a specfique and alone file (given the file name) like (INSERT OVERWRITE LOCAL DIRECTORY '/directory_path_name/')? Thanks for your answers -- Nitin Pawar
Re: How Can I store the Hive query result in one file ?
The question is what is the volume of your output. There is one file per output task (map or reduce) because that way each can write it independently and in parallel. That's how mapreduce work. And except by forcing the number of tasks to 1, there is no certain way to have one output file. But indeed if the volume is low enough, you could also capture the standard output into a local file like Nitin described. Bertrand On Thu, Jul 4, 2013 at 12:38 PM, Nitin Pawar nitinpawar...@gmail.comwrote: will hive -e query filename or hive -f query.q filename will do ? you specially want it to write into a named file on hdfs only? On Thu, Jul 4, 2013 at 3:12 PM, Matouk IFTISSEN matouk.iftis...@ysance.com wrote: Hello Hive users, Is there a manner to store the Hive query result (SELECT *.) in a specfique and alone file (given the file name) like (INSERT OVERWRITE LOCAL DIRECTORY '/directory_path_name/')? Thanks for your answers -- Nitin Pawar -- Bertrand Dechoux
Re: Elastic MapReduce Hive Avro SerDe
Hi. My guess is that you can try to look it up in their docs or mailing lists (Amazon EMR). IIRC, CDH had the patch for Avro+Hive before it was included in Hive itself, so Amazon EMR can have similar patches... Ruslan On Thu, Jul 4, 2013 at 12:20 PM, Dan Filimon dangeorge.fili...@gmail.comwrote: Hi! I'm working on a few Avro MapReduce jobs whose output will end up on S3 to be processed by Hive. Amazon's latest Hive version [1] is 0.8.1 but Avro support was added in 0.9.1. I can only find the haivvreo project [2] that supports 0.7. Is this my only option? Thanks! [1] http://aws.amazon.com/elasticmapreduce/faqs/#hive-19 [2] https://github.com/jghoman/haivvreo
Hortonworks HDP 1.3 vs. HDP 1.1
Hi Hive Team, Currently am developing and testing the Hive queries in HDP 1.1 with Hadoop 1.0.3 and Hive 0.9.0 However, it seems that my production is going to get upgraded to HDP 1.3 in near future. Will it will impact with respect to design, optimization? Please suggest. Regards, Kumar Chinnakali CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
Re: How Can I store the Hive query result in one file ?
I have found that for output larger than a few GB, redirecting stdout results in an incomplete file. For very large output, I do CREATE TABLE MYTABLE AS SELECT ... and then copy the resulting HDFS files directly out of /user/hive/warehouse. From: Bertrand Dechoux decho...@gmail.com To: user@hive.apache.org Sent: Thursday, July 4, 2013 7:09 AM Subject: Re: How Can I store the Hive query result in one file ? The question is what is the volume of your output. There is one file per output task (map or reduce) because that way each can write it independently and in parallel. That's how mapreduce work. And except by forcing the number of tasks to 1, there is no certain way to have one output file. But indeed if the volume is low enough, you could also capture the standard output into a local file like Nitin described. Bertrand On Thu, Jul 4, 2013 at 12:38 PM, Nitin Pawar nitinpawar...@gmail.com wrote: will hive -e query filename or hive -f query.q filename will do ? you specially want it to write into a named file on hdfs only? On Thu, Jul 4, 2013 at 3:12 PM, Matouk IFTISSEN matouk.iftis...@ysance.com wrote: Hello Hive users, Is there a manner to store the Hive query result (SELECT *.) in a specfique and alone file (given the file name) like (INSERT OVERWRITE LOCAL DIRECTORY '/directory_path_name/')? Thanks for your answers -- Nitin Pawar -- Bertrand Dechoux
Re: How Can I store the Hive query result in one file ?
Thanks for your responses, effctively the answer of Bertrand make this possible: the set of hive properities below froce thet job to write the hive result in one file whithout specifing the name (_0) : set hive.exec.reducers.max = 1; set mapred.reduce.tasks = 1; for Nitin, I want to store the results of SELECT not the stdout (log) of execution of the query, is this applicable for the results of SELECT? 2013/7/4 Michael Malak michaelma...@yahoo.com I have found that for output larger than a few GB, redirecting stdout results in an incomplete file. For very large output, I do CREATE TABLE MYTABLE AS SELECT ... and then copy the resulting HDFS files directly out of /user/hive/warehouse. *From:* Bertrand Dechoux decho...@gmail.com *To:* user@hive.apache.org *Sent:* Thursday, July 4, 2013 7:09 AM *Subject:* Re: How Can I store the Hive query result in one file ? The question is what is the volume of your output. There is one file per output task (map or reduce) because that way each can write it independently and in parallel. That's how mapreduce work. And except by forcing the number of tasks to 1, there is no certain way to have one output file. But indeed if the volume is low enough, you could also capture the standard output into a local file like Nitin described. Bertrand On Thu, Jul 4, 2013 at 12:38 PM, Nitin Pawar nitinpawar...@gmail.comwrote: will hive -e query filename or hive -f query.q filename will do ? you specially want it to write into a named file on hdfs only? On Thu, Jul 4, 2013 at 3:12 PM, Matouk IFTISSEN matouk.iftis...@ysance.com wrote: Hello Hive users, Is there a manner to store the Hive query result (SELECT *.) in a specfique and alone file (given the file name) like (INSERT OVERWRITE LOCAL DIRECTORY '/directory_path_name/')? Thanks for your answers -- Nitin Pawar -- Bertrand Dechoux
Re: How Can I store the Hive query result in one file ?
the one i said does not work on hdfs files. Its just one way to write the stdlog to a file. I am not sure if hive allows you named files for output and the above settings will make your query run really slow if you have large dataset. if you are really specific on having a filename then for now I am not aware if hive supports it. I did a quick search but did not find anything useful. If you need a quick way to get to your solution then pig supports the store function and its written to a named file. i will search in depth and see if there is anything in configurations of hive On Thu, Jul 4, 2013 at 8:50 PM, Matouk IFTISSEN matouk.iftis...@ysance.comwrote: Thanks for your responses, effctively the answer of Bertrand make this possible: the set of hive properities below froce thet job to write the hive result in one file whithout specifing the name (_0) : set hive.exec.reducers.max = 1; set mapred.reduce.tasks = 1; for Nitin, I want to store the results of SELECT not the stdout (log) of execution of the query, is this applicable for the results of SELECT? 2013/7/4 Michael Malak michaelma...@yahoo.com I have found that for output larger than a few GB, redirecting stdout results in an incomplete file. For very large output, I do CREATE TABLE MYTABLE AS SELECT ... and then copy the resulting HDFS files directly out of /user/hive/warehouse. *From:* Bertrand Dechoux decho...@gmail.com *To:* user@hive.apache.org *Sent:* Thursday, July 4, 2013 7:09 AM *Subject:* Re: How Can I store the Hive query result in one file ? The question is what is the volume of your output. There is one file per output task (map or reduce) because that way each can write it independently and in parallel. That's how mapreduce work. And except by forcing the number of tasks to 1, there is no certain way to have one output file. But indeed if the volume is low enough, you could also capture the standard output into a local file like Nitin described. Bertrand On Thu, Jul 4, 2013 at 12:38 PM, Nitin Pawar nitinpawar...@gmail.comwrote: will hive -e query filename or hive -f query.q filename will do ? you specially want it to write into a named file on hdfs only? On Thu, Jul 4, 2013 at 3:12 PM, Matouk IFTISSEN matouk.iftis...@ysance.com wrote: Hello Hive users, Is there a manner to store the Hive query result (SELECT *.) in a specfique and alone file (given the file name) like (INSERT OVERWRITE LOCAL DIRECTORY '/directory_path_name/')? Thanks for your answers -- Nitin Pawar -- Bertrand Dechoux -- Nitin Pawar
Re: Experience of Hive local mode execution style
Since you are launching locally you have to account for this. 1) If multiple jobs are running they become a burden on the local memory of the system 2) Your local parameters like java heap size Xmx or mapred.child.java.opts may be getting applied locally, if you are doing distinct queries they may use a lot of memory or spill to disk quite often However what you are reporting does not look like a memory error, although distinct queries can become fairly intense. If you can repeat the problem with empty tables it is likely a bug but if you can't it just means that query takes to much memory for local mode. On Thu, Jul 4, 2013 at 6:21 AM, Guillaume Allain guillau...@blinkbox.comwrote: Local mode really helps with those little delays. It definately helps for small data sets. But my concerns are about consistency of results with distributed modes and some requests that fails only when it is triggered (see my description below). -- *From:* Edward Capriolo *Sent:* 03 July 2013 00:07 *To:* user@hive.apache.org *Subject:* Re: Experience of Hive local mode execution style Local mode is fast. In particular older version pf hadoop take a lot of time scheduling tasks and a delay betwen map and reduce phase. Local mode really helps with those little delays. On Monday, July 1, 2013, Guillaume Allain guillau...@blinkbox.com wrote: Hi all, Would anybody have any comments or feedback about the hive local mode execution? It is advertised as providing a boost to performance for small data sets. It seem to fit nicely when running unit/integration tests on single node or virtual machine. My exact questions are the following : - How significantly diverge the local mode execution of queries compared to distributed mode? Do the results may be different in some way? - I have had encountered error when running complex queries (with several joins/distinct/groupbys) that seem to relate to configuration (see below). I got no exact answers from the ML and I am kind of ready to dive into the source code. Any idea where I should aim in order to solve that particular problem? Thanks in advance, Guillaume From: Guillaume Allain Sent: 18 June 2013 12:14 To: user@hive.apache.org Subject: FileNotFoundException when using hive local mode execution style Hi all, I plan to use hive local in order to speed-up unit testing on (very) small data sets. (Data is still on hdfs). I switch the local mode by setting the following variables : SET hive.exec.mode.local.auto=true; SET mapred.local.dir=/user; SET mapred.tmp.dir=file:///tmp; (plus creating needed directories and permissions) Simple GROUP BY, INNER and OUTER JOIN queries work just fine (with up to 3 jobs) with nice performance improvements. Unfortunately I ran into a FileNotFoundException:/tmp/vagrant/hive_2013-06-17_16-10-05_614_7672774118904458113/-mr-1/1/emptyFile) on some more complex query (4 jobs, distinct on top of several joins, see below logs if needed). Any idea about that error? What other option I am missing to have a fully fonctional local mode? Thanks in advance, Guillaume $ tail -50 /tmp/vagrant/vagrant_20130617171313_82baad8b-1961-4055-a52e-d8865b2cd4f8.lo 2013-06-17 16:10:05,669 INFO exec.ExecDriver (ExecDriver.java:execute(320)) - Using org.apache.hadoop.hive.ql.io.CombineHiveInputFormat 2013-06-17 16:10:05,688 INFO exec.ExecDriver (ExecDriver.java:execute(342)) - adding libjars: file:///opt/events-warehouse/build/jars/joda-time.jar,file:///opt/events-warehouse/build/jars/we7-hive-udfs.jar,file:///usr/lib/hive/lib/hive-json-serde-0.2.jar,file:///usr/lib/hive/lib/hive-builtins-0.9.0-cdh4.1.2.jar,file:///opt/events-warehouse/build/jars/guava.jar 2013-06-17 16:10:05,688 INFO exec.ExecDriver (ExecDriver.java:addInputPaths(840)) - Processing alias dc 2013-06-17 16:10:05,688 INFO exec.ExecDriver (ExecDriver.java:addInputPaths(858)) - Adding input file hdfs://localhost/user/hive/warehouse/events_super_mart_test.db/dim_cohorts 2013-06-17 16:10:05,689 INFO exec.Utilities (Utilities.java:isEmptyPath(1807)) - Content Summary not cached for hdfs://localhost/user/hive/warehouse/events_super_mart_test.db/dim_cohorts 2013-06-17 16:10:06,185 INFO exec.ExecDriver (ExecDriver.java:addInputPath(789)) - Changed input file to file:/tmp/vagrant/hive_2013-06-17_16-10-05_614_7672774118904458113/-mr-1/1 2013-06-17 16:10:06,226 INFO exec.ExecDriver (ExecDriver.java:addInputPaths(840)) - Processing alias $INTNAME 2013-06-17 16:10:06,226 INFO exec.ExecDriver (ExecDriver.java:addInputPaths(858)) - Adding input file hdfs://localhost/tmp/hive-vagrant/hive_2013-06-17_16-09-42_560_407729448242367/-mr-10004 2013-06-17 16:10:06,226 INFO exec.Utilities (Utilities.java:isEmptyPath(1807)) - Content Summary not cached for
Re: How Can I store the Hive query result in one file ?
Normally if use set mapred.reduce.tasks=1 you get one output file. You can also look at *hive*.*merge*.*mapfiles*, mapred.reduce.tasks, hive.merge.reducefiles also you can use a separate tool https://github.com/edwardcapriolo/filecrush On Thu, Jul 4, 2013 at 6:38 AM, Nitin Pawar nitinpawar...@gmail.com wrote: will hive -e query filename or hive -f query.q filename will do ? you specially want it to write into a named file on hdfs only? On Thu, Jul 4, 2013 at 3:12 PM, Matouk IFTISSEN matouk.iftis...@ysance.com wrote: Hello Hive users, Is there a manner to store the Hive query result (SELECT *.) in a specfique and alone file (given the file name) like (INSERT OVERWRITE LOCAL DIRECTORY '/directory_path_name/')? Thanks for your answers -- Nitin Pawar
Re: How Can I store the Hive query result in one file ?
hive set hive.io.output.fileformat=CSVTextFile; hive insert overwrite local directory '/usr/home/hadoop/da1/' select * from customers *** customers is a Hive table From: Edward Capriolo edlinuxg...@gmail.com To: user@hive.apache.org user@hive.apache.org Sent: Friday, July 5, 2013 12:10 AM Subject: Re: How Can I store the Hive query result in one file ? Normally if use set mapred.reduce.tasks=1 you get one output file. You can also look at hive.merge.mapfiles, mapred.reduce.tasks, hive.merge.reducefiles also you can use a separate tool https://github.com/edwardcapriolo/filecrush On Thu, Jul 4, 2013 at 6:38 AM, Nitin Pawar nitinpawar...@gmail.com wrote: will hive -e query filename or hive -f query.q filename will do ? you specially want it to write into a named file on hdfs only? On Thu, Jul 4, 2013 at 3:12 PM, Matouk IFTISSEN matouk.iftis...@ysance.com wrote: Hello Hive users, Is there a manner to store the Hive query result (SELECT *.) in a specfique and alone file (given the file name) like (INSERT OVERWRITE LOCAL DIRECTORY '/directory_path_name/')? Thanks for your answers -- Nitin Pawar
Re: How Can I store the Hive query result in one file ?
Adding to that - Multiple files can be concatenated from the directory like Example: cat 0-0 00-1 0-2 final From: Raj Hadoop hadoop...@yahoo.com To: user@hive.apache.org user@hive.apache.org; matouk.iftis...@ysance.com matouk.iftis...@ysance.com Sent: Friday, July 5, 2013 12:17 AM Subject: Re: How Can I store the Hive query result in one file ? hive set hive.io.output.fileformat=CSVTextFile; hive insert overwrite local directory '/usr/home/hadoop/da1/' select * from customers *** customers is a Hive table From: Edward Capriolo edlinuxg...@gmail.com To: user@hive.apache.org user@hive.apache.org Sent: Friday, July 5, 2013 12:10 AM Subject: Re: How Can I store the Hive query result in one file ? Normally if use set mapred.reduce.tasks=1 you get one output file. You can also look at hive.merge.mapfiles, mapred.reduce.tasks, hive.merge.reducefiles also you can use a separate tool https://github.com/edwardcapriolo/filecrush On Thu, Jul 4, 2013 at 6:38 AM, Nitin Pawar nitinpawar...@gmail.com wrote: will hive -e query filename or hive -f query.q filename will do ? you specially want it to write into a named file on hdfs only? On Thu, Jul 4, 2013 at 3:12 PM, Matouk IFTISSEN matouk.iftis...@ysance.com wrote: Hello Hive users, Is there a manner to store the Hive query result (SELECT *.) in a specfique and alone file (given the file name) like (INSERT OVERWRITE LOCAL DIRECTORY '/directory_path_name/')? Thanks for your answers -- Nitin Pawar
Re: Hortonworks HDP 1.3 vs. HDP 1.1
For HDP specific questions, you should use the Hortonworks lists: http://hortonworks.com/community/forums/forum/hive/ Your question is about the difference between Hive 0.9 and Hive 0.11. The big additions are: Decimal type ORC files Analytics functions - cube roll up Windowing functions Join improvements There are some blog entries for Hive 0.10 and Hive 0.11. http://hortonworks.com/blog/apache-hive-0-10-0-is-now-available/ http://hortonworks.com/blog/apache-hive-0-11-stinger-phase-1-delivered -- Owen On Thu, Jul 4, 2013 at 6:26 AM, Kumar Chinnakali kumar_chinnak...@infosys.com wrote: Hi Hive Team, ** ** Currently am developing and testing the Hive queries in HDP 1.1 with Hadoop 1.0.3 and Hive 0.9.0 ** ** However, it seems that my production is going to get upgraded to HDP 1.3 in near future. Will it will impact with respect to design, optimization?* *** ** ** Please suggest. ** ** Regards, Kumar Chinnakali CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***