large data ingestion with jdbc versus putting data into hdfs with hive to ingest
hi, what are the drawbacks of ingesting large quantities of data through jdbc connection in Hive versus placing data directly into HDFS for Hive to incorporate? thanks
Re: Happy Diwali to those forum members who celebrate this great festival
You are very kind Sir On Sunday, 30 October 2016, 16:42, Devopam Mittrawrote: +1 Thanks and regards Devopam On 30 Oct 2016 9:37 pm, "Mich Talebzadeh" wrote: Enjoy the festive season. Regards, Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/ profile/view?id= AAEWh2gBxianrbJd6zP6AcPCCd OABUrV8Pw http://talebzadehmich. wordpress.com Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.
Re: Presentation in London: Running Spark on Hive or Hive on Spark
Thanks Mich looking forward to it :) On Tuesday, 19 July 2016, 19:13, Mich Talebzadehwrote: Hi all, This will be in London tomorrow Wednesday 20th July starting at 18:00 hour for refreshments and kick off at 18:30, 5 minutes walk from Canary Wharf Station, Jubilee Line If you wish you can register and get more info here It will be in La Tasca West India Docks Road E14 and especially if you like Spanish food :) Regards, Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. On 15 July 2016 at 11:06, Joaquin Alzola wrote: It is on the 20th (Wednesday) next week. From: Marco Mistroni [mailto:mmistr...@gmail.com] Sent: 15 July 2016 11:04 To: Mich Talebzadeh Cc: user @spark ; user Subject: Re: Presentation in London: Running Spark on Hive or Hive on Spark Dr Mich do you have any slides or videos available for the presentation you did @Canary Wharf?kindest regards marco On Wed, Jul 6, 2016 at 10:37 PM, Mich Talebzadeh wrote: Dear forum members I will be presenting on the topic of "Running Spark on Hive or Hive on Spark, your mileage varies" in Future of Data: London DetailsOrganized by: HortonworksDate: Wednesday, July 20, 2016, 6:00 PM to 8:30 PM Place: LondonLocation: One Canada Square, Canary Wharf, London E14 5AB.Nearest Underground: Canary Warf (map)If you are interested please register hereLooking forward to seeing those who can make it to have an interesting discussion and leverage your experience.Regards, Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com Disclaimer: Use it at your own risk.Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. This email is confidential and may be subject to privilege. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.
Re: Hive on TEZ + LLAP
thanks. I have also heard about Hortonworks with Tez + LLAP but that is a distro? Is it a complicated work to build it with Do It Yourself so to speak? On Friday, 15 July 2016, 19:23, "Long, Andrew" <loand...@amazon.com> wrote: #yiv7626288998 #yiv7626288998 -- _filtered #yiv7626288998 {panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv7626288998 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv7626288998 {font-family:MingLiU;panose-1:2 2 5 9 0 0 0 0 0 0;}#yiv7626288998 #yiv7626288998 p.yiv7626288998MsoNormal, #yiv7626288998 li.yiv7626288998MsoNormal, #yiv7626288998 div.yiv7626288998MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv7626288998 a:link, #yiv7626288998 span.yiv7626288998MsoHyperlink {color:blue;text-decoration:underline;}#yiv7626288998 a:visited, #yiv7626288998 span.yiv7626288998MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv7626288998 span.yiv7626288998EmailStyle17 {font-family:Calibri;color:windowtext;}#yiv7626288998 span.yiv7626288998msoIns {text-decoration:underline;color:teal;}#yiv7626288998 .yiv7626288998MsoChpDefault {font-size:10.0pt;} _filtered #yiv7626288998 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv7626288998 div.yiv7626288998WordSection1 {}#yiv7626288998 Amazon AWS has recently released EMR with Hive + Tez as well. Cheers Andrew From: Jörn Franke <jornfra...@gmail.com> Reply-To: "user@hive.apache.org" <user@hive.apache.org> Date: Friday, July 15, 2016 at 8:36 AM To: "user@hive.apache.org" <user@hive.apache.org> Subject: Re: Hive on TEZ + LLAP I would recommend a distribution such as Hortonworks were everything is already configured. As far as I know llap is currently not part of any distribution. On 15 Jul 2016, at 17:04, Ashok Kumar <ashok34...@yahoo.com> wrote: Hi, Has anyone managed to make Hive work with Tez + LLAP as the query engine in place of Map-reduce please? If you configured it yourself which version of Tez and LLAP work with Hive 2. Do I need to build Tez from source for example Thanks
Hive on TEZ + LLAP
Hi, Has anyone managed to make Hive work with Tez + LLAP as the query engine in place of Map-reduce please? If you configured it yourself which version of Tez and LLAP work with Hive 2. Do I need to build Tez from source for example Thanks
Fast database with writes per second and horizontal scaling
Hi Gurus, Advice appreciated from Hive gurus. My colleague has been using Cassandra. However, he says it is too slow and not user friendly/MongodDB as a doc databases is pretty neat but not fast enough May main concern is fast writes per second and good scaling. Hive on Spark or Tez? How about Hbase. or anything else Any expert advice warmly acknowledged.. thanking you
Re: Using Spark on Hive with Hive also using Spark as its execution engine
Hi Mich, Your recent presentation in London on this topic "Running Spark on Hive or Hive on Spark" Have you made any more interesting findings that you like to bring up? If Hive is offering both Spark and Tez in addition to MR, what stopping one not to use Spark? I still don't get why TEZ + LLAP is going to be a better choice from what you mentioned? thanking you On Tuesday, 31 May 2016, 20:22, Mich Talebzadehwrote: Couple of points if I may and kindly bear with my remarks. Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP "Sub-second queries require fast query execution and low setup cost. The challenge for Hive is to achieve this without giving up on the scale and flexibility that users depend on. This requires a new approach using a hybrid engine that leverages Tez and something new called LLAP (Live Long and Process, #llap online). LLAP is an optional daemon process running on multiple nodes, that provides the following: - Caching and data reuse across queries with compressed columnar data in-memory (off-heap) - Multi-threaded execution including reads with predicate pushdown and hash joins - High throughput IO using Async IO Elevator with dedicated thread and core per disk - Granular column level security across applications - " OK so we have added an in-memory capability to TEZ by way of LLAP, In other words what Spark does already and BTW it does not require a daemon running on any host. Don't take me wrong. It is interesting but this sounds to me (without testing myself) adding caching capability to TEZ to bring it on par with SPARK. Remember: Spark -> DAG + in-memory cachingTEZ = MR on DAGTEZ + LLAP => DAG + in-memory caching OK it is another way getting the same result. However, my concerns: - Spark has a wide user base. I judge this from Spark user group traffic - TEZ user group has no traffic I am afraid - LLAP I don't know Sounds like Hortonworks promote TEZ and Cloudera does not want to know anything about Hive. and they promote Impala but that sounds like a sinking ship these days. Having said that I will try TEZ + LLAP :) No pun intended Regards Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 31 May 2016 at 08:19, Jörn Franke wrote: Thanks very interesting explanation. Looking forward to test it. > On 31 May 2016, at 07:51, Gopal Vijayaraghavan wrote: > > >> That being said all systems are evolving. Hive supports tez+llap which >> is basically the in-memory support. > > There is a big difference between where LLAP & SparkSQL, which has to do > with access pattern needs. > > The first one is related to the lifetime of the cache - the Spark RDD > cache is per-user-session which allows for further operation in that > session to be optimized. > > LLAP is designed to be hammered by multiple user sessions running > different queries, designed to automate the cache eviction & selection > process. There's no user visible explicit .cache() to remember - it's > automatic and concurrent. > > My team works with both engines, trying to improve it for ORC, but the > goals of both are different. > > I will probably have to write a proper academic paper & get it > edited/reviewed instead of send my ramblings to the user lists like this. > Still, this needs an example to talk about. > > To give a qualified example, let's leave the world of single use clusters > and take the use-case detailed here > > http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/ > > > There are two distinct problems there - one is that a single day sees upto > 100k independent user sessions running queries and that most queries cover > the last hour (& possibly join/compare against a similar hour aggregate > from the past). > > The problem with having independent 100k user-sessions from different > connections was that the SparkSQL layer drops the RDD lineage & cache > whenever a user ends a session. > > The scale problem in general for Impala was that even though the data size > was in multiple terabytes, the actual hot data was approx <20Gb, which > resides on <10 machines with locality. > > The same problem applies when you apply RDD caching with something like > un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding > popular that the machines which hold those blocks run extra hot. > > A cache model per-user session is entirely wasteful and a common cache + > MPP model effectively overloads 2-3% of cluster, while leaving the other > machines idle. > > LLAP was designed specifically to prevent that hotspotting, while > maintaining the common cache model - within a few minutes after an hour > ticks over, the whole cluster develops temporal popularity for the hot > data and nearly every rack has at least one cached copy of the same data
Re: Presentation in London: Running Spark on Hive or Hive on Spark
Thanks. Will this presentation recorded as well? Regards On Wednesday, 6 July 2016, 22:38, Mich Talebzadehwrote: Dear forum members I will be presenting on the topic of "Running Spark on Hive or Hive on Spark, your mileage varies" in Future of Data: London DetailsOrganized by: HortonworksDate: Wednesday, July 20, 2016, 6:00 PM to 8:30 PM Place: LondonLocation: One Canada Square, Canary Wharf, London E14 5AB.Nearest Underground: Canary Warf (map) If you are interested please register hereLooking forward to seeing those who can make it to have an interesting discussion and leverage your experience.Regards, Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.
latest version of Spark to work OK as Hive engine
Hi, Looking at this presentation Hive on Spark is Blazing Fast .. Which latest version of Spark can run as an engine for Hive please? Thanks P.S. I am aware of Hive on TEZ but that is not what I am interested here please Warmest regards
last stats time on table columns
Greeting gurus, When I use ANALYZE TABLE COMPUTE STATISTICS for COLUMNS, Where can I get the last stats time. DESC FORMATTED does not show it thanking you
Re: Using Spark on Hive with Hive also using Spark as its execution engine
Hi Dr Mich, This is very good news. I will be interested to know how Hive engages with Spark as an engine. What Spark processes are used to make this work? Thanking you On Monday, 23 May 2016, 19:01, Mich Talebzadehwrote: Have a look at this thread Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 23 May 2016 at 09:10, Mich Talebzadeh wrote: Hi Timur and everyone. I will answer your first question as it is very relevant 1) How to make 2 versions of Spark live together on the same cluster (libraries clash, paths, etc.) ? Most of the Spark users perform ETL, ML operations on Spark as well. So, we may have 3 Spark installations simultaneously There are two distinct points here. Using Spark as a query engine. That is BAU and most forum members use it everyday. You run Spark with either Standalone, Yarn or Mesos as Cluster managers. You start master that does the management of resources and you start slaves to create workers. You deploy Spark either by Spark-shell, Spark-sql or submit jobs through spark-submit etc. You may or may not use Hive as your database. You may use Hbase via Phoenix etcIf you choose to use Hive as your database, on every host of cluster including your master host, you ensure that Hive APIs are installed (meaning Hive installed). In $SPARK_HOME/conf, you create a soft link to cd $SPARK_HOME/conf hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6/conf> ltr hive-site.xml lrwxrwxrwx 1 hduser hadoop 32 May 3 17:48 hive-site.xml -> /usr/lib/hive/conf/hive-site.xml Now in hive-site.xml you can define all the parameters needed for Spark connectivity. Remember we are making Hive use spark1.3.1 engine. WE ARE NOT RUNNING SPARK 1.3.1 AS A QUERY TOOL. We do not need to start master or workers for Spark 1.3.1! It is just an execution engine like mr etc. Let us look at how we do that in hive-site,xml. Noting the settings for hive.execution.engine=spark and spark.home=/usr/lib/spark-1.3.1-bin-hadoop2 below. That tells Hive to use spark 1.3.1 as the execution engine. You just install spark 1.3.1 on the host just the binary download it is /usr/lib/spark-1.3.1-bin-hadoop2.6 In hive-site.xml, you set the properties. hive.execution.engine spark Expects one of [mr, tez, spark]. Chooses execution engine. Options are: mr (Map reduce, default), tez, spark. While MR remains the default engine for historical reasons, it is itself a historical engine and is deprecated in Hive 2 line. It may be removed without further warning. spark.home /usr/lib/spark-1.3.1-bin-hadoop2 something hive.merge.sparkfiles false Merge small files at the end of a Spark DAG Transformation hive.spark.client.future.timeout 60s Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is sec if not specified. Timeout for requests from Hive client to remote Spark driver. hive.spark.job.monitor.timeout 60s Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is sec if not specified. Timeout for job monitor to get Spark job state. hive.spark.client.connect.timeout 1000ms Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is msec if not specified. Timeout for remote Spark driver in connecting back to Hive client. hive.spark.client.server.connect.timeout 9ms Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is msec if not specified. Timeout for handshake between Hive client and remote Spark driver. Checked by both processes. hive.spark.client.secret.bits 256 Number of bits of randomness in the generated secret for communication between Hive client and remote Spark driver. Rounded down to the nearest multiple of 8. hive.spark.client.rpc.threads 8 Maximum number of threads for remote Spark driver's RPC event loop. And other settings as well That was the Hive stuff for your Spark BAU. So there are two distinct things. Now going to Hive itself, you will need to add the correct assembly jar file for Hadoop. These are called spark-assembly-x.y.z-hadoop2.4.0.jar Where x.y.z in this case is 1.3.1 The assembly file is spark-assembly-1.3.1-hadoop2.4.0.jar So you add that spark-assembly-1.3.1-hadoop2.4.0.jar to $HIVE_HOME/libs ls $HIVE_HOME/lib/spark-assembly-1.3.1-hadoop2.4.0.jar /usr/lib/hive/lib/spark-assembly-1.3.1-hadoop2.4.0.jar And you need to compile spark from source excluding Hadoop dependencies ./make-distribution.sh --name"hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided" So
Copying all Hive tables from Prod to UAT
Hi, Anyone has suggestions how to create and copy Hive and Spark tables from Production to UAT. One way would be to copy table data to external files and then move the external files to a local target directory and populate the tables in target Hive with data. Is there an easier way of doing so? thanks
Updating column in table throws error
Hi gurus, I have an ORC table bucketed on invoicenumber with "transactional"="true" I am trying to update invoicenumber column used for bucketing this table but it comes back with Error: Error while compiling statement: FAILED: SemanticException [Error 10302]: Updating values of bucketing columns is not supported. Column invoicenumber Any ideas how it can be solved? Thank you
Re: Hive and Impala
Dr Mitch, My two cents here. I don't have direct experience of Impala but in my humble opinion I share your views that Hive provides the best metastore of all Big Data systems. Looking around almost every product in one form and shape use Hive code somewhere. My colleagues inform me that Hive is one of the most stable Big Data products. With the capabilities of Spark on Hive and Hive on Spark or Tez plus of course MR, there is really little need for many other products in the same space. It is good to keep things simple. Warmest On Tuesday, 1 March 2016, 11:33, Mich Talebzadehwrote: I have not heard of Impala anymore. I saw an article in LinkedIn titled "Apache Hive Or Cloudera Impala? What is Best for me?" "We can access all objects from Hive data warehouse with HiveQL which leverages the map-reduce architecture in background for data retrieval and transformation and this results in latency." My response was This statement is no longer valid as you have choices of three engines now with MR, Spark and Tez. I have not used Impala myself as I don't think there is a need for it with Hive on Spark or Spark using Hive metastore providing whatever needed. Hive is for Data Warehouse and provides what is says on the tin. Please also bear in mind that Hive offers ORC storage files that provide store Index capabilities further optimizing the queries with additional stats at file, stripe and row group levels. Anyway the question is with Hive on Spark or Spark using Hive metastore what we cannot achieve that we can achieve with Impala? Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com
Re: Hive optimizer
Thank you for the link. On Thursday, 4 February 2016, 8:07, Lefty Leverenz <leftylever...@gmail.com> wrote: You can find Hive CBO information in Cost Based Optimizer in Hive. -- Lefty On Wed, Feb 3, 2016 at 11:48 AM, John Pullokkaran <jpullokka...@hortonworks.com> wrote: Its both.Some of the optimizations are rule based and some are cost based. John From: Ashok Kumar <ashok34...@yahoo.com> Reply-To: "user@hive.apache.org" <user@hive.apache.org>, Ashok Kumar <ashok34...@yahoo.com> Date: Wednesday, February 3, 2016 at 11:45 AM To: User <user@hive.apache.org> Subject: Hive optimizer Hi, Is Hive optimizer a cost based Optimizer (CBO) or a rule based optimizer (CBO) or none of them. thanks
Hive optimizer
Hi, Is Hive optimizer a cost based Optimizer (CBO) or a rule based optimizer (CBO) or none of them. thanks
Re: Importing Oracle data into Hive
Thank you Mich and Jorn foryour help. Very useful indeed. On Sunday, 31 January 2016, 13:43, Mich Talebzadeh <m...@peridale.co.uk> wrote: #yiv6166657167 -- filtered {panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv6166657167 filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv6166657167 filtered {}#yiv6166657167 p.yiv6166657167MsoNormal, #yiv6166657167 li.yiv6166657167MsoNormal, #yiv6166657167 div.yiv6166657167MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:12.0pt;}#yiv6166657167 a:link, #yiv6166657167 span.yiv6166657167MsoHyperlink {color:blue;text-decoration:underline;}#yiv6166657167 a:visited, #yiv6166657167 span.yiv6166657167MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv6166657167 p.yiv6166657167msonormal0, #yiv6166657167 li.yiv6166657167msonormal0, #yiv6166657167 div.yiv6166657167msonormal0 {margin-right:0cm;margin-left:0cm;font-size:12.0pt;}#yiv6166657167 span.yiv6166657167EmailStyle18 {color:windowtext;font-weight:normal;font-style:normal;text-decoration:none none;}#yiv6166657167 .yiv6166657167MsoChpDefault {font-size:10.0pt;}#yiv6166657167 filtered {margin:72.0pt 72.0pt 72.0pt 72.0pt;}#yiv6166657167 div.yiv6166657167WordSection1 {}#yiv6166657167 You will need to have Oracle Database 11g JDBC Driver ojdbc6.jar installed in $SQOOP_HOME/lib. You can download it from here The approach I prefer is to let Sqoop import it as a text file to a staging table and then insert/select into an ORC table from the staging table. sqoop import --connect "jdbc:oracle:thin:@rhes564:1521:mydb" --username scratchpad -P \ --query "select * from scratchpad.dummy where \ \$CONDITIONS" \ -split-by id \ -hive-import -hive-table "test.dummy_staging" --target-dir "/a/b/c/dummy_staging" --create-hive-table Once the table staging is created you can then insert/select to an Orc table of your definition and make sure that the schema is clearly defined as you wish. For example you have to cater for date fields or columns that are varchar as opposed to String. Case in point The source table schema in Oracle is CREATE TABLE "SCRATCHPAD"."DUMMY" ( "ID" NUMBER, "CLUSTERED" NUMBER, "SCATTERED" NUMBER, "RANDOMISED" NUMBER, "RANDOM_STRING" VARCHAR2(50 BYTE), "SMALL_VC" VARCHAR2(10 BYTE), "PADDING" VARCHAR2(10 BYTE), CONSTRAINT "DUMMY_PK" PRIMARY KEY ("ID") ) The staging table dummy_staging is generated by Sqoop: desc dummy_staging;+++--+--+| col_name | data_type | comment |+++--+--+| id | double | || clustered | double | || scattered | double | || randomised | double | || random_string | string | || small_vc | string | || padding | string | |+++--+--+ Your ORC table may look like: desc dummy;++--+--+--+| col_name | data_type | comment |++--+--+--+| id | int | || clustered | int | || scattered | int | || randomised | int | || random_string | varchar(50) | || small_vc | varchar(10) | || padding | varchar(10) | |++--+--+--+ This also translates to Extract Load Transfer (ELT)) methodology which I prefer. HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw Sybase ASE 15 Gold Medal Award 2008A Winning Strategy: Running the most Critical Financial Data on ASE 15http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdfAuthor of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4Publications due shortly:Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly http://talebzadehmich.wordpress.com/ NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Technology Ltd
Importing Oracle data into Hive
Hi, What is the easiest method of importing data from an Oracle 11g table to Hive please? This will be a weekly periodic job. The source table has 20 million rows. I am running Hive 1.2.1 regards
Re: Importing Oracle data into Hive
Thanks, Can sqoop create this table as ORC in Hive? On Sunday, 31 January 2016, 13:13, Ashok Kumar <ashok34...@yahoo.com> wrote: Thanks. Can sqoop create this table as ORC in Hive? On Sunday, 31 January 2016, 13:11, Nitin Pawar <nitinpawar...@gmail.com> wrote: check sqoop On Sun, Jan 31, 2016 at 6:36 PM, Ashok Kumar <ashok34...@yahoo.com> wrote: Hi, What is the easiest method of importing data from an Oracle 11g table to Hive please? This will be a weekly periodic job. The source table has 20 million rows. I am running Hive 1.2.1 regards -- Nitin Pawar
Re: ORC files and statistics
Thanks Owen, I got a bit confused comparing ORC with what I know about indexes in relational databases. Still need to understand it a bit better. Regards From: Owen O'Malley [mailto:omal...@apache.org] Sent: 19 January 2016 17:57 To: user@hive.apache.org; Ashok Kumar <ashok34...@yahoo.com> Cc: Jörn Franke <jornfra...@gmail.com> Subject: Re: ORC files and statistics On Tue, Jan 19, 2016 at 9:45 AM, Ashok Kumar <ashok34...@yahoo.com> wrote: Thank you both. So if I have a Hive table of ORC type and it contains 100K rows, there will be 10 row groups of 10K row each. Yes within each row group there will be min, max, count(distint_value) and sum for each column within that row group. is count mean count of distinct values including null occurrence for that column?. Actually, it is just count, not count distinct. Newer versions of Hive also have the option of including bloom filters for some columns. That enables fast searches for particular values in columns that aren't sorted. also if the table contains 5 columns will there be 5x10 row groups in total? The ORC files are laid out in stripes that correspond to roughly ~64MB compressed. Each column within a stripe is laid out together. The row groups are a feature of the index and correspond to how many entries the index has. So yes, within a file with 100k rows, which obviously will be a single stripe, the index will have 10 row groups for each column for a total of 50 entries in the index. (The index is also laid out in columns so the reader only loads the parts of the index it needs for the columns it is reading.) .. Owen
Re: ORC files and statistics
Thank you both. So if I have a Hive table of ORC type and it contains 100K rows, there will be 10 row groups of 10K row each. within each row group there will be min, max, count(distint_value) and sum for each column within that row group. is count mean count of distinct values including null occurrence for that column?. also if the table contains 5 columns will there be 5x10 row groups in total? thanks again On Tuesday, 19 January 2016, 17:35, Jörn Franke <jornfra...@gmail.com> wrote: Just be aware that you should insert the data sorted at least on the most discrimating column of your where clause On 19 Jan 2016, at 17:27, Owen O'Malley <omal...@apache.org> wrote: It has both. Each index has statistics of min, max, count, and sum for each column in the row group of 10,000 rows. It also has the location of the start of each row group, so that the reader can jump straight to the beginning of the row group. The reader takes a SearchArgument (eg. age > 100) that limits which rows are required for the query and can avoid reading an entire file, or at least sections of the file. .. Owen On Tue, Jan 19, 2016 at 7:50 AM, Ashok Kumar <ashok34...@yahoo.com> wrote: Hi, I have read some notes on ORC files in Hive and indexes. The document describes in the indexes but makes reference to statistics Indexes | | | | | | | | | | | IndexesIndexes ORC provides three level of indexes within each file: file level - statistics about the values in each column across the entire file | | | | View on orc.apache.org | Preview by Yahoo | | | | | I am confused as it is mixing up indexes with statistics. Can someone clarify these. Thanks
ORC files and statistics
Hi, I have read some notes on ORC files in Hive and indexes. The document describes in the indexes but makes reference to statistics Indexes | | | | | | | | | | | IndexesIndexes ORC provides three level of indexes within each file: file level - statistics about the values in each column across the entire file | | | | View on orc.apache.org | Preview by Yahoo | | | | | I am confused as it is mixing up indexes with statistics. Can someone clarify these. Thanks
eiquivalent to identity column in Hive
Hi, Is there an equivalent to Microsoft IDENTITY column in Hive please. Thanks and regards
foreign keys in Hive
hi, what is the equivalent to foreign keys in Hive? Thanks
Re: Immutable data in Hive
I second that. Many thanks Mich for your reply. Regards On Monday, 4 January 2016, 10:58, "Singh, Abhijeet" <absi...@informatica.com> wrote: #yiv5347372295 #yiv5347372295 -- _filtered #yiv5347372295 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv5347372295 {font-family:Tahoma;panose-1:2 11 6 4 3 5 4 4 2 4;} _filtered #yiv5347372295 {}#yiv5347372295 #yiv5347372295 p.yiv5347372295MsoNormal, #yiv5347372295 li.yiv5347372295MsoNormal, #yiv5347372295 div.yiv5347372295MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv5347372295 a:link, #yiv5347372295 span.yiv5347372295MsoHyperlink {color:blue;text-decoration:underline;}#yiv5347372295 a:visited, #yiv5347372295 span.yiv5347372295MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv5347372295 p.yiv5347372295MsoAcetate, #yiv5347372295 li.yiv5347372295MsoAcetate, #yiv5347372295 div.yiv5347372295MsoAcetate {margin:0in;margin-bottom:.0001pt;font-size:8.0pt;}#yiv5347372295 p.yiv5347372295MsoListParagraph, #yiv5347372295 li.yiv5347372295MsoListParagraph, #yiv5347372295 div.yiv5347372295MsoListParagraph {margin-top:0in;margin-right:0in;margin-bottom:0in;margin-left:.5in;margin-bottom:.0001pt;font-size:11.0pt;}#yiv5347372295 p.yiv5347372295msonormal0, #yiv5347372295 li.yiv5347372295msonormal0, #yiv5347372295 div.yiv5347372295msonormal0 {margin-right:0in;margin-left:0in;font-size:12.0pt;}#yiv5347372295 span.yiv5347372295EmailStyle19 {color:windowtext;font-weight:normal;font-style:normal;text-decoration:none none;}#yiv5347372295 span.yiv5347372295BalloonTextChar {}#yiv5347372295 span.yiv5347372295EmailStyle22 {color:#1F497D;}#yiv5347372295 .yiv5347372295MsoChpDefault {font-size:10.0pt;} _filtered #yiv5347372295 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv5347372295 div.yiv5347372295WordSection1 {}#yiv5347372295 Very well answered by Mich. Thanks Mich !! From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: Sunday, January 03, 2016 8:35 PM To: user@hive.apache.org; 'Ashok Kumar' Subject: RE: Immutable data in Hive Hi Ashok. I will have a go at this on top of Alan’s very valuable clarification. Extraction, Transformation and Load (ETL) is a very common method in Data Warehousing (DW)and Business Analytics projects and can be performed by custom programming like writing shell scripts, JAVA. .NET tools or combination of all to get the data from internal or external sources and put them in DW. In general only data of valueends up in DW. What this mean is that in say in Banking environment you collect and feed (Extract) data into a staging area (in relational term often staging tables or the so called global temporary tables that are cleared daily for the next cycle in a staging database), prune it from unwanted data, do some manipulation (Transformation) (often happens into another set of staging tables) and finally Load it into target tables in a Data Warehouse. The analysts then use appropriate tools like Tableau to look at macroscopic trend in the data. Remember a Data Warehouse is still a relational database most probably a columnar implementation of relational model like SAP Sybase IQ. There are many examples of DW repositories used for Business Intelligence (BI, another fancy term for Analytics) such as working out global trading positioning (I did one of these by bolting Oracle TimesTen IMDB to Oracle DW for fast extraction) or data gathered from algorithmic trading using Complex Event Processing. Obviously although DW store larger amount of data (large being a relative term) and have impressive compression like Sybase IQ (every column is stored as an index so it is far more effective to do columnar compression (all data being the same type as opposed to row compression in OLTP databases)), they still require additional space, SAN storage and expensive horizontal scaling (adding another multi-plex requires additional license). ELT (Extraction, Load and Transform) is a similar concept used in Big Data World. The fundamental difference being that it is not just confined to data deemed to be of specific value, meaning you know what you are looking for in advance. In Hadoop one can store everything from data coming from structured data (transactional databases) and unstructured data (data coming from internet, excel sheets, email, logs and others). This means that you can store potentially all data to be exploited later, Hadoop echo system provides that flexibility by means of horizontal scaling on cheap commodity disks (AKA JBOD) and lack of licensing restrictions result in reducing Total Cost of Ownership (TCO) considerably. In summary you (E))xtract and (L)oad all data as is (don’t care whether that data is exactly what you want) into HDFS and then you do (T)ransformation later through Schema on Read (you decide at time of exploration your data needs). HDFS is great for storing large amount of data but on
Re: Immutable data in Hive
Tank you sir, very helpful. Could you also briefly describe from your experience the major differences between traditional ETL in DW and ELT in Hive? Why there is emphasis to take data from traditional transactional databases into Hive table with the same format and do the transform in Hive after. Is it because Hive is meant to be efficient in data transformation? Regards On Wednesday, 30 December 2015, 18:00, Alan Gates <alanfga...@gmail.com> wrote: Traditionally data in Hive was write once (insert) read many. You could append to tables and partitions, add new partitions, etc. You could remove data by dropping tables or partitions. But there was no updates of data or deletes of particular rows. This was what was meant by immutable. Hive was originally done this way because it was based on MapReduce and HDFS and these were the natural semantics given those underlying systems. For many use cases (e.g. ETL) this is sufficient, and the vast majority of people still run Hive this way. We added transactions and updates and deletes to Hive because some use cases require these features. Hive is being used more and more as a data warehouse, and while updates and deletes are less common there they are still required (slow changing dimensions, fixing wrong data, deleting records for compliance, etc.) Also streaming data into warehouses from transactional systems is a common use case. Alan. Ashok Kumar December 29, 2015 at 14:59 Hi, Can someone please clarify what "immutable data" in Hive means? I have been told that data in Hive is/should be immutable but in that case why we need transactional tables in Hive that allow updates to data. thanks and greetings
Immutable data in Hive
Hi, Can someone please clarify what "immutable data" in Hive means? I have been told that data in Hive is/should be immutable but in that case why we need transactional tables in Hive that allow updates to data. thanks and greetings
Difference between ORC and RC files
Hi Gurus, I am trying to understand the advantages that ORC file format offers over RC. I have read the existing documents but I still don't seem to grasp the main differences. Can someone explain to me as a user where ORC scores when compared to RC. What I like to know is mainly the performance. I am also aware that ORC does some smart compression as well. Finally is ORC file format is the best choice in Hive. Thank you
Re: The advantages of Hive/Hadoop comnpared to Data Warehouse
Hi, Thanks for the info. I understand ELT (Extract, Load, Transform) is more appropriate for big data compared to traditional ETL. What are the major advantages of this in Big Data space. Example. if I started using Sqoop to get data from traditional transactional and Data Warehouse databases and create the same tables in Hive, what would be the next step to get to a consolidated data model in Hive on HDFS. The entry tables will be tabular tables in line with source, correct? How many ELT steps need to apply generally to get to the final model. Will ELT speed up this process I understand this is a very broad question. However, any comments will be welcome. Regards On Friday, 18 December 2015, 22:27, Jörn Franke <jornfra...@gmail.com> wrote: I think you should draw more the attention that Hive is just one component in the ecosystem. You can have many more components, such as ELT, integrating unstructured data, machine learning, streaming data etc. however usually analysts are not aware about the technologies and it staff is not much aware of how it can bring benefits to a specific business domain. You could explore the potentials together in workshops, design thinking etc. once you know more details, both sides decide on potential ways forward you can start doing PoCs and see what works and what not. It is important that you break old ties created by more traditional data warehouse approaches in the past and go beyond the comfort zone. On 18 Dec 2015, at 22:01, Ashok Kumar <ashok34...@yahoo.com> wrote: Gurus, Some analysts keep asking me the advantages of having Hive tables when the star schema in Data Warehouse (DW) does the same. For example if you have fact and dimensions table in DW and just import them into Hive via a say SQOOP, what are we going to gain. I keep telling them storage economy and cheap disks, de-normalisation can be done further etc. However, they are not convinced :( Any additional comments will help my case. Thanks a lot
The advantages of Hive/Hadoop comnpared to Data Warehouse
Gurus, Some analysts keep asking me the advantages of having Hive tables when the star schema in Data Warehouse (DW) does the same. For example if you have fact and dimensions table in DW and just import them into Hive via a say SQOOP, what are we going to gain. I keep telling them storage economy and cheap disks, de-normalisation can be done further etc. However, they are not convinced :( Any additional comments will help my case. Thanks a lot
Re: The advantages of Hive/Hadoop comnpared to Data Warehouse
Thank you sir. Can you please describe a bit more detail your vision of "A fully denormalized columnar store"? Are you referring to get rid of star schema altogether in Hive and replace it with ORC tables? Regards On Friday, 18 December 2015, 21:13, Grant Overby (groverby) <grove...@cisco.com> wrote: You forgot horizontal scaling. A fully denormalized columnar store in Hive will out preform a star schema in Oracle in every way imaginableat scale; however, if your data isn’t big enough then this is a moot point. If your data fits in a traditional BI warehouse, and especially if it does so without people constantly complaining about performance or retention issues, your data may simply not be large enough to warrant a Hadoop stack. How many rows are in your largest fact table? | | | | Grant Overby Software Engineer Cisco.com grove...@cisco.com Mobile: 865 724 4910 | | | | | Think before you print. | | This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message.Please click here for Company Registration Information. | | From: Ashok Kumar <ashok34...@yahoo.com> Reply-To: User <user@hive.apache.org>, Ashok Kumar <ashok34...@yahoo.com> Date: Friday, December 18, 2015 at 4:01 PM To: User <user@hive.apache.org> Subject: The advantages of Hive/Hadoop comnpared to Data Warehouse Gurus, Some analysts keep asking me the advantages of having Hive tables when the star schema in Data Warehouse (DW) does the same. For example if you have fact and dimensions table in DW and just import them into Hive via a say SQOOP, what are we going to gain. I keep telling them storage economy and cheap disks, de-normalisation can be done further etc. However, they are not convinced :( Any additional comments will help my case. Thanks a lot
Fw: Managed to make Hive run on Spark engine
This is great news sir. It shows perseverance pays at last. Can you inform us when the write-up is ready so I can set it up as well please. I know a bit about the advantages of having Hive using Spark engine. However, the general question I have is when one should use Hive on spark as opposed to Hive on MapReduce engine? Thanks again On Monday, 7 December 2015, 15:50, Mich Talebzadehwrote: #yiv8051477931 -- filtered {font-family:Wingdings;panose-1:5 0 0 0 0 0 0 0 0 0;}#yiv8051477931 filtered {panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv8051477931 filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv8051477931 p.yiv8051477931MsoNormal, #yiv8051477931 li.yiv8051477931MsoNormal, #yiv8051477931 div.yiv8051477931MsoNormal {margin:0cm;margin-bottom:.0001pt;font-size:12.0pt;}#yiv8051477931 a:link, #yiv8051477931 span.yiv8051477931MsoHyperlink {color:blue;text-decoration:underline;}#yiv8051477931 a:visited, #yiv8051477931 span.yiv8051477931MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv8051477931 span.yiv8051477931EmailStyle17 {color:windowtext;}#yiv8051477931 span.yiv8051477931EmailStyle18 {color:windowtext;}#yiv8051477931 .yiv8051477931MsoChpDefault {font-size:10.0pt;}#yiv8051477931 filtered {margin:72.0pt 72.0pt 72.0pt 72.0pt;}#yiv8051477931 div.yiv8051477931WordSection1 {}#yiv8051477931 For those interested From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: 06 December 2015 20:33 To: user@hive.apache.org Subject: Managed to make Hive run on Spark engine Thanks all especially to Xuefu.for contributions. Finally it works, which means don’t give up until it works J hduser@rhes564::/usr/lib/hive/lib> hiveLogging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-1.2.1.jar!/hive-log4j.propertieshive> set spark.home= /usr/lib/spark-1.3.1-bin-hadoop2.6;hive> set hive.execution.engine=spark;hive> set spark.master=spark://50.140.197.217:7077;hive> set spark.eventLog.enabled=true;hive> set spark.eventLog.dir= /usr/lib/spark-1.3.1-bin-hadoop2.6/logs;hive> set spark.executor.memory=512m;hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;hive> set hive.spark.client.server.connect.timeout=22ms;hive> set spark.io.compression.codec=org.apache.spark.io.LZFCompressionCodec;hive> use asehadoop;OKTime taken: 0.638 secondshive> select count(1) from t;Query ID = hduser_20151206200528_4b85889f-e4ca-41d2-9bd2-1082104be42bTotal jobs = 1Launching Job 1 out of 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Starting Spark Job = c8fee86c-0286-4276-aaa1-2a5eb4e4958a Query Hive on Spark job[0] stages:01 Status: Running (Hive on Spark job[0])Job Progress FormatCurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]2015-12-06 20:05:36,299 Stage-0_0: 0(+1)/1 Stage-1_0: 0/12015-12-06 20:05:39,344 Stage-0_0: 1/1 Finished Stage-1_0: 0(+1)/12015-12-06 20:05:40,350 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 FinishedStatus: Finished successfully in 8.10 secondsOK The versions used for this project OS version Linux version 2.6.18-92.el5xen (brewbuil...@ls20-bc2-13.build.redhat.com) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-41)) #1 SMP Tue Apr 29 13:31:30 EDT 2008 Hadoop 2.6.0Hive 1.2.1spark-1.3.1-bin-hadoop2.6 (downloaded from prebuild spark-1.3.1-bin-hadoop2.6.gz for starting spark standalone cluster)The Jar file used in $HIVE_HOME/lib to link Hive to spark was à spark-assembly-1.3.1-hadoop2.4.0.jar (built from the source downloaded as zipped file spark-1.3.1.gz and built with command line make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided" Pretty picky on parameters, CLASSPATH, IP addresses or hostname etc to make it work I will create a full guide on how to build and make Hive to run with Spark as its engine (as opposed to MR). HTH Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008A Winning Strategy: Running the most Critical Financial Data on ASE 15http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdfAuthor of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4Publications due shortly:Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly http://talebzadehmich.wordpress.com/ NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Best practices for monitoring hive
Hi, I would like to know best practices to monitor the health and performance of Hive and hive server, trouble shooting and catching errors etc. to be clear we do not use any bespoke monitoring tool and keen on developing our own in house tools to be integrated into general monitoring tools to be picked up by operations. greetings and thanks
exporting and importing a schema/database from Hive Test to Hive DEV
Hi gurus, What is the easiest way of exporting one database in Hive in Test and importing it to say DEV database in another instance. How metastore at target will handle this please? Thanks
Re: Hive and HBase
Hi, It is from Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem (Addison-Wesley Data & Analytics Series) (Kindle Locations 735-738). Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summary, ad hoc queries, and the analysis of large data sets using an SQL-like language called HiveQL. Hive transparently translates queries into MapReduce jobs that are executed in HBase. Hive is considered the de facto standard for interactive SQL queries over petabytes of data On Tuesday, 10 November 2015, 13:03, Binglin Chang <decst...@gmail.com> wrote: Hive transparently translates queries into MapReduce jobs that are executed in HBase I think this is not correct, are you sure it is from some book? On Tue, Nov 10, 2015 at 6:56 PM, Ashok Kumar <ashok34...@yahoo.com> wrote: hi, I have read in a book about Hadoop that says Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summary, ad hoc queries, and the analysis of large data sets using an SQL-like language called HiveQL. Hive transparently translates queries into MapReduce jobs that are executed in HBase. Hive is considered the de facto standard for interactive SQL queries over petabytes of data. What is the relation between Hive and HBase? I always thought that HBase is an independent database. Is it correct that Hive itself uses MapReduce engine that in turn uses HBase as the database. I always thought that Hive is a data warehouse database or I am missing something. Thanking you
DML operations and transactions in Hive
Hi, I would like to understand a bit more about insert/update/select operations in Hive. I believe ORC table format offers the best performance and concurrency. Is ORC the best format for DML operations. What is the granularity of locks? Is it partition or row. Also how ACID properties implemented in Hive. Does it follow read consistency like Oracle please? Example if two updates on the same row happen more and less same time, will the exclusive lock be taken by the first user and the second user is blocked until the first user finishs updating and the transaction is committed or the second user will see the row before any change is done by the first user? Thanks
Re: clarification please
Thank you sir. Very helpful On Thursday, 29 October 2015, 15:22, Alan Gates <alanfga...@gmail.com> wrote: Ashok Kumar October 28, 2015 at 22:43 hi gurus, kindly clarify the following please - Hive currently does not support indexes or indexes are not used in the query Mostly true. There is a create index, but Hive does not use the resulting index by default. Some storage formats (ORC, Parquet I think) have their own indices they use internally to speed access. - The lowest granularity for concurrency is partition. If table is partitioned, then partition will be lucked in DML operation lucked =locked? I'm not sure what you intended here. If you mean locked, then it depends. By default Hive doesn't use locking. You can set it up to do locking via ZooKeeper or as part of Hive transactions. They have different locking models. See https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions and https://cwiki.apache.org/confluence/display/Hive/Locking for more information. You can sub-partition using buckets, but for most queries partition is the lowest level of granularity. Hive does a lot of work to optimize only reading relevant partitions for a query. - What is the best file format to store Hive table in HDFS? Is this ORC or Avro that allow being split and support block compression? It depends on what you want to do. ORC and Parquet do better for traditional data warehousing type queries because they are columnar formats and have lots of optimization built in for fast access, pushing filter down into the storage level etc. People like Avro and other self describing formats when their data brings its own structure. We very frequently see pipelines where people dump Avro, text, etc. into Hive and then ETL it into ORC. - Text/CSV files. By default if file type is not specified at creation time, Hive will default to text file? Out of the box yes, but you can change that in your Hive installation by setting hive.default.fileformat in your hive-site.xml. Alan. Thanks
clarification please
hi gurus, kindly clarify the following please - Hive currently does not support indexes or indexes are not used in the query - The lowest granularity for concurrency is partition. If table is partitioned, then partition will be lucked in DML operation - What is the best file format to store Hive table in HDFS? Is this ORC or Avro that allow being split and support block compression? - Text/CSV files. By default if file type is not specified at creation time, Hive will default to text file? Thanks
downloading RDBMS table data to Hive with Sqoop import
Hi gurus, I can use Sqoop import to get RDBMS data say Oracle to Hive first and then use incremental append for new rows with PK and last value. However, how do you account for updates and deletes with Sqoop without full load of table from RDBMS to Hive? Thanks
Hive and Impala
Hi gurus, Kindly help me understand the advantage that Impala has over Hive. I read a note that Impala does not use MapReduce engine and is therefore very fast for queries compared to Hive. However, Hive as I understand is widely used everywhere! Thank you
Re: partition and bucket
to hash partitioning where a hashing function is applied. Likewise an RDBMS, Hive will apply a linear hashing algorithm to prevent data from clustering within specific partitions. Hashing is very effective if the column selected for bucketing has very high selectivity like an ID column where selectivity (select count(distinct(column))/count(column) ) = 1. In this case, the created partitions/ files will be as evenly sized as possible. In a nutshell bucketing is a method to get data evenly distributed over many partitions/files. One should define the number of buckets by a power of two -- 2^n, like 2, 4, 8, 16 etc to achieve best results. Again bucketing will help concurrency in Hive. It may even allow a partition wise join i.e. a join between two tables that are bucketed on the same column with the same number of buckets (anyone has tried this?) One more things. When one defines the number of buckets at table creation level in Hive, the number of partitions/files will be fixed. In contrast, with partitioning you do not have this limitation. HTH Mich NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility. From: Ashok Kumar [mailto:ashok34...@yahoo.com] Sent: 10 April 2015 17:46 To: user@hive.apache.org Subject: partition and bucket | Greeting all, Glad to join the user group. I am from DBA background Oracle/Sybase/MSSQL. I would like to understand partition and bucketing in Hive and the difference between. Shall be grateful if someone explains where shall I use partition or bucket for best practices. thanks |