large data ingestion with jdbc versus putting data into hdfs with hive to ingest

2016-11-07 Thread Ashok Kumar
hi,
what are the drawbacks of ingesting large quantities of data through jdbc 
connection in Hive versus placing data directly into HDFS for Hive to 
incorporate?
thanks

Re: Happy Diwali to those forum members who celebrate this great festival

2016-10-30 Thread Ashok Kumar
You are very kind Sir 

On Sunday, 30 October 2016, 16:42, Devopam Mittra  wrote:
 

 +1
Thanks and regards
Devopam
On 30 Oct 2016 9:37 pm, "Mich Talebzadeh"  wrote:

Enjoy the festive season.
Regards,
Dr Mich Talebzadeh LinkedIn  https://www.linkedin.com/ profile/view?id= 
AAEWh2gBxianrbJd6zP6AcPCCd OABUrV8Pw http://talebzadehmich. wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.  


   

Re: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-19 Thread Ashok Kumar
Thanks Mich looking forward to it :) 

On Tuesday, 19 July 2016, 19:13, Mich Talebzadeh 
 wrote:
 

 Hi all,
This will be in London tomorrow Wednesday 20th July starting at 18:00 hour for 
refreshments and kick off at 18:30, 5 minutes walk from Canary Wharf Station, 
Jubilee Line 
If you wish you can register and get more info here
It will be in La Tasca West India Docks Road E14 
and especially if you like Spanish food :)
Regards,



Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.  
On 15 July 2016 at 11:06, Joaquin Alzola  wrote:

It is on the 20th (Wednesday) next week. From: Marco Mistroni 
[mailto:mmistr...@gmail.com]
Sent: 15 July 2016 11:04
To: Mich Talebzadeh 
Cc: user @spark ; user 
Subject: Re: Presentation in London: Running Spark on Hive or Hive on Spark Dr 
Mich  do you have any slides or videos available for the presentation you did 
@Canary Wharf?kindest regards marco On Wed, Jul 6, 2016 at 10:37 PM, Mich 
Talebzadeh  wrote:
Dear forum members I will be presenting on the topic of "Running Spark on Hive 
or Hive on Spark, your mileage varies" in Future of Data: London 
DetailsOrganized by: HortonworksDate: Wednesday, July 20, 2016, 6:00 PM to 8:30 
PM Place: LondonLocation: One Canada Square, Canary Wharf,  London E14 
5AB.Nearest Underground:  Canary Warf (map)If you are interested please 
register hereLooking forward to seeing those who can make it to have an 
interesting discussion and leverage your experience.Regards,
Dr Mich Talebzadeh LinkedIn 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com Disclaimer: Use it at your own risk.Any 
and all responsibility for any loss, damage or destruction of data or any other 
property which may arise from relying on this email's technical content is 
explicitly disclaimed. The author will in no case be liable for any monetary 
damages arising from such loss, damage or destruction. 
 This email is confidential and may be subject to privilege. If you are not the 
intended recipient, please do not copy or disclose its content but contact the 
sender immediately upon receipt.



  

Re: Hive on TEZ + LLAP

2016-07-15 Thread Ashok Kumar
thanks.
I have also heard about Hortonworks with Tez + LLAP but that is a distro?
Is it a complicated work to build it with Do It Yourself so to speak?
 

On Friday, 15 July 2016, 19:23, "Long, Andrew" <loand...@amazon.com> wrote:
 

 #yiv7626288998 #yiv7626288998 -- _filtered #yiv7626288998 {panose-1:2 4 5 3 5 
4 6 3 2 4;} _filtered #yiv7626288998 {font-family:Calibri;panose-1:2 15 5 2 2 2 
4 3 2 4;} _filtered #yiv7626288998 {font-family:MingLiU;panose-1:2 2 5 9 0 0 0 
0 0 0;}#yiv7626288998 #yiv7626288998 p.yiv7626288998MsoNormal, #yiv7626288998 
li.yiv7626288998MsoNormal, #yiv7626288998 div.yiv7626288998MsoNormal 
{margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv7626288998 a:link, 
#yiv7626288998 span.yiv7626288998MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv7626288998 a:visited, #yiv7626288998 
span.yiv7626288998MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv7626288998 
span.yiv7626288998EmailStyle17 
{font-family:Calibri;color:windowtext;}#yiv7626288998 span.yiv7626288998msoIns 
{text-decoration:underline;color:teal;}#yiv7626288998 
.yiv7626288998MsoChpDefault {font-size:10.0pt;} _filtered #yiv7626288998 
{margin:1.0in 1.0in 1.0in 1.0in;}#yiv7626288998 div.yiv7626288998WordSection1 
{}#yiv7626288998 Amazon AWS has recently released EMR with Hive + Tez as well.  
  Cheers Andrew    From: Jörn Franke <jornfra...@gmail.com>
Reply-To: "user@hive.apache.org" <user@hive.apache.org>
Date: Friday, July 15, 2016 at 8:36 AM
To: "user@hive.apache.org" <user@hive.apache.org>
Subject: Re: Hive on TEZ + LLAP    I would recommend a distribution such as 
Hortonworks were everything is already configured. As far as I know llap is 
currently not part of any distribution. 
On 15 Jul 2016, at 17:04, Ashok Kumar <ashok34...@yahoo.com> wrote: 
Hi,    Has anyone managed to make Hive work with Tez + LLAP as the query engine 
in place of Map-reduce please?    If you configured it yourself which version 
of Tez and LLAP work with Hive 2. Do I need to build Tez from source for 
example    Thanks 


  

Hive on TEZ + LLAP

2016-07-15 Thread Ashok Kumar
Hi,
Has anyone managed to make Hive work with Tez + LLAP as the query engine in 
place of Map-reduce please?
If you configured it yourself which version of Tez and LLAP work with Hive 2. 
Do I need to build Tez from source for example
Thanks

Fast database with writes per second and horizontal scaling

2016-07-11 Thread Ashok Kumar
Hi Gurus,
Advice appreciated from Hive gurus.
My colleague has been using Cassandra. However, he says it is too slow and not 
user friendly/MongodDB as a doc databases is pretty neat but not fast enough
May main concern is fast writes per second and good scaling.

Hive on Spark or Tez?
How about Hbase. or anything else
Any expert advice warmly acknowledged..
thanking you

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Ashok Kumar
Hi Mich,
Your recent presentation in London on this topic "Running Spark on Hive or Hive 
on Spark"
Have you made any more interesting findings that you like to bring up?
If Hive is offering both Spark and Tez in addition to MR, what stopping one not 
to use Spark? I still don't get why TEZ + LLAP is going to be a better choice 
from what you mentioned?
thanking you 
 

On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh  
wrote:
 

 Couple of points if I may and kindly bear with my remarks. 
Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP
"Sub-second queries require fast query execution and low setup cost. The 
challenge for Hive is to achieve this without giving up on the scale and 
flexibility that users depend on. This requires a new approach using a hybrid 
engine that leverages Tez and something new called  LLAP (Live Long and 
Process, #llap online).
LLAP is an optional daemon process running on multiple nodes, that provides the 
following:   
   - Caching and data reuse across queries with compressed columnar data 
in-memory (off-heap)
   - Multi-threaded execution including reads with predicate pushdown and hash 
joins
   - High throughput IO using Async IO Elevator with dedicated thread and core 
per disk
   - Granular column level security across applications
   - "
OK so we have added an in-memory capability to TEZ by way of LLAP, In other 
words what Spark does already and BTW it does not require a daemon running on 
any host. Don't take me wrong. It is interesting but this sounds to me (without 
testing myself) adding caching capability to TEZ to bring it on par with SPARK. 
Remember:
Spark -> DAG + in-memory cachingTEZ = MR on DAGTEZ + LLAP => DAG + in-memory 
caching
OK it is another way getting the same result. However, my concerns:
   
   - Spark has a wide user base. I judge this from Spark user group traffic
   - TEZ user group has no traffic I am afraid
   - LLAP I don't know
Sounds like Hortonworks promote TEZ and Cloudera does not want to know anything 
about Hive. and they promote Impala but that sounds like a sinking ship these 
days.
Having said that I will try TEZ + LLAP :) No pun intended
Regards
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 31 May 2016 at 08:19, Jörn Franke  wrote:

Thanks very interesting explanation. Looking forward to test it.

> On 31 May 2016, at 07:51, Gopal Vijayaraghavan  wrote:
>
>
>> That being said all systems are evolving. Hive supports tez+llap which
>> is basically the in-memory support.
>
> There is a big difference between where LLAP & SparkSQL, which has to do
> with access pattern needs.
>
> The first one is related to the lifetime of the cache - the Spark RDD
> cache is per-user-session which allows for further operation in that
> session to be optimized.
>
> LLAP is designed to be hammered by multiple user sessions running
> different queries, designed to automate the cache eviction & selection
> process. There's no user visible explicit .cache() to remember - it's
> automatic and concurrent.
>
> My team works with both engines, trying to improve it for ORC, but the
> goals of both are different.
>
> I will probably have to write a proper academic paper & get it
> edited/reviewed instead of send my ramblings to the user lists like this.
> Still, this needs an example to talk about.
>
> To give a qualified example, let's leave the world of single use clusters
> and take the use-case detailed here
>
> http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
>
>
> There are two distinct problems there - one is that a single day sees upto
> 100k independent user sessions running queries and that most queries cover
> the last hour (& possibly join/compare against a similar hour aggregate
> from the past).
>
> The problem with having independent 100k user-sessions from different
> connections was that the SparkSQL layer drops the RDD lineage & cache
> whenever a user ends a session.
>
> The scale problem in general for Impala was that even though the data size
> was in multiple terabytes, the actual hot data was approx <20Gb, which
> resides on <10 machines with locality.
>
> The same problem applies when you apply RDD caching with something like
> un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
> popular that the machines which hold those blocks run extra hot.
>
> A cache model per-user session is entirely wasteful and a common cache +
> MPP model effectively overloads 2-3% of cluster, while leaving the other
> machines idle.
>
> LLAP was designed specifically to prevent that hotspotting, while
> maintaining the common cache model - within a few minutes after an hour
> ticks over, the whole cluster develops temporal popularity for the hot
> data and nearly every rack has at least one cached copy of the same data

Re: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-07 Thread Ashok Kumar
Thanks.
Will this presentation recorded as well?
Regards 

On Wednesday, 6 July 2016, 22:38, Mich Talebzadeh 
 wrote:
 

 Dear forum members
I will be presenting on the topic of "Running Spark on Hive or Hive on Spark, 
your mileage varies" in Future of Data: London DetailsOrganized by: 
HortonworksDate: Wednesday, July 20, 2016, 6:00 PM to 8:30 PM Place: 
LondonLocation: One Canada Square, Canary Wharf,  London E14 5AB.Nearest 
Underground:  Canary Warf (map) If you are interested please register 
hereLooking forward to seeing those who can make it to have an interesting 
discussion and leverage your experience.Regards,
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.  

  

latest version of Spark to work OK as Hive engine

2016-07-02 Thread Ashok Kumar
Hi,
Looking at this presentation Hive on Spark is Blazing Fast ..
Which latest version of Spark can run as an engine for Hive please?
Thanks
P.S. I am aware of  Hive on TEZ but that is not what I am interested here please
Warmest regards

last stats time on table columns

2016-06-16 Thread Ashok Kumar
Greeting gurus,
When I use
ANALYZE TABLE  COMPUTE STATISTICS for COLUMNS,
Where can I get the last stats time.
DESC FORMATTED  does not show it
thanking you 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread Ashok Kumar
Hi Dr Mich,
This is very good news. I will be interested to know how Hive engages with 
Spark as an engine. What Spark processes are used to make this work? 
Thanking you 

On Monday, 23 May 2016, 19:01, Mich Talebzadeh  
wrote:
 

 Have a look at this thread
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 23 May 2016 at 09:10, Mich Talebzadeh  wrote:

Hi Timur and everyone.
I will answer your first question as it is very relevant
1) How to make 2 versions of Spark live together on the same cluster (libraries 
clash, paths, etc.) ? 
Most of the Spark users perform ETL, ML operations on Spark as well. So, we may 
have 3 Spark installations simultaneously

There are two distinct points here.
Using Spark as a  query engine. That is BAU and most forum members use it 
everyday. You run Spark with either Standalone, Yarn or Mesos as Cluster 
managers. You start master that does the management of resources and you start 
slaves to create workers. 
 You deploy Spark either by Spark-shell, Spark-sql or submit jobs through 
spark-submit etc. You may or may not use Hive as your database. You may use 
Hbase via Phoenix etcIf you choose to use Hive as your database, on every host 
of cluster including your master host, you ensure that Hive APIs are installed 
(meaning Hive installed). In $SPARK_HOME/conf, you create a soft link to cd 
$SPARK_HOME/conf
hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6/conf> ltr hive-site.xml
lrwxrwxrwx 1 hduser hadoop 32 May  3 17:48 hive-site.xml -> 
/usr/lib/hive/conf/hive-site.xml
Now in hive-site.xml you can define all the parameters needed for Spark 
connectivity. Remember we are making Hive use spark1.3.1  engine. WE ARE NOT 
RUNNING SPARK 1.3.1 AS A QUERY TOOL. We do not need to start master or workers 
for Spark 1.3.1! It is just an execution engine like mr etc.
Let us look at how we do that in hive-site,xml. Noting the settings for 
hive.execution.engine=spark and spark.home=/usr/lib/spark-1.3.1-bin-hadoop2 
below. That tells Hive to use spark 1.3.1 as the execution engine. You just 
install spark 1.3.1 on the host just the binary download it is 
/usr/lib/spark-1.3.1-bin-hadoop2.6
In hive-site.xml, you set the properties.
  
    hive.execution.engine
    spark
    
  Expects one of [mr, tez, spark].
  Chooses execution engine. Options are: mr (Map reduce, default), tez, 
spark. While MR
  remains the default engine for historical reasons, it is itself a 
historical engine
  and is deprecated in Hive 2 line. It may be removed without further 
warning.
    
    
    spark.home
    /usr/lib/spark-1.3.1-bin-hadoop2
    something
  

 
    hive.merge.sparkfiles
    false
    Merge small files at the end of a Spark DAG 
Transformation
    
    hive.spark.client.future.timeout
    60s
    
  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, 
us/usec, ns/nsec), which is sec if not specified.
  Timeout for requests from Hive client to remote Spark driver.
    
  
    hive.spark.job.monitor.timeout
    60s
    
  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, 
us/usec, ns/nsec), which is sec if not specified.
  Timeout for job monitor to get Spark job state.
    
 
  
    hive.spark.client.connect.timeout
    1000ms
    
  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, 
us/usec, ns/nsec), which is msec if not specified.
  Timeout for remote Spark driver in connecting back to Hive client.
    
  
  
    hive.spark.client.server.connect.timeout
    9ms
    
  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, 
us/usec, ns/nsec), which is msec if not specified.
  Timeout for handshake between Hive client and remote Spark driver.  
Checked by both processes.
    
  
  
    hive.spark.client.secret.bits
    256
    Number of bits of randomness in the generated secret for 
communication between Hive client and remote Spark driver. Rounded down to the 
nearest multiple of 8.
  
  
    hive.spark.client.rpc.threads
    8
    Maximum number of threads for remote Spark driver's RPC event 
loop.
  
And other settings as well
That was the Hive stuff for your Spark BAU. So there are two distinct things. 
Now going to Hive itself, you will need to add the correct assembly jar file 
for Hadoop. These are called 
spark-assembly-x.y.z-hadoop2.4.0.jar 
Where x.y.z in this case is 1.3.1 
The assembly file is
spark-assembly-1.3.1-hadoop2.4.0.jar
So you add that spark-assembly-1.3.1-hadoop2.4.0.jar to $HIVE_HOME/libs
ls $HIVE_HOME/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
/usr/lib/hive/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
And you need to compile spark from source excluding Hadoop dependencies 
./make-distribution.sh --name"hadoop2-without-hive" --tgz 
"-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"

So 

Copying all Hive tables from Prod to UAT

2016-04-08 Thread Ashok Kumar
Hi,
Anyone has suggestions how to create and copy Hive and Spark tables from 
Production to UAT.
One way would be to copy table data to external files and then move the 
external files to a local target directory and populate the tables in target 
Hive with data.
Is there an easier way of doing so?
thanks



Updating column in table throws error

2016-03-06 Thread Ashok Kumar
 Hi gurus,
I have an ORC table bucketed on invoicenumber with "transactional"="true"
I am trying to update invoicenumber column used for bucketing this table but it 
comes back with
Error: Error while compiling statement: FAILED: SemanticException [Error 
10302]: Updating values of bucketing columns is not supported.  Column 
invoicenumber
Any ideas how it can be solved?
Thank you

Re: Hive and Impala

2016-03-01 Thread Ashok Kumar

Dr Mitch,
My two cents here.
I don't have direct experience of Impala but in my humble opinion I share your 
views that Hive provides the best metastore of all Big Data systems. Looking 
around almost every product in one form and shape use Hive code somewhere. My 
colleagues inform me that Hive is one of the most stable Big Data products.
With the capabilities of Spark on Hive and Hive on Spark or Tez plus of course 
MR, there is really little need for many other products in the same space. It 
is good to keep things simple.
Warmest 

On Tuesday, 1 March 2016, 11:33, Mich Talebzadeh 
 wrote:
 

 I have not heard of Impala anymore. I saw an article in LinkedIn titled
"Apache Hive Or Cloudera Impala? What is Best for me?"
"We can access all objects from Hive data warehouse with HiveQL which leverages 
the map-reduce architecture in background for data retrieval and transformation 
and this results in latency." 
My response was
This statement is no longer valid as you have choices of three engines now with 
MR, Spark and Tez. I have not used Impala myself as I don't think there is a 
need for it with Hive on Spark or Spark using Hive metastore providing whatever 
needed. Hive is for Data Warehouse and provides what is says on the tin. Please 
also bear in mind that Hive offers ORC storage files that provide store Index 
capabilities further optimizing the queries with additional stats at file, 
stripe and row group levels. 
Anyway the question is with Hive on Spark or Spark using Hive metastore what we 
cannot achieve that we can achieve with Impala?

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 

  

Re: Hive optimizer

2016-02-04 Thread Ashok Kumar
Thank you for the link. 

On Thursday, 4 February 2016, 8:07, Lefty Leverenz 
<leftylever...@gmail.com> wrote:
 

 You can find Hive CBO information in Cost Based Optimizer in Hive.
-- Lefty

On Wed, Feb 3, 2016 at 11:48 AM, John Pullokkaran 
<jpullokka...@hortonworks.com> wrote:

Its both.Some of the optimizations are rule based and some are cost based.
John
From: Ashok Kumar <ashok34...@yahoo.com>
Reply-To: "user@hive.apache.org" <user@hive.apache.org>, Ashok Kumar 
<ashok34...@yahoo.com>
Date: Wednesday, February 3, 2016 at 11:45 AM
To: User <user@hive.apache.org>
Subject: Hive optimizer

  Hi,
Is Hive optimizer a cost based Optimizer (CBO) or a rule based optimizer (CBO) 
or none of them.
thanks



  

Hive optimizer

2016-02-03 Thread Ashok Kumar
  Hi,
Is Hive optimizer a cost based Optimizer (CBO) or a rule based optimizer (CBO) 
or none of them.
thanks

Re: Importing Oracle data into Hive

2016-01-31 Thread Ashok Kumar

Thank you Mich and Jorn foryour help. Very useful indeed.
 

   

 On Sunday, 31 January 2016, 13:43, Mich Talebzadeh <m...@peridale.co.uk> wrote:
 

 #yiv6166657167 -- filtered {panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv6166657167 
filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv6166657167 
filtered {}#yiv6166657167 p.yiv6166657167MsoNormal, #yiv6166657167 
li.yiv6166657167MsoNormal, #yiv6166657167 div.yiv6166657167MsoNormal 
{margin:0cm;margin-bottom:.0001pt;font-size:12.0pt;}#yiv6166657167 a:link, 
#yiv6166657167 span.yiv6166657167MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv6166657167 a:visited, #yiv6166657167 
span.yiv6166657167MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv6166657167 
p.yiv6166657167msonormal0, #yiv6166657167 li.yiv6166657167msonormal0, 
#yiv6166657167 div.yiv6166657167msonormal0 
{margin-right:0cm;margin-left:0cm;font-size:12.0pt;}#yiv6166657167 
span.yiv6166657167EmailStyle18 
{color:windowtext;font-weight:normal;font-style:normal;text-decoration:none 
none;}#yiv6166657167 .yiv6166657167MsoChpDefault 
{font-size:10.0pt;}#yiv6166657167 filtered {margin:72.0pt 72.0pt 72.0pt 
72.0pt;}#yiv6166657167 div.yiv6166657167WordSection1 {}#yiv6166657167 You will 
need to have Oracle Database 11g  JDBC Driver ojdbc6.jar installed in 
$SQOOP_HOME/lib. You can download it from here  The approach I prefer is to let 
Sqoop import it as a text file to a staging table and then insert/select into 
an ORC table from the staging table.  sqoop import --connect 
"jdbc:oracle:thin:@rhes564:1521:mydb" --username scratchpad -P \   --query 
"select * from scratchpad.dummy where \   \$CONDITIONS" \    -split-by id \   
-hive-import -hive-table "test.dummy_staging" --target-dir 
"/a/b/c/dummy_staging" --create-hive-table    Once the table staging is created 
you can then insert/select to an Orc table of your definition and make sure 
that the schema is clearly defined as you wish. For example you have to cater 
for date fields or columns that are varchar as opposed to String.  Case in 
point  The source table schema in Oracle is      CREATE TABLE 
"SCRATCHPAD"."DUMMY"    (  "ID" NUMBER,   "CLUSTERED" NUMBER,   
"SCATTERED" NUMBER,   "RANDOMISED" NUMBER,   "RANDOM_STRING" 
VARCHAR2(50 BYTE),   "SMALL_VC" VARCHAR2(10 BYTE),   "PADDING" 
VARCHAR2(10 BYTE),    CONSTRAINT "DUMMY_PK" PRIMARY KEY ("ID")  )  The 
staging table dummy_staging  is generated by Sqoop:  desc 
dummy_staging;+++--+--+|    col_name    | 
data_type  | comment  |+++--+--+| id
 | double |  || clustered  | double |  || 
scattered  | double |  || randomised | double | 
 || random_string  | string |  || small_vc   | string |     
 || padding    | string |  
|+++--+--+  Your ORC table may look like:  
desc dummy;++--+--+--+|    col_name    |  
data_type   | comment  |++--+--+--+| id 
    | int  |  || clustered  | int  |  
|| scattered  | int  |  || randomised | int  |  
    || random_string  | varchar(50)  |  || small_vc   | 
varchar(10)  |  || padding    | varchar(10)  |  
|++--+--+--+    This also translates to 
Extract Load Transfer (ELT)) methodology which I prefer.  HTH  Dr Mich 
Talebzadeh  LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
  Sybase ASE 15 Gold Medal Award 2008A Winning Strategy: Running the most 
Critical Financial Data on ASE 
15http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdfAuthor
 of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 
978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", 
ISBN 978-0-9759693-0-4Publications due shortly:Complex Event Processing in 
Heterogeneous Environments, ISBN: 978-0-9563693-3-8Oracle and Sybase, Concepts 
and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly  
http://talebzadehmich.wordpress.com/  NOTE: The information in this email is 
proprietary and confidential. This message is for the designated recipient 
only, if you are not the intended recipient, you should destroy it immediately. 
Any information in this message shall not be understood as given or endorsed by 
Peridale Technology Ltd, its subsidiaries or their employees, unless expressly 
so stated. It is the responsibility of the recipient to ensure that this email 
is virus free, therefore neither Peridale Technology Ltd

Importing Oracle data into Hive

2016-01-31 Thread Ashok Kumar
  Hi,
What is the easiest method of importing data from an Oracle 11g table to Hive 
please? This will be a weekly periodic job. The source table has 20 million 
rows.
I am running Hive 1.2.1
regards



Re: Importing Oracle data into Hive

2016-01-31 Thread Ashok Kumar
Thanks,
Can sqoop create this table as ORC in Hive? 

On Sunday, 31 January 2016, 13:13, Ashok Kumar <ashok34...@yahoo.com> wrote:
 

 Thanks.
Can sqoop create this table as ORC in Hive? 

On Sunday, 31 January 2016, 13:11, Nitin Pawar <nitinpawar...@gmail.com> 
wrote:
 

 check sqoop 

On Sun, Jan 31, 2016 at 6:36 PM, Ashok Kumar <ashok34...@yahoo.com> wrote:

  Hi,
What is the easiest method of importing data from an Oracle 11g table to Hive 
please? This will be a weekly periodic job. The source table has 20 million 
rows.
I am running Hive 1.2.1
regards





-- 
Nitin Pawar


   

  

Re: ORC files and statistics

2016-01-19 Thread Ashok Kumar
 Thanks Owen,
I got a bit confused comparing ORC with what I know about indexes in relational 
databases. Still need to understand it a bit better.
Regards
From: Owen O'Malley [mailto:omal...@apache.org] 
Sent: 19 January 2016 17:57
To: user@hive.apache.org; Ashok Kumar <ashok34...@yahoo.com>
Cc: Jörn Franke <jornfra...@gmail.com>
Subject: Re: ORC files and statistics  On Tue, Jan 19, 2016 at 9:45 AM, Ashok 
Kumar <ashok34...@yahoo.com> wrote:
Thank you both.  So if I have a Hive table of ORC type and it contains 100K 
rows, there will be 10 row groups of 10K row each.
  Yes 
  within each row group there will be min, max, count(distint_value) and sum 
for each column within that row group. is count mean count of distinct values 
including null occurrence for that column?.
  Actually, it is just count, not count distinct. Newer versions of Hive also 
have the option of including bloom filters for some columns. That enables fast 
searches for particular values in columns that aren't sorted. 
  also if the table contains 5 columns will there be 5x10 row groups in total?
  The ORC files are laid out in stripes that correspond to roughly ~64MB 
compressed. Each column within a stripe is laid out together. The row groups 
are a feature of the index and correspond to how many entries the index has. So 
yes, within a file with 100k rows, which obviously will be a single stripe, the 
index will have 10 row groups for each column for a total of 50 entries in the 
index. (The index is also laid out in columns so the reader only loads the 
parts of the index it needs for the columns it is reading.)  .. Owen  
  

  

Re: ORC files and statistics

2016-01-19 Thread Ashok Kumar
Thank you both.
So if I have a Hive table of ORC type and it contains 100K rows, there will be 
10 row groups of 10K row each.
within each row group there will be min, max, count(distint_value) and sum for 
each column within that row group. is count mean count of distinct values 
including null occurrence for that column?.
also if the table contains 5 columns will there be 5x10 row groups in total?
thanks again 

On Tuesday, 19 January 2016, 17:35, Jörn Franke <jornfra...@gmail.com> 
wrote:
 

 Just be aware that you should insert the data sorted at least on the most 
discrimating column of your where clause
On 19 Jan 2016, at 17:27, Owen O'Malley <omal...@apache.org> wrote:


It has both. Each index has statistics of min, max, count, and sum for each 
column in the row group of 10,000 rows. It also has the location of the start 
of each row group, so that the reader can jump straight to the beginning of the 
row group. The reader takes a SearchArgument (eg. age > 100)  that limits which 
rows are required for the query and can avoid reading an entire file, or at 
least sections of the file.
.. Owen
On Tue, Jan 19, 2016 at 7:50 AM, Ashok Kumar <ashok34...@yahoo.com> wrote:

 Hi,
I have read some notes on ORC files in Hive and indexes.
The document describes in the indexes but makes reference to statistics
Indexes
|   |
|   |  |   |   |   |   |   |
| IndexesIndexes ORC provides three level of indexes within each file: file 
level - statistics about the values in each column across the entire file  |
|  |
| View on orc.apache.org | Preview by Yahoo |
|  |
|   |


I am confused as it is mixing up indexes with statistics. Can someone clarify 
these.
Thanks




  

ORC files and statistics

2016-01-19 Thread Ashok Kumar
 Hi,
I have read some notes on ORC files in Hive and indexes.
The document describes in the indexes but makes reference to statistics
Indexes
|   |
|   |  |   |   |   |   |   |
| IndexesIndexes ORC provides three level of indexes within each file: file 
level - statistics about the values in each column across the entire file  |
|  |
| View on orc.apache.org | Preview by Yahoo |
|  |
|   |


I am confused as it is mixing up indexes with statistics. Can someone clarify 
these.
Thanks

eiquivalent to identity column in Hive

2016-01-16 Thread Ashok Kumar
 Hi,
Is there an equivalent to Microsoft IDENTITY column in Hive please.
Thanks  and regards

foreign keys in Hive

2016-01-10 Thread Ashok Kumar
 hi,
what is the equivalent to foreign keys in Hive?
Thanks


Re: Immutable data in Hive

2016-01-04 Thread Ashok Kumar
I second that. Many thanks Mich for your reply.
Regards 

On Monday, 4 January 2016, 10:58, "Singh, Abhijeet" 
<absi...@informatica.com> wrote:
 

 #yiv5347372295 #yiv5347372295 -- _filtered #yiv5347372295 
{font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv5347372295 
{font-family:Tahoma;panose-1:2 11 6 4 3 5 4 4 2 4;} _filtered #yiv5347372295 
{}#yiv5347372295 #yiv5347372295 p.yiv5347372295MsoNormal, #yiv5347372295 
li.yiv5347372295MsoNormal, #yiv5347372295 div.yiv5347372295MsoNormal 
{margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv5347372295 a:link, 
#yiv5347372295 span.yiv5347372295MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv5347372295 a:visited, #yiv5347372295 
span.yiv5347372295MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv5347372295 
p.yiv5347372295MsoAcetate, #yiv5347372295 li.yiv5347372295MsoAcetate, 
#yiv5347372295 div.yiv5347372295MsoAcetate 
{margin:0in;margin-bottom:.0001pt;font-size:8.0pt;}#yiv5347372295 
p.yiv5347372295MsoListParagraph, #yiv5347372295 
li.yiv5347372295MsoListParagraph, #yiv5347372295 
div.yiv5347372295MsoListParagraph 
{margin-top:0in;margin-right:0in;margin-bottom:0in;margin-left:.5in;margin-bottom:.0001pt;font-size:11.0pt;}#yiv5347372295
 p.yiv5347372295msonormal0, #yiv5347372295 li.yiv5347372295msonormal0, 
#yiv5347372295 div.yiv5347372295msonormal0 
{margin-right:0in;margin-left:0in;font-size:12.0pt;}#yiv5347372295 
span.yiv5347372295EmailStyle19 
{color:windowtext;font-weight:normal;font-style:normal;text-decoration:none 
none;}#yiv5347372295 span.yiv5347372295BalloonTextChar {}#yiv5347372295 
span.yiv5347372295EmailStyle22 {color:#1F497D;}#yiv5347372295 
.yiv5347372295MsoChpDefault {font-size:10.0pt;} _filtered #yiv5347372295 
{margin:1.0in 1.0in 1.0in 1.0in;}#yiv5347372295 div.yiv5347372295WordSection1 
{}#yiv5347372295 Very well answered by Mich.    Thanks Mich !!    From: Mich 
Talebzadeh [mailto:m...@peridale.co.uk]
Sent: Sunday, January 03, 2016 8:35 PM
To: user@hive.apache.org; 'Ashok Kumar'
Subject: RE: Immutable data in Hive    Hi Ashok.    I will have a go at this on 
top of Alan’s very valuable clarification.    Extraction, Transformation and 
Load  (ETL) is a very common method in Data Warehousing (DW)and Business 
Analytics projects and can be performed by custom programming like writing 
shell scripts, JAVA. .NET tools or combination of all to get the data from 
internal or external sources and put them in DW.    In general only data of 
valueends up in DW. What this mean is that in say in Banking environment you 
collect and feed (Extract) data into a staging area (in relational term often 
staging tables or the so called global temporary tables that are cleared daily 
for the next cycle in a staging database), prune it from unwanted data, do some 
manipulation (Transformation) (often happens into another set of staging 
tables) and finally Load it into target tables in a Data Warehouse. The 
analysts then use appropriate tools like Tableau to look at macroscopic trend 
in the data. Remember a Data Warehouse is still a relational database most 
probably a columnar implementation of relational model like SAP Sybase IQ.      
 There are many examples of DW repositories used for Business Intelligence (BI, 
another fancy term for Analytics)  such as working out global trading 
positioning (I did one of these by bolting Oracle TimesTen IMDB to Oracle DW 
for fast extraction) or data gathered from algorithmic trading using Complex 
Event Processing. Obviously although DW store larger amount of data (large 
being a relative term) and have impressive compression like Sybase IQ (every 
column is stored as an index so it is far more effective to do columnar 
compression (all data being the same type as opposed to row compression in OLTP 
databases)), they still require additional space, SAN storage and expensive 
horizontal scaling (adding another multi-plex requires additional license).    
ELT (Extraction, Load and Transform) is a similar concept used in Big Data 
World. The fundamental difference being that it is not just confined to data 
deemed to be of specific value, meaning you know what you are looking for in 
advance. In Hadoop one can store everything from data coming from structured 
data (transactional databases) and unstructured data (data coming from 
internet, excel sheets, email, logs and others). This means that you can store 
potentially all data to be exploited later, Hadoop echo system provides that 
flexibility by means of horizontal scaling on cheap commodity disks (AKA JBOD) 
and lack of licensing restrictions result in reducing Total Cost of Ownership 
(TCO) considerably.  In summary you (E))xtract and (L)oad all data as is (don’t 
care whether that data is exactly what you want) into HDFS and then you do 
(T)ransformation later through Schema on Read (you decide at time of 
exploration your data needs).    HDFS is great for storing large amount of data 
but on

Re: Immutable data in Hive

2015-12-30 Thread Ashok Kumar
Tank you sir,  very helpful. Could you also briefly describe from your 
experience  the major differences between traditional ETL in DW and ELT in 
Hive?  Why there is emphasis to take data from traditional transactional 
databases into Hive table with the same format and do the transform in Hive 
after. Is it because Hive is meant to be efficient in data transformation? 
Regards    

On Wednesday, 30 December 2015, 18:00, Alan Gates <alanfga...@gmail.com> 
wrote:
 

 Traditionally data in Hive was write once (insert) read many.  You could 
append to tables and partitions, add new partitions, etc.  You could remove 
data by dropping tables or partitions.  But there was no updates of data or 
deletes of particular rows.  This was what was meant by immutable.  Hive was 
originally done this way because it was based on MapReduce and HDFS and these 
were the natural semantics given those underlying systems.

For many use cases (e.g. ETL) this is sufficient, and the vast majority of 
people still run Hive this way.

We added transactions and updates and deletes to Hive because some use cases 
require these features.  Hive is being used more and more as a data warehouse, 
and while updates and deletes are less common there they are still required 
(slow changing dimensions, fixing wrong data, deleting records for compliance, 
etc.)  Also streaming data into warehouses from transactional systems is a 
common use case.

Alan.


    Ashok Kumar  December 29, 2015 at 14:59  Hi,
Can someone please clarify what  "immutable data" in Hive means?
I have been told that data in Hive is/should be immutable but in that case why 
we need transactional tables in Hive that allow updates to data.
thanks and greetings 






  

Immutable data in Hive

2015-12-29 Thread Ashok Kumar
 Hi,
Can someone please clarify what  "immutable data" in Hive means?
I have been told that data in Hive is/should be immutable but in that case why 
we need transactional tables in Hive that allow updates to data.
thanks and greetings 





Difference between ORC and RC files

2015-12-21 Thread Ashok Kumar
 Hi Gurus,
I am trying to understand the advantages that ORC file format offers over RC.
I have read the existing documents but I still don't seem to grasp the main 
differences.
Can someone explain to me as a user where ORC scores when compared to RC. What 
I like to know is mainly the performance. I am also aware that ORC does some 
smart compression as well.
Finally is ORC file format is the best choice in Hive.
Thank you



Re: The advantages of Hive/Hadoop comnpared to Data Warehouse

2015-12-18 Thread Ashok Kumar
Hi,
Thanks for the info. I understand ELT (Extract, Load, Transform) is more 
appropriate for big data compared to traditional ETL. What are the major 
advantages of this in Big Data space.
Example. if I started using Sqoop to get data from traditional transactional 
and Data Warehouse databases and create the same tables in Hive, what would be 
the next step to get to a consolidated data model in Hive on HDFS. The entry 
tables will be tabular tables in line with source, correct? How many ELT steps 
need to apply generally to get to the final model. Will ELT speed up this 
process
I understand this is a very broad question. However, any comments will be 
welcome.
Regards
 

On Friday, 18 December 2015, 22:27, Jörn Franke <jornfra...@gmail.com> 
wrote:
 

 I think you should draw more the attention that Hive is just one component in 
the ecosystem. You can have many more components, such as ELT, integrating 
unstructured data, machine learning, streaming data etc. however usually 
analysts are not aware about the technologies and it staff is not much aware of 
how it can bring benefits to a specific business domain. You could explore the 
potentials together in workshops, design thinking etc. once you know more 
details, both sides decide on potential ways forward you can start doing PoCs 
and see what works and what not. It is important that you break old ties 
created by more traditional data warehouse approaches in the past and go beyond 
the comfort zone.
On 18 Dec 2015, at 22:01, Ashok Kumar <ashok34...@yahoo.com> wrote:


 Gurus,
Some analysts keep asking me the advantages of having Hive tables when the star 
schema in Data Warehouse (DW) does the same.
For example if you have fact and dimensions table in DW and just import them 
into Hive via a say SQOOP, what are we going to gain.
I keep telling them storage economy and cheap disks, de-normalisation can be 
done further etc. However, they are not convinced :(
Any additional comments will help my case.
Thanks a lot


  

The advantages of Hive/Hadoop comnpared to Data Warehouse

2015-12-18 Thread Ashok Kumar
 Gurus,
Some analysts keep asking me the advantages of having Hive tables when the star 
schema in Data Warehouse (DW) does the same.
For example if you have fact and dimensions table in DW and just import them 
into Hive via a say SQOOP, what are we going to gain.
I keep telling them storage economy and cheap disks, de-normalisation can be 
done further etc. However, they are not convinced :(
Any additional comments will help my case.
Thanks a lot

Re: The advantages of Hive/Hadoop comnpared to Data Warehouse

2015-12-18 Thread Ashok Kumar
Thank you sir.
Can you please describe a bit more detail your vision of "A fully denormalized 
columnar store"? Are you referring to get rid of star schema altogether in Hive 
and replace it with ORC tables?
Regards 

On Friday, 18 December 2015, 21:13, Grant Overby (groverby) 
<grove...@cisco.com> wrote:
 

 You forgot horizontal scaling.
A fully denormalized columnar store in Hive will out preform a star schema in 
Oracle in every way imaginableat scale; however, if your data isn’t big enough 
then this is a moot point. 
If your data fits in a traditional BI warehouse, and especially if it does so 
without people constantly complaining about performance or retention issues, 
your data may simply not be large enough to warrant a Hadoop stack.
How many rows are in your largest fact table?
| 
|  |
| Grant Overby
Software Engineer
Cisco.com
grove...@cisco.com
Mobile: 865 724 4910
 | 

 |
|  |


|  Think before you print. |
| This email may contain confidential and privileged material for the sole use 
of the intended recipient. Any review, use, distribution or disclosure by 
others is strictly prohibited. If you are not the intended recipient (or 
authorized to receive for the recipient), please contact the sender by reply 
email and delete all copies of this message.Please click here for Company 
Registration Information. |


 |



From: Ashok Kumar <ashok34...@yahoo.com>
Reply-To: User <user@hive.apache.org>, Ashok Kumar <ashok34...@yahoo.com>
Date: Friday, December 18, 2015 at 4:01 PM
To: User <user@hive.apache.org>
Subject: The advantages of Hive/Hadoop comnpared to Data Warehouse

Gurus,
Some analysts keep asking me the advantages of having Hive tables when the star 
schema in Data Warehouse (DW) does the same.
For example if you have fact and dimensions table in DW and just import them 
into Hive via a say SQOOP, what are we going to gain.
I keep telling them storage economy and cheap disks, de-normalisation can be 
done further etc. However, they are not convinced :(
Any additional comments will help my case.
Thanks a lot

  

Fw: Managed to make Hive run on Spark engine

2015-12-07 Thread Ashok Kumar

 This is great news sir. It shows perseverance pays at last.
Can you inform us when the write-up is ready so I can set it up as well please.
I know a bit about the advantages of having Hive using Spark engine. However, 
the general question I have is when one should use Hive on spark as opposed to 
Hive on MapReduce engine?
Thanks again  


On Monday, 7 December 2015, 15:50, Mich Talebzadeh  
wrote:
 

 #yiv8051477931 -- filtered {font-family:Wingdings;panose-1:5 0 0 0 0 0 0 0 0 
0;}#yiv8051477931 filtered {panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv8051477931 
filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv8051477931 
p.yiv8051477931MsoNormal, #yiv8051477931 li.yiv8051477931MsoNormal, 
#yiv8051477931 div.yiv8051477931MsoNormal 
{margin:0cm;margin-bottom:.0001pt;font-size:12.0pt;}#yiv8051477931 a:link, 
#yiv8051477931 span.yiv8051477931MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv8051477931 a:visited, #yiv8051477931 
span.yiv8051477931MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv8051477931 
span.yiv8051477931EmailStyle17 {color:windowtext;}#yiv8051477931 
span.yiv8051477931EmailStyle18 {color:windowtext;}#yiv8051477931 
.yiv8051477931MsoChpDefault {font-size:10.0pt;}#yiv8051477931 filtered 
{margin:72.0pt 72.0pt 72.0pt 72.0pt;}#yiv8051477931 
div.yiv8051477931WordSection1 {}#yiv8051477931 For those interested  From: Mich 
Talebzadeh [mailto:m...@peridale.co.uk] 
Sent: 06 December 2015 20:33
To: user@hive.apache.org
Subject: Managed to make Hive run on Spark engine  Thanks all especially to 
Xuefu.for contributions. Finally it works, which means don’t give up until it 
works J  hduser@rhes564::/usr/lib/hive/lib> hiveLogging initialized using 
configuration in 
jar:file:/usr/lib/hive/lib/hive-common-1.2.1.jar!/hive-log4j.propertieshive> 
set spark.home= /usr/lib/spark-1.3.1-bin-hadoop2.6;hive> set 
hive.execution.engine=spark;hive> set 
spark.master=spark://50.140.197.217:7077;hive> set 
spark.eventLog.enabled=true;hive> set spark.eventLog.dir= 
/usr/lib/spark-1.3.1-bin-hadoop2.6/logs;hive> set 
spark.executor.memory=512m;hive> set 
spark.serializer=org.apache.spark.serializer.KryoSerializer;hive> set 
hive.spark.client.server.connect.timeout=22ms;hive> set 
spark.io.compression.codec=org.apache.spark.io.LZFCompressionCodec;hive> use 
asehadoop;OKTime taken: 0.638 secondshive> select count(1) from t;Query ID = 
hduser_20151206200528_4b85889f-e4ca-41d2-9bd2-1082104be42bTotal jobs = 
1Launching Job 1 out of 1In order to change the average load for a reducer (in 
bytes):  set hive.exec.reducers.bytes.per.reducer=In order to limit the 
maximum number of reducers:  set hive.exec.reducers.max=In order to set 
a constant number of reducers:  set mapreduce.job.reduces=Starting 
Spark Job = c8fee86c-0286-4276-aaa1-2a5eb4e4958a  Query Hive on Spark job[0] 
stages:01  Status: Running (Hive on Spark job[0])Job Progress FormatCurrentTime 
StageId_StageAttemptId: 
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
[StageCost]2015-12-06 20:05:36,299 Stage-0_0: 0(+1)/1  Stage-1_0: 
0/12015-12-06 20:05:39,344 Stage-0_0: 1/1 Finished Stage-1_0: 0(+1)/12015-12-06 
20:05:40,350 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 FinishedStatus: Finished 
successfully in 8.10 secondsOK  The versions used for this project    OS 
version Linux version 2.6.18-92.el5xen 
(brewbuil...@ls20-bc2-13.build.redhat.com) (gcc version 4.1.2 20071124 (Red Hat 
4.1.2-41)) #1 SMP Tue Apr 29 13:31:30 EDT 2008  Hadoop 2.6.0Hive 
1.2.1spark-1.3.1-bin-hadoop2.6 (downloaded from prebuild 
spark-1.3.1-bin-hadoop2.6.gz for starting spark standalone cluster)The Jar file 
used in $HIVE_HOME/lib to link Hive to spark was à 
spark-assembly-1.3.1-hadoop2.4.0.jar    (built from the source downloaded as 
zipped file spark-1.3.1.gz and built with command line make-distribution.sh 
--name "hadoop2-without-hive" --tgz 
"-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"  Pretty picky on 
parameters, CLASSPATH, IP addresses or hostname etc to make it work  I will 
create a full guide on how to build and make Hive to run with Spark as its 
engine (as opposed to MR).  HTH  Mich Talebzadeh  Sybase ASE 15 Gold Medal 
Award 2008A Winning Strategy: Running the most Critical Financial Data on ASE 
15http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdfAuthor
 of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 
978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", 
ISBN 978-0-9759693-0-4Publications due shortly:Complex Event Processing in 
Heterogeneous Environments, ISBN: 978-0-9563693-3-8Oracle and Sybase, Concepts 
and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly  
http://talebzadehmich.wordpress.com/  NOTE: The information in this email is 
proprietary and confidential. This message is for the designated recipient 
only, if you are not the intended recipient, you should destroy it immediately. 

Best practices for monitoring hive

2015-11-13 Thread Ashok Kumar
 Hi,
I would like to know best practices to monitor the health and performance of 
Hive and hive server, trouble shooting and catching errors etc.
to be clear we do not use any bespoke monitoring tool and keen on developing 
our own in house tools to be integrated into general monitoring tools to be 
picked up by operations.
greetings and thanks

exporting and importing a schema/database from Hive Test to Hive DEV

2015-11-11 Thread Ashok Kumar
 Hi gurus,
What is the easiest way of exporting one database in Hive in Test and importing 
it to say DEV database in another instance.
How metastore at target will handle this please?
Thanks 

Re: Hive and HBase

2015-11-10 Thread Ashok Kumar
Hi,

It is from Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data 
Computing in the Apache Hadoop 2 Ecosystem (Addison-Wesley Data & Analytics 
Series) (Kindle Locations 735-738).

Apache Hive is a data warehouse infrastructure built on top of Hadoop for 
providing data summary, ad hoc queries, and the analysis of large data sets 
using an SQL-like language called HiveQL. Hive transparently translates queries 
into MapReduce jobs that are executed in HBase. Hive is considered the de facto 
standard for interactive SQL queries over petabytes of data  


 On Tuesday, 10 November 2015, 13:03, Binglin Chang <decst...@gmail.com> 
wrote:
   

 
Hive transparently translates queries into MapReduce jobs that are executed in 
HBase

I think this is not correct, are you sure it is from some book? 
On Tue, Nov 10, 2015 at 6:56 PM, Ashok Kumar <ashok34...@yahoo.com> wrote:

 hi,
I have read in a book about Hadoop that says
Apache Hive is a data warehouse infrastructure built on top of Hadoop for 
providing data summary, ad hoc queries, and the analysis of large data sets 
using an SQL-like language called HiveQL. 
Hive transparently translates queries into MapReduce jobs that are executed in 
HBase. Hive is considered the de facto standard for interactive SQL queries 
over petabytes of data.
What is the relation between Hive and HBase? I always thought that HBase is an 
independent database.
Is it correct that Hive itself uses MapReduce engine that in turn uses HBase as 
the database. I always thought that Hive is a data warehouse database or I am 
missing something.

Thanking you





  

DML operations and transactions in Hive

2015-11-06 Thread Ashok Kumar
 Hi,
I would like to understand a bit more about insert/update/select operations in 
Hive.
I believe ORC table format offers the best performance and concurrency.
Is ORC the best format for DML operations. What is the granularity of locks? Is 
it partition or row.
Also how ACID properties implemented in Hive. Does it follow read consistency 
like Oracle please? Example if two updates on the same row happen more and less 
same time, will the exclusive lock be taken by the first user and the second 
user is blocked until the first user finishs updating and the transaction is 
committed or the second user will see the row before any change is done by the 
first user?
Thanks

Re: clarification please

2015-10-29 Thread Ashok Kumar
Thank you sir. Very helpful 


 On Thursday, 29 October 2015, 15:22, Alan Gates <alanfga...@gmail.com> 
wrote:
   

 


    Ashok Kumar  October 28, 2015 at 22:43 hi gurus,
kindly clarify the following please
   
   - Hive currently does not support indexes or indexes are not used in the 
query

Mostly true.  There is a create index, but Hive does not use the resulting 
index by default.  Some storage formats (ORC, Parquet I think) have their own 
indices they use internally to speed access.

  
   - The lowest granularity for concurrency is partition. If table is 
partitioned, then partition will be lucked in DML operation
  
lucked =locked?  I'm not sure what you intended here.  If you mean locked, then 
it depends.  By default Hive doesn't use locking.  You can set it up to do 
locking via ZooKeeper or as part of Hive transactions.  They have different 
locking models.  See 
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions and 
https://cwiki.apache.org/confluence/display/Hive/Locking for more information.

You can sub-partition using buckets, but for most queries partition is the 
lowest level of granularity.  Hive does a lot of work to optimize only reading 
relevant partitions for a query.

  
   - What is the best file format to store Hive table in HDFS? Is this ORC or 
Avro that allow being split and support block compression?
  
It depends on what you want to do.  ORC and Parquet do better for traditional 
data warehousing type queries because they are columnar formats and have lots 
of optimization built in for fast access, pushing filter down into the storage 
level etc. People like Avro and other self describing formats when their data 
brings its own structure.  We very frequently see pipelines where people dump 
Avro, text, etc. into Hive and then ETL it into ORC.

  
   - Text/CSV files. By default if file type is not specified at creation time, 
Hive will default to text file?
  
Out of the box yes, but you can change that in your Hive installation by 
setting hive.default.fileformat in your hive-site.xml.

Alan.

  

Thanks 


  

clarification please

2015-10-28 Thread Ashok Kumar
hi gurus,
kindly clarify the following please
   
   - Hive currently does not support indexes or indexes are not used in the 
query
   - The lowest granularity for concurrency is partition. If table is 
partitioned, then partition will be lucked in DML operation
   - What is the best file format to store Hive table in HDFS? Is this ORC or 
Avro that allow being split and support block compression?
   - Text/CSV files. By default if file type is not specified at creation time, 
Hive will default to text file?


Thanks

downloading RDBMS table data to Hive with Sqoop import

2015-05-05 Thread Ashok Kumar
Hi gurus,

I can use Sqoop import to get RDBMS data say Oracle to Hive first and then use 
incremental append for new rows with PK and last value.

However, how do you account for updates and deletes with Sqoop without full 
load of table from RDBMS to Hive?

Thanks

Hive and Impala

2015-04-27 Thread Ashok Kumar
Hi gurus,

Kindly help me understand the advantage that Impala has over Hive.

I read a note that Impala does not use MapReduce engine and is therefore very 
fast for queries compared to Hive. However, Hive as I understand is widely used 
everywhere!

Thank you

Re: partition and bucket

2015-04-14 Thread Ashok Kumar
 to hash partitioning where a 
hashing function is applied. Likewise an RDBMS, Hive will apply a linear 
hashing algorithm to prevent data from clustering within specific partitions. 
Hashing is very effective if the column selected for bucketing has very high 
selectivity like an ID column where selectivity (select 
count(distinct(column))/count(column) ) = 1.  In this case, the created 
partitions/ files will be as evenly sized as possible. In a nutshell bucketing 
is a method to get data evenly distributed over many partitions/files.  One 
should define the number of buckets by a power of two -- 2^n,  like 2, 4, 8, 16 
etc to achieve best results. Again bucketing will help concurrency in Hive. It 
may even allow a partition wise join i.e. a join between two tables that are 
bucketed on the same column with the same number of buckets (anyone has tried 
this?)  One more things. When one defines the number of buckets at table 
creation level in Hive, the number of partitions/files will be fixed. In 
contrast, with partitioning you do not have this limitation.   HTH  Mich    
NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries 
or their employees, unless expressly so stated. It is the responsibility of the 
recipient to ensure that this email is virus free, therefore neither Peridale 
Ltd, its subsidiaries nor their employees accept any responsibility.  From: 
Ashok Kumar [mailto:ashok34...@yahoo.com] 
Sent: 10 April 2015 17:46
To: user@hive.apache.org
Subject: partition and bucket  
| Greeting all,

Glad to join the user group. I am from DBA background Oracle/Sybase/MSSQL.

I would like to understand partition and bucketing in Hive and the difference 
between.

Shall be grateful if someone explains where shall I use partition or bucket for 
best practices.

thanks |