Looking for a working mysql import sqoop command line

2012-05-12 Thread David Morel

Hi everyone,

I must say I'm only starting, and this is on cdh3u3.

I'm struggling with the sqoop options ATM, trying to import mysql tables 
to hive, containing tabs, newlines, and all sorts of things.


I cannot figure out a proper combination of options on the sqoop command 
line to turn these either to '\t' (in plain text, as mysql on the 
command line does) or something usable, nor can I have 
--mysql-delimiters output something usable (quotes are left in the final 
file in hive), or... I tried many things, tried to understand the docs, 
and am miserably failing to achieve anything usable.


The best I can do for now is import with piping mysql -B (which does all 
the transliteration I need) to a fifo, and have hadoop read from that 
for -put, then use this file in hive.


Could anyone, either:
- provide a working sqoop command line that actually works fine in 
production (preferrably with --direct, since the files I have to put 
there are quite big)
- provide alternative solutions, since maybe I'm going a completely 
wrong way


Thanks a million!

David Morel


Skew join failure

2012-11-30 Thread David Morel

Hi,

I am trying to solve the last reducer hangs because of GC because of 
truckloads of data issue that I have on some queries, by using SET 
hive.optimize.skewjoin=true; Unfortunately, every time I try this, I 
encounter an error of the form:

...
2012-11-30 10:42:39,181 Stage-10 map = 100%,  reduce = 100%, Cumulative 
CPU 406984.1 sec
MapReduce Total cumulative CPU time: 4 days 17 hours 3 minutes 4 seconds 
100 msec

Ended Job = job_201211281801_0463
java.io.FileNotFoundException: File 
hdfs://nameservice1/tmp/hive-dmorel/hive_2012-11-30_09-23-00_375_8178040921995939301/-mr-10014/hive_skew_join_bigkeys_0 
does not exist.
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:365)
at 
org.apache.hadoop.hive.ql.plan.ConditionalResolverSkewJoin.getTasks(ConditionalResolverSkewJoin.java:96)
at 
org.apache.hadoop.hive.ql.exec.ConditionalTask.execute(ConditionalTask.java:81)
at 
org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:133)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at 
org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1332)

at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1123)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:931)
...

Googling didn't give me any indication on how to debug/solve this, so 
I'd be glad if I could get any indication where to start looking.


I'm using CMF4.0 currently, so Hive 0.8.1.

Thanks a lot!

David Morel


Re: Skew join failure

2012-12-03 Thread David Morel

On 30 Nov 2012, at 16:46, Mark Grover wrote:


Hi David, It seems like Hive is unable to find the skewed keys on
HDFS. Did you set *hive.skewjoin.key property? If so, to what value?*


Hey Mark,

thanks for answering!

I didn't set it to anything, but left it at its default value (100,000
IIRC). I should probably have set it to a much lower value (I guess?)
but I fail to understand why not meeting the threshold would break the
whole thing. I guess I have too inspect the logs more closely? Do you
have real-life examples of skewjoin params settings? the docs are really
scarce about it...

thanks!

David



Mark

On Fri, Nov 30, 2012 at 2:10 AM, David Morel
david.mo...@amakuru.netwrote:


Hi,

I am trying to solve the last reducer hangs because of GC because of
truckloads of data issue that I have on some queries, by using SET
hive.optimize.skewjoin=true; Unfortunately, every time I try this, I
encounter an error of the form: ... 2012-11-30 10:42:39,181 Stage-10
map = 100%, reduce = 100%, Cumulative CPU 406984.1 sec MapReduce
Total cumulative CPU time: 4 days 17 hours 3 minutes 4 seconds 100
msec Ended Job = job_201211281801_0463 java.io.FileNotFoundException:
File hdfs://nameservice1/tmp/hive-** 
dmorel/hive_2012-11-30_09-23-**00_375_8178040921995939301/-

** mr-10014/hive_skew_join_**bigkeys_0 does not exist. at
org.apache.hadoop.hdfs.**DistributedFileSystem.**listStatus(**
DistributedFileSystem.java:**365) at
org.apache.hadoop.hive.ql.**plan.**ConditionalResolverSkewJoin.
**getTasks(**ConditionalResolverSkewJoin.**java:96) at
org.apache.hadoop.hive.ql.**exec.ConditionalTask.execute(**
ConditionalTask.java:81) at
org.apache.hadoop.hive.ql.**exec.Task.executeTask(Task.** java:133)
at org.apache.hadoop.hive.ql.**exec.TaskRunner.runSequential(**
TaskRunner.java:57) at
org.apache.hadoop.hive.ql.**Driver.launchTask(Driver.java:** 1332) at
org.apache.hadoop.hive.ql.**Driver.execute(Driver.java:**1123) at
org.apache.hadoop.hive.ql.**Driver.run(Driver.java:931) ...

Googling didn't give me any indication on how to debug/solve this, so
I'd be glad if I could get any indication where to start looking.

I'm using CMF4.0 currently, so Hive 0.8.1.


Thrift Hive client for CDH 4.1 HiveServer2?

2013-01-03 Thread David Morel
Hi all (and happy New Year!)

Is it possible to build a perl Thrift client for HiveServer2 (from
Cloudera's 4.1.x) ?

I'm following the instructions found here:
http://stackoverflow.com/questions/5289164/perl-thrift-client-to-hive

Downloaded Hive from Cloudera's site, then i'm a bit lost: where do I find
these thrift files that I need to build the perl libs? I have the thrift
compiler working ok, but thats as far as I got.

Any help would be most welcome

Thanks!

D.Morel


Re: Thrift Hive client for CDH 4.1 HiveServer2?

2013-01-05 Thread David Morel
On 4 Jan 2013, at 16:04, Jov wrote:

they are in the src/service/if and src/metastore/if

Cool. But these would be files for HiveServer, not HiveServer2 which has a
different API, right? After finally generating the libs, it turns out they
work fine on the old-style hive server, but produce this in hiveserver2's
log: 13/01/04 20:09:11 ERROR server.TThreadPoolServer: Error occurred
during processing of message. java.lang.RuntimeException:
org.apache.thrift.transport.TTransportException at
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:218)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:170)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662) Caused by:
org.apache.thrift.transport.TTransportException at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at
org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:182)
at
org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:124)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253)
at
org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:40)
at
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:215)
... 4 more Where should I start looking (meaning I haven't a clue)? Thanks!
David

在 2013-1-4 上午7:16,David Morel dmore...@gmail.com写道:

Hi all (and happy New Year!) Is it possible to build a perl Thrift client
for HiveServer2 (from Cloudera's 4.1.x) ? I'm following the instructions
found here:
http://stackoverflow.com/questions/5289164/perl-thrift-client-to-hiveDownloaded
Hive from Cloudera's site, then i'm a bit lost: where do I find
these thrift files that I need to build the perl libs? I have the thrift
compiler working ok, but thats as far as I got.


Re: Thrift Hive client for CDH 4.1 HiveServer2?

2013-01-05 Thread David Morel
So that would probably be generated using src/service/if/cli_service.thrift
instead of the older hive_service.thrift which i suppose is for hiveserver1.
Compiled it, still getting errors that seem transport-related

13/01/04 23:02:22 ERROR server.TThreadPoolServer: Error occurred during
processing of message.
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException
at
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:218)
...

This is a bit tedious...

D.Morel


On Sat, Jan 5, 2013 at 10:27 AM, Jov zhao6...@gmail.com wrote:

 here:
 https://issues.apache.org/jira/browse/HIVE-2935
 https://cwiki.apache.org/Hive/hiveserver2-thrift-api.html
 HiveServer2 now is CDH extension.

 I think you can use find cmd to search the CDH src dir to find the .thrift
 files.


 2013/1/5 David Morel dmore...@gmail.com

 On 4 Jan 2013, at 16:04, Jov wrote:

 they are in the src/service/if and src/metastore/if

 Cool. But these would be files for HiveServer, not HiveServer2 which has
 a different API, right? After finally generating the libs, it turns out
 they work fine on the old-style hive server, but produce this in
 hiveserver2's log: 13/01/04 20:09:11 ERROR server.TThreadPoolServer: Error
 occurred during processing of message. java.lang.RuntimeException:
 org.apache.thrift.transport.TTransportException at
 org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:218)
 at
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:170)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662) Caused by:
 org.apache.thrift.transport.TTransportException at
 org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
 at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at
 org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:182)
 at
 org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:124)
 at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253)
 at
 org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:40)
 at
 org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:215)
 ... 4 more Where should I start looking (meaning I haven't a clue)? Thanks!
 David

 在 2013-1-4 上午7:16,David Morel dmore...@gmail.com写道:

 Hi all (and happy New Year!) Is it possible to build a perl Thrift client
 for HiveServer2 (from Cloudera's 4.1.x) ? I'm following the instructions
 found here:
 http://stackoverflow.com/questions/5289164/perl-thrift-client-to-hiveDownloaded
  Hive from Cloudera's site, then i'm a bit lost: where do I find
 these thrift files that I need to build the perl libs? I have the thrift
 compiler working ok, but thats as far as I got.




 --
 jov
 blog: http:amutu.com/blog



An explanation of LEFT OUTER JOIN and NULL values

2013-01-24 Thread David Morel
Hi!

After hitting the curse of the last reducer many times on LEFT OUTER
JOIN queries, and trying to think about it, I came to the conclusion
there's something I am missing regarding how keys are handled in mapred
jobs.

The problem shows when I have table A containing billions of rows with
distinctive keys, that I need to join to table B that has a much lower
number of rows.

I need to keep all the A rows, populated with NULL values from the B
side, so that's what a LEFT OUTER is for.

Now, when transforming that into a mapred job, my -naive- understanding
would be that for every key on the A table, a missing key on the B table
would be generated with a NULL value. If that were the case, I fail to
understand why all NULL valued B keys would end up on the same reducer,
since the key defines which reducer is used, not the value.

So, obviously, this is not how it works.

So my question is: how is this construct handled?

Thanks a lot!

D.Morel


Re: An explanation of LEFT OUTER JOIN and NULL values

2013-01-24 Thread David Morel
On 24 Jan 2013, at 18:16, bejoy...@yahoo.com wrote:

 Hi David

 An explain extended would give you the exact pointer.

 From my understanding, this is how it could work.

 You have two tables then two different map reduce job would be
 processing those. Based on the join keys, combination of corresponding
 columns would be chosen as key from mapper1 and mapper2. So if the
 combination of columns having the same value those records from two
 set of mappers would go into the same reducer.

 On the reducer if there is a corresponding value for a key from table
 1 to  table 2/mapper 2 that value would be populated. If no val for
 mapper 2 then those columns from table 2 are made null.

 If there is a key-value just from table 2/mapper 2 and no
 corresponding value from mapper 1. That value is just discarded.

Hi Bejoy,

Thanks! So schematically, something like this, right?

mapper1 (bigger table):
K1-A, V1A
K2-A, V2A
K3-A, V3A

mapper2 (joined, smaller table):
K1-B, V1B

reducer1:
K1-A, V1A 
K1-B, V1B

returns:
K1, V1A, V1B etc

reducer2:
K2-A, V2A
*no* K2-B, V so: K2-B, NULL is created, same for next row.
K3-A, V3A

returns:
K2, V2A, NULL etc
K3, V3A, NULL etc

I still don't understand why my reducer2 (and only this one, which
apparently gets all the keys for which we don't have a row on table B)
would become overloaded. Am I completely misunderstanding the whole
thing?

David

 Regards
 Bejoy KS

 Sent from remote device, Please excuse typos

 -Original Message-
 From: David Morel dmore...@gmail.com
 Date: Thu, 24 Jan 2013 18:03:40
 To: user@hive.apache.orguser@hive.apache.org
 Reply-To: user@hive.apache.org
 Subject: An explanation of LEFT OUTER JOIN and NULL values

 Hi!

 After hitting the curse of the last reducer many times on LEFT OUTER
 JOIN queries, and trying to think about it, I came to the conclusion
 there's something I am missing regarding how keys are handled in
 mapred jobs.

 The problem shows when I have table A containing billions of rows with
 distinctive keys, that I need to join to table B that has a much lower
 number of rows.

 I need to keep all the A rows, populated with NULL values from the B
 side, so that's what a LEFT OUTER is for.

 Now, when transforming that into a mapred job, my -naive-
 understanding would be that for every key on the A table, a missing
 key on the B table would be generated with a NULL value. If that were
 the case, I fail to understand why all NULL valued B keys would end up
 on the same reducer, since the key defines which reducer is used, not
 the value.

 So, obviously, this is not how it works.

 So my question is: how is this construct handled?

 Thanks a lot!

 D.Morel



Re: An explanation of LEFT OUTER JOIN and NULL values

2013-01-24 Thread David Morel
On 24 Jan 2013, at 20:39, bejoy...@yahoo.com wrote:

 Hi David,

 The default partitioner used in map reduce is the hash partitioner. So
 based on your keys they are send to a particular reducer.

 May be in your current data set, the keys that have no values in table
 are all falling in the same hash bucket and hence being processed by
 the same reducer.

Really not the case, no, so it doesn't make any sort of sense to me :\
At this point I am starting to think the only way to figure it out is to
set or add debugging at the reducer level. I hoped I could avoid it,
alas...

 If you are noticing a skew on a particular reducer, sometimes  a
 simple work around like increasing the no of reducers explicitly might
 help you get pass the hurdle.

Yes, that seemed to work in some cases, but is not very handy: it's
hard to tell my users 'try setting a different number of reducers when
your query is stuck' ;-)

 Also please ensure you have enabled skew join optimization.

Didn't have much success with this, but maybe my version of hive is a
bit old.

So to emulate a LEFT OUTER JOIN I had to do something really horrible:

JOIN (
-- TABLE B is the joined table
SELECT key, value FROM TABLE B
UNION ALL
-- TABLE A has all the unique ids 
SELECT key, '' FROM TABLE A 
) AS SUB

and then use a count -1 to get a count of non-null rows, etc. At least 
it does work with no slowdowns, but I mean, yuk!

It's the best I could come up with so far, so if I could fully understand
the root cause of the problem, that would be much better. I guess I'll 
dig in a bit deeper then.

Thanks a lot!

David


 Regards
 Bejoy KS

 Sent from remote device, Please excuse typos

 -Original Message-
 From: David Morel dmore...@gmail.com
 Date: Thu, 24 Jan 2013 18:39:56
 To: user@hive.apache.org; bejoy...@yahoo.com
 Reply-To: user@hive.apache.org
 Subject: Re: An explanation of LEFT OUTER JOIN and NULL values

 On 24 Jan 2013, at 18:16, bejoy...@yahoo.com wrote:

 Hi David

 An explain extended would give you the exact pointer.

 From my understanding, this is how it could work.

 You have two tables then two different map reduce job would be
 processing those. Based on the join keys, combination of corresponding
 columns would be chosen as key from mapper1 and mapper2. So if the
 combination of columns having the same value those records from two
 set of mappers would go into the same reducer.

 On the reducer if there is a corresponding value for a key from table
 1 to  table 2/mapper 2 that value would be populated. If no val for
 mapper 2 then those columns from table 2 are made null.

 If there is a key-value just from table 2/mapper 2 and no
 corresponding value from mapper 1. That value is just discarded.

 Hi Bejoy,

 Thanks! So schematically, something like this, right?

 mapper1 (bigger table):
 K1-A, V1A
 K2-A, V2A
 K3-A, V3A

 mapper2 (joined, smaller table):
 K1-B, V1B

 reducer1:
 K1-A, V1A
 K1-B, V1B

 returns:
 K1, V1A, V1B etc

 reducer2:
 K2-A, V2A
 *no* K2-B, V so: K2-B, NULL is created, same for next row.
 K3-A, V3A

 returns:
 K2, V2A, NULL etc
 K3, V3A, NULL etc

 I still don't understand why my reducer2 (and only this one, which
 apparently gets all the keys for which we don't have a row on table B)
 would become overloaded. Am I completely misunderstanding the whole
 thing?

 David

 Regards
 Bejoy KS

 Sent from remote device, Please excuse typos

 -Original Message-
 From: David Morel dmore...@gmail.com
 Date: Thu, 24 Jan 2013 18:03:40
 To: user@hive.apache.orguser@hive.apache.org
 Reply-To: user@hive.apache.org
 Subject: An explanation of LEFT OUTER JOIN and NULL values

 Hi!

 After hitting the curse of the last reducer many times on LEFT OUTER
 JOIN queries, and trying to think about it, I came to the conclusion
 there's something I am missing regarding how keys are handled in
 mapred jobs.

 The problem shows when I have table A containing billions of rows with
 distinctive keys, that I need to join to table B that has a much lower
 number of rows.

 I need to keep all the A rows, populated with NULL values from the B
 side, so that's what a LEFT OUTER is for.

 Now, when transforming that into a mapred job, my -naive-
 understanding would be that for every key on the A table, a missing
 key on the B table would be generated with a NULL value. If that were
 the case, I fail to understand why all NULL valued B keys would end up
 on the same reducer, since the key defines which reducer is used, not
 the value.

 So, obviously, this is not how it works.

 So my question is: how is this construct handled?

 Thanks a lot!

 D.Morel



Re: Real-life experience of forcing smaller input splits?

2013-01-25 Thread David Morel
On 25 Jan 2013, at 10:37, Bertrand Dechoux wrote:

 It seems to me the question has not been answered :
 is it possible yes or no to force a smaller split size
 than a block on the mappers

 Not that I know (but you could implement something to do it) but why would
 you do it?
 By default if the split is set under the size of a block, it will be a
 block.
 One of the reason is data-locality. The second is that a block is written
 into a single hard-drive (leaving replicas aside) so if n mappers were
 reading n parts from the same block well they would share the hard-drive
 bandwidth... So it is not a clear win.

 You can change the block size of the file you want to read but using
 smaller block size is really an anti-pattern. Most people increase the
 block size.
 (Note : block size of files are fixed when writing the files and it can be
 different between two different files.)

 Are you trying to handle data which are too small?
 If hive supports multi-threading for mapper it might be an solution. But I
 don't the configuration for that.

The files are RCFiles with a block size of 128MB IIRC, but the file
compression achieves a ratio of nearly 1 to 100. When going through the
mapper, there is simply not enough memory available to it. Since the
compression scheme is BLOCK, I expected it would be possible to instruct
hive to process only a limited number of fragments instead of everything
that's in the file in 1 go.

David


Re: Real-life experience of forcing smaller input splits?

2013-01-25 Thread David Morel

On 25 Jan 2013, at 10:37, Bertrand Dechoux wrote:


It seems to me the question has not been answered :
is it possible yes or no to force a smaller split size
than a block on the mappers

Not that I know (but you could implement something to do it) but why 
would

you do it?
By default if the split is set under the size of a block, it will be a
block.
One of the reason is data-locality. The second is that a block is 
written

into a single hard-drive (leaving replicas aside) so if n mappers were
reading n parts from the same block well they would share the 
hard-drive

bandwidth... So it is not a clear win.

You can change the block size of the file you want to read but using
smaller block size is really an anti-pattern. Most people increase the
block size.
(Note : block size of files are fixed when writing the files and it 
can be

different between two different files.)


That will be my approach for now, or disabling compression altogether 
for

these files. The only problem I have is that compression is so efficient
that any operation in the mapper (so on the uncompressed data) just 
makes

the mapper throw an OOM exception, no matter how much memory I give it.

What partly works though, is setting a low mapred.max.split.size. In a
directory containing 34 files, I get 33 mappers (???). When setting
hive.merge.mapfiles to false (and leaving mapred.max.split.size at its 
fs
blocksize default), it doesn't seem to have any effect and I get 20 
mappers

only.



Are you trying to handle data which are too small?
If hive supports multi-threading for mapper it might be an solution. 
But I

don't the configuration for that.

Regards

Bertrand

PS : the question is quite general and not really hive related


I realized that after re-reading the whole thread :-)

Thanks for all the answers, everyone!

David

On Fri, Jan 25, 2013 at 8:46 AM, Edward Capriolo 
edlinuxg...@gmail.comwrote:


Not all files are split-table Sequence Files are. Raw gzip files are 
not.


On Fri, Jan 25, 2013 at 1:47 AM, Nitin Pawar 
nitinpawar...@gmail.comwrote:



set mapred.min.split.size=1024000;
set mapred.max.split.size=4096000;
set hive.merge.mapfiles=false;

I had set above value and setting max split size to a lower value  
did

increase my # number of maps.  My blocksize was 128MB
Only thing was my files on hdfs were not heavily compressed and I 
was

using RCFileFormat

I would suggest if you have heavily compressed files then you may 
want to
do check what will be size after uncompression and allocate more 
memory to

maps


On Fri, Jan 25, 2013 at 11:46 AM, David Morel dmore...@gmail.com 
wrote:



Hello,

I have seen many posts on various sites and MLs, but didn't find a 
firm
answer anywhere: is it possible yes or no to force a smaller split 
size

than a block on the mappers, from the client side? I'm not after
pointers to the docs (unless you're very very sure :-) but after
real-life experience along the lines of 'yes, it works this way, 
I've

done it like this...'

All the parameters that I could find (especially specifying a max 
input
split size) seem to have no effect, and the files that I have are 
so
heavily compressed that they completely saturate the mappers' 
memory

when processed.

A solution I could imagine for this specific issue is reducing the 
block
size, but for now I simply went with disabling in-file compression 
for
those. And changing the block size on a per-file basis is something 
I'd

like to avoid if at all possible.

All the hive settings that we tried only got me as far as raising 
the
number of mappers from 5 to 6 (yay!) where I would have needed at 
least

ten times more.

Thanks!

D.Morel





--
Nitin Pawar







--
Bertrand Dechoux




Re: Avro Backed Hive tables

2013-03-12 Thread David Morel

On 7 Mar 2013, at 2:43, Murtaza Doctor wrote:


Folks,

Wanted to get some help or feedback from the community on this one:


Hello,

in that case it is advisable to start a new thread, and not 'reply-to' 
when you compose your email :-)


Have a nice day

David


Re: Partition performance

2013-07-03 Thread David Morel
On 2 Jul 2013, at 16:51, Owen O'Malley wrote:

 On Tue, Jul 2, 2013 at 2:34 AM, Peter Marron 
 peter.mar...@trilliumsoftware.com wrote:

 Hi Owen,

 ** **

 I’m curious about this advice about partitioning. Is there some
 fundamental reason why Hive

 is slow when the number of partitions is 10,000 rather than 1,000?


 The precise numbers don't matter. I wanted to give people a ballpark range
 that they should be looking at. Most tables at 1000 partitions won't cause
 big slow downs, but the cost scales with the number of partitions. By the
 time you are at 10,000 the cost is noticeable. I have one customer who has
 a table with 1.2 million partitions. That causes a lot of slow downs.

That is still not really answering the question, which is: why is it slower
to run a query on a heavily partitioned table than it is on the same number 
of files in a less heavily partitioned table.

David


Re: Seeking Help configuring log4j for sqoop import into hive

2013-11-11 Thread David Morel

On 12 Nov 2013, at 0:01, Sunita Arvind wrote:


Just in case this acts as a workaround for someone:
The issue is resolved if I eliminate the where clause in the query 
(just

keep where $CONDITIONS). So 2 workarounds I can think of now are:
1. Create views in Oracle and query without the where clause in the 
sqoop

import command
2. Import everything in the table (not feasible in most cases)

However, I still need to know how to get the exact stack trace.

regards
Sunita


On Mon, Nov 11, 2013 at 1:48 PM, Sunita Arvind 
sunitarv...@gmail.comwrote:



Hello,

I am using sqoop to import data from oracle into hive. Below is my 
SQL:


nohup sqoop import --connect jdbc:oracle:thin:@(DESCRIPTION = 
(ADDRESS =
(PROTOCOL = TCP)(HOST = xxx)(PORT = )) (CONNECT_DATA = 
(SERVER =

DEDICATED) (SERVICE_NAME = CDWQ.tms.toyota.com) (FAILOVER_MODE=
(TYPE=select) (METHOD=basic  --username   --password 

--split-by employeeid --query  SELECT e.employeeid,p.salary from 
employee

e, payroll p
where e.employeeid =p.employeeid and $CONDITIONS
--create-hive-table  --hive-table EMPLOYEE --hive-import 
--target-dir

/user/hive/warehouse/employee --direct --verbose


Note: This is production data hence I cannot share the log file or 
actual

query. Sorry for that.

Similar query works for some tables and for this particular table, 
there

is an exception as below:

java.io.IOException: SQLException in nextKeyValue
 at
org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:266)
 at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:484)
 at
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
 at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)
 at
org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
 at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:673)

 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at org.apache.hadoop.mapred.Child.main(Child.java:262)



This is usually the case when your PK (on which Sqoop will try to do the 
split) isn't an integer.


my 2c.

David


Re: HiveServer2

2013-11-19 Thread David Morel
On 18 Nov 2013, at 21:59, Stephen Sprague wrote:

 A word of warning for users of HiveServer2 - version 0.11 at least. This
 puppy has the ability crash and/or hang your server with a memory leak.

 Apparently its not new since googling shows this discussed before and i see
 reference to a workaround here:

 https://cwiki.apache.org/confluence/display/Hive/Setting+up+HiveServer2

 Anyhoo. Consider this a Public Service Announcement. Take heed.

 Regards,
 Stephen.

When setting fs.hdfs.impl.disable.cache to false I have all my ALTER TABLE 
statements involving managed tables throw an Error 1 in Hive (nothing more).
Can anyone confirm that behaviour?

David


Re: Hive query taking a lot of time just to launch map-reduce jobs

2013-11-25 Thread David Morel

On 25 Nov 2013, at 11:50, Sreenath wrote:


hi all,

We are using hive for Ad-hoc querying and have a hive table which is
partitioned on two fields (date,id).Now for each date there are around 
1400
ids so on a single day around that many partitions are added.The 
actual

data is residing in s3. now the issue we are facing is suppose we do a
select count(*) for a month from the table then it takes quite a long
amount of time(approx : 1hrs 52 min) just to launch the map reduce 
job.
when i ran the query in hive verbose mode i can see that its spending 
this

time actually deciding how many number of mappers to spawn(calculating
splits). Is there any means by which i can reduce this lag time for 
the

launch of map-reduce job.

this is one of the log messages that is being logged during this lag 
time


13/11/19 07:11:06 INFO mapred.FileInputFormat: Total input paths to 
process

: 1
13/11/19 07:11:06 WARN httpclient.RestS3Service: Response
'/Analyze%2F2013%2F10%2F03%2F465' - Unexpected response code 404, 
expected

200
Anyone has a quick fix for this ?


So we're talking about 30 days x 1400 ids x number of files per ID 
(usually more than 1)


this is at least 42,000 file paths, and (regardless of the error you 
posted) hive won't perform well on this many files when making the 
query.


It is IMHO a typical case of over-partitioning. I'd use RCFile and keep 
IDs unpartitioned.


What volume of data are we talking about here? What's the volume of the 
biggest ID for a day, and the average?


David


Re: java.lang.OutOfMemoryError: Java heap space

2013-11-25 Thread David Morel

On 22 Nov 2013, at 9:35, Rok Kralj wrote:

If anybody has any clue what is the cause of this, I'd be happy to 
hear it.

On Nov 21, 2013 9:59 PM, Rok Kralj rok.kr...@gmail.com wrote:


what does echo $HADOOP_HEAPSIZE return in the environment you're trying 
to launch hive from?


David


Re: Difference in number of row observstions from distinct and group by

2013-11-25 Thread David Morel

On 25 Nov 2013, at 9:06, Mayank Bansal wrote:


Hi,

I was also thinking that this might be the case. For that reason I ran 
this query


Select * from (select col1,col2,col3,count(*)  as val from table_name
group by col1,col2,col3)a where a.val1 ;

The output that I receive from this query is blank, then I ended up 
doing count(*) and I got the same number of rows as originally in the 
table. Please help me figure this out.


Instead of going circles, I would put the 2 result sets in 2 tables 
(with a concatenated PK made of your 3 columns, with some separator like 
'-'), and do a left outer join of table 1 on table 2.


you'd be able to identify quickly what went wrong. Sort the result so 
you get unlikely dupes, and all. Just trial and error until you nail it.


David


Re: Hive query taking a lot of time just to launch map-reduce jobs

2013-11-26 Thread David Morel

On 26 Nov 2013, at 7:02, Sreenath wrote:


Hey David,
Thanks for the swift reply. Each id will have exactly one file. and
regarding the volume on an average each file would be 100MB of 
compressed

data with the maximum going upto around 200MB compressed data.



And how will RC files be an advantage here?


I was thinking RCFiles would be an advantage but was confusing their
utility with that of indexes used on non-partitioned data, my bad.
ORCFiles (or indexes) would be an advantage as it would allow you to not
use partitions and regroup your files, thus reducing the overall number
greatly. You could additionally specify a greater block size (say 512MB)
so the number of files to read is divided by 5.

I guess the real issue is having a hive instance communicating with
remote storage on a large number of files, as the metastore only keeps
memory of the directories, not the files. As a result, in order to
assemble all your file paths, which are needed for query execution, it
takes a long time and a large number of I/O on the data store, which
happens to be remote and possibly slow to poll.

This is only a wild guess (and I could be completely wrong on my
understanding on Hive altogether) and there might be a bug and/or
something to optimize; the error you're seeing is maybe the key to the
issue but then it is for more knowledgeable people than me to commment 
on.


Sorry (and good luck)

David

On Mon, Nov 25, 2013 at 5:50 PM, David Morel dmore...@gmail.com 
wrote:



On 25 Nov 2013, at 11:50, Sreenath wrote:

hi all,


We are using hive for Ad-hoc querying and have a hive table which is
partitioned on two fields (date,id).Now for each date there are 
around

1400
ids so on a single day around that many partitions are added.The 
actual
data is residing in s3. now the issue we are facing is suppose we do 
a
select count(*) for a month from the table then it takes quite a 
long
amount of time(approx : 1hrs 52 min) just to launch the map reduce 
job.
when i ran the query in hive verbose mode i can see that its 
spending this
time actually deciding how many number of mappers to 
spawn(calculating
splits). Is there any means by which i can reduce this lag time for 
the

launch of map-reduce job.

this is one of the log messages that is being logged during this lag 
time


13/11/19 07:11:06 INFO mapred.FileInputFormat: Total input paths to
process
: 1
13/11/19 07:11:06 WARN httpclient.RestS3Service: Response
'/Analyze%2F2013%2F10%2F03%2F465' - Unexpected response code 404,
expected
200
Anyone has a quick fix for this ?



So we're talking about 30 days x 1400 ids x number of files per ID
(usually more than 1)

this is at least 42,000 file paths, and (regardless of the error you
posted) hive won't perform well on this many files when making the 
query.


It is IMHO a typical case of over-partitioning. I'd use RCFile and 
keep

IDs unpartitioned.

What volume of data are we talking about here? What's the volume of 
the

biggest ID for a day, and the average?

David


Re: Hive Data into a Html Page

2015-07-31 Thread David Morel
Hive is not really meant to serve data as fast as a web page needs. You'll
have to use some intermediate (could even be a db file, or template toolkit
generated static pages).

David
Le 28 juil. 2015 8:53 AM, siva kumar siva165...@gmail.com a écrit :

 Hi Lohith,
  We use http server. Lemme explain you what exactly is the
 requirement.We are using a Perl Script  to read hive data and make some
 modifications on that data.Now, this modified data should get displayed on
 the html page instead of storing it back to database.So, i need to know the
 solution for this scenario. Can we achieve this(displayimg onto html page)
 using the same perl script after reading and modifying the data from hive?.

 Any suggestions?

 Thanks and regards,
 siva.

 On Mon, Jul 27, 2015 at 8:09 PM, Lohith Samaga M 
 lohith.sam...@mphasis.com wrote:

 Hi Siva
 What web/application server do you have?

 You could use Hive or Drill odbc drivers...

 Sent from my Sony Xperia™ smartphone


  siva kumar wrote 


 Hi ,
   There is some data loaded into hive and i want to display hive data
 into a html webpage based on the parameters i pass. So, how can i do this?

 Any Help?
 Thanks in Advance
 siva.


 Information transmitted by this e-mail is proprietary to Mphasis, its
 associated companies and/ or its customers and is intended
 for use only by the individual or entity to which it is addressed, and
 may contain information that is privileged, confidential or
 exempt from disclosure under applicable law. If you are not the intended
 recipient or it appears that this mail has been forwarded
 to you without proper authority, you are notified that any use or
 dissemination of this information in any manner is strictly
 prohibited. In such cases, please notify us immediately at
 mailmas...@mphasis.com and delete this mail from your records.





Re: Perl-Hive connection

2015-07-30 Thread David Morel

On 29 Jul 2015, at 9:42, siva kumar wrote:


Hi folks,
   I need to set up a connection between perl and hive using
thrift. Can anyone sugggest me the steps involved in making this 
happen?.


Thanka and regrads,
siva.


Hi,

check out 
http://search.cpan.org/~dmor/Thrift-API-HiveClient2/lib/Thrift/API/HiveClient2.pm


David


Re: Perl-Hive connection

2015-08-06 Thread David Morel
You probably forgot to load (use) the module before calling new()
Le 6 août 2015 8:49 AM, siva kumar siva165...@gmail.com a écrit :

 Hi David ,
  I have tried the link you have posted. But im stuck
 with this error message below

 Can't locate object method new via package Thrift::API::HiveClient2

 could you please help me out ?

 Thanks and regards,
 siva


 On Thu, Jul 30, 2015 at 3:20 PM, David Morel dmore...@gmail.com wrote:

 On 29 Jul 2015, at 9:42, siva kumar wrote:

 Hi folks,
I need to set up a connection between perl and hive using
 thrift. Can anyone sugggest me the steps involved in making this happen?.

 Thanka and regrads,
 siva.


 Hi,

 check out
 http://search.cpan.org/~dmor/Thrift-API-HiveClient2/lib/Thrift/API/HiveClient2.pm

 David





Re: Hive Metadata tables of a schema

2016-04-05 Thread David Morel
Better use HCatalog for this.

David
Le 5 avr. 2016 10:14, "Mich Talebzadeh"  a
écrit :

> So you want to interrogate Hive metastore and get information about
> objects for a given schema/database in Hive.
>
> These info are kept in Hive metastore database running on an RDBVMS say
> Oracle. There are dozens of tables in Hive specific schema.
>
> For example table DBS contains Hive databases etc
>
> hiveu...@mydb.mich.LOCAL> select name from DBS;
> NAME
>
> 
> oraclehadoop
> test
> mytable_db
> accounts
> asehadoop
> default
> iqhadoop
> 7 rows selected.
>
> However, to get tables etc you need to understand the Hive schema tables
> and relationships among the tables. There is no package or proc to provide
> the info you need. You need to write your own queries.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 5 April 2016 at 04:57, brajmohan saxena 
> wrote:
>
>> Hi,
>>
>> How can I access Hive metadata tables of a particular Server and Schema
>> in a C program or through JDBC client.
>>
>> I have one application running on Client and my Hiveserver2 running on
>> Remote machine.
>>
>> So if i provide Server name and Schema name in my client C program ,I
>> should be able to get all the tables belongs that Schema.
>>
>> So Is there any Hive API where i can provide the Server and Schema name
>> and get the tables of that schema.
>>
>> Thanks in advance,
>>
>> Regards
>> Braj
>>
>>
>>
>