Trying to create an extrenal table in Hive for MongoDBthrows error

2018-08-26 Thread Mich Talebzadeh
Hi,

Trying to create an external table in Hive to be accessed by MongoDB.

This is the code in Hive

ADD JAR /home/hduser/jars/mongo-hadoop-core-2.0.2.jar;
ADD JAR /home/hduser/jars/mongo-hadoop-hive-2.0.2.jar;
ADD JAR /home/hduser/jars/mongo-java-driver-3.8.1.jar;
use accounts;
DROP TABLE IF EXISTS ll_18740868_mongo;
CREATE EXTERNAL TABLE ll_18740868_mongo (
TransactionDateDATE
,TransactionType   String
,SortCode  String
,AccountNumber String
,TransactionDescriptionString
,DebitAmount   Double
,CreditAmount  Double
,Balance   Double
)
COMMENT 'Mongo external table'
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'=
'{
   "TransactionDate":"TransactionDate",
   "TransactionType":"TransactionType",
   "SortCode":"SortCode",
   "AccountNumber":"AccountNumber",
   "TransactionDescription":"TransactionDescription",
   "DebitAmount":"DebitAmount",
   "CreditAmount":"CreditAmount",
   "Balance":"Balance"
}'
)
TBLPROPERTIES ('mongo.uri'='mongodb://account_user_RO:mongodb@rhes75
:60100/accounts.ll_18740868_mongo')
;

In debug mode it throws this error

CREATE EXTERNAL TABLE ll_18740868_mongo (
TransactionDateDATE
,TransactionType   String
,SortCode  String
,AccountNumber String
,TransactionDescriptionString
,DebitAmount   Double
,CreditAmount  Double
,Balance   Double
)
COMMENT 'Mongo external table'
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'=
'{
   "TransactionDate":"TransactionDate",
   "TransactionType":"TransactionType",
   "SortCode":"SortCode",
   "AccountNumber":"AccountNumber",
   "TransactionDescription":"TransactionDescription",
   "DebitAmount":"DebitAmount",
   "CreditAmount":"CreditAmount",
   "Balance":"Balance"
}'
)
TBLPROPERTIES ('mongo.uri'='mongodb://account_user_RO:mongodb@rhes75
:60100/accounts.ll_18740868_mongo')
2018-08-26 23:01:32,424 INFO  [a928186b-244d-4b81-bda9-dd3a83e306b2 main]
ql.Driver: Starting task [Stage-0:DDL] in serial mode
2018-08-26 23:01:32,426 ERROR [a928186b-244d-4b81-bda9-dd3a83e306b2 main]
exec.DDLTask: Failed
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/serde2/SerDe
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass
(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at
com.mongodb.hadoop.hive.MongoStorageHandler.getSerDeClass(MongoStorageHandler.java:88)
at
org.apache.hadoop.hive.ql.plan.CreateTableDesc.toTable(CreateTableDesc.java:744)
at
org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4882)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:428)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:205)
at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:97)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2479)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:2150)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1826)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1567)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1561)
at
org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:157)
at
org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:218)
at
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:239)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:188)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:402)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:335)
at
org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:471)
at
org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:487)
at
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 

Re: Improve performance of Analyze table compute statistics

2018-08-26 Thread Jörn Franke
You can partition it and only compute statistics for new partitions...

> On 26. Aug 2018, at 12:43, Prabhakar Reddy  wrote:
> 
> Hello,
> 
> Are there any properties that I can set to improve the performance of Analyze 
> table compute statistics statement.My data sits in s3 and I see it's taking 
> one second per file to read the schema of each file from s3.
> 
> 2018-08-24T03:25:57,525 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(79)) - Reading ORC 
> rows from s3://file_1
> 2018-08-24T03:25:57,526 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: impl.RecordReaderImpl (RecordReaderImpl.java:(187)) - Reader 
> schema not provided -- using file schema 
> 
> 2018-08-24T03:25:58,395 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(79)) - Reading ORC 
> rows from s3://file_2
> 2018-08-24T03:25:58,395 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: impl.RecordReaderImpl (RecordReaderImpl.java:(187)) - Reader 
> schema not provided -- using file schema
> 
> It takes around 80 seconds for 76 files with total size of 23 GB.
> 
> 
> 2018-08-24T03:27:07,673 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: exec.Task (SessionState.java:printInfo()) - Table 
> dept_data_services.lc_credit_2018_08_20_temp stats: [numFiles=76, 
> numRows=101341845, totalSize=26500166568, rawDataSize=294491898741]
> 2018-08-24T03:27:07,673 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: ql.Driver (Driver.java:execute(2050)) - Completed executing 
> command(queryId=lcapp_20180824032545_aba21e71-ea4b-4214-8793-705a5e0367f0); 
> Time taken: 81.169 seconds
> 2018-08-24T03:27:07,674 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: ql.Driver (SessionState.java:printInfo()) - OK
> 2018-08-24T03:27:07,681 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 
> main([])]: CliDriver (SessionState.java:printInfo()) - Time taken: 81.992 
> seconds
> 
> If I run the same command with few columns then the query runs 60% faster.Is 
> there any property that I can modify to reduce the time taken for this read?
> 
> Regards
> Prabhakar Reddy
> 
> 
>  


Improve performance of Analyze table compute statistics

2018-08-26 Thread Prabhakar Reddy
Hello,

Are there any properties that I can set to improve the performance of
Analyze table compute statistics statement.My data sits in s3 and I see
it's taking one second per file to read the schema of each file from s3.

2018-08-24T03:25:57,525 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
main([])]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(79)) - Reading ORC
rows from s3://file_1
2018-08-24T03:25:57,526 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
main([])]: impl.RecordReaderImpl (RecordReaderImpl.java:(187)) -
Reader schema not provided -- using file schema

2018-08-24T03:25:58,395 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
main([])]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(79)) - Reading ORC
rows from s3://file_2
2018-08-24T03:25:58,395 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
main([])]: impl.RecordReaderImpl (RecordReaderImpl.java:(187)) -
Reader schema not provided -- using file schema

It takes around 80 seconds for 76 files with total size of 23 GB.


2018-08-24T03:27:07,673 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
main([])]: exec.Task (SessionState.java:printInfo()) - Table
dept_data_services.lc_credit_2018_08_20_temp stats: [numFiles=76,
numRows=101341845, totalSize=26500166568, rawDataSize=294491898741]
2018-08-24T03:27:07,673 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
main([])]: ql.Driver (Driver.java:execute(2050)) - Completed executing
command(queryId=lcapp_20180824032545_aba21e71-ea4b-4214-8793-705a5e0367f0);
Time taken: 81.169 seconds
2018-08-24T03:27:07,674 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
main([])]: ql.Driver (SessionState.java:printInfo()) - OK
2018-08-24T03:27:07,681 INFO  [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1
main([])]: CliDriver (SessionState.java:printInfo()) - Time taken:
81.992 seconds

If I run the same command with few columns then the query runs 60%
faster.Is there any property that I can modify to reduce the time taken for
this read?

Regards
Prabhakar Reddy