Re: Hive storm streaming with s3 file system

2018-06-12 Thread Gopal Vijayaraghavan


> So transactional tables only work with hdfs. Thanks for the confirmation 
> Elliot.

No, that's not what said.

Streaming ingest into transactional tables requires strong filesystem 
consistency and a flush-to-remote operation (hflush).

S3 supports neither of those things and HDFS is not the only filesystem which 
has both those features.

For eg. see AdlFsOutputStream.java#L38.

Cheers,
Gopal






Log user name in the hive-server2 logs?

2018-06-12 Thread Kaidi Zhao
(This is for hive 1.2.x)

I noticed in places like ParseDriver.java,

LOG.info("Parsing command: " + command);
It does not log the user who submitted the query.

Just wonder if there is any setting that allows me to log the user name as
well?

Thanks!
Kaidi


Re: Hive storm streaming with s3 file system

2018-06-12 Thread Abhishek Raj
So transactional tables only work with hdfs. Thanks for the confirmation
Elliot.

On Tue, Jun 12, 2018 at 10:25 PM, Elliot West  wrote:

> I don't not believe that S3 is currently a supported filesystem for
> transactional tables. I believe there are plans to make this so.
>
> On 12 June 2018 at 17:50, Abhishek Raj  wrote:
>
>> Hi. I'm using HiveBolt from Apache Storm to stream
>> 
>> data into a transactional hive table. It works great when the table is
>> backed by hdfs, but starts throwing error when the location is s3. The
>> error I get is, "No filesystem for scheme: s3"
>>
>> Just wondering if it's possible to stream data into a s3 backed hive
>> table from storm or is hdfs the only filesystem supported.
>>
>> Any insights would be great.
>>
>> Thanks.
>>
>
>


Re: Hive storm streaming with s3 file system

2018-06-12 Thread Elliot West
I don't not believe that S3 is currently a supported filesystem for
transactional tables. I believe there are plans to make this so.

On 12 June 2018 at 17:50, Abhishek Raj  wrote:

> Hi. I'm using HiveBolt from Apache Storm to stream
> 
> data into a transactional hive table. It works great when the table is
> backed by hdfs, but starts throwing error when the location is s3. The
> error I get is, "No filesystem for scheme: s3"
>
> Just wondering if it's possible to stream data into a s3 backed hive table
> from storm or is hdfs the only filesystem supported.
>
> Any insights would be great.
>
> Thanks.
>


Re: Hive storm streaming with s3 file system

2018-06-12 Thread Abhishek Raj
Hi. I'm using HiveBolt from Apache Storm to stream

data into a transactional hive table. It works great when the table is
backed by hdfs, but starts throwing error when the location is s3. The
error I get is, "No filesystem for scheme: s3"

Just wondering if it's possible to stream data into a s3 backed hive table
from storm or is hdfs the only filesystem supported.

Any insights would be great.

Thanks.


[no subject]

2018-06-12 Thread Sowjanya Kakarala
Hi Guys,


I have 4datanodes and one master node EMR cluster with 120GB data storage
left. I have been running sqoop jobs which loads data to hive table. After
some jobs ran successfully I suddenly see these errors all over the name
node logs and datanodes logs.

I have tried changing so many configurations as suggeted in stackoverflow
and hortonworks sites but couldnt find a way for fixing it.


Here is the error:

2018-06-12 15:32:35,933 WARN [main] org.apache.hadoop.mapred.YarnChild:
Exception running child :
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/user/hive/warehouse/monolith.db/tblname/_SCRATCH0.28417629602676764/time_stamp=2018-04-02/_temporary/1/_temporary/attempt_1528318855054_3528_m_00_1/part-m-0
could only be replicated to 0 nodes instead of minReplication (=1).  There
are 4 datanode(s) running and no node(s) are excluded in this operation.

at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1735)

at
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)

at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2561)

at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:829)

at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)

at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)

at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:847)

at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:790)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2486)


at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)

at org.apache.hadoop.ipc.Client.call(Client.java:1435)

at org.apache.hadoop.ipc.Client.call(Client.java:1345)

at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)

at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)

at com.sun.proxy.$Proxy14.addBlock(Unknown Source)

at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:444)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)

at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)

at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)

at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)

at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)

at com.sun.proxy.$Proxy15.addBlock(Unknown Source)

at
org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1838)

at
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1638)

at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:704)


References I already followed:

https://community.hortonworks.com/articles/16144/write-or-append-failures-in-very-small-clusters-un.html

https://stackoverflow.com/questions/14288453/writing-to-hdfs-from-java-getting-could-only-be-replicated-to-0-nodes-instead

https://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo

https://stackoverflow.com/questions/36015864/hadoop-be-replicated-to-0-nodes-instead-of-minreplication-1-there-are-1/36310025


Any help is appreciated.


Thanks

Sowjanya


Re: issues with Hive 3 simple sellect from an ORC table

2018-06-12 Thread Sungwoo Park
This is a diff file that let me compile Hive 3.0 on Hadoop 2.8.0 (and also
run it on Hadoop 2.7.x).

diff --git a/pom.xml b/pom.xml
index c57ff58..8445288 100644
--- a/pom.xml
+++ b/pom.xml
@@ -146,7 +146,7 @@
 19.0
 2.4.11
 1.3.166
-3.1.0
+2.8.0

 
${basedir}/${hive.path.to.root}/testutils/hadoop
 1.3
 2.0.0-alpha4
@@ -1212,7 +1212,7 @@
   true
 
   
-  true
+  false
 
   
   
diff --git
a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java
b/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java
index b13f73b..21d8541 100644
--- a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java
+++ b/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java
@@ -277,7 +277,7 @@ protected void openInternal(String[]
additionalFilesNotFromConf,
 } else {
   this.resources = new HiveResources(createTezDir(sessionId,
"resources"));
   ensureLocalResources(conf, additionalFilesNotFromConf);
-  LOG.info("Created new resources: " + resources);
+  LOG.info("Created new resources: " + this.resources);
 }

 // unless already installed on all the cluster nodes, we'll have to
@@ -639,7 +639,6 @@ public void ensureLocalResources(Configuration conf,
String[] newFilesNotFromCon
* @throws Exception
*/
   void close(boolean keepDagFilesDir) throws Exception {
-console = null;
 appJarLr = null;

 try {
@@ -665,6 +664,7 @@ void close(boolean keepDagFilesDir) throws Exception {
 }
   }
 } finally {
+  console = null;
   try {
 cleanupScratchDir();
   } finally {
diff --git a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
b/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
index 84ae157..be66787 100644
--- a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
+++ b/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
@@ -160,7 +160,9 @@ public int execute(DriverContext driverContext) {
   if (userName == null) {
 userName = "anonymous";
   } else {
-groups =
UserGroupInformation.createRemoteUser(userName).getGroups();
+groups =
Arrays.asList(UserGroupInformation.createRemoteUser(userName).getGroupNames());
+// TODO: for Hadoop 2.8.0+, just call getGroups():
+//   groups =
UserGroupInformation.createRemoteUser(userName).getGroups();
   }
   MappingInput mi = new MappingInput(userName, groups,
   ss.getHiveVariables().get("wmpool"),
ss.getHiveVariables().get("wmapp"));
diff --git
a/ql/src/java/org/apache/hadoop/hive/ql/hooks/HiveProtoLoggingHook.java
b/ql/src/java/org/apache/hadoop/hive/ql/hooks/HiveProtoLoggingHook.java
index 1ae8194..aaf0c62 100644
--- a/ql/src/java/org/apache/hadoop/hive/ql/hooks/HiveProtoLoggingHook.java
+++ b/ql/src/java/org/apache/hadoop/hive/ql/hooks/HiveProtoLoggingHook.java
@@ -472,7 +472,7 @@ static EventLogger getInstance(HiveConf conf) {
   if (instance == null) {
 synchronized (EventLogger.class) {
   if (instance == null) {
-instance = new EventLogger(conf, SystemClock.getInstance());
+instance = new EventLogger(conf, new SystemClock());
 ShutdownHookManager.addShutdownHook(instance::shutdown);
   }
 }
diff --git a/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java
b/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java
index 183515a..2f393c3 100644
--- a/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java
+++ b/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java
@@ -1051,7 +1051,9 @@ else if (prev != null && next.maxWriteId ==
prev.maxWriteId
  */
 Collections.sort(original, (HdfsFileStatusWithId o1,
HdfsFileStatusWithId o2) -> {
   //this does "Path.uri.compareTo(that.uri)"
-  return o1.getFileStatus().compareTo(o2.getFileStatus());
+  return
o1.getFileStatus().getPath().compareTo(o2.getFileStatus().getPath());
+  // TODO: for Hadoop 2.8+
+  // return o1.getFileStatus().compareTo(o2.getFileStatus());
 });

 // Note: isRawFormat is invalid for non-ORC tables. It will always
return true, so we're good.
diff --git
a/ql/src/test/org/apache/hadoop/hive/ql/hooks/TestHiveProtoLoggingHook.java
b/ql/src/test/org/apache/hadoop/hive/ql/hooks/TestHiveProtoLoggingHook.java
index 5e117fe..4367107 100644
---
a/ql/src/test/org/apache/hadoop/hive/ql/hooks/TestHiveProtoLoggingHook.java
+++
b/ql/src/test/org/apache/hadoop/hive/ql/hooks/TestHiveProtoLoggingHook.java
@@ -76,7 +76,7 @@ public void setup() throws Exception {
   @Test
   public void testPreEventLog() throws Exception {
 context.setHookType(HookType.PRE_EXEC_HOOK);
-EventLogger evtLogger = new EventLogger(conf,
SystemClock.getInstance());
+EventLogger evtLogger = new EventLogger(conf, new SystemClock());
 evtLogger.handle(context);
 evtLogger.shutdown();

@@ -105,7 

CachedStore for hive.metastore.rawstore.impl in Hive 3.0

2018-06-12 Thread Sungwoo Park
Hello Hive users,

I am experience a problem with MetaStore in Hive 3.0.

1. Start MetaStore
with hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.ObjectStore.

2. Generate TPC-DS data.

3. TPC-DS queries run okay and produce correct results. E.g., from query 1:
+---+
|   c_customer_id   |
+---+
| CHAA  |
| DCAA  |
| DDAA  |
...
| AAAILIAA  |
+---+
100 rows selected (69.901 seconds)

However, the query compilation takes long (
https://issues.apache.org/jira/browse/HIVE-16520).

4. Now, restart MetaStore with
hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.cache.CachedStore.

5. TPC-DS queries run okay, but produce wrong results. E.g, from query 1:
++
| c_customer_id  |
++
++
No rows selected (37.448 seconds)

What I noticed is that with hive.metastore.rawstore.impl=CachedStore,
HiveServer2 produces such log messages:

2018-06-12T23:50:04,223  WARN [b3041385-0290-492f-aef8-c0249de328ad
HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
2018-06-12T23:50:04,223  INFO [b3041385-0290-492f-aef8-c0249de328ad
HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
2018-06-12T23:50:04,225  WARN [b3041385-0290-492f-aef8-c0249de328ad
HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk
2018-06-12T23:50:04,225  INFO [b3041385-0290-492f-aef8-c0249de328ad
HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk
2018-06-12T23:50:04,226  WARN [b3041385-0290-492f-aef8-c0249de328ad
HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
tpcds_bin_partitioned_orc_1000@customer, Columns: c_customer_sk,
c_customer_id
2018-06-12T23:50:04,226  INFO [b3041385-0290-492f-aef8-c0249de328ad
HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
tpcds_bin_partitioned_orc_1000@customer, Columns: c_customer_sk,
c_customer_id

2018-06-12T23:50:05,158 ERROR [b3041385-0290-492f-aef8-c0249de328ad
HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
Invalid column stats: No of nulls > cardinality
2018-06-12T23:50:05,159 ERROR [b3041385-0290-492f-aef8-c0249de328ad
HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
Invalid column stats: No of nulls > cardinality
2018-06-12T23:50:05,160 ERROR [b3041385-0290-492f-aef8-c0249de328ad
HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
Invalid column stats: No of nulls > cardinality

However, even after computing column stats, queries still return wrong
results, despite the fact that the above log messages disappear.

I guess I am missing some configuration parameters (because I imported
hive-site.xml from Hive 2). Any suggestion would be appreciated.

Thanks a lot,

--- Sungwoo Park


Re: Which version of Hive can hanle creating XML table?

2018-06-12 Thread kristijan berta
*Apologies, here is the link to the
product: https://sonra.io/flexter-for-xml/
*

*and how it can be used with
Hive: https://sonra.io/2018/01/27/converting-xml-hive/
*


On Mon, Jun 11, 2018 at 5:46 PM, Mich Talebzadeh 
wrote:

> many thanks. but I cannot see any specific product name there?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 11 June 2018 at 14:10, kristijan berta  wrote:
>
>> The XPath stuff works reasonably well for simple XML files.
>>
>> However for complex XML files that change frequently and need to be
>> ingested in realtime you might look at a 3rd party solution, e.g. here:
>> https://dataworkssummit.com/san-jose-2018/session/add-a-spar
>> k-to-your-etl/
>>
>> On Mon, Jun 11, 2018 at 3:05 PM, kristijan berta 
>> wrote:
>>
>>> thanks Jorn. The only alternative is to use xpath UDF? Works as shown
>>> below but tedious
>>>
>>> Like the example below
>>>
>>> *$cat employees.xml*
>>> 
>>> 1
>>> Satish Kumar
>>> Technical Lead
>>> 
>>> 
>>> 2
>>> Ramya
>>> Testing
>>> 
>>>
>>> *Step:1 Bring each record to one line, by executing below command*
>>>
>>> $cat employees.xml | tr -d '&' | tr '\n' ' ' | tr '\r' ' ' | sed
>>> 's||\n|g' | grep -v '^\s*$' >
>>> employees_records.xml
>>>
>>> *$cat employees_records.xml*
>>>  1 Satish Kumar Technical
>>> Lead 
>>>  2 Ramya Testing
>>> 
>>>
>>> *tep:2 Load the file to HDFS*
>>>
>>> *$hadoop fs -mkdir /user/hive/sample-xml-inputs*
>>>
>>> *$hadoop fs -put employees_records.xml /user/hive/sample-xml-inputs*
>>>
>>> *$hadoop fs -cat /user/hive/sample-xml-inputs/employees_records.xml*
>>>  1 Satish KumarTechnical
>>> Lead 
>>>  2 Ramya Testing
>>> 
>>>
>>> *Step:3 Create a Hive table and point to xml file*
>>>
>>> *hive>create external table xml_table_org( xmldata string) LOCATION
>>> '/user/hive/sample-xml-inputs/';*
>>>
>>> *hive> select * from xml_table_org;*
>>> *OK*
>>>  1 Satish Kumar Technical
>>> Lead 
>>>  2 Ramya Testing
>>> 
>>>
>>> *Step 4: From the stage table we can query the elements and load it to
>>> other table.*
>>>
>>> *hive> CREATE TABLE xml_table AS SELECT
>>> xpath_int(xmldata,'employee/id'),xpath_string(xmldata,'employee/name'),xpath_string(xmldata,'employee/designation')
>>> FROM xml_table_org;*
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 9 June 2018 at 07:42, Jörn Franke  wrote:
>>>
 Yes.

 Serde must have been removed then in 2.x.



 On 8. Jun 2018, at 23:52, Mich Talebzadeh 
 wrote:

 Ok I am looking at this jar file

  jar tf hive-serde-3.0.0.jar|grep -i abstractserde
 org/apache/hadoop/hive/serde2/AbstractSerDe.class

 Is this the correct one?

 Thanks


 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 8 June 2018 at 22:34, Mich Talebzadeh 
 wrote:

> Thanks Jorn so what is the resolution? do I need another jar file?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>