why does HIVE can run normally without starting yarn?

2016-05-31 Thread Joseph

Hi all,

I use hadoop 2.7.2, and I just start HDFS, then I can submit mapreduce jobs and 
run HIVE 1.2.1.  Do the jobs just execute locally If I don't start YARN?



Joseph


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Mich Talebzadeh
Thanks Gopal.

SAP Replication server (SRS) does it to Hive real time as well. That is the
main advantage of replication as it is real time. Picks up committed data
from the log and sends it to hive as well. Also it ois way ahead of Sqoop
that only does the initial load really.  It does 10k rows at a time with
insert into Hive table. Hive table cannot be transactional to start with.

I. 2016/04/08 09:38:23. REPLICATE Replication Server: Dropped subscription
<102_105_t> for replication definition <102_t> with replicate at

I. 2016/04/08 09:38:31. REPLICATE Replication Server: Creating subscription
<102_105_t> for replication definition <102_t> with replicate at

I. 2016/04/08 09:38:31. PRIMARY Replication Server: Creating subscription
<102_105_t> for replication definition <102_t> with replicate at

T. 2016/04/08 09:38:32. (84): Command sent to 'SYB_157.scratchpad':
T. 2016/04/08 09:38:32. (84): 'begin transaction  '
T. 2016/04/08 09:38:32. (84): Command sent to 'SYB_157.scratchpad':
T. 2016/04/08 09:38:32. (84): 'select  count (*) from t  '
T. 2016/04/08 09:38:34. (84): Command sent to 'SYB_157.scratchpad':
T. 2016/04/08 09:38:34. (84): 'select OWNER, OBJECT_NAME, SUBOBJECT_NAME,
OBJECT_ID, DATA_OBJECT_ID, OBJECT_TYPE, CREATED, LAST_DDL_TIME, TIMESTAMP2,
STATUS, TEMPORARY2, GENERATED, SECONDARY, NAMESPACE, EDITION_NA
ME, PADDING1, PADDING2, ATTRIBUTE from t  '
T. 2016/04/08 09:39:54. (86): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:39:54. (86): 'Bulk insert table 't' ( rows affected)'
T. 2016/04/08 09:40:12. (89): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:40:12. (89): 'Bulk insert table 't' ( rows affected)'
T. 2016/04/08 09:40:34. (87): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:40:34. (87): 'Bulk insert table 't' ( rows affected)'
T. 2016/04/08 09:40:52. (88): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:40:52. (88): 'Bulk insert table 't' ( rows affected)'
T. 2016/04/08 09:41:11. (90): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:41:11. (90): 'Bulk insert table 't' ( rows affected)'
T. 2016/04/08 09:41:56. (86): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:41:56. (86): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:42:30. (87): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:42:30. (87): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:42:53. (89): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:42:53. (89): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:43:14. (90): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:43:14. (90): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:43:33. (88): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:43:33. (88): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:44:25. (86): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:44:25. (86): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:44:44. (89): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:44:44. (89): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:45:37. (90): Command sent to 'hiveserver2.asehadoop':

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 31 May 2016 at 22:18, Gopal Vijayaraghavan  wrote:

>
> > Can LLAP be used as a caching tool for data from Oracle DB or any RDBMS.
>
> No, LLAP intermediates HDFS. It holds column & index data streams as-is
> (i.e dictionary encoding, RLE, bloom filters etc are preserved).
>
> Because it does not cache row-tuples, it cannot exist as a caching tool
> for another RDBMS.
>
> I have heard of Oracle GoldenGate replicating into Hive, but it is not
> without its own pains of schema compat.
>
> Cheers,
> Gopal
>
>
>
>


Re: [ANNOUNCE] Apache Hive 2.0.1 Released

2016-05-31 Thread Sergey Shelukhin
Oh. I just copy-pasted the Wiki text, perhaps it should be updated.

From: Mich Talebzadeh 
>
Reply-To: "user@hive.apache.org" 
>
Date: Tuesday, May 31, 2016 at 14:01
To: user >
Cc: "d...@hive.apache.org" 
>, 
"annou...@apache.org" 
>
Subject: Re: [ANNOUNCE] Apache Hive 2.0.1 Released

Thanks Sergey,

Congratulations.

May I add that Hive 0.14 and above can also deploy Spark as its executions 
engine and with Spark on Hive on Spark execution engine you have a winning 
combination.

BTW we are just discussing the merits of TEZ + LLAP versus Spark as the 
execution engine for Spark. With Hive on Spark vs Hive on MapReduce the 
performance gains are order of magnitude.

HTH




Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 31 May 2016 at 21:39, Sergey Shelukhin 
> wrote:
The Apache Hive team is proud to announce the the release of Apache Hive
version 2.0.1.

The Apache Hive (TM) data warehouse software facilitates querying and
managing large datasets residing in distributed storage. Built on top of
Apache Hadoop (TM), it provides:

* Tools to enable easy data extract/transform/load (ETL)

* A mechanism to impose structure on a variety of data formats

* Access to files stored either directly in Apache HDFS (TM) or in other
data storage systems such as Apache HBase (TM)

* Query execution via Apache Hadoop MapReduce and Apache Tez frameworks.

For Hive release details and downloads, please visit:
https://hive.apache.org/downloads.html

Hive 2.0.1 Release Notes are available here:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12334886
leName=Text=12310843

We would like to thank the many contributors who made this release
possible.

Regards,

The Apache Hive Team





Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Gopal Vijayaraghavan

> Can LLAP be used as a caching tool for data from Oracle DB or any RDBMS.

No, LLAP intermediates HDFS. It holds column & index data streams as-is
(i.e dictionary encoding, RLE, bloom filters etc are preserved).

Because it does not cache row-tuples, it cannot exist as a caching tool
for another RDBMS.

I have heard of Oracle GoldenGate replicating into Hive, but it is not
without its own pains of schema compat.

Cheers,
Gopal





Fwd: [ANNOUNCE] Apache Hive 2.0.1 Released

2016-05-31 Thread Mich Talebzadeh
Thanks Sergey,

Congratulations.

May I add that Hive 0.14 and above can also deploy Spark as its executions
engine and with Spark on Hive on Spark execution engine you have a winning
combination.

BTW we are just discussing the merits of TEZ + LLAP versus Spark as the
execution engine for Spark. With Hive on Spark vs Hive on MapReduce the
performance gains are order of magnitude.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 31 May 2016 at 21:39, Sergey Shelukhin  wrote:

> The Apache Hive team is proud to announce the the release of Apache Hive
> version 2.0.1.
>
> The Apache Hive (TM) data warehouse software facilitates querying and
> managing large datasets residing in distributed storage. Built on top of
> Apache Hadoop (TM), it provides:
>
> * Tools to enable easy data extract/transform/load (ETL)
>
> * A mechanism to impose structure on a variety of data formats
>
> * Access to files stored either directly in Apache HDFS (TM) or in other
> data storage systems such as Apache HBase (TM)
>
> * Query execution via Apache Hadoop MapReduce and Apache Tez frameworks.
>
> For Hive release details and downloads, please visit:
> https://hive.apache.org/downloads.html
>
> Hive 2.0.1 Release Notes are available here:
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12334886
> leName=Text=12310843
> 
>
> We would like to thank the many contributors who made this release
> possible.
>
> Regards,
>
> The Apache Hive Team
>
>
>


Re: [ANNOUNCE] Apache Hive 2.0.1 Released

2016-05-31 Thread Mich Talebzadeh
Thanks Sergey,

Congratulations.

May I add that Hive 0.14 and above can also deploy Spark as its executions
engine and with Spark on Hive on Spark execution engine you have a winning
combination.

BTW we are just discussing the merits of TEZ + LLAP versus Spark as the
execution engine for Spark. With Hive on Spark vs Hive on MapReduce the
performance gains are order of magnitude.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 31 May 2016 at 21:39, Sergey Shelukhin  wrote:

> The Apache Hive team is proud to announce the the release of Apache Hive
> version 2.0.1.
>
> The Apache Hive (TM) data warehouse software facilitates querying and
> managing large datasets residing in distributed storage. Built on top of
> Apache Hadoop (TM), it provides:
>
> * Tools to enable easy data extract/transform/load (ETL)
>
> * A mechanism to impose structure on a variety of data formats
>
> * Access to files stored either directly in Apache HDFS (TM) or in other
> data storage systems such as Apache HBase (TM)
>
> * Query execution via Apache Hadoop MapReduce and Apache Tez frameworks.
>
> For Hive release details and downloads, please visit:
> https://hive.apache.org/downloads.html
>
> Hive 2.0.1 Release Notes are available here:
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12334886
> leName=Text=12310843
>
> We would like to thank the many contributors who made this release
> possible.
>
> Regards,
>
> The Apache Hive Team
>
>
>


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Gopal Vijayaraghavan

> but this sounds to me (without testing myself) adding caching capability
>to TEZ to bring it on par with SPARK.

Nope, that was the crux of the earlier email.

"Caching" seems to be catch-all term misused in that comparison.

>> There is a big difference between where LLAP & SparkSQL, which has to do
>> with access pattern needs.

On another note, LLAP can actually be used inside Spark as well, just use
LlapContext instead of HiveContext.





I even have a Postgres FDW for LLAP, which is mostly used for analytics
web dashboards which are hooked into Hive.

https://github.com/t3rmin4t0r/llap_fdw


LLAP can do 200-400ms queries, but Postgres can get to the sub 10ms when
it comes to slicing-dicing result sets <100k rows.

Cheers,
Gopal




[ANNOUNCE] Apache Hive 2.0.1 Released

2016-05-31 Thread Sergey Shelukhin
The Apache Hive team is proud to announce the the release of Apache Hive
version 2.0.1.

The Apache Hive (TM) data warehouse software facilitates querying and
managing large datasets residing in distributed storage. Built on top of
Apache Hadoop (TM), it provides:

* Tools to enable easy data extract/transform/load (ETL)

* A mechanism to impose structure on a variety of data formats

* Access to files stored either directly in Apache HDFS (TM) or in other
data storage systems such as Apache HBase (TM)

* Query execution via Apache Hadoop MapReduce and Apache Tez frameworks.

For Hive release details and downloads, please visit:
https://hive.apache.org/downloads.html

Hive 2.0.1 Release Notes are available here:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12334886
leName=Text=12310843

We would like to thank the many contributors who made this release
possible.

Regards,

The Apache Hive Team




Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Mich Talebzadeh
Couple of points if I may and kindly bear with my remarks.

Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP

"Sub-second queries require fast query execution and low setup cost. The
challenge for Hive is to achieve this without giving up on the scale and
flexibility that users depend on. This requires a new approach using a
hybrid engine that leverages Tez and something new called  LLAP (Live Long
and Process, #llap online).

LLAP is an optional daemon process running on multiple nodes, that provides
the following:

   - Caching and data reuse across queries with compressed columnar data
   in-memory (off-heap)
   - Multi-threaded execution including reads with predicate pushdown and
   hash joins
   - High throughput IO using Async IO Elevator with dedicated thread and
   core per disk
   - Granular column level security across applications
   - "

OK so we have added an in-memory capability to TEZ by way of LLAP, In other
words what Spark does already and BTW it does not require a daemon running
on any host. Don't take me wrong. It is interesting but this sounds to me
(without testing myself) adding caching capability to TEZ to bring it on
par with SPARK.

Remember:

Spark -> DAG + in-memory caching
TEZ = MR on DAG
TEZ + LLAP => DAG + in-memory caching

OK it is another way getting the same result. However, my concerns:


   - Spark has a wide user base. I judge this from Spark user group traffic
   - TEZ user group has no traffic I am afraid
   - LLAP I don't know

Sounds like Hortonworks promote TEZ and Cloudera does not want to know
anything about Hive. and they promote Impala but that sounds like a sinking
ship these days.

Having said that I will try TEZ + LLAP :) No pun intended

Regards

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 31 May 2016 at 08:19, Jörn Franke  wrote:

> Thanks very interesting explanation. Looking forward to test it.
>
> > On 31 May 2016, at 07:51, Gopal Vijayaraghavan 
> wrote:
> >
> >
> >> That being said all systems are evolving. Hive supports tez+llap which
> >> is basically the in-memory support.
> >
> > There is a big difference between where LLAP & SparkSQL, which has to do
> > with access pattern needs.
> >
> > The first one is related to the lifetime of the cache - the Spark RDD
> > cache is per-user-session which allows for further operation in that
> > session to be optimized.
> >
> > LLAP is designed to be hammered by multiple user sessions running
> > different queries, designed to automate the cache eviction & selection
> > process. There's no user visible explicit .cache() to remember - it's
> > automatic and concurrent.
> >
> > My team works with both engines, trying to improve it for ORC, but the
> > goals of both are different.
> >
> > I will probably have to write a proper academic paper & get it
> > edited/reviewed instead of send my ramblings to the user lists like this.
> > Still, this needs an example to talk about.
> >
> > To give a qualified example, let's leave the world of single use clusters
> > and take the use-case detailed here
> >
> > http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
> >
> >
> > There are two distinct problems there - one is that a single day sees
> upto
> > 100k independent user sessions running queries and that most queries
> cover
> > the last hour (& possibly join/compare against a similar hour aggregate
> > from the past).
> >
> > The problem with having independent 100k user-sessions from different
> > connections was that the SparkSQL layer drops the RDD lineage & cache
> > whenever a user ends a session.
> >
> > The scale problem in general for Impala was that even though the data
> size
> > was in multiple terabytes, the actual hot data was approx <20Gb, which
> > resides on <10 machines with locality.
> >
> > The same problem applies when you apply RDD caching with something like
> > un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
> > popular that the machines which hold those blocks run extra hot.
> >
> > A cache model per-user session is entirely wasteful and a common cache +
> > MPP model effectively overloads 2-3% of cluster, while leaving the other
> > machines idle.
> >
> > LLAP was designed specifically to prevent that hotspotting, while
> > maintaining the common cache model - within a few minutes after an hour
> > ticks over, the whole cluster develops temporal popularity for the hot
> > data and nearly every rack has at least one cached copy of the same data
> > for availability/performance.
> >
> > Since data stream tend to be extremely wide table (Omniture) comes to
> > mine, so the cache actually does not hold all columns in a table and
> since
> > Zipf distributions are extremely common in these 

RE: How to disable SMB join?

2016-05-31 Thread Markovitz, Dudu
Hi

The documentation describes a scenario where SMB join leads to the same error 
you’ve got.
It claims that changing the order of the tables solves the problem.

Dudu


https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization#LanguageManualJoinOptimization-SMBJoinacrossTableswithDifferentKeys
SMB Join across Tables with Different Keys
If the tables have differing number of keys, for example Table A has 2 SORT 
columns and Table B has 1 SORT column, then you might get an index out of 
bounds exception.
The following query results in an index out of bounds exception because 
emp_person let us say for example has 1 sort column while emp_pay_history has 2 
sort columns.
Error Hive 0.11
SELECT p.*, py.*
FROM emp_person p INNER JOIN emp_pay_history py
ON   p.empid = py.empid

This works fine.
Working query Hive 0.11
SELECT p.*, py.*
FROM emp_pay_history py INNER JOIN emp_person p
ON   p.empid = py.empid




From: Banias H [mailto:banias4sp...@gmail.com]
Sent: Tuesday, May 31, 2016 8:09 PM
To: user@hive.apache.org
Subject: How to disable SMB join?

Hi,

Does anybody know if there a config setting to disable SMB join?

One of our Hive queries failed with ArrayIndexOutOfBoundsException when Tez is 
the execution engine. The error seems to be addressed by 
https://issues.apache.org/jira/browse/HIVE-13282

We have Hive 1.2 and Tez 0.7 in our cluster and the workaround suggested in the 
ticket is to disable SMB join. I searched around and only found the setting to 
convert to SMB MapJoin. Any help on disabling SMB join altogether would be 
appreciated. Thanks.

-B





How to disable SMB join?

2016-05-31 Thread Banias H
Hi,

Does anybody know if there a config setting to disable SMB join?

One of our Hive queries failed with ArrayIndexOutOfBoundsException when Tez
is the execution engine. The error seems to be addressed by
https://issues.apache.org/jira/browse/HIVE-13282

We have Hive 1.2 and Tez 0.7 in our cluster and the workaround suggested in
the ticket is to disable SMB join. I searched around and only found the
setting to convert to SMB MapJoin. Any help on disabling SMB join
altogether would be appreciated. Thanks.

-B


Re: Why does the user need write permission on the location of external hive table?

2016-05-31 Thread Mich Talebzadeh
right that directly belongs to hdfs:hdfs and nonone else bar that user can
write to it.

if you are connecting via beeline you need to specify the user and password

beeline -u jdbc:hive2://rhes564:10010/default
org.apache.hive.jdbc.HiveDriver -n hduser -p 

When I look at permissioning I see only hdfs can write to it not user
Sandeep?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 31 May 2016 at 09:20, Sandeep Giri  wrote:

> Yes, when I run hadoop fs it gives results correctly.
>
> *hadoop fs -ls /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/*
> *Found 30 items*
> *-rw-r--r--   3 hdfs hdfs   6148 2015-12-04 15:19
> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/.DS_Store*
> *-rw-r--r--   3 hdfs hdfs 803323 2015-12-04 15:19
> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670393.gz*
> *-rw-r--r--   3 hdfs hdfs 284355 2015-12-04 15:19
> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670394.gz*
> **
>
>
>
>
> On Tue, May 31, 2016 at 1:42 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> is this location correct and valid?
>>
>> LOCATION '/data/SentimentFiles/*SentimentFiles*/upload/data/tweets_raw/'
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 31 May 2016 at 08:50, Sandeep Giri  wrote:
>>
>>> Hi Hive Team,
>>>
>>> As per my understanding, in Hive, you can create two kinds of tables:
>>> Managed and External.
>>>
>>> In case of managed table, you own the data and hence when you drop the
>>> table the data is deleted.
>>>
>>> In case of external table, you don't have ownership of the data and
>>> hence when you delete such a table, the underlying data is not deleted.
>>> Only metadata is deleted.
>>>
>>> Now, recently i have observed that you can not create an external table
>>> over a location on which you don't have write (modification) permissions in
>>> HDFS. I completely fail to understand this.
>>>
>>> Use case: It is quite common that the data you are churning is huge and
>>> read-only. So, to churn such data via Hive, will you have to copy this huge
>>> data to a location on which you have write permissions?
>>>
>>> Please help.
>>>
>>> My data is located in a hdfs folder
>>> (/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/)  on which I
>>> only have readonly permission. And I am trying to execute the following
>>> command
>>>
>>> *CREATE EXTERNAL TABLE tweets_raw (*
>>> *id BIGINT,*
>>> *created_at STRING,*
>>> *source STRING,*
>>> *favorited BOOLEAN,*
>>> *retweet_count INT,*
>>> *retweeted_status STRUCT<*
>>> *text:STRING,*
>>> *users:STRUCT>,*
>>> *entities STRUCT<*
>>> *urls:ARRAY,*
>>> *user_mentions:ARRAY>,*
>>> *hashtags:ARRAY>,*
>>> *text STRING,*
>>> *user1 STRUCT<*
>>> *screen_name:STRING,*
>>> *name:STRING,*
>>> *friends_count:INT,*
>>> *followers_count:INT,*
>>> *statuses_count:INT,*
>>> *verified:BOOLEAN,*
>>> *utc_offset:STRING, -- was INT but nulls are strings*
>>> *time_zone:STRING>,*
>>> *in_reply_to_screen_name STRING,*
>>> *year int,*
>>> *month int,*
>>> *day int,*
>>> *hour int*
>>> *)*
>>> *ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'*
>>> *WITH SERDEPROPERTIES ("ignore.malformed.json" = "true")*
>>> *LOCATION
>>> '/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/'*
>>> *;*
>>>
>>> It throws the following error:
>>>
>>> FAILED: Execution Error, return code 1 from
>>> org.apache.hadoop.hive.ql.exec.DDLTask.
>>> MetaException(message:java.security.AccessControlException: Permission
>>> denied: user=sandeep, access=WRITE,
>>> inode="/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw":hdfs:hdfs:drwxr-xr-x
>>> at
>>> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
>>> at
>>> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
>>> at
>>> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
>>> at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1771)
>>> at
>>> 

Re: Why does the user need write permission on the location of external hive table?

2016-05-31 Thread Sandeep Giri
Yes, when I run hadoop fs it gives results correctly.

*hadoop fs -ls /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/*
*Found 30 items*
*-rw-r--r--   3 hdfs hdfs   6148 2015-12-04 15:19
/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/.DS_Store*
*-rw-r--r--   3 hdfs hdfs 803323 2015-12-04 15:19
/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670393.gz*
*-rw-r--r--   3 hdfs hdfs 284355 2015-12-04 15:19
/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670394.gz*
**




On Tue, May 31, 2016 at 1:42 PM, Mich Talebzadeh 
wrote:

> is this location correct and valid?
>
> LOCATION '/data/SentimentFiles/*SentimentFiles*/upload/data/tweets_raw/'
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 31 May 2016 at 08:50, Sandeep Giri  wrote:
>
>> Hi Hive Team,
>>
>> As per my understanding, in Hive, you can create two kinds of tables:
>> Managed and External.
>>
>> In case of managed table, you own the data and hence when you drop the
>> table the data is deleted.
>>
>> In case of external table, you don't have ownership of the data and hence
>> when you delete such a table, the underlying data is not deleted. Only
>> metadata is deleted.
>>
>> Now, recently i have observed that you can not create an external table
>> over a location on which you don't have write (modification) permissions in
>> HDFS. I completely fail to understand this.
>>
>> Use case: It is quite common that the data you are churning is huge and
>> read-only. So, to churn such data via Hive, will you have to copy this huge
>> data to a location on which you have write permissions?
>>
>> Please help.
>>
>> My data is located in a hdfs folder
>> (/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/)  on which I
>> only have readonly permission. And I am trying to execute the following
>> command
>>
>> *CREATE EXTERNAL TABLE tweets_raw (*
>> *id BIGINT,*
>> *created_at STRING,*
>> *source STRING,*
>> *favorited BOOLEAN,*
>> *retweet_count INT,*
>> *retweeted_status STRUCT<*
>> *text:STRING,*
>> *users:STRUCT>,*
>> *entities STRUCT<*
>> *urls:ARRAY,*
>> *user_mentions:ARRAY>,*
>> *hashtags:ARRAY>,*
>> *text STRING,*
>> *user1 STRUCT<*
>> *screen_name:STRING,*
>> *name:STRING,*
>> *friends_count:INT,*
>> *followers_count:INT,*
>> *statuses_count:INT,*
>> *verified:BOOLEAN,*
>> *utc_offset:STRING, -- was INT but nulls are strings*
>> *time_zone:STRING>,*
>> *in_reply_to_screen_name STRING,*
>> *year int,*
>> *month int,*
>> *day int,*
>> *hour int*
>> *)*
>> *ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'*
>> *WITH SERDEPROPERTIES ("ignore.malformed.json" = "true")*
>> *LOCATION
>> '/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/'*
>> *;*
>>
>> It throws the following error:
>>
>> FAILED: Execution Error, return code 1 from
>> org.apache.hadoop.hive.ql.exec.DDLTask.
>> MetaException(message:java.security.AccessControlException: Permission
>> denied: user=sandeep, access=WRITE,
>> inode="/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw":hdfs:hdfs:drwxr-xr-x
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1771)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1755)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1729)
>> at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8348)
>> at
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1978)
>> at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.ja
>> va:1443)
>> at
>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProto
>> s.java)
>> at
>> 

Re: Why does the user need write permission on the location of external hive table?

2016-05-31 Thread Mich Talebzadeh
is this location correct and valid?

LOCATION '/data/SentimentFiles/*SentimentFiles*/upload/data/tweets_raw/'

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 31 May 2016 at 08:50, Sandeep Giri  wrote:

> Hi Hive Team,
>
> As per my understanding, in Hive, you can create two kinds of tables:
> Managed and External.
>
> In case of managed table, you own the data and hence when you drop the
> table the data is deleted.
>
> In case of external table, you don't have ownership of the data and hence
> when you delete such a table, the underlying data is not deleted. Only
> metadata is deleted.
>
> Now, recently i have observed that you can not create an external table
> over a location on which you don't have write (modification) permissions in
> HDFS. I completely fail to understand this.
>
> Use case: It is quite common that the data you are churning is huge and
> read-only. So, to churn such data via Hive, will you have to copy this huge
> data to a location on which you have write permissions?
>
> Please help.
>
> My data is located in a hdfs folder
> (/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/)  on which I
> only have readonly permission. And I am trying to execute the following
> command
>
> *CREATE EXTERNAL TABLE tweets_raw (*
> *id BIGINT,*
> *created_at STRING,*
> *source STRING,*
> *favorited BOOLEAN,*
> *retweet_count INT,*
> *retweeted_status STRUCT<*
> *text:STRING,*
> *users:STRUCT>,*
> *entities STRUCT<*
> *urls:ARRAY,*
> *user_mentions:ARRAY>,*
> *hashtags:ARRAY>,*
> *text STRING,*
> *user1 STRUCT<*
> *screen_name:STRING,*
> *name:STRING,*
> *friends_count:INT,*
> *followers_count:INT,*
> *statuses_count:INT,*
> *verified:BOOLEAN,*
> *utc_offset:STRING, -- was INT but nulls are strings*
> *time_zone:STRING>,*
> *in_reply_to_screen_name STRING,*
> *year int,*
> *month int,*
> *day int,*
> *hour int*
> *)*
> *ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'*
> *WITH SERDEPROPERTIES ("ignore.malformed.json" = "true")*
> *LOCATION
> '/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/'*
> *;*
>
> It throws the following error:
>
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask.
> MetaException(message:java.security.AccessControlException: Permission
> denied: user=sandeep, access=WRITE,
> inode="/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw":hdfs:hdfs:drwxr-xr-x
> at
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
> at
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
> at
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1771)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1755)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1729)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8348)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1978)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.ja
> va:1443)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProto
> s.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
>
>
>
> --
> Regards,
> Sandeep Giri,
> +1-(347) 781-4573 (US)
> +91-953-899-8962 (IN)
> www.CloudxLab.com  (A Hadoop cluster for practicing)
>
>


Why does the user need write permission on the location of external hive table?

2016-05-31 Thread Sandeep Giri
Hi Hive Team,

As per my understanding, in Hive, you can create two kinds of tables:
Managed and External.

In case of managed table, you own the data and hence when you drop the
table the data is deleted.

In case of external table, you don't have ownership of the data and hence
when you delete such a table, the underlying data is not deleted. Only
metadata is deleted.

Now, recently i have observed that you can not create an external table
over a location on which you don't have write (modification) permissions in
HDFS. I completely fail to understand this.

Use case: It is quite common that the data you are churning is huge and
read-only. So, to churn such data via Hive, will you have to copy this huge
data to a location on which you have write permissions?

Please help.

My data is located in a hdfs folder
(/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/)  on which I
only have readonly permission. And I am trying to execute the following
command

*CREATE EXTERNAL TABLE tweets_raw (*
*id BIGINT,*
*created_at STRING,*
*source STRING,*
*favorited BOOLEAN,*
*retweet_count INT,*
*retweeted_status STRUCT<*
*text:STRING,*
*users:STRUCT>,*
*entities STRUCT<*
*urls:ARRAY,*
*user_mentions:ARRAY>,*
*hashtags:ARRAY>,*
*text STRING,*
*user1 STRUCT<*
*screen_name:STRING,*
*name:STRING,*
*friends_count:INT,*
*followers_count:INT,*
*statuses_count:INT,*
*verified:BOOLEAN,*
*utc_offset:STRING, -- was INT but nulls are strings*
*time_zone:STRING>,*
*in_reply_to_screen_name STRING,*
*year int,*
*month int,*
*day int,*
*hour int*
*)*
*ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'*
*WITH SERDEPROPERTIES ("ignore.malformed.json" = "true")*
*LOCATION
'/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/'*
*;*

It throws the following error:

FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:java.security.AccessControlException: Permission
denied: user=sandeep, access=WRITE,
inode="/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw":hdfs:hdfs:drwxr-xr-x
at
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
at
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
at
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1771)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1755)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1729)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8348)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1978)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.ja
va:1443)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProto
s.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)



-- 
Regards,
Sandeep Giri,
+1-(347) 781-4573 (US)
+91-953-899-8962 (IN)
www.CloudxLab.com  (A Hadoop cluster for practicing)


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Jörn Franke
Thanks very interesting explanation. Looking forward to test it.

> On 31 May 2016, at 07:51, Gopal Vijayaraghavan  wrote:
> 
> 
>> That being said all systems are evolving. Hive supports tez+llap which
>> is basically the in-memory support.
> 
> There is a big difference between where LLAP & SparkSQL, which has to do
> with access pattern needs.
> 
> The first one is related to the lifetime of the cache - the Spark RDD
> cache is per-user-session which allows for further operation in that
> session to be optimized.
> 
> LLAP is designed to be hammered by multiple user sessions running
> different queries, designed to automate the cache eviction & selection
> process. There's no user visible explicit .cache() to remember - it's
> automatic and concurrent.
> 
> My team works with both engines, trying to improve it for ORC, but the
> goals of both are different.
> 
> I will probably have to write a proper academic paper & get it
> edited/reviewed instead of send my ramblings to the user lists like this.
> Still, this needs an example to talk about.
> 
> To give a qualified example, let's leave the world of single use clusters
> and take the use-case detailed here
> 
> http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
> 
> 
> There are two distinct problems there - one is that a single day sees upto
> 100k independent user sessions running queries and that most queries cover
> the last hour (& possibly join/compare against a similar hour aggregate
> from the past).
> 
> The problem with having independent 100k user-sessions from different
> connections was that the SparkSQL layer drops the RDD lineage & cache
> whenever a user ends a session.
> 
> The scale problem in general for Impala was that even though the data size
> was in multiple terabytes, the actual hot data was approx <20Gb, which
> resides on <10 machines with locality.
> 
> The same problem applies when you apply RDD caching with something like
> un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
> popular that the machines which hold those blocks run extra hot.
> 
> A cache model per-user session is entirely wasteful and a common cache +
> MPP model effectively overloads 2-3% of cluster, while leaving the other
> machines idle.
> 
> LLAP was designed specifically to prevent that hotspotting, while
> maintaining the common cache model - within a few minutes after an hour
> ticks over, the whole cluster develops temporal popularity for the hot
> data and nearly every rack has at least one cached copy of the same data
> for availability/performance.
> 
> Since data stream tend to be extremely wide table (Omniture) comes to
> mine, so the cache actually does not hold all columns in a table and since
> Zipf distributions are extremely common in these real data sets, the cache
> does not hold all rows either.
> 
> select count(clicks) from table where zipcode = 695506;
> 
> with ORC data bucketed + *sorted* by zipcode, the row-groups which are in
> the cache will be the only 2 columns (clicks & zipcode) & all bloomfilter
> indexes for all files will be loaded into memory, all misses on the bloom
> will not even feature in the cache.
> 
> A subsequent query for
> 
> select count(clicks) from table where zipcode = 695586;
> 
> will run against the collected indexes, before deciding which files need
> to be loaded into cache.
> 
> 
> Then again, 
> 
> select count(clicks)/count(impressions) from table where zipcode = 695586;
> 
> will load only impressions out of the table into cache, to add it to the
> columnar cache without producing another complete copy (RDDs are not
> mutable, but LLAP cache is additive).
> 
> The column split cache & index-cache separation allows for this to be
> cheaper than a full rematerialization - both are evicted as they fill up,
> with different priorities.
> 
> Following the same vein, LLAP can do a bit of clairvoyant pre-processing,
> with a bit of input from UX patterns observed from Tableau/Microstrategy
> users to give it the impression of being much faster than the engine
> really can be.
> 
> Illusion of performance is likely to be indistinguishable from actual -
> I'm actually looking for subjects for that experiment :)
> 
> Cheers,
> Gopal
> 
>