[no subject]

2020-01-14 Thread @Sanjiv Singh
Regards
Sanjiv Singh
Mob :  +1 571-599-5236


[no subject]

2020-01-14 Thread @Sanjiv Singh
Regards
Sanjiv Singh
Mob :  +1 571-599-5236


Re: Spark SQL Parallelism - While reading from Oracle

2016-08-10 Thread @Sanjiv Singh
Use it 
You can set up all the properties (driver,partitionColumn, lowerBound,
upperBound, numPartitions) you should start with the driver at first.

Now you have the maximum id so you can use it for the upperBound parameter.
The numPartitions now based on your table's dimensions and your actual
system what you use. Now with this snippet you read a database table to a
dataframe with Spark.

df = sqlContext.read.format('jdbc').options(

url="jdbc:mysql://ip-address:3306/sometable?user=username&password=password",
dbtable=*sometable*,
driver="com.mysql.jdbc.Driver",
*partitionColumn*="id",
*lowerBound *= 1,
*upperBound *= maxId,
*numPartitions *= 100
).load()



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Wed, Aug 10, 2016 at 6:35 AM, Siva A  wrote:

> Hi Team,
>
> How do we increase the parallelism in Spark SQL.
> In Spark Core, we can re-partition or pass extra arguments part of the
> transformation.
>
> I am trying the below example,
>
> val df1 = sqlContext.read.format("jdbc").options(Map(...)).load
> val df2= df1.cache
> val df2.count
>
> Here count operation using only one task. I couldn't increase the
> parallelism.
> Thanks in advance
>
> Thanks
> Siva
>


Re: Spark SQL is not returning records for HIVE transactional tables on HDP

2016-03-13 Thread @Sanjiv Singh
Hi All,
We are using for Spark SQL :


   - Hive :1.2.1
   - Spark : 1.3.1
   - Hadoop :2.7.1

Let me know if needs other details to debug the issue.


Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Sun, Mar 13, 2016 at 1:07 AM, Mich Talebzadeh 
wrote:

> Hi,
>
> Thanks for the input. I use Hive 2 and still have this issue.
>
>
>
>1. Hive version 2
>2. Hive on Spark engine 1.3.1
>3. Spark 1.5.2
>
>
> I have added Hive user group  to this as well. So hopefully we may get
> some resolution.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 12 March 2016 at 19:25, Timur Shenkao  wrote:
>
>> Hi,
>>
>> I have suffered from Hive Streaming , Transactions enough, so I can share
>> my experience with you.
>>
>> 1) It's not a problem of Spark. It happens because of "peculiarities" /
>> bugs of Hive Streaming.  Hive Streaming, transactions are very raw
>> technologies. If you look at Hive JIRA, you'll see several critical bugs
>> concerning Hive Streaming, transactions. Some of them are resolved in Hive
>> 2+ only. But Cloudera & Hortonworks ship their distributions with outdated
>> & buggy Hive.
>> So use Hive 2+. Earlier versions of Hive didn't run compaction at all.
>>
>> 2) In Hive 1.1, I  issue the following lines
>> ALTER TABLE default.foo COMPACT 'MAJOR';
>> SHOW COMPACTIONS;
>>
>> My manual compaction was shown but it was never fulfilled.
>>
>> 3) If you use Hive Streaming, it's not recommended or even forbidden to
>> insert rows into Hive Streaming tables manually. Only the process that
>> writes to such table should insert incoming rows sequentially. Otherwise
>> you'll get unpredictable behaviour.
>>
>> 4) Ordinary Hive tables are catalogs with text, ORC, etc. files.
>> Hive Streaming / transactional tables are catalogs that have numerous
>> subcatalogs with "delta" prefix. Moreover, there are files with
>> "flush_length" suffix in some delta subfolders. "flush_length" files have 8
>> bytes length. The presence of "flush_length" file in some subfolder means
>> that Hive writes updates to this subfolder right now. When Hive fails or is
>> restarted, it begins to write into new delta subfolder with new
>> "flush_length" file. And old "flush_length" file (that was used before
>> failure) still remains.
>> One of the goal of compaction is to delete outdated "flush_length" files.
>> Not every application / library can read such folder structure or knows
>> details of Hive Streaming / transactions implementation. Most of the
>> software solutions still expect ordinary Hive tables as input.
>> When they encounter subcatalogs or special files "flush_length" file,
>> applications / libraries either "see nothing" (return 0 or empty result
>> set) or stumble over "flush_length" files (return unexplainable errors).
>>
>> For instance, Facebook Presto couldn't read subfolders by default unless
>> you activate special parameters. But it stumbles over "flush_length" files
>> as Presto expect legal ORC files not 8-byte-length text files in folders.
>>
>> So, I don't advise you to use Hive Streaming, transactions right now in
>> real production systems (24 / 7 /365) with hundreds millions of events a
>> day.
>>
>> On Sat, Mar 12, 2016 at 11:24 AM, @Sanjiv Singh 
>> wrote:
>>
>>> Hi All,
>>>
>>> I am facing this issue on HDP setup on which COMPACTION is required only
>>> once for transactional tables to fetch records with Spark SQL.
>>> On the other hand, Apache setup doesn't required compaction even once.
>>>
>>> May be something got triggered on meta-store after compaction, Spark SQL
>>> start recognizing delta files.
>>>
>>> Let know me if needed other details to get root cause.
>>>
>>> Try this,
>>>
>>> *See complete scenario :*
>>>
>>> hive> create table default.foo(id int) clustered by (id) into 2 buckets
>>> STORED AS ORC TBLPROPERTIES ('transactional'='true');
>>> hive> insert into default.foo values(10);
>>>
>>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
>>> because data is still in delta files
>>>
>>> Now run major compaction:
>>>
>>> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>>>
>>> scala> sqlContext.table("default.foo").count // Gives 1
>>>
>>> hive> insert into foo values(20);
>>>
>>> scala> sqlContext.table("default.foo").count* // Gives 2 , no
>>> compaction required.*
>>>
>>>
>>>
>>>
>>> Regards
>>> Sanjiv Singh
>>> Mob :  +091 9990-447-339
>>>
>>
>>
>


Spark SQL is not returning records for HIVE transactional tables on HDP

2016-03-12 Thread @Sanjiv Singh
Hi All,

I am facing this issue on HDP setup on which COMPACTION is required only
once for transactional tables to fetch records with Spark SQL.
On the other hand, Apache setup doesn't required compaction even once.

May be something got triggered on meta-store after compaction, Spark SQL
start recognizing delta files.

Let know me if needed other details to get root cause.

Try this,

*See complete scenario :*

hive> create table default.foo(id int) clustered by (id) into 2 buckets
STORED AS ORC TBLPROPERTIES ('transactional'='true');
hive> insert into default.foo values(10);

scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
because data is still in delta files

Now run major compaction:

hive> ALTER TABLE default.foo COMPACT 'MAJOR';

scala> sqlContext.table("default.foo").count // Gives 1

hive> insert into foo values(20);

scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction
required.*




Regards
Sanjiv Singh
Mob :  +091 9990-447-339


Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-23 Thread @Sanjiv Singh
Yes, It is very strange and also very opposite to my belief on Spark SQL on
hive tables.

I am facing this issue on HDP setup on which COMPACTION is required only
once.
On the other hand, Apache setup doesn't required compaction even once.

May be something got triggered on meta-store after compaction, Spark SQL
start recognizing delta files.

Let know me if needed other details to get root cause.



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Feb 23, 2016 at 2:28 PM, Varadharajan Mukundan  wrote:

> That's interesting. I'm not sure why first compaction is needed but not on
> the subsequent inserts. May be its just to create few metadata. Thanks for
> clarifying this :)
>
> On Tue, Feb 23, 2016 at 2:15 PM, @Sanjiv Singh 
> wrote:
>
>> Try this,
>>
>>
>> hive> create table default.foo(id int) clustered by (id) into 2 buckets
>> STORED AS ORC TBLPROPERTIES ('transactional'='true');
>> hive> insert into default.foo values(10);
>>
>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
>> because data is still in delta files
>>
>> Now run major compaction:
>>
>> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>>
>> scala> sqlContext.table("default.foo").count // Gives 1
>>
>> hive> insert into foo values(20);
>>
>> scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction
>> required.*
>>
>>
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Tue, Feb 23, 2016 at 2:02 PM, Varadharajan Mukundan <
>> srinath...@gmail.com> wrote:
>>
>>> This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure
>>> if its different.
>>>
>>> Please walkthrough the below commands in the same order to understand
>>> the sequence.
>>>
>>> hive> create table default.foo(id int) clustered by (id) into 2 buckets
>>> STORED AS ORC TBLPROPERTIES ('transactional'='true');
>>> hive> insert into foo values(10);
>>>
>>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
>>> because data is still in delta files
>>>
>>> Now run major compaction:
>>>
>>> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>>>
>>> scala> sqlContext.table("default.foo").count // Gives 1
>>>
>>>
>>> On Tue, Feb 23, 2016 at 12:35 PM, @Sanjiv Singh 
>>> wrote:
>>>
>>>> Hi Varadharajan,
>>>>
>>>>
>>>> That is the point, Spark SQL is able to recognize delta files. See
>>>> below directory structure, ONE BASE (43 records) and one DELTA (created
>>>> after last insert). And I am able see last insert through Spark SQL.
>>>>
>>>>
>>>> *See below complete scenario :*
>>>>
>>>> *Steps:*
>>>>
>>>>- Inserted 43 records in table.
>>>>- Run major compaction on table.
>>>>- *alter table mytable COMPACT 'major';*
>>>>- Disabled auto compaction on table.
>>>>- *alter table mytable set
>>>>   TBLPROPERTIES("NO_AUTO_COMPACTION"="true");*
>>>>- Inserted 1 record in table.
>>>>
>>>>
>>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>>> drwxrwxrwx   - root hdfs  0 2016-02-23 11:43
>>>> /apps/hive/warehouse/mydb.db/mytable/base_087
>>>> drwxr-xr-x   - root hdfs  0 2016-02-23 12:02
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_088_088
>>>>
>>>> *SPARK JDBC :*
>>>>
>>>> 0: jdbc:hive2://myhost:> select count(*) from mytable ;
>>>> +--+
>>>> | _c0  |
>>>> +--+
>>>> | 44   |
>>>> +--+
>>>> 1 row selected (1.196 seconds)
>>>>
>>>> *HIVE JDBC :*
>>>>
>>>> 1: jdbc:hive2://myhost:1> select count(*) from mytable ;
>>>> +--+--+
>>>> | _c0  |
>>>> +--+--+
>>>> | 44   |
>>>> +--+--+
>>>> 1 row selected (0.121 seconds)
>>>>
>>>>
>>>> Regards
>>>> Sanjiv Singh
>>>> Mob :  +091 9990-447-339
>>>>
>>>> On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan <
>&g

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-23 Thread @Sanjiv Singh
Try this,


hive> create table default.foo(id int) clustered by (id) into 2 buckets
STORED AS ORC TBLPROPERTIES ('transactional'='true');
hive> insert into default.foo values(10);

scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
because data is still in delta files

Now run major compaction:

hive> ALTER TABLE default.foo COMPACT 'MAJOR';

scala> sqlContext.table("default.foo").count // Gives 1

hive> insert into foo values(20);

scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction
required.*




Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Feb 23, 2016 at 2:02 PM, Varadharajan Mukundan  wrote:

> This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure
> if its different.
>
> Please walkthrough the below commands in the same order to understand the
> sequence.
>
> hive> create table default.foo(id int) clustered by (id) into 2 buckets
> STORED AS ORC TBLPROPERTIES ('transactional'='true');
> hive> insert into foo values(10);
>
> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
> because data is still in delta files
>
> Now run major compaction:
>
> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>
> scala> sqlContext.table("default.foo").count // Gives 1
>
>
> On Tue, Feb 23, 2016 at 12:35 PM, @Sanjiv Singh 
> wrote:
>
>> Hi Varadharajan,
>>
>>
>> That is the point, Spark SQL is able to recognize delta files. See below
>> directory structure, ONE BASE (43 records) and one DELTA (created after
>> last insert). And I am able see last insert through Spark SQL.
>>
>>
>> *See below complete scenario :*
>>
>> *Steps:*
>>
>>- Inserted 43 records in table.
>>- Run major compaction on table.
>>- *alter table mytable COMPACT 'major';*
>>- Disabled auto compaction on table.
>>- *alter table mytable set
>>   TBLPROPERTIES("NO_AUTO_COMPACTION"="true");*
>>- Inserted 1 record in table.
>>
>>
>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>> drwxrwxrwx   - root hdfs  0 2016-02-23 11:43
>> /apps/hive/warehouse/mydb.db/mytable/base_087
>> drwxr-xr-x   - root hdfs  0 2016-02-23 12:02
>> /apps/hive/warehouse/mydb.db/mytable/delta_088_088
>>
>> *SPARK JDBC :*
>>
>> 0: jdbc:hive2://myhost:> select count(*) from mytable ;
>> +--+
>> | _c0  |
>> +--+
>> | 44   |
>> +--+
>> 1 row selected (1.196 seconds)
>>
>> *HIVE JDBC :*
>>
>> 1: jdbc:hive2://myhost:1> select count(*) from mytable ;
>> +--+--+
>> | _c0  |
>> +--+--+
>> | 44   |
>> +--+--+
>> 1 row selected (0.121 seconds)
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan <
>> srinath...@gmail.com> wrote:
>>
>>> Hi Sanjiv,
>>>
>>> Yes.. If we make use of Hive JDBC we should be able to retrieve all the
>>> rows since it is hive which processes the query. But i think the problem
>>> with Hive JDBC is that there are two layers of processing, hive and then at
>>> spark with the result set. And another one is performance is limited to
>>> that single HiveServer2 node and network.
>>>
>>> But If we make use of sqlContext.table function in spark to access hive
>>> tables, it is supposed to read files directly from HDFS skipping the hive
>>> layer. But it doesn't read delta files and just reads the contents from
>>> base folder. Only after Major compaction, the delta files would be merged
>>> with based folder and be visible for Spark SQL
>>>
>>> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh 
>>> wrote:
>>>
>>>> Hi Varadharajan,
>>>>
>>>> Can you elaborate on (you quoted on previous mail) :
>>>> "I observed that hive transaction storage structure do not work with
>>>> spark yet"
>>>>
>>>>
>>>> If it is related to delta files created after each transaction and
>>>> spark would not be able recognize them. then I have a table *mytable *(ORC
>>>> , BUCKETED , NON-SORTED) , already done lots on insert , update and
>>>> deletes. I can see delta files created in HDFS (see below), Still able to
>>>> fetch consistent records through Spa

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-22 Thread @Sanjiv Singh
Hi Varadharajan,


That is the point, Spark SQL is able to recognize delta files. See below
directory structure, ONE BASE (43 records) and one DELTA (created after
last insert). And I am able see last insert through Spark SQL.


*See below complete scenario :*

*Steps:*

   - Inserted 43 records in table.
   - Run major compaction on table.
   - *alter table mytable COMPACT 'major';*
   - Disabled auto compaction on table.
   - *alter table mytable set TBLPROPERTIES("NO_AUTO_COMPACTION"="true");*
   - Inserted 1 record in table.


> *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
drwxrwxrwx   - root hdfs  0 2016-02-23 11:43
/apps/hive/warehouse/mydb.db/mytable/base_087
drwxr-xr-x   - root hdfs  0 2016-02-23 12:02
/apps/hive/warehouse/mydb.db/mytable/delta_088_088

*SPARK JDBC :*

0: jdbc:hive2://myhost:> select count(*) from mytable ;
+--+
| _c0  |
+--+
| 44   |
+--+
1 row selected (1.196 seconds)

*HIVE JDBC :*

1: jdbc:hive2://myhost:1> select count(*) from mytable ;
+--+--+
| _c0  |
+--+--+
| 44   |
+--+--+
1 row selected (0.121 seconds)


Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan <
srinath...@gmail.com> wrote:

> Hi Sanjiv,
>
> Yes.. If we make use of Hive JDBC we should be able to retrieve all the
> rows since it is hive which processes the query. But i think the problem
> with Hive JDBC is that there are two layers of processing, hive and then at
> spark with the result set. And another one is performance is limited to
> that single HiveServer2 node and network.
>
> But If we make use of sqlContext.table function in spark to access hive
> tables, it is supposed to read files directly from HDFS skipping the hive
> layer. But it doesn't read delta files and just reads the contents from
> base folder. Only after Major compaction, the delta files would be merged
> with based folder and be visible for Spark SQL
>
> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh 
> wrote:
>
>> Hi Varadharajan,
>>
>> Can you elaborate on (you quoted on previous mail) :
>> "I observed that hive transaction storage structure do not work with
>> spark yet"
>>
>>
>> If it is related to delta files created after each transaction and spark
>> would not be able recognize them. then I have a table *mytable *(ORC ,
>> BUCKETED , NON-SORTED) , already done lots on insert , update and deletes.
>> I can see delta files created in HDFS (see below), Still able to fetch
>> consistent records through Spark JDBC and HIVE JDBC.
>>
>> Not compaction triggered for that table.
>>
>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>
>> drwxrwxrwx   - root hdfs  0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/base_060
>> drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_061_061
>> drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_062_062
>> drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_063_063
>> drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_064_064
>> drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_065_065
>> drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_066_066
>> drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_067_067
>> drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_068_068
>> drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_069_069
>> drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_070_070
>> drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_071_071
>> drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_072_072
>> drwxr-xr-x   - root hdfs  0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_073_073
>> drwxr-xr-x   - root hdfs  0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_074_074
>> drwxr-xr-x   - root hdfs  0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_075_075
>> drwxr-xr-x   - root hdfs

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-22 Thread @Sanjiv Singh
Hi Varadharajan,

Can you elaborate on (you quoted on previous mail) :
"I observed that hive transaction storage structure do not work with spark
yet"


If it is related to delta files created after each transaction and spark
would not be able recognize them. then I have a table *mytable *(ORC ,
BUCKETED , NON-SORTED) , already done lots on insert , update and deletes.
I can see delta files created in HDFS (see below), Still able to fetch
consistent records through Spark JDBC and HIVE JDBC.

Not compaction triggered for that table.

> *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*

drwxrwxrwx   - root hdfs  0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/base_060
drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_061_061
drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_062_062
drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_063_063
drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_064_064
drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_065_065
drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_066_066
drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_067_067
drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_068_068
drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_069_069
drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_070_070
drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_071_071
drwxr-xr-x   - root hdfs  0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_072_072
drwxr-xr-x   - root hdfs  0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_073_073
drwxr-xr-x   - root hdfs  0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_074_074
drwxr-xr-x   - root hdfs  0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_075_075
drwxr-xr-x   - root hdfs  0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_076_076
drwxr-xr-x   - root hdfs  0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_077_077
drwxr-xr-x   - root hdfs  0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_078_078
drwxr-xr-x   - root hdfs  0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_079_079
drwxr-xr-x   - root hdfs  0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_080_080
drwxr-xr-x   - root hdfs  0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_081_081
drwxr-xr-x   - root hdfs  0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_082_082
drwxr-xr-x   - root hdfs  0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_083_083
drwxr-xr-x   - root hdfs  0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_084_084
drwxr-xr-x   - root hdfs  0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_085_085
drwxr-xr-x   - root hdfs  0 2016-02-23 11:40
/apps/hive/warehouse/mydb.db/mytable/delta_086_086
drwxr-xr-x   - root hdfs  0 2016-02-23 11:41
/apps/hive/warehouse/mydb.db/mytable/delta_087_087



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan  wrote:

> Actually the auto compaction if enabled is triggered based on the volume
> of changes. It doesn't automatically run after every insert. I think its
> possible to reduce the thresholds but that might reduce performance by a
> big margin. As of now, we do compaction after the batch insert completes.
>
> The only other way to solve this problem as of now is to use Hive JDBC API.
>
> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh 
> wrote:
>
>> Compaction would have been triggered automatically as following
>> properties already set in *hive-site.xml*. and also *NO_AUTO_COMPACTION* 
>> property
>> not been set for these tables.
>>
>>
>> 
>>
>>   hive.compactor.initiator.on
>>
>>   true
>>
>> 
>>
>> 
>>
>>   hive.compactor.worker.threads
>>
>>   1
>>
>> 
>>
>>
>> Documentation is upset sometimes.
>>
>>
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Mo

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-21 Thread @Sanjiv Singh
Compaction would have been triggered automatically as following properties
already set in *hive-site.xml*. and also *NO_AUTO_COMPACTION* property not
been set for these tables.




  hive.compactor.initiator.on

  true





  hive.compactor.worker.threads

  1




Documentation is upset sometimes.




Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan  wrote:

> Yes, I was burned down by this issue couple of weeks back. This also means
> that after every insert job, compaction should be run to access new rows
> from Spark. Sad that this issue is not documented / mentioned anywhere.
>
> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh 
> wrote:
>
>> Hi Varadharajan,
>>
>> Thanks for your response.
>>
>> Yes it is transnational table; See below *show create table. *
>>
>> Table hardly have 3 records , and after triggering minor compaction on
>> tables , it start showing results on spark SQL.
>>
>>
>> > *ALTER TABLE hivespark COMPACT 'major';*
>>
>>
>> > *show create table hivespark;*
>>
>>   CREATE TABLE `hivespark`(
>>
>> `id` int,
>>
>> `name` string)
>>
>>   CLUSTERED BY (
>>
>> id)
>>
>>   INTO 32 BUCKETS
>>
>>   ROW FORMAT SERDE
>>
>> 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>
>>   STORED AS INPUTFORMAT
>>
>> 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>
>>   OUTPUTFORMAT
>>
>> 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>
>>   LOCATION
>>
>> 'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>   TBLPROPERTIES (
>>
>> 'COLUMN_STATS_ACCURATE'='true',
>>
>> 'last_modified_by'='root',
>>
>> 'last_modified_time'='1455859079',
>>
>> 'numFiles'='37',
>>
>> 'numRows'='3',
>>
>> 'rawDataSize'='0',
>>
>> 'totalSize'='11383',
>>
>> 'transactional'='true',
>>
>> 'transient_lastDdlTime'='1455864121') ;
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>> srinath...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Is the transaction attribute set on your table? I observed that hive
>>> transaction storage structure do not work with spark yet. You can confirm
>>> this by looking at the transactional attribute in the output of "desc
>>> extended " in hive console.
>>>
>>> If you'd need to access transactional table, consider doing a major
>>> compaction and then try accessing the tables
>>>
>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> I have observed that Spark SQL is not returning records for hive
>>>> bucketed ORC tables on HDP.
>>>>
>>>>
>>>>
>>>> On spark SQL , I am able to list all tables , but queries on hive
>>>> bucketed tables are not returning records.
>>>>
>>>> I have also tried the same for non-bucketed hive tables. it is working
>>>> fine.
>>>>
>>>>
>>>>
>>>> Same is working on plain Apache setup.
>>>>
>>>> Let me know if needs other details.
>>>>
>>>> Regards
>>>> Sanjiv Singh
>>>> Mob :  +091 9990-447-339
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks,
>>> M. Varadharajan
>>>
>>> 
>>>
>>> "Experience is what you get when you didn't get what you wanted"
>>>-By Prof. Randy Pausch in "The Last Lecture"
>>>
>>> My Journal :- http://varadharajan.in
>>>
>>
>>
>
>
> --
> Thanks,
> M. Varadharajan
>
> 
>
> "Experience is what you get when you didn't get what you wanted"
>-By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>


Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-21 Thread @Sanjiv Singh
Hi Varadharajan,

Thanks for your response.

Yes it is transnational table; See below *show create table. *

Table hardly have 3 records , and after triggering minor compaction on
tables , it start showing results on spark SQL.


> *ALTER TABLE hivespark COMPACT 'major';*


> *show create table hivespark;*

  CREATE TABLE `hivespark`(

`id` int,

`name` string)

  CLUSTERED BY (

id)

  INTO 32 BUCKETS

  ROW FORMAT SERDE

'org.apache.hadoop.hive.ql.io.orc.OrcSerde'

  STORED AS INPUTFORMAT

'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'

  OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

  LOCATION

'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
  TBLPROPERTIES (

'COLUMN_STATS_ACCURATE'='true',

'last_modified_by'='root',

'last_modified_time'='1455859079',

'numFiles'='37',

'numRows'='3',

'rawDataSize'='0',

'totalSize'='11383',

'transactional'='true',

'transient_lastDdlTime'='1455864121') ;


Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan  wrote:

> Hi,
>
> Is the transaction attribute set on your table? I observed that hive
> transaction storage structure do not work with spark yet. You can confirm
> this by looking at the transactional attribute in the output of "desc
> extended " in hive console.
>
> If you'd need to access transactional table, consider doing a major
> compaction and then try accessing the tables
>
> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh 
> wrote:
>
>> Hi,
>>
>>
>> I have observed that Spark SQL is not returning records for hive bucketed
>> ORC tables on HDP.
>>
>>
>>
>> On spark SQL , I am able to list all tables , but queries on hive
>> bucketed tables are not returning records.
>>
>> I have also tried the same for non-bucketed hive tables. it is working
>> fine.
>>
>>
>>
>> Same is working on plain Apache setup.
>>
>> Let me know if needs other details.
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>
>
>
> --
> Thanks,
> M. Varadharajan
>
> 
>
> "Experience is what you get when you didn't get what you wanted"
>-By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>


Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-21 Thread @Sanjiv Singh
Hi,


I have observed that Spark SQL is not returning records for hive bucketed
ORC tables on HDP.



On spark SQL , I am able to list all tables , but queries on hive bucketed
tables are not returning records.

I have also tried the same for non-bucketed hive tables. it is working fine.



Same is working on plain Apache setup.

Let me know if needs other details.

Regards
Sanjiv Singh
Mob :  +091 9990-447-339


Re: Having issue with Spark SQL JDBC on hive table !!!

2016-01-28 Thread @Sanjiv Singh
It working now ...

I checked at Spark worker UI , executor startup failing with below error ,
JVM initialization failing because of wrong -Xms :

Invalid initial heap size: -Xms0MError: Could not create the Java
Virtual Machine.Error: A fatal exception has occurred. Program will
exit.

Thrift server is not picking executor memory from *spark-env.sh*​ , then I
added in thrift server startup script explicitly.

*./sbin/start-thriftserver.sh*

exec "$FWDIR"/sbin/spark-daemon.sh spark-submit $CLASS 1
--executor-memory 512M "$@"

With this , Executor start getting valid memory and JDBC queries are
getting results.

*conf/spark-env.sh*​ (executor memory configurations not picked by
thrift-server)

export SPARK_JAVA_OPTS="-Dspark.executor.memory=512M"
export SPARK_EXECUTOR_MEMORY=512M


Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Thu, Jan 28, 2016 at 10:57 PM, @Sanjiv Singh 
wrote:

> Adding to it
>
> job status at UI :
>
> Stage IdDescriptionSubmittedDurationTasks: Succeeded/TotalInputOutputShuffle
> ReadShuffle Write
> 1 select ename from employeetest(kill
> <http://impetus-d951centos:4040/stages/stage/kill?id=1&terminate=true>)collect
> at SparkPlan.scala:84
> <http://impetus-d951centos:4040/stages/stage?id=1&attempt=0>+details
>
> 2016/01/29 04:20:06 3.0 min
> 0/2
>
> Getting below exception on Spark UI :
>
> org.apache.spark.rdd.RDD.collect(RDD.scala:813)
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)
> org.apache.spark.sql.DataFrame.collect(DataFrame.scala:887)
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:178)
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
> org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
> org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
> org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
> org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
> org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:744)
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
> On Thu, Jan 28, 2016 at 9:57 PM, @Sanjiv Singh 
> wrote:
>
>> Any help on this.
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Wed, Jan 27, 2016 at 10:25 PM, @Sanjiv Singh 
>> wrote:
>>
>>> Hi Ted ,
>>> Its typo.
>>>
>>>
>>> Regards
>>> Sanjiv Singh
>>> Mob :  +091 9990-447-339
>>>
>>> On Wed, Jan 27, 2016 at 9:13 PM, Ted Yu  wrote:
>>>
>>>> In the last snippet, temptable is shown by 'show tables' command.
>>>> Yet you queried tampTable.
>>>>
>>>> I believe this just was typo :-)
>>>>
>>>> On Wed, Jan 27, 2016 at 7:07 AM, @Sanjiv Singh 
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I have configured Spark to query on hive table.
>>>>>
>>>>> Run the Thrift JDBC/ODBC server using below command :
>>>>>
>>>>> *cd $SPARK_HOME*
>>>>> *./sbin/start-thriftserver.sh --master spark://myhost:7077 --hiveconf
>>>>> hive.server2.thrift.bind.host=myhost --hiveconf
>>>>> hive.server2.thrift.port=*
>>>>>
>>>>> and also able to connect through beeline
>>>>>
>>>>> *beeline>* !connect jdbc:hive2://192.168.145.20:
>>>>> Enter username for jdbc:hive2://192.168.145.20:: root
>>>>> Enter password for jdbc:hive2://192.168.145.20:: impetus
>>>>> *beeline > *
>>>>>
>>>>> It is not giving query result on hive table through Spark JDBC, but it
>>>>> is working with spark HiveSQLContext. See complete scenario explain below.
>>>>>
>>>>> Help me understand the issue why Spark SQL JDBC is n

Re: Having issue with Spark SQL JDBC on hive table !!!

2016-01-28 Thread @Sanjiv Singh
Adding to it

job status at UI :

Stage IdDescriptionSubmittedDurationTasks: Succeeded/TotalInputOutputShuffle
ReadShuffle Write
1 select ename from employeetest(kill
<http://impetus-d951centos:4040/stages/stage/kill?id=1&terminate=true>)collect
at SparkPlan.scala:84
<http://impetus-d951centos:4040/stages/stage?id=1&attempt=0>+details

2016/01/29 04:20:06 3.0 min
0/2

Getting below exception on Spark UI :

org.apache.spark.rdd.RDD.collect(RDD.scala:813)
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)
org.apache.spark.sql.DataFrame.collect(DataFrame.scala:887)
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:178)
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)


Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Thu, Jan 28, 2016 at 9:57 PM, @Sanjiv Singh 
wrote:

> Any help on this.
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
> On Wed, Jan 27, 2016 at 10:25 PM, @Sanjiv Singh 
> wrote:
>
>> Hi Ted ,
>> Its typo.
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Wed, Jan 27, 2016 at 9:13 PM, Ted Yu  wrote:
>>
>>> In the last snippet, temptable is shown by 'show tables' command.
>>> Yet you queried tampTable.
>>>
>>> I believe this just was typo :-)
>>>
>>> On Wed, Jan 27, 2016 at 7:07 AM, @Sanjiv Singh 
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have configured Spark to query on hive table.
>>>>
>>>> Run the Thrift JDBC/ODBC server using below command :
>>>>
>>>> *cd $SPARK_HOME*
>>>> *./sbin/start-thriftserver.sh --master spark://myhost:7077 --hiveconf
>>>> hive.server2.thrift.bind.host=myhost --hiveconf
>>>> hive.server2.thrift.port=*
>>>>
>>>> and also able to connect through beeline
>>>>
>>>> *beeline>* !connect jdbc:hive2://192.168.145.20:
>>>> Enter username for jdbc:hive2://192.168.145.20:: root
>>>> Enter password for jdbc:hive2://192.168.145.20:: impetus
>>>> *beeline > *
>>>>
>>>> It is not giving query result on hive table through Spark JDBC, but it
>>>> is working with spark HiveSQLContext. See complete scenario explain below.
>>>>
>>>> Help me understand the issue why Spark SQL JDBC is not giving result ?
>>>>
>>>> Below are version details.
>>>>
>>>> *Hive Version  : 1.2.1*
>>>> *Hadoop Version :  2.6.0*
>>>> *Spark version:  1.3.1*
>>>>
>>>> Let me know if need other details.
>>>>
>>>>
>>>> *Created Hive Table , insert some records and query it :*
>>>>
>>>> *beeline> !connect jdbc:hive2://myhost:1*
>>>> Enter username for jdbc:hive2://myhost:1: root
>>>> Enter password for jdbc:hive2://myhost:1: **
>>>> *beeline> create table tampTable(id int ,name string ) clustered by
>>>> (id) into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');*
>>>> *beeline> insert into table tampTable values
>>>> (1,'row1'),(2,'row2'),(3,'row3');*
>>>> *beeline> select name from tampTable;*
>>>> name
>>>> -
>>>> row1
>>>> row3
>>>> row2
>>>>
>>>> *Query through SparkSQL HiveSQLContext :*
>>>>
>>>> SparkConf sparkConf = new SparkConf().setAppName("JavaSpar

Re: Having issue with Spark SQL JDBC on hive table !!!

2016-01-28 Thread @Sanjiv Singh
Any help on this.

Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Wed, Jan 27, 2016 at 10:25 PM, @Sanjiv Singh 
wrote:

> Hi Ted ,
> Its typo.
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
> On Wed, Jan 27, 2016 at 9:13 PM, Ted Yu  wrote:
>
>> In the last snippet, temptable is shown by 'show tables' command.
>> Yet you queried tampTable.
>>
>> I believe this just was typo :-)
>>
>> On Wed, Jan 27, 2016 at 7:07 AM, @Sanjiv Singh 
>> wrote:
>>
>>> Hi All,
>>>
>>> I have configured Spark to query on hive table.
>>>
>>> Run the Thrift JDBC/ODBC server using below command :
>>>
>>> *cd $SPARK_HOME*
>>> *./sbin/start-thriftserver.sh --master spark://myhost:7077 --hiveconf
>>> hive.server2.thrift.bind.host=myhost --hiveconf
>>> hive.server2.thrift.port=*
>>>
>>> and also able to connect through beeline
>>>
>>> *beeline>* !connect jdbc:hive2://192.168.145.20:
>>> Enter username for jdbc:hive2://192.168.145.20:: root
>>> Enter password for jdbc:hive2://192.168.145.20:: impetus
>>> *beeline > *
>>>
>>> It is not giving query result on hive table through Spark JDBC, but it
>>> is working with spark HiveSQLContext. See complete scenario explain below.
>>>
>>> Help me understand the issue why Spark SQL JDBC is not giving result ?
>>>
>>> Below are version details.
>>>
>>> *Hive Version  : 1.2.1*
>>> *Hadoop Version :  2.6.0*
>>> *Spark version:  1.3.1*
>>>
>>> Let me know if need other details.
>>>
>>>
>>> *Created Hive Table , insert some records and query it :*
>>>
>>> *beeline> !connect jdbc:hive2://myhost:1*
>>> Enter username for jdbc:hive2://myhost:1: root
>>> Enter password for jdbc:hive2://myhost:1: **
>>> *beeline> create table tampTable(id int ,name string ) clustered by (id)
>>> into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');*
>>> *beeline> insert into table tampTable values
>>> (1,'row1'),(2,'row2'),(3,'row3');*
>>> *beeline> select name from tampTable;*
>>> name
>>> -
>>> row1
>>> row3
>>> row2
>>>
>>> *Query through SparkSQL HiveSQLContext :*
>>>
>>> SparkConf sparkConf = new SparkConf().setAppName("JavaSparkSQL");
>>> SparkContext sc = new SparkContext(sparkConf);
>>> HiveContext hiveContext = new HiveContext(sc);
>>> DataFrame teenagers = hiveContext.sql("*SELECT name FROM tampTable*");
>>> List teenagerNames = teenagers.toJavaRDD().map(new Function>> String>() {
>>>  @Override
>>>  public String call(Row row) {
>>>  return "Name: " + row.getString(0);
>>>  }
>>> }).collect();
>>> for (String name: teenagerNames) {
>>>  System.out.println(name);
>>> }
>>> teenagers2.toJavaRDD().saveAsTextFile("/tmp1");
>>> sc.stop();
>>>
>>> which is working perfectly and giving all names from table *tempTable*
>>>
>>> *Query through Spark SQL JDBC :*
>>>
>>> *beeline> !connect jdbc:hive2://myhost:*
>>> Enter username for jdbc:hive2://myhost:: root
>>> Enter password for jdbc:hive2://myhost:: **
>>> *beeline> show tables;*
>>> *temptable*
>>> *..other tables*
>>> beeline> *SELECT name FROM tampTable;*
>>>
>>> I can list the table through "show tables", but I run the query , it is
>>> either hanged or returns nothing.
>>>
>>>
>>>
>>> Regards
>>> Sanjiv Singh
>>> Mob :  +091 9990-447-339
>>>
>>
>>
>


Re: Having issue with Spark SQL JDBC on hive table !!!

2016-01-27 Thread @Sanjiv Singh
Hi Ted ,
Its typo.


Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Wed, Jan 27, 2016 at 9:13 PM, Ted Yu  wrote:

> In the last snippet, temptable is shown by 'show tables' command.
> Yet you queried tampTable.
>
> I believe this just was typo :-)
>
> On Wed, Jan 27, 2016 at 7:07 AM, @Sanjiv Singh 
> wrote:
>
>> Hi All,
>>
>> I have configured Spark to query on hive table.
>>
>> Run the Thrift JDBC/ODBC server using below command :
>>
>> *cd $SPARK_HOME*
>> *./sbin/start-thriftserver.sh --master spark://myhost:7077 --hiveconf
>> hive.server2.thrift.bind.host=myhost --hiveconf
>> hive.server2.thrift.port=*
>>
>> and also able to connect through beeline
>>
>> *beeline>* !connect jdbc:hive2://192.168.145.20:
>> Enter username for jdbc:hive2://192.168.145.20:: root
>> Enter password for jdbc:hive2://192.168.145.20:: impetus
>> *beeline > *
>>
>> It is not giving query result on hive table through Spark JDBC, but it is
>> working with spark HiveSQLContext. See complete scenario explain below.
>>
>> Help me understand the issue why Spark SQL JDBC is not giving result ?
>>
>> Below are version details.
>>
>> *Hive Version  : 1.2.1*
>> *Hadoop Version :  2.6.0*
>> *Spark version:  1.3.1*
>>
>> Let me know if need other details.
>>
>>
>> *Created Hive Table , insert some records and query it :*
>>
>> *beeline> !connect jdbc:hive2://myhost:1*
>> Enter username for jdbc:hive2://myhost:1: root
>> Enter password for jdbc:hive2://myhost:1: **
>> *beeline> create table tampTable(id int ,name string ) clustered by (id)
>> into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');*
>> *beeline> insert into table tampTable values
>> (1,'row1'),(2,'row2'),(3,'row3');*
>> *beeline> select name from tampTable;*
>> name
>> -
>> row1
>> row3
>> row2
>>
>> *Query through SparkSQL HiveSQLContext :*
>>
>> SparkConf sparkConf = new SparkConf().setAppName("JavaSparkSQL");
>> SparkContext sc = new SparkContext(sparkConf);
>> HiveContext hiveContext = new HiveContext(sc);
>> DataFrame teenagers = hiveContext.sql("*SELECT name FROM tampTable*");
>> List teenagerNames = teenagers.toJavaRDD().map(new Function> String>() {
>>  @Override
>>  public String call(Row row) {
>>  return "Name: " + row.getString(0);
>>  }
>> }).collect();
>> for (String name: teenagerNames) {
>>  System.out.println(name);
>> }
>> teenagers2.toJavaRDD().saveAsTextFile("/tmp1");
>> sc.stop();
>>
>> which is working perfectly and giving all names from table *tempTable*
>>
>> *Query through Spark SQL JDBC :*
>>
>> *beeline> !connect jdbc:hive2://myhost:*
>> Enter username for jdbc:hive2://myhost:: root
>> Enter password for jdbc:hive2://myhost:: **
>> *beeline> show tables;*
>> *temptable*
>> *..other tables*
>> beeline> *SELECT name FROM tampTable;*
>>
>> I can list the table through "show tables", but I run the query , it is
>> either hanged or returns nothing.
>>
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>
>


Having issue with Spark SQL JDBC on hive table !!!

2016-01-27 Thread @Sanjiv Singh
Hi All,

I have configured Spark to query on hive table.

Run the Thrift JDBC/ODBC server using below command :

*cd $SPARK_HOME*
*./sbin/start-thriftserver.sh --master spark://myhost:7077 --hiveconf
hive.server2.thrift.bind.host=myhost --hiveconf
hive.server2.thrift.port=*

and also able to connect through beeline

*beeline>* !connect jdbc:hive2://192.168.145.20:
Enter username for jdbc:hive2://192.168.145.20:: root
Enter password for jdbc:hive2://192.168.145.20:: impetus
*beeline > *

It is not giving query result on hive table through Spark JDBC, but it is
working with spark HiveSQLContext. See complete scenario explain below.

Help me understand the issue why Spark SQL JDBC is not giving result ?

Below are version details.

*Hive Version  : 1.2.1*
*Hadoop Version :  2.6.0*
*Spark version:  1.3.1*

Let me know if need other details.


*Created Hive Table , insert some records and query it :*

*beeline> !connect jdbc:hive2://myhost:1*
Enter username for jdbc:hive2://myhost:1: root
Enter password for jdbc:hive2://myhost:1: **
*beeline> create table tampTable(id int ,name string ) clustered by (id)
into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');*
*beeline> insert into table tampTable values
(1,'row1'),(2,'row2'),(3,'row3');*
*beeline> select name from tampTable;*
name
-
row1
row3
row2

*Query through SparkSQL HiveSQLContext :*

SparkConf sparkConf = new SparkConf().setAppName("JavaSparkSQL");
SparkContext sc = new SparkContext(sparkConf);
HiveContext hiveContext = new HiveContext(sc);
DataFrame teenagers = hiveContext.sql("*SELECT name FROM tampTable*");
List teenagerNames = teenagers.toJavaRDD().map(new Function() {
 @Override
 public String call(Row row) {
 return "Name: " + row.getString(0);
 }
}).collect();
for (String name: teenagerNames) {
 System.out.println(name);
}
teenagers2.toJavaRDD().saveAsTextFile("/tmp1");
sc.stop();

which is working perfectly and giving all names from table *tempTable*

*Query through Spark SQL JDBC :*

*beeline> !connect jdbc:hive2://myhost:*
Enter username for jdbc:hive2://myhost:: root
Enter password for jdbc:hive2://myhost:: **
*beeline> show tables;*
*temptable*
*..other tables*
beeline> *SELECT name FROM tampTable;*

I can list the table through "show tables", but I run the query , it is
either hanged or returns nothing.



Regards
Sanjiv Singh
Mob :  +091 9990-447-339


Re: How to convert a non-rdd data to rdd.

2014-10-12 Thread @Sanjiv Singh
Hi Karthik,

Can you provide us more detail of dataset "data" that you wanted to
parallelize with

SparkContext.parallelize(data);




Regards,
Sanjiv Singh


Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Sun, Oct 12, 2014 at 11:45 AM, rapelly kartheek 
wrote:

> Hi,
>
> I am trying to write a String that is not an rdd to HDFS. This data is a
> variable in Spark Scheduler code. None of the spark File operations are
> working because my data is not rdd.
>
> So, I tried using SparkContext.parallelize(data). But it throws error:
>
> [error]
> /home/karthik/spark-1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala:265:
> not found: value SparkContext
> [error]  SparkContext.parallelize(result)
> [error]  ^
> [error] one error found
>
> I realized that this data is part of the Scheduler. So, the Sparkcontext
> would not have got created yet.
>
> Any help in "writing scheduler variable data to HDFS" is appreciated!!
>
> -Karthik
>