[no subject]
Regards Sanjiv Singh Mob : +1 571-599-5236
[no subject]
Regards Sanjiv Singh Mob : +1 571-599-5236
Re: Spark SQL Parallelism - While reading from Oracle
Use it You can set up all the properties (driver,partitionColumn, lowerBound, upperBound, numPartitions) you should start with the driver at first. Now you have the maximum id so you can use it for the upperBound parameter. The numPartitions now based on your table's dimensions and your actual system what you use. Now with this snippet you read a database table to a dataframe with Spark. df = sqlContext.read.format('jdbc').options( url="jdbc:mysql://ip-address:3306/sometable?user=username&password=password", dbtable=*sometable*, driver="com.mysql.jdbc.Driver", *partitionColumn*="id", *lowerBound *= 1, *upperBound *= maxId, *numPartitions *= 100 ).load() Regards Sanjiv Singh Mob : +091 9990-447-339 On Wed, Aug 10, 2016 at 6:35 AM, Siva A wrote: > Hi Team, > > How do we increase the parallelism in Spark SQL. > In Spark Core, we can re-partition or pass extra arguments part of the > transformation. > > I am trying the below example, > > val df1 = sqlContext.read.format("jdbc").options(Map(...)).load > val df2= df1.cache > val df2.count > > Here count operation using only one task. I couldn't increase the > parallelism. > Thanks in advance > > Thanks > Siva >
Re: Spark SQL is not returning records for HIVE transactional tables on HDP
Hi All, We are using for Spark SQL : - Hive :1.2.1 - Spark : 1.3.1 - Hadoop :2.7.1 Let me know if needs other details to debug the issue. Regards Sanjiv Singh Mob : +091 9990-447-339 On Sun, Mar 13, 2016 at 1:07 AM, Mich Talebzadeh wrote: > Hi, > > Thanks for the input. I use Hive 2 and still have this issue. > > > >1. Hive version 2 >2. Hive on Spark engine 1.3.1 >3. Spark 1.5.2 > > > I have added Hive user group to this as well. So hopefully we may get > some resolution. > > HTH > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 12 March 2016 at 19:25, Timur Shenkao wrote: > >> Hi, >> >> I have suffered from Hive Streaming , Transactions enough, so I can share >> my experience with you. >> >> 1) It's not a problem of Spark. It happens because of "peculiarities" / >> bugs of Hive Streaming. Hive Streaming, transactions are very raw >> technologies. If you look at Hive JIRA, you'll see several critical bugs >> concerning Hive Streaming, transactions. Some of them are resolved in Hive >> 2+ only. But Cloudera & Hortonworks ship their distributions with outdated >> & buggy Hive. >> So use Hive 2+. Earlier versions of Hive didn't run compaction at all. >> >> 2) In Hive 1.1, I issue the following lines >> ALTER TABLE default.foo COMPACT 'MAJOR'; >> SHOW COMPACTIONS; >> >> My manual compaction was shown but it was never fulfilled. >> >> 3) If you use Hive Streaming, it's not recommended or even forbidden to >> insert rows into Hive Streaming tables manually. Only the process that >> writes to such table should insert incoming rows sequentially. Otherwise >> you'll get unpredictable behaviour. >> >> 4) Ordinary Hive tables are catalogs with text, ORC, etc. files. >> Hive Streaming / transactional tables are catalogs that have numerous >> subcatalogs with "delta" prefix. Moreover, there are files with >> "flush_length" suffix in some delta subfolders. "flush_length" files have 8 >> bytes length. The presence of "flush_length" file in some subfolder means >> that Hive writes updates to this subfolder right now. When Hive fails or is >> restarted, it begins to write into new delta subfolder with new >> "flush_length" file. And old "flush_length" file (that was used before >> failure) still remains. >> One of the goal of compaction is to delete outdated "flush_length" files. >> Not every application / library can read such folder structure or knows >> details of Hive Streaming / transactions implementation. Most of the >> software solutions still expect ordinary Hive tables as input. >> When they encounter subcatalogs or special files "flush_length" file, >> applications / libraries either "see nothing" (return 0 or empty result >> set) or stumble over "flush_length" files (return unexplainable errors). >> >> For instance, Facebook Presto couldn't read subfolders by default unless >> you activate special parameters. But it stumbles over "flush_length" files >> as Presto expect legal ORC files not 8-byte-length text files in folders. >> >> So, I don't advise you to use Hive Streaming, transactions right now in >> real production systems (24 / 7 /365) with hundreds millions of events a >> day. >> >> On Sat, Mar 12, 2016 at 11:24 AM, @Sanjiv Singh >> wrote: >> >>> Hi All, >>> >>> I am facing this issue on HDP setup on which COMPACTION is required only >>> once for transactional tables to fetch records with Spark SQL. >>> On the other hand, Apache setup doesn't required compaction even once. >>> >>> May be something got triggered on meta-store after compaction, Spark SQL >>> start recognizing delta files. >>> >>> Let know me if needed other details to get root cause. >>> >>> Try this, >>> >>> *See complete scenario :* >>> >>> hive> create table default.foo(id int) clustered by (id) into 2 buckets >>> STORED AS ORC TBLPROPERTIES ('transactional'='true'); >>> hive> insert into default.foo values(10); >>> >>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong >>> because data is still in delta files >>> >>> Now run major compaction: >>> >>> hive> ALTER TABLE default.foo COMPACT 'MAJOR'; >>> >>> scala> sqlContext.table("default.foo").count // Gives 1 >>> >>> hive> insert into foo values(20); >>> >>> scala> sqlContext.table("default.foo").count* // Gives 2 , no >>> compaction required.* >>> >>> >>> >>> >>> Regards >>> Sanjiv Singh >>> Mob : +091 9990-447-339 >>> >> >> >
Spark SQL is not returning records for HIVE transactional tables on HDP
Hi All, I am facing this issue on HDP setup on which COMPACTION is required only once for transactional tables to fetch records with Spark SQL. On the other hand, Apache setup doesn't required compaction even once. May be something got triggered on meta-store after compaction, Spark SQL start recognizing delta files. Let know me if needed other details to get root cause. Try this, *See complete scenario :* hive> create table default.foo(id int) clustered by (id) into 2 buckets STORED AS ORC TBLPROPERTIES ('transactional'='true'); hive> insert into default.foo values(10); scala> sqlContext.table("default.foo").count // Gives 0, which is wrong because data is still in delta files Now run major compaction: hive> ALTER TABLE default.foo COMPACT 'MAJOR'; scala> sqlContext.table("default.foo").count // Gives 1 hive> insert into foo values(20); scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction required.* Regards Sanjiv Singh Mob : +091 9990-447-339
Re: Spark SQL is not returning records for hive bucketed tables on HDP
Yes, It is very strange and also very opposite to my belief on Spark SQL on hive tables. I am facing this issue on HDP setup on which COMPACTION is required only once. On the other hand, Apache setup doesn't required compaction even once. May be something got triggered on meta-store after compaction, Spark SQL start recognizing delta files. Let know me if needed other details to get root cause. Regards Sanjiv Singh Mob : +091 9990-447-339 On Tue, Feb 23, 2016 at 2:28 PM, Varadharajan Mukundan wrote: > That's interesting. I'm not sure why first compaction is needed but not on > the subsequent inserts. May be its just to create few metadata. Thanks for > clarifying this :) > > On Tue, Feb 23, 2016 at 2:15 PM, @Sanjiv Singh > wrote: > >> Try this, >> >> >> hive> create table default.foo(id int) clustered by (id) into 2 buckets >> STORED AS ORC TBLPROPERTIES ('transactional'='true'); >> hive> insert into default.foo values(10); >> >> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong >> because data is still in delta files >> >> Now run major compaction: >> >> hive> ALTER TABLE default.foo COMPACT 'MAJOR'; >> >> scala> sqlContext.table("default.foo").count // Gives 1 >> >> hive> insert into foo values(20); >> >> scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction >> required.* >> >> >> >> >> Regards >> Sanjiv Singh >> Mob : +091 9990-447-339 >> >> On Tue, Feb 23, 2016 at 2:02 PM, Varadharajan Mukundan < >> srinath...@gmail.com> wrote: >> >>> This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure >>> if its different. >>> >>> Please walkthrough the below commands in the same order to understand >>> the sequence. >>> >>> hive> create table default.foo(id int) clustered by (id) into 2 buckets >>> STORED AS ORC TBLPROPERTIES ('transactional'='true'); >>> hive> insert into foo values(10); >>> >>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong >>> because data is still in delta files >>> >>> Now run major compaction: >>> >>> hive> ALTER TABLE default.foo COMPACT 'MAJOR'; >>> >>> scala> sqlContext.table("default.foo").count // Gives 1 >>> >>> >>> On Tue, Feb 23, 2016 at 12:35 PM, @Sanjiv Singh >>> wrote: >>> >>>> Hi Varadharajan, >>>> >>>> >>>> That is the point, Spark SQL is able to recognize delta files. See >>>> below directory structure, ONE BASE (43 records) and one DELTA (created >>>> after last insert). And I am able see last insert through Spark SQL. >>>> >>>> >>>> *See below complete scenario :* >>>> >>>> *Steps:* >>>> >>>>- Inserted 43 records in table. >>>>- Run major compaction on table. >>>>- *alter table mytable COMPACT 'major';* >>>>- Disabled auto compaction on table. >>>>- *alter table mytable set >>>> TBLPROPERTIES("NO_AUTO_COMPACTION"="true");* >>>>- Inserted 1 record in table. >>>> >>>> >>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable* >>>> drwxrwxrwx - root hdfs 0 2016-02-23 11:43 >>>> /apps/hive/warehouse/mydb.db/mytable/base_087 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 12:02 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_088_088 >>>> >>>> *SPARK JDBC :* >>>> >>>> 0: jdbc:hive2://myhost:> select count(*) from mytable ; >>>> +--+ >>>> | _c0 | >>>> +--+ >>>> | 44 | >>>> +--+ >>>> 1 row selected (1.196 seconds) >>>> >>>> *HIVE JDBC :* >>>> >>>> 1: jdbc:hive2://myhost:1> select count(*) from mytable ; >>>> +--+--+ >>>> | _c0 | >>>> +--+--+ >>>> | 44 | >>>> +--+--+ >>>> 1 row selected (0.121 seconds) >>>> >>>> >>>> Regards >>>> Sanjiv Singh >>>> Mob : +091 9990-447-339 >>>> >>>> On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan < >&g
Re: Spark SQL is not returning records for hive bucketed tables on HDP
Try this, hive> create table default.foo(id int) clustered by (id) into 2 buckets STORED AS ORC TBLPROPERTIES ('transactional'='true'); hive> insert into default.foo values(10); scala> sqlContext.table("default.foo").count // Gives 0, which is wrong because data is still in delta files Now run major compaction: hive> ALTER TABLE default.foo COMPACT 'MAJOR'; scala> sqlContext.table("default.foo").count // Gives 1 hive> insert into foo values(20); scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction required.* Regards Sanjiv Singh Mob : +091 9990-447-339 On Tue, Feb 23, 2016 at 2:02 PM, Varadharajan Mukundan wrote: > This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure > if its different. > > Please walkthrough the below commands in the same order to understand the > sequence. > > hive> create table default.foo(id int) clustered by (id) into 2 buckets > STORED AS ORC TBLPROPERTIES ('transactional'='true'); > hive> insert into foo values(10); > > scala> sqlContext.table("default.foo").count // Gives 0, which is wrong > because data is still in delta files > > Now run major compaction: > > hive> ALTER TABLE default.foo COMPACT 'MAJOR'; > > scala> sqlContext.table("default.foo").count // Gives 1 > > > On Tue, Feb 23, 2016 at 12:35 PM, @Sanjiv Singh > wrote: > >> Hi Varadharajan, >> >> >> That is the point, Spark SQL is able to recognize delta files. See below >> directory structure, ONE BASE (43 records) and one DELTA (created after >> last insert). And I am able see last insert through Spark SQL. >> >> >> *See below complete scenario :* >> >> *Steps:* >> >>- Inserted 43 records in table. >>- Run major compaction on table. >>- *alter table mytable COMPACT 'major';* >>- Disabled auto compaction on table. >>- *alter table mytable set >> TBLPROPERTIES("NO_AUTO_COMPACTION"="true");* >>- Inserted 1 record in table. >> >> >> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable* >> drwxrwxrwx - root hdfs 0 2016-02-23 11:43 >> /apps/hive/warehouse/mydb.db/mytable/base_087 >> drwxr-xr-x - root hdfs 0 2016-02-23 12:02 >> /apps/hive/warehouse/mydb.db/mytable/delta_088_088 >> >> *SPARK JDBC :* >> >> 0: jdbc:hive2://myhost:> select count(*) from mytable ; >> +--+ >> | _c0 | >> +--+ >> | 44 | >> +--+ >> 1 row selected (1.196 seconds) >> >> *HIVE JDBC :* >> >> 1: jdbc:hive2://myhost:1> select count(*) from mytable ; >> +--+--+ >> | _c0 | >> +--+--+ >> | 44 | >> +--+--+ >> 1 row selected (0.121 seconds) >> >> >> Regards >> Sanjiv Singh >> Mob : +091 9990-447-339 >> >> On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan < >> srinath...@gmail.com> wrote: >> >>> Hi Sanjiv, >>> >>> Yes.. If we make use of Hive JDBC we should be able to retrieve all the >>> rows since it is hive which processes the query. But i think the problem >>> with Hive JDBC is that there are two layers of processing, hive and then at >>> spark with the result set. And another one is performance is limited to >>> that single HiveServer2 node and network. >>> >>> But If we make use of sqlContext.table function in spark to access hive >>> tables, it is supposed to read files directly from HDFS skipping the hive >>> layer. But it doesn't read delta files and just reads the contents from >>> base folder. Only after Major compaction, the delta files would be merged >>> with based folder and be visible for Spark SQL >>> >>> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh >>> wrote: >>> >>>> Hi Varadharajan, >>>> >>>> Can you elaborate on (you quoted on previous mail) : >>>> "I observed that hive transaction storage structure do not work with >>>> spark yet" >>>> >>>> >>>> If it is related to delta files created after each transaction and >>>> spark would not be able recognize them. then I have a table *mytable *(ORC >>>> , BUCKETED , NON-SORTED) , already done lots on insert , update and >>>> deletes. I can see delta files created in HDFS (see below), Still able to >>>> fetch consistent records through Spa
Re: Spark SQL is not returning records for hive bucketed tables on HDP
Hi Varadharajan, That is the point, Spark SQL is able to recognize delta files. See below directory structure, ONE BASE (43 records) and one DELTA (created after last insert). And I am able see last insert through Spark SQL. *See below complete scenario :* *Steps:* - Inserted 43 records in table. - Run major compaction on table. - *alter table mytable COMPACT 'major';* - Disabled auto compaction on table. - *alter table mytable set TBLPROPERTIES("NO_AUTO_COMPACTION"="true");* - Inserted 1 record in table. > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable* drwxrwxrwx - root hdfs 0 2016-02-23 11:43 /apps/hive/warehouse/mydb.db/mytable/base_087 drwxr-xr-x - root hdfs 0 2016-02-23 12:02 /apps/hive/warehouse/mydb.db/mytable/delta_088_088 *SPARK JDBC :* 0: jdbc:hive2://myhost:> select count(*) from mytable ; +--+ | _c0 | +--+ | 44 | +--+ 1 row selected (1.196 seconds) *HIVE JDBC :* 1: jdbc:hive2://myhost:1> select count(*) from mytable ; +--+--+ | _c0 | +--+--+ | 44 | +--+--+ 1 row selected (0.121 seconds) Regards Sanjiv Singh Mob : +091 9990-447-339 On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan < srinath...@gmail.com> wrote: > Hi Sanjiv, > > Yes.. If we make use of Hive JDBC we should be able to retrieve all the > rows since it is hive which processes the query. But i think the problem > with Hive JDBC is that there are two layers of processing, hive and then at > spark with the result set. And another one is performance is limited to > that single HiveServer2 node and network. > > But If we make use of sqlContext.table function in spark to access hive > tables, it is supposed to read files directly from HDFS skipping the hive > layer. But it doesn't read delta files and just reads the contents from > base folder. Only after Major compaction, the delta files would be merged > with based folder and be visible for Spark SQL > > On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh > wrote: > >> Hi Varadharajan, >> >> Can you elaborate on (you quoted on previous mail) : >> "I observed that hive transaction storage structure do not work with >> spark yet" >> >> >> If it is related to delta files created after each transaction and spark >> would not be able recognize them. then I have a table *mytable *(ORC , >> BUCKETED , NON-SORTED) , already done lots on insert , update and deletes. >> I can see delta files created in HDFS (see below), Still able to fetch >> consistent records through Spark JDBC and HIVE JDBC. >> >> Not compaction triggered for that table. >> >> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable* >> >> drwxrwxrwx - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/base_060 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_061_061 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_062_062 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_063_063 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_064_064 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_065_065 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_066_066 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_067_067 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_068_068 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_069_069 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_070_070 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_071_071 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_072_072 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >> /apps/hive/warehouse/mydb.db/mytable/delta_073_073 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >> /apps/hive/warehouse/mydb.db/mytable/delta_074_074 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >> /apps/hive/warehouse/mydb.db/mytable/delta_075_075 >> drwxr-xr-x - root hdfs
Re: Spark SQL is not returning records for hive bucketed tables on HDP
Hi Varadharajan, Can you elaborate on (you quoted on previous mail) : "I observed that hive transaction storage structure do not work with spark yet" If it is related to delta files created after each transaction and spark would not be able recognize them. then I have a table *mytable *(ORC , BUCKETED , NON-SORTED) , already done lots on insert , update and deletes. I can see delta files created in HDFS (see below), Still able to fetch consistent records through Spark JDBC and HIVE JDBC. Not compaction triggered for that table. > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable* drwxrwxrwx - root hdfs 0 2016-02-23 11:38 /apps/hive/warehouse/mydb.db/mytable/base_060 drwxr-xr-x - root hdfs 0 2016-02-23 11:38 /apps/hive/warehouse/mydb.db/mytable/delta_061_061 drwxr-xr-x - root hdfs 0 2016-02-23 11:38 /apps/hive/warehouse/mydb.db/mytable/delta_062_062 drwxr-xr-x - root hdfs 0 2016-02-23 11:38 /apps/hive/warehouse/mydb.db/mytable/delta_063_063 drwxr-xr-x - root hdfs 0 2016-02-23 11:38 /apps/hive/warehouse/mydb.db/mytable/delta_064_064 drwxr-xr-x - root hdfs 0 2016-02-23 11:38 /apps/hive/warehouse/mydb.db/mytable/delta_065_065 drwxr-xr-x - root hdfs 0 2016-02-23 11:38 /apps/hive/warehouse/mydb.db/mytable/delta_066_066 drwxr-xr-x - root hdfs 0 2016-02-23 11:38 /apps/hive/warehouse/mydb.db/mytable/delta_067_067 drwxr-xr-x - root hdfs 0 2016-02-23 11:38 /apps/hive/warehouse/mydb.db/mytable/delta_068_068 drwxr-xr-x - root hdfs 0 2016-02-23 11:38 /apps/hive/warehouse/mydb.db/mytable/delta_069_069 drwxr-xr-x - root hdfs 0 2016-02-23 11:38 /apps/hive/warehouse/mydb.db/mytable/delta_070_070 drwxr-xr-x - root hdfs 0 2016-02-23 11:38 /apps/hive/warehouse/mydb.db/mytable/delta_071_071 drwxr-xr-x - root hdfs 0 2016-02-23 11:38 /apps/hive/warehouse/mydb.db/mytable/delta_072_072 drwxr-xr-x - root hdfs 0 2016-02-23 11:39 /apps/hive/warehouse/mydb.db/mytable/delta_073_073 drwxr-xr-x - root hdfs 0 2016-02-23 11:39 /apps/hive/warehouse/mydb.db/mytable/delta_074_074 drwxr-xr-x - root hdfs 0 2016-02-23 11:39 /apps/hive/warehouse/mydb.db/mytable/delta_075_075 drwxr-xr-x - root hdfs 0 2016-02-23 11:39 /apps/hive/warehouse/mydb.db/mytable/delta_076_076 drwxr-xr-x - root hdfs 0 2016-02-23 11:39 /apps/hive/warehouse/mydb.db/mytable/delta_077_077 drwxr-xr-x - root hdfs 0 2016-02-23 11:39 /apps/hive/warehouse/mydb.db/mytable/delta_078_078 drwxr-xr-x - root hdfs 0 2016-02-23 11:39 /apps/hive/warehouse/mydb.db/mytable/delta_079_079 drwxr-xr-x - root hdfs 0 2016-02-23 11:39 /apps/hive/warehouse/mydb.db/mytable/delta_080_080 drwxr-xr-x - root hdfs 0 2016-02-23 11:39 /apps/hive/warehouse/mydb.db/mytable/delta_081_081 drwxr-xr-x - root hdfs 0 2016-02-23 11:39 /apps/hive/warehouse/mydb.db/mytable/delta_082_082 drwxr-xr-x - root hdfs 0 2016-02-23 11:39 /apps/hive/warehouse/mydb.db/mytable/delta_083_083 drwxr-xr-x - root hdfs 0 2016-02-23 11:39 /apps/hive/warehouse/mydb.db/mytable/delta_084_084 drwxr-xr-x - root hdfs 0 2016-02-23 11:39 /apps/hive/warehouse/mydb.db/mytable/delta_085_085 drwxr-xr-x - root hdfs 0 2016-02-23 11:40 /apps/hive/warehouse/mydb.db/mytable/delta_086_086 drwxr-xr-x - root hdfs 0 2016-02-23 11:41 /apps/hive/warehouse/mydb.db/mytable/delta_087_087 Regards Sanjiv Singh Mob : +091 9990-447-339 On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan wrote: > Actually the auto compaction if enabled is triggered based on the volume > of changes. It doesn't automatically run after every insert. I think its > possible to reduce the thresholds but that might reduce performance by a > big margin. As of now, we do compaction after the batch insert completes. > > The only other way to solve this problem as of now is to use Hive JDBC API. > > On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh > wrote: > >> Compaction would have been triggered automatically as following >> properties already set in *hive-site.xml*. and also *NO_AUTO_COMPACTION* >> property >> not been set for these tables. >> >> >> >> >> hive.compactor.initiator.on >> >> true >> >> >> >> >> >> hive.compactor.worker.threads >> >> 1 >> >> >> >> >> Documentation is upset sometimes. >> >> >> >> >> Regards >> Sanjiv Singh >> Mob : +091 9990-447-339 >> >> On Mo
Re: Spark SQL is not returning records for hive bucketed tables on HDP
Compaction would have been triggered automatically as following properties already set in *hive-site.xml*. and also *NO_AUTO_COMPACTION* property not been set for these tables. hive.compactor.initiator.on true hive.compactor.worker.threads 1 Documentation is upset sometimes. Regards Sanjiv Singh Mob : +091 9990-447-339 On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan wrote: > Yes, I was burned down by this issue couple of weeks back. This also means > that after every insert job, compaction should be run to access new rows > from Spark. Sad that this issue is not documented / mentioned anywhere. > > On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh > wrote: > >> Hi Varadharajan, >> >> Thanks for your response. >> >> Yes it is transnational table; See below *show create table. * >> >> Table hardly have 3 records , and after triggering minor compaction on >> tables , it start showing results on spark SQL. >> >> >> > *ALTER TABLE hivespark COMPACT 'major';* >> >> >> > *show create table hivespark;* >> >> CREATE TABLE `hivespark`( >> >> `id` int, >> >> `name` string) >> >> CLUSTERED BY ( >> >> id) >> >> INTO 32 BUCKETS >> >> ROW FORMAT SERDE >> >> 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' >> >> STORED AS INPUTFORMAT >> >> 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' >> >> OUTPUTFORMAT >> >> 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' >> >> LOCATION >> >> 'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark' >> TBLPROPERTIES ( >> >> 'COLUMN_STATS_ACCURATE'='true', >> >> 'last_modified_by'='root', >> >> 'last_modified_time'='1455859079', >> >> 'numFiles'='37', >> >> 'numRows'='3', >> >> 'rawDataSize'='0', >> >> 'totalSize'='11383', >> >> 'transactional'='true', >> >> 'transient_lastDdlTime'='1455864121') ; >> >> >> Regards >> Sanjiv Singh >> Mob : +091 9990-447-339 >> >> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan < >> srinath...@gmail.com> wrote: >> >>> Hi, >>> >>> Is the transaction attribute set on your table? I observed that hive >>> transaction storage structure do not work with spark yet. You can confirm >>> this by looking at the transactional attribute in the output of "desc >>> extended " in hive console. >>> >>> If you'd need to access transactional table, consider doing a major >>> compaction and then try accessing the tables >>> >>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh >>> wrote: >>> >>>> Hi, >>>> >>>> >>>> I have observed that Spark SQL is not returning records for hive >>>> bucketed ORC tables on HDP. >>>> >>>> >>>> >>>> On spark SQL , I am able to list all tables , but queries on hive >>>> bucketed tables are not returning records. >>>> >>>> I have also tried the same for non-bucketed hive tables. it is working >>>> fine. >>>> >>>> >>>> >>>> Same is working on plain Apache setup. >>>> >>>> Let me know if needs other details. >>>> >>>> Regards >>>> Sanjiv Singh >>>> Mob : +091 9990-447-339 >>>> >>> >>> >>> >>> -- >>> Thanks, >>> M. Varadharajan >>> >>> >>> >>> "Experience is what you get when you didn't get what you wanted" >>>-By Prof. Randy Pausch in "The Last Lecture" >>> >>> My Journal :- http://varadharajan.in >>> >> >> > > > -- > Thanks, > M. Varadharajan > > > > "Experience is what you get when you didn't get what you wanted" >-By Prof. Randy Pausch in "The Last Lecture" > > My Journal :- http://varadharajan.in >
Re: Spark SQL is not returning records for hive bucketed tables on HDP
Hi Varadharajan, Thanks for your response. Yes it is transnational table; See below *show create table. * Table hardly have 3 records , and after triggering minor compaction on tables , it start showing results on spark SQL. > *ALTER TABLE hivespark COMPACT 'major';* > *show create table hivespark;* CREATE TABLE `hivespark`( `id` int, `name` string) CLUSTERED BY ( id) INTO 32 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='true', 'last_modified_by'='root', 'last_modified_time'='1455859079', 'numFiles'='37', 'numRows'='3', 'rawDataSize'='0', 'totalSize'='11383', 'transactional'='true', 'transient_lastDdlTime'='1455864121') ; Regards Sanjiv Singh Mob : +091 9990-447-339 On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan wrote: > Hi, > > Is the transaction attribute set on your table? I observed that hive > transaction storage structure do not work with spark yet. You can confirm > this by looking at the transactional attribute in the output of "desc > extended " in hive console. > > If you'd need to access transactional table, consider doing a major > compaction and then try accessing the tables > > On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh > wrote: > >> Hi, >> >> >> I have observed that Spark SQL is not returning records for hive bucketed >> ORC tables on HDP. >> >> >> >> On spark SQL , I am able to list all tables , but queries on hive >> bucketed tables are not returning records. >> >> I have also tried the same for non-bucketed hive tables. it is working >> fine. >> >> >> >> Same is working on plain Apache setup. >> >> Let me know if needs other details. >> >> Regards >> Sanjiv Singh >> Mob : +091 9990-447-339 >> > > > > -- > Thanks, > M. Varadharajan > > > > "Experience is what you get when you didn't get what you wanted" >-By Prof. Randy Pausch in "The Last Lecture" > > My Journal :- http://varadharajan.in >
Spark SQL is not returning records for hive bucketed tables on HDP
Hi, I have observed that Spark SQL is not returning records for hive bucketed ORC tables on HDP. On spark SQL , I am able to list all tables , but queries on hive bucketed tables are not returning records. I have also tried the same for non-bucketed hive tables. it is working fine. Same is working on plain Apache setup. Let me know if needs other details. Regards Sanjiv Singh Mob : +091 9990-447-339
Re: Having issue with Spark SQL JDBC on hive table !!!
It working now ... I checked at Spark worker UI , executor startup failing with below error , JVM initialization failing because of wrong -Xms : Invalid initial heap size: -Xms0MError: Could not create the Java Virtual Machine.Error: A fatal exception has occurred. Program will exit. Thrift server is not picking executor memory from *spark-env.sh* , then I added in thrift server startup script explicitly. *./sbin/start-thriftserver.sh* exec "$FWDIR"/sbin/spark-daemon.sh spark-submit $CLASS 1 --executor-memory 512M "$@" With this , Executor start getting valid memory and JDBC queries are getting results. *conf/spark-env.sh* (executor memory configurations not picked by thrift-server) export SPARK_JAVA_OPTS="-Dspark.executor.memory=512M" export SPARK_EXECUTOR_MEMORY=512M Regards Sanjiv Singh Mob : +091 9990-447-339 On Thu, Jan 28, 2016 at 10:57 PM, @Sanjiv Singh wrote: > Adding to it > > job status at UI : > > Stage IdDescriptionSubmittedDurationTasks: Succeeded/TotalInputOutputShuffle > ReadShuffle Write > 1 select ename from employeetest(kill > <http://impetus-d951centos:4040/stages/stage/kill?id=1&terminate=true>)collect > at SparkPlan.scala:84 > <http://impetus-d951centos:4040/stages/stage?id=1&attempt=0>+details > > 2016/01/29 04:20:06 3.0 min > 0/2 > > Getting below exception on Spark UI : > > org.apache.spark.rdd.RDD.collect(RDD.scala:813) > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84) > org.apache.spark.sql.DataFrame.collect(DataFrame.scala:887) > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:178) > org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) > org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) > org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) > org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) > org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) > org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) > org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:744) > > > Regards > Sanjiv Singh > Mob : +091 9990-447-339 > > On Thu, Jan 28, 2016 at 9:57 PM, @Sanjiv Singh > wrote: > >> Any help on this. >> >> Regards >> Sanjiv Singh >> Mob : +091 9990-447-339 >> >> On Wed, Jan 27, 2016 at 10:25 PM, @Sanjiv Singh >> wrote: >> >>> Hi Ted , >>> Its typo. >>> >>> >>> Regards >>> Sanjiv Singh >>> Mob : +091 9990-447-339 >>> >>> On Wed, Jan 27, 2016 at 9:13 PM, Ted Yu wrote: >>> >>>> In the last snippet, temptable is shown by 'show tables' command. >>>> Yet you queried tampTable. >>>> >>>> I believe this just was typo :-) >>>> >>>> On Wed, Jan 27, 2016 at 7:07 AM, @Sanjiv Singh >>>> wrote: >>>> >>>>> Hi All, >>>>> >>>>> I have configured Spark to query on hive table. >>>>> >>>>> Run the Thrift JDBC/ODBC server using below command : >>>>> >>>>> *cd $SPARK_HOME* >>>>> *./sbin/start-thriftserver.sh --master spark://myhost:7077 --hiveconf >>>>> hive.server2.thrift.bind.host=myhost --hiveconf >>>>> hive.server2.thrift.port=* >>>>> >>>>> and also able to connect through beeline >>>>> >>>>> *beeline>* !connect jdbc:hive2://192.168.145.20: >>>>> Enter username for jdbc:hive2://192.168.145.20:: root >>>>> Enter password for jdbc:hive2://192.168.145.20:: impetus >>>>> *beeline > * >>>>> >>>>> It is not giving query result on hive table through Spark JDBC, but it >>>>> is working with spark HiveSQLContext. See complete scenario explain below. >>>>> >>>>> Help me understand the issue why Spark SQL JDBC is n
Re: Having issue with Spark SQL JDBC on hive table !!!
Adding to it job status at UI : Stage IdDescriptionSubmittedDurationTasks: Succeeded/TotalInputOutputShuffle ReadShuffle Write 1 select ename from employeetest(kill <http://impetus-d951centos:4040/stages/stage/kill?id=1&terminate=true>)collect at SparkPlan.scala:84 <http://impetus-d951centos:4040/stages/stage?id=1&attempt=0>+details 2016/01/29 04:20:06 3.0 min 0/2 Getting below exception on Spark UI : org.apache.spark.rdd.RDD.collect(RDD.scala:813) org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84) org.apache.spark.sql.DataFrame.collect(DataFrame.scala:887) org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:178) org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) Regards Sanjiv Singh Mob : +091 9990-447-339 On Thu, Jan 28, 2016 at 9:57 PM, @Sanjiv Singh wrote: > Any help on this. > > Regards > Sanjiv Singh > Mob : +091 9990-447-339 > > On Wed, Jan 27, 2016 at 10:25 PM, @Sanjiv Singh > wrote: > >> Hi Ted , >> Its typo. >> >> >> Regards >> Sanjiv Singh >> Mob : +091 9990-447-339 >> >> On Wed, Jan 27, 2016 at 9:13 PM, Ted Yu wrote: >> >>> In the last snippet, temptable is shown by 'show tables' command. >>> Yet you queried tampTable. >>> >>> I believe this just was typo :-) >>> >>> On Wed, Jan 27, 2016 at 7:07 AM, @Sanjiv Singh >>> wrote: >>> >>>> Hi All, >>>> >>>> I have configured Spark to query on hive table. >>>> >>>> Run the Thrift JDBC/ODBC server using below command : >>>> >>>> *cd $SPARK_HOME* >>>> *./sbin/start-thriftserver.sh --master spark://myhost:7077 --hiveconf >>>> hive.server2.thrift.bind.host=myhost --hiveconf >>>> hive.server2.thrift.port=* >>>> >>>> and also able to connect through beeline >>>> >>>> *beeline>* !connect jdbc:hive2://192.168.145.20: >>>> Enter username for jdbc:hive2://192.168.145.20:: root >>>> Enter password for jdbc:hive2://192.168.145.20:: impetus >>>> *beeline > * >>>> >>>> It is not giving query result on hive table through Spark JDBC, but it >>>> is working with spark HiveSQLContext. See complete scenario explain below. >>>> >>>> Help me understand the issue why Spark SQL JDBC is not giving result ? >>>> >>>> Below are version details. >>>> >>>> *Hive Version : 1.2.1* >>>> *Hadoop Version : 2.6.0* >>>> *Spark version: 1.3.1* >>>> >>>> Let me know if need other details. >>>> >>>> >>>> *Created Hive Table , insert some records and query it :* >>>> >>>> *beeline> !connect jdbc:hive2://myhost:1* >>>> Enter username for jdbc:hive2://myhost:1: root >>>> Enter password for jdbc:hive2://myhost:1: ** >>>> *beeline> create table tampTable(id int ,name string ) clustered by >>>> (id) into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');* >>>> *beeline> insert into table tampTable values >>>> (1,'row1'),(2,'row2'),(3,'row3');* >>>> *beeline> select name from tampTable;* >>>> name >>>> - >>>> row1 >>>> row3 >>>> row2 >>>> >>>> *Query through SparkSQL HiveSQLContext :* >>>> >>>> SparkConf sparkConf = new SparkConf().setAppName("JavaSpar
Re: Having issue with Spark SQL JDBC on hive table !!!
Any help on this. Regards Sanjiv Singh Mob : +091 9990-447-339 On Wed, Jan 27, 2016 at 10:25 PM, @Sanjiv Singh wrote: > Hi Ted , > Its typo. > > > Regards > Sanjiv Singh > Mob : +091 9990-447-339 > > On Wed, Jan 27, 2016 at 9:13 PM, Ted Yu wrote: > >> In the last snippet, temptable is shown by 'show tables' command. >> Yet you queried tampTable. >> >> I believe this just was typo :-) >> >> On Wed, Jan 27, 2016 at 7:07 AM, @Sanjiv Singh >> wrote: >> >>> Hi All, >>> >>> I have configured Spark to query on hive table. >>> >>> Run the Thrift JDBC/ODBC server using below command : >>> >>> *cd $SPARK_HOME* >>> *./sbin/start-thriftserver.sh --master spark://myhost:7077 --hiveconf >>> hive.server2.thrift.bind.host=myhost --hiveconf >>> hive.server2.thrift.port=* >>> >>> and also able to connect through beeline >>> >>> *beeline>* !connect jdbc:hive2://192.168.145.20: >>> Enter username for jdbc:hive2://192.168.145.20:: root >>> Enter password for jdbc:hive2://192.168.145.20:: impetus >>> *beeline > * >>> >>> It is not giving query result on hive table through Spark JDBC, but it >>> is working with spark HiveSQLContext. See complete scenario explain below. >>> >>> Help me understand the issue why Spark SQL JDBC is not giving result ? >>> >>> Below are version details. >>> >>> *Hive Version : 1.2.1* >>> *Hadoop Version : 2.6.0* >>> *Spark version: 1.3.1* >>> >>> Let me know if need other details. >>> >>> >>> *Created Hive Table , insert some records and query it :* >>> >>> *beeline> !connect jdbc:hive2://myhost:1* >>> Enter username for jdbc:hive2://myhost:1: root >>> Enter password for jdbc:hive2://myhost:1: ** >>> *beeline> create table tampTable(id int ,name string ) clustered by (id) >>> into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');* >>> *beeline> insert into table tampTable values >>> (1,'row1'),(2,'row2'),(3,'row3');* >>> *beeline> select name from tampTable;* >>> name >>> - >>> row1 >>> row3 >>> row2 >>> >>> *Query through SparkSQL HiveSQLContext :* >>> >>> SparkConf sparkConf = new SparkConf().setAppName("JavaSparkSQL"); >>> SparkContext sc = new SparkContext(sparkConf); >>> HiveContext hiveContext = new HiveContext(sc); >>> DataFrame teenagers = hiveContext.sql("*SELECT name FROM tampTable*"); >>> List teenagerNames = teenagers.toJavaRDD().map(new Function>> String>() { >>> @Override >>> public String call(Row row) { >>> return "Name: " + row.getString(0); >>> } >>> }).collect(); >>> for (String name: teenagerNames) { >>> System.out.println(name); >>> } >>> teenagers2.toJavaRDD().saveAsTextFile("/tmp1"); >>> sc.stop(); >>> >>> which is working perfectly and giving all names from table *tempTable* >>> >>> *Query through Spark SQL JDBC :* >>> >>> *beeline> !connect jdbc:hive2://myhost:* >>> Enter username for jdbc:hive2://myhost:: root >>> Enter password for jdbc:hive2://myhost:: ** >>> *beeline> show tables;* >>> *temptable* >>> *..other tables* >>> beeline> *SELECT name FROM tampTable;* >>> >>> I can list the table through "show tables", but I run the query , it is >>> either hanged or returns nothing. >>> >>> >>> >>> Regards >>> Sanjiv Singh >>> Mob : +091 9990-447-339 >>> >> >> >
Re: Having issue with Spark SQL JDBC on hive table !!!
Hi Ted , Its typo. Regards Sanjiv Singh Mob : +091 9990-447-339 On Wed, Jan 27, 2016 at 9:13 PM, Ted Yu wrote: > In the last snippet, temptable is shown by 'show tables' command. > Yet you queried tampTable. > > I believe this just was typo :-) > > On Wed, Jan 27, 2016 at 7:07 AM, @Sanjiv Singh > wrote: > >> Hi All, >> >> I have configured Spark to query on hive table. >> >> Run the Thrift JDBC/ODBC server using below command : >> >> *cd $SPARK_HOME* >> *./sbin/start-thriftserver.sh --master spark://myhost:7077 --hiveconf >> hive.server2.thrift.bind.host=myhost --hiveconf >> hive.server2.thrift.port=* >> >> and also able to connect through beeline >> >> *beeline>* !connect jdbc:hive2://192.168.145.20: >> Enter username for jdbc:hive2://192.168.145.20:: root >> Enter password for jdbc:hive2://192.168.145.20:: impetus >> *beeline > * >> >> It is not giving query result on hive table through Spark JDBC, but it is >> working with spark HiveSQLContext. See complete scenario explain below. >> >> Help me understand the issue why Spark SQL JDBC is not giving result ? >> >> Below are version details. >> >> *Hive Version : 1.2.1* >> *Hadoop Version : 2.6.0* >> *Spark version: 1.3.1* >> >> Let me know if need other details. >> >> >> *Created Hive Table , insert some records and query it :* >> >> *beeline> !connect jdbc:hive2://myhost:1* >> Enter username for jdbc:hive2://myhost:1: root >> Enter password for jdbc:hive2://myhost:1: ** >> *beeline> create table tampTable(id int ,name string ) clustered by (id) >> into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');* >> *beeline> insert into table tampTable values >> (1,'row1'),(2,'row2'),(3,'row3');* >> *beeline> select name from tampTable;* >> name >> - >> row1 >> row3 >> row2 >> >> *Query through SparkSQL HiveSQLContext :* >> >> SparkConf sparkConf = new SparkConf().setAppName("JavaSparkSQL"); >> SparkContext sc = new SparkContext(sparkConf); >> HiveContext hiveContext = new HiveContext(sc); >> DataFrame teenagers = hiveContext.sql("*SELECT name FROM tampTable*"); >> List teenagerNames = teenagers.toJavaRDD().map(new Function> String>() { >> @Override >> public String call(Row row) { >> return "Name: " + row.getString(0); >> } >> }).collect(); >> for (String name: teenagerNames) { >> System.out.println(name); >> } >> teenagers2.toJavaRDD().saveAsTextFile("/tmp1"); >> sc.stop(); >> >> which is working perfectly and giving all names from table *tempTable* >> >> *Query through Spark SQL JDBC :* >> >> *beeline> !connect jdbc:hive2://myhost:* >> Enter username for jdbc:hive2://myhost:: root >> Enter password for jdbc:hive2://myhost:: ** >> *beeline> show tables;* >> *temptable* >> *..other tables* >> beeline> *SELECT name FROM tampTable;* >> >> I can list the table through "show tables", but I run the query , it is >> either hanged or returns nothing. >> >> >> >> Regards >> Sanjiv Singh >> Mob : +091 9990-447-339 >> > >
Having issue with Spark SQL JDBC on hive table !!!
Hi All, I have configured Spark to query on hive table. Run the Thrift JDBC/ODBC server using below command : *cd $SPARK_HOME* *./sbin/start-thriftserver.sh --master spark://myhost:7077 --hiveconf hive.server2.thrift.bind.host=myhost --hiveconf hive.server2.thrift.port=* and also able to connect through beeline *beeline>* !connect jdbc:hive2://192.168.145.20: Enter username for jdbc:hive2://192.168.145.20:: root Enter password for jdbc:hive2://192.168.145.20:: impetus *beeline > * It is not giving query result on hive table through Spark JDBC, but it is working with spark HiveSQLContext. See complete scenario explain below. Help me understand the issue why Spark SQL JDBC is not giving result ? Below are version details. *Hive Version : 1.2.1* *Hadoop Version : 2.6.0* *Spark version: 1.3.1* Let me know if need other details. *Created Hive Table , insert some records and query it :* *beeline> !connect jdbc:hive2://myhost:1* Enter username for jdbc:hive2://myhost:1: root Enter password for jdbc:hive2://myhost:1: ** *beeline> create table tampTable(id int ,name string ) clustered by (id) into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');* *beeline> insert into table tampTable values (1,'row1'),(2,'row2'),(3,'row3');* *beeline> select name from tampTable;* name - row1 row3 row2 *Query through SparkSQL HiveSQLContext :* SparkConf sparkConf = new SparkConf().setAppName("JavaSparkSQL"); SparkContext sc = new SparkContext(sparkConf); HiveContext hiveContext = new HiveContext(sc); DataFrame teenagers = hiveContext.sql("*SELECT name FROM tampTable*"); List teenagerNames = teenagers.toJavaRDD().map(new Function() { @Override public String call(Row row) { return "Name: " + row.getString(0); } }).collect(); for (String name: teenagerNames) { System.out.println(name); } teenagers2.toJavaRDD().saveAsTextFile("/tmp1"); sc.stop(); which is working perfectly and giving all names from table *tempTable* *Query through Spark SQL JDBC :* *beeline> !connect jdbc:hive2://myhost:* Enter username for jdbc:hive2://myhost:: root Enter password for jdbc:hive2://myhost:: ** *beeline> show tables;* *temptable* *..other tables* beeline> *SELECT name FROM tampTable;* I can list the table through "show tables", but I run the query , it is either hanged or returns nothing. Regards Sanjiv Singh Mob : +091 9990-447-339
Re: How to convert a non-rdd data to rdd.
Hi Karthik, Can you provide us more detail of dataset "data" that you wanted to parallelize with SparkContext.parallelize(data); Regards, Sanjiv Singh Regards Sanjiv Singh Mob : +091 9990-447-339 On Sun, Oct 12, 2014 at 11:45 AM, rapelly kartheek wrote: > Hi, > > I am trying to write a String that is not an rdd to HDFS. This data is a > variable in Spark Scheduler code. None of the spark File operations are > working because my data is not rdd. > > So, I tried using SparkContext.parallelize(data). But it throws error: > > [error] > /home/karthik/spark-1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala:265: > not found: value SparkContext > [error] SparkContext.parallelize(result) > [error] ^ > [error] one error found > > I realized that this data is part of the Scheduler. So, the Sparkcontext > would not have got created yet. > > Any help in "writing scheduler variable data to HDFS" is appreciated!! > > -Karthik >