Spark uses HiveContext to access Hive tables on the same Hadoop cluster that both Hive and Spark are running.
Let us look at an example of code below that used Spark as an ETL tool to get data from an Oracle table though JDBC and store it in a Hive ORC table // 1) create Spark conf first val conf = new SparkConf(). setAppName(sparkAppName). set("spark.driver.allowMultipleContexts", sparkDriverAllowMultipleContextsValue). set("enableHiveSupport",enableHiveSupportValue) // 2 Create sparkContext and HiveContext as below val sc = new SparkContext(conf) val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) // 3 Access Hive database oraclehadoop as below HiveContext.sql("use oraclehadoop") // 4 Access Oracle DB though JDBC driver from Spark private var dbURL = "jdbc:oracle:thin:@rhes564:1521:mydb12" private var dbDatabase = null private var dbUserName ="scratchpad" private var dbPassword = "xxxxx" private var partitionColumnName = "ID" private var lowerBoundValue = "1" private var upperBoundValue = "100000000" private var numPartitionsValue = "100" // 5 Get data from Oracle table scratchpad.dummy to a Spark DataFrame val d = HiveContext.read.format("jdbc").options( Map("url" -> dbURL, "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS CLUSTERED, to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS RANDOMISED, RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)", "partitionColumn" -> partitionColumnName, "lowerBound" -> lowerBoundValue, "upperBound" -> upperBoundValue, "numPartitions" -> numPartitionsValue, "user" -> dbUserName, "password" -> dbPassword)).load // 6 Create a temp table in Spark based on DataFrame "d" above d.registerTempTable("tmp") // // Need to create and populate target ORC table oraclehadoop.dummy // // Drop and create table oraclehadoop.dummy HiveContext.sql("DROP TABLE IF EXISTS oraclehadoop.dummy") var sqltext = "" sqltext = """ CREATE TABLE oraclehadoop.dummy ( ID INT , CLUSTERED INT , SCATTERED INT , RANDOMISED INT , RANDOM_STRING VARCHAR(50) , SMALL_VC VARCHAR(10) , PADDING VARCHAR(10) ) --CLUSTERED BY (ID) INTO 256 BUCKETS STORED AS ORC TBLPROPERTIES ( "orc.create.index"="true", "orc.bloom.filter.columns"="ID", "orc.bloom.filter.fpp"="0.05", "orc.compress"="SNAPPY", "orc.stripe.size"="16777216", "orc.row.index.stride"="10000" ) """ HiveContext.sql(sqltext) // // Put data in Hive table. Clean up is already done // sqltext = """ INSERT INTO TABLE oraclehadoop.dummy SELECT ID , CLUSTERED , SCATTERED , RANDOMISED , RANDOM_STRING , SMALL_VC , PADDING FROM tmp """ HiveContext.sql(sqltext) …. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Fri, 24 May 2019 at 17:59, Alan Gates <alanfga...@gmail.com> wrote: > > HCatalog was built as an interface to allow tools such as Pig and > MapReduce to access Hive tabular data, for both read and write. In more > recent versions of Hive, HCatalog has not been updated to support the > newest features, such as reading or writing transactional data or, in Hive > 3.x, accessing managed tables (that tables that Hive owns). > > HiveServer2 is a ODBC/JDBC server for Hive. There no relationship between > HiveServer2 and HCatalog. > > Hive also has a metastore, a data catalog that tracks metadata for Hive > tables. This can be run as a separate service, in which case it is often > referred to as HMS, or embedded into another system. For example in the > past HiveServer2 was often configured to embed the metastore > functionality. HCatalog communicates with the metastore to determine what > physical storage objects (files or objects in an object store) make up a > table or partition that the non-Hive user wants to interact with. > Traditionally Spark communicates directly with the Hive metastore (I > believe it can either embed the metastore functionality or communicate with > an external HMS, but I'm not sure) and then operates directly on the > underlying files or objects. This no longer works in Hive 3, and there are > other ways to connect the two, which I can go into if you're interested. > > Alan. > > > On Fri, May 24, 2019 at 1:28 AM 崔苗(未来技术实验室) <0049003...@znv.com> wrote: > >> Hi, >> we have some confusion about hive : >> 1、what is the difference between hcatalog and hiveserver2 ,does >> hiveserver2 rely on hcatalog ? >> 2、what is the layer of hcatalog and hiverserver2 in the whole Hive >> Architecture ? >> 3、how does spark sql read hive tables , through hcatalog or hiveserver2 ? >> >> thanks for any replys >> >> 0049003208 >> 0049003...@znv.com >> >> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=0049003208&uid=0049003208%40znv.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%220049003208%40znv.com%22%5D> >> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail81> 定制 >> >