Re: hcatalog and hiveserver2

Mich Talebzadeh Fri, 24 May 2019 10:36:19 -0700

Spark uses HiveContext to access Hive tables on the same Hadoop cluster
that both Hive and Spark are running.


Let us look at an example of code below that used Spark as an ETL tool  to
get data from an Oracle table though JDBC and store it in a Hive ORC table

// 1) create Spark conf first

  val conf = new SparkConf().
               setAppName(sparkAppName).
               set("spark.driver.allowMultipleContexts",
sparkDriverAllowMultipleContextsValue).
               set("enableHiveSupport",enableHiveSupportValue)

// 2 Create sparkContext and HiveContext as below

val sc = new SparkContext(conf)
val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

// 3 Access Hive database oraclehadoop as below

HiveContext.sql("use oraclehadoop")

// 4 Access Oracle DB though JDBC driver from Spark

  private var dbURL = "jdbc:oracle:thin:@rhes564:1521:mydb12"
  private var dbDatabase = null
  private var dbUserName ="scratchpad"
  private var dbPassword = "xxxxx"
  private var partitionColumnName = "ID"
  private var lowerBoundValue = "1"
  private var upperBoundValue = "100000000"
  private var numPartitionsValue = "100"


  // 5 Get data from Oracle table scratchpad.dummy to a Spark DataFrame

 val d = HiveContext.read.format("jdbc").options(
 Map("url" -> dbURL,
 "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS CLUSTERED,
to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS RANDOMISED,
RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
 "partitionColumn" -> partitionColumnName,
 "lowerBound" -> lowerBoundValue,
 "upperBound" -> upperBoundValue,
 "numPartitions" -> numPartitionsValue,
 "user" -> dbUserName,
 "password" -> dbPassword)).load

//  6 Create a temp table in Spark based on DataFrame "d" above

   d.registerTempTable("tmp")
  //
  // Need to create and populate target ORC table oraclehadoop.dummy
  //
  // Drop and create table oraclehadoop.dummy
  HiveContext.sql("DROP TABLE IF EXISTS oraclehadoop.dummy")
  var sqltext  = ""
  sqltext = """
  CREATE TABLE oraclehadoop.dummy (
     ID INT
   , CLUSTERED INT
   , SCATTERED INT
   , RANDOMISED INT
   , RANDOM_STRING VARCHAR(50)
   , SMALL_VC VARCHAR(10)
   , PADDING  VARCHAR(10)
  )
  --CLUSTERED BY (ID) INTO 256 BUCKETS
  STORED AS ORC
  TBLPROPERTIES (
  "orc.create.index"="true",
  "orc.bloom.filter.columns"="ID",
  "orc.bloom.filter.fpp"="0.05",
  "orc.compress"="SNAPPY",
  "orc.stripe.size"="16777216",
  "orc.row.index.stride"="10000" )
  """
   HiveContext.sql(sqltext)
  //
  // Put data in Hive table. Clean up is already done
  //
  sqltext = """
  INSERT INTO TABLE oraclehadoop.dummy
  SELECT
          ID
        , CLUSTERED
        , SCATTERED
        , RANDOMISED
        , RANDOM_STRING
        , SMALL_VC
        , PADDING
  FROM tmp
  """
   HiveContext.sql(sqltext)

….

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 24 May 2019 at 17:59, Alan Gates <[email protected]> wrote:

>
> HCatalog was built as an interface to allow tools such as Pig and
> MapReduce to access Hive tabular data, for both read and write.  In more
> recent versions of Hive, HCatalog has not been updated to support the
> newest features, such as reading or writing transactional data or, in Hive
> 3.x, accessing managed tables (that tables that Hive owns).
>
> HiveServer2 is a ODBC/JDBC server for Hive.  There no relationship between
> HiveServer2 and HCatalog.
>
> Hive also has a metastore, a data catalog that tracks metadata for Hive
> tables.  This can be run as a separate service, in which case it is often
> referred to as HMS, or embedded into another system.  For example in the
> past HiveServer2 was often configured to embed the metastore
> functionality.  HCatalog communicates with the metastore to determine what
> physical storage objects (files or objects in an object store) make up a
> table or partition that the non-Hive user wants to interact with.
> Traditionally Spark communicates directly with the Hive metastore (I
> believe it can either embed the metastore functionality or communicate with
> an external HMS, but I'm not sure) and then operates directly on the
> underlying files or objects. This no longer works in Hive 3, and there are
> other ways to connect the two, which I can go into if you're interested.
>
> Alan.
>
>
> On Fri, May 24, 2019 at 1:28 AM 崔苗(未来技术实验室) <[email protected]> wrote:
>
>> Hi,
>> we have some confusion about hive ：
>> 1、what is the difference between hcatalog and hiveserver2 ,does
>> hiveserver2 rely on hcatalog ?
>> 2、what is the layer of hcatalog and hiverserver2 in the whole  Hive
>> Architecture ?
>> 3、how does spark sql read hive tables , through hcatalog or hiveserver2 ?
>>
>> thanks for any replys
>>
>> 0049003208
>> [email protected]
>>
>> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=0049003208&uid=0049003208%40znv.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%220049003208%40znv.com%22%5D>
>> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail81> 定制
>>
>

Re: hcatalog and hiveserver2

Reply via email to