Re: Specifying YARN Node (Label) for LLAP AM

2023-08-19 Thread Mich Talebzadeh
Thank you for the information, Aaron. I explored the MR3 link you provided
and found it intriguing. However, the latest email I received from another
member seemed to deviate from the technical discussion's focus, potentially
leading us off
track and hinder objectivity. Therefore, with regret, I am closing my
argument.

Cheers

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 19 Aug 2023 at 11:33, Aaron Grubb  wrote:

> You might also be interested in knowing that there has been discussions
> about deprecating Hive on Spark:
> https://lists.apache.org/thread/sspltkv3ovbsjmoct72p4m1ooqk2g740
>
> On Sat, 2023-08-19 at 10:17 +, Aaron Grubb wrote:
>
> Hi Mich,
>
> It's not a question of cannot but rather a) is it worth converting our
> pipelines from Hive to Spark and b) is Spark more performant than LLAP, and
> in both cases the answer seems to be no. 2016 is a lifetime ago in
> technological time and since then there's been a major release of Hive as
> well as many minor releases. When we started looking for our "big data
> processor" 2 years ago, we had evaluated Spark, Presto, AWS Athena and Hive
> on LLAP and all literature pointed to Hive on LLAP being the most
> performant, in particular when you're able to take advantage of the ORC
> footer caching. If you'd like to review some benchmarks, you can take a
> look at this [1] but the direct comparison between Spark and LLAP is done
> with a fork of Hive.
>
> Regards,
> Aaron
>
> [1] https://www.datamonad.com/post/2022-04-01-spark-hive-performance-1.4/
>
> On Fri, 2023-08-18 at 16:06 +0100, Mich Talebzadeh wrote:
>
> interesting!
>
> In 2016 I gave a presentation in London, in Future of DataOrganised by
> Hortonworks July 20, 2016,
>
> Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
> <https://talebzadehmich.files.wordpress.com/2016/08/hive_on_spark_only.pdf>
>
>
> Then I thought Spark as an underlying engine for Hive did the best job.
> However, I am not sure there has been many new developments to make Spark
> as the underlying engine for Hive. Any particular reason you cannot use
> Spark as the ET: tool with Hive providing the underlying storage? Spark has
> excellent APIs to work with hive including spark thrift server (which is
> under the bonnet Hive thrift server).
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 18 Aug 2023 at 15:45, Aaron Grubb  wrote:
>
> Hi Mich,
>
> Yes, that's correct
>
> On Fri, 2023-08-18 at 15:24 +0100, Mich Talebzadeh wrote:
>
> Hi,
>
> Are you using LLAP (Long live and prosper) as a Hive engine?
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 18 Aug 2023 at 15:09, Aaron Grubb  wrote:
>
> For those interested, I managed to define a way to launch the LLAP
> application master and daemons on separate, targeted machines. It was
> inspired by an article I found [1] and implemented using YARN Node Labels
> [2] and Placement Constraints [3] with a modification to the file
> scripts/llap/yarn/templates.py. Here are the basic instructions:
>
>

Re: Specifying YARN Node (Label) for LLAP AM

2023-08-18 Thread Mich Talebzadeh
interesting!

In 2016 I gave a presentation in London, in Future of DataOrganised by
Hortonworks July 20, 2016,

Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
<https://talebzadehmich.files.wordpress.com/2016/08/hive_on_spark_only.pdf>


Then I thought Spark as an underlying engine for Hive did the best job.
However, I am not sure there has been many new developments to make Spark
as the underlying engine for Hive. Any particular reason you cannot use
Spark as the ET: tool with Hive providing the underlying storage? Spark has
excellent APIs to work with hive including spark thrift server (which is
under the bonnet Hive thrift server).

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 18 Aug 2023 at 15:45, Aaron Grubb  wrote:

> Hi Mich,
>
> Yes, that's correct
>
> On Fri, 2023-08-18 at 15:24 +0100, Mich Talebzadeh wrote:
>
> Hi,
>
> Are you using LLAP (Long live and prosper) as a Hive engine?
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 18 Aug 2023 at 15:09, Aaron Grubb  wrote:
>
> For those interested, I managed to define a way to launch the LLAP
> application master and daemons on separate, targeted machines. It was
> inspired by an article I found [1] and implemented using YARN Node Labels
> [2] and Placement Constraints [3] with a modification to the file
> scripts/llap/yarn/templates.py. Here are the basic instructions:
>
> 1. Configure YARN to enable placement constraints and node labels. You
> have the option of using 2 node labels or 1 node label + the default
> partition. The machines that are intended to run the daemons must have a
> label associated with them. If you choose to use 2 node labels, you must
> set the default label for the queue that you're submitting LLAP to, to the
> node label associated with the machine that will run the application
> master. Note that this affects other applications submitted to the same
> queue. If it's only 1 label, the machine that will run the AM must be
> accessible by the DEFAULT_PARTITION queue, and that machine will not be
> specifically targeted if you have more than one machine accessible by the
> DEFAULT_PARTITION, so this scenario is recommended only if you have a
> single machine intended for application masters, as is my case.
>
> 2. Modify scripts/llap/yarn/templates.py like so:
>
> #SNIP
>
>   "APP_ROOT": "/app/install/",
>
>   "APP_TMP_DIR": "/tmp/"
>
> }
>
>   },
>
>   "placement_policy": {
>
> "constraints": [
>
>   {
>
> "type": "ANTI_AFFINITY",
>
> "scope": "NODE",
>
> "target_tags": [
>
>   "llap"
>
> ],
>
> "node_partitions": [
>
>   ""
>
> ]
>
>   }
>
> ]
>
>   }
>
> }
>
>   ],
>
>   "kerberos_principal" : {
>
> #SNIP
>
> Note that ANTI_AFFINITY means that only 1 daemon will be spawned per
> machine but that should be the desired behaviour anyway. Read more about it
> in [3].
>
> 3. Launch LLAP using the *hive --service llap *command
>
> Hope this helps someone!
> Aaron
>
> [1]
> https://www.gresearch.com/blog/article/hive-llap-in-practice-sizing-setup-and-troubleshooting/
> [2]
> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeLabel.html
> [3]
> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html
>
> On 2023

Re: Specifying YARN Node (Label) for LLAP AM

2023-08-18 Thread Mich Talebzadeh
Hi,

Are you using LLAP (Long live and prosper) as a Hive engine?

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 18 Aug 2023 at 15:09, Aaron Grubb  wrote:

> For those interested, I managed to define a way to launch the LLAP
> application master and daemons on separate, targeted machines. It was
> inspired by an article I found [1] and implemented using YARN Node Labels
> [2] and Placement Constraints [3] with a modification to the file
> scripts/llap/yarn/templates.py. Here are the basic instructions:
>
> 1. Configure YARN to enable placement constraints and node labels. You
> have the option of using 2 node labels or 1 node label + the default
> partition. The machines that are intended to run the daemons must have a
> label associated with them. If you choose to use 2 node labels, you must
> set the default label for the queue that you're submitting LLAP to, to the
> node label associated with the machine that will run the application
> master. Note that this affects other applications submitted to the same
> queue. If it's only 1 label, the machine that will run the AM must be
> accessible by the DEFAULT_PARTITION queue, and that machine will not be
> specifically targeted if you have more than one machine accessible by the
> DEFAULT_PARTITION, so this scenario is recommended only if you have a
> single machine intended for application masters, as is my case.
>
> 2. Modify scripts/llap/yarn/templates.py like so:
>
> #SNIP
>
>   "APP_ROOT": "/app/install/",
>
>   "APP_TMP_DIR": "/tmp/"
>
> }
>
>   },
>
>   "placement_policy": {
>
> "constraints": [
>
>   {
>
> "type": "ANTI_AFFINITY",
>
> "scope": "NODE",
>
> "target_tags": [
>
>   "llap"
>
> ],
>
> "node_partitions": [
>
>   ""
>
> ]
>
>   }
>
> ]
>
>   }
>
> }
>
>   ],
>
>   "kerberos_principal" : {
>
> #SNIP
>
> Note that ANTI_AFFINITY means that only 1 daemon will be spawned per
> machine but that should be the desired behaviour anyway. Read more about it
> in [3].
>
> 3. Launch LLAP using the *hive --service llap *command
>
> Hope this helps someone!
> Aaron
>
> [1]
> https://www.gresearch.com/blog/article/hive-llap-in-practice-sizing-setup-and-troubleshooting/
> [2]
> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeLabel.html
> [3]
> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html
>
> On 2023/03/22 10:19:57 Aaron Grubb wrote:
> > Hi all,
> >
> > I have a Hadoop cluster (3.3.4) with 6 nodes of equal resource size that
> run HDFS and YARN and 1 node with lower resources which only runs YARN that
> I use for Hive AMs, the LLAP AM, Spark AMs and Hive file merge containers.
> The HDFS nodes are set up such that the queue for LLAP on the YARN
> NodeManager is allocated resources exactly equal to what the LLAP daemons
> consume. However, when I need to re-launch LLAP, I currently have to stop
> the NodeManager processes on each HDFS node, then launch LLAP to guarantee
> that the application master ends up on the YARN-only machine, then start
> the NodeManager processes again to let the daemons start spawning on the
> nodes. This used to not be a problem because only Hive/LLAP was using YARN
> but now we've started using Spark in my company and I'm in a position where
> if LLAP happens to crash, I would need to wait for Spark jobs to finish
> before I can re-launch LLAP, which would put our ETL processes behind,
> potentially to unacceptable delays. I could allocate 1 vcore and 1024mb
> memory extra for the LLAP queue on each machine, however that would mean I
> have 5 vcores and 5gb RAM being reserved and unused at all times, so I was
> wondering if there's a way to specify which node to launch the LLAP AM on,
> perhaps through YARN node labels similar to the Spark
> "spark.yarn.am.nodeLabelExpression" configuration? Or even a way to specify
> the node machine through a different mechanism? My Hive version is 3.1.3.
> >
> > Thanks,
> > Aaron
> >
>
>


Re: Hive 3 has big performance improvement from my test

2023-01-08 Thread Mich Talebzadeh
What bothers me is that you are making sweeping statements about Spark
inability to handle quote " ... the key weakness of Spark is 1) its poor
performance when executing concurrent queries and 2) its poor resource
utilization when executing multiple Spark applications concurrently"
and conversely overstating Hive ability on handling MR.
In fairness anything published  in a public forum is fair game for analysis
or criticism. Thenyou are expected to back it up. I cannot see how anyone
could object to the statement: if you make a claim, be prepared to prove
it.

I am open minded on this so please clarify the above statement

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 8 Jan 2023 at 05:21, Sungwoo Park  wrote:

>
>> [image: image.png]
>>
>> from your posting, the result is amazing. glad to know hive on mr3 has
>> that nice performance.
>>
>
> Hive on MR3 is similar to Hive-LLAP in performance, so we can interpret
> the above result as Hive being much faster than SparkSQL. For executing
> concurrent queries, the performance gap is even greater. In my (rather
> biased) opinion, the key weakness of Spark is 1) its poor performance when
> executing concurrent queries and 2) its poor resource utilization when
> executing multiple Spark applications concurrently.
>
> We released Hive on MR3 1.6 a couple of weeks ago. Now we have backported
> about 700 patches to Hive 3.1. If interested, please check it out:
> https://www.datamonad.com/
>
> Sungwoo
>


Re: Hive 3 has big performance improvement from my test

2023-01-07 Thread Mich Talebzadeh
Thanks for this insight guys.

On your point below and I quote:

...  "It's even as fast as Spark by using the default mr engine"

OK as we are all experimentalists, are we stating that the classic
MapReduce computation can outdo Spark's in-memory computation. I would be
curious to know this.

Thanks



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 6 Jan 2023 at 03:35, ypeng  wrote:

> Hello,
>
> Just from my personal testing, Hive 3.1.3 has much better performance than
> the old ones.
> It's even as fast as Spark by using the default mr engine.
> My test process and dataset,
> https://blog.crypt.pw/Another-10-million-dataset-testing-for-Spark-and-Hive
>
> Thanks.
>


Re: Hive unable to Launch job to spark

2022-05-31 Thread Mich Talebzadeh
Agreed. The last time I presented it was in 2016

Presentation in London: Query Engines for Hive: MR, Spark, Tez with LLAP
– Considerations!
<https://www.slideshare.net/MichTalebzadeh1/query-engines-for-hive-mr-spark-tez-with-llap-considerations>


and sadly they diverged!


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 31 May 2022 at 10:29, Peter Vary  wrote:

> Hi Prasanth,
>
> I would suggest not to invest too heavily in Hive on Spark.
> I recent years there was no movement around the feature and it will be
> removed in Hive 4.0.0.
>
> Thanks,
> Peter
>
> On 2022. May 27., at 13:00, Prasanth M Sasidharan 
> wrote:
>
> Hello team,
>
> I am trying to use spark as the engine for Hive.
> Hive version : 3.1.1
> Spark: 2.4.0 (have tried with several versions)
>
> I am able to query the hive tables from within pyspark.However, when i
> execute the hive query from the hive shell, it throws the following error
>
> Launch Command: "/opt/CDS/pre_requisites/jdk1.8.0_141/bin/java" "-cp"
> "/opt/CDS/spark/conf/:/opt/CDS/spark/jars/*:/opt/CDS/hadoop-2.9.2/etc/hadoop/:/etc/tez/conf:/opt/CDS/hive/conf/:/opt/CDS/hive/lib/HikariCP-2.6.1.jar:/opt/CDS/hive/lib/ST4-4.0.4.jar:/opt/CDS/hive/lib/accumulo-core-1.7.3.jar:/opt/CDS/hive/lib/accumulo-fate-1.7.3.jar:/opt/CDS/hive/lib/accumulo-start-1.7.3.jar:/opt/CDS/hive/lib/accumulo-trace-1.7.3.jar:/opt/CDS/hive/lib/aircompressor-0.10.jar:/opt/CDS/hive/lib/ant-1.9.1.jar:/opt/CDS/hive/lib/ant-launcher-1.9.1.jar:/opt/CDS/hive/lib/antlr-runtime-3.5.2.jar:/opt/CDS/hive/lib/antlr4-runtime-4.5.jar:/opt/CDS/hive/lib/aopalliance-repackaged-2.5.0-b32.jar:/opt/CDS/hive/lib/apache-jsp-9.3.20.v20170531.jar:/opt/CDS/hive/lib/apache-jstl-9.3.20.v20170531.jar:/opt/CDS/hive/lib/arrow-format-0.8.0.jar:/opt/CDS/hive/lib/arrow-memory-0.8.0.jar:/opt/CDS/hive/lib/arrow-vector-0.8.0.jar:/opt/CDS/hive/lib/asm-5.0.1.jar:/opt/CDS/hive/lib/asm-commons-5.0.1.jar:/opt/CDS/hive/lib/asm-tree-5.0.1.jar:/opt/CDS/hive/lib/audience-annotations-0.5.0.jar:/opt/CDS/hive/lib/avatica-1.11.0.jar:/opt/CDS/hive/lib/avro-1.7.7.jar:/opt/CDS/hive/lib/bonecp-0.8.0.RELEASE.jar:/opt/CDS/hive/lib/calcite-core-1.16.0.jar:/opt/CDS/hive/lib/calcite-druid-1.16.0.jar:/opt/CDS/hive/lib/calcite-linq4j-1.16.0.jar:/opt/CDS/hive/lib/commons-cli-1.2.jar:/opt/CDS/hive/lib/commons-codec-1.7.jar:/opt/CDS/hive/lib/commons-collections4-4.1.jar:/opt/CDS/hive/lib/commons-compiler-2.7.6.jar:/opt/CDS/hive/lib/commons-compress-1.9.jar:/opt/CDS/hive/lib/commons-crypto-1.0.0.jar:/opt/CDS/hive/lib/commons-dbcp-1.4.jar:/opt/CDS/hive/lib/commons-io-2.4.jar:/opt/CDS/hive/lib/commons-lang-2.6.jar:/opt/CDS/hive/lib/commons-lang3-3.2.jar:/opt/CDS/hive/lib/commons-logging-1.0.4.jar:/opt/CDS/hive/lib/commons-math-2.1.jar:/opt/CDS/hive/lib/commons-math3-3.6.1.jar:/opt/CDS/hive/lib/commons-pool-1.5.4.jar:/opt/CDS/hive/lib/commons-vfs2-2.1.jar:/opt/CDS/hive/lib/curator-client-2.12.0.jar:/opt/CDS/hive/lib/curator-framework-2.12.0.jar:/opt/CDS/hive/lib/curator-recipes-2.12.0.jar:/opt/CDS/hive/lib/datanucleus-api-jdo-4.2.4.jar:/opt/CDS/hive/lib/datanucleus-core-4.1.17.jar:/opt/CDS/hive/lib/datanucleus-rdbms-4.1.19.jar:/opt/CDS/hive/lib/derby-10.14.1.0.jar:/opt/CDS/hive/lib/disruptor-3.3.6.jar:/opt/CDS/hive/lib/dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar:/opt/CDS/hive/lib/druid-hdfs-storage-0.12.0.jar:/opt/CDS/hive/lib/ecj-4.4.2.jar:/opt/CDS/hive/lib/esri-geometry-api-2.0.0.jar:/opt/CDS/hive/lib/findbugs-annotations-1.3.9-1.jar:/opt/CDS/hive/lib/flatbuffers-1.2.0-3f79e055.jar:/opt/CDS/hive/lib/groovy-all-2.4.11.jar:/opt/CDS/hive/lib/gson-2.2.4.jar:/opt/CDS/hive/lib/guava-19.0.jar:/opt/CDS/hive/lib/hbase-client-2.0.0-alpha4.jar:/opt/CDS/hive/lib/hbase-common-2.0.0-alpha4-tests.jar:/opt/CDS/hive/lib/hbase-common-2.0.0-alpha4.jar:/opt/CDS/hive/lib/hbase-hadoop-compat-2.0.0-alpha4.jar:/opt/CDS/hive/lib/hbase-hadoop2-compat-2.0.0-alpha4-tests.jar:/opt/CDS/hive/lib/hbase-hadoop2-compat-2.0.0-alpha4.jar:/opt/CDS/hive/lib/hbase-http-2.0.0-alpha4.jar:/opt/CDS/hive/lib/hbase-mapreduce-2.0.0-alpha4.jar:/opt/CDS/hive/lib/hbase-metrics-2.0.0-alpha4.jar:/opt/CDS/hive/lib/hbase-metrics-api-2.0.0-alpha4.jar:/opt/CDS/hive/lib/hbase-prefix-tree-2.0.0-alpha4.jar:/opt/CDS/hive/lib/hbase-procedure-2.0.0-alpha4.jar:/opt/CDS/hive/lib/hbase-protocol-2.0.0-alpha4.jar:/opt/CDS/hive/lib/hbase-protocol-shaded-2.0.0-alpha4.jar:/opt/CDS/hive/lib/hbase-replication-2.0.0-alpha4.jar:/opt/CDS/hive/lib/hbase-server-2.0.0-alpha4.jar:/opt/CDS/hive/lib/h

Insert from Hive table into Google BigQuery table

2022-03-08 Thread Mich Talebzadeh
Time Spent: 0 msec

2022-03-08 15:51:58,277 INFO  [ec1f7b38-7286-481a-b48b-44f7ed54c20a main]
ql.Driver: Total MapReduce CPU Time Spent: 0 msec

2022-03-08 15:51:58,277 INFO  [ec1f7b38-7286-481a-b48b-44f7ed54c20a
main] ql.Driver:
Completed executing
command(queryId=hduser_20220308155155_9c3bc173-6893-4a5b-a061-226e2ac9b42a);
Time taken: 2.297 seconds

OK

2022-03-08 15:51:58,277 INFO  [ec1f7b38-7286-481a-b48b-44f7ed54c20a main]
ql.Driver: OK

2022-03-08 15:51:58,277 INFO  [ec1f7b38-7286-481a-b48b-44f7ed54c20a main]
lockmgr.DbTxnManager: Stopped heartbeat for query:
hduser_20220308155155_9c3bc173-6893-4a5b-a061-226e2ac9b42a

Time taken: 2.558 seconds

2022-03-08 15:51:58,296 INFO  [ec1f7b38-7286-481a-b48b-44f7ed54c20a main]
CliDriver: Time taken: 2.558 seconds

2022-03-08 15:51:58,296 INFO  [ec1f7b38-7286-481a-b48b-44f7ed54c20a main]
conf.HiveConf: Using the default value passed in for log id:
ec1f7b38-7286-481a-


So it shows OK and no errors


But the record is NOT added to the local Hive table!


So in summary we can read rows added to the BigQuery table through Hive but
cannot add any rows to Hive table itself?


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: help with beeline connection to hive

2022-02-23 Thread Mich Talebzadeh
and check that beeline thrift server is indeed running (mine runs on port
10099)

netstat -plten|grep 10099
tcp0  0 0.0.0.0:10099   0.0.0.0:*   LISTEN
1



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 23 Feb 2022 at 09:42, Mich Talebzadeh 
wrote:

> beeline -u jdbc:hive2://localhost:1/default
> org.apache.hive.jdbc.HiveDriver -n  is running> -p 
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 23 Feb 2022 at 09:35, Aaron Grubb  wrote:
>
>> Try username "root" and empty password. That works for me on 3.1.2.
>>
>> On Wed, 2022-02-23 at 10:16 +0800, Bitfox wrote:
>>
>> Hello
>>
>> I have hive 2.3.9 installed by default on localhost for testing.
>> HDFS is also installed on localhost, which works correctly b/c I have
>> already used the file storage feature.
>>
>> I didn't change any configure files for hive.
>>
>> I can login into hive shell:
>>
>> hive> show databases;
>>
>> OK
>>
>> default
>>
>> Time taken: 4.458 seconds, Fetched: 1 row(s)
>>
>>
>>
>> After started hiveserver2 which works online, I can't connect via beeline:
>>
>>
>> beeline> !connect jdbc:hive2://localhost:1/default
>>
>> Connecting to jdbc:hive2://localhost:1/default
>>
>> Enter username for jdbc:hive2://localhost:1/default: APP
>>
>> Enter password for jdbc:hive2://localhost:1/default: 
>>
>> 22/02/23 10:11:41 [main]: WARN jdbc.HiveConnection: Failed to connect to
>> localhost:1
>>
>> Could not open connection to the HS2 server. Please check the server URI
>> and if the URI is correct, then ask the administrator to check the server
>> status.
>>
>> Error: Could not open client transport with JDBC Uri:
>> jdbc:hive2://localhost:1/default: java.net.ConnectException: Connection
>> refused (Connection refused) (state=08S01,code=0)
>>
>>
>>
>> It prompts me to input username and password.
>>
>> I have tried both empty user/pass, and user "APP"/pass "mine".
>>
>> Neither of them will work.
>>
>>
>>
>> How can I fix this and connect to Hive correctly via beeline?
>>
>>
>> Sorry I am the newbie to Hive.
>>
>>
>> Thanks a lot.
>>
>>
>>


Re: help with beeline connection to hive

2022-02-23 Thread Mich Talebzadeh
beeline -u jdbc:hive2://localhost:1/default
org.apache.hive.jdbc.HiveDriver -n  -p 



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 23 Feb 2022 at 09:35, Aaron Grubb  wrote:

> Try username "root" and empty password. That works for me on 3.1.2.
>
> On Wed, 2022-02-23 at 10:16 +0800, Bitfox wrote:
>
> Hello
>
> I have hive 2.3.9 installed by default on localhost for testing.
> HDFS is also installed on localhost, which works correctly b/c I have
> already used the file storage feature.
>
> I didn't change any configure files for hive.
>
> I can login into hive shell:
>
> hive> show databases;
>
> OK
>
> default
>
> Time taken: 4.458 seconds, Fetched: 1 row(s)
>
>
>
> After started hiveserver2 which works online, I can't connect via beeline:
>
>
> beeline> !connect jdbc:hive2://localhost:1/default
>
> Connecting to jdbc:hive2://localhost:1/default
>
> Enter username for jdbc:hive2://localhost:1/default: APP
>
> Enter password for jdbc:hive2://localhost:1/default: 
>
> 22/02/23 10:11:41 [main]: WARN jdbc.HiveConnection: Failed to connect to
> localhost:1
>
> Could not open connection to the HS2 server. Please check the server URI
> and if the URI is correct, then ask the administrator to check the server
> status.
>
> Error: Could not open client transport with JDBC Uri:
> jdbc:hive2://localhost:1/default: java.net.ConnectException: Connection
> refused (Connection refused) (state=08S01,code=0)
>
>
>
> It prompts me to input username and password.
>
> I have tried both empty user/pass, and user "APP"/pass "mine".
>
> Neither of them will work.
>
>
>
> How can I fix this and connect to Hive correctly via beeline?
>
>
> Sorry I am the newbie to Hive.
>
>
> Thanks a lot.
>
>
>


Re: metastore bug when hive update spark table ?

2022-01-06 Thread Mich Talebzadeh
Well I have seen this type of error before.

I tend to create the table in hive first and alter it in spark if needed.
This is spark 3.1.1 with Hive (version 3.1.1)

0: jdbc:hive2://rhes75:10099/default> create table my_table2 (col1 int,
col2 int)
0: jdbc:hive2://rhes75:10099/default> describe my_table2;
+---++--+
| col_name  | data_type  | comment  |
+---++--+
| col1  | int|  |
| col2  | int|  |
+---++--+
2 rows selected (0.17 seconds)

in Spark

>>> spark.sql("""ALTER TABLE my_table2 ADD column col3 string""")
DataFrame[]
>>> for c in spark.sql("""describe formatted my_table2 """).collect():
...   print(c)
...
*Row(col_name='col1', data_type='int', comment=None)*
*Row(col_name='col2', data_type='int', comment=None)*
*Row(col_name='col3', data_type='string', comment=None)*
Row(col_name='', data_type='', comment='')
Row(col_name='# Detailed Table Information', data_type='', comment='')
Row(col_name='Database', data_type='default', comment='')
Row(col_name='Table', data_type='my_table2', comment='')
Row(col_name='Owner', data_type='hduser', comment='')
Row(col_name='Created Time', data_type='Thu Jan 06 17:16:37 GMT 2022',
comment='')
Row(col_name='Last Access', data_type='UNKNOWN', comment='')
Row(col_name='Created By', data_type='Spark 2.2 or prior', comment='')
Row(col_name='Type', data_type='MANAGED', comment='')
Row(col_name='Provider', data_type='hive', comment='')
Row(col_name='Table Properties', data_type='[bucketing_version=2,
transient_lastDdlTime=1641489641]', comment='')
Row(col_name='Location',
data_type='hdfs://rhes75:9000/user/hive/warehouse/my_table2', comment='')
Row(col_name='Serde Library',
data_type='org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe', comment='')
Row(col_name='InputFormat',
data_type='org.apache.hadoop.mapred.TextInputFormat', comment='')
Row(col_name='OutputFormat',
data_type='org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
comment='')
Row(col_name='Storage Properties', data_type='[serialization.format=1]',
comment='')
Row(col_name='Partition Provider', data_type='Catalog', comment='')


This is my work around

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 6 Jan 2022 at 16:17, Nicolas Paris  wrote:

> Hi there.
>
> I also posted this problem in the spark list. I am no sure this is a
> spark or a hive metastore problem. Or if there is some metastore tunning
> configuration as workaround.
>
>
> Spark can't see hive schema updates partly because it stores the schema
> in a weird way in hive metastore.
>
>
> 1. FROM SPARK: create a table
> 
> >>> spark.sql("select 1 col1, 2
> col2").write.format("parquet").saveAsTable("my_table")
> >>> spark.table("my_table").printSchema()
> root
> |-- col1: integer (nullable = true)
> |-- col2: integer (nullable = true)
>
>
> 2. FROM HIVE: alter the schema
> ==
> 0: jdbc:hive2://localhost:1> ALTER TABLE my_table REPLACE
> COLUMNS(`col1` int, `col2` int, `col3` string);
> 0: jdbc:hive2://localhost:1> describe my_table;
> +---++--+
> | col_name | data_type | comment |
> +---++--+
> | col1 | int | |
> | col2 | int | |
> | col3 | string | |
> +---++--+
>
>
> 3. FROM SPARK: problem, column does not appear
> ==
> >>> spark.table("my_table").printSchema()
> root
> |-- col1: integer (nullable = true)
> |-- col2: integer (nullable = true)
>
>
> 4. FROM METASTORE DB: two ways of storing the columns
> ==
> metastore=# select * from "COLUMNS_V2";
> CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX
> ---+-+-+---+-
> 2 | | col1 | int | 0
> 2 | | col2 | int | 1
> 2 | | col3 | string | 2
>
>
> metastore=# select * from "TABLE_PARAMS";
> TBL_ID | PARAM_KEY | PARAM_VALUE
>
>
> +---+-
> ---
> 1 | spark.sql.sources.provider | parquet
> 1 | spark.sql.sources.schema.part.0 |
>
&g

Re: [ANNOUNCE] Apache Hive 2.3.9 Released

2021-06-10 Thread Mich Talebzadeh
Dear Announcer and the vibrant Hive community

I must congratulate Hive development and the user community for investing,
contributing and keeping a strong Apache Hive Data Warehouse on HDFS.

In July 2016, I made a presentation in conjunction with Hortonworks on the
topic of Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
<https://www.slideshare.net/MichTalebzadeh1/query-engines-for-hive-mr-spark-tez-with-llap-considerations>


That was on Hive 1.3.1. I am very glad to see Apache Hive
<https://en.wikipedia.org/wiki/Apache_Hive>going strong and indeed being so
popular after nearly 11 years from its first release. No doubt Hive is a
flagship product and it is an essential part of any storage and ETL work
with its versatile architecture in Big Data and Cloud world. Additionally,
it is probably the first artifact for many SQL savvy people to make
themselves familiar with Big Data.

Worth noting that Apache Hive has been a pioneer in supporting additional
Big Data tools like Apache Spark <https://spark.apache.org/> whose Spark
SQL <https://spark.apache.org/docs/latest/sql-programming-guide.html>is
largely built on top of Hive SQL. Google BigQuery Data Warehouse
<https://cloud.google.com/bigquery> is built largely from concepts
inherited from our humble Apache Hive. I often refer to BigQuery as Hive on
steroids. Even Databricks Data Lake
<https://databricks.com/glossary/data-lake>has a lot of concepts borrowed
from Apache Hive.

Looking forward to seeing many happy years with Apache Hive.





   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 10 Jun 2021 at 22:11, Chao Sun  wrote:

> The Apache Hive team is proud to announce the release of Apache Hive
> version 2.3.9.
>
> The Apache Hive (TM) data warehouse software facilitates querying and
> managing large datasets residing in distributed storage. Built on top of
> Apache Hadoop (TM), it provides, among others:
>
> * Tools to enable easy data extract/transform/load (ETL)
> * A mechanism to impose structure on a variety of data formats
> * Access to files stored either directly in Apache HDFS (TM) or in other
> data storage systems such as Apache HBase (TM)
> * Query execution via Apache Hadoop MapReduce, Apache Tez and Apache Spark
> frameworks.
>
> For Hive release details and downloads, please visit:
> https://hive.apache.org/downloads.html
> Hive 2.3.9 Release Notes are available here:
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12350009=Html=12310843
>
> We would like to thank the many contributors who made this release
> possible.
>
> Regards,
> The Apache Hive Team
>
>


Re: Is a Hive installation necessary for Spark SQL?

2021-04-25 Thread Mich Talebzadeh
Hi,

I don't know much about delta but your statement below

df.createOrReplaceTempView("myTable")
res = spark.sql("select * from myTable")


The so-called TempView is a reference to a hash table in memory. That is,
you are mapping your dataframe* df *to a hash table in memory and it is
transient, meaning that it is not persistent.


I have never heard of partition discovery or optimisation of  sql with a
tempView. Hive comes into play when you persist the data to disk.


df.createOrReplaceTempView("myTable")

Then you can either store it as below on Hive

from pyspark.sql import HiveContext
df.repartition(2).partitionBy("ID").write.mode("overwrite").saveAsTable("test.hiveTable")
-- test database in Hive has to exist


or


fullyQualifiedTableName="test.hiveTable"

sqltext  = ""

if (spark.sql("SHOW TABLES IN test like 'hiveTable'").count() == 1):

  rows = spark.sql(f"""SELECT COUNT(1) FROM
{fullyQualifiedTableName}""").collect()[0][0]

  print ("number of rows is ",rows)

else:

  print(f"""\nTable {fullyQualifiedTableName} does not exist, creating
table ")

  sqltext = f"""

 CREATE TABLE {fullyQualifiedTableName}(

   ID INT

 , CLUSTERED INT

 , SCATTERED INT

 , RANDOMISED INT

 , RANDOM_STRING VARCHAR(50)

 , SMALL_VC VARCHAR(50)

 , PADDING  VARCHAR(4000)

)

STORED AS PARQUET

"""

  spark.sql(sqltext)

sqltext = f"""

  INSERT INTO TABLE {fullyQualifiedTableName}

  SELECT

  ID

, CLUSTERED

, SCATTERED

, RANDOMISED

, RANDOM_STRING

, SMALL_VC

, PADDING

  FROM myTable

  """

spark.sql(sqltext)

In either case there will be a table in the Hive test database called
myTable. The advantage of Hive under the hood with Spark SQL is that
Spark-hive has an efficient API that does not require JDBC connection from
Spark to the underlying database in Hive. Other databases that support JDBC
connection will need to use JDBC API from Spark


def writeTableWithJDBC(dataFrame, url, tableName, user, password, driver,
mode):

try:

dataFrame. \

write. \

format("jdbc"). \

option("url", url). \

option("dbtable", tableName). \

option("user", user). \

option("password", password). \

option("driver", driver). \

mode(mode). \

save()

except Exception as e:

print(f"""{e}, quitting""")

sys.exit(1)


So in summary it is a good idea to keep Hive there.


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 25 Apr 2021 at 10:57, chia kang ren  wrote:

> Does it make sense to keep a Hive installation when your parquet files
> come with a transactional metadata layer like Delta Lake / Apache Iceberg?
>
> My understanding from this:
> https://github.com/delta-io/delta/issues/85
>
> is that Hive is no longer necessary in a Spark cluster other than
> discovering where the table is stored. Hence, we can simply do something
> like:
> ```
> df = spark.read.delta($LOCATION)
> df.createOrReplaceTempView("myTable")
> res = spark.sql("select * from myTable")
> ```
> and this approach still gets all the benefits of having the metadata for
> partition discovery / SQL optimization? With Delta, the Hive metastore
> should only store a pointer from the table name to the path of the table,
> and all other metadata will come from the Delta log, which will be
> processed in Spark.
>
> One reason i can think of keeping Hive is to keep track of other data
> sources that don't necessarily have a Delta / Iceberg transactional
> metadata layer. But i'm not sure if it's still worth it, are there any use
> cases i might have missed out on keeping a Hive installation after
> migrating to Delta / Iceberg?
>
> Please correct me if i've used any terms wrongly.
>
> On Sun, Apr 25, 2021 at 5:42 PM chia kang ren 
> wrote:
>
>> Does it make sense to keep a Hive installation when your parquet files
>> come with a transactional metadata layer like Delta Lake / Apache Iceberg?
>>
>> My understanding from this:
>> https://github.com/delta-io/delta/issues/85
>>
>> is that Hi

Re: [BUG] Hive 3.1.2 ALTER TABLE statement

2021-04-23 Thread Mich Talebzadeh
Hi Vinicius,

This is an external Hive table so the table is just a placeholder for data
locations.

Try setting:

set hive.msck.path.validation=ignore;

Also worth checking this stackoverflow

HIVE Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask - Stack Overflow
<https://stackoverflow.com/questions/2075/hive-execution-error-return-code-1-from-org-apache-hadoop-hive-ql-exec-ddltask>

HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 23 Apr 2021 at 21:51, Vinícius Matheus Olivieri <
olivierivi...@gmail.com> wrote:

> Hi Mich! Nice to meet you and thanks for the answer!
>
> I don't know what you exactly mean with the size of the table and
> partition columns, but here are some relevant information about it:
>
> The table that we are trying to add the column to has 600,000 partitions
> and the total size of the partitions table in the metastore is 9,100,000.
>
> The partitions are divided by the following levels
>
> region=/dt=/country=
>
> Which are:
>
> region: 3
>
> dt: -mm-dd-HH ts string format that is divided around 2160
>
> Country: ~138
>
> The number that varies is countries by region, but multiplying these 3
> numbers above, reaches the 600,000 partitions told before.
>
> I tried to add a column to another table with 770,000 partitions and
> didn't work as well.
>
> Also, the tables in question are EXTERNAL, so AFAIK, the size of it
> shouldn't affect the DDL right?
>
> Trying to execute the DDL in a table, with 4,000 partitions in it, that is
> in the same schema works fine.
>
> I've tried what you suggested and it didn't work, the log of it is below. 
> Seems
> that the timeout setting wasn't taken by Hive, because in the error log we
> can see that it is telling that exceeds the 600s default that you've
> mentioned.
>
>
> Do you have a clue on what could have happened?
>
> ++--+
>
> |set |
>
> ++--+
>
> | hive.metastore.client.socket.timeout=3600  |
>
> ++--+
>
> 1 row selected (0,014 seconds)
>
> 0: jdbc:hive2://emr-prd-etl.jampp.com:1/d> ALTER TABLE schema.table
> ADD COLUMNS (column STRING);
>
> INFO  : Compiling
> command(queryId=hive_20210422131340_3163e6b9-1739-4c4f-8241-442c507e8e4f):
> ALTER TABLE schema.table ADD COLUMNS (column STRING)
>
> INFO  : Concurrency mode is disabled, not creating a lock manager
>
> INFO  : Semantic Analysis Completed (retrial = false)
>
> INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
>
> INFO  : Completed compiling
> command(queryId=hive_20210422131340_3163e6b9-1739-4c4f-8241-442c507e8e4f);
> Time taken: 0.101 seconds
>
> INFO  : Concurrency mode is disabled, not creating a lock manager
>
> INFO  : Executing
> command(queryId=hive_20210422131340_3163e6b9-1739-4c4f-8241-442c507e8e4f):
> ALTER TABLE schema.table ADD COLUMNS (column STRING)
>
> INFO  : Starting task [Stage-0:DDL] in serial mode
>
> ERROR : FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. Timeout when
> executing method: alter_table_with_environment_context; 1611675ms exceeds
> 60ms
>
> INFO  : Completed executing
> command(queryId=hive_20210422131340_3163e6b9-1739-4c4f-8241-442c507e8e4f);
> Time taken: 1617.779 seconds
>
> INFO  : Concurrency mode is disabled, not creating a lock manager
>
> Error: Error while processing statement: FAILED: Execution Error, return
> code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table.
> Timeout when executing method: alter_table_with_environment_context;
> 1611675ms exceeds 60ms (state=08S01,code=1)
>
> Thanks in advance,
>
> Vinícius
>
>
> Em ter., 20 de abr. de 2021 às 06:26, Mich Talebzadeh <
> mich.talebza...@gmail.com> escreveu:
>
>> Hi,
>>
>> Just a small point, what you are doing is DDL not a query. You are
>> altering the table and adding a new column.
>>
>> First can you confirm that your table has 600,000 partitions!. If so,
>> what is the size of the table and the partition column?
>>
>> The process seems to be timing out on alter. table. Your table has too
>> 

Re: [BUG] Hive 3.1.2 ALTER TABLE statement

2021-04-20 Thread Mich Talebzadeh
Hi,

Just a small point, what you are doing is DDL not a query. You are altering
the table and adding a new column.

First can you confirm that your table has 600,000 partitions!. If so, what
is the size of the table and the partition column?

The process seems to be timing out on alter. table. Your table has too many
partitions, then the chance is that timeout happens. You can check this
parameter

hive.metastore.client.socket.timeout

in hive-site.xml. In my case it is set to 600s,

  
hive.metastore.client.socket.timeout
*600s*

  Expects a time value with unit (d/day, h/hour, m/min, s/sec,
ms/msec, us/usec, ns/nsec), which is sec if not specified.
  MetaStore Client socket timeout in seconds

  

and set it at session level before running ALTER TABLE command

set hive.metastore.client.socket.timeout=600  -- or larger value
. . . . . . . . . . . . . . . . . . > No rows affected (0.002 seconds)

HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 19 Apr 2021 at 15:27, Vinícius Matheus Olivieri <
olivierivi...@gmail.com> wrote:

> Hey guys!
>
> I am using HIVE 3.1.2 in a recently updated cluster of EMR 6.0.0 hosted in
> AWS and I am experiencing some problems when trying to execute a simple
> query in hive.
>
> The query is in question is the following:
>
> ALTER TABLE schema.table ADD COLUMNS (new_column STRING);
>
> The table that we are executing the query has approximately 600k
> partitions.
>
> The version of Hive was updated recently to 3.1.2 as the whole package
> included in EMR 6.0.0
> <https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html>.
>
> The curious thing is that when we were using Hive 2.3.6, the query worked
> with no worries or any hard work. So I searched if the version update
> changed something on the execution of an ALTER TABLE but I didn’t find
> anything relevant that could be the root cause of the problem.
>
> Could you guys help me see the light at the end of the tunnel?
>
>
> The log that is showed in the server side is the following:
>
> 2021-04-14T14:34:45,176 ERROR [HiveServer2-Background-Pool:
> Thread-221871([])]: exec.DDLTask (:()) - Failed
>
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table.
> java.net.SocketTimeoutException: Read timed out
>
> at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:721)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:698)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.hadoop.hive.ql.exec.DDLTask.alterTable(DDLTask.java:3966)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:455)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:205)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:97)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2664)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:2335)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:2011)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1709)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1703)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:157)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:224)
> ~[hive-service-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at
> org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:87)
> ~[hive-service-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:316)
> ~[hive-service-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at java.security.AccessController.doPrivileged(Native Method)
> ~[?:1.8.0_242]
>
> at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_242]
>
> at
> org.apache.hadoop.security.UserGroupInfo

Re: Migration of Hadoop Warehouse to newer versions lead to bad performance for JDBC-data-retrieval

2021-02-14 Thread Mich Talebzadeh
park also).
>
> But we have an OLAP tool which retrieves the data through JDBC.
>
> This has been working reliably for years on the old infrastructure,
>
> the poor performance with the new cluster block the migration to the new
> cluster.
>
>
> Thanks,
>
>
> Julien
>
>
>
>
>
> Julien Tane
> Big Data Engineer
>
> [image: Tel.] +49 721 98993-393
> [image: Fax] +49 721 98993-66
> [image: E-Mail] j...@solute.de
>
> solute GmbH
> Zeppelinstraße 15
> 76185 Karlsruhe
> Germany
>
>
> [image: Logo Solute]
>
> Marken der solute GmbH | brands of solute GmbH
> [image: Marken]
> Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
> Webseite | www.solute.de
> Sitz | Registered Office: Karlsruhe
> Registergericht | Register Court: Amtsgericht Mannheim
> Registernummer | Register No.: HRB 110579
> USt-ID | VAT ID: DE234663798
>
> *Informationen zum Datenschutz | Information about privacy policy*
> https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php
>
>
>
> --
> *Von:* Mich Talebzadeh 
> *Gesendet:* Samstag, 13. Februar 2021 10:09:10
> *An:* user
> *Betreff:* Re: Migration of Hadoop Warehouse to newer versions lead to
> bad performance for JDBC-data-retrieval
>
> Hi Juuien,
>
> I assume you mean you are using JDBC drivers to retrieve from the source
> table in Hive (older version) to the target table in Hive (newer version).
>
> 1) what JDBC drivers are you using?
> 2) Are these environments kerberized in both cases?
> 3) Have you considered other JDBC drivers for Hive, Example,
>
>  hive_driver: "org.apache.hive.jdbc.HiveDriver"   ## default
> hive_driver: com.cloudera.hive.jdbc41.HS2Driver ## Cloudera
> hive_driver: "com.ddtek.jdbcx.hive.HiveDataSource" ## Progress direct
> hive_driver: "com.ddtek.jdbc.hive.HiveDriver" ## Progress direct
>
> I think besides JDBC there may be other disk read issues (that you can get
> stats from Unix tools, like iostat etc
>
> Another thing is that you try to read from the same table in both old and
> new, find the timings and compare reads.
>
> If the issue is throughput of writes through JDBC then you can test
> another driver for it.
>
> As a matter of interest are you doing all these through Java, Python etc
> interface?
>
> HTH
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 12 Feb 2021 at 15:10, Julien Tane  wrote:
>
>> Dear all,
>>
>> we are in the process of switching from our old cluster with HDP 2.5:
>>
>> HDFS2.7.3
>> YARN2.7.3
>> Tez 0.7.0
>> Hive1.2.1000
>> to a new cluster with HDP 3.1:
>>
>> HDFS3.1.1.3.1
>> YARN3.1.0
>> HIVE3.0.0.3.1
>> Tez 0.9.0.3.1
>>
>> We (1st) query and (2nd) retrieve data from table_0 from old cluster to 
>> target-machine machine_a.
>> We (1st) query and (2nd) retrieve data from table_1 from new cluster to same 
>> target-machine machine_a.
>> table_0 and table_1 are defined in the exact same way (partitioned by 
>> t_date) and hold the exact same data.
>>
>> The querying from table_0 on old cluster and the querying from table_1 from 
>> the new cluster show the same performance. All is good so far.
>>
>> After the query is processed and data is ready to be retrieved, we start 
>> data retrieval with JDBC-driver. The data-retrieval-performance from the old 
>> cluster is ca. 40'000 rows/sec whereas the data-retrieval-performance from 
>> the new cluster is ca. 20'000 rows/sec. This big performance decrease is a 
>> problem!
>>
>> Things we tried:
>> - Made sure that there's no bandwidth-issue with the new version.
>> - We tried downloading and uploading from and to HDFS on both old and 
>> new cluster using HDFScli. We observed a difference in 
>> data-transfer-performance with new cluster being a factor ca. 1.5x slower 
>> than old cluster.
>> - We made following observations while experimenting:
>> When we filled the table only with 3 days worth of data, then the 
>> n

Re: Migration of Hadoop Warehouse to newer versions lead to bad performance for JDBC-data-retrieval

2021-02-13 Thread Mich Talebzadeh
Hi Juuien,

I assume you mean you are using JDBC drivers to retrieve from the source
table in Hive (older version) to the target table in Hive (newer version).

1) what JDBC drivers are you using?
2) Are these environments kerberized in both cases?
3) Have you considered other JDBC drivers for Hive, Example,

 hive_driver: "org.apache.hive.jdbc.HiveDriver"   ## default
hive_driver: com.cloudera.hive.jdbc41.HS2Driver ## Cloudera
hive_driver: "com.ddtek.jdbcx.hive.HiveDataSource" ## Progress direct
hive_driver: "com.ddtek.jdbc.hive.HiveDriver" ## Progress direct

I think besides JDBC there may be other disk read issues (that you can get
stats from Unix tools, like iostat etc

Another thing is that you try to read from the same table in both old and
new, find the timings and compare reads.

If the issue is throughput of writes through JDBC then you can test another
driver for it.

As a matter of interest are you doing all these through Java, Python etc
interface?

HTH



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 12 Feb 2021 at 15:10, Julien Tane  wrote:

> Dear all,
>
> we are in the process of switching from our old cluster with HDP 2.5:
>
> HDFS2.7.3
> YARN2.7.3
> Tez 0.7.0
> Hive1.2.1000
> to a new cluster with HDP 3.1:
>
> HDFS3.1.1.3.1
> YARN3.1.0
> HIVE3.0.0.3.1
> Tez 0.9.0.3.1
>
> We (1st) query and (2nd) retrieve data from table_0 from old cluster to 
> target-machine machine_a.
> We (1st) query and (2nd) retrieve data from table_1 from new cluster to same 
> target-machine machine_a.
> table_0 and table_1 are defined in the exact same way (partitioned by t_date) 
> and hold the exact same data.
>
> The querying from table_0 on old cluster and the querying from table_1 from 
> the new cluster show the same performance. All is good so far.
>
> After the query is processed and data is ready to be retrieved, we start data 
> retrieval with JDBC-driver. The data-retrieval-performance from the old 
> cluster is ca. 40'000 rows/sec whereas the data-retrieval-performance from 
> the new cluster is ca. 20'000 rows/sec. This big performance decrease is a 
> problem!
>
> Things we tried:
> - Made sure that there's no bandwidth-issue with the new version.
> - We tried downloading and uploading from and to HDFS on both old and new 
> cluster using HDFScli. We observed a difference in data-transfer-performance 
> with new cluster being a factor ca. 1.5x slower than old cluster.
> - We made following observations while experimenting:
> When we filled the table only with 3 days worth of data, then the new 
> cluster loaded faster then the old one.
>
> When we filled the table with 2 years worth of data and selected in 
> the SQL statement only 3 days, then the new cluster loaded slower than the 
> old one.
>
> The old cluster loaded with the same speed each time (regardless of 
> number of days) whereas the new cluster changed from 25.000 to 42.000 rows/s 
> for low number of days.
>
> So it seems that if number of partitions increases, the 
> data-retrieval-performance from the new cluster decreases whereas the 
> data-retrieval-performance from the old cluster stays approx. the same.
>
> Questions:
> q1) Do you have an idea about what this low data-retrieval-performance 
> could be caused by?
> q2) How do we use the Hive Logging/Debug Infrastructure to find out what 
> the throughput of the rows are?
> q3) How do we use the HDFS Logging/Debug Infrastructure to find out what 
> the throughput of the rows are?
> q4) What are the parameters and settings we could use to make sure the 
> data-retrieval-performance is (as) high (as possible)?
> q5) Could the garbage collector be slowing down the data-retrieval to 
> this extent? How can we find out?
>
> Looking forward to your ideas,
> Julien Tane
>
>
>
> Julien Tane
> Big Data Engineer
>
> [image: Tel.] +49 721 98993-393
> [image: Fax] +49 721 98993-66
> [image: E-Mail] j...@solute.de
>
> solute GmbH
> Zeppelinstraße 15
> 76185 Karlsruhe
> Germany
>
>
> [image: Logo Solute]
>
> Marken der solute GmbH | brands of solute GmbH
> [image: Marken]
> Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
> Webseite | www.solute.de
> Sitz | Registered Office: Karlsruhe
> Registergericht | Register Court: Amtsgericht Mannheim
> Registernummer | Register No.: HRB 110579
> USt-ID | VAT ID: DE234663798
>
> *Informationen zum Datenschutz | Information about privacy policy*
> 

Re: Is Insert Overwrite table partition on s3 is an atomic operation ?

2021-01-11 Thread Mich Talebzadeh
Hi Mark,

By atomic operation I gather you mean INSERT/OVERWRITE affects that
partition only?

According to my somehow dated scripts yes you can do that. The idea being
that you only want to overwrite data for that partition ONLY.

--show create table marketData;
--Populate target table
select from_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') AS
StartTime;
INSERT OVERWRITE TABLE ${DATABASE}.MARKETDATA PARTITION (DateStamp =
"${TODAY}")
SELECT
  KEY
, TICKER
, TIMECREATED
, PRICE
, 1
, CAST(from_unixtime(unix_timestamp()) AS timestamp)
FROM ${DATABASE}.EXTERNALMARKETDATA

So this basically loads data into the Hive partitioned table from an
external Hive table populated by Flume. It overwrites data for *today's
created partition regardles*s. Th external table has one partition added
daily

ALTER TABLE ${DATABASE}.EXTERNALMARKETDATA set location
'hdfs://rhes564:9000/data/prices/${TODAY}';

HTH




LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 11 Jan 2021 at 16:45, Mark Norkin  wrote:

> Hello Hive users,
>
> We are using AWS Glue as Hive compatible metastore when running queries on
> EMR. For Hive external tables we are using AWS S3.
>
> After looking at the docs we didn't find a conclusive answer on whether an
> Insert Overwrite table partition is an atomic operation, maybe we've missed
> it and it is documented somewhere, or maybe someone knows from their
> experience?
>
> If it's an atomic operation, is there any difference whether the table is
> external or a managed one?
>
> Thank you,
>
> Mark
>


Re: How useful are tools for Hive data modeling

2020-11-11 Thread Mich Talebzadeh
Many thanks Austin.

The challenge I have been told is how to effectively query a subset of data
avoiding full table scan. The tables I believe are parquet.

I know performance in Hive is not that great, so anything that could help
would be great.

Cheers,



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 11 Nov 2020 at 19:32, Austin Hackett  wrote:

> Hi Mich
>
> Hive also has non-validated primary key, foreign key etc constraints.
> Whilst I’m not too familiar with the modelling tools you mention, perhaps
> they’re able to use these for generating SQL etc?
>
> ORC files have indexes (min, max, bloom filters) - not particularly
> relevant to the data modelling tools question, but mentioning it for
> completeness…
>
> Thanks
>
> Austin
>
>
> On 11 Nov 2020, at 17:14, Mich Talebzadeh 
> wrote:
>
> Many thanks Peter.
>
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 11 Nov 2020 at 16:58, Peter Vary  wrote:
>
>> Hi Mich,
>>
>> Index support was removed from hive:
>>
>>- https://issues.apache.org/jira/browse/HIVE-21968
>>- https://issues.apache.org/jira/browse/HIVE-18715
>>
>>
>> Thanks,
>> Peter
>>
>> On Nov 11, 2020, at 17:25, Mich Talebzadeh 
>> wrote:
>>
>> Hi all,
>>
>> I wrote these notes earlier this year.
>>
>> I heard today that someone mentioned Hive 1 does not support indexes but
>> hive 2 does.
>>
>> I still believe that Hive does not support indexing as per below. Has
>> this been changed?
>>
>> Regards,
>>
>> Mich
>>
>> -- Forwarded message -
>> From: Mich Talebzadeh 
>> Date: Thu, 2 Apr 2020 at 12:17
>> Subject: How useful are tools for Hive data modeling
>> To: user 
>>
>>
>> Hi,
>>
>> Fundamentally Hive tables have structure and support provided by desc
>> formatted  and show partitions .
>>
>> Hive does not support indexes in real HQL operations (I stand corrected).
>> So what we have are tables, partitions and clustering (AKA hash
>> partitioning).
>>
>> Hive does not support indexes because Hadoop lacks blocks locality
>> necessary for indexes. So If I use a tool like Collibra, Ab-intio etc what
>> advantage(s) one is going to gain on top a simple sell scrip to get table
>> and partition definitions?
>>
>> Thanks,
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>
>


Re: How useful are tools for Hive data modeling

2020-11-11 Thread Mich Talebzadeh
Many thanks Peter.




LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 11 Nov 2020 at 16:58, Peter Vary  wrote:

> Hi Mich,
>
> Index support was removed from hive:
>
>- https://issues.apache.org/jira/browse/HIVE-21968
>- https://issues.apache.org/jira/browse/HIVE-18715
>
>
> Thanks,
> Peter
>
> On Nov 11, 2020, at 17:25, Mich Talebzadeh 
> wrote:
>
> Hi all,
>
> I wrote these notes earlier this year.
>
> I heard today that someone mentioned Hive 1 does not support indexes but
> hive 2 does.
>
> I still believe that Hive does not support indexing as per below. Has this
> been changed?
>
> Regards,
>
> Mich
>
> -- Forwarded message -
> From: Mich Talebzadeh 
> Date: Thu, 2 Apr 2020 at 12:17
> Subject: How useful are tools for Hive data modeling
> To: user 
>
>
> Hi,
>
> Fundamentally Hive tables have structure and support provided by desc
> formatted  and show partitions .
>
> Hive does not support indexes in real HQL operations (I stand corrected).
> So what we have are tables, partitions and clustering (AKA hash
> partitioning).
>
> Hive does not support indexes because Hadoop lacks blocks locality
> necessary for indexes. So If I use a tool like Collibra, Ab-intio etc what
> advantage(s) one is going to gain on top a simple sell scrip to get table
> and partition definitions?
>
> Thanks,
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>


Fwd: How useful are tools for Hive data modeling

2020-11-11 Thread Mich Talebzadeh
Hi all,


I wrote these notes earlier this year.


I heard today that someone mentioned Hive 1 does not support indexes but
hive 2 does.


I still believe that Hive does not support indexing as per below. Has this
been changed?


Regards,


Mich

-- Forwarded message -
From: Mich Talebzadeh 
Date: Thu, 2 Apr 2020 at 12:17
Subject: How useful are tools for Hive data modeling
To: user 


Hi,

Fundamentally Hive tables have structure and support provided by desc
formatted  and show partitions .

Hive does not support indexes in real HQL operations (I stand corrected).
So what we have are tables, partitions and clustering (AKA hash
partitioning).

Hive does not support indexes because Hadoop lacks blocks locality
necessary for indexes. So If I use a tool like Collibra, Ab-intio etc what
advantage(s) one is going to gain on top a simple sell scrip to get table
and partition definitions?

Thanks,


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: what does MM stand for

2020-11-06 Thread Mich Talebzadeh
MM -> Micro Managed

Hive managed tables supporting Insert-only operations with ACID semantics
are called MM (Micro-Managed) OR Insert-Only ACID tables. Supports all file
formats.

>From say

https://docs.cloudera.com/HDPDocuments/DLM1/DLM-1.5.1/administration/content/dlm_acid_tables_replication.html

HTH




LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 6 Nov 2020 at 10:56, Gabriel Balan  wrote:

> Hello
>
> Apologies for the silly question.
>
> What you say "MM tables", what does MM stand for?
>
> HIVE_MM_ALLOW_ORIGINALS("hive.mm.allow.originals", false,
> "Whether to allow original files in MM tables. Conversion to MM
> may be expensive if\n" +
> "this is set to false, however unless MAPREDUCE-7086 fix is
> present (hadoop 3.1.1+),\n" +
> "queries that read non-orc *MM tables* with original files will
> fail. The default in\n" +
> "Hive 3.0 is false."),
>
> thank you
>
> Gabriel Balan
>
> P.S. And since I'm already showing my ignorance, I've also seen these
> mentioned:
>
>- DP tables
>
> 
>(dynamic partitions?)
>- SMB table  (sort
>merge bucket?)
>
>
>
> --
> The statements and opinions expressed here are my own and do not necessarily 
> represent those of Oracle Corporation.
>
>


Re: SQL CTAS query failed on compilation stage

2020-11-03 Thread Mich Talebzadeh
ok fine.



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 3 Nov 2020 at 14:51, Bartek Kotwica  wrote:

> I understand, but it looks strange as a query without the "create table"
> clause works. Obviously I use the workaround, but I think Hive as an
> application should be more predictable in interaction, so I created a JIRA
> for the issue.
>
> https://issues.apache.org/jira/browse/HIVE-24352
>
> Regards,
> Bartosz Kotwica
>
> wt., 3 lis 2020 o 11:57 Mich Talebzadeh 
> napisał(a):
>
>> well you have to be pragmatic. That may well be a bug due to Hive,
>> especially it says  "Also check for circular dependencies"
>>
>> you can raise a JIRA but not sure about its priority as you have a
>> work-around
>>
>> HTH
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 3 Nov 2020 at 10:46, Bartek Kotwica  wrote:
>>
>>> Hi Mich,
>>> Thank you for the reply! Creating a stage table works well, a problem
>>> comes up when CTE or subquery in from clause is used.
>>>
>>> wt., 3 lis 2020 o 10:45 Mich Talebzadeh 
>>> napisał(a):
>>>
>>>> Hm,
>>>>
>>>> Hi Bartosz,
>>>>
>>>> Can you create a temporary table with your sub-query and see it works?
>>>>
>>>> create temporary table tab2 as ...
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 3 Nov 2020 at 09:28, Bartek Kotwica  wrote:
>>>>
>>>>> Hi!
>>>>> I use Hive 3.1.0 and beeline.
>>>>> I have encountered a compilation error when issue a CTAS query from
>>>>> beeline, but without "create table" query works as expected,* narrowed 
>>>>> query
>>>>> to reproduce:*
>>>>>
>>>>> create table tab_error as
>>>>> with tab2 as (
>>>>> select
>>>>> id,
>>>>> lead(id) over (partition by id order by id) as x
>>>>> from
>>>>> (select 1 id) a
>>>>> )
>>>>> select
>>>>>   lead(x) over (partition by id order by id) = 1
>>>>> from
>>>>> tab2
>>>>> ;
>>>>>
>>>>> *ERROR:*
>>>>> Error: Error while compiling statement: FAILED: SemanticException
>>>>> Failed to breakup Windowing invocations into Groups. At least 1 group must
>>>>> only depend on input columns. Also check for circular dependencies.
>>>>> Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException:
>>>>> Line 0:-1 Invalid column reference '1': (possible column names are: )
>>>>> (state=42000,code=4)
>>>>>
>>>>> Please confirm the issue then I will create a Jira ticket.
>>>>>
>>>>> Kind regards,
>>>>> Bartosz Kotwica
>>>>>
>>>>


Re: SQL CTAS query failed on compilation stage

2020-11-03 Thread Mich Talebzadeh
well you have to be pragmatic. That may well be a bug due to Hive,
especially it says  "Also check for circular dependencies"

you can raise a JIRA but not sure about its priority as you have a
work-around

HTH



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 3 Nov 2020 at 10:46, Bartek Kotwica  wrote:

> Hi Mich,
> Thank you for the reply! Creating a stage table works well, a problem
> comes up when CTE or subquery in from clause is used.
>
> wt., 3 lis 2020 o 10:45 Mich Talebzadeh 
> napisał(a):
>
>> Hm,
>>
>> Hi Bartosz,
>>
>> Can you create a temporary table with your sub-query and see it works?
>>
>> create temporary table tab2 as ...
>>
>> HTH
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 3 Nov 2020 at 09:28, Bartek Kotwica  wrote:
>>
>>> Hi!
>>> I use Hive 3.1.0 and beeline.
>>> I have encountered a compilation error when issue a CTAS query from
>>> beeline, but without "create table" query works as expected,* narrowed query
>>> to reproduce:*
>>>
>>> create table tab_error as
>>> with tab2 as (
>>> select
>>> id,
>>> lead(id) over (partition by id order by id) as x
>>> from
>>> (select 1 id) a
>>> )
>>> select
>>>   lead(x) over (partition by id order by id) = 1
>>> from
>>> tab2
>>> ;
>>>
>>> *ERROR:*
>>> Error: Error while compiling statement: FAILED: SemanticException Failed
>>> to breakup Windowing invocations into Groups. At least 1 group must only
>>> depend on input columns. Also check for circular dependencies.
>>> Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException:
>>> Line 0:-1 Invalid column reference '1': (possible column names are: )
>>> (state=42000,code=4)
>>>
>>> Please confirm the issue then I will create a Jira ticket.
>>>
>>> Kind regards,
>>> Bartosz Kotwica
>>>
>>


Re: SQL CTAS query failed on compilation stage

2020-11-03 Thread Mich Talebzadeh
Hm,

Hi Bartosz,

Can you create a temporary table with your sub-query and see it works?

create temporary table tab2 as ...

HTH



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 3 Nov 2020 at 09:28, Bartek Kotwica  wrote:

> Hi!
> I use Hive 3.1.0 and beeline.
> I have encountered a compilation error when issue a CTAS query from
> beeline, but without "create table" query works as expected,* narrowed query
> to reproduce:*
>
> create table tab_error as
> with tab2 as (
> select
> id,
> lead(id) over (partition by id order by id) as x
> from
> (select 1 id) a
> )
> select
>   lead(x) over (partition by id order by id) = 1
> from
> tab2
> ;
>
> *ERROR:*
> Error: Error while compiling statement: FAILED: SemanticException Failed
> to breakup Windowing invocations into Groups. At least 1 group must only
> depend on input columns. Also check for circular dependencies.
> Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line
> 0:-1 Invalid column reference '1': (possible column names are: )
> (state=42000,code=4)
>
> Please confirm the issue then I will create a Jira ticket.
>
> Kind regards,
> Bartosz Kotwica
>


Re: Hive using Spark engine vs native spark with hive integration.

2020-10-06 Thread Mich Talebzadeh
Hi Manu,

In the past (July 2016), I made a presentation organised by then
Hortonworks in London titled "Query Engines for Hive: MR, Spark, Tez with
LLAP – Considerations! "

The PDF presentation is here
.
With a caveat that was more than 4 years ago!

However, as of today I would recommend writing the code in Spark with Scala
and running against Spark. You can try it using spark-shell to start with.

If you are reading from Hive table or any other source like CSV etc, there
are plenty of examples in Spark web https://spark.apache.org/examples.html

Also I suggest that you use Scala as Spark itself is written in Scala
(though Python is more popular with Data Science guys).

HTH



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 6 Oct 2020 at 16:47, Manu Jacob  wrote:

> Hi All,
>
>
>
> Not sure if I need to ask this question on hive community or spark
> community.
>
>
>
> We have a set of hive scripts that runs on EMR (Tez engine). We would like
> to experiment by moving some of it onto Spark. We are planning to
> experiment with two options.
>
>
>1. Use the current code based on HQL, with engine set as spark.
>2. Write pure spark code in scala/python using SparkQL and hive
>integration.
>
>
>
> The first approach helps us to transition to Spark quickly but not sure if
> this is the best approach in terms of performance.  Could not find any
> reasonable comparisons of this two approaches.  It looks like writing pure
> Spark code, gives us more control to add logic and also control some of the
> performance features, for example things like caching/evicting etc.
>
>
>
>
>
> Any advise on this is much appreciated.
>
>
>
>
>
> Thanks,
>
> -Manu
>


Re: Measuring the execution time of Hive queries through Ambari

2020-06-30 Thread Mich Talebzadeh
Many thanks to all. We will consider these options one by one.

Regards,
.




LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 30 Jun 2020 at 16:35, Julien Tane  wrote:

> Hi Mich,
>
> Again Ambari is only a cluster management framework. Not a complete GUI.
> It can have plugins like things like views.
>
>
> On Ambari you can have different Stacks. which correspond to the available
> services for a given services, and which you can install, start and stop.
>
>
> First of all, which stack is installed (see Manage Ambari > Versions  has
> a list of stacks.
>
> on HDP 3.1 Tez is there Tez.ui is available in the code but the Tez-UI
> (ambari View) is not supported anymore by ambari.
>
>
> What we did is that we installed a tomcat. and in this tomcat we deployed
> the tez-ui war. We then set the right values for the yarn configuration...
>
> And it worked
>
>
> Under Manage Ambari >Views, you will find the Views which are installed.
> for instance we have:
>
>
>
>
>
> Julien Tane
> Big Data Engineer
>
> [image: Tel.] +49 721 98993-393
> [image: Fax] +49 721 98993-66
> [image: E-Mail] j...@solute.de
>
> solute GmbH
> Zeppelinstraße 15
> 76185 Karlsruhe
> Germany
>
>
> [image: Logo Solute]
>
> Marken der solute GmbH | brands of solute GmbH
> [image: Marken]
> Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
> Webseite | www.solute.de
> Sitz | Registered Office: Karlsruhe
> Registergericht | Register Court: Amtsgericht Mannheim
> Registernummer | Register No.: HRB 110579
> USt-ID | VAT ID: DE234663798
>
> *Informationen zum Datenschutz | Information about privacy policy*
> https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php
>
>
>
> --
> *Von:* Mich Talebzadeh 
> *Gesendet:* Montag, 22. Juni 2020 14:23:51
> *An:* user
> *Betreff:* Re: Measuring the execution time of Hive queries through Ambari
>
> Hi  Julien.
>
> It is as I see is standard Ambari. Has TEZ UI but when I run the query and
> check TEZ UI it says TEZ view is not deployed!
>
> Thanks
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 22 Jun 2020 at 13:10, Julien Tane  wrote:
>
>> Mich,
>>
>>
>> When you say, that you are using ambari to connect to hvie what do you
>> mean by that.
>>
>> Unless you added a view in ambari to perform query (as far as I know, not
>> in the vanilla ambari)
>>
>>
>> One thing you could more or less do is use the tez.ui (assuming you are
>> using tez)
>>
>> but here again this is not in the standard ambari (at least not the newer
>> versions)
>>
>> one other possibility (depending on how you configured yarn) would be to
>> use the yarn ui
>>
>> which should be accessible in the Yarn Tab from your ambari... But here,
>> it kinds of depends
>>
>> on how you configured your system.
>>
>>
>> Kind Regards,
>>
>>
>> J
>>
>>
>>
>>
>>
>> Julien Tane
>> Big Data Engineer
>>
>> [image: Tel.] +49 721 98993-393
>> [image: Fax] +49 721 98993-66
>> [image: E-Mail] j...@solute.de
>>
>> solute GmbH
>> Zeppelinstraße 15
>> 76185 Karlsruhe
>> Germany
>>
>>
>> [image: Logo Solute]
>>
>> Marken der solute GmbH | brands of solute GmbH
>> [image: Marken]
>> Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
>> Webseite | www.solute.de
>> Sitz | Registered Office: Karlsruhe
>> Registergericht | Register Court: Amtsgericht Mannheim
>> Registernummer | Register No.: HRB 11

Re: Measuring the execution time of Hive queries through Ambari

2020-06-22 Thread Mich Talebzadeh
Hi  Julien.

It is as I see is standard Ambari. Has TEZ UI but when I run the query and
check TEZ UI it says TEZ view is not deployed!

Thanks



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 22 Jun 2020 at 13:10, Julien Tane  wrote:

> Mich,
>
>
> When you say, that you are using ambari to connect to hvie what do you
> mean by that.
>
> Unless you added a view in ambari to perform query (as far as I know, not
> in the vanilla ambari)
>
>
> One thing you could more or less do is use the tez.ui (assuming you are
> using tez)
>
> but here again this is not in the standard ambari (at least not the newer
> versions)
>
> one other possibility (depending on how you configured yarn) would be to
> use the yarn ui
>
> which should be accessible in the Yarn Tab from your ambari... But here,
> it kinds of depends
>
> on how you configured your system.
>
>
> Kind Regards,
>
>
> J
>
>
>
>
>
> Julien Tane
> Big Data Engineer
>
> [image: Tel.] +49 721 98993-393
> [image: Fax] +49 721 98993-66
> [image: E-Mail] j...@solute.de
>
> solute GmbH
> Zeppelinstraße 15
> 76185 Karlsruhe
> Germany
>
>
> [image: Logo Solute]
>
> Marken der solute GmbH | brands of solute GmbH
> [image: Marken]
> Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
> Webseite | www.solute.de
> Sitz | Registered Office: Karlsruhe
> Registergericht | Register Court: Amtsgericht Mannheim
> Registernummer | Register No.: HRB 110579
> USt-ID | VAT ID: DE234663798
>
> *Informationen zum Datenschutz | Information about privacy policy*
> https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php
>
>
>
> --
> *Von:* Mich Talebzadeh 
> *Gesendet:* Montag, 22. Juni 2020 12:57:27
> *An:* user
> *Betreff:* Measuring the execution time of Hive queries through Ambari
>
> Hi,
>
> Using Ambari to connect to Hive, is there any way of measuring the query
> time?
>
> Please be aware that this is through Ambari not through beeline etc.
>
> The tool we have at the moment is Ambari to Prod.
>
> We do not have any other luxury!
>
> Thanks
>
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Measuring the execution time of Hive queries through Ambari

2020-06-22 Thread Mich Talebzadeh
Hi,

Using Ambari to connect to Hive, is there any way of measuring the query
time?

Please be aware that this is through Ambari not through beeline etc.

The tool we have at the moment is Ambari to Prod.

We do not have any other luxury!

Thanks




LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Running Hive queries from Ambari or from edge node via beeline

2020-06-16 Thread Mich Talebzadeh
many thanks Julien much appreciated.




LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 15 Jun 2020 at 22:09, Julien Tane  wrote:

> ambari is not a GUI interface to Hive... It is an Hadoop cluster managment
> tool.
>
>
> If you need a commandline compatible interface, you can use for instance
> using beeline.
>
>
> you can use a JDBC based GUI (like dbeaver) to access the data as long as
> the port is accessible.
>
>
> In older versions of ambari you could add some ambari views which had a
> query interface,
>
> though I do not know which work with the current versions of ambari.
>
>
> Kind Regards,
>
> J
>
>
> Julien Tane
> Big Data Engineer
>
> [image: Tel.] +49 721 98993-393
> [image: Fax] +49 721 98993-66
> [image: E-Mail] j...@solute.de
>
> solute GmbH
> Zeppelinstraße 15
> 76185 Karlsruhe
> Germany
>
>
> [image: Logo Solute]
>
> Marken der solute GmbH | brands of solute GmbH
> [image: Marken]
> Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
> Webseite | www.solute.de
> Sitz | Registered Office: Karlsruhe
> Registergericht | Register Court: Amtsgericht Mannheim
> Registernummer | Register No.: HRB 110579
> USt-ID | VAT ID: DE234663798
>
> *Informationen zum Datenschutz | Information about privacy policy*
> https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php
>
>
>
> --
> *Von:* Mich Talebzadeh 
> *Gesendet:* Montag, 15. Juni 2020 21:26:20
> *An:* user
> *Betreff:* Running Hive queries from Ambari or from edge node via beeline
>
> Hi,
>
> I am not a user of Ambari but I believe it is a GUI interface to Hive. it
> can be run on your laptop and connect to Hive via ODBC or JDBC.
>
> There is also another tool DB Visualizer Pro that uses JDBC to connect to
> Hive thrift server.
>
> My view is that if one is a developer the best best would be to have
> access to the edge node and connect through beeline (Hive thrift server).
> This could be through Putty or  SSH Tectia Secureshell but crucially  since
> one is running on the Hadoop cluster (the edge node is part of the cluster
> on the same V-LAN), then the performance is expected to be better?
>
> also both Tectia SSH and Putty are thin clients so you are effectively
> running the code on the edge node as opposed through the client-server.
>
> Does this make sense?
>
> Thanks
>
>
>
> Mich
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Running Hive queries from Ambari or from edge node via beeline

2020-06-15 Thread Mich Talebzadeh
Hi,

I am not a user of Ambari but I believe it is a GUI interface to Hive. it
can be run on your laptop and connect to Hive via ODBC or JDBC.

There is also another tool DB Visualizer Pro that uses JDBC to connect to
Hive thrift server.

My view is that if one is a developer the best best would be to have access
to the edge node and connect through beeline (Hive thrift server). This
could be through Putty or  SSH Tectia Secureshell but crucially  since one
is running on the Hadoop cluster (the edge node is part of the cluster on
the same V-LAN), then the performance is expected to be better?

also both Tectia SSH and Putty are thin clients so you are effectively
running the code on the edge node as opposed through the client-server.

Does this make sense?

Thanks



Mich



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: create transactional table issue

2020-05-17 Thread Mich Talebzadeh
In Hive 3.1.1 the thread owner table creation works fine

 hive --version
Hive 3.1.1

0: jdbc:hive2://rhes75:10099/default> create table test.dllm ( b string )
partitioned by (a int) clustered by (b) into 2 buckets stored as orc
tblproperties('transactional'='true')
. . . . . . . . . . . . . . . . . . > No rows affected (0.246 seconds)

HTH

[image: image.png]



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 17 May 2020 at 17:30, Shawn Weeks  wrote:

> The Hive Streaming API has changed in Hive 3.x and that might have
> something to do with it I’m not sure as I’ve never used Flume. Are you
> using a vendor distribution of Hive 3.x or are you rolling your own?
>
>
>
> What does “show create table your_table;” show? It is possible to turn off
> acid if your setting Hive up manually.
>
>
>
> Thanks
>
>
>
> *From: *Huang Chiming 
> *Reply-To: *"user@hive.apache.org" 
> *Date: *Saturday, May 16, 2020 at 11:09 AM
> *To: *"user@hive.apache.org" 
> *Subject: *Re: create transactional table issue
>
>
>
> Hi @Shawn,
>
>
>
> Can tblproperties('transactional'='true') be omitted starting from 3.x?
>
>
>
> I just create a hive table without tblproperties('transactional'='true')
> and ingest data from flume hive sink (
> https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#hive-sink),
> flume throws an exception like `Caused by:
> org.apache.hive.hcatalog.streaming.InvalidTable: Invalid table db:test,
> table:dllm: is not an Acid table`.
>
>
>
> How can I create a hive ACID table? I'd like to ingest stream data from
> flume to hive.
>
>
>
> Thank you for help!
>
>
>
> Best Wishes,
>
> Chiming HUANG
>
>
>
>
>
> On Sat, May 16, 2020 at 11:43 PM Huang Chiming  wrote:
>
> Thanks for clarifying!
>
>
>
> Best Wishes,
>
> Chiming HUANG
>
>
>
>
>
> On Sat, May 16, 2020 at 11:40 PM Shawn Weeks 
> wrote:
>
> Starting in Hive 3.x all internal tables are transactional by default.
> Unless you need the buckets you should be able to just say create table and
> drop everything after partitioned by.
>
>
>
> Thanks
>
> Shawn
>
>
>
> *From: *Huang Chiming 
> *Reply-To: *"user@hive.apache.org" 
> *Date: *Saturday, May 16, 2020 at 10:32 AM
> *To: *"user@hive.apache.org" 
> *Subject: *create transactional table issue
>
>
>
> Hi guys,
>
>
>
> I use hive 3.1.2 connecting to standalone metastore and execute ddl in
> beeline/hive-cli failed, the error message is really weird to me… Anyone
> ever encounter the same issue?
>
>
>
> ```
>
> create table test.dllm ( b string )
>
> partitioned by (a int)
>
> clustered by (b) into 2 buckets
>
> stored as orc tblproperties('transactional'='true');
>
>
>
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:The table must 
> be stored using an ACID compliant format (such as ORC): test.dllm)
>
> ```
>
>
>
>
>
> Best Wishes,
>
> Chiming HUANG
>
>


Re: Adding a virtual column for a custom input format

2020-05-06 Thread Mich Talebzadeh
Hi Christine.

Virtual column meaning a derived column?

can you achieve this by creating a view on Hive table?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 6 May 2020 at 16:54, Christine Mathiesen 
wrote:

> Hello!
>
> I'm hoping someone can help me shed some light on how Hive deals with
> virtual columns. We are trying to implement some time travel features in a
> custom input format we're building and are considering the approach of
> adding a virtual column, but I haven't been able to find much information
> on Hive's implementation.
>
>
>
> Any help or links on how we can add our own virtual column to a Hive table
> would be very appreciated!
>
> Thank you!
>
>
>
> *Christine Mathiesen *
>
> Software Development Intern
>
> BDP – Hotels.com
>
> Expedia Group
>
>
>


How useful are tools for Hive data modeling

2020-04-02 Thread Mich Talebzadeh
Hi,

Fundamentally Hive tables have structure and support provided by desc
formatted  and show partitions .

Hive does not support indexes in real HQL operations (I stand corrected).
So what we have are tables, partitions and clustering (AKA hash
partitioning).

Hive does not support indexes because Hadoop lacks blocks locality
necessary for indexes. So If I use a tool like Collibra, Ab-intio etc what
advantage(s) one is going to gain on top a simple sell scrip to get table
and partition definitions?

Thanks,

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Mich Talebzadeh
Have you updated partition statistics by any chance?

I assume you can access the table and data though Hive itself?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 7 Aug 2019 at 21:07, Patrick McCarthy 
wrote:

> Do the permissions on the hive table files on HDFS correspond with what
> the spark user is able to read? This might arise from spark being run as
> different users.
>
> On Wed, Aug 7, 2019 at 3:15 PM Rishikesh Gawade 
> wrote:
>
>> Hi,
>> I did not explicitly create a Hive Context. I have been using the
>> spark.sqlContext that gets created upon launching the spark-shell.
>> Isn't this sqlContext same as the hiveContext?
>> Thanks,
>> Rishikesh
>>
>> On Wed, Aug 7, 2019 at 12:43 PM Jörn Franke  wrote:
>>
>>> Do you use the HiveContext in Spark? Do you configure the same options
>>> there? Can you share some code?
>>>
>>> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade <
>>> rishikeshg1...@gmail.com>:
>>>
>>> Hi.
>>> I am using Spark 2.3.2 and Hive 3.1.0.
>>> Even if i use parquet files the result would be same, because after all
>>> sparkSQL isn't able to descend into the subdirectories over which the table
>>> is created. Could there be any other way?
>>> Thanks,
>>> Rishikesh
>>>
>>> On Tue, Aug 6, 2019, 1:03 PM Mich Talebzadeh 
>>> wrote:
>>>
>>>> which versions of Spark and Hive are you using.
>>>>
>>>> what will happen if you use parquet tables instead?
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade 
>>>> wrote:
>>>>
>>>>> Hi.
>>>>> I have built a Hive external table on top of a directory 'A' which has
>>>>> data stored in ORC format. This directory has several subdirectories 
>>>>> inside
>>>>> it, each of which contains the actual ORC files.
>>>>> These subdirectories are actually created by spark jobs which ingest
>>>>> data from other sources and write it into this directory.
>>>>> I tried creating a table and setting the table properties of the same
>>>>> as *hive.mapred.supports.subdirectories=TRUE* and
>>>>> *mapred.input.dir.recursive**=TRUE*.
>>>>> As a result of this, when i fire the simplest query of *select
>>>>> count(*) from ExtTable* via the Hive CLI, it successfully gives me
>>>>> the expected count of records in the table.
>>>>> However, when i fire the same query via sparkSQL, i get count = 0.
>>>>>
>>>>> I think the sparkSQL isn't able to descend into the subdirectories for
>>>>> getting the data while hive is able to do so.
>>>>> Are there any configurations needed to be set on the spark side so
>>>>> that this works as it does via hive cli?
>>>>> I am using Spark on YARN.
>>>>>
>>>>> Thanks,
>>>>> Rishikesh
>>>>>
>>>>> Tags: subdirectories, subdirectory, recursive, recursion, hive
>>>>> external table, orc, sparksql, yarn
>>>>>
>>>>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>


Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-06 Thread Mich Talebzadeh
which versions of Spark and Hive are you using.

what will happen if you use parquet tables instead?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 6 Aug 2019 at 07:58, Rishikesh Gawade 
wrote:

> Hi.
> I have built a Hive external table on top of a directory 'A' which has
> data stored in ORC format. This directory has several subdirectories inside
> it, each of which contains the actual ORC files.
> These subdirectories are actually created by spark jobs which ingest data
> from other sources and write it into this directory.
> I tried creating a table and setting the table properties of the same as
> *hive.mapred.supports.subdirectories=TRUE* and
> *mapred.input.dir.recursive**=TRUE*.
> As a result of this, when i fire the simplest query of *select count(*)
> from ExtTable* via the Hive CLI, it successfully gives me the expected
> count of records in the table.
> However, when i fire the same query via sparkSQL, i get count = 0.
>
> I think the sparkSQL isn't able to descend into the subdirectories for
> getting the data while hive is able to do so.
> Are there any configurations needed to be set on the spark side so that
> this works as it does via hive cli?
> I am using Spark on YARN.
>
> Thanks,
> Rishikesh
>
> Tags: subdirectories, subdirectory, recursive, recursion, hive external
> table, orc, sparksql, yarn
>


Re: Error: java.io.IOException: java.lang.RuntimeException: ORC split generation failed with exception: java.lang.NoSuchMethodError

2019-07-19 Thread Mich Talebzadeh
Hi,

Thanks my Hadoop is

hadoop version
WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of
HADOOP_PREFIX.
*Hadoop 3.1.0*
Source code repository https://github.com/apache/hadoop -r
16b70619a24cdcf5d3b0fcf4b58ca77238ccbe6d
Compiled by centos on 2018-03-30T00:00Z
Compiled with protoc 2.5.0
>From source with checksum 14182d20c972b3e2105580a1ad6990

It was installed a year back.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 19 Jul 2019 at 18:01, Gopal Vijayaraghavan 
wrote:

> Hi,
>
> > java.lang.NoSuchMethodError:
> org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
> (state=,code=0)
>
> Are you rolling your own Hadoop install?
>
> https://issues.apache.org/jira/browse/HADOOP-14683
>
> Cheers,
> Gopal
>
>
>


Error: java.io.IOException: java.lang.RuntimeException: ORC split generation failed with exception: java.lang.NoSuchMethodError

2019-07-19 Thread Mich Talebzadeh
Just upgraded Hive  from Hive-3.0 to 3.1.1

Connected to: Apache Hive (version 3.1.1)
Driver: Hive JDBC (version 3.1.1)

Created an ORC table through Spark as below:

sql("use accounts")
//
// Drop and create table ll_18740868
//
sql("DROP TABLE IF EXISTS accounts.ll_18740868")
var sqltext = ""
sqltext = """
CREATE TABLE accounts.ll_18740868 (
TransactionDateDATE
,TransactionType   String
,SortCode  String
,AccountNumber String
,TransactionDescriptionString
,DebitAmount   Double
,CreditAmount  Double
,Balance   Double
)
COMMENT 'from csv file from excel sheet'
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="ZLIB" )
"""
sql(sqltext)

Table is created Ok and populated from CSV files from HDFS

Data is inserted through a Hive temp table created on DataFrame  "a" as
below:

a.toDF.registerTempTable("tmp")

INSERT INTO TABLE accounts.ll_18740868
SELECT
……...
FROM tmp

So the data is there as I can select rows from the ORC table

// example

scala> sql("Select TransactionDate, DebitAmount, CreditAmount, Balance from
ll_18740868 limit 3 ").collect.foreach(println)
[2011-12-30,50.0,null,304.89]
[2011-12-30,19.01,null,354.89]
[2011-12-29,80.1,null,373.9]

However, this select does not work from beeline

0: jdbc:hive2://rhes75:10099/default> Beeline version 3.1.1 by Apache Hive
0: jdbc:hive2://rhes75:10099/default> use accounts;
No rows affected (0.011 seconds)
0: jdbc:hive2://rhes75:10099/default> Select TransactionDate, DebitAmount,
CreditAmount, Balance from ll_18740868 limit 3;
Error: java.io.IOException: java.lang.RuntimeException: ORC split
generation failed with exception: java.lang.NoSuchMethodError:
org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
(state=,code=0)

I thought this problem would have gone away in this release?

So it works through because it uses Spark Tungesten optimiser but not
through Hive!

explain Select TransactionDate, DebitAmount, CreditAmount, Balance from
ll_18740868 limit 3;
++
|  Explain   |
++
| STAGE DEPENDENCIES:|
|   Stage-0 is a root stage  |
||
| STAGE PLANS:   |
|   Stage: Stage-0   |
| Fetch Operator |
|   limit: 3 |
|   Processor Tree:  |
| TableScan  |
|   alias: ll_18740868   |
|   Statistics: Num rows: 80 Data size: 53535 Basic stats: COMPLETE
Column stats: NONE |
|   Select Operator  |
| expressions: transactiondate (type: date), debitamount (type:
double), creditamount (type: double), balance (type: double) |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 80 Data size: 53535 Basic stats:
COMPLETE Column stats: NONE |
| Limit  |
|   Number of rows: 3|
|   Statistics: Num rows: 3 Data size: 2007 Basic stats:
COMPLETE Column stats: NONE |
|   ListSink |
||
++

BTW, I tried different settings for

set hive.exec.orc.split.strategy

None worked

Thanks


Re: hcatalog and hiveserver2

2019-05-24 Thread Mich Talebzadeh
Spark uses HiveContext to access Hive tables on the same Hadoop cluster
that both Hive and Spark are running.

Let us look at an example of code below that used Spark as an ETL tool  to
get data from an Oracle table though JDBC and store it in a Hive ORC table

// 1) create Spark conf first

  val conf = new SparkConf().
   setAppName(sparkAppName).
   set("spark.driver.allowMultipleContexts",
sparkDriverAllowMultipleContextsValue).
   set("enableHiveSupport",enableHiveSupportValue)

// 2 Create sparkContext and HiveContext as below

val sc = new SparkContext(conf)
val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

// 3 Access Hive database oraclehadoop as below

HiveContext.sql("use oraclehadoop")

// 4 Access Oracle DB though JDBC driver from Spark

  private var dbURL = "jdbc:oracle:thin:@rhes564:1521:mydb12"
  private var dbDatabase = null
  private var dbUserName ="scratchpad"
  private var dbPassword = "x"
  private var partitionColumnName = "ID"
  private var lowerBoundValue = "1"
  private var upperBoundValue = "1"
  private var numPartitionsValue = "100"


  // 5 Get data from Oracle table scratchpad.dummy to a Spark DataFrame

 val d = HiveContext.read.format("jdbc").options(
 Map("url" -> dbURL,
 "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS CLUSTERED,
to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS RANDOMISED,
RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
 "partitionColumn" -> partitionColumnName,
 "lowerBound" -> lowerBoundValue,
 "upperBound" -> upperBoundValue,
 "numPartitions" -> numPartitionsValue,
 "user" -> dbUserName,
 "password" -> dbPassword)).load

//  6 Create a temp table in Spark based on DataFrame "d" above

   d.registerTempTable("tmp")
  //
  // Need to create and populate target ORC table oraclehadoop.dummy
  //
  // Drop and create table oraclehadoop.dummy
  HiveContext.sql("DROP TABLE IF EXISTS oraclehadoop.dummy")
  var sqltext  = ""
  sqltext = """
  CREATE TABLE oraclehadoop.dummy (
 ID INT
   , CLUSTERED INT
   , SCATTERED INT
   , RANDOMISED INT
   , RANDOM_STRING VARCHAR(50)
   , SMALL_VC VARCHAR(10)
   , PADDING  VARCHAR(10)
  )
  --CLUSTERED BY (ID) INTO 256 BUCKETS
  STORED AS ORC
  TBLPROPERTIES (
  "orc.create.index"="true",
  "orc.bloom.filter.columns"="ID",
  "orc.bloom.filter.fpp"="0.05",
  "orc.compress"="SNAPPY",
  "orc.stripe.size"="16777216",
  "orc.row.index.stride"="1" )
  """
   HiveContext.sql(sqltext)
  //
  // Put data in Hive table. Clean up is already done
  //
  sqltext = """
  INSERT INTO TABLE oraclehadoop.dummy
  SELECT
  ID
, CLUSTERED
, SCATTERED
, RANDOMISED
, RANDOM_STRING
, SMALL_VC
, PADDING
  FROM tmp
  """
   HiveContext.sql(sqltext)

….

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 24 May 2019 at 17:59, Alan Gates  wrote:

>
> HCatalog was built as an interface to allow tools such as Pig and
> MapReduce to access Hive tabular data, for both read and write.  In more
> recent versions of Hive, HCatalog has not been updated to support the
> newest features, such as reading or writing transactional data or, in Hive
> 3.x, accessing managed tables (that tables that Hive owns).
>
> HiveServer2 is a ODBC/JDBC server for Hive.  There no relationship between
> HiveServer2 and HCatalog.
>
> Hive also has a metastore, a data catalog that tracks metadata for Hive
> tables.  This can be run as a separate service, in which case it is often
> referred to as HMS, or embedded into another system.  For example in the
> past HiveServer2 was often configured to embed the metastore
> functionality.  HCatalog communicates with the metastore to determine what
> physical storage objects (files or objects in an object store) make up a
> table or partition that the non-Hive user wants to interact with.
> Traditionally Spark communicates directly with the Hive metastore (I

Re: Any HIVE DDL statement takes minutes to execute

2019-05-16 Thread Mich Talebzadeh
Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


I don't know much about MySQL but assuming it has tools to see the activity
in the back end, what locks are you seeing in the database itself plus the
duration of time that the command is executed on RDBMS etc.

Does this only apply to the DDL statements?

It is either some locking/blocking in the back end or the network
connection between your Hadoop and the RDBMS causing the issue

I just tested DDL for external table in Hive through Oracle database and
there was no issue.

HTH


On Thu, 16 May 2019 at 08:16, Iulian Mongescu 
wrote:

> Hi Alan,
>
>
>
> I’m using MySQL (Mariadb) for the metastore and I was thinking on this
> possibility too but from all my tests on metastore database that I run,
> every query is almost instant.
>
> For example :
>
> SELECT * FROM `TBLS`  ->  Query took 0.0001 seconds.
>
> INSERT INTO `TBLS` ->  Query took 0.0020 seconds
>
> DELETE FROM `TBLS` -> Query took 0.0021 seconds
>
>
>
> Thank you,
>
> Iulian
>
>
>
> *From:* Alan Gates 
> *Sent:* Wednesday, May 15, 2019 9:51 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Any HIVE DDL statement takes minutes to execute
>
>
>
> What are you using as the RDBMS for your metastore?  A first place I'd
> look is if the communications with the RDBMS are slow for some reason.
>
>
>
> Alan.
>
>
>
> On Wed, May 15, 2019 at 10:34 AM Iulian Mongescu 
> wrote:
>
> Hello,
>
>
>
> I'm working on a HDP-2.6.5.0 cluster with kerberos enabled and I have a
> problem with hive as any DDL statement that I run takes minutes to execute
> but any DML run in normal limits. I checked the logs but I didn’t find
> anything that seems related with this problem and I would appreciate any
> help to debug this issue.
>
>
>
> Please find bellow some examples with DDL queries and their durations:
>
>
>
> 
>
> 0: jdbc:hive2://hdpx03:1/> CREATE EXTERNAL TABLE IF NOT EXISTS agenti1
> (...) STORED AS ORC LOCATION
> '/staging/core/agenti/2019-03-18/29d52a54eecae3731b31a3d6ef45d012';
>
> No rows affected (184.191 seconds)
>
> 
>
> 0: jdbc:hive2://hdpx03:1/> show tables;
>
> +---+--+
>
> | tab_name |
>
> +---+--+
>
> | agenti1 |
>
> +---+--+
>
> 1 row selected (0.358 seconds)
>
> -
>
> 0: jdbc:hive2://hdpx03:1/> select count(*) as total from agenti1 where
> 1;
>
> INFO : Tez session hasn't been created yet. Opening session
>
> INFO : Dag name: select count(*) as total from agenti1 wh...1(Stage-1)
>
> INFO : Status: Running (Executing on YARN cluster with App id
> application_1552674174918_0002)
>
>
>
>
> 
>
> VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
>
>
> 
>
> Map 1 .. SUCCEEDED 1 1 0 0 0 0
>
> Reducer 2 .. SUCCEEDED 1 1 0 0 0 0
>
>
> 
>
> VERTICES: 02/02 [==>>] 100% ELAPSED TIME: 5.48 s
>
>
> 
>
> ++--+
>
> | total |
>
> ++--+
>
> | 1960 |
>
> ++--+
>
> 1 row selected (15.853 seconds)
>
>
>
> ---
>
> 0: jdbc:hive2://hdpx03:1/> drop table agenti1;
>
> No rows affected (184.164 seconds)
>
> 
>
> 0: jdbc:hive2://hdpx03:1/> CREATE EXTERNAL TABLE IF NOT EXISTS agenti1
> (...) STORED AS ORC LOCATION
> '/staging/core/agenti/2019-03-18/29d52a54eecae3731b31a3d6ef45d012';
>
> No rows affected (190.288 seconds)
>
>
>
> Thanks,
>
>
>
> Iulian
>
>
>
>


Re: Hive metastore service

2019-04-16 Thread Mich Talebzadeh
Try this

Assuming that you are talking about Hive Thrift server

beeline -u jdbc:hive2://rhes75:10099/default
org.apache.hive.jdbc.HiveDriver *-n USERNAME -p PASSWORD*  -i
/home/hduser/dba/bin/add_jars.hql'

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 16 Apr 2019 at 16:11, Odon Copon  wrote:

> Hi,
> would be possible to add authentication to the Thrift Hive metastore
> service? like user and password?
> I cannot find any documentation on how to protect this endpoint.
> Thanks.
>


Re: Comparing Google Cloud Platform BiqQuery with Hive

2019-01-29 Thread Mich Talebzadeh
Hi Furcy,

Thanks.

Apologies for being late on this. You are absolutely correct. I tried and
BQ can read compressed ORC files.

Still referring to my original thread, BQ handling of Double and Dates are
problematic. I tend to create these type of fields as String and do the ETL
in BQ by converting these fields into the desired type.

I am not much concerned about what Hive itself does. I run Hive on Spark
Execution engine on prem and use Spark for anything on prem interacting
with Hive. On BQ one can achieve the same although my Spark codes (written
in Scala) have to be modified. In general I have founds out that using
Spark in both prem and GCP on Hive and BQ respectively makes things easier.
Also so far as my tests go Spark has analytical functions identical both on
prem and in Dataproc.

HTH,

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 14 Jan 2019 at 09:18, Furcy Pin  wrote:

> Hi Mich,
>
> Contrary to what you said, I can confirm you that BQ is able to read ORC
> files compressed with Snappy.
> However, BQ requires to perform a loading operation from the ORC file on
> Google Storage and convert it into a BQ table.
>
> The main advantages I see with BQ is the guaranteed very high scalability
> and low query latency without having to manage a Hadoop cluster yourself.
>
> I would not say however, that you can simply plug your existing HQL
> queries into BQ. All useful analytics functions are indeed there, but in
> many cases they have a different name.
> For instance, the equivalent of Hive's UDF *trunc* in BQ is *date_trunc.*
>
> In my use case I use pyspark for complex transformations and use BQ as a
> Warehouse to plug Power BI on it.
> So for a fair comparison, I think you should compare BQ with Vertica,
> Presto, Impala or Hive LLAP rather than just Hive.
>
> Regards,
>
> Furcy
>
>
>
>
> On Fri, 11 Jan 2019 at 11:18, Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> Has anyone got some benchmarks on comparing Hive with Google Cloud
>> Platform (GCP) BiqQuery (BQ)?
>>
>> From my experience  experience BQ supports both Avro and ORC file types.
>> There is no support for compressed ORC or AVRO. So if you want to load a
>> Hive table into BQ, you will need to create a table with no compression. In
>> short you need to perform ETL to move a Hive table to BQ.
>>
>> On the other hand BQ seems to support all analytical functions available
>> in Hive so your queries should run without any modification in BQ.
>>
>> On the other hand Dataproc tool in GCP also supports Hive (though I have
>> not tried it myself). So the question is are there any advantages taking a
>> Hive table into BQ itself?
>>
>> Thanks,
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Comparing Google Cloud Platform BiqQuery with Hive

2019-01-11 Thread Mich Talebzadeh
Hi,

Has anyone got some benchmarks on comparing Hive with Google Cloud Platform
(GCP) BiqQuery (BQ)?

>From my experience  experience BQ supports both Avro and ORC file types.
There is no support for compressed ORC or AVRO. So if you want to load a
Hive table into BQ, you will need to create a table with no compression. In
short you need to perform ETL to move a Hive table to BQ.

On the other hand BQ seems to support all analytical functions available in
Hive so your queries should run without any modification in BQ.

On the other hand Dataproc tool in GCP also supports Hive (though I have
not tried it myself). So the question is are there any advantages taking a
Hive table into BQ itself?

Thanks,

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Trying to create an extrenal table in Hive for MongoDBthrows error

2018-08-27 Thread Mich Talebzadeh
I sorted it out this by spending a fair bit of time with jar files needed.

These are my versions

Hive 3
Hadoop 3.1
MongoDB 4.0.1

You need the following jar file added to Hive

ADD JAR hdfs://rhes75:9000/jars/mongo-hadoop-core-2.0.2.jar;
ADD JAR hdfs://rhes75:9000/jars/mongo-hadoop-hive-2.0.2.jar;
ADD JAR hdfs://rhes75:9000/jars/mongo-java-driver-3.8.1.jar;
--ADD JAR hdfs://rhes75:9000/jars/hive-serde-3.1.0.jar;

*ADD JAR hdfs://rhes75:9000/jars/hive-serde-0.11.0.jar;*
The jar file hive-serde-0.11.0.jar is the one you need to use!

This code works for mapping a Hive external  table (here
accounts.ll_18740868_mongo) to an existing Mongo collection

ADD JAR hdfs://rhes75:9000/jars/mongo-hadoop-core-2.0.2.jar;
ADD JAR hdfs://rhes75:9000/jars/mongo-hadoop-hive-2.0.2.jar;
ADD JAR hdfs://rhes75:9000/jars/mongo-java-driver-3.8.1.jar;
--ADD JAR hdfs://rhes75:9000/jars/hive-serde-3.1.0.jar;
ADD JAR hdfs://rhes75:9000/jars/hive-serde-0.11.0.jar;
DROP TABLE IF EXISTS accounts.ll_18740868_mongo;
CREATE EXTERNAL TABLE accounts.ll_18740868_mongo (
  id Int
, TRANSACTIONDATEDate
, TRANSACTIONTYPE   String
, SORTCODE  String
, ACCOUNTNUMBER String
, TRANSACTIONDESCRIPTIONString
, DEBITAMOUNT   Double
, CREDITAMOUNT  Double
, BALANCE   Double
)
COMMENT 'Mongo external table'
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'=
'{
"id":"_id"
   , "TRANSACTIONDATE":"TRANSACTIONDATE"
   , "TRANSACTIONTYPE":"TRANSACTIONTYPE"
   , "SORTCODE":"SORTCODE"
   , "ACCOUNTNUMBER":"ACCOUNTNUMBER"
   , "TRANSACTIONDESCRIPTION":"TRANSACTIONDESCRIPTION"
   , "DEBITAMOUNT":"DEBITAMOUNT"
   , "CREDITAMOUNT":"CREDITAMOUNT"
   , "BALANCE":"BALANCE"
}'
)
TBLPROPERTIES
('mongo.uri'='mongodb://account_user_RO:@rhes75:60100/accounts.ll_18740868')
;
desc accounts.ll_18740868_mongo;

You will need to load the jar files whenever you need to access this table

In MongoDB I have

> use accounts;
switched to db accounts
> db.ll_18740868.count()
3623

And confirmed in Hive

0: jdbc:hive2://rhes75:10099/default> select count(1) from
accounts.ll_18740868_mongo;
+---+
|  _c0  |
+---+
| 3623  |
+---+
1 row selected (1.426 seconds)

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 26 Aug 2018 at 23:05, Mich Talebzadeh 
wrote:

> Hi,
>
> Trying to create an external table in Hive to be accessed by MongoDB.
>
> This is the code in Hive
>
> ADD JAR /home/hduser/jars/mongo-hadoop-core-2.0.2.jar;
> ADD JAR /home/hduser/jars/mongo-hadoop-hive-2.0.2.jar;
> ADD JAR /home/hduser/jars/mongo-java-driver-3.8.1.jar;
> use accounts;
> DROP TABLE IF EXISTS ll_18740868_mongo;
> CREATE EXTERNAL TABLE ll_18740868_mongo (
> TransactionDateDATE
> ,TransactionType   String
> ,SortCode  String
> ,AccountNumber String
> ,TransactionDescriptionString
> ,DebitAmount   Double
> ,CreditAmount  Double
> ,Balance   Double
> )
> COMMENT 'Mongo external table'
> STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
> WITH SERDEPROPERTIES('mongo.columns.mapping'=
> '{
>"TransactionDate":"TransactionDate",
>"TransactionType":"TransactionType",
>"SortCode":"SortCode",
>"AccountNumber":"AccountNumber",
>"TransactionDescription":"TransactionDescription",
>"DebitAmount":"DebitAmount",
>"CreditAmount":"CreditAmount",
>"Balance":"Balance"
> }'
> )
> TBLPROPERTIES ('mongo.uri'='mongodb://account_user_RO:mongodb@rhes75
> :60100/accounts.ll_18740868_mongo')
> ;
>
> In debug mode it throws this error
>
> CREATE EXTERNAL TABLE ll_18740868_mongo (
> TransactionDateDATE
> ,TransactionType   String
> ,SortCode  String
> ,AccountNumber String
> ,TransactionDescriptionString
> ,DebitAmount   Double
> ,CreditAmount  

Trying to create an extrenal table in Hive for MongoDBthrows error

2018-08-26 Thread Mich Talebzadeh
Driver.java:793)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.hive.serde2.SerDe
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 39 more
2018-08-26 23:01:32,426 INFO  [a928186b-244d-4b81-bda9-dd3a83e306b2 main]
reexec.ReOptimizePlugin: ReOptimization: retryPossible: false
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/hive/serde2/SerDe
2018-08-26 23:01:32,426 ERROR [a928186b-244d-4b81-bda9-dd3a83e306b2 main]
ql.Driver: FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/hive/serde2/SerDe
2018-08-26 23:01:32,426 INFO  [a928186b-244d-4b81-bda9-dd3a83e306b2 main]
ql.Driver: Completed executing
command(queryId=hduser_20180826230132_53320afb-5265-407f-82a3-a1b22e584e20);
Time taken: 0.005 seconds
2018-08-26 23:01:32,426 INFO  [a928186b-244d-4b81-bda9-dd3a83e306b2 main]
lockmgr.DbTxnManager: Stopped heartbeat for query:
hduser_20180826230132_53320afb-5265-407f-82a3-a1b22e584e20
2018-08-26 23:01:32,437 INFO  [a928186b-244d-4b81-bda9-dd3a83e306b2 main]
conf.HiveConf: Using the default value passed in for log id:
a928186b-244d-4b81-bda9-dd3a83e306b2
2018-08-26 23:01:32,437 INFO  [a928186b-244d-4b81-bda9-dd3a83e306b2 main]
session.SessionState: Resetting thread name to  main
2018-08-26 23:01:32,437 INFO  [main] conf.HiveConf: Using the default value
passed in for log id: a928186b-244d-4b81-bda9-dd3a83e306b2
2018-08-26 23:01:32,447 INFO  [main] session.SessionState: Deleted
directory: /tmp/hive/hduser/a928186b-244d-4b81-bda9-dd3a83e306b2 on fs with
scheme hdfs
2018-08-26 23:01:32,447 INFO  [main] session.SessionState: Deleted
directory: /tmp/hive/a928186b-244d-4b81-bda9-dd3a83e306b2 on fs with scheme
file
2018-08-26 23:01:32,447 INFO  [main] metastore.HiveMetaStoreClient: Closed
a connection to metastore, current connections: 0

May be I am using incompatible jars here?


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Hive Metada as a microservice

2018-07-05 Thread Mich Talebzadeh
Thanks

I believe we can classify it as a microservice as it provides metadata
service for various other artefacts and it falls into the definition of loosely
coupled service.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 5 Jul 2018 at 20:14, Alan Gates  wrote:

> In 3.0, you can download the metastore as a separate artifact, either
> source or binary (e.g.
> http://ftp.wayne.edu/apache/hive/hive-standalone-metastore-3.0.0/).  It
> does not require any other parts of Hive beyond what's released in that
> artifact.  I'm not sure if this meets your definition of a loosely coupled
> microservice or not.
>
> Alan.
>
> On Thu, Jul 5, 2018 at 11:49 AM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> My understanding is that in later releases of Hive, the metadata will be
>> a separate offerings. Will this be a type of microservice offering
>> providing loose coupling to various other artefact?
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Hive Metada as a microservice

2018-07-05 Thread Mich Talebzadeh
Hi,

My understanding is that in later releases of Hive, the metadata will be a
separate offerings. Will this be a type of microservice offering providing
loose coupling to various other artefact?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Error Starting hive thrift server, hive 3 on Hadoop 3.1

2018-07-04 Thread Mich Talebzadeh
Sure will do

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 3 Jul 2018 at 23:35, Jörn Franke  wrote:

> It would be good if you can document this on the Hive wiki so that other
> users know it.
>
> On the other hand there is Apache Bigtop which tests integration of
> various Big Data components  - but it is complicated. Behind a big data
> distribution there is a lot of effort.
>
> On 3. Jul 2018, at 23:08, Mich Talebzadeh 
> wrote:
>
> Resolved this by getting rid of HADOOP_CLASSPATH that I had to add to make
> Hbase 2 work with Hadoop 3.1. It did not help and I had to revert back to
> Hbase 1.2.6. But left that CLASSPATH in ENV file.
>
> This is becoming retrofitting issues where making an artefact work with
> Hadoop impacts other artefacts and results in unnecessary waste of time.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 3 Jul 2018 at 17:48, Mich Talebzadeh 
> wrote:
>
>> This is hive 3 on Hadoop 3.1
>>
>> I am getting this error in a loop
>>
>> 2018-07-03 17:43:44,929 INFO  [main] SessionState: Hive Session ID =
>> 5f38c8a3-f269-42e0-99d8-9ddff676f009
>> 2018-07-03 17:43:44,929 INFO  [main] server.HiveServer2: Shutting down
>> HiveServer2
>> 2018-07-03 17:43:44,929 INFO  [main] server.HiveServer2:
>> Stopping/Disconnecting tez sessions.
>> 2018-07-03 17:43:44,930 WARN  [main] server.HiveServer2: Error starting
>> HiveServer2 on attempt 5, will retry in 6ms
>> java.lang.NoSuchMethodError:
>> org.apache.hadoop.tracing.TraceUtils.wrapHadoopConf(Ljava/lang/String;Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/htrace/HTraceConfiguration;
>>
>> Any ideas?
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Re: Error Starting hive thrift server, hive 3 on Hadoop 3.1

2018-07-03 Thread Mich Talebzadeh
Resolved this by getting rid of HADOOP_CLASSPATH that I had to add to make
Hbase 2 work with Hadoop 3.1. It did not help and I had to revert back to
Hbase 1.2.6. But left that CLASSPATH in ENV file.

This is becoming retrofitting issues where making an artefact work with
Hadoop impacts other artefacts and results in unnecessary waste of time.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 3 Jul 2018 at 17:48, Mich Talebzadeh 
wrote:

> This is hive 3 on Hadoop 3.1
>
> I am getting this error in a loop
>
> 2018-07-03 17:43:44,929 INFO  [main] SessionState: Hive Session ID =
> 5f38c8a3-f269-42e0-99d8-9ddff676f009
> 2018-07-03 17:43:44,929 INFO  [main] server.HiveServer2: Shutting down
> HiveServer2
> 2018-07-03 17:43:44,929 INFO  [main] server.HiveServer2:
> Stopping/Disconnecting tez sessions.
> 2018-07-03 17:43:44,930 WARN  [main] server.HiveServer2: Error starting
> HiveServer2 on attempt 5, will retry in 6ms
> java.lang.NoSuchMethodError:
> org.apache.hadoop.tracing.TraceUtils.wrapHadoopConf(Ljava/lang/String;Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/htrace/HTraceConfiguration;
>
> Any ideas?
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Error Starting hive thrift server, hive 3 on Hadoop 3.1

2018-07-03 Thread Mich Talebzadeh
This is hive 3 on Hadoop 3.1

I am getting this error in a loop

2018-07-03 17:43:44,929 INFO  [main] SessionState: Hive Session ID =
5f38c8a3-f269-42e0-99d8-9ddff676f009
2018-07-03 17:43:44,929 INFO  [main] server.HiveServer2: Shutting down
HiveServer2
2018-07-03 17:43:44,929 INFO  [main] server.HiveServer2:
Stopping/Disconnecting tez sessions.
2018-07-03 17:43:44,930 WARN  [main] server.HiveServer2: Error starting
HiveServer2 on attempt 5, will retry in 6ms
java.lang.NoSuchMethodError:
org.apache.hadoop.tracing.TraceUtils.wrapHadoopConf(Ljava/lang/String;Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/htrace/HTraceConfiguration;

Any ideas?


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Reading XML into Hive table, handing null columns

2018-06-28 Thread Mich Talebzadeh
Hi,

One way of avowing null values in an xml tag when displaying the results is
the use of Outer Lateral Views
<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView#LanguageManualLateralView-OuterLateralViews>
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView#LanguageManualLateralView-OuterLateralViews


The explanation below summarises it


The user can specify the optional OUTER keyword to generate rows even when
a LATERAL VIEW usually would not generate a row. This happens when the UDTF
used does not generate any rows which happens easily with explode when the
column to explode is empty. In this case the source row would never appear
in the results. *OUTER can be used to prevent that and rows will be
generated with NULL values in the columns coming from the UDTF.*


So for hdfs-site where description is null one can do a test for null value
for every column


insert overwrite table hdfs_site
select
  rownum
, name
, value
, description
from
hdfs_site_temp
lateral view outer explode(Pname) a as name
lateral view outer explode(Pvalue) a as value
lateral view outer explode(Pdescription) a as description
;

0: jdbc:hive2://rhes75:10099/default> select * from hdfs_site
. . . . . . . . . . . . . . . . . . > order by rownum
. . . . . . . . . . . . . . . . . . > ;
+---++++
| hdfs_site.rownum  | hdfs_site.name
|  hdfs_site.value   |
hdfs_site.description  |
+---++++
| 1 | dfs.namenode.ec.system.default.policy  |
RS-6-3-1024k   | NULL
|
| 2 | dfs.safemode.threshold.pct |
0  | NULL
|
| 3 | hadoop.tmp.dir |
/tmp   | NULL
|
| 4 | dfs.image.transfer.timeout |
60 | NULL
|
| 5 | dfs.permissions|
false  | NULL
|
| 6 | dfs.block.size |
134217728  | NULL
|
| 7 | dfs.balance.bandwidthPerSec|
10 | NULL
|
| 8 | dfs.datanode.data.dir  |
file:/d4T/hduser/hadoop/hadoop_store/hdfs/datanode | NULL
|
| 9 | dfs.namenode.name.dir  |
file:/d4T/hduser/hadoop/hadoop_store/hdfs/namenode | NULL
|
+---+++----+

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 28 Jun 2018 at 09:40, Mich Talebzadeh 
wrote:

> The classic XML configuration files like hive-site.xml have the following
> patterns:
>
>
>  
>
> In order to read these xml into Hive, you need to do the following:
>
> ### 1. Brings each record to one line (deletes the root element
> configuration tags):
> ### 2. Need to get rid of space between tags
> ###
> 
> ### 3. Each xml record needs to be in one line!
> ### 4. The xml file needs the same start and end tag for each line like
> property tags. Since you have only one root tag for the whole document,
> that is not going to work. Also
> ###need to get rid of root tag at the beginning and end plus the xml
> information at the beginning (i.e ?xml version)
>
> This simple shell command does work (hive-site.xml, hdfs-site.xml etc)
>
> cat ${HIVE_HOME}/conf/hive-site.xml | tr -d '&' | tr '\n' ' ' | tr '\r' '
> ' | sed 's||\n|g'| sed 's///g'|sed
> 's|||g' \
>  | sed -e 's/^[]*//g' | sed -e 's/>[]*/>/g' | grep
> -v " ${XML_FILE}
>
> replace ${XML_FILE} with any name
>
> The below code will work for hive-ste.xml
>
> drop table if exists xml_temp;
> -- create a load table
> create table xml_temp (xmlData string);
> -- load data from local xml 

Reading XML into Hive table, handing null columns

2018-06-28 Thread Mich Talebzadeh
The classic XML configuration files like hive-site.xml have the following
patterns:

 

In order to read these xml into Hive, you need to do the following:

### 1. Brings each record to one line (deletes the root element
configuration tags):
### 2. Need to get rid of space between tags
###

### 3. Each xml record needs to be in one line!
### 4. The xml file needs the same start and end tag for each line like
property tags. Since you have only one root tag for the whole document,
that is not going to work. Also
###need to get rid of root tag at the beginning and end plus the xml
information at the beginning (i.e ?xml version)

This simple shell command does work (hive-site.xml, hdfs-site.xml etc)

cat ${HIVE_HOME}/conf/hive-site.xml | tr -d '&' | tr '\n' ' ' | tr '\r' ' '
| sed 's||\n|g'| sed 's///g'|sed
's|||g' \
 | sed -e 's/^[]*//g' | sed -e 's/>[]*/>/g' | grep
-v " ${XML_FILE}

replace ${XML_FILE} with any name

The below code will work for hive-ste.xml

drop table if exists xml_temp;
-- create a load table
create table xml_temp (xmlData string);
-- load data from local xml file
load data local inpath "${XML_FILE}" into table xml_temp;
select * from xml_temp;
drop table if exists hive_site_temp;
create table if not exists hive_site_temp(
   rownum int,
   Pname array,
   Pvalue array,
   Pdescription array)
   row format delimited
   fields terminated by '|'
;
--desc hive_site_temp;
insert overwrite table hive_site_temp
select
  row_number() over()
, xpath(xmlData,'property/name/text()')
, xpath(xmlData,'property/value/text()')
, xpath(xmlData,'property/description/text()')
from xml_temp
;
drop table if exists hive_site;
create table if not exists hive_site
 (
   rownum int,
   name string,
   value string,
   description string
 )
STORED AS PARQUET
TBLPROPERTIES ("parquet.compression"="SNAPPY")
;
insert overwrite table hive_site
select
  rownum
, name
, value
, description
from
hive_site_temp



*lateral view explode(Pname) a as namelateral view explode(Pvalue) a as
valuelateral view explode(Pdescription) a as description*;
select * from hive_site
order by rownum


However, for hdfs-site.xml, by default there is no "description" tag. Still
you can read data in no problem. However, "lateral view
explode(Pdescription) a as description" will not work and table will have
no rows!

One way is to exclude that column from insert. However, I was wondering if
explode function can check if column has any value before populating it!

For example for hdfs-site.xml these are my values:

0: jdbc:hive2://rhes75:10099/default> select
xpath(xmlData,'property/description/text()') as description from xml_temp;
+--+
| description  |
+--+
| []   |
| []   |
| []   |
| []   |
| []   |
| []   |
| []   |
| []   |
| []   |
+--+


Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. ORC split generation failed with exception

2018-06-26 Thread Mich Talebzadeh
Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



With Hadoop 3.0.3 I had the following

0: jdbc:hive2://rhes75:10099/default> select count(1) from sales;
Error: Error while processing statement: FAILED: Execution Error, return
code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. ORC split
generation failed with exception: java.lang.NoSuchMethodError:
org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
(state=08S01,code=1)


I upgraded Hadoop from 3.0.3 to 3.1.0and the same operation on that ORC
table works now


0: jdbc:hive2://rhes75:10099/default> select count(1) from sales;
+-+
|   _c0   |
+-+
| 917359  |
+-+




On Mon, 25 Jun 2018 at 18:27, Mich Talebzadeh 
wrote:

> Thanks
>
> I assume two things
>
> Create these ORC tables from scratch and populate them. These are older
> tables from 2.7
>
> Do I need to upgrade to Hadoop 3.1 as suggested as well? Or I can keep the
> current 3.0.3 of Hadoop and just redo these ORC tables again
>
> Sounds like I need to upgrade Hadoop to 3.1?
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 25 Jun 2018 at 18:18, Gopal Vijayaraghavan 
> wrote:
>
>> > This is Hadoop 3.0.3
>> > java.lang.NoSuchMethodError:
>> org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
>> (state=08S01,code=1)
>> > Something is missing here! Is this specific to ORC tables?
>>
>> No, it is a Hadoop BUG.
>>
>> https://issues.apache.org/jira/browse/HADOOP-1468
>>
>> Fixed in Hadoop-2.8.2+ & most likely this is happening because you have
>> built something against Hadoop-2.7.3 (Parquet and ORC default to the 2.7.3,
>> so you need to rebuild them over for Hadoop3).
>>
>> Cheers,
>> Gopal
>>
>>
>>


Re: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. ORC split generation failed with exception

2018-06-25 Thread Mich Talebzadeh
Thanks

I assume two things

Create these ORC tables from scratch and populate them. These are older
tables from 2.7

Do I need to upgrade to Hadoop 3.1 as suggested as well? Or I can keep the
current 3.0.3 of Hadoop and just redo these ORC tables again

Sounds like I need to upgrade Hadoop to 3.1?


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 25 Jun 2018 at 18:18, Gopal Vijayaraghavan 
wrote:

> > This is Hadoop 3.0.3
> > java.lang.NoSuchMethodError:
> org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
> (state=08S01,code=1)
> > Something is missing here! Is this specific to ORC tables?
>
> No, it is a Hadoop BUG.
>
> https://issues.apache.org/jira/browse/HADOOP-1468
>
> Fixed in Hadoop-2.8.2+ & most likely this is happening because you have
> built something against Hadoop-2.7.3 (Parquet and ORC default to the 2.7.3,
> so you need to rebuild them over for Hadoop3).
>
> Cheers,
> Gopal
>
>
>


Re: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. ORC split generation failed with exception

2018-06-25 Thread Mich Talebzadeh
Thanks

This is Hadoop 3.0.3

hadoop version
Hadoop 3.0.3
Source code repository
https://yjzhan...@git-wip-us.apache.org/repos/asf/hadoop.git -r
37fd7d752db73d984dc31e0cdfd590d252f5e075
Compiled by yzhang on 2018-05-31T17:12Z
Compiled with protoc 2.5.0
>From source with checksum 736cdcefa911261ad56d2d120bf1fa
This command was run using
/home/hduser/hadoop-3.0.3/share/hadoop/common/hadoop-common-3.0.3.jar

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 25 Jun 2018 at 18:05, Prasanth Jayachandran <
pjayachand...@hortonworks.com> wrote:

> Hi
> What hadoop version are you using? I am guessing this is from old hadoop
> version. Can you try with hadoop-3.1?
>
> Thanks
> Prasanth
>
>
>
> On Mon, Jun 25, 2018 at 9:55 AM -0700, "Mich Talebzadeh" <
> mich.talebza...@gmail.com> wrote:
>
> Hive version 3
>>
>> An ORC partitioned table
>>
>> 0: jdbc:hive2://rhes75:10099/default> select count(1) from sales;
>> Error: Error while processing statement: FAILED: Execution Error, return
>> code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. ORC split
>> generation failed with exception: java.lang.NoSuchMethodError:
>> org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
>> (state=08S01,code=1)
>>
>> Something is missing here! Is this specific to ORC tables?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. ORC split generation failed with exception

2018-06-25 Thread Mich Talebzadeh
Hive version 3

An ORC partitioned table

0: jdbc:hive2://rhes75:10099/default> select count(1) from sales;
Error: Error while processing statement: FAILED: Execution Error, return
code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. ORC split
generation failed with exception: java.lang.NoSuchMethodError:
org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
(state=08S01,code=1)

Something is missing here! Is this specific to ORC tables?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Update on ORC transactional table fails with org.apache.hadoop.fs.FileStatus.compareTo.. error

2018-06-25 Thread Mich Talebzadeh
Hi,

hadoop version
Hadoop 3.0.3

Hive version
Apache Hive (version 3.0.0)

ORC transactional table is created as follows:

create table t (
 owner   varchar(30)
,object_name varchar(30)
,subobject_name  varchar(30)
,object_id   bigint
,data_object_id  bigint
,object_type varchar(19)
,created timestamp
,last_ddl_time   timestamp
,timestamp2   varchar(19)
,status  varchar(7)
,temporary2  varchar(1)
,generated   varchar(1)
,secondary   varchar(1)
,namespace   bigint
,edition_namevarchar(30)
,padding1varchar(4000)
,padding2varchar(3500)
,attribute   varchar(32)
,op_type int
,op_time timestamp
)
CLUSTERED BY (object_id) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="SNAPPY",
"transactional"="true",
"orc.create.index"="true",
"orc.bloom.filter.columns"="object_id",
"orc.bloom.filter.fpp"="0.05",
"orc.stripe.size"="268435456",
"orc.row.index.stride"="1" )
;

The following a simple update

use asehadoop;
set hive.support.concurrency=true;
set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on=true;
set hive.compactor.worker.threads=20;
UPDATE t set object_name = 'Mich' WHERE object_id = 594688;

And this is the error I get at the end

Error: Error while processing statement: FAILED: Execution Error, return
code -101 from org.apache.hadoop.hive.ql.exec.StatsTask.
org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
(state=08S01,code=-101)

Appreciate any info.

Regards,

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Hive 3 Thrift server expects Tez to be there and throws error and waits before bouncing back

2018-06-14 Thread Mich Talebzadeh
The problem is by default Hive is on mr although you get a warning that is
best to use spark or tez.

That is fair enough but Hive 3 expects Tez to be there. Otherwise the
start-up script throws an error and waits a minute to retry the connection
to the metastore!

2018-06-14 14:48:16,989 INFO  [main] http.HttpServer: Started
HttpServer[hiveserver2] on port 10002
2018-06-14 14:48:16,989 INFO  [main] server.HiveServer2: Web UI has started
on port 10002
2018-06-14 14:48:16,989 INFO  [main] server.HiveServer2: HS2 interactive HA
not enabled. Starting tez sessions..
2018-06-14 14:48:16,989 INFO  [main] server.HiveServer2:
Starting/Reconnecting tez sessions..
2018-06-14 14:48:16,989 INFO  [main] server.HiveServer2: Initializing tez
session pool manager
2018-06-14 14:48:16,993 INFO  [main] server.HiveServer2: Shutting down
HiveServer2
2018-06-14 14:48:16,993 INFO  [main] thrift.ThriftCLIService: Thrift server
has stopped
2018-06-14 14:48:16,993 INFO  [main] service.AbstractService:
Service:ThriftBinaryCLIService is stopped.
2018-06-14 14:48:16,993 INFO  [main] service.AbstractService:
Service:OperationManager is stopped.
2018-06-14 14:48:16,993 INFO  [main] service.AbstractService:
Service:SessionManager is stopped.
2018-06-14 14:48:16,993 INFO  [main] service.AbstractService:
Service:CLIService is stopped.
2018-06-14 14:48:16,993 INFO  [main] metastore.HiveMetaStoreClient: Closed
a connection to metastore, current connections: 0
2018-06-14 14:48:16,993 INFO  [main] service.AbstractService:
Service:HiveServer2 is stopped.
2018-06-14 14:48:16,996 INFO  [main] server.ServerConnector: Stopped
ServerConnector@7b477141{HTTP/1.1,[http/1.1]}{0.0.0.0:10002}
2018-06-14 14:48:16,996 INFO  [main] handler.ContextHandler: Stopped
o.e.j.s.ServletContextHandler@7c0777b5
{/static,jar:file:/data6/hduser/hive-3.0.0/lib/hive-service-3.0.0.jar!/hive-webapps/static,UNAVAILABLE}
2018-06-14 14:48:16,999 INFO  [main] handler.ContextHandler: Stopped
o.e.j.w.WebAppContext@41b13f3d
{/,null,UNAVAILABLE}{jar:file:/data6/hduser/hive-3.0.0/lib/hive-service-3.0.0.jar!/hive-webapps/hiveserver2}
2018-06-14 14:48:17,000 INFO  [main] server.HiveServer2: Web UI has stopped
2018-06-14 14:48:17,000 INFO  [main] server.HiveServer2:
Stopping/Disconnecting tez sessions.
2018-06-14 14:48:17,000 INFO  [main] server.HiveServer2: Stopped tez
session pool manager.
2018-06-14 14:48:17,000 WARN  [main] server.HiveServer2: Error starting
HiveServer2 on attempt 1, will retry in 6ms
java.lang.NoClassDefFoundError: org/apache/tez/dag/api/TezConfiguration
at
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolSession$AbstractTriggerValidator.startTriggerValidator(TezSessionPoolSession.java:74)
at
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.initTriggers(TezSessionPoolManager.java:207)
at
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.startPool(TezSessionPoolManager.java:114)
at
org.apache.hive.service.server.HiveServer2.initAndStartTezSessionPoolManager(HiveServer2.java:831)
at
org.apache.hive.service.server.HiveServer2.startOrReconnectTezSessions(HiveServer2.java:815)
at
org.apache.hive.service.server.HiveServer2.start(HiveServer2.java:739)
at
org.apache.hive.service.server.HiveServer2.startHiveServer2(HiveServer2.java:1014)
at
org.apache.hive.service.server.HiveServer2.access$1800(HiveServer2.java:134)
at
org.apache.hive.service.server.HiveServer2$StartOptionExecutor.execute(HiveServer2.java:1282)
at
org.apache.hive.service.server.HiveServer2.main(HiveServer2.java:1126)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.ClassNotFoundException:
org.apache.tez.dag.api.TezConfiguration

I gather this is a bug as Hive install has to have both spark or tez as
plugins.

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Hive 3,0 on Hadoop 3.0.3 crahes with org.apache.hadoop.mapreduce.v2.app.MRAppMaster error

2018-06-13 Thread Mich Talebzadeh
sorted out. needed to add the following to
${HADOOP_HOME}/etc/Hadoop/mapred-site.xml !

   
mapreduce.framework.name
yarn


yarn.app.mapreduce.am.env
HADOOP_MAPRED_HOME=${HADOOP_HOME}


mapreduce.map.env
HADOOP_MAPRED_HOME=${HADOOP_HOME}


mapreduce.reduce.env
HADOOP_MAPRED_HOME=${HADOOP_HOME}



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 13 June 2018 at 23:49, Mich Talebzadeh  wrote:

>
> *Hadoop 3.0.3Hive (version 3.0.0)*
>
> Running a simple query
>
> select count(1) from sales;
>
> I get the following error in container
>
> Error: Could not find or load main class org.apache.hadoop.mapreduce.
> v2.app.MRAppMaster
>
> The container file launch_container.sh has the following entry
>
> exec /bin/bash -c "$JAVA_HOME/bin/java 
> -Dlog4j.configuration=container-log4j.properties
> -Dyarn.app.container.log.dir=/home/hduser/hadoop-3.0.3/logs/
> userlogs/application_1528926568884_0006/container_1528926568884_0006_02_01
>
>
> -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA
> -Dhadoop.root.logfile=syslog
>
>  -Xmx1024m org.apache.hadoop.mapreduce.v2.app.MRAppMaster
>
> 1>/home/hduser/hadoop-3.0.3/logs/userlogs/application_
> 1528926568884_0006/container_1528926568884_0006_02_01/stdout
> 2>/home/hduser/hadoop-3.0.3/logs/userlogs/application_
> 1528926568884_0006/container_1528926568884_0006_02_01/stderr "
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Hive 3,0 on Hadoop 3.0.3 crahes with org.apache.hadoop.mapreduce.v2.app.MRAppMaster error

2018-06-13 Thread Mich Talebzadeh
*Hadoop 3.0.3Hive (version 3.0.0)*

Running a simple query

select count(1) from sales;

I get the following error in container

Error: Could not find or load main class
org.apache.hadoop.mapreduce.v2.app.MRAppMaster

The container file launch_container.sh has the following entry

exec /bin/bash -c "$JAVA_HOME/bin/java
-Dlog4j.configuration=container-log4j.properties
-Dyarn.app.container.log.dir=/home/hduser/hadoop-3.0.3/logs/userlogs/application_1528926568884_0006/container_1528926568884_0006_02_01


-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA
-Dhadoop.root.logfile=syslog

 -Xmx1024m org.apache.hadoop.mapreduce.v2.app.MRAppMaster

1>/home/hduser/hadoop-3.0.3/logs/userlogs/application_1528926568884_0006/container_1528926568884_0006_02_01/stdout
2>/home/hduser/hadoop-3.0.3/logs/userlogs/application_1528926568884_0006/container_1528926568884_0006_02_01/stderr
"

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Which version of Hive can hanle creating XML table?

2018-06-11 Thread Mich Talebzadeh
many thanks. but I cannot see any specific product name there?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 June 2018 at 14:10, kristijan berta  wrote:

> The XPath stuff works reasonably well for simple XML files.
>
> However for complex XML files that change frequently and need to be
> ingested in realtime you might look at a 3rd party solution, e.g. here:
> https://dataworkssummit.com/san-jose-2018/session/add-a-spark-to-your-etl/
>
> On Mon, Jun 11, 2018 at 3:05 PM, kristijan berta  wrote:
>
>> thanks Jorn. The only alternative is to use xpath UDF? Works as shown
>> below but tedious
>>
>> Like the example below
>>
>> *$cat employees.xml*
>> 
>> 1
>> Satish Kumar
>> Technical Lead
>> 
>> 
>> 2
>> Ramya
>> Testing
>> 
>>
>> *Step:1 Bring each record to one line, by executing below command*
>>
>> $cat employees.xml | tr -d '&' | tr '\n' ' ' | tr '\r' ' ' | sed
>> 's||\n|g' | grep -v '^\s*$' > employees_records.xml
>>
>> *$cat employees_records.xml*
>>  1 Satish Kumar Technical
>> Lead 
>>  2 Ramya Testing
>> 
>>
>> *tep:2 Load the file to HDFS*
>>
>> *$hadoop fs -mkdir /user/hive/sample-xml-inputs*
>>
>> *$hadoop fs -put employees_records.xml /user/hive/sample-xml-inputs*
>>
>> *$hadoop fs -cat /user/hive/sample-xml-inputs/employees_records.xml*
>>  1 Satish KumarTechnical
>> Lead 
>>  2 Ramya Testing
>> 
>>
>> *Step:3 Create a Hive table and point to xml file*
>>
>> *hive>create external table xml_table_org( xmldata string) LOCATION
>> '/user/hive/sample-xml-inputs/';*
>>
>> *hive> select * from xml_table_org;*
>> *OK*
>>  1 Satish Kumar Technical
>> Lead 
>>  2 Ramya Testing
>> 
>>
>> *Step 4: From the stage table we can query the elements and load it to
>> other table.*
>>
>> *hive> CREATE TABLE xml_table AS SELECT
>> xpath_int(xmldata,'employee/id'),xpath_string(xmldata,'employee/name'),xpath_string(xmldata,'employee/designation')
>> FROM xml_table_org;*
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 9 June 2018 at 07:42, Jörn Franke  wrote:
>>
>>> Yes.
>>>
>>> Serde must have been removed then in 2.x.
>>>
>>>
>>>
>>> On 8. Jun 2018, at 23:52, Mich Talebzadeh 
>>> wrote:
>>>
>>> Ok I am looking at this jar file
>>>
>>>  jar tf hive-serde-3.0.0.jar|grep -i abstractserde
>>> org/apache/hadoop/hive/serde2/AbstractSerDe.class
>>>
>>> Is this the correct one?
>>>
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>

Re: May 2018 Hive User Group Meeting

2018-06-11 Thread Mich Talebzadeh
yes indeed I second that

Regards


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 June 2018 at 09:11,  wrote:

> Thank you very much for sharing Sahil. I found the video of the event and
> the new features of Hive to be very interesting.
>
>
>
> *From:* Sahil Takiar [mailto:takiar.sa...@gmail.com]
> *Sent:* miércoles, 23 de mayo de 2018 23:08
>
> *To:* user@hive.apache.org
> *Cc:* d...@hive.apache.org
> *Subject:* Re: May 2018 Hive User Group Meeting
>
>
>
> Wanted to thank everyone for attending the meetup a few weeks ago, and a
> huge thanks to all of our speakers! Apologies for the delay, but we finally
> have the recording uploaded to Youtube along with all the slides uploaded
> to Slidehshare. Below are the links:
>
>
>
> Recording: https://youtu.be/gwX3KpHa2j0
>
>- Hive-on-Spark at Uber: Efficiency & Scale - Xuefu Zhang -
>https://www.slideshare.net/sahiltakiar/hive-on-spark-at-uber-scale
><https://www.slideshare.net/sahiltakiar/hive-on-spark-at-uber-scale>
>- Hive-on-S3 Performance: Past, Present, and Future - Sahil Takiar -
>https://www.slideshare.net/sahiltakiar/hive-ons3-
>performance-past-present-and-future
>
> <https://www.slideshare.net/sahiltakiar/hive-ons3-performance-past-present-and-future>
>- Dali: Data Access Layer at LinkedIn - Adwait Tumbde -
>https://www.slideshare.net/sahiltakiar/dali-data-access-layer
><https://www.slideshare.net/sahiltakiar/dali-data-access-layer>
>- Parquet Vectorization in Hive - Vihang Karajgaonkar -
>https://www.slideshare.net/sahiltakiar/parquet-vectorization-in-hive
><https://www.slideshare.net/sahiltakiar/parquet-vectorization-in-hive>
>- ORC Column Level Encryption - Owen O’Malley -
>https://www.slideshare.net/sahiltakiar/orc-column-encryption
><https://www.slideshare.net/sahiltakiar/orc-column-encryption>
>- Running Hive at Scale @ Lyft - Sharanya Santhanam, Rohit Menon -
>https://www.slideshare.net/sahiltakiar/running-hive-at-scale-lyft
><https://www.slideshare.net/sahiltakiar/running-hive-at-scale-lyft>
>- Materialized Views in Hive - Jesus Camacho Rodriguez -
>https://www.slideshare.net/sahiltakiar/accelerating-
>query-processing-with-materialized-views-in-apache-hive-98333641
>
> <https://www.slideshare.net/sahiltakiar/accelerating-query-processing-with-materialized-views-in-apache-hive-98333641>
>- Hive Metastore Caching - Daniel Dai - https://www.slideshare.net/
>sahiltakiar/hive-metastore-cache
><https://www.slideshare.net/sahiltakiar/hive-metastore-cache>
>- Hive Metastore Separation - Alan Gates - https://www.slideshare.net/
>sahiltakiar/making-the-metastore-standalone
><https://www.slideshare.net/sahiltakiar/making-the-metastore-standalone>
>- Customer Use Cases & Pain Points of (Big) Metadata - Rituparna
>Agrawal - https://www.slideshare.net/sahiltakiar/customer-use-
>cases-pain-points-of-big-metadata
>
> <https://www.slideshare.net/sahiltakiar/customer-use-cases-pain-points-of-big-metadata>
>
> If you have any issues accessing the links, feel free to reach out to me.
>
>
>
> Looking forward to our next Hive Meetup!
>
>
>
> On Mon, May 14, 2018 at 8:45 AM, Sahil Takiar 
> wrote:
>
> Hello,
>
>
>
> Yes, the meetup was recorded. We are in the process of getting it uploaded
> to Youtube. Once its publicly available I will send out the link on this
> email thread.
>
>
>
> Thanks
>
>
>
> --Sahil
>
>
>
> On Mon, May 14, 2018 at 6:04 AM,  wrote:
>
> Hi,
>
>
>
> If you have recorded the meeting share link please. I could not follow it
> online for the schedule (I live in Spain).
>
>
>
> Kind Regards,
>
>
>
>
>
> *From:* Luis Figueroa [mailto:lef...@outlook.com]
> *Sent:* miércoles, 9 de mayo de 2018 18:01
> *To:* user@hive.apache.org
> *Cc:* d...@hive.apache.org
> *Subject:* Re: May 2018 Hive User Group Meeting
>
>
>
> Hey everyone,
>
>
>
> Was the meeting recorded by any chance?
>
> Luis
>
>
> On May 8, 2018, at 5:31 PM, Sahil Takiar  wrote:
>
> Hey Everyo

reading xml file with xpath into hive table it expects the xml without white space and carriage return?

2018-06-10 Thread Mich Talebzadeh
Hi,

I have a simple xml file for myself as records shown below

MichTalebzadeh
UK

mich.talebza...@gmail.com
m...@peridale.co.uk
12345
12346
12347

Try to read it in as follows and I get an error

0: jdbc:hive2://rhes75:10099/default> use test;
OK
No rows affected (0.032 seconds)
0: jdbc:hive2://rhes75:10099/default> drop table if exists xml_temp;
OK
No rows affected (0.229 seconds)
0: jdbc:hive2://rhes75:10099/default> -- create a load table
0: jdbc:hive2://rhes75:10099/default> create table xml_temp (line string);
OK
No rows affected (0.181 seconds)
0: jdbc:hive2://rhes75:10099/default> -- load data from local xml file
0: jdbc:hive2://rhes75:10099/default> load data local inpath
"/home/hduser/dba/bin/xml/test.xml" into table xml_temp;
Loading data to table test.xml_temp
OK
No rows affected (0.487 seconds)
0: jdbc:hive2://rhes75:10099/default> select * from xml_temp;
OK
++
|   xml_temp.line|
++
| MichTalebzadeh |
| UK|
|   |
| mich.talebza...@gmail.com |
| m...@peridale.co.uk   |
| 12345  |
| 12346 |
| 12347 |
++
8 rows selected (0.16 seconds)
0: jdbc:hive2://rhes75:10099/default> select
xpath_string(line,'rec/name/fname')
. . . . . . . . . . . . . . . . . . > ,
xpath(line,'rec/name/lname/text()')
. . . . . . . . . . . . . . . . . . > ,
xpath(line,'rec/domicile/text()')
. . . . . . . . . . . . . . . . . . > ,
xpath(line,'rec/contact/email/personal/text()')
. . . . . . . . . . . . . . . . . . > ,
xpath(line,'rec/contact/email/official/text()')
. . . . . . . . . . . . . . . . . . > ,
xpath(line,'rec/contact/phone/mobile/text()')
. . . . . . . . . . . . . . . . . . > ,
xpath(line,'rec/contact/phone/office/text()')
. . . . . . . . . . . . . . . . . . > ,
xpath(line,'rec/contact/phone/residence/text()')
. . . . . . . . . . . . . . . . . . > from xml_temp;
OK
[Fatal Error] :1:63: XML document structures must start and end within the
same entity.
Error: java.io.IOException:
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method
public org.apache.hadoop.io.Text
org.apache.hadoop.hive.ql.udf.xml.UDFXPathString.evaluate(java.lang.String,java.lang.String)
with arguments
{MichTalebzadeh,rec/name/fname}:Error
loading expression 'rec/name/fname' (state=,code=0)
Closing: 0: jdbc:hive2://rhes75:10099/default


*However, if I go back and get rid of the carriage return in the source xml
file it works!*

MichTalebzadehUK
mich.talebza...@gmail.comm...@peridale.co.uk
123451234612347

0: jdbc:hive2://rhes75:10099/default> use test;
OK
No rows affected (0.036 seconds)
0: jdbc:hive2://rhes75:10099/default> drop table if exists xml_temp;
OK
No rows affected (0.222 seconds)
0: jdbc:hive2://rhes75:10099/default> -- create a load table
0: jdbc:hive2://rhes75:10099/default> create table xml_temp (line string);
OK
No rows affected (0.18 seconds)
0: jdbc:hive2://rhes75:10099/default> -- load data from local xml file
0: jdbc:hive2://rhes75:10099/default> load data local inpath
"/home/hduser/dba/bin/xml/test.xml" into table xml_temp;
Loading data to table test.xml_temp
OK
No rows affected (0.407 seconds)
0: jdbc:hive2://rhes75:10099/default> select * from xml_temp;
OK
++
|   xml_temp.line|
++
|
MichTalebzadehUK
mich.talebza...@gmail.comm...@peridale.co.uk123451234612347
|
++
1 row selected (0.192 seconds)
0: jdbc:hive2://rhes75:10099/default> select
xpath_string(line,'rec/name/fname')
. . . . . . . . . . . . . . . . . . > ,
xpath(line,'rec/name/lname/text()')
. . . . . . . . . . . . . . . . . . > ,
xpath(line,'rec/domicile/text()')
. . . . . . . . . . . . . . . . . . > ,
xpath(line,'rec/contact/email/personal/text()')
. . . . . . . . . . . . . . . . . . > ,
xpath(line,'rec/contact/email/official/text()')
. . . . . . . . . . . . . . . . . . > ,
xpath(line,'rec/contact/phone/mobile/text()')
. . . . . . . . . . . . . . . . . . > ,
xpath(line,'rec/contact/phone/office/text()')
. . . . . . . . . . . . . . . . . . > ,
xpath(line,'rec/contact/phone/residence/text()')
. . . . . . . . . . . . . . . . . . > from xml_temp;
OK
+---+-+-++--++++
|  _c0  |   _c1   |   _c2   |  _c3
|   _c4|_c5 |_c6 |_c7 |
+---+-----+-----+----+

Re: Which version of Hive can hanle creating XML table?

2018-06-09 Thread Mich Talebzadeh
thanks Jorn. The only alternative is to use xpath UDF? Works as shown below
but tedious

Like the example below

*$cat employees.xml*

1
Satish Kumar
Technical Lead


2
Ramya
Testing


*Step:1 Bring each record to one line, by executing below command*

$cat employees.xml | tr -d '&' | tr '\n' ' ' | tr '\r' ' ' | sed
's||\n|g' | grep -v '^\s*$' > employees_records.xml

*$cat employees_records.xml*
 1 Satish Kumar Technical
Lead 
 2 Ramya Testing


*tep:2 Load the file to HDFS*

*$hadoop fs -mkdir /user/hive/sample-xml-inputs*

*$hadoop fs -put employees_records.xml /user/hive/sample-xml-inputs*

*$hadoop fs -cat /user/hive/sample-xml-inputs/employees_records.xml*
 1 Satish KumarTechnical
Lead 
 2 Ramya Testing


*Step:3 Create a Hive table and point to xml file*

*hive>create external table xml_table_org( xmldata string) LOCATION
'/user/hive/sample-xml-inputs/';*

*hive> select * from xml_table_org;*
*OK*
 1 Satish Kumar Technical
Lead 
 2 Ramya Testing


*Step 4: From the stage table we can query the elements and load it to
other table.*

*hive> CREATE TABLE xml_table AS SELECT
xpath_int(xmldata,'employee/id'),xpath_string(xmldata,'employee/name'),xpath_string(xmldata,'employee/designation')
FROM xml_table_org;*

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 9 June 2018 at 07:42, Jörn Franke  wrote:

> Yes.
>
> Serde must have been removed then in 2.x.
>
>
>
> On 8. Jun 2018, at 23:52, Mich Talebzadeh 
> wrote:
>
> Ok I am looking at this jar file
>
>  jar tf hive-serde-3.0.0.jar|grep -i abstractserde
> org/apache/hadoop/hive/serde2/AbstractSerDe.class
>
> Is this the correct one?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 8 June 2018 at 22:34, Mich Talebzadeh 
> wrote:
>
>> Thanks Jorn so what is the resolution? do I need another jar file?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 8 June 2018 at 21:56, Jörn Franke  wrote:
>>
>>> Oha i see now Serde is a deprecated Interface , if i am not wrong it has
>>> been replaced by the abstract class abstractserde
>>>
>>> On 8. Jun 2018, at 22:22, Mich Talebzadeh 
>>> wrote:
>>>
>>> Thanks Jorn.
>>>
>>> Spark 2.3.3 (labelled as stable)
>>>
>>> First I put the jar file hivexmlserde-1.0.5.3.jar under $HIVE_HOME/lib
>>> and explicitly loaded with ADD JAR as well in hive session
>>>
>>> hive> ADD JAR hdfs://rhes75:9000/jars/hivexmlserde-1.0.5.3.jar;
>>> Added 
>>> [/tmp/hive/7feb5165-780b-4ab6-aca8-f516d0388823_resources/hivexmlserde-1.0.5.3.jar]
>>> to class path
>>> Added resources: [hdfs://rhes75:9000/jars/hivexmlserde-1.0.5.3.jar]
>>>
>>> Then I ran a simple code given here
>>> <https://github.com/dvasilen/Hive-XML-SerDe/issues/41>
>>>
>>> hive> CREATE  TABLE xml_41 (imap map)
>>> > ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
>>>

Re: Which version of Hive can hanle creating XML table?

2018-06-08 Thread Mich Talebzadeh
Ok I am looking at this jar file

 jar tf hive-serde-3.0.0.jar|grep -i abstractserde
org/apache/hadoop/hive/serde2/AbstractSerDe.class

Is this the correct one?

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 8 June 2018 at 22:34, Mich Talebzadeh  wrote:

> Thanks Jorn so what is the resolution? do I need another jar file?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 8 June 2018 at 21:56, Jörn Franke  wrote:
>
>> Oha i see now Serde is a deprecated Interface , if i am not wrong it has
>> been replaced by the abstract class abstractserde
>>
>> On 8. Jun 2018, at 22:22, Mich Talebzadeh 
>> wrote:
>>
>> Thanks Jorn.
>>
>> Spark 2.3.3 (labelled as stable)
>>
>> First I put the jar file hivexmlserde-1.0.5.3.jar under $HIVE_HOME/lib
>> and explicitly loaded with ADD JAR as well in hive session
>>
>> hive> ADD JAR hdfs://rhes75:9000/jars/hivexmlserde-1.0.5.3.jar;
>> Added 
>> [/tmp/hive/7feb5165-780b-4ab6-aca8-f516d0388823_resources/hivexmlserde-1.0.5.3.jar]
>> to class path
>> Added resources: [hdfs://rhes75:9000/jars/hivexmlserde-1.0.5.3.jar]
>>
>> Then I ran a simple code given here
>> <https://github.com/dvasilen/Hive-XML-SerDe/issues/41>
>>
>> hive> CREATE  TABLE xml_41 (imap map)
>> > ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
>> > WITH SERDEPROPERTIES (
>> > "column.xpath.imap"="/file-format/data-set/element",
>> > "xml.map.specification.element"="@name->#content"
>> > )
>> > STORED AS
>> > INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
>> > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.
>> IgnoreKeyTextOutputFormat'
>> > TBLPROPERTIES (
>> > "xmlinput.start"="",
>> > "xmlinput.end"=""
>> > );
>> FAILED: Execution Error, return code 1 from 
>> org.apache.hadoop.hive.ql.exec.DDLTask.
>> org/apache/hadoop/hive/serde2/SerDe
>> And this is full error 2018-06-08T21:17:20,775  INFO
>> [7feb5165-780b-4ab6-aca8-f516d0388823 main] ql.Driver: Starting task
>> [Stage-0:DDL] in serial mode
>> 2018-06-08T21:17:20,776 ERROR [7feb5165-780b-4ab6-aca8-f516d0388823
>> main] exec.DDLTask: java.lang.NoClassDefFoundError:
>> org/apache/hadoop/hive/serde2/SerDe
>> at java.lang.ClassLoader.defineClass1(Native Method)
>> at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
>> at java.security.SecureClassLoader.defineClass(SecureClassLoade
>> r.java:142)
>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:348)
>> at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Co
>> nfiguration.java:2134)
>&

Re: Which version of Hive can hanle creating XML table?

2018-06-08 Thread Mich Talebzadeh
Thanks Jorn so what is the resolution? do I need another jar file?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 8 June 2018 at 21:56, Jörn Franke  wrote:

> Oha i see now Serde is a deprecated Interface , if i am not wrong it has
> been replaced by the abstract class abstractserde
>
> On 8. Jun 2018, at 22:22, Mich Talebzadeh 
> wrote:
>
> Thanks Jorn.
>
> Spark 2.3.3 (labelled as stable)
>
> First I put the jar file hivexmlserde-1.0.5.3.jar under $HIVE_HOME/lib and
> explicitly loaded with ADD JAR as well in hive session
>
> hive> ADD JAR hdfs://rhes75:9000/jars/hivexmlserde-1.0.5.3.jar;
> Added 
> [/tmp/hive/7feb5165-780b-4ab6-aca8-f516d0388823_resources/hivexmlserde-1.0.5.3.jar]
> to class path
> Added resources: [hdfs://rhes75:9000/jars/hivexmlserde-1.0.5.3.jar]
>
> Then I ran a simple code given here
> <https://github.com/dvasilen/Hive-XML-SerDe/issues/41>
>
> hive> CREATE  TABLE xml_41 (imap map)
> > ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
> > WITH SERDEPROPERTIES (
> > "column.xpath.imap"="/file-format/data-set/element",
> > "xml.map.specification.element"="@name->#content"
> > )
> > STORED AS
> > INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
> > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.
> IgnoreKeyTextOutputFormat'
> > TBLPROPERTIES (
> > "xmlinput.start"="",
> > "xmlinput.end"=""
> > );
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask.
> org/apache/hadoop/hive/serde2/SerDe
> And this is full error 2018-06-08T21:17:20,775  INFO
> [7feb5165-780b-4ab6-aca8-f516d0388823 main] ql.Driver: Starting task
> [Stage-0:DDL] in serial mode
> 2018-06-08T21:17:20,776 ERROR [7feb5165-780b-4ab6-aca8-f516d0388823 main]
> exec.DDLTask: java.lang.NoClassDefFoundError:
> org/apache/hadoop/hive/serde2/SerDe
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
> at java.security.SecureClassLoader.defineClass(
> SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(
> Configuration.java:2134)
> at org.apache.hadoop.conf.Configuration.getClassByName(
> Configuration.java:2099)
> at org.apache.hadoop.hive.ql.exec.DDLTask.validateSerDe(
> DDLTask.java:4213)
> at org.apache.hadoop.hive.ql.plan.CreateTableDesc.toTable(
> CreateTableDesc.java:723)
> at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(
> DDLTask.java:4321)
> at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.
> java:354)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
> at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(
> TaskRunner.java:100)
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183)
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(
> CliDriver.java:233)
> at o

Re: Which version of Hive can hanle creating XML table?

2018-06-08 Thread Mich Talebzadeh
/spss/hive/
com/ibm/spss/hive/serde2/
com/ibm/spss/hive/serde2/xml/
com/ibm/spss/hive/serde2/xml/objectinspector/
com/ibm/spss/hive/serde2/xml/processor/
com/ibm/spss/hive/serde2/xml/processor/java/
com/ibm/spss/hive/serde2/xml/HiveXmlRecordReader.class
com/ibm/spss/hive/serde2/xml/objectinspector/XmlListObjectInspector.class
com/ibm/spss/hive/serde2/xml/objectinspector/XmlMapObjectInspector.class
com/ibm/spss/hive/serde2/xml/objectinspector/XmlObjectInspectorFactory$1.class
com/ibm/spss/hive/serde2/xml/objectinspector/XmlObjectInspectorFactory.class
com/ibm/spss/hive/serde2/xml/objectinspector/XmlStructObjectInspector$1.class
com/ibm/spss/hive/serde2/xml/objectinspector/XmlStructObjectInspector.class
com/ibm/spss/hive/serde2/xml/processor/AbstractXmlProcessor$1.class
com/ibm/spss/hive/serde2/xml/processor/AbstractXmlProcessor$2.class
com/ibm/spss/hive/serde2/xml/processor/AbstractXmlProcessor.class
com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlProcessor$1.class
com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlProcessor$2.class
com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlProcessor.class
com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlQuery.class
com/ibm/spss/hive/serde2/xml/processor/java/NodeArray.class
com/ibm/spss/hive/serde2/xml/processor/SerDeArray.class
com/ibm/spss/hive/serde2/xml/processor/XmlMapEntry.class
com/ibm/spss/hive/serde2/xml/processor/XmlMapFacet$Type.class
com/ibm/spss/hive/serde2/xml/processor/XmlMapFacet.class
com/ibm/spss/hive/serde2/xml/processor/XmlNode$1.class
com/ibm/spss/hive/serde2/xml/processor/XmlNode$2.class
com/ibm/spss/hive/serde2/xml/processor/XmlNode.class
com/ibm/spss/hive/serde2/xml/processor/XmlNodeArray.class
com/ibm/spss/hive/serde2/xml/processor/XmlProcessor.class
com/ibm/spss/hive/serde2/xml/processor/XmlProcessorContext.class
com/ibm/spss/hive/serde2/xml/processor/XmlQuery.class
com/ibm/spss/hive/serde2/xml/processor/XmlTransformer.class
com/ibm/spss/hive/serde2/xml/processor/XmlUtils$1.class
com/ibm/spss/hive/serde2/xml/processor/XmlUtils.class
com/ibm/spss/hive/serde2/xml/SplittableXmlInputFormat.class
com/ibm/spss/hive/serde2/xml/XmlInputFormat$XmlRecordReader.class
com/ibm/spss/hive/serde2/xml/XmlInputFormat.class
com/ibm/spss/hive/serde2/xml/XmlSerDe$1.class
com/ibm/spss/hive/serde2/xml/XmlSerDe.class
META-INF/maven/
META-INF/maven/com.ibm.spss.hive.serde2.xml/
META-INF/maven/com.ibm.spss.hive.serde2.xml/hivexmlserde/
META-INF/maven/com.ibm.spss.hive.serde2.xml/hivexmlserde/pom.xml
META-INF/maven/com.ibm.spss.hive.serde2.xml/hivexmlserde/pom.properties



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 8 June 2018 at 17:58, Jörn Franke  wrote:

> Can you get the log files and start Hive with more detailled logs?
> In could be that not all libraries are loaded (i don’t remember anymore
> but I think this one needs more , I can look next week in my docs) or that
> it does not support maps (not sure).
> You can try first with a more simpler extraction with a String field to
> see if it works .
>
> Hive has always had external libraries for xml support and I used the one
> below with Hive 1.x, but it should also work with 2.x (3 not sure, but it
> should if it works in 2.x)
>
>
> On 8. Jun 2018, at 17:53, Mich Talebzadeh 
> wrote:
>
> I tried Hive 2.0.1, 2.3.2 and now Hive 3/
>
> I explicitly added hivexmlserde  jar file as ADD JAR shown below
>
> 0: jdbc:hive2://rhes75:10099/default> ADD JAR hdfs://rhes75:9000/jars/
> hivexmlserde-1.0.5.3.jar;
> No rows affected (0.002 seconds)
>
> But still cannot create an xml table
>
> 0: jdbc:hive2://rhes75:10099/default> CREATE  TABLE xml_41 (imap
> map) ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
> WITH SERDEPROPERTIES ("column.xpath.imap"="/file-
> format/data-set/element","xml.map.specification.element"="@name->#content")
> STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
> TBLPROPERTIES ("xmlinput.start"="","xmlinput.end"=" file-format>");
>
> Error: Error while processing statement: FAILED: Execution Error, return
> code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
> org/apache/hadoop/hive/serde2/SerDe (state=08S01,code=1)
>
> Does anyone know the cause of this or whi

Re: issues with Hive 3 simple sellect from an ORC table

2018-06-08 Thread Mich Talebzadeh
Hi Owen,

It is 2.7.3

 hadoop version
Hadoop 2.7.3
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
baa91f7c6bc9cb92be5982de4719c1c8af91ccff
Compiled by root on 2016-08-18T01:41Z


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 8 June 2018 at 16:59, Owen O'Malley  wrote:

> This looks like there is an API incompatibility between the version of
> hadoop expected and the version used. Which version of hadoop are you using?
>
> .. Owen
>
> On Jun 8, 2018, at 08:31, Mich Talebzadeh 
> wrote:
>
> Just installed and upgraded to Hive 3 where fun and game started :)
>
> First I had to set below as per this JIRA
> <https://www.mail-archive.com/issues@hive.apache.org/msg121026.html> to
> make hive server 2 start and stay up
>
> setting hive.metastore.event.db.notification.api.auth to false on
> hive-site.xml
>
> Now when I connect via beeline I see this error
>
>
> *0: jdbc:hive2://rhes75:10099/default> select * from sales limit 10;*
> Error: java.io.IOException: java.lang.RuntimeException: ORC split
> generation failed with exception: java.lang.NoSuchMethodError:
> org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
> (state=,code=0)
>
> Table is an ORC table as follows and used to work fine
>
> 0: jdbc:hive2://rhes75:10099/default> desc formatted sales;
> +---+---
> -+--+
> |   col_name|
> data_type  |
> comment|
> +---+---
> -+--+
> | # col_name| data_type
> | comment  |
> | prod_id   | bigint
> |  |
> | cust_id   | bigint
> |  |
> | time_id   | timestamp
> |  |
> | channel_id| bigint
> |  |
> | promo_id  | bigint
> |  |
> | quantity_sold | decimal(10,0)
> |  |
> | amount_sold   | decimal(10,0)
> |  |
> |   | NULL
> | NULL |
> | # Partition Information   | NULL
> | NULL |
> | # col_name| data_type
> | comment  |
> | year  | int
> |  |
> | month | int
> |  |
> |   | NULL
> | NULL |
> | # Detailed Table Information  | NULL
> | NULL |
> | Database: | oraclehadoop
> | NULL |
> | OwnerType:| USER
> | NULL |
> | Owner:| hduser
> | NULL |
> | CreateTime:   | Wed May 31 16:31:47 BST
> 2017   | NULL
> |
> | LastAccessTime:   | UNKNOWN
> | NULL |
> | Retention:| 0
> | NULL |
> | Location: | hdfs://rhes75:9000/user/hive/
> warehouse/oraclehadoop.db/sales | NULL
> |
> | Table Type:   | MANAGED_TABLE
> | NULL |
> | Table Parameters: | NULL
> | NULL |
> |   | COLUMN_STATS_ACCURATE
> | {\"BASIC_STATS\":\"true\"}   |
> |   | numFiles
> | 12544  

Which version of Hive can hanle creating XML table?

2018-06-08 Thread Mich Talebzadeh
I tried Hive 2.0.1, 2.3.2 and now Hive 3/

I explicitly added hivexmlserde  jar file as ADD JAR shown below

0: jdbc:hive2://rhes75:10099/default> ADD JAR
hdfs://rhes75:9000/jars/hivexmlserde-1.0.5.3.jar;
No rows affected (0.002 seconds)

But still cannot create an xml table

0: jdbc:hive2://rhes75:10099/default> CREATE  TABLE xml_41 (imap
map) ROW FORMAT SERDE
'com.ibm.spss.hive.serde2.xml.XmlSerDe' WITH SERDEPROPERTIES
("column.xpath.imap"="/file-format/data-set/element","xml.map.specification.element"="@name->#content")
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES
("xmlinput.start"="","xmlinput.end"="");

Error: Error while processing statement: FAILED: Execution Error, return
code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
org/apache/hadoop/hive/serde2/SerDe (state=08S01,code=1)

Does anyone know the cause of this or which version of Hive supports
creating an XML table?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


issues with Hive 3 simple sellect from an ORC table

2018-06-08 Thread Mich Talebzadeh
  |
true |
|   |
orc.row.index.stride   |
1|
|   |
orc.stripe.size|
268435456|
|   |
rawDataSize|
271538264|
|   |
totalSize  |
77602053 |
|   |
transient_lastDdlTime  |
1496244707   |
|   |
NULL   |
NULL |
| # Storage Information |
NULL   |
NULL |
| SerDe Library:|
org.apache.hadoop.hive.ql.io.orc.OrcSerde  |
NULL |
| InputFormat:  |
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat|
NULL |
| OutputFormat: |
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat   |
NULL |
| Compressed:   |
No |
NULL |
| Num Buckets:  |
256|
NULL |
| Bucket Columns:   | [prod_id, cust_id, time_id, channel_id,
promo_id]  | NULL |
| Sort Columns: |
[] |
NULL |
| Storage Desc Params:  |
NULL   |
NULL |
|   |
serialization.format   |
1|
+---++--+
48 rows selected (0.561 seconds)



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Oracle 11g Hive 2.1 metastore backend

2018-06-06 Thread Mich Talebzadeh
My Hive is 2.3.2

my Oracle is 12.c

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application
Testing options

and these are hive connections

sessions as on JUN 06 2018 08:21 PM

LOGIN   SID/serial# LOGGED IN S HOST   OS PID Client
PID PROGRAM   MEM/KB  Logical I/O Physical I/O
--- --- --- -- --
-- ---   
ACT INFO
--- ---
HIVEUSER46,3413906/06 07:29 rhes75 oracle/28441
hduser/1234JDBC Thin Clien1,088   460
N
HIVEUSER325,856906/06 08:01 rhes75 oracle/28748
hduser/1234JDBC Thin Clien1,088   440
N
HIVEUSER407,64925   06/06 07:29 rhes75 oracle/28437
hduser/1234JDBC Thin Clien1,088   440
N


I have no issues

Is this your issue?

Caused by: MetaException(message:java.lang.ClassCastException:
org.datanucleus.store.rdbms.mapping.datastore.ClobImpl cannot be cast to
oracle.sql.CLOB)

HTH,

Mich

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 6 June 2018 at 05:10, Arjun kr  wrote:

> Hi All,
>
> Is anyone using Oracle 11g configured as Hive 2.1 metastore backend? I'm
> encountering below exception with Oracle 11g configured as Hive 2.1
> metastore backend. Any help would be appreciated.
>
> 2018-05-23T13:05:03,219 DEBUG [main] transport.TSaslTransport: CLIENT:
> reading data length: 211
> 2018-05-23T13:05:03,220 DEBUG [main] transport.TSaslTransport: data length
> after unwrap: 179
> 2018-05-23T13:05:03,245 ERROR [main] exec.DDLTask:
> org.apache.hadoop.hive.ql.metadata.HiveException:
> MetaException(message:java.lang.ClassCastException:
> org.datanucleus.store.rdbms.mapping.datastore.ClobImpl cannot be cast to
> oracle.sql.CLOB)
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:842)
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:847)
> at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3992)
> at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:332)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:197)
> at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(
> TaskRunner.java:100)
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2074)
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1745)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1454)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1172)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1162)
> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(
> CliDriver.java:234)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:185)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:401)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:337)
> at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:435)
> at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:451)
> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:763)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:729)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:652)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:647)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> Caused by: MetaException(message:java.lang.ClassCastException:
> org.datanucleus.store.rdbms.mapping.datastore.ClobImpl cannot be cast to
> oracle.sql.CLOB)
> at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_
> table_with_environment_context_result$create_table_
> with_environment_context_resultStandardScheme.read(
> ThriftHiveMetastore.java:41498)
> at org.apache.hadoop.hive.metastore.api.

Re: Cannot send metadata info from Hive 2.0.1 to Hive metastore on Oracle 12c

2018-05-10 Thread Mich Talebzadeh
upgraded from hive 2.0.1 to hive 2.3.2 and it works now!

0: jdbc:hive2://rhes564:10099/default> create table abc(col1 int);
No rows affected (1.21 seconds)
0: jdbc:hive2://rhes564:10099/default> desc abc;
+---++--+
| col_name  | data_type  | comment  |
+---++--+
| col1  | int|  |
+---++--+
1 row selected (0.147 seconds)

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 10 May 2018 at 18:44, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:

> Hi,
>
> Running Hive 2.0.1 on Oracle 12c, it seems that Hive cannot perform insert
> metadata info into metadata table "HIVEUSER"."TBLS"
>
> When I try to create a simple table in Hive. It sends null values to
> Oracle metastore.
>
> 0: jdbc:hive2://rhes564:10099/default> use test2;
>
> 0: jdbc:hive2://rhes564:10099/default> create table michboy(col1 int,
> col2 varchar(30));
>
> ERROR : FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask. 
> MetaException(message:javax.jdo.JDODataStoreException:
> Insert of object "org.apache.hadoop.hive.metastore.model.MTable@6b511a66"
> using statement "INSERT INTO TBLS
> *(TBL_ID,CREATE_TIME,DB_ID,LAST_ACCESS_TIME,OWNER,RETENTION,SD_ID,TBL_NAME,TBL_TYPE,VIEW_EXPANDED_TEXT,VIEW_ORIGINAL_TEXT)
> VALUES (?,?,?,?,?,?,?,?,?,?,?)" failed : ORA-01400: cannot insert NULL into
> ("HIVEUSER"."TBLS"."IS_REWRITE_ENABLED")*
>
> java.sql.SQLIntegrityConstraintViolationException: ORA-01400: cannot
> insert NULL into ("HIVEUSER"."TBLS"."IS_REWRITE_ENABLED")
>
> Any ideas what can cause this?
>
> The version of Hive in Oracle hive metadata is:
>
> select * from version;
> VER_ID, SCHEMA_VERSION, VERSION_COMMENT
> 1   2.0.0   Hive release version 2.0.0
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>


Cannot send metadata info from Hive 2.0.1 to Hive metastore on Oracle 12c

2018-05-10 Thread Mich Talebzadeh
Hi,

Running Hive 2.0.1 on Oracle 12c, it seems that Hive cannot perform insert
metadata info into metadata table "HIVEUSER"."TBLS"

When I try to create a simple table in Hive. It sends null values to Oracle
metastore.

0: jdbc:hive2://rhes564:10099/default> use test2;

0: jdbc:hive2://rhes564:10099/default> create table michboy(col1 int, col2
varchar(30));

ERROR : FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:javax.jdo.JDODataStoreException: Insert of object
"org.apache.hadoop.hive.metastore.model.MTable@6b511a66" using statement
"INSERT INTO TBLS
*(TBL_ID,CREATE_TIME,DB_ID,LAST_ACCESS_TIME,OWNER,RETENTION,SD_ID,TBL_NAME,TBL_TYPE,VIEW_EXPANDED_TEXT,VIEW_ORIGINAL_TEXT)
VALUES (?,?,?,?,?,?,?,?,?,?,?)" failed : ORA-01400: cannot insert NULL into
("HIVEUSER"."TBLS"."IS_REWRITE_ENABLED")*

java.sql.SQLIntegrityConstraintViolationException: ORA-01400: cannot insert
NULL into ("HIVEUSER"."TBLS"."IS_REWRITE_ENABLED")

Any ideas what can cause this?

The version of Hive in Oracle hive metadata is:

select * from version;
VER_ID, SCHEMA_VERSION, VERSION_COMMENT
1   2.0.0   Hive release version 2.0.0

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


What has changed in Hive 2.3.2 that cannot use Spark engine.

2018-05-04 Thread Mich Talebzadeh
Hi,

my Hive 2.0.1 works fine on Spark 1.3.1 engine.

select count(1) from sales;
Starting Spark Job = dc529d0e-e2d2-431f-8c17-f7867858217f
Query Hive on Spark job[5] stages:
10
11
Status: Running (Hive on Spark job[5])

However, once I upgraded Hive to version 2.3.2, it stopped working!

Any ideas what has changed?

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Query failing in Hive 2.2

2018-01-20 Thread Mich Talebzadeh
Your Hive metadata reflects the new data type of the column changed from
type A to type B.

The old partitions have data stored as type A. These have not changed.

If you run the query on the old partition you would probably need to use
CAST function to change the column from B to A so it is compatible with
what is stored in the old partitions. That I believe will work.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 19 January 2018 at 21:37, Raghuraman Murugaiyan <
ragu.hive.engin...@gmail.com> wrote:

> Hi All,
>
> I have a table partitioned on a date column. We have changed the data type
> of one of the field for the older partitions and the data type of the newer
> partitions will remain the same as the table. When I tried to run a simple
> Select query on the older partitions , the query is failing with the below
> error :
>
>
> 2018-01-19 16:17:58,766 FATAL [IPC Server handler 23 on 38645]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
> attempt_1497186993127_459726_m_000112_0 - exited : java.io.IOException:
> java.lang.reflect.InvocationTargetException
> at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.
> handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
> at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.
> handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
> at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.
> initNextRecordReader(HadoopShimsSecure.java:269)
> at org.apache.hadoop.hive.shims.HadoopShimsSecure$
> CombineFileRecordReader.(HadoopShimsSecure.java:216)
> at org.apache.hadoop.hive.shims.HadoopShimsSecure$
> CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:343)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(
> CombineHiveInputFormat.java:681)
> at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<
> init>(MapTask.java:169)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1656)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:62)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.
> initNextRecordReader(HadoopShimsSecure.java:255)
> ... 11 more
> Caused by: java.io.IOException: Unknown encoding kind: DICTIONARY_V2
> dictionarySize: 6
>  in column 219
>
>
> Can you help me debug this error ? I am using Hive 2.2.
>
> Regards,
> Raghu M
>


Hive 2.3.2 does not execute on Spark engine anymore

2017-12-28 Thread Mich Talebzadeh
Hi,

My previous Hive 2/0 used to work with Spark 1.3.1 as its execution engine.

I recently upgraded Hive to 2.3.2 and it fails to start with spark as its
execution engine as follows:

Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Failed to execute spark task, with exception
'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark
client.)'
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.spark.SparkTask. Failed to create spark
client.
I have looked around and this seems to be an issue.

As a matter of interest has anyone tested if Hive 2.3.2 runs with any
version of Spark engone.

P.S. Not interested about TEZ LLAP at this stage.

Thanks





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: partitioned hive table

2017-10-30 Thread Mich Talebzadeh
have you analyzed table for the partition?

ANALYZE TABLE test_table PARTITION('2017-08-20, bar='hello'') COMPUTE
STATISTICS;

and do count(*) from table

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 30 October 2017 at 22:29, Jiewen Shao <fifistorm...@gmail.com> wrote:

> Hi, I have persisted lots of JSON files on S3 under partitioned
> directories such as /bucket/table1/dt=2017-10-28/bar=hello/*
>
> 1. Now I created a hive table:
> CREATE EXTERNAL TABLE table1 ( )
> PARTITIONED BY (dt string, bar string) ROW FORMAT serde
> 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION 's3://bucket/table1';
>
> 2. Under hive commandline
> select * from table1;// return nothing
>
> 3. INSERT INTO TABLE  table1 PARTITION (dt='2017-08-28', bar='hello')
> select ;
>
> 4. now select * from table1;// return all the data from that partition
>
> 5. select count(*) from table1;  // returns 1
>
> Can someone explain what did  I miss?
>
> Thanks a lot!
>


Can one classify Hive as an analytical tool besides storage?

2017-08-14 Thread Mich Talebzadeh
This may be rather mute point.

We had a discussion in the office about Hive and Impala!

A colleague insisted that Hive and Impala are storage tools. Changed his
mind when I mentioned that one can use Impala on Hive tables for faster
query access.

With regard to Hive, my view is that Hive supports HQL that in turn has
analytical functions like RANK etc built in. So in effect it is not only a
storage, but can be used as an analytical tool as well?

What are your views?

Thanks,

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Hive query on ORC table is really slow compared to Presto

2017-06-21 Thread Mich Talebzadeh
With ORC tables have you tried

set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
SET hive.exec.parallel=true;
--
set hive.optimize.ppd=true;

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 June 2017 at 18:24, Premal Shah <premal.j.s...@gmail.com> wrote:

> Gopal,
> Thanx for the debugging steps.
>
> Here's the output
>
> *hive> select count(1) as collisions, hash(ip) from table group by
> hash(ip) order by collisions desc limit 10;*
>
> 4   -1432955330
> 4   -317748560
> 4   -1460629578
> 4   1486313154
> 4   -320519155
> 4   1875999753
> 4   -1410139032
> 4   1596671554
> 4   503687909
> 4   989075923
>
>
> *hive> select count(1) as collisions, hash(id) from table group by
> hash(id) order by collisions desc limit 10;*
>
> 1711647 -1032220119
> 1439738 -1316837863
> 851204  -330948739
> 838145  535385402
> 512621  165206418
> 308968  -36549075
> 306190  -1568034366
> 302932  -1386594327
> 284935  -1991768757
> 218979  922811836
>
>
> *hive> select count(1) as collisions, hash(name) from table group by
> hash(name) order by collisions desc limit 10;*
>
> 1712041 -1906315012
> 1439738 -1583171535
> 512630  193448621
> 340485  2094529
> 308988  68745436
> 306240  79997099
> 289465  -1824055323
> 218263  1074334059
> 216464  -466945424
>
>
> Turning off map side aggregations definitely helped the query on *id . *The
> query time went to 1 minute from the earlier 3+ hours.
>
> Based on the output above, both id and name have a lot of collisions, but
> the name query was fast earlier too which is interesting.
>
>
>
>
> On Wed, Jun 14, 2017 at 10:34 AM, Gopal Vijayaraghavan <gop...@apache.org>
> wrote:
>
>>
>> > SELECT COUNT(DISTINCT ip) FROM table - 71 seconds
>> > SELECT COUNT(DISTINCT id) FROM table - 12,399 seconds
>>
>> Ok, I misunderstood your gist.
>>
>> > While ip is more unique that id, ip runs many times faster than id.
>> >
>> > How can I debug this ?
>>
>> Nearly the same way - just replace "ip" with "id" in my exploratory
>> queries.
>>
>> count(distinct hash(id)) from the table?
>>
>> count count(1) as collisions, hash(id) from table group by hash(id) order
>> by collisions desc limit 10;
>>
>> And, if those show many collisions
>>
>> set tez.runtime.io.sort.mb=640;
>> set hive.map.aggr=false;
>> set tez.runtime.pipelined.shuffle=true; // this reduces failure
>> tolerance (i.e retries are more expensive, happy path is faster)
>>
>> select count(distinct id) from ip_table;
>>
>> Java's hashCode() implementation is pretty horrible (& Hive defaults to
>> using it). If you're seeing a high collision count, I think I might know
>> what's happening here.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>
>
> --
> Regards,
> Premal Shah.
>


Re: setup spark engine to hive ,the hive version and spark build problem

2017-06-19 Thread Mich Talebzadeh
just to clarify you mean:

Hive 2.1.1 works with Spark engine 2.X

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 19 June 2017 at 06:53, vergil <vergilc...@foxmail.com> wrote:

> Hi,dear friend
> 1.After I try it and read the source code,I find that hive on spark2.X do
> works on hive 2.2.0 2.2.1 version and 2.4.x version which includes this
> commit history https://github.com/apache/hive/commit/
> ac977cc88757b49fbbd5c3bb236adcedcaae396c.I use the 2.2.0 version now.
> 2.Hive on spark,according to hive source code,you can just build the spark
> without hive profile and  add the essential spark jars and scala'library
> into your hive's lib folder.So you have no need to deploy the whole
> spark.This can meet your demands.
> 3.Spark sql,deploying another standard version whith hive can meet your
> sparksql demand.
> I hope it can help you.
>
> --
> The harder, more fortunate
>
>


Re: setup spark engine to hive ,the hive version and spark build problem

2017-06-17 Thread Mich Talebzadeh
the only way I managed this work (and I have not tried newer build yet) was
Hive. 2.0.1 on Spark 1.3.1 engine

Connected to: Apache Hive (version 2.0.1

0: jdbc:hive2://rhes564:10099/default> select count(1) from passwd;
INFO  : Compiling
command(queryId=hduser_20170617175040_5d4a85d9-a306-4cb3-a191-e675cdd21dcf):
select count(1) from passwd
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:c0,
type:bigint, comment:null)], properties:null)
INFO  : Completed compiling
command(queryId=hduser_20170617175040_5d4a85d9-a306-4cb3-a191-e675cdd21dcf);
Time taken: 0.064 seconds
INFO  : Executing
command(queryId=hduser_20170617175040_5d4a85d9-a306-4cb3-a191-e675cdd21dcf):
select count(1) from passwd
INFO  : Query ID =
hduser_20170617175040_5d4a85d9-a306-4cb3-a191-e675cdd21dcf
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  :
*Query Hive on Spark job[0] stages:*
INFO  : 0
INFO  : 1
INFO  :
*Status: Running (Hive on Spark job[0])*

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 June 2017 at 17:33, Boris Lublinsky <boris.lublin...@lightbend.com>
wrote:

> You need to exlicitely build spark without Hive. Look at getting started
> doc
>
> Get Outlook for Android <https://aka.ms/ghei36>
>
>
>
>
> On Sat, Jun 17, 2017 at 5:26 AM -0400, "wuchang" <583424...@qq.com> wrote:
>
> I want to build hive and spark to make my hive based on spark engine.I
>> choose Hive 2.3.0 and Spark 2.0.0, which is claimed to be *compatible*
>> by hive official document.According to the hive officials document ,I
>> have to build spark *without hive profile *to avoid the conflict between
>> original hive and spark-integrated hive. Yes, I build successfully , but
>> then the problem comes:*I cannot use spark-sql anymore* because *spark-sql
>> relies on the hive library and my spark is a no-hive build.*
>>
>> [appuser@ab-10-11-22-209 spark]$ spark-sql java.lang.ClassNotFoundException:
>> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver at
>> java.net.URLClassLoader.findClass(URLClassLoader.java:381) at
>> java.lang.ClassLoader.loadClass(ClassLoader.java:424) at
>> java.lang.ClassLoader.loadClass(ClassLoader.java:357) at
>> java.lang.Class.forName0(Native Method) at 
>> java.lang.Class.forName(Class.java:348)
>> at org.apache.spark.util.Utils$.classForName(Utils.scala:225) at
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$
>> deploy$SparkSubmit$$runMain(SparkSubmit.scala:686) at
>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) at
>> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) at
>> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Failed to
>> load main class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
>> You need to build Spark with *-Phive and -Phive-thriftserver.*
>> How can I build and setup spark and to make hive on spark
>> Work properly and my spark-sql、pyspark and spark-shell work properly?
>>
>>
>> I don’t know the relationship between *spark-integrated hive* and *original
>> hive. *Below is the spark-integrated hive jars:
>>
>> hive-beeline-1.2.1.spark2.jarhive-cli-1.2.1.spark2.jarhive-exec-1.2.1.spark2.jarhive-jdbc-1.2.1.spark2.jarhive-metastore-1.2.1.spark2.jarspark-hive_2.11-2.0.0.jarspark-hive-thriftserver_2.11-2.0.0.jar
>>
>>
>> It seems that Spark 2.0.0 relies on hive 1.2.1.
>>
>>
>>


Re: Pro and Cons of using HBase table as an external table in HIVE

2017-06-07 Thread Mich Talebzadeh
As I know using Hive on Hbase can only be done through Hive

Example

hive>  create external table MARKETDATAHBASE (key STRING, TICKER STRING,
TIMECREATED STRING, PRICE STRING)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH
SERDEPROPERTIES ("hbase.columns.mapping" =
":key,PRICE_INFO:TICKER,PRICE_INFO:TIMECREATED,PRICE_INFO:PRICE")

TBLPROPERTIES ("hbase.table.name" = "MARKETDATAHBASE");


The problem here is that like most Hive external tables you are creating a
pointer to Hbase with Hive storage handler and there is very little
optimization that can be done.


In all probability you would be better off using Apache  Phoenix on top of
Hbase with Phoenix secondary indexes. Granted the SQL capability in Phoenix
may not be that good as Hive but should do for most purposes.


In Phoenix you can do:



CREATE TABLE MARKETDATAHBASE (PK VARCHAR PRIMARY KEY, PRICE_INFO.TICKER
VARCHAR, PRICE_INFO.TIMECREATED VARCHAR, PRICE_INFO.PRICE VARCHAR);



HTH,

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 7 June 2017 at 11:13, Ramasubramanian Narayanan <
ramasubramanian.naraya...@gmail.com> wrote:

> Hi,
>
> Can you please let us know Pro and Cons of using HBase table as an
> external table in HIVE.
>
> Will there be any performance degrade when using Hive over HBase instead
> of using direct HIVE table.
>
> The table that I am planning to use in HBase will be master table like
> account, customer. Wanting to achieve Slowly Changing Dimension. Please
> through some lights on that too if you have done any such implementations.
>
> Thanks and Regards,
> Rams
>


Re: Jimmy Xiang now a Hive PMC member

2017-05-25 Thread Mich Talebzadeh
Best wishes in the new role Jimmy.

Regards,

Mich

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 May 2017 at 20:27, Vaibhav Gumashta <vgumas...@hortonworks.com> wrote:

> Congrats Jimmy!
>
> ‹Vaibhav
>
> On 5/25/17, 10:48 AM, "Vineet Garg" <vg...@hortonworks.com> wrote:
>
> >Congrats Jimmy!
> >
> >> On May 24, 2017, at 9:16 PM, Xuefu Zhang <xu...@apache.org> wrote:
> >>
> >> Hi all,
> >>
> >> It's an honer to announce that Apache Hive PMC has recently voted to
> >>invite Jimmy Xiang as a new Hive PMC member. Please join me in
> >>congratulating him and looking forward to a bigger role that he will
> >>play in Apache Hive project.
> >>
> >> Thanks,
> >> Xuefu
> >
> >
>
>


Hive handling of ingested data when source column changes size or new column added

2017-05-15 Thread Mich Talebzadeh
Assuming we are ingesting into Hive table from an RDBMS Oracle table.

This is done through a daily mechanism.

My conclusion is this.


   1. The source column has moved from VARCHA2(50) to CARCHAR2(100). As I
   know this should not matter in Hive as every VARCHAR is stored as String in
   Hive. So moving from VARCHAR2(50) to VARCHAR2(100) at source with Hive
   column defined as Varchar(50) would not matter? Data will be ingested OK
   2. How about NUMERIC columns or DECIMAL columns at source. Will they
   have to be modified in Hive?


Also if we add another column to source table, then there is no other way
than adding that column to Hive. Also the existing Hive partitions will
stay as before but new partitions will have space reserved for additional
columns.

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Unable to retrieve table metadata from hcatalog

2017-05-13 Thread Mich Talebzadeh
yes I do it is there


Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 13 May 2017 at 05:07, Per Ullberg <per.ullb...@klarna.com> wrote:

> And you're sure that you have a database and table in hive matching
> *oraclehadoop.bigtab*?
>
> /Pelle
>
> On Sat 13 May 2017 at 01:08 Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> I am using GoldenGate to send data real time from Oracle table to Hive.
>>
>> It tries to read table metadata using HCat but fails as below:
>>
>> *ERROR 2017-05-12 22:45:52,700 [main] Unable to retrieve table matadata.
>> Table : oraclehadoop.bigtab*
>>
>> *org.apache.hive.hcatalog.common.HCatException : 9001 : Exception
>> occurred while processing HCat request : NoSuchObjectException while
>> fetching table.. Cause : NoSuchObjectException(message:oraclehadoop.bigtab
>> table not found) **at
>> org.apache.hive.hcatalog.api.HCatClientHMSImpl.getTable(HCatClientHMSImpl.java:175)*
>>
>> at oracle.goldengate.mdp.hive.HiveMetaDataProvider.resolve(
>> HiveMetaDataProvider.java:91)
>>
>> at oracle.goldengate.datasource.metadata.provider.
>> TargetMetaDataStore.retrieveMetaData(TargetMetaDataStore.java:73)
>>
>> at oracle.goldengate.datasource.UserExitDataSource.getMetaData(
>> UserExitDataSource.java:2133)
>>
>> Caused by: NoSuchObjectException(message:oraclehadoop.bigtab table not
>> found)
>>
>> at org.apache.hadoop.hive.metastore.HiveMetaStore$
>> HMSHandler.get_table_core(HiveMetaStore.java:1885)
>>
>> at org.apache.hadoop.hive.metastore.HiveMetaStore$
>> HMSHandler.get_table(HiveMetaStore.java:1838)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke(
>> NativeMethodAccessorImpl.java:62)
>>
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
>> DelegatingMethodAccessorImpl.java:43)
>>
>> at java.lang.reflect.Method.invoke(Method.java:497)
>>
>> at org.apache.hadoop.hive.metastore.RetryingHMSHandler.
>> invokeInternal(RetryingHMSHandler.java:138)
>>
>> at org.apache.hadoop.hive.metastore.RetryingHMSHandler.
>> invoke(RetryingHMSHandler.java:99)
>>
>> at com.sun.proxy.$Proxy13.get_table(Unknown Source)
>>
>> at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.
>> getTable(HiveMetaStoreClient.java:1228)
>>
>> at org.apache.hive.hcatalog.api.HCatClientHMSImpl.getTable(
>> HCatClientHMSImpl.java:168)
>>
>>
>>
>> I am not that familiar with HCat. Any help will be appreciated.
>>
>>
>> thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
> --
>
> *Per Ullberg*
> Data Vault Tech Lead
> Odin Uppsala
> +46 701612693
>
> Klarna AB (publ)
> Sveavägen 46, 111 34 Stockholm
> Tel: +46 8 120 120 00
> Reg no: 556737-0431
> klarna.com
>
>


Unable to retrieve table metadata from hcatalog

2017-05-12 Thread Mich Talebzadeh
I am using GoldenGate to send data real time from Oracle table to Hive.

It tries to read table metadata using HCat but fails as below:

*ERROR 2017-05-12 22:45:52,700 [main] Unable to retrieve table matadata.
Table : oraclehadoop.bigtab*

*org.apache.hive.hcatalog.common.HCatException : 9001 : Exception occurred
while processing HCat request : NoSuchObjectException while fetching
table.. Cause : NoSuchObjectException(message:oraclehadoop.bigtab table not
found) **at
org.apache.hive.hcatalog.api.HCatClientHMSImpl.getTable(HCatClientHMSImpl.java:175)*

at
oracle.goldengate.mdp.hive.HiveMetaDataProvider.resolve(HiveMetaDataProvider.java:91)

at
oracle.goldengate.datasource.metadata.provider.TargetMetaDataStore.retrieveMetaData(TargetMetaDataStore.java:73)

at
oracle.goldengate.datasource.UserExitDataSource.getMetaData(UserExitDataSource.java:2133)

Caused by: NoSuchObjectException(message:oraclehadoop.bigtab table not
found)

at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table_core(HiveMetaStore.java:1885)

at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1838)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:138)

at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:99)

at com.sun.proxy.$Proxy13.get_table(Unknown Source)

at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1228)

at
org.apache.hive.hcatalog.api.HCatClientHMSImpl.getTable(HCatClientHMSImpl.java:168)



I am not that familiar with HCat. Any help will be appreciated.


thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: beeline connection to Hive using both Kerberos and LDAP with SSL

2017-05-02 Thread Mich Talebzadeh
So it translates to either LDAP or Kerberos, we cannot enable both for same
Hive Server. SSL is independent. So the supported situations are as below.



   1. Anonymous authentication (w/ or w/o SSL)
   2. LDAP authentication (w/ or w/o SSL)
   3. Kerberos

Cheers




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 30 April 2017 at 18:20, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks Kapil.
>
> Does this mean that one can have both Kerberos and LDAP (with SSL) and use
> either?
>
> Cheers,
>
> Mich
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 25 April 2017 at 22:42, Kapil Rastogi <krast...@cloudera.com> wrote:
>
>> Starting with CDH 5.7, clusters running LDAP-enabled HiveServer2
>> deployments also accept Kerberos authentication.
>>
>> https://www.cloudera.com/documentation/enterprise/5-8-x/
>> topics/cdh_sg_hiveserver2_security.html
>>
>> Hope that helps.
>> Kapil
>>
>> On Fri, Apr 7, 2017 at 7:58 PM, Gopal Vijayaraghavan <gop...@apache.org>
>> wrote:
>>
>>>
>>> > Is there anyway one can enable both (Kerberos and LDAP with SSL) on
>>> Hive?
>>>
>>> I believe what you're looking for is Apache Knox SSO. And for LDAP
>>> users, Apache Ranger user-sync handles auto-configuration.
>>>
>>> That is how SSL+LDAP+JDBC works in the HD Cloud gateway [1].
>>>
>>> There might be a similar solution from CDH, if you go digging for it.
>>>
>>> Cheers,
>>> Gopal
>>> [1] - https://hortonworks.github.io/hdp-aws/security-network/#prot
>>> ected-gateway
>>>
>>>
>>>
>>
>


Re: beeline connection to Hive using both Kerberos and LDAP with SSL

2017-04-30 Thread Mich Talebzadeh
Thanks Kapil.

Does this mean that one can have both Kerberos and LDAP (with SSL) and use
either?

Cheers,

Mich

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 April 2017 at 22:42, Kapil Rastogi <krast...@cloudera.com> wrote:

> Starting with CDH 5.7, clusters running LDAP-enabled HiveServer2
> deployments also accept Kerberos authentication.
>
> https://www.cloudera.com/documentation/enterprise/5-8-
> x/topics/cdh_sg_hiveserver2_security.html
>
> Hope that helps.
> Kapil
>
> On Fri, Apr 7, 2017 at 7:58 PM, Gopal Vijayaraghavan <gop...@apache.org>
> wrote:
>
>>
>> > Is there anyway one can enable both (Kerberos and LDAP with SSL) on
>> Hive?
>>
>> I believe what you're looking for is Apache Knox SSO. And for LDAP users,
>> Apache Ranger user-sync handles auto-configuration.
>>
>> That is how SSL+LDAP+JDBC works in the HD Cloud gateway [1].
>>
>> There might be a similar solution from CDH, if you go digging for it.
>>
>> Cheers,
>> Gopal
>> [1] - https://hortonworks.github.io/hdp-aws/security-network/#prot
>> ected-gateway
>>
>>
>>
>


beeline connection to Hive using both Kerberos and LDAP with SSL

2017-04-07 Thread Mich Talebzadeh
Hi,

The Hive on CDH 5.9 uses Kerberos security mechanism.

We have an application that can connect to Impala with LDAP user/password
and SSL.

The problem is that Impala functionality is limited so the command sent via
application to impala do not work.

The problem is that the application does not support Kerberos connectivity
and only works with SSL and anonymous connection.

This works on Impala as Impala that runs by default on port 21000 spawns
port 21050 for user connections to Load balancer

Is there anyway one can enable both (Kerberos and LDAP with SSL) on Hive?


Thanks




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Real time data streaming into Hive text table and excessive amount of file numbers

2017-03-26 Thread Mich Talebzadeh
Hi,

I   am testing real time data streaming from Oracle DB schema table to Hive
table via SAP Replication Server (SRS).

This works fine and SRS uses DIRECT LOAD to ingest data into Hive table. By
default this is set to 10K rows for batch load into Hive ass shown below

-- Direct Load Materialization Configuration Parameters
--
-- Parameter "mat_load_tran_size", Default: 1, specifies the optimal
transaction size or batch size for the initial copying of primary data to
the replicate table during direct load materialization.
--

A sample bulk load for updates in table "t" are shown below, it does in
batches of 10K

T. 2017/03/26 18:25:33. (136): Command sent to 'hiveserver2.asehadoop':
T. 2017/03/26 18:25:33. (136): 'Bulk update table 'rs_ut_136_1' (1 rows
affected)'
T. 2017/03/26 18:25:49. (136): Command sent to 'hiveserver2.asehadoop':
T. 2017/03/26 18:25:49. (136): 'Bulk update table 'rs_ut_136_1' (1 rows
affected)'
T. 2017/03/26 18:26:06. (136): Command sent to 'hiveserver2.asehadoop':
T. 2017/03/26 18:26:06. (136): 'Bulk update table 'rs_ut_136_1' (1 rows
affected)'

Now the issue that each reduce job creates a new partition under table
directory like below

-rwxr-xr-x   2 hduser supergroup   11128103 2017-03-26 16:01
/user/hive/warehouse/asehadoop.db/t/00_0_copy_1539
-rwxr-xr-x   2 hduser supergroup   2245 2017-03-25 23:37
/user/hive/warehouse/asehadoop.db/t/00_0_copy_154
-rwxr-xr-x   2 hduser supergroup   11128103 2017-03-26 16:02
/user/hive/warehouse/asehadoop.db/t/00_0_copy_1540
-rwxr-xr-x   2 hduser supergroup   11128118 2017-03-26 16:02
/user/hive/warehouse/asehadoop.db/t/00_0_copy_1541
-rwxr-xr-x   2 hduser supergroup   11128118 2017-03-26 16:02
/user/hive/warehouse/asehadoop.db/t/00_0_copy_1542
-rwxr-xr-x   2 hduser supergroup   11128093 2017-03-26 16:02
/user/hive/warehouse/asehadoop.db/t/00_0_copy_1543
-rwxr-xr-x   2 hduser supergroup   11128093 2017-03-26 16:03
/user/hive/warehouse/asehadoop.db/t/00_0_copy_1544


So I was wondering what are the best ways of compacting these files? Is
there any detriment when the number of these files grow very high such as
1000s of them?

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Using Sqoop to get data from Impala/Hive to another Hive table

2017-02-21 Thread Mich Talebzadeh
regardless there is no point using Sqoop for such purpose. it is not really
designed for it :)

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 February 2017 at 10:50, Jörn Franke <jornfra...@gmail.com> wrote:

> From hive to hive another cluster I would use Hive Import/Export, if
> possible with Falcon instead of Sqoop. Sqoop always needs to do
> serialization/deserialization which is fine if the systems are different
> (e.g. Oracle -> Hive), but if it is the same system then usually it makes
> sense to use the tool of the system (e.g. Hive import/export) for
> performance/resource usage reasons.
>
>
> On 21 Feb 2017, at 11:31, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> thanks Jorn.
>
> the idea is to test a big data high availability tool. by ingesting data
> to the target cluster. I have provisioned aan oracle schema for it so we
> can use Sqoop to get data into Hive.
>
> someone suggested try using sqoop to ingest from a hive table in one
> cluster to the target cluster.
>
> this is not really a test is iut?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 21 February 2017 at 10:26, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> Hallo,
>>
>> I have not tried it, but sqoop supports any jdbc driver. However, since
>> the SQL syntax is not necessarily standardized you may face issues or
>> performance problems. Hive itself has a nice import and export tool that
>> supports also the metadata import/export. It can be orchestrated from Oozie
>> and/or Falcon.
>>
>> Best regards
>>
>> On 21 Feb 2017, at 11:16, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> I have not tried this but someone mentioned that it is possible to use
>> Sqoop to get data from one Impala/Hive table in one cluster to another?
>>
>> The clusters are in different zones. This is to test the cluster. Has
>> anyone done such a thing?
>>
>> Thanks
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>


Re: Using Sqoop to get data from Impala/Hive to another Hive table

2017-02-21 Thread Mich Talebzadeh
thanks Jorn.

the idea is to test a big data high availability tool. by ingesting data to
the target cluster. I have provisioned aan oracle schema for it so we can
use Sqoop to get data into Hive.

someone suggested try using sqoop to ingest from a hive table in one
cluster to the target cluster.

this is not really a test is iut?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 February 2017 at 10:26, Jörn Franke <jornfra...@gmail.com> wrote:

> Hallo,
>
> I have not tried it, but sqoop supports any jdbc driver. However, since
> the SQL syntax is not necessarily standardized you may face issues or
> performance problems. Hive itself has a nice import and export tool that
> supports also the metadata import/export. It can be orchestrated from Oozie
> and/or Falcon.
>
> Best regards
>
> On 21 Feb 2017, at 11:16, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> Hi,
>
> I have not tried this but someone mentioned that it is possible to use
> Sqoop to get data from one Impala/Hive table in one cluster to another?
>
> The clusters are in different zones. This is to test the cluster. Has
> anyone done such a thing?
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>


Using Sqoop to get data from Impala/Hive to another Hive table

2017-02-21 Thread Mich Talebzadeh
Hi,

I have not tried this but someone mentioned that it is possible to use
Sqoop to get data from one Impala/Hive table in one cluster to another?

The clusters are in different zones. This is to test the cluster. Has
anyone done such a thing?

Thanks



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Difference between join and inner join

2017-02-13 Thread Mich Talebzadeh
join is by default inner join as in Oracle or Sybase.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 12 February 2017 at 04:13, balajee venkatesh <
balajee.venkatesh1...@gmail.com> wrote:

> Hi Divya,
>
> If you don't specify the Join type then by default it performs Inner Join
> on the tuples under action.
> So, there is nothing specific to point as a difference between Join and
> Inner Join in Hadoop.
> Hope it helps.
>
> Thanks,
> Balajee Venkatesh
>
>
> On 12-Feb-2017 9:32 am, "Divya Gehlot" <divya.htco...@gmail.com> wrote:
>
> Hi ,
> What's the difference between" join " and "inner join" in hive ?
>
> Thanks ,
> Divya
>
>
>


Parquet tables with snappy compression

2017-01-25 Thread Mich Talebzadeh
Hi,

Has there been any study of how much compressing Hive Parquet tables with
snappy reduces storage space or simply the table size in quantitative terms?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: VARCHAR or STRING fields in Hive

2017-01-16 Thread Mich Talebzadeh
Sounds like VARCHAR and CHAR types were created for Hive to have ANSI SQL
Compliance. Otherwise they seem to be practically the same as String types.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 16 January 2017 at 17:15, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks Elliot for the insight.
>
> Another issue that Spark does not support "CHAR" types. It supports
> VARCHAR. Often one uses Spark as well on these tables.
>
> This should not really matter. I tend to define CHA(N) to be VARCHAR(N) as
> the assumption is that the table ingested into Parquet say is already
> filled with data so it does not really matter whether one defines CHAR or
> VARCHAR here.
>
> For compatibility reason I would rather use VARCHA as opposed to String to
> provide RDBMS developer a more familiar look and fill.
>
> Also if one uses SNAPPY compression on columns will a VARCHAR as opposed
> to String make any difference in terms of storage efficiency?
>
>
> Regards
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 16 January 2017 at 17:00, Elliot West <tea...@gmail.com> wrote:
>
>> Internally it looks as though Hive simply represents CHAR/VARCHAR values
>> using a Java String and so I would not expect a significant change in
>> execution performance. The Hive JIRA suggests that these types were added
>> to 'support for more SQL-compliant behavior, such as SQL string
>> comparison semantics, max length, etc.' rather than for performance reasons.
>>
>>- https://issues.apache.org/jira/browse/HIVE-4844
>>- https://issues.apache.org/jira/browse/HIVE-5191
>>
>> In terms of storage I expect it depends on the underlying file format and
>> the types that these values are encoded to. Parquet does appear to
>> support the specific encoding of both CHAR/VARCHAR, however I'm skeptical
>> that there would be significant storage efficiencies gained by using the
>> CHAR types, over String for comparable values. I'd be keen to hear
>> otherwise.
>>
>>- https://issues.apache.org/jira/browse/HIVE-7735
>>
>> Thanks,
>>
>> Elliot.
>>
>> On 16 January 2017 at 15:37, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>>
>>> Coming from DBMS background I tend to treat the columns in Hive similar
>>> to an RDBMS table. For example if a table created in Hive as Parquet I will
>>> use VARCHAR(30) for column that has been defined as VARCHAR(30) as source.
>>> If a column is defined as TEXT in RDBMS table I use STRING in Hive with a
>>> max size of 2GB I believe.
>>>
>>> My view is that it is more efficient storage wise to have Hive table
>>> created as VARCHA as opposed to STRING.
>>>
>>> I have not really seen any performance difference if one uses VARCHAR or
>>> STRING. However, I believe there is a reason why one has VARCH in Hive as
>>> opposed to STRRING.
>>>
>>> What is the thread view on this?
>>>
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>>
>


Re: VARCHAR or STRING fields in Hive

2017-01-16 Thread Mich Talebzadeh
Thanks Elliot for the insight.

Another issue that Spark does not support "CHAR" types. It supports
VARCHAR. Often one uses Spark as well on these tables.

This should not really matter. I tend to define CHA(N) to be VARCHAR(N) as
the assumption is that the table ingested into Parquet say is already
filled with data so it does not really matter whether one defines CHAR or
VARCHAR here.

For compatibility reason I would rather use VARCHA as opposed to String to
provide RDBMS developer a more familiar look and fill.

Also if one uses SNAPPY compression on columns will a VARCHAR as opposed to
String make any difference in terms of storage efficiency?


Regards





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 16 January 2017 at 17:00, Elliot West <tea...@gmail.com> wrote:

> Internally it looks as though Hive simply represents CHAR/VARCHAR values
> using a Java String and so I would not expect a significant change in
> execution performance. The Hive JIRA suggests that these types were added
> to 'support for more SQL-compliant behavior, such as SQL string
> comparison semantics, max length, etc.' rather than for performance reasons.
>
>- https://issues.apache.org/jira/browse/HIVE-4844
>- https://issues.apache.org/jira/browse/HIVE-5191
>
> In terms of storage I expect it depends on the underlying file format and
> the types that these values are encoded to. Parquet does appear to
> support the specific encoding of both CHAR/VARCHAR, however I'm skeptical
> that there would be significant storage efficiencies gained by using the
> CHAR types, over String for comparable values. I'd be keen to hear
> otherwise.
>
>- https://issues.apache.org/jira/browse/HIVE-7735
>
> Thanks,
>
> Elliot.
>
> On 16 January 2017 at 15:37, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>>
>> Coming from DBMS background I tend to treat the columns in Hive similar
>> to an RDBMS table. For example if a table created in Hive as Parquet I will
>> use VARCHAR(30) for column that has been defined as VARCHAR(30) as source.
>> If a column is defined as TEXT in RDBMS table I use STRING in Hive with a
>> max size of 2GB I believe.
>>
>> My view is that it is more efficient storage wise to have Hive table
>> created as VARCHA as opposed to STRING.
>>
>> I have not really seen any performance difference if one uses VARCHAR or
>> STRING. However, I believe there is a reason why one has VARCH in Hive as
>> opposed to STRRING.
>>
>> What is the thread view on this?
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>


Re: VARCHAR or STRING fields in Hive

2017-01-16 Thread Mich Talebzadeh
thanks both.

String has a max length of 2GB so in a MapReduce with a 128MB block size we
are talking about 16 blocks. With VARCHAR(30) we are talking about 1 block.
I have not really experimented with this, however, I assume a table of 100k
rows with VARCHAR columns will have a smaller footprint in HDFS compared to
STRING columns?

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 16 January 2017 at 15:49, sreebalineni . <sreebalin...@gmail.com> wrote:

> How is that efficient storage wise because as far as I  see it is in hdfs
> and storage is based on your block size.
>
> Am i missing something here?
>
> On Jan 16, 2017 9:07 PM, "Mich Talebzadeh" <mich.talebza...@gmail.com>
> wrote:
>
>
> Coming from DBMS background I tend to treat the columns in Hive similar to
> an RDBMS table. For example if a table created in Hive as Parquet I will
> use VARCHAR(30) for column that has been defined as VARCHAR(30) as source.
> If a column is defined as TEXT in RDBMS table I use STRING in Hive with a
> max size of 2GB I believe.
>
> My view is that it is more efficient storage wise to have Hive table
> created as VARCHA as opposed to STRING.
>
> I have not really seen any performance difference if one uses VARCHAR or
> STRING. However, I believe there is a reason why one has VARCH in Hive as
> opposed to STRRING.
>
> What is the thread view on this?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>


VARCHAR or STRING fields in Hive

2017-01-16 Thread Mich Talebzadeh
Coming from DBMS background I tend to treat the columns in Hive similar to
an RDBMS table. For example if a table created in Hive as Parquet I will
use VARCHAR(30) for column that has been defined as VARCHAR(30) as source.
If a column is defined as TEXT in RDBMS table I use STRING in Hive with a
max size of 2GB I believe.

My view is that it is more efficient storage wise to have Hive table
created as VARCHA as opposed to STRING.

I have not really seen any performance difference if one uses VARCHAR or
STRING. However, I believe there is a reason why one has VARCH in Hive as
opposed to STRRING.

What is the thread view on this?

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Vectorised Queries in Hive

2017-01-10 Thread Mich Talebzadeh
My understanding is that Vectorized query execution is only applicable to
ORC tables in Hive by processing block of 1024 rows at a time.

I have also noticed that this execution mode is only applicable to single
predicate search. It does not work with multiple predicates searches. Can
someone confirms this please?

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Specifying orc.stripe.size in Spark

2016-12-18 Thread Mich Talebzadeh
You can use straight sql command to create ORC table in Hive. Assuming you
have registered a temp table

val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
s.registerTempTable("tmp")
sqltext = """
CREATE TABLE test.dummy2
 (
 ID INT
   , CLUSTERED INT
   , SCATTERED INT
   , RANDOMISED INT
   , RANDOM_STRING VARCHAR(50)
   , SMALL_VC VARCHAR(10)
   , PADDING  VARCHAR(10)
)
--CLUSTERED BY (ID) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="SNAPPY",
"orc.create.index"="true",
"orc.bloom.filter.columns"="ID",
"orc.bloom.filter.fpp"="0.05",
"orc.stripe.size"="268435456",
"orc.row.index.stride"="1" )
"""
HiveContext.sql(sqltext)

sqltext = """
INSERT INTO TABLE test.dummy2
SELECT
*
FROM tmp
"""
HiveContext.sql(sqltext)

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 18 December 2016 at 09:02, Daniel Haviv <daniel.ha...@veracity-group.com>
wrote:

> Hi,
> When writing a dataframe using:
> df.write.orc("/path/to/orc")
>
> How can I specify orc parameters like orc.stripe.size ?
>
> Thank you,
> Daniel
>


  1   2   3   4   5   6   7   8   >