I am using Hive 1.2.1 with MR backend. Ryan, I hear you. I totally agree. This is not the best approach, and I am in fact restructuring the approach.
However, I would like to understand what is going on. In my test run, each hive query is computing *distinct* on a toy table of 10 records -- so, we are definitely not running into problems like resource contention. Also, I increased (streaming) mappers' task timeout value (to 1hr) so that I give ample time for shell script (i.e., hive query) to finish. So, architecturally, is there something that limits us spawning such nested MR jobs -- a streaming MR job spawning multiple hive queries that in turn spawn mr jobs. Shirish On Mon, Apr 18, 2016 at 1:31 PM, Ryan Harris <ryan.har...@zionsbancorp.com> wrote: > My $0.02.... > > > > If you are running multiple concurrent queries on the data, you are > probably doing it wrong (or at least inefficiently)....although this > somewhat depends on what type of files are backing your hive warehouse... > > > > Let's assume that your data is NOT backed by ORC/parquet files, and that > you are NOT using Tez/Spark as your execution engine.... > > > > Generally with HDFS, data I/O is going to be the slowest piece....so, with > your workflow, each hive query is going to need to read ALL of the source > data to resolve the query. It would be much more efficient if you could > write a more complex query that could read the source data 1 time (instead > of however many parallel operations you are running)....Additionally, from > an efficiency perspective running queries in parallel might only help > improve performance if each of your queries requires fewer map tasks than > the total capacity of your cluster....otherwise it would generally be more > efficient to run your queries in series. > > > > If you schedule the work in series, and things get backed up, the job will > still eventually complete. If you attempt to do TOO much work in parallel, > all of the jobs will start timing out and then everything will fail. > > > > There may be a valid reason for approaching the problem the way that you > are, but I'd encourage you to look at restructuring your approach to the > problem to save you more headaches down the road. > > > > *From:* Shirish Tatikonda [mailto:shirish.tatiko...@gmail.com] > *Sent:* Monday, April 18, 2016 2:00 PM > *To:* user@hive.apache.org > *Subject:* Re: Mappers spawning Hive queries > > > > Hi John, > > > > 2) The shell script is invoked in the mappers of a Hadoop streaming job. > > > > 1) The use case is that I have to process multiple entities in parallel. > Each entity is associated with its own data set. The processing involves a > few hive queries to do joins and aggregations, which is followed by some > code in Python. My thought process is to put the hive queries and python > invocation in a shell script, and invoke the shell script on multiple > entities in parallel through a streaming mapreduce job. > > > > Shirish > > > > > > On Sat, Apr 16, 2016 at 12:10 AM, Jörn Franke <jornfra...@gmail.com> > wrote: > > Just out of curiosity, what is the use case behind this? > > How do you call the shell script? > > > > On 16 Apr 2016, at 00:24, Shirish Tatikonda <shirish.tatiko...@gmail.com> > wrote: > > > > Hello, > > > > I am trying to run multiple hive queries in parallel by submitting them > through a map-reduce job. > > More specifically, I have a map-only hadoop streaming job where each > mapper runs a shell script that does two things -- 1) parses input lines > obtained via streaming; and 2) submits a very simple hive query (via hive > -e ...) with parameters computed from step-1. > > > > Now, when I run the streaming job, the mappers seem to be stuck and I > don't know what is going on. When I looked on resource manager web UI, I > don't see any new MR Jobs (triggered from the hive query). I am trying to > understand this behavior. > > > > This may be a bad idea to begin with, and there may be better ways to > accomplish the same task. However, I would like to understand the behavior > of such a MR job. > > > > Any thoughts? > > > > Thank you, > > Shirish > > > > > ------------------------------ > THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS > CONFIDENTIAL and may contain information that is privileged and exempt from > disclosure under applicable law. If you are neither the intended recipient > nor responsible for delivering the message to the intended recipient, > please note that any dissemination, distribution, copying or the taking of > any action in reliance upon the message is strictly prohibited. If you have > received this communication in error, please notify the sender immediately. > Thank you. >