Re: trying to figure out number of MR jobs from explain output

2015-12-11 Thread Nicholas Hakobian
You can't find out definitively because it is going to depend on the
nature of the data being processed, especially when it comes to
mapjoins. If the output of one stage is small enough for it to
mapjoin, parts of a stage can be skipped as the whole dataset is on
every node.

I'm sure there are other conditions as well, but that is general idea.

-Nick

Nicholas Szandor Hakobian
Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com



On Fri, Dec 11, 2015 at 2:00 PM, Ophir Etzion  wrote:
> Hi,
>
> I've been trying to figure out how to know the number of MR jobs that will
> be ran for a hive query using the EXPLAIN output.
>
> I haven't got to a consistent method to knowing that.
>
> for example (in one of my queries, ctas query):
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-7 depends on stages: Stage-1 , consists of Stage-4, Stage-3, Stage-5
>   Stage-4
>   Stage-0 depends on stages: Stage-4, Stage-3, Stage-6
>   Stage-8 depends on stages: Stage-0
>   Stage-2 depends on stages: Stage-8
>   Stage-3
>   Stage-5
>   Stage-6 depends on stages: Stage-5
>
> Stage-1, Stage-3, Stage-5 are listed as map reduce steps.
>
> eventually 2 MR jobs ran.
>
> in other cases only 1 job runs.
>
> I couldn't find a consistent rule on how to figure this out.
>
> can anyone help??
>
> Thank you!!
>
> below is full output
>
> explain CREATE TABLE beekeeper_results.test3 ROW FORMAT SERDE
> "com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde" WITH
> SERDEPROPERTIES ('escape.delim'='\\', 'mapkey.delim'='\;',
> 'colelction.delim'='|') AS SELECT * FROM beekeeper_results.test2;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-7 depends on stages: Stage-1 , consists of Stage-4, Stage-3, Stage-5
>   Stage-4
>   Stage-0 depends on stages: Stage-4, Stage-3, Stage-6
>   Stage-8 depends on stages: Stage-0
>   Stage-2 depends on stages: Stage-8
>   Stage-3
>   Stage-5
>   Stage-6 depends on stages: Stage-5
>
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Map Operator Tree:
>   TableScan
> alias: test2
> Statistics: Num rows: 112 Data size: 11690 Basic stats: COMPLETE
> Column stats: NONE
> Select Operator
>   expressions: blasttag (type: string), actioncounts (type:
> array>), detailedclicks (type:
> array>), countsbyclient
> (type: array>),
> totalactioncounts (type: array>),
> actionsbydate (type:
> array>)
>   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
>   Statistics: Num rows: 112 Data size: 11690 Basic stats:
> COMPLETE Column stats: NONE
>   File Output Operator
> compressed: false
> Statistics: Num rows: 112 Data size: 11690 Basic stats:
> COMPLETE Column stats: NONE
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> serde:
> com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde
> name: beekeeper_results.test3
>
>   Stage: Stage-7
> Conditional Operator
>
>   Stage: Stage-4
> Move Operator
>   files:
>   hdfs directory: true
>   destination:
> hdfs://hadoop-alidoro-nn-vip/user/hive/warehouse/.hive-staging_hive_2015-12-11_21-52-35_063_8498858370292854265-1/-ext-10001
>
>   Stage: Stage-0
> Move Operator
>   files:
>   hdfs directory: true
>   destination: ***
>
>   Stage: Stage-8
>   Create Table Operator:
> Create Table
>   columns: blasttag string, actioncounts
> array>, detailedclicks
> array>, countsbyclient
> array>, totalactioncounts
> array>, actionsbydate
> array>
>   input format: org.apache.hadoop.mapred.TextInputFormat
>   output format:
> org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
>   serde name:
> com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde
>   serde properties:
> colelction.delim |
> escape.delim \
> mapkey.delim ;
>   name: beekeeper_results.test3
>
>   Stage: Stage-2
> Stats-Aggr Operator
>
>   Stage: Stage-3
> Map Reduce
>   Map Operator Tree:
>   TableScan
> File Output Operator
>   compressed: false
>   table:
>   input format: org.apache.hadoop.mapred.TextInputFormat
>   output format:
> 

trying to figure out number of MR jobs from explain output

2015-12-11 Thread Ophir Etzion
Hi,

I've been trying to figure out how to know the number of MR jobs that will
be ran for a hive query using the EXPLAIN output.

I haven't got to a consistent method to knowing that.

for example (in one of my queries, ctas query):
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-7 depends on stages: Stage-1 , consists of Stage-4, Stage-3, Stage-5
  Stage-4
  Stage-0 depends on stages: Stage-4, Stage-3, Stage-6
  Stage-8 depends on stages: Stage-0
  Stage-2 depends on stages: Stage-8
  Stage-3
  Stage-5
  Stage-6 depends on stages: Stage-5

Stage-1, Stage-3, Stage-5 are listed as map reduce steps.

eventually 2 MR jobs ran.

in other cases only 1 job runs.

I couldn't find a consistent rule on how to figure this out.

can anyone help??

Thank you!!

below is full output

explain CREATE TABLE beekeeper_results.test3 ROW FORMAT SERDE
"com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde" WITH
SERDEPROPERTIES ('escape.delim'='\\', 'mapkey.delim'='\;',
'colelction.delim'='|') AS SELECT * FROM beekeeper_results.test2;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-7 depends on stages: Stage-1 , consists of Stage-4, Stage-3, Stage-5
  Stage-4
  Stage-0 depends on stages: Stage-4, Stage-3, Stage-6
  Stage-8 depends on stages: Stage-0
  Stage-2 depends on stages: Stage-8
  Stage-3
  Stage-5
  Stage-6 depends on stages: Stage-5

STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: test2
Statistics: Num rows: 112 Data size: 11690 Basic stats:
COMPLETE Column stats: NONE
Select Operator
  expressions: blasttag (type: string), actioncounts (type:
array>), detailedclicks (type:
array>), countsbyclient
(type: array>),
totalactioncounts (type: array>),
actionsbydate (type:
array>)
  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
  Statistics: Num rows: 112 Data size: 11690 Basic stats:
COMPLETE Column stats: NONE
  File Output Operator
compressed: false
Statistics: Num rows: 112 Data size: 11690 Basic stats:
COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde:
com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde
name: beekeeper_results.test3

  Stage: Stage-7
Conditional Operator

  Stage: Stage-4
Move Operator
  files:
  hdfs directory: true
  destination:
hdfs://hadoop-alidoro-nn-vip/user/hive/warehouse/.hive-staging_hive_2015-12-11_21-52-35_063_8498858370292854265-1/-ext-10001

  Stage: Stage-0
Move Operator
  files:
  hdfs directory: true
  destination: ***

  Stage: Stage-8
  Create Table Operator:
Create Table
  columns: blasttag string, actioncounts
array>, detailedclicks
array>, countsbyclient
array>, totalactioncounts
array>, actionsbydate
array>
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format:
org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
  serde name:
com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde
  serde properties:
colelction.delim |
escape.delim \
mapkey.delim ;
  name: beekeeper_results.test3

  Stage: Stage-2
Stats-Aggr Operator

  Stage: Stage-3
Map Reduce
  Map Operator Tree:
  TableScan
File Output Operator
  compressed: false
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde:
com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde
  name: beekeeper_results.test3

  Stage: Stage-5
Map Reduce
  Map Operator Tree:
  TableScan
File Output Operator
  compressed: false
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde:
com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde
  name: beekeeper_results.test3

  Stage: Stage-6
Move Operator
  files:
  hdfs directory: true
  destination: ***