Hello, I have a sequence of MR jobs which produces some intermediate results - output of one job is input to another one.
Also, some data is always used as input to MR jobs. That data is stored in HBase. I would like to know which of the following is more performant: 1) Write intermediate results to HBase in one job and read from HBase in the next job 2) Write intermediate results to HDFS in one job and read from HDFS in the next job Also, about the data which is always used in MR jobs: 1) Read same data each time from HBase (which includes scanning by rowkey) 2) Read data from HBase only first time, store it to HDFS and read from HDFS every next time (avoid querying the database each time) Please elaborate why would you choose one. Best regards, -- Marko Dinic
