TableInputFormat doesn't read the filesystem directly it essentially issues a scan over the whole table (or the specified range) so it'll read the data you expect to read if you'd done a scan from any client. There is a TableSnapshotInputFormat that bypasses the hbase server itself https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormat.html going directly to the files. When using this, your job will read the entire table.
On Wed, May 29, 2019 at 1:45 AM Guillermo Ortiz Fernández < guillermo.ortiz.f...@gmail.com> wrote: > Another little doubt it's: if I use the class TableinputFormat to read a > HBase table, Am I going to read the whole table? or data what haven't been > flushed to storefiles it's not going to be read? > > El mié., 29 may. 2019 a las 0:14, Guillermo Ortiz Fernández (< > guillermo.ortiz.f...@gmail.com>) escribió: > > > it depends of the row, they did only share 5% of the qualifiers names. > > Each row could have about 500-3000 columns in 3 column families. One of > > them has 80% of the columns. > > > > The table has around 75M of rows. > > > > El mar., 28 may. 2019 a las 17:33, <s...@comcast.net> escribió: > > > >> Guillermo > >> > >> > >> How large is your table? How many columns? > >> > >> > >> Sincerely, > >> > >> Sean > >> > >> > On May 28, 2019 at 10:11 AM Guillermo Ortiz <konstt2...@gmail.com > >> mailto:konstt2...@gmail.com > wrote: > >> > > >> > > >> > I have a doubt. When you process a Hbase table with MapReduce you > >> could use > >> > the TableInputFormat, I understand that it goes directly to HDFS > >> files > >> > (storesFiles in HDFS) , so you could do some filter in the map > >> phase and > >> > it's not the same to go through to the region servers to do some > >> massive > >> > queriesIt's possible to do the same using TableInputFormat with > >> Spark and > >> > it's more efficient than use scan with filters and so on (again) > >> when you > >> > want to do a massive query about all the table. Am I right? > >> > > >> > > >