Hi Generally if you can form the column names like you did in the above case it is always better you add them using scan#addColumn(family, qual). I am not sure of the shell syntax to add multiple columns but am sure there is a provision to do it.
This will ensure that the scan starts from the given column and fetches the required column only. In your case probably you need to pass a set of qualifiers (instead of just 1). Regards Ram On Fri, Sep 13, 2019 at 8:45 PM Solvannan R M <[email protected]> wrote: > Hi Anoop, > > We have executed the query with the qualifier set like you advised. > But we dont get the results for the range but only the specified > qualifier cell is returned. > > Query & Result: > > hbase(main):008:0> get 'mytable', 'MY_ROW', > {COLUMN=>["pcf:\x00\x16\xDFx"], > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)), > true, Bytes.toBytes(1499010.to_java(:int)), false)} > COLUMN CELL > pcf:\x00\x16\xDFx timestamp=1568380663616, > value=\x00\x16\xDFx > 1 row(s) in 0.0080 seconds > > hbase(main):009:0> > > > Is there any other way to get arond this ?. > > > Regards, > > Solvannan R M > > > On 2019/09/13 04:53:45, Anoop John wrote: > > Hi> > > When you did a put with a lower qualifier int (put 'mytable',> > > 'MY_ROW', "pcf:\x0A", "\x00") the system flow is getting a valid cell > at> > > 1st step itself and that getting passed to the Filter. The Filter is > doing> > > a seek which just avoids all the in between deletes and puts > processing..> > > In 1st case the Filter wont get into action at all unless the scan flow> > > sees a valid cell. The delete processing happens as 1st step before the> > > filter processinf step happening.> > > > > In this case I am wondering why you can not add the specific 1st > qualifier> > > in the get part itself along with the column range filter. I mean> > > > > get 'mytable', 'MY_ROW', {COLUMN=>['pcf: *1499000 * '],> > > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),> > > true, Bytes.toBytes(1499010.to_java(:int)), false)}> > > > > Pardon the syntax it might not be proper for the shell.. Can this be > done?> > > This will make the scan to make a seek to the given qualifier at 1st > step> > > itself.> > > > > Anoop> > > > > On Thu, Sep 12, 2019 at 10:18 PM Udai Bhan Kashyap (BLOOMBERG/ > PRINCETON) <> > > [email protected]> wrote:> > > > > > Are you keeping the deleted cells? Check 'VERSIONS' for the column > family> > > > and set it to 1 if you don't want to keep the deleted cells.> > > >> > > > From: [email protected] At: 09/12/19 12:40:01To:> > > > [email protected]> > > > Subject: Re: HBase Scan consumes high cpu> > > >> > > > Hi,> > > >> > > > As said earlier, we have populated the rowkey "MY_ROW" with integers> > > > from 0 to 1500000 as column qualifiers. Then we have deleted the> > > > qualifiers from 0 to 1499000.> > > >> > > > We executed the following query. It took 15.3750 seconds to execute.> > > >> > > > hbase(main):057:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],> > > > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),> > > > true, Bytes.toBytes(1499010.to_java(:int)), false)}> > > > COLUMN CELL> > > > pcf:\x00\x16\xDFx timestamp=1568123881899,> > > > value=\x00\x16\xDFx> > > > pcf:\x00\x16\xDFy timestamp=1568123881899,> > > > value=\x00\x16\xDFy> > > > pcf:\x00\x16\xDFz timestamp=1568123881899,> > > > value=\x00\x16\xDFz> > > > pcf:\x00\x16\xDF{ timestamp=1568123881899,> > > > value=\x00\x16\xDF{> > > > pcf:\x00\x16\xDF| timestamp=1568123881899,> > > > value=\x00\x16\xDF|> > > > pcf:\x00\x16\xDF} timestamp=1568123881899,> > > > value=\x00\x16\xDF}> > > > pcf:\x00\x16\xDF~ timestamp=1568123881899,> > > > value=\x00\x16\xDF~> > > > pcf:\x00\x16\xDF\x7F timestamp=1568123881899,> > > > value=\x00\x16\xDF\x7F> > > > pcf:\x00\x16\xDF\x80 timestamp=1568123881899,> > > > value=\x00\x16\xDF\x80> > > > pcf:\x00\x16\xDF\x81 timestamp=1568123881899,> > > > value=\x00\x16\xDF\x81> > > > 1 row(s) in 15.3750 seconds> > > >> > > >> > > > Now we inserted a new column with qualifier 10 (\x0A), such that it> > > > comes earlier in lexicographical order. Now we executed the same > query.> > > > It only took 0.0240 seconds.> > > >> > > > hbase(main):058:0> put 'mytable', 'MY_ROW', "pcf:\x0A", "\x00"> > > > 0 row(s) in 0.0150 seconds> > > > hbase(main):059:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],> > > > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),> > > > true, Bytes.toBytes(1499010.to_java(:int)), false)}> > > > COLUMN CELL> > > > pcf:\x00\x16\xDFx timestamp=1568123881899,> > > > value=\x00\x16\xDFx> > > > pcf:\x00\x16\xDFy timestamp=1568123881899,> > > > value=\x00\x16\xDFy> > > > pcf:\x00\x16\xDFz timestamp=1568123881899,> > > > value=\x00\x16\xDFz> > > > pcf:\x00\x16\xDF{ timestamp=1568123881899,> > > > value=\x00\x16\xDF{> > > > pcf:\x00\x16\xDF| timestamp=1568123881899,> > > > value=\x00\x16\xDF|> > > > pcf:\x00\x16\xDF} timestamp=1568123881899,> > > > value=\x00\x16\xDF}> > > > pcf:\x00\x16\xDF~ timestamp=1568123881899,> > > > value=\x00\x16\xDF~> > > > pcf:\x00\x16\xDF\x7F timestamp=1568123881899,> > > > value=\x00\x16\xDF\x7F> > > > pcf:\x00\x16\xDF\x80 timestamp=1568123881899,> > > > value=\x00\x16\xDF\x80> > > > pcf:\x00\x16\xDF\x81 timestamp=1568123881899,> > > > value=\x00\x16\xDF\x81> > > > 1 row(s) in 0.0240 seconds> > > > hbase(main):060:0>> > > >> > > >> > > > We were able to reproduce the result consistently same, the pattern> > > > being bulk insert followed by bulk delete of most of the earlier > columns.> > > >> > > >> > > > We observed the following behaviour while debugging the StoreScanner> > > > (regionserver).> > > >> > > > Case 1:> > > >> > > > 1. When StoreScanner.next() is called, it starts to iterate over the> > > > cells from the start of the rowkey.> > > >> > > > 2. As all the cells are deleted (from 0 to 1499000), we could see> > > > alternate delete and put type cells. Now, the> > > > NormalUserScanQueryMatcher.match() returns> > > > ScanQueryMatcher.MatchCode.SKIP and> > > > ScanQueryMatcher.MatchCode.SEEK_NEXT_COL for Delete and Put type cell> > > > respectively. This iteration happens throughout the range of 0 to > 1499000.> > > >> > > > 3. This happens until a valid Put type cell is encountered, where the> > > > matcher applies the ColumnRangeFilter to the cell, which in turm > returns> > > > ScanQueryMatcher.MatchCode.SEEK_NEXT_USING_HINT. In the next > iteration> > > > it seeks directly to the desired column.> > > >> > > >> > > > Case 2:> > > >> > > > 1. When StoreScanner.next() is called, it starts to iterate over the> > > > cells from the start of the rowkey.> > > >> > > > 2. When the Put cell of qualifier 10 (\x0A) is encountered, the > matcher> > > > returns ScanQueryMatcher.MatchCode.SEEK_NEXT_USING_HINT. In the next> > > > iteration it seeks directly to the desired column.> > > >> > > >> > > > Please let us know if this behaviour is intentional or it could be > avoided.> > > >> > > > Regards,> > > >> > > > Solvannan R M> > > >> > > >> > > > On 2019/09/10 17:12:36, Josh Elser wrote:> > > > > Deletes are held in memory. They represent data you have to > traverse >> > > > > until that data is flushed out to disk. When you write a new cell> > > > with a >> > > > > qualifier of 10, that sorts, lexicographically, "early" with > respect> > > > to >> > > > > the other qualifiers you've written.>> > > > >> > > > > By that measure, if you are only scanning for the first column in > this >> > > > > row which you've loaded with deletes, it would make total sense > to me >> > > > > that the first case is slow and the second fast is fast>> > > > >> > > > > Can you please share exactly how you execute your "query" for> > > > both(all) >> > > > > scenarios?>> > > > >> > > > > On 9/10/19 11:35 AM, Solvannan R M wrote:>> > > > > > Hi,>> > > > > > >> > > > > > We have been using HBase (1.4.9) for a case where timeseries data> > > > is continuously inserted and deleted (high churn) against a single> > > > rowkey. The column keys would represent timestamp more or less. > When we> > > > scan this data using ColumnRangeFilter for a recent time-range, > scanner> > > > for the stores (memstore & storefiles) has to go through contiguous> > > > deletes, before it reaches the requested timerange data. While using> > > > this scan, we could notice 100% cpu usages in single core by the> > > > regionserver process.>> > > > > > >> > > > > > So, for our case, most of the cells with older timestamps will be> > > > in deleted state. While traversing these deleted cells, the > regionserver> > > > process causing 100% cpu usage in single core.>> > > > > > >> > > > > > We tried to trace the code for scan and we observed the following> > > > behaviour.>> > > > > > >> > > > > > 1. While scanner is initialized, it seeked all the store-scanners> > > > to the start of the rowkey.>> > > > > > 2. Then it traverses the deleted cells and discards it (as it was> > > > deleted) one by one.>> > > > > > 3. When it encounters a valid cell (put type), it applies the> > > > filter and it returns SEEK_TO_NEXT_USING_HINT.>> > > > > > 4. Now the scanner seeks to the required key directly and > returning> > > > the results quickly then.>> > > > > > >> > > > > > For confirming the mentioned behaviour, we have done a test:>> > > > > > 1. We have populated a single rowkey with column qualifier as a> > > > range of integers of 0 to 1500000 with random data.>> > > > > > 2. We then deleted the column qualifier range of 0 to 1499000.>> > > > > > 3. Now the data is only in memsore. No store file exists.>> > > > > > 4. Now we scanned the rowkey with ColumnRangeFilter[1499000,> > > > 1499010).>> > > > > > 5. The query took 12 seconds to execute. During this query, a> > > > single core is completely used>> > > > > > 6. Then we put a new cell with qualifier 10.>> > > > > > 7. Executed the same query, it took 0.018 seconds to execute.>> > > > > > >> > > > > > Kindly check this and advise !.>> > > > > > >> > > > > > Regards,>> > > > > > Solvannan R M>> > > > > > >> > > > >> > > >> > > >> > > >> > > >
