Hi
When you did a put with a lower qualifier int (put 'mytable',
'MY_ROW', "pcf:\x0A", "\x00") the system flow is getting a valid cell at
1st step itself and that getting passed to the Filter. The Filter is doing
a seek which just avoids all the in between deletes and puts processing..
In 1st case the Filter wont get into action at all unless the scan flow
sees a valid cell. The delete processing happens as 1st step before the
filter processinf step happening.
In this case I am wondering why you can not add the specific 1st qualifier
in the get part itself along with the column range filter. I mean
get 'mytable', 'MY_ROW', {COLUMN=>['pcf: *1499000 * '],
FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),
true, Bytes.toBytes(1499010.to_java(:int)), false)}
Pardon the syntax it might not be proper for the shell.. Can this be done?
This will make the scan to make a seek to the given qualifier at 1st step
itself.
Anoop
On Thu, Sep 12, 2019 at 10:18 PM Udai Bhan Kashyap (BLOOMBERG/ PRINCETON) <
[email protected]> wrote:
> Are you keeping the deleted cells? Check 'VERSIONS' for the column family
> and set it to 1 if you don't want to keep the deleted cells.
>
> From: [email protected] At: 09/12/19 12:40:01To:
> [email protected]
> Subject: Re: HBase Scan consumes high cpu
>
> Hi,
>
> As said earlier, we have populated the rowkey "MY_ROW" with integers
> from 0 to 1500000 as column qualifiers. Then we have deleted the
> qualifiers from 0 to 1499000.
>
> We executed the following query. It took 15.3750 seconds to execute.
>
> hbase(main):057:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],
> FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),
> true, Bytes.toBytes(1499010.to_java(:int)), false)}
> COLUMN CELL
> pcf:\x00\x16\xDFx timestamp=1568123881899,
> value=\x00\x16\xDFx
> pcf:\x00\x16\xDFy timestamp=1568123881899,
> value=\x00\x16\xDFy
> pcf:\x00\x16\xDFz timestamp=1568123881899,
> value=\x00\x16\xDFz
> pcf:\x00\x16\xDF{ timestamp=1568123881899,
> value=\x00\x16\xDF{
> pcf:\x00\x16\xDF| timestamp=1568123881899,
> value=\x00\x16\xDF|
> pcf:\x00\x16\xDF} timestamp=1568123881899,
> value=\x00\x16\xDF}
> pcf:\x00\x16\xDF~ timestamp=1568123881899,
> value=\x00\x16\xDF~
> pcf:\x00\x16\xDF\x7F timestamp=1568123881899,
> value=\x00\x16\xDF\x7F
> pcf:\x00\x16\xDF\x80 timestamp=1568123881899,
> value=\x00\x16\xDF\x80
> pcf:\x00\x16\xDF\x81 timestamp=1568123881899,
> value=\x00\x16\xDF\x81
> 1 row(s) in 15.3750 seconds
>
>
> Now we inserted a new column with qualifier 10 (\x0A), such that it
> comes earlier in lexicographical order. Now we executed the same query.
> It only took 0.0240 seconds.
>
> hbase(main):058:0> put 'mytable', 'MY_ROW', "pcf:\x0A", "\x00"
> 0 row(s) in 0.0150 seconds
> hbase(main):059:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],
> FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),
> true, Bytes.toBytes(1499010.to_java(:int)), false)}
> COLUMN CELL
> pcf:\x00\x16\xDFx timestamp=1568123881899,
> value=\x00\x16\xDFx
> pcf:\x00\x16\xDFy timestamp=1568123881899,
> value=\x00\x16\xDFy
> pcf:\x00\x16\xDFz timestamp=1568123881899,
> value=\x00\x16\xDFz
> pcf:\x00\x16\xDF{ timestamp=1568123881899,
> value=\x00\x16\xDF{
> pcf:\x00\x16\xDF| timestamp=1568123881899,
> value=\x00\x16\xDF|
> pcf:\x00\x16\xDF} timestamp=1568123881899,
> value=\x00\x16\xDF}
> pcf:\x00\x16\xDF~ timestamp=1568123881899,
> value=\x00\x16\xDF~
> pcf:\x00\x16\xDF\x7F timestamp=1568123881899,
> value=\x00\x16\xDF\x7F
> pcf:\x00\x16\xDF\x80 timestamp=1568123881899,
> value=\x00\x16\xDF\x80
> pcf:\x00\x16\xDF\x81 timestamp=1568123881899,
> value=\x00\x16\xDF\x81
> 1 row(s) in 0.0240 seconds
> hbase(main):060:0>
>
>
> We were able to reproduce the result consistently same, the pattern
> being bulk insert followed by bulk delete of most of the earlier columns.
>
>
> We observed the following behaviour while debugging the StoreScanner
> (regionserver).
>
> Case 1:
>
> 1. When StoreScanner.next() is called, it starts to iterate over the
> cells from the start of the rowkey.
>
> 2. As all the cells are deleted (from 0 to 1499000), we could see
> alternate delete and put type cells. Now, the
> NormalUserScanQueryMatcher.match() returns
> ScanQueryMatcher.MatchCode.SKIP and
> ScanQueryMatcher.MatchCode.SEEK_NEXT_COL for Delete and Put type cell
> respectively. This iteration happens throughout the range of 0 to 1499000.
>
> 3. This happens until a valid Put type cell is encountered, where the
> matcher applies the ColumnRangeFilter to the cell, which in turm returns
> ScanQueryMatcher.MatchCode.SEEK_NEXT_USING_HINT. In the next iteration
> it seeks directly to the desired column.
>
>
> Case 2:
>
> 1. When StoreScanner.next() is called, it starts to iterate over the
> cells from the start of the rowkey.
>
> 2. When the Put cell of qualifier 10 (\x0A) is encountered, the matcher
> returns ScanQueryMatcher.MatchCode.SEEK_NEXT_USING_HINT. In the next
> iteration it seeks directly to the desired column.
>
>
> Please let us know if this behaviour is intentional or it could be avoided.
>
> Regards,
>
> Solvannan R M
>
>
> On 2019/09/10 17:12:36, Josh Elser wrote:
> > Deletes are held in memory. They represent data you have to traverse >
> > until that data is flushed out to disk. When you write a new cell
> with a >
> > qualifier of 10, that sorts, lexicographically, "early" with respect
> to >
> > the other qualifiers you've written.>
> >
> > By that measure, if you are only scanning for the first column in this >
> > row which you've loaded with deletes, it would make total sense to me >
> > that the first case is slow and the second fast is fast>
> >
> > Can you please share exactly how you execute your "query" for
> both(all) >
> > scenarios?>
> >
> > On 9/10/19 11:35 AM, Solvannan R M wrote:>
> > > Hi,>
> > > >
> > > We have been using HBase (1.4.9) for a case where timeseries data
> is continuously inserted and deleted (high churn) against a single
> rowkey. The column keys would represent timestamp more or less. When we
> scan this data using ColumnRangeFilter for a recent time-range, scanner
> for the stores (memstore & storefiles) has to go through contiguous
> deletes, before it reaches the requested timerange data. While using
> this scan, we could notice 100% cpu usages in single core by the
> regionserver process.>
> > > >
> > > So, for our case, most of the cells with older timestamps will be
> in deleted state. While traversing these deleted cells, the regionserver
> process causing 100% cpu usage in single core.>
> > > >
> > > We tried to trace the code for scan and we observed the following
> behaviour.>
> > > >
> > > 1. While scanner is initialized, it seeked all the store-scanners
> to the start of the rowkey.>
> > > 2. Then it traverses the deleted cells and discards it (as it was
> deleted) one by one.>
> > > 3. When it encounters a valid cell (put type), it applies the
> filter and it returns SEEK_TO_NEXT_USING_HINT.>
> > > 4. Now the scanner seeks to the required key directly and returning
> the results quickly then.>
> > > >
> > > For confirming the mentioned behaviour, we have done a test:>
> > > 1. We have populated a single rowkey with column qualifier as a
> range of integers of 0 to 1500000 with random data.>
> > > 2. We then deleted the column qualifier range of 0 to 1499000.>
> > > 3. Now the data is only in memsore. No store file exists.>
> > > 4. Now we scanned the rowkey with ColumnRangeFilter[1499000,
> 1499010).>
> > > 5. The query took 12 seconds to execute. During this query, a
> single core is completely used>
> > > 6. Then we put a new cell with qualifier 10.>
> > > 7. Executed the same query, it took 0.018 seconds to execute.>
> > > >
> > > Kindly check this and advise !.>
> > > >
> > > Regards,>
> > > Solvannan R M>
> > > >
> >
>
>
>