Re: HBase Scan consumes high cpu

ramkrishna vasudevan Fri, 13 Sep 2019 08:54:12 -0700

Hi
Generally if you can form the column names like you did in the above case
it is always better you add them using
scan#addColumn(family, qual).  I am not sure of the shell syntax to add
multiple columns but am sure there is a provision to do it.


This will ensure that the scan starts from the given column and fetches the
required column only. In your case probably you need to pass a set of
qualifiers (instead of just 1).

Regards
Ram

On Fri, Sep 13, 2019 at 8:45 PM Solvannan R M <[email protected]>
wrote:

> Hi Anoop,
>
>     We have executed the query with the qualifier set like you advised.
> But we dont get the results for the range but only the specified
> qualifier cell is returned.
>
> Query & Result:
>
> hbase(main):008:0> get 'mytable', 'MY_ROW',
> {COLUMN=>["pcf:\x00\x16\xDFx"],
> FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),
> true, Bytes.toBytes(1499010.to_java(:int)), false)}
> COLUMN CELL
>   pcf:\x00\x16\xDFx                 timestamp=1568380663616,
> value=\x00\x16\xDFx
> 1 row(s) in 0.0080 seconds
>
> hbase(main):009:0>
>
>
> Is there any other way to get arond this ?.
>
>
> Regards,
>
> Solvannan R M
>
>
> On 2019/09/13 04:53:45, Anoop John wrote:
>  > Hi>
>  > When you did a put with a lower qualifier int (put 'mytable',>
>  > 'MY_ROW', "pcf:\x0A", "\x00") the system flow is getting a valid cell
> at>
>  > 1st step itself and that getting passed to the Filter. The Filter is
> doing>
>  > a seek which just avoids all the in between deletes and puts
> processing..>
>  > In 1st case the Filter wont get into action at all unless the scan flow>
>  > sees a valid cell. The delete processing happens as 1st step before the>
>  > filter processinf step happening.>
>  >
>  > In this case I am wondering why you can not add the specific 1st
> qualifier>
>  > in the get part itself along with the column range filter. I mean>
>  >
>  > get 'mytable', 'MY_ROW', {COLUMN=>['pcf: *1499000 * '],>
>  > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>
>  > true, Bytes.toBytes(1499010.to_java(:int)), false)}>
>  >
>  > Pardon the syntax it might not be proper for the shell.. Can this be
> done?>
>  > This will make the scan to make a seek to the given qualifier at 1st
> step>
>  > itself.>
>  >
>  > Anoop>
>  >
>  > On Thu, Sep 12, 2019 at 10:18 PM Udai Bhan Kashyap (BLOOMBERG/
> PRINCETON) <>
>  > [email protected]> wrote:>
>  >
>  > > Are you keeping the deleted cells? Check 'VERSIONS' for the column
> family>
>  > > and set it to 1 if you don't want to keep the deleted cells.>
>  > >>
>  > > From: [email protected] At: 09/12/19 12:40:01To:>
>  > > [email protected]>
>  > > Subject: Re: HBase Scan consumes high cpu>
>  > >>
>  > > Hi,>
>  > >>
>  > > As said earlier, we have populated the rowkey "MY_ROW" with integers>
>  > > from 0 to 1500000 as column qualifiers. Then we have deleted the>
>  > > qualifiers from 0 to 1499000.>
>  > >>
>  > > We executed the following query. It took 15.3750 seconds to execute.>
>  > >>
>  > > hbase(main):057:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],>
>  > > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>
>  > > true, Bytes.toBytes(1499010.to_java(:int)), false)}>
>  > > COLUMN CELL>
>  > > pcf:\x00\x16\xDFx timestamp=1568123881899,>
>  > > value=\x00\x16\xDFx>
>  > > pcf:\x00\x16\xDFy timestamp=1568123881899,>
>  > > value=\x00\x16\xDFy>
>  > > pcf:\x00\x16\xDFz timestamp=1568123881899,>
>  > > value=\x00\x16\xDFz>
>  > > pcf:\x00\x16\xDF{ timestamp=1568123881899,>
>  > > value=\x00\x16\xDF{>
>  > > pcf:\x00\x16\xDF| timestamp=1568123881899,>
>  > > value=\x00\x16\xDF|>
>  > > pcf:\x00\x16\xDF} timestamp=1568123881899,>
>  > > value=\x00\x16\xDF}>
>  > > pcf:\x00\x16\xDF~ timestamp=1568123881899,>
>  > > value=\x00\x16\xDF~>
>  > > pcf:\x00\x16\xDF\x7F timestamp=1568123881899,>
>  > > value=\x00\x16\xDF\x7F>
>  > > pcf:\x00\x16\xDF\x80 timestamp=1568123881899,>
>  > > value=\x00\x16\xDF\x80>
>  > > pcf:\x00\x16\xDF\x81 timestamp=1568123881899,>
>  > > value=\x00\x16\xDF\x81>
>  > > 1 row(s) in 15.3750 seconds>
>  > >>
>  > >>
>  > > Now we inserted a new column with qualifier 10 (\x0A), such that it>
>  > > comes earlier in lexicographical order. Now we executed the same
> query.>
>  > > It only took 0.0240 seconds.>
>  > >>
>  > > hbase(main):058:0> put 'mytable', 'MY_ROW', "pcf:\x0A", "\x00">
>  > > 0 row(s) in 0.0150 seconds>
>  > > hbase(main):059:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],>
>  > > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>
>  > > true, Bytes.toBytes(1499010.to_java(:int)), false)}>
>  > > COLUMN CELL>
>  > > pcf:\x00\x16\xDFx timestamp=1568123881899,>
>  > > value=\x00\x16\xDFx>
>  > > pcf:\x00\x16\xDFy timestamp=1568123881899,>
>  > > value=\x00\x16\xDFy>
>  > > pcf:\x00\x16\xDFz timestamp=1568123881899,>
>  > > value=\x00\x16\xDFz>
>  > > pcf:\x00\x16\xDF{ timestamp=1568123881899,>
>  > > value=\x00\x16\xDF{>
>  > > pcf:\x00\x16\xDF| timestamp=1568123881899,>
>  > > value=\x00\x16\xDF|>
>  > > pcf:\x00\x16\xDF} timestamp=1568123881899,>
>  > > value=\x00\x16\xDF}>
>  > > pcf:\x00\x16\xDF~ timestamp=1568123881899,>
>  > > value=\x00\x16\xDF~>
>  > > pcf:\x00\x16\xDF\x7F timestamp=1568123881899,>
>  > > value=\x00\x16\xDF\x7F>
>  > > pcf:\x00\x16\xDF\x80 timestamp=1568123881899,>
>  > > value=\x00\x16\xDF\x80>
>  > > pcf:\x00\x16\xDF\x81 timestamp=1568123881899,>
>  > > value=\x00\x16\xDF\x81>
>  > > 1 row(s) in 0.0240 seconds>
>  > > hbase(main):060:0>>
>  > >>
>  > >>
>  > > We were able to reproduce the result consistently same, the pattern>
>  > > being bulk insert followed by bulk delete of most of the earlier
> columns.>
>  > >>
>  > >>
>  > > We observed the following behaviour while debugging the StoreScanner>
>  > > (regionserver).>
>  > >>
>  > > Case 1:>
>  > >>
>  > > 1. When StoreScanner.next() is called, it starts to iterate over the>
>  > > cells from the start of the rowkey.>
>  > >>
>  > > 2. As all the cells are deleted (from 0 to 1499000), we could see>
>  > > alternate delete and put type cells. Now, the>
>  > > NormalUserScanQueryMatcher.match() returns>
>  > > ScanQueryMatcher.MatchCode.SKIP and>
>  > > ScanQueryMatcher.MatchCode.SEEK_NEXT_COL for Delete and Put type cell>
>  > > respectively. This iteration happens throughout the range of 0 to
> 1499000.>
>  > >>
>  > > 3. This happens until a valid Put type cell is encountered, where the>
>  > > matcher applies the ColumnRangeFilter to the cell, which in turm
> returns>
>  > > ScanQueryMatcher.MatchCode.SEEK_NEXT_USING_HINT. In the next
> iteration>
>  > > it seeks directly to the desired column.>
>  > >>
>  > >>
>  > > Case 2:>
>  > >>
>  > > 1. When StoreScanner.next() is called, it starts to iterate over the>
>  > > cells from the start of the rowkey.>
>  > >>
>  > > 2. When the Put cell of qualifier 10 (\x0A) is encountered, the
> matcher>
>  > > returns ScanQueryMatcher.MatchCode.SEEK_NEXT_USING_HINT. In the next>
>  > > iteration it seeks directly to the desired column.>
>  > >>
>  > >>
>  > > Please let us know if this behaviour is intentional or it could be
> avoided.>
>  > >>
>  > > Regards,>
>  > >>
>  > > Solvannan R M>
>  > >>
>  > >>
>  > > On 2019/09/10 17:12:36, Josh Elser wrote:>
>  > > > Deletes are held in memory. They represent data you have to
> traverse >>
>  > > > until that data is flushed out to disk. When you write a new cell>
>  > > with a >>
>  > > > qualifier of 10, that sorts, lexicographically, "early" with
> respect>
>  > > to >>
>  > > > the other qualifiers you've written.>>
>  > > >>
>  > > > By that measure, if you are only scanning for the first column in
> this >>
>  > > > row which you've loaded with deletes, it would make total sense
> to me >>
>  > > > that the first case is slow and the second fast is fast>>
>  > > >>
>  > > > Can you please share exactly how you execute your "query" for>
>  > > both(all) >>
>  > > > scenarios?>>
>  > > >>
>  > > > On 9/10/19 11:35 AM, Solvannan R M wrote:>>
>  > > > > Hi,>>
>  > > > > >>
>  > > > > We have been using HBase (1.4.9) for a case where timeseries data>
>  > > is continuously inserted and deleted (high churn) against a single>
>  > > rowkey. The column keys would represent timestamp more or less.
> When we>
>  > > scan this data using ColumnRangeFilter for a recent time-range,
> scanner>
>  > > for the stores (memstore & storefiles) has to go through contiguous>
>  > > deletes, before it reaches the requested timerange data. While using>
>  > > this scan, we could notice 100% cpu usages in single core by the>
>  > > regionserver process.>>
>  > > > > >>
>  > > > > So, for our case, most of the cells with older timestamps will be>
>  > > in deleted state. While traversing these deleted cells, the
> regionserver>
>  > > process causing 100% cpu usage in single core.>>
>  > > > > >>
>  > > > > We tried to trace the code for scan and we observed the following>
>  > > behaviour.>>
>  > > > > >>
>  > > > > 1. While scanner is initialized, it seeked all the store-scanners>
>  > > to the start of the rowkey.>>
>  > > > > 2. Then it traverses the deleted cells and discards it (as it was>
>  > > deleted) one by one.>>
>  > > > > 3. When it encounters a valid cell (put type), it applies the>
>  > > filter and it returns SEEK_TO_NEXT_USING_HINT.>>
>  > > > > 4. Now the scanner seeks to the required key directly and
> returning>
>  > > the results quickly then.>>
>  > > > > >>
>  > > > > For confirming the mentioned behaviour, we have done a test:>>
>  > > > > 1. We have populated a single rowkey with column qualifier as a>
>  > > range of integers of 0 to 1500000 with random data.>>
>  > > > > 2. We then deleted the column qualifier range of 0 to 1499000.>>
>  > > > > 3. Now the data is only in memsore. No store file exists.>>
>  > > > > 4. Now we scanned the rowkey with ColumnRangeFilter[1499000,>
>  > > 1499010).>>
>  > > > > 5. The query took 12 seconds to execute. During this query, a>
>  > > single core is completely used>>
>  > > > > 6. Then we put a new cell with qualifier 10.>>
>  > > > > 7. Executed the same query, it took 0.018 seconds to execute.>>
>  > > > > >>
>  > > > > Kindly check this and advise !.>>
>  > > > > >>
>  > > > > Regards,>>
>  > > > > Solvannan R M>>
>  > > > > >>
>  > > >>
>  > >>
>  > >>
>  > >>
>  >
>

Re: HBase Scan consumes high cpu

Reply via email to