Hi Anoop,
We have executed the query with the qualifier set like you advised.
But we dont get the results for the range but only the specified
qualifier cell is returned.
Query & Result:
hbase(main):008:0> get 'mytable', 'MY_ROW',
{COLUMN=>["pcf:\x00\x16\xDFx"],
FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),
true, Bytes.toBytes(1499010.to_java(:int)), false)}
COLUMN CELL
pcf:\x00\x16\xDFx timestamp=1568380663616,
value=\x00\x16\xDFx
1 row(s) in 0.0080 seconds
hbase(main):009:0>
Is there any other way to get arond this ?.
Regards,
Solvannan R M
On 2019/09/13 04:53:45, Anoop John wrote:
> Hi>
> When you did a put with a lower qualifier int (put 'mytable',>
> 'MY_ROW', "pcf:\x0A", "\x00") the system flow is getting a valid cell
at>
> 1st step itself and that getting passed to the Filter. The Filter is
doing>
> a seek which just avoids all the in between deletes and puts
processing..>
> In 1st case the Filter wont get into action at all unless the scan flow>
> sees a valid cell. The delete processing happens as 1st step before the>
> filter processinf step happening.>
>
> In this case I am wondering why you can not add the specific 1st
qualifier>
> in the get part itself along with the column range filter. I mean>
>
> get 'mytable', 'MY_ROW', {COLUMN=>['pcf: *1499000 * '],>
> FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>
> true, Bytes.toBytes(1499010.to_java(:int)), false)}>
>
> Pardon the syntax it might not be proper for the shell.. Can this be
done?>
> This will make the scan to make a seek to the given qualifier at 1st
step>
> itself.>
>
> Anoop>
>
> On Thu, Sep 12, 2019 at 10:18 PM Udai Bhan Kashyap (BLOOMBERG/
PRINCETON) <>
> [email protected]> wrote:>
>
> > Are you keeping the deleted cells? Check 'VERSIONS' for the column
family>
> > and set it to 1 if you don't want to keep the deleted cells.>
> >>
> > From: [email protected] At: 09/12/19 12:40:01To:>
> > [email protected]>
> > Subject: Re: HBase Scan consumes high cpu>
> >>
> > Hi,>
> >>
> > As said earlier, we have populated the rowkey "MY_ROW" with integers>
> > from 0 to 1500000 as column qualifiers. Then we have deleted the>
> > qualifiers from 0 to 1499000.>
> >>
> > We executed the following query. It took 15.3750 seconds to execute.>
> >>
> > hbase(main):057:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],>
> > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>
> > true, Bytes.toBytes(1499010.to_java(:int)), false)}>
> > COLUMN CELL>
> > pcf:\x00\x16\xDFx timestamp=1568123881899,>
> > value=\x00\x16\xDFx>
> > pcf:\x00\x16\xDFy timestamp=1568123881899,>
> > value=\x00\x16\xDFy>
> > pcf:\x00\x16\xDFz timestamp=1568123881899,>
> > value=\x00\x16\xDFz>
> > pcf:\x00\x16\xDF{ timestamp=1568123881899,>
> > value=\x00\x16\xDF{>
> > pcf:\x00\x16\xDF| timestamp=1568123881899,>
> > value=\x00\x16\xDF|>
> > pcf:\x00\x16\xDF} timestamp=1568123881899,>
> > value=\x00\x16\xDF}>
> > pcf:\x00\x16\xDF~ timestamp=1568123881899,>
> > value=\x00\x16\xDF~>
> > pcf:\x00\x16\xDF\x7F timestamp=1568123881899,>
> > value=\x00\x16\xDF\x7F>
> > pcf:\x00\x16\xDF\x80 timestamp=1568123881899,>
> > value=\x00\x16\xDF\x80>
> > pcf:\x00\x16\xDF\x81 timestamp=1568123881899,>
> > value=\x00\x16\xDF\x81>
> > 1 row(s) in 15.3750 seconds>
> >>
> >>
> > Now we inserted a new column with qualifier 10 (\x0A), such that it>
> > comes earlier in lexicographical order. Now we executed the same
query.>
> > It only took 0.0240 seconds.>
> >>
> > hbase(main):058:0> put 'mytable', 'MY_ROW', "pcf:\x0A", "\x00">
> > 0 row(s) in 0.0150 seconds>
> > hbase(main):059:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],>
> > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>
> > true, Bytes.toBytes(1499010.to_java(:int)), false)}>
> > COLUMN CELL>
> > pcf:\x00\x16\xDFx timestamp=1568123881899,>
> > value=\x00\x16\xDFx>
> > pcf:\x00\x16\xDFy timestamp=1568123881899,>
> > value=\x00\x16\xDFy>
> > pcf:\x00\x16\xDFz timestamp=1568123881899,>
> > value=\x00\x16\xDFz>
> > pcf:\x00\x16\xDF{ timestamp=1568123881899,>
> > value=\x00\x16\xDF{>
> > pcf:\x00\x16\xDF| timestamp=1568123881899,>
> > value=\x00\x16\xDF|>
> > pcf:\x00\x16\xDF} timestamp=1568123881899,>
> > value=\x00\x16\xDF}>
> > pcf:\x00\x16\xDF~ timestamp=1568123881899,>
> > value=\x00\x16\xDF~>
> > pcf:\x00\x16\xDF\x7F timestamp=1568123881899,>
> > value=\x00\x16\xDF\x7F>
> > pcf:\x00\x16\xDF\x80 timestamp=1568123881899,>
> > value=\x00\x16\xDF\x80>
> > pcf:\x00\x16\xDF\x81 timestamp=1568123881899,>
> > value=\x00\x16\xDF\x81>
> > 1 row(s) in 0.0240 seconds>
> > hbase(main):060:0>>
> >>
> >>
> > We were able to reproduce the result consistently same, the pattern>
> > being bulk insert followed by bulk delete of most of the earlier
columns.>
> >>
> >>
> > We observed the following behaviour while debugging the StoreScanner>
> > (regionserver).>
> >>
> > Case 1:>
> >>
> > 1. When StoreScanner.next() is called, it starts to iterate over the>
> > cells from the start of the rowkey.>
> >>
> > 2. As all the cells are deleted (from 0 to 1499000), we could see>
> > alternate delete and put type cells. Now, the>
> > NormalUserScanQueryMatcher.match() returns>
> > ScanQueryMatcher.MatchCode.SKIP and>
> > ScanQueryMatcher.MatchCode.SEEK_NEXT_COL for Delete and Put type cell>
> > respectively. This iteration happens throughout the range of 0 to
1499000.>
> >>
> > 3. This happens until a valid Put type cell is encountered, where the>
> > matcher applies the ColumnRangeFilter to the cell, which in turm
returns>
> > ScanQueryMatcher.MatchCode.SEEK_NEXT_USING_HINT. In the next
iteration>
> > it seeks directly to the desired column.>
> >>
> >>
> > Case 2:>
> >>
> > 1. When StoreScanner.next() is called, it starts to iterate over the>
> > cells from the start of the rowkey.>
> >>
> > 2. When the Put cell of qualifier 10 (\x0A) is encountered, the
matcher>
> > returns ScanQueryMatcher.MatchCode.SEEK_NEXT_USING_HINT. In the next>
> > iteration it seeks directly to the desired column.>
> >>
> >>
> > Please let us know if this behaviour is intentional or it could be
avoided.>
> >>
> > Regards,>
> >>
> > Solvannan R M>
> >>
> >>
> > On 2019/09/10 17:12:36, Josh Elser wrote:>
> > > Deletes are held in memory. They represent data you have to
traverse >>
> > > until that data is flushed out to disk. When you write a new cell>
> > with a >>
> > > qualifier of 10, that sorts, lexicographically, "early" with
respect>
> > to >>
> > > the other qualifiers you've written.>>
> > >>
> > > By that measure, if you are only scanning for the first column in
this >>
> > > row which you've loaded with deletes, it would make total sense
to me >>
> > > that the first case is slow and the second fast is fast>>
> > >>
> > > Can you please share exactly how you execute your "query" for>
> > both(all) >>
> > > scenarios?>>
> > >>
> > > On 9/10/19 11:35 AM, Solvannan R M wrote:>>
> > > > Hi,>>
> > > > >>
> > > > We have been using HBase (1.4.9) for a case where timeseries data>
> > is continuously inserted and deleted (high churn) against a single>
> > rowkey. The column keys would represent timestamp more or less.
When we>
> > scan this data using ColumnRangeFilter for a recent time-range,
scanner>
> > for the stores (memstore & storefiles) has to go through contiguous>
> > deletes, before it reaches the requested timerange data. While using>
> > this scan, we could notice 100% cpu usages in single core by the>
> > regionserver process.>>
> > > > >>
> > > > So, for our case, most of the cells with older timestamps will be>
> > in deleted state. While traversing these deleted cells, the
regionserver>
> > process causing 100% cpu usage in single core.>>
> > > > >>
> > > > We tried to trace the code for scan and we observed the following>
> > behaviour.>>
> > > > >>
> > > > 1. While scanner is initialized, it seeked all the store-scanners>
> > to the start of the rowkey.>>
> > > > 2. Then it traverses the deleted cells and discards it (as it was>
> > deleted) one by one.>>
> > > > 3. When it encounters a valid cell (put type), it applies the>
> > filter and it returns SEEK_TO_NEXT_USING_HINT.>>
> > > > 4. Now the scanner seeks to the required key directly and
returning>
> > the results quickly then.>>
> > > > >>
> > > > For confirming the mentioned behaviour, we have done a test:>>
> > > > 1. We have populated a single rowkey with column qualifier as a>
> > range of integers of 0 to 1500000 with random data.>>
> > > > 2. We then deleted the column qualifier range of 0 to 1499000.>>
> > > > 3. Now the data is only in memsore. No store file exists.>>
> > > > 4. Now we scanned the rowkey with ColumnRangeFilter[1499000,>
> > 1499010).>>
> > > > 5. The query took 12 seconds to execute. During this query, a>
> > single core is completely used>>
> > > > 6. Then we put a new cell with qualifier 10.>>
> > > > 7. Executed the same query, it took 0.018 seconds to execute.>>
> > > > >>
> > > > Kindly check this and advise !.>>
> > > > >>
> > > > Regards,>>
> > > > Solvannan R M>>
> > > > >>
> > >>
> >>
> >>
> >>
>