Yes, the idea is to add a byte array to the column cf:BLOBDATACOLUMN for
each row key. 

This *should* only ever happen once per row. I do not modify the row keys
given by Hbase (Result#getRow) in anyway, so the row key in each Put
should be unique during a single run. Subsequent runs of the same program
do not attempt to reload rows containing data. The way this is done is by
calling Result#getValue and checking for null. I know this method is
inefficient, but I have not found a good Filter strategy for bringing back
rows that *do not* have a certain column. Two Filters I have tried to use
with the Scan object are listed below:

This one returns rows that have data already loaded. The idea here was to
create a logic condition that no row would meet, but due to the
filterIfMissing=false I would get back rows that did not have the column
being tested:
final SingleColumnValueFilter fltrOnlyColsNoData2
= new SingleColumnValueFilter(
COLUMN_FAMILY, BLOBDATACOLUMN,
CompareFilter.CompareOp.LESS, Bytes.toBytes(0));


I used this one before, can't remember what problem I had with it, and
will try it again now that I switched from next(batchsize) to an iterator
from ResultScanner#iterator:
final SingleColumnValueExcludeFilter fltrOnlyColsNoData
                    = new SingleColumnValueExcludeFilter(
                            COLUMN_FAMILY, BLOBDATACOLUMN,
                            CompareFilter.CompareOp.EQUAL,
                            new NullComparator());



Thanks




On 8/22/14 10:26 AM, "Ted Yu" <[email protected]> wrote:

>For given rowkey, would there be only one record written per
>cf:BLOBDATACOLUMN
>column ?
>
>Cheers
>
>
>On Fri, Aug 22, 2014 at 10:17 AM, Magana-zook, Steven Alan <
>[email protected]> wrote:
>
>> Hi Ted,
>>
>> For example, if the program reports an average speed of 88 records a
>> second, and I let the program run for 24 hours, then I would expect the
>> RowCounter program to report a number around 88
>> (rows/second)*24(hours)*(60min/hour)*60(seconds/min) = 7,603,200 rows.
>>
>> In actuality, RowCounter returns:
>>
>> org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper$Counters
>>         ROWS=1356588
>>
>>
>> The vast difference between ~7 million rows and ~1 million rows has me
>> confused about what happened to the other rows that should have been in
>> the table.
>>
>> Thanks for your reply,
>> Steven
>>
>>
>>
>>
>>
>>
>> On 8/22/14 9:53 AM, "Ted Yu" <[email protected]> wrote:
>>
>> >bq. the result from the RowCounter program is far fewer records than I
>> >expected.
>> >
>> >Can you give more detailed information about the gap ?
>> >
>> >Which hbase release are you running ?
>> >
>> >Cheers
>> >
>> >
>> >On Fri, Aug 22, 2014 at 9:26 AM, Magana-zook, Steven Alan <
>> >[email protected]> wrote:
>> >
>> >> Hello,
>> >>
>> >> I have written a program in Java that is supposed to update rows in a
>> >> Hbase table that do not yet have a value in a certain column (blob
>> >>values
>> >> of between 5k and 50k). The program keeps track of how many puts have
>> >>been
>> >> added to the table along with how long the program is running. These
>> >>pieces
>> >> of information are used to calculate a speed for data ingestion
>>(records
>> >> per second). After running the program for multiple days, and based
>>on
>> >>the
>> >> average speed reported, the result from the RowCounter program is far
>> >>fewer
>> >> records than I expected. The essential parts of the code are shown
>>below
>> >> (error handling and other potentially not important code omitted)
>>along
>> >> with the command I use to see how many rows have been updated.
>> >>
>> >> Is it possible that the put method call on Htable does not actually
>>put
>> >> the record in the database while also not throwing an exception?
>> >> Could the output of RowCounter be incorrect?
>> >> Am I doing something below that is obviously incorrect?
>> >>
>> >> Row counter command (does frequently report
>> >>OutOfOrderScannerNextException
>> >> during execution): hbase org.apache.hadoop.hbase.mapreduce.RowCounter
>> >> mytable cf:BLOBDATACOLUMN
>> >>
>> >> Code that is essentially what I am doing in my program:
>> >> ...
>> >> Scan scan = new Scan();
>> >> scan.setCaching(200);
>> >>
>> >> HTable targetTable = new HTable(hbaseConfiguration,
>> >> Bytes.toBytes(tblTarget));
>> >> targetTable.getScanner(scan);
>> >>
>> >> int batchSize = 10;
>> >> Date startTime = new Date();
>> >> numFilesSent = 0;
>> >>
>> >> Result[] rows = resultScanner.next(batchSize);
>> >> while (rows != null) {
>> >> for (Result row : rows) {
>> >> byte[] rowKey = row.getRow();
>> >> byte[] byteArrayBlobData = getFileContentsForRow(rowKey);
>> >>
>> >> Put put = new Put(rowKey);
>> >> put.add(COLUMN_FAMILY, BLOB_COLUMN, byteArrayBlobData);
>> >> targetTable.put(put); // Auto-flush is on by default
>> >> numFilesSent++;
>> >> float elapsedSeconds = (new Date().getTime() - startTime.getTime()) /
>> >> 1000.0f;
>> >> float speed = numFilesSent / elapsedSeconds;
>> >> System.out.println("Speed(rows/sec): " + speed); // routinely says
>>from
>> >>80
>> >> to 200+
>> >> }
>> >> rows = resultScanner.next(batchSize);
>> >> }
>> >> ...
>> >>
>> >> Thanks,
>> >> Steven
>> >>
>>
>>

Reply via email to