If I understand SiMa's use case correctly, after top records for (file A)_ are returned, (file B)_ would be next. Therefore some kind of filter is needed server side to skip the remaining records for (file A)_.
Another (corner) case is that for certain file prefix, there may not be as many records as the preset (per file) limit. Cheers On Sat, Jul 19, 2014 at 1:41 PM, Arun Allamsetty <arun.allamse...@gmail.com> wrote: > Hi, > > I have an idea which might be just bulloni, but people learn from mistakes > and this is my attempt to learn. So if I properly understand user use case, > you want to get the first 500 records pertaining to a file based on its > file name. Since you want to limit the number of records written, I won't > recommend writing each record as a column. But what if instead, we could > create a composite key consisting of the file name and the timestamp > (epoch) in a fashion similar to as described in Flurry - The Delicate Art > of Organizing Data in HBase <http://www.flurry.com/2012/06/12/137492485>. > If you want the latest timestamp first, we can use Long.MAX_VALUE - > timestamp in the constructor for the composite key. Now to get the top 500 > records for, let's say, *fileA* and *2014/06/14*, convert the date to epoch > time and create an object of the composite key class you created. Create a > *Scan* object, specifying it as the start row and use > *Scan#setMaxResultSize* to 500. That should give you only the top 500 > records and I believe the performance won't be bad provided you have the > hardware to manage your data volume. > > Experts, please correct me wherever I am wrong. > > Thanks, > Arun > > > On Sat, Jul 19, 2014 at 9:23 AM, SiMaYunRui <myl...@hotmail.com> wrote: > > > Hi experts, > > > > > > > > I have a wide-flat table, and during scan, how can I limit columns > > returned by a single row, instead of all rows (what ColumnCountGetFilter > > does)? Because I need to scan multiple rows at the same time, and in > client > > side to do aggregation. > > > > Put more background, I am designing an auditing tools, which records > > pattern of “(who) operates against (what) at (when)”. The search pattern > is > > like, given time range from "2014/6/14 13:45" to "2014/6/24 7:15", list > all > > files (what part, start-with search) be operated in DESC order of (when). > > > > I have tens of millions of records per day, and keep them 30 - 90 days. > So > > I am thinking about two designs: a) rowkey as (file name)_(reverse of > > when), problem is that people want to use start-wth search to match > > multiple files, in this way, scan has to go thru all matches files, which > > could be huge and then client has to re-order them to display 500 records > > on top; It could be very slow; > > > > b) use wide-flat table, rowkey as (file_name)_(reverse of when (unit to > > day to partition)). qualifier is (reverse of when). This design can > > leverage the fact that qualifiers are in order to make fewer search than > #a > > in my personal opinion. But I cannot put all operations on a single file > in > > one row, because total number might exceeds multiple millions. > > > > So I am thinking of grouping data into the following shape by using #b. > > Then back to my original question, because I only need 500 records, if > the > > row (file A)_(2014/06/14), contains more than that number, can I stop it > > and then continue to scan next row? And if I already get enough in (file > > A)_(2014/06/14), can I skip (file A)_(2014/06/13) and then continue to > scan > > (file B)_(2014/06/14) which is a different file? > > > > Row: (file A)_(2014/06/14) > > > > d:1341069600 value > > > > d:1341069500 value > > > > d:1341069400 value > > > > Row: (file A)_(2014/06/13) > > > > d:1341059600 value > > > > d:1341059500 value > > > > d:1341059400 value > > > > Row: (file B)_(2014/06/14) > > > > d:1341069700 value > > > > d:1341069580 value > > > > d:1341069401 value > > > > > > > > > > > > > > 发自 Windows 邮件 >