John thanks for the quick response!

Crazy enough, I'm not doing much differently than the VersioningIterator as it 
is storing the max number of versions that ti should be returning- right? And 
that's a scan time iterator (as well as majc/minc).

I am testing it as a scan time iterator (set on the table but using accumulo 
shell to scan). Perhaps I should force a couple compactions and see what's left 
afterwards. 





On Jan 3, 2013, at 5:53 PM, John Vines wrote:

> Are you testing this in scan time or via actual minor/major compactions? I 
> know at scan time, there is no guarantee that the iterator remains intact 
> through the entire scan, and it instead may be reconstructed, causing state 
> to be lost. I don't think this is the case for compaction time iterators, but 
> I'm not positive.
> 
> 
> On Thu, Jan 3, 2013 at 5:41 PM, Corey Nolet <[email protected]> wrote:
> Hey Guys,
> 
> In "Accumulo 1.3.5", I wrote a "Top N" table structure, services and a 
> FilteringIterator that would allow us to drop in several keys/values 
> associated with a UUID (similar to a document id). The UUID was further 
> associated with an "index" (or type). The purpose of the TopN table was to 
> keep the keys/values separated so that they could still be queried back with 
> cell-level tagging, but when I performed a query for an index, I would get 
> the last N UUIDs and further be able to query the keys/values for each of 
> those UUIDs.
> 
> This problem seemed simple to solve in Accumulo 1.3.5, as I was able to 
> provide 2 FilteringIterators for compaction time to perform data cleanup of 
> the table so that any keys/values kept around were guaranteed to be inside of 
> the range of those keys being managed by the versioning iterator. 
> 
> Just to recap, I have the following table structure. I also hash the 
> keys/values and run a filter before the versioning iterator to clean up any 
> duplicates. There are two types of columns: index & key/value.
> 
> 
> Index: 
> 
> R: index (or "type" of data)
> F: '\x00index'
> Q: empty
> V: uuid\x00hashOfKeys&Values
> 
> 
> Key/Value:
> 
> R: index (or "type" of data)
> F: uuid
> Q: key\x00value
> V: empty
> 
> 
> The filtering iterator that makes sure any key/value rows are in the index 
> manages a hashset internally. The index rows are purposefully indexed before 
> the key/value rows so that the filter can build up the hashset containing 
> those uuids in the index. As the filter iterates into the key/value rows, it 
> will return true only if the uuid of the key/value exists inside of the 
> hashset containing the uuids in the index. This worked with older versions of 
> accumulo but I'm now getting a weird artifact where INIT() is called on my 
> Filter in the middle of iterating through an index row.
> 
> More specifically, the Filter will iterate through the index rows of a 
> specific "index" and build up a hashset, then init() will be called which 
> wipes away the hashset of uuids, then the further goes on to iterate through 
> the key/value rows. Keep in mind, we are talking about maybe 400k entries, 
> not enough to have more than 1 tablet.
> 
> Any idea why this may have worked on 1.3.5 but doesn't work any longer? I 
> know it has got to be a huge nono to be storing state inside of a filter, but 
> I haven't had any issues until trying to update my code for the new version. 
> If I'm doing this completely wrong, any ideas on how to make this better?
> 
> 
> Thanks!
> 
> 
> -- 
> Corey Nolet
> Senior Software Engineer
> TexelTek, inc.
> [Office] 301.880.7123
> [Cell] 410-903-2110
> 

Reply via email to