On Wed, Jan 8, 2014 at 10:41 AM, Terry P. <[email protected]> wrote:
> Hi Keith, > The goal of the iterator is to purge data that has expired (or suppress it > for scans). The goal of the log message is to bring to light any data > format issues, as otherwise the "bad data" would NOT be purged by the > iterator and hang around forever, which would be bad, so yes we would purge > it with a special job. The iterator fires at both Full Major Compaction and > at Scan time. > So you want to use the summary data from scans to know if you should kick off a full major compaction? In 1.6.0 compaction strategies were added (ACCUMULO-1451). If scans could provide information to these compaction strategies, then that would lay the ground work for ACCUMULO-1266 and what you are trying to achieve. I am not sure of the best way to do this. Maybe when a scan iterator is closed it could update counters (maybe counters encourage small memory usage). The compaction strategy could access the counters and use them to make a decision about doing a full major compaction. > > Good point on "How did the bad data get there?" -- it shouldn't based on > how items are indexed and then inserted into Accumulo, but I wanted to > check for it in case the individual that installs the iterator in Accumulo > fat-fingers the date format, OR if someone changes it on the other side > (the app that sends the data to Accumulo). The first one could happen > easily, but the latter shouldn't happen. But as folks roll off programs and > others maintain the code, anything can happen. > > Looks like ACCUMULO-1280 is exactly what I need! Maybe someday, but until > then what I have for the iterator will do the job (and thanks again for > your help on it!). > > Best regards, > Terry > > On Wed, Jan 8, 2014 at 9:30 AM, Keith Turner <[email protected]> wrote: > >> whats is your goal? It seems like you want to produce counts about bad >> data suppressed at scan time. What will you do with these counts? Will >> you ever purge the bad data? How did the bad data get there? If you are >> not bulk importing the data, then maybe you could add constraints to the >> table? >> >> >> On Mon, Jan 6, 2014 at 7:30 PM, Terry P. <[email protected]> wrote: >> >>> Greetings folks, >>> I have an iterator that extends RowFilter and I have a case where I need >>> to know when its defined date format doesn't match the format of the data >>> being scanned by the iterator. I don't want to flood the tserver log with >>> an error per row (how horrid that would be), but instead keep a counter of >>> the number of times that error occurs during a scan or major compaction. >>> >>> Trouble is, I don't see any way to know when an iterator is on the "last >>> row" or "last entry" in its scan on a tabletserver, as if I could test for >>> that, I could then dump my single log message with the count of date format >>> parse errors for that scan/compaction. >>> >>> Anyone know a way to determine if an iterator is at the "last entry" or >>> "last row" of its execution? >>> >> >> I do not think there is a good way to do this. ACCUMULO-1280 >> >> >>> >>> Many thanks in advance. >>> >> >> >
