Filters are iterators, which are configured to run on the server-side. fetchColumnFamily will only return entries in the specified column families. If a row has no entries with the specified column families, then no entries for that row will return.
-- Christopher L Tubbs II http://gravatar.com/ctubbsii On Sun, Jan 26, 2014 at 7:39 PM, Jamie Johnson <[email protected]> wrote: > After a little reading...if I use fetchColumnFamily does that skip any rows > that does not have the column family? > > On Jan 26, 2014 7:27 PM, "Jamie Johnson" <[email protected]> wrote: >> >> Thanks for the ideas. Filters are client side right? >> >> I need to read the documentation more as I don't know how to just query a >> column family. Would it be possible to get all terms that start with a >> particular value? I was thinking that we would need a special prefix for >> this but if something could be done without needing it that would work well. >> >> On Jan 26, 2014 5:44 PM, "Christopher" <[email protected]> wrote: >>> >>> Ah, I see. Well, you could do that with a custom filter (iterator), >>> but otherwise, no, not unless you had some other special per-term >>> entry to query (rather than per-term/document pair). The design of >>> this kind of table though, seems focused on finding documents which >>> contain the given terms, though, not listing all terms seen. If you >>> need that additional feature and don't want to write a custom filter, >>> you could achieve that by putting a special entry in its own row for >>> each term, in addition to the entries per-term/document pair, as in: >>> >>> RowID ColumnFamily Column Qualifier Value >>> <term1> term - >>> - >>> <term1>=<doc_id2> index count 5 >>> >>> Then, you could list terms by querying the "term" column family >>> without getting duplicates. And, you could get decent performance with >>> this scan if you put the "term" column family and the "index" column >>> family in separate locality groups. You could even make this entry an >>> aggregated count for all documents (see documentation for combiners), >>> in case you want corpus-wide term frequencies (for something like >>> TF-IDF computations). >>> >>> -- >>> Christopher L Tubbs II >>> http://gravatar.com/ctubbsii >>> >>> >>> On Sun, Jan 26, 2014 at 7:55 AM, Jamie Johnson <[email protected]> wrote: >>> > I mean if a user asked for all terms that started with "term" is there >>> > a way >>> > to get term1 and term2 just once while scanning or would I get each >>> > twice, >>> > once for each docid and need to filter client side? >>> > >>> > On Jan 26, 2014 1:33 AM, "Christopher" <[email protected]> wrote: >>> >> >>> >> If you use the Range constructor that takes two arguments, then yes, >>> >> you'd get two entries. However, "count" would come before "doc_id", >>> >> though, because the qualifier is part of the Key, and therefore, part >>> >> of the sort order. There's also a Range constructor that allows you to >>> >> specify whether you want the startKey and endKey to be inclusive or >>> >> exclusive. >>> >> >>> >> I don't know of a specific document that outlines various strategies >>> >> that I can link to. Perhaps I'll put one together, when I get some >>> >> spare time, if nobody else does. I think most people do a lot of >>> >> experimentation to figure out which strategies work best. >>> >> >>> >> I'm not entirely sure what you mean about "getting an iterator over >>> >> all terms without duplicates". I'm assuming you don't mean duplicate >>> >> versions of a single entry, which is handled by the >>> >> VersioningIterator, which should be on new tables by default, and set >>> >> to retain the recent 1 version, to support updates. With the scheme I >>> >> suggested, your table would look something like the following, >>> >> instead: >>> >> >>> >> RowID ColumnFamily Column Qualifier >>> >> Value >>> >> <term1>=<doc_id1> index count >>> >> 10 >>> >> <term1>=<doc_id2> index count 5 >>> >> <term2>=<doc_id3> index count 3 >>> >> <term3>=<doc_id1> index count >>> >> 12 >>> >> >>> >> With this scheme, you'd have only a single entry (a count) for each >>> >> row, and a single row for each term/document combination, so you >>> >> wouldn't have any duplicate counts for any given term/document. If >>> >> that's what you mean by duplicates... >>> >> >>> >> >>> >> -- >>> >> Christopher L Tubbs II >>> >> http://gravatar.com/ctubbsii >>> >> >>> >> >>> >> On Sat, Jan 25, 2014 at 12:19 AM, Jamie Johnson <[email protected]> >>> >> wrote: >>> >> > Thanks for the reply Chris. Say I had the following >>> >> > >>> >> > RowID ColumnFamily Column Qualifier Value >>> >> > term Occurrence~1 doc_id 1 >>> >> > term Occurrence~1 count 10 >>> >> > term2 Occurrence~2 doc_id 2 >>> >> > term2 Occurrence~2 count 1 >>> >> > >>> >> > creating a scanner with start key new Key(new Text("term"), new >>> >> > Text("Occurrence~1")) and end key new Key(new Text("term"), new >>> >> > Text("Occurrence~1")) I would get an iterator with two entries, the >>> >> > first >>> >> > key would be doc_id and the second would be count. Is that >>> >> > accurate? >>> >> > >>> >> > In regards to the other strategies is there anywhere that some of >>> >> > these >>> >> > are >>> >> > captured? Also in the your example, how would you go about getting >>> >> > an >>> >> > iterator over all terms without duplicates? Again thanks >>> >> > >>> >> > >>> >> > On Fri, Jan 24, 2014 at 11:34 PM, Christopher <[email protected]> >>> >> > wrote: >>> >> >> >>> >> >> It's not quite clear what you mean by "load", but I think you mean >>> >> >> "iterate over"? >>> >> >> >>> >> >> A simplified explanation is this: >>> >> >> >>> >> >> When you scan an Accumulo table, you are streaming each entry >>> >> >> (Key/Value pair), one at a time, through your client code. They are >>> >> >> only held in memory if you do that yourself in your client code. A >>> >> >> row >>> >> >> in Accumulo is the set of entries that share a particular value of >>> >> >> the >>> >> >> Row portion of the Key. They are logically grouped, but are not >>> >> >> grouped in memory unless you do that. >>> >> >> >>> >> >> One additional note is regarding your index schema of a row being a >>> >> >> search term and columns being documents. You will likely have >>> >> >> issues >>> >> >> with this strategy, as the number of documents for high frequency >>> >> >> terms grows, because tablets do not split in the middle of a row. >>> >> >> With >>> >> >> your schema, a row could get too large to manage on a single tablet >>> >> >> server. A slight variation, like concatenating the search term with >>> >> >> a >>> >> >> document identifier in the row (term=doc1, term=doc2, ....) would >>> >> >> allow the high frequency terms to split into multiple tablets if >>> >> >> they >>> >> >> get too large. There are better strategies, but that's just one >>> >> >> simple >>> >> >> option. >>> >> >> >>> >> >> >>> >> >> -- >>> >> >> Christopher L Tubbs II >>> >> >> http://gravatar.com/ctubbsii >>> >> >> >>> >> >> >>> >> >> On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <[email protected]> >>> >> >> wrote: >>> >> >> > If I have a row that as the key is a particular term and a set of >>> >> >> > columns >>> >> >> > that stores the documents that the term appears in if I load the >>> >> >> > row >>> >> >> > is >>> >> >> > the >>> >> >> > contents of all of the columns also loaded? Is there a way to >>> >> >> > page >>> >> >> > over >>> >> >> > the >>> >> >> > columns such that only N columns are in memory at any point? In >>> >> >> > this >>> >> >> > particular case the documents are all in a particular column >>> >> >> > family >>> >> >> > (say >>> >> >> > docs) and the column qualifier is created dynamically, for >>> >> >> > arguments >>> >> >> > sake we >>> >> >> > can say they are UUIDs. >>> >> > >>> >> >
