Re: scanner question in regards to columns loaded

Jamie Johnson Sun, 26 Jan 2014 16:39:59 -0800

After a little reading...if I use fetchColumnFamily does that skip any rows
that does not have the column family?
On Jan 26, 2014 7:27 PM, "Jamie Johnson" <[email protected]> wrote:


> Thanks for the ideas.  Filters are client side right?
>
> I need to read the documentation more as I don't know how to just query a
> column family.  Would it be possible to get all terms that start with a
> particular value?  I was thinking that we would need a special prefix for
> this but if something could be done without needing it that would work well.
> On Jan 26, 2014 5:44 PM, "Christopher" <[email protected]> wrote:
>
>> Ah, I see. Well, you could do that with a custom filter (iterator),
>> but otherwise, no, not unless you had some other special per-term
>> entry to query (rather than per-term/document pair). The design of
>> this kind of table though, seems focused on finding documents which
>> contain the given terms, though, not listing all terms seen. If you
>> need that additional feature and don't want to write a custom filter,
>> you could achieve that by putting a special entry in its own row for
>> each term, in addition to the entries per-term/document pair, as in:
>>
>> RowID                       ColumnFamily     Column Qualifier     Value
>> <term1>                    term                   -
>>      -
>> <term1>=<doc_id2>   index                  count                     5
>>
>> Then, you could list terms by querying the "term" column family
>> without getting duplicates. And, you could get decent performance with
>> this scan if you put the "term" column family and the "index" column
>> family in separate locality groups. You could even make this entry an
>> aggregated count for all documents (see documentation for combiners),
>> in case you want corpus-wide term frequencies (for something like
>> TF-IDF computations).
>>
>> --
>> Christopher L Tubbs II
>> http://gravatar.com/ctubbsii
>>
>>
>> On Sun, Jan 26, 2014 at 7:55 AM, Jamie Johnson <[email protected]> wrote:
>> > I mean if a user asked for all terms that started with "term" is there
>> a way
>> > to get term1 and term2 just once while scanning or would I get each
>> twice,
>> > once for each docid and need to filter client side?
>> >
>> > On Jan 26, 2014 1:33 AM, "Christopher" <[email protected]> wrote:
>> >>
>> >> If you use the Range constructor that takes two arguments, then yes,
>> >> you'd get two entries. However, "count" would come before "doc_id",
>> >> though, because the qualifier is part of the Key, and therefore, part
>> >> of the sort order. There's also a Range constructor that allows you to
>> >> specify whether you want the startKey and endKey to be inclusive or
>> >> exclusive.
>> >>
>> >> I don't know of a specific document that outlines various strategies
>> >> that I can link to. Perhaps I'll put one together, when I get some
>> >> spare time, if nobody else does. I think most people do a lot of
>> >> experimentation to figure out which strategies work best.
>> >>
>> >> I'm not entirely sure what you mean about "getting an iterator over
>> >> all terms without duplicates". I'm assuming you don't mean duplicate
>> >> versions of a single entry, which is handled by the
>> >> VersioningIterator, which should be on new tables by default, and set
>> >> to retain the recent 1 version, to support updates. With the scheme I
>> >> suggested, your table would look something like the following,
>> >> instead:
>> >>
>> >> RowID                       ColumnFamily     Column Qualifier     Value
>> >> <term1>=<doc_id1>   index                  count                     10
>> >> <term1>=<doc_id2>   index                  count                     5
>> >> <term2>=<doc_id3>   index                  count                     3
>> >> <term3>=<doc_id1>   index                  count                     12
>> >>
>> >> With this scheme, you'd have only a single entry (a count) for each
>> >> row, and a single row for each term/document combination, so you
>> >> wouldn't have any duplicate counts for any given term/document. If
>> >> that's what you mean by duplicates...
>> >>
>> >>
>> >> --
>> >> Christopher L Tubbs II
>> >> http://gravatar.com/ctubbsii
>> >>
>> >>
>> >> On Sat, Jan 25, 2014 at 12:19 AM, Jamie Johnson <[email protected]>
>> wrote:
>> >> > Thanks for the reply Chris.  Say I had the following
>> >> >
>> >> > RowID     ColumnFamily     Column Qualifier     Value
>> >> > term         Occurrence~1     doc_id                    1
>> >> > term         Occurrence~1     count                      10
>> >> > term2       Occurrence~2      doc_id                     2
>> >> > term2       Occurrence~2      count                      1
>> >> >
>> >> > creating a scanner with start key new Key(new Text("term"), new
>> >> > Text("Occurrence~1")) and end key new Key(new Text("term"), new
>> >> > Text("Occurrence~1")) I would get an iterator with two entries, the
>> >> > first
>> >> > key would be doc_id and the second would be count.  Is that accurate?
>> >> >
>> >> > In regards to the other strategies is there anywhere that some of
>> these
>> >> > are
>> >> > captured?  Also in the your example, how would you go about getting
>> an
>> >> > iterator over all terms without duplicates?  Again thanks
>> >> >
>> >> >
>> >> > On Fri, Jan 24, 2014 at 11:34 PM, Christopher <[email protected]>
>> >> > wrote:
>> >> >>
>> >> >> It's not quite clear what you mean by "load", but I think you mean
>> >> >> "iterate over"?
>> >> >>
>> >> >> A simplified explanation is this:
>> >> >>
>> >> >> When you scan an Accumulo table, you are streaming each entry
>> >> >> (Key/Value pair), one at a time, through your client code. They are
>> >> >> only held in memory if you do that yourself in your client code. A
>> row
>> >> >> in Accumulo is the set of entries that share a particular value of
>> the
>> >> >> Row portion of the Key. They are logically grouped, but are not
>> >> >> grouped in memory unless you do that.
>> >> >>
>> >> >> One additional note is regarding your index schema of a row being a
>> >> >> search term and columns being documents. You will likely have issues
>> >> >> with this strategy, as the number of documents for high frequency
>> >> >> terms grows, because tablets do not split in the middle of a row.
>> With
>> >> >> your schema, a row could get too large to manage on a single tablet
>> >> >> server. A slight variation, like concatenating the search term with
>> a
>> >> >> document identifier in the row (term=doc1, term=doc2, ....) would
>> >> >> allow the high frequency terms to split into multiple tablets if
>> they
>> >> >> get too large. There are better strategies, but that's just one
>> simple
>> >> >> option.
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Christopher L Tubbs II
>> >> >> http://gravatar.com/ctubbsii
>> >> >>
>> >> >>
>> >> >> On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <[email protected]>
>> >> >> wrote:
>> >> >> > If I have a row that as the key is a particular term and a set of
>> >> >> > columns
>> >> >> > that stores the documents that the term appears in if I load the
>> row
>> >> >> > is
>> >> >> > the
>> >> >> > contents of all of the columns also loaded?  Is there a way to
>> page
>> >> >> > over
>> >> >> > the
>> >> >> > columns such that only N columns are in memory at any point?  In
>> this
>> >> >> > particular case the documents are all in a particular column
>> family
>> >> >> > (say
>> >> >> > docs) and the column qualifier is created dynamically, for
>> arguments
>> >> >> > sake we
>> >> >> > can say they are UUIDs.
>> >> >
>> >> >
>>
>

Re: scanner question in regards to columns loaded

Reply via email to