Re: Schema design for filters

Kristoffer Sjögren Fri, 28 Jun 2013 11:54:44 -0700

@Otis

HBase is a natural fit for my usecase because its schemaless. Im building a
configuration management system and there is no need for advanced
filtering/querying capabilities, just basic predicate logic and pagination
that scales to < 1 million rows with reasonable performance.


Thanks for the tip!


On Fri, Jun 28, 2013 at 8:34 PM, Otis Gospodnetic <
[email protected]> wrote:

> Kristoffer,
>
> You could also consider using something other than HBase, something
> that supports "secondary indices", like anything that is Lucene based
> - Solr and ElasticSearch for example.  We recently compared how we
> aggregate data in HBase (see my signature) and how we would do it if
> we were to use Solr (or ElasticSearch), and so far things look better
> in Solr for our use case.  And our use case involves a lot of
> filtering, slicing and dicing..... something to consider...
>
> Otis
> --
> Solr & ElasticSearch Support -- http://sematext.com/
> Performance Monitoring -- http://sematext.com/spm
>
>
>
> On Fri, Jun 28, 2013 at 5:24 AM, Kristoffer Sjögren <[email protected]>
> wrote:
> > Interesting. Im actually building something similar.
> >
> > A fullblown SQL implementation is bit overkill for my particular usecase
> > and the query API is the final piece to the puzzle. But ill definitely
> have
> > a look for some inspiration.
> >
> > Thanks!
> >
> >
> >
> > On Fri, Jun 28, 2013 at 3:55 AM, James Taylor <[email protected]
> >wrote:
> >
> >> Hi Kristoffer,
> >> Have you had a look at Phoenix (https://github.com/forcedotcom/phoenix
> )?
> >> You could model your schema much like an O/R mapper and issue SQL
> queries
> >> through Phoenix for your filtering.
> >>
> >> James
> >> @JamesPlusPlus
> >> http://phoenix-hbase.blogspot.com
> >>
> >> On Jun 27, 2013, at 4:39 PM, "Kristoffer Sjögren" <[email protected]>
> >> wrote:
> >>
> >> > Thanks for your help Mike. Much appreciated.
> >> >
> >> > I dont store rows/columns in JSON format. The schema is exactly that
> of a
> >> > specific java class, where the rowkey is a unique object identifier
> with
> >> > the class type encoded into it. Columns are the field names of the
> class
> >> > and the values are that of the object instance.
> >> >
> >> > Did think about coprocessors but the schema is discovered a runtime
> and I
> >> > cant hard code it.
> >> >
> >> > However, I still believe that filters might work. Had a look
> >> > at SingleColumnValueFilter and this filter is be able to target
> specific
> >> > column qualifiers with specific WritableByteArrayComparables.
> >> >
> >> > But list comparators are still missing... So I guess the only way is
> to
> >> > write these comparators?
> >> >
> >> > Do you follow my reasoning? Will it work?
> >> >
> >> >
> >> >
> >> >
> >> > On Fri, Jun 28, 2013 at 12:58 AM, Michael Segel
> >> > <[email protected]>wrote:
> >> >
> >> >> Ok...
> >> >>
> >> >> If you want to do type checking and schema enforcement...
> >> >>
> >> >> You will need to do this as a coprocessor.
> >> >>
> >> >> The quick and dirty way... (Not recommended) would be to hard code
> the
> >> >> schema in to the co-processor code.)
> >> >>
> >> >> A better way... at start up, load up ZK to manage the set of known
> table
> >> >> schemas which would be a map of column qualifier to data type.
> >> >> (If JSON then you need to do a separate lookup to get the records
> >> schema)
> >> >>
> >> >> Then a single java class that does the look up and then handles the
> >> known
> >> >> data type comparators.
> >> >>
> >> >> Does this make sense?
> >> >> (Sorry, kinda was thinking this out as I typed the response. But it
> >> should
> >> >> work )
> >> >>
> >> >> At least it would be a design approach I would talk. YMMV
> >> >>
> >> >> Having said that, I expect someone to say its a bad idea and that
> they
> >> >> have a better solution.
> >> >>
> >> >> HTH
> >> >>
> >> >> -Mike
> >> >>
> >> >> On Jun 27, 2013, at 5:13 PM, Kristoffer Sjögren <[email protected]>
> >> wrote:
> >> >>
> >> >>> I see your point. Everything is just bytes.
> >> >>>
> >> >>> However, the schema is known and every row is formatted according to
> >> this
> >> >>> schema, although some columns may not exist, that is, no value exist
> >> for
> >> >>> this property on this row.
> >> >>>
> >> >>> So if im able to apply these "typed comparators" to the right cell
> >> values
> >> >>> it may be possible? But I cant find a filter that target specific
> >> >> columns?
> >> >>>
> >> >>> Seems like all filters scan every column/qualifier and there is no
> way
> >> of
> >> >>> knowing what column is currently being evaluated?
> >> >>>
> >> >>>
> >> >>> On Thu, Jun 27, 2013 at 11:51 PM, Michael Segel
> >> >>> <[email protected]>wrote:
> >> >>>
> >> >>>> You have to remember that HBase doesn't enforce any sort of typing.
> >> >>>> That's why this can be difficult.
> >> >>>>
> >> >>>> You'd have to write a coprocessor to enforce a schema on a table.
> >> >>>> Even then YMMV if you're writing JSON structures to a column
> because
> >> >> while
> >> >>>> the contents of the structures could be the same, the actual
> strings
> >> >> could
> >> >>>> differ.
> >> >>>>
> >> >>>> HTH
> >> >>>>
> >> >>>> -Mike
> >> >>>>
> >> >>>> On Jun 27, 2013, at 4:41 PM, Kristoffer Sjögren <[email protected]>
> >> >> wrote:
> >> >>>>
> >> >>>>> I realize standard comparators cannot solve this.
> >> >>>>>
> >> >>>>> However I do know the type of each column so writing custom list
> >> >>>>> comparators for boolean, char, byte, short, int, long, float,
> double
> >> >>>> seems
> >> >>>>> quite straightforward.
> >> >>>>>
> >> >>>>> Long arrays, for example, are stored as a byte array with 8 bytes
> per
> >> >>>> item
> >> >>>>> so a comparator might look like this.
> >> >>>>>
> >> >>>>> public class LongsComparator extends WritableByteArrayComparable {
> >> >>>>>  public int compareTo(byte[] value, int offset, int length) {
> >> >>>>>      long[] values = BytesUtils.toLongs(value, offset, length);
> >> >>>>>      for (long longValue : values) {
> >> >>>>>          if (longValue == val) {
> >> >>>>>              return 0;
> >> >>>>>          }
> >> >>>>>      }
> >> >>>>>      return 1;
> >> >>>>>  }
> >> >>>>> }
> >> >>>>>
> >> >>>>> public static long[] toLongs(byte[] value, int offset, int
> length) {
> >> >>>>>  int num = (length - offset) / 8;
> >> >>>>>  long[] values = new long[num];
> >> >>>>>  for (int i = offset; i < num; i++) {
> >> >>>>>      values[i] = getLong(value, i * 8);
> >> >>>>>  }
> >> >>>>>  return values;
> >> >>>>> }
> >> >>>>>
> >> >>>>>
> >> >>>>> Strings are similar but would require charset and length for each
> >> >> string.
> >> >>>>>
> >> >>>>> public class StringsComparator extends
> WritableByteArrayComparable  {
> >> >>>>>  public int compareTo(byte[] value, int offset, int length) {
> >> >>>>>      String[] values = BytesUtils.toStrings(value, offset,
> length);
> >> >>>>>      for (String stringValue : values) {
> >> >>>>>          if (val.equals(stringValue)) {
> >> >>>>>              return 0;
> >> >>>>>          }
> >> >>>>>      }
> >> >>>>>      return 1;
> >> >>>>>  }
> >> >>>>> }
> >> >>>>>
> >> >>>>> public static String[] toStrings(byte[] value, int offset, int
> >> length)
> >> >> {
> >> >>>>>  ArrayList<String> values = new ArrayList<String>();
> >> >>>>>  int idx = 0;
> >> >>>>>  ByteBuffer buffer = ByteBuffer.wrap(value, offset, length);
> >> >>>>>  while (idx < length) {
> >> >>>>>      int size = buffer.getInt();
> >> >>>>>      byte[] bytes = new byte[size];
> >> >>>>>      buffer.get(bytes);
> >> >>>>>      values.add(new String(bytes));
> >> >>>>>      idx += 4 + size;
> >> >>>>>  }
> >> >>>>>  return values.toArray(new String[values.size()]);
> >> >>>>> }
> >> >>>>>
> >> >>>>>
> >> >>>>> Am I on the right track or maybe overlooking some implementation
> >> >> details?
> >> >>>>> Not really sure how to target each comparator to a specific column
> >> >> value?
> >> >>>>>
> >> >>>>>
> >> >>>>> On Thu, Jun 27, 2013 at 9:21 PM, Michael Segel <
> >> >>>> [email protected]>wrote:
> >> >>>>>
> >> >>>>>> Not an easy task.
> >> >>>>>>
> >> >>>>>> You first need to determine how you want to store the data
> within a
> >> >>>> column
> >> >>>>>> and/or apply a type constraint to a column.
> >> >>>>>>
> >> >>>>>> Even if you use JSON records to store your data within a column,
> >> does
> >> >> an
> >> >>>>>> equality comparator exist? If not, you would have to write one.
> >> >>>>>> (I kinda think that one may already exist...)
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On Jun 27, 2013, at 12:59 PM, Kristoffer Sjögren <
> [email protected]>
> >> >>>> wrote:
> >> >>>>>>
> >> >>>>>>> Hi
> >> >>>>>>>
> >> >>>>>>> Working with the standard filtering mechanism to scan rows that
> >> have
> >> >>>>>>> columns matching certain criterias.
> >> >>>>>>>
> >> >>>>>>> There are columns of numeric (integer and decimal) and string
> >> types.
> >> >>>>>> These
> >> >>>>>>> columns are single or multi-valued like "1", "2", "1,2,3", "a",
> "b"
> >> >> or
> >> >>>>>>> "a,b,c" - not sure what the separator would be in the case of
> list
> >> >>>> types.
> >> >>>>>>> Maybe none?
> >> >>>>>>>
> >> >>>>>>> I would like to compose the following queries to filter out rows
> >> that
> >> >>>>>> does
> >> >>>>>>> not match.
> >> >>>>>>>
> >> >>>>>>> - contains(String column, String value)
> >> >>>>>>> Single valued column that String.contain() provided value.
> >> >>>>>>>
> >> >>>>>>> - equal(String column, Object value)
> >> >>>>>>> Single valued column that Object.equals() provided value.
> >> >>>>>>> Value is either string or numeric type.
> >> >>>>>>>
> >> >>>>>>> - greaterThan(String column, java.lang.Number value)
> >> >>>>>>> Single valued column that > provided numeric value.
> >> >>>>>>>
> >> >>>>>>> - in(String column, Object value...)
> >> >>>>>>> Multi-valued column have values that Object.equals() all
> provided
> >> >>>>>> values.
> >> >>>>>>> Values are of string or numeric type.
> >> >>>>>>>
> >> >>>>>>> How would I design a schema that can take advantage of the
> already
> >> >>>>>> existing
> >> >>>>>>> filters and comparators to accomplish this?
> >> >>>>>>>
> >> >>>>>>> Already looked at the string and binary comparators but fail to
> see
> >> >> how
> >> >>>>>> to
> >> >>>>>>> solve this in a clean way for multi-valued column values.
> >> >>>>>>>
> >> >>>>>>> Im aware of custom filters but would like to avoid it if
> possible.
> >> >>>>>>>
> >> >>>>>>> Cheers,
> >> >>>>>>> -Kristoffer
> >> >>>>>>
> >> >>>>>>
> >> >>>>
> >> >>>>
> >> >>
> >> >>
> >>
>

Re: Schema design for filters

Reply via email to