Re: Schema design for filters

Otis Gospodnetic Fri, 28 Jun 2013 11:59:14 -0700

Hi,

I see.  Btw. isn't HBase for < 1M rows an overkill?
Note that Lucene is schemaless and both Solr and Elasticsearch can
detect field types, so in a way they are schemaless, too.


Otis
--
Performance Monitoring -- http://sematext.com/spm



On Fri, Jun 28, 2013 at 2:53 PM, Kristoffer Sjögren <[email protected]> wrote:
> @Otis
>
> HBase is a natural fit for my usecase because its schemaless. Im building a
> configuration management system and there is no need for advanced
> filtering/querying capabilities, just basic predicate logic and pagination
> that scales to < 1 million rows with reasonable performance.
>
> Thanks for the tip!
>
>
> On Fri, Jun 28, 2013 at 8:34 PM, Otis Gospodnetic <
> [email protected]> wrote:
>
>> Kristoffer,
>>
>> You could also consider using something other than HBase, something
>> that supports "secondary indices", like anything that is Lucene based
>> - Solr and ElasticSearch for example.  We recently compared how we
>> aggregate data in HBase (see my signature) and how we would do it if
>> we were to use Solr (or ElasticSearch), and so far things look better
>> in Solr for our use case.  And our use case involves a lot of
>> filtering, slicing and dicing..... something to consider...
>>
>> Otis
>> --
>> Solr & ElasticSearch Support -- http://sematext.com/
>> Performance Monitoring -- http://sematext.com/spm
>>
>>
>>
>> On Fri, Jun 28, 2013 at 5:24 AM, Kristoffer Sjögren <[email protected]>
>> wrote:
>> > Interesting. Im actually building something similar.
>> >
>> > A fullblown SQL implementation is bit overkill for my particular usecase
>> > and the query API is the final piece to the puzzle. But ill definitely
>> have
>> > a look for some inspiration.
>> >
>> > Thanks!
>> >
>> >
>> >
>> > On Fri, Jun 28, 2013 at 3:55 AM, James Taylor <[email protected]
>> >wrote:
>> >
>> >> Hi Kristoffer,
>> >> Have you had a look at Phoenix (https://github.com/forcedotcom/phoenix
>> )?
>> >> You could model your schema much like an O/R mapper and issue SQL
>> queries
>> >> through Phoenix for your filtering.
>> >>
>> >> James
>> >> @JamesPlusPlus
>> >> http://phoenix-hbase.blogspot.com
>> >>
>> >> On Jun 27, 2013, at 4:39 PM, "Kristoffer Sjögren" <[email protected]>
>> >> wrote:
>> >>
>> >> > Thanks for your help Mike. Much appreciated.
>> >> >
>> >> > I dont store rows/columns in JSON format. The schema is exactly that
>> of a
>> >> > specific java class, where the rowkey is a unique object identifier
>> with
>> >> > the class type encoded into it. Columns are the field names of the
>> class
>> >> > and the values are that of the object instance.
>> >> >
>> >> > Did think about coprocessors but the schema is discovered a runtime
>> and I
>> >> > cant hard code it.
>> >> >
>> >> > However, I still believe that filters might work. Had a look
>> >> > at SingleColumnValueFilter and this filter is be able to target
>> specific
>> >> > column qualifiers with specific WritableByteArrayComparables.
>> >> >
>> >> > But list comparators are still missing... So I guess the only way is
>> to
>> >> > write these comparators?
>> >> >
>> >> > Do you follow my reasoning? Will it work?
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Jun 28, 2013 at 12:58 AM, Michael Segel
>> >> > <[email protected]>wrote:
>> >> >
>> >> >> Ok...
>> >> >>
>> >> >> If you want to do type checking and schema enforcement...
>> >> >>
>> >> >> You will need to do this as a coprocessor.
>> >> >>
>> >> >> The quick and dirty way... (Not recommended) would be to hard code
>> the
>> >> >> schema in to the co-processor code.)
>> >> >>
>> >> >> A better way... at start up, load up ZK to manage the set of known
>> table
>> >> >> schemas which would be a map of column qualifier to data type.
>> >> >> (If JSON then you need to do a separate lookup to get the records
>> >> schema)
>> >> >>
>> >> >> Then a single java class that does the look up and then handles the
>> >> known
>> >> >> data type comparators.
>> >> >>
>> >> >> Does this make sense?
>> >> >> (Sorry, kinda was thinking this out as I typed the response. But it
>> >> should
>> >> >> work )
>> >> >>
>> >> >> At least it would be a design approach I would talk. YMMV
>> >> >>
>> >> >> Having said that, I expect someone to say its a bad idea and that
>> they
>> >> >> have a better solution.
>> >> >>
>> >> >> HTH
>> >> >>
>> >> >> -Mike
>> >> >>
>> >> >> On Jun 27, 2013, at 5:13 PM, Kristoffer Sjögren <[email protected]>
>> >> wrote:
>> >> >>
>> >> >>> I see your point. Everything is just bytes.
>> >> >>>
>> >> >>> However, the schema is known and every row is formatted according to
>> >> this
>> >> >>> schema, although some columns may not exist, that is, no value exist
>> >> for
>> >> >>> this property on this row.
>> >> >>>
>> >> >>> So if im able to apply these "typed comparators" to the right cell
>> >> values
>> >> >>> it may be possible? But I cant find a filter that target specific
>> >> >> columns?
>> >> >>>
>> >> >>> Seems like all filters scan every column/qualifier and there is no
>> way
>> >> of
>> >> >>> knowing what column is currently being evaluated?
>> >> >>>
>> >> >>>
>> >> >>> On Thu, Jun 27, 2013 at 11:51 PM, Michael Segel
>> >> >>> <[email protected]>wrote:
>> >> >>>
>> >> >>>> You have to remember that HBase doesn't enforce any sort of typing.
>> >> >>>> That's why this can be difficult.
>> >> >>>>
>> >> >>>> You'd have to write a coprocessor to enforce a schema on a table.
>> >> >>>> Even then YMMV if you're writing JSON structures to a column
>> because
>> >> >> while
>> >> >>>> the contents of the structures could be the same, the actual
>> strings
>> >> >> could
>> >> >>>> differ.
>> >> >>>>
>> >> >>>> HTH
>> >> >>>>
>> >> >>>> -Mike
>> >> >>>>
>> >> >>>> On Jun 27, 2013, at 4:41 PM, Kristoffer Sjögren <[email protected]>
>> >> >> wrote:
>> >> >>>>
>> >> >>>>> I realize standard comparators cannot solve this.
>> >> >>>>>
>> >> >>>>> However I do know the type of each column so writing custom list
>> >> >>>>> comparators for boolean, char, byte, short, int, long, float,
>> double
>> >> >>>> seems
>> >> >>>>> quite straightforward.
>> >> >>>>>
>> >> >>>>> Long arrays, for example, are stored as a byte array with 8 bytes
>> per
>> >> >>>> item
>> >> >>>>> so a comparator might look like this.
>> >> >>>>>
>> >> >>>>> public class LongsComparator extends WritableByteArrayComparable {
>> >> >>>>>  public int compareTo(byte[] value, int offset, int length) {
>> >> >>>>>      long[] values = BytesUtils.toLongs(value, offset, length);
>> >> >>>>>      for (long longValue : values) {
>> >> >>>>>          if (longValue == val) {
>> >> >>>>>              return 0;
>> >> >>>>>          }
>> >> >>>>>      }
>> >> >>>>>      return 1;
>> >> >>>>>  }
>> >> >>>>> }
>> >> >>>>>
>> >> >>>>> public static long[] toLongs(byte[] value, int offset, int
>> length) {
>> >> >>>>>  int num = (length - offset) / 8;
>> >> >>>>>  long[] values = new long[num];
>> >> >>>>>  for (int i = offset; i < num; i++) {
>> >> >>>>>      values[i] = getLong(value, i * 8);
>> >> >>>>>  }
>> >> >>>>>  return values;
>> >> >>>>> }
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> Strings are similar but would require charset and length for each
>> >> >> string.
>> >> >>>>>
>> >> >>>>> public class StringsComparator extends
>> WritableByteArrayComparable  {
>> >> >>>>>  public int compareTo(byte[] value, int offset, int length) {
>> >> >>>>>      String[] values = BytesUtils.toStrings(value, offset,
>> length);
>> >> >>>>>      for (String stringValue : values) {
>> >> >>>>>          if (val.equals(stringValue)) {
>> >> >>>>>              return 0;
>> >> >>>>>          }
>> >> >>>>>      }
>> >> >>>>>      return 1;
>> >> >>>>>  }
>> >> >>>>> }
>> >> >>>>>
>> >> >>>>> public static String[] toStrings(byte[] value, int offset, int
>> >> length)
>> >> >> {
>> >> >>>>>  ArrayList<String> values = new ArrayList<String>();
>> >> >>>>>  int idx = 0;
>> >> >>>>>  ByteBuffer buffer = ByteBuffer.wrap(value, offset, length);
>> >> >>>>>  while (idx < length) {
>> >> >>>>>      int size = buffer.getInt();
>> >> >>>>>      byte[] bytes = new byte[size];
>> >> >>>>>      buffer.get(bytes);
>> >> >>>>>      values.add(new String(bytes));
>> >> >>>>>      idx += 4 + size;
>> >> >>>>>  }
>> >> >>>>>  return values.toArray(new String[values.size()]);
>> >> >>>>> }
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> Am I on the right track or maybe overlooking some implementation
>> >> >> details?
>> >> >>>>> Not really sure how to target each comparator to a specific column
>> >> >> value?
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> On Thu, Jun 27, 2013 at 9:21 PM, Michael Segel <
>> >> >>>> [email protected]>wrote:
>> >> >>>>>
>> >> >>>>>> Not an easy task.
>> >> >>>>>>
>> >> >>>>>> You first need to determine how you want to store the data
>> within a
>> >> >>>> column
>> >> >>>>>> and/or apply a type constraint to a column.
>> >> >>>>>>
>> >> >>>>>> Even if you use JSON records to store your data within a column,
>> >> does
>> >> >> an
>> >> >>>>>> equality comparator exist? If not, you would have to write one.
>> >> >>>>>> (I kinda think that one may already exist...)
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> On Jun 27, 2013, at 12:59 PM, Kristoffer Sjögren <
>> [email protected]>
>> >> >>>> wrote:
>> >> >>>>>>
>> >> >>>>>>> Hi
>> >> >>>>>>>
>> >> >>>>>>> Working with the standard filtering mechanism to scan rows that
>> >> have
>> >> >>>>>>> columns matching certain criterias.
>> >> >>>>>>>
>> >> >>>>>>> There are columns of numeric (integer and decimal) and string
>> >> types.
>> >> >>>>>> These
>> >> >>>>>>> columns are single or multi-valued like "1", "2", "1,2,3", "a",
>> "b"
>> >> >> or
>> >> >>>>>>> "a,b,c" - not sure what the separator would be in the case of
>> list
>> >> >>>> types.
>> >> >>>>>>> Maybe none?
>> >> >>>>>>>
>> >> >>>>>>> I would like to compose the following queries to filter out rows
>> >> that
>> >> >>>>>> does
>> >> >>>>>>> not match.
>> >> >>>>>>>
>> >> >>>>>>> - contains(String column, String value)
>> >> >>>>>>> Single valued column that String.contain() provided value.
>> >> >>>>>>>
>> >> >>>>>>> - equal(String column, Object value)
>> >> >>>>>>> Single valued column that Object.equals() provided value.
>> >> >>>>>>> Value is either string or numeric type.
>> >> >>>>>>>
>> >> >>>>>>> - greaterThan(String column, java.lang.Number value)
>> >> >>>>>>> Single valued column that > provided numeric value.
>> >> >>>>>>>
>> >> >>>>>>> - in(String column, Object value...)
>> >> >>>>>>> Multi-valued column have values that Object.equals() all
>> provided
>> >> >>>>>> values.
>> >> >>>>>>> Values are of string or numeric type.
>> >> >>>>>>>
>> >> >>>>>>> How would I design a schema that can take advantage of the
>> already
>> >> >>>>>> existing
>> >> >>>>>>> filters and comparators to accomplish this?
>> >> >>>>>>>
>> >> >>>>>>> Already looked at the string and binary comparators but fail to
>> see
>> >> >> how
>> >> >>>>>> to
>> >> >>>>>>> solve this in a clean way for multi-valued column values.
>> >> >>>>>>>
>> >> >>>>>>> Im aware of custom filters but would like to avoid it if
>> possible.
>> >> >>>>>>>
>> >> >>>>>>> Cheers,
>> >> >>>>>>> -Kristoffer
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>
>> >> >>>>
>> >> >>
>> >> >>
>> >>
>>

Re: Schema design for filters

Reply via email to