Re: create schema on write

Steve Terrell Fri, 03 Jun 2016 10:00:07 -0700

Now that I see the VIEW solution written down, I think I tried something
similar (but *after* all UPSERTS, not *during* like you showed here.).  I
tried to create a view consisting of the dynamic field names because I
hated the extra typing of specifying the field name and type after the
table name every time.  But it did not work and gave me errors.  I was
using 4.6.  But I am anxious to try it again when I upgrade to 4.7 or 4.8.
Thanks!


On Fri, Jun 3, 2016 at 11:28 AM, James Taylor <jamestay...@apache.org>
wrote:

> Sure, Steve. Let's say you have a base table you define for your JSON
> object like this:
>
> -- Create sequence that doles out IDs when a new JSON type
> -- is processed
> CREATE SEQUENCE JSON_ID_SEQ;
>
> -- Create base table for JSON
> CREATE TABLE json (json_id BIGINT PRIMARY KEY,
>     created_date DATE
>     raw_json VARCHAR)
>     APPEND_ONLY_SCHEMA = true,
>     AUTO_PARTITION_SEQ = 'JSON_ID_SEQ';
>
> Then as each json object is processed, you'd pull out the object name/type
> from the message, each field (as you're doing now), and dynamically create
> a view with the fields  as columns:
>
> CREATE VIEW IF NOT EXISTS <json_type_name> (
>     <field 1> <field 1 type>, ...)
>     AS SELECT * FROM json
>     APPEND_ONLY_SCHEMA = true;
>
> This would then assign a json_id when a new json_type_name is encountered
> and otherwise add any new columns to the json_type_name VIEW. By declaring
> it as APPEND_ONLY_SCHEMA we can optimize the RPC traffic by doing no RPCs
> if all field names are already know by the metadata cached on the client,
> and otherwise doing a single RPC to create or alter the view with any new
> field names.
>
> The nice thing about this too is that the metadata is introspectable
> through the standard JDBC metadata APIs - you can look up the fields and
> their types based on the <json_type_name>.
>
> Note that the APPEND_ONLY_SCHEMA and AUTO_PARTITION_SEQ and 4.8 features,
> but you can do the same pre-4.8 with a few more RPCs by catching the
> TableAlreadyExistsException and adding each field through ALTER VIEW calls.
>
> This is how we've modeled time-series data in support of Argus[1], not as
> JSON in this case, but as tags and a metric value.
>
> HTH.  Thanks,
> James
>
> [1] https://github.com/SalesforceEng/Argus
>
> On Fri, Jun 3, 2016 at 8:41 AM, Steve Terrell <sterr...@oculus360.us>
> wrote:
>
>> James, I don't understand how to use IF NOT EXISTS to track the dynamic
>> columns.  Could you elaborate?  Thanks!
>>
>> On Fri, Jun 3, 2016 at 10:36 AM, James Taylor <jamestay...@apache.org>
>> wrote:
>>
>>> That's pretty slick, Steve. Another variant along the same lines would
>>> be to create/alter a view over a base table where you use the IF NOT
>>> EXISTS. In this way, you let Phoenix track the dynamic columns for you. We
>>> have a new feature in 4.8 for declaring a table as an APPEND_ONLY_SCHEMA
>>> and we'll minimize the RPCs for ensuring the meta data is up to date. It
>>> works well for dynamically tracking schema as data is processed, a pretty
>>> common pattern.
>>>
>>> Thanks,
>>> James
>>>
>>>
>>> On Friday, June 3, 2016, Steve Terrell <sterr...@oculus360.us> wrote:
>>>
>>>> I have a similar situation.  I have records with varying fields that I
>>>> wanted to access individually and also as a group.
>>>>
>>>> My actual records are JSON objects, so they look like like this:
>>>> {"field1": value1, "field2": value2, …}
>>>> To make matter harder, the fields are also varying types:  ints,
>>>> strings, boolean, floats.
>>>>
>>>> What I ended up doing that has worked really well is make each field in
>>>> the JSON record a dynamic field in a Phoenix table.  And, to preserve the
>>>> integrity of the original record, I have one static string field in which
>>>> the entire original JSON goes into.
>>>>
>>>> Because of the dynamic fields, I have to execute an UPSERT for each row
>>>> instead of some kind of batch loading.  But, I routinely load millions of
>>>> rows like this, sometimes with 20+ fields, and it has never failed on an
>>>> upsert and runs fast enough.
>>>>
>>>> A few of the fields are actually always in each record, so I use those
>>>> for my primary keys,
>>>>
>>>> The end result is great!  I can query by any field, quickly pull entire
>>>> records from the key fields, and any JSON object can be loaded.
>>>>
>>>> In my code that translates the JSON into UPSERT statements I also keep
>>>> a running tab of field names, their type, and number of occurrences.  Then
>>>> a the end of my millions of UPSERTS I populate a simple table that just
>>>> lists all the fields/types/counts so that there is no mystery about what
>>>> fields are available when I want to query the big table.  I did this
>>>> because I don't always know what field names are in the JSON, and I don't
>>>> know how to get a list of dynamic field names from Phoenix SQL.
>>>>
>>>> Hope this helps,
>>>>     Steve
>>>>
>>>> On Thu, Jun 2, 2016 at 9:28 PM, Stephen Wilcoxon <wilco...@gmail.com>
>>>> wrote:
>>>>
>>>>> The simple solution is to transform the multi-valued field into a
>>>>> separate linked table with multiple rows per "id".  If this is what you
>>>>> mean by "exploding the data", why do you not want to do so?
>>>>>
>>>>> On Thu, Jun 2, 2016 at 7:11 PM, Rahul Jain <rja...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> Folks,
>>>>>>
>>>>>> I used this question earlier on the wrong list; posting on apache
>>>>>> phoenix user group on the advise of James Taylor:
>>>>>>
>>>>>> We have a use case of migration from legacy database to
>>>>>> hbase/phoenix. I'd want to hear your thoughts on how to manage the  given
>>>>>> multi-valued field:
>>>>>>
>>>>>> Field name : namesList
>>>>>>
>>>>>> Contains: comma separated list of strings.
>>>>>>
>>>>>>  e.g. "Jack, Richard, Fred"
>>>>>>
>>>>>> The field does not have an inherent limit on the number of strings,
>>>>>> however for successful migration we need to support unto 100 strings in 
>>>>>> the
>>>>>> field
>>>>>>
>>>>>> This field is currently indexed in our legacy DB technology where the
>>>>>> user is able to query by any of the contained strings.
>>>>>>
>>>>>> e.g. query:
>>>>>>
>>>>>> select * from myTable where namesList LIKE '%Fred%';
>>>>>>
>>>>>> Is there any way Phoenix supports generating multiple index values from
>>>>>> a single column in a row ? We are looking for a method where we don't
>>>>>> explicitly specify individual offsets for indexing, (nameList[0],
>>>>>> nameList[1] etc) , rather say a custom functional index equivalent
>>>>>> generates the indexes as a list and each of the index in the list is
>>>>>> persisted individually.
>>>>>>
>>>>>> Effectively our goal is to avoid full table scans when doing queries
>>>>>> that match values from multi valued fields such as example above. We
>>>>>> do want to avoid exploding the data however possible.
>>>>>>
>>>>>> Any comments / suggestions are welcome.
>>>>>>
>>>>>> Thank you !
>>>>>>
>>>>>> Rahul
>>>>>>
>>>>>
>>>>>
>>>>
>>
>

Re: create schema on write

Reply via email to