Now that I see the VIEW solution written down, I think I tried something similar (but *after* all UPSERTS, not *during* like you showed here.). I tried to create a view consisting of the dynamic field names because I hated the extra typing of specifying the field name and type after the table name every time. But it did not work and gave me errors. I was using 4.6. But I am anxious to try it again when I upgrade to 4.7 or 4.8. Thanks!
On Fri, Jun 3, 2016 at 11:28 AM, James Taylor <jamestay...@apache.org> wrote: > Sure, Steve. Let's say you have a base table you define for your JSON > object like this: > > -- Create sequence that doles out IDs when a new JSON type > -- is processed > CREATE SEQUENCE JSON_ID_SEQ; > > -- Create base table for JSON > CREATE TABLE json (json_id BIGINT PRIMARY KEY, > created_date DATE > raw_json VARCHAR) > APPEND_ONLY_SCHEMA = true, > AUTO_PARTITION_SEQ = 'JSON_ID_SEQ'; > > Then as each json object is processed, you'd pull out the object name/type > from the message, each field (as you're doing now), and dynamically create > a view with the fields as columns: > > CREATE VIEW IF NOT EXISTS <json_type_name> ( > <field 1> <field 1 type>, ...) > AS SELECT * FROM json > APPEND_ONLY_SCHEMA = true; > > This would then assign a json_id when a new json_type_name is encountered > and otherwise add any new columns to the json_type_name VIEW. By declaring > it as APPEND_ONLY_SCHEMA we can optimize the RPC traffic by doing no RPCs > if all field names are already know by the metadata cached on the client, > and otherwise doing a single RPC to create or alter the view with any new > field names. > > The nice thing about this too is that the metadata is introspectable > through the standard JDBC metadata APIs - you can look up the fields and > their types based on the <json_type_name>. > > Note that the APPEND_ONLY_SCHEMA and AUTO_PARTITION_SEQ and 4.8 features, > but you can do the same pre-4.8 with a few more RPCs by catching the > TableAlreadyExistsException and adding each field through ALTER VIEW calls. > > This is how we've modeled time-series data in support of Argus[1], not as > JSON in this case, but as tags and a metric value. > > HTH. Thanks, > James > > [1] https://github.com/SalesforceEng/Argus > > On Fri, Jun 3, 2016 at 8:41 AM, Steve Terrell <sterr...@oculus360.us> > wrote: > >> James, I don't understand how to use IF NOT EXISTS to track the dynamic >> columns. Could you elaborate? Thanks! >> >> On Fri, Jun 3, 2016 at 10:36 AM, James Taylor <jamestay...@apache.org> >> wrote: >> >>> That's pretty slick, Steve. Another variant along the same lines would >>> be to create/alter a view over a base table where you use the IF NOT >>> EXISTS. In this way, you let Phoenix track the dynamic columns for you. We >>> have a new feature in 4.8 for declaring a table as an APPEND_ONLY_SCHEMA >>> and we'll minimize the RPCs for ensuring the meta data is up to date. It >>> works well for dynamically tracking schema as data is processed, a pretty >>> common pattern. >>> >>> Thanks, >>> James >>> >>> >>> On Friday, June 3, 2016, Steve Terrell <sterr...@oculus360.us> wrote: >>> >>>> I have a similar situation. I have records with varying fields that I >>>> wanted to access individually and also as a group. >>>> >>>> My actual records are JSON objects, so they look like like this: >>>> {"field1": value1, "field2": value2, …} >>>> To make matter harder, the fields are also varying types: ints, >>>> strings, boolean, floats. >>>> >>>> What I ended up doing that has worked really well is make each field in >>>> the JSON record a dynamic field in a Phoenix table. And, to preserve the >>>> integrity of the original record, I have one static string field in which >>>> the entire original JSON goes into. >>>> >>>> Because of the dynamic fields, I have to execute an UPSERT for each row >>>> instead of some kind of batch loading. But, I routinely load millions of >>>> rows like this, sometimes with 20+ fields, and it has never failed on an >>>> upsert and runs fast enough. >>>> >>>> A few of the fields are actually always in each record, so I use those >>>> for my primary keys, >>>> >>>> The end result is great! I can query by any field, quickly pull entire >>>> records from the key fields, and any JSON object can be loaded. >>>> >>>> In my code that translates the JSON into UPSERT statements I also keep >>>> a running tab of field names, their type, and number of occurrences. Then >>>> a the end of my millions of UPSERTS I populate a simple table that just >>>> lists all the fields/types/counts so that there is no mystery about what >>>> fields are available when I want to query the big table. I did this >>>> because I don't always know what field names are in the JSON, and I don't >>>> know how to get a list of dynamic field names from Phoenix SQL. >>>> >>>> Hope this helps, >>>> Steve >>>> >>>> On Thu, Jun 2, 2016 at 9:28 PM, Stephen Wilcoxon <wilco...@gmail.com> >>>> wrote: >>>> >>>>> The simple solution is to transform the multi-valued field into a >>>>> separate linked table with multiple rows per "id". If this is what you >>>>> mean by "exploding the data", why do you not want to do so? >>>>> >>>>> On Thu, Jun 2, 2016 at 7:11 PM, Rahul Jain <rja...@gmail.com> wrote: >>>>> >>>>>> >>>>>> Folks, >>>>>> >>>>>> I used this question earlier on the wrong list; posting on apache >>>>>> phoenix user group on the advise of James Taylor: >>>>>> >>>>>> We have a use case of migration from legacy database to >>>>>> hbase/phoenix. I'd want to hear your thoughts on how to manage the given >>>>>> multi-valued field: >>>>>> >>>>>> Field name : namesList >>>>>> >>>>>> Contains: comma separated list of strings. >>>>>> >>>>>> e.g. "Jack, Richard, Fred" >>>>>> >>>>>> The field does not have an inherent limit on the number of strings, >>>>>> however for successful migration we need to support unto 100 strings in >>>>>> the >>>>>> field >>>>>> >>>>>> This field is currently indexed in our legacy DB technology where the >>>>>> user is able to query by any of the contained strings. >>>>>> >>>>>> e.g. query: >>>>>> >>>>>> select * from myTable where namesList LIKE '%Fred%'; >>>>>> >>>>>> Is there any way Phoenix supports generating multiple index values from >>>>>> a single column in a row ? We are looking for a method where we don't >>>>>> explicitly specify individual offsets for indexing, (nameList[0], >>>>>> nameList[1] etc) , rather say a custom functional index equivalent >>>>>> generates the indexes as a list and each of the index in the list is >>>>>> persisted individually. >>>>>> >>>>>> Effectively our goal is to avoid full table scans when doing queries >>>>>> that match values from multi valued fields such as example above. We >>>>>> do want to avoid exploding the data however possible. >>>>>> >>>>>> Any comments / suggestions are welcome. >>>>>> >>>>>> Thank you ! >>>>>> >>>>>> Rahul >>>>>> >>>>> >>>>> >>>> >> >