Re: [GENERAL] full text search on hstore or json with materialized view?

2017-04-20 Thread Rj Ewing
On Wed, Apr 19, 2017 at 9:55 PM, George Neuner <gneun...@comcast.net> wrote:

> On Wed, 19 Apr 2017 16:28:13 -0700, Rj Ewing <ewing...@gmail.com>
> wrote:
>
> >okay, messing around a bit more with the secondary k,v table it seems like
> >this could be a good solution..
> >
> >I created a keys table to hold the 63 key values, then I dropped and
> >recreated the secondary table, using a FK referencing the keys table. I'm
> >not really sure why, but a basic full text query on 44 million row is
> >taking aproxx. 20ms.
>
> That pretty much confirms your statistics were bad ... using the FK
> table or not wouldn't make any difference to the planner.
>
> But if you are getting 20ms on 44M rows, then one or more of the
> following must be true:
>  - your text values must be very short
>  - your FTS queries must be very simple
>  - you aren't reading the results
>

​text is on average very short. 1-3 words per value.​ It was a count(*)
query with only a single condition.


> For comparison:
>
> I have an application that does FTS on a table of NAICS descriptions
> indexed using tsvectors with an average length of 4.8 tokens per.  It
> does a 3-part All/Any/None term search.
>
> On my 24-core 2.4GHz server, a single threaded query with the whole
> table and index in memory takes ~1 ms to search 20K rows using a
> realistic tsquery:  e.g.,
>
>   SELECT code,description
> FROM naics
> WHERE ts_index @@ to_tsquery('packaged & software & !(wholesale)')
>
> [getting the data out of Postgresql takes longer than the search]
>
>
> GIN indexes don't exactly scale linearly, and tsquery is, in general,
> much more dependent on the lengths of the tsvectors than on the
> complexity of the match, but with 44M rows of similarly distributed
> data, a similarly realistic query would be expected to take well over
> 1 second.
>
>
> My example is genuine but too small to bother parallelizing [mentioned
> in a previous message].  Since you are *testing* with 1M records (that
> create 44M k:v shards), I am assuming you will need to deal with much
> more than that in deployment.  And if you think you need FTS, then you
> must be expecting more than simple word matches [as below], else you
> might do something simpler like
>
>   SELECT ...
> WHERE val ILIKE 
> ​
>

​the 1M records would most likely be the max. On average the tables would
have more like 100,000 records each.

I also realized that I inserted all k:v pairs into the secondary k:v table.
In reality, I would only index strings, which would eliminate approx 25% of
the k:v bringing that number down closer to 30M.

from my understanding, *ILIKE* doesn't do any text normalization, which is
something we would like to have.​

> ​
>
>
> >my table structure is:
> >
> > Table "public.samples_lg_txt"
> > Column |   Type   | Modifiers
> >+--+---
> > id | integer  |
> > key| integer  |
> > val| text |
> > tsv| tsvector |
> >Indexes:
> >"idx_tsv_samples_lg_text" gin (tsv)
> >Foreign-key constraints:
> >"samples_lg_txt_id_fkey" FOREIGN KEY (id) REFERENCES samples_lg(id)
> >ON DELETE CASCADE
> >"samples_lg_txt_key_fkey" FOREIGN KEY (key) REFERENCES keys(id)
> >
> >
> >how would I write an AND query that filtered on 2 separate keys from the
> >samples_lg_txt table?
> >
> >something like:
> >
> >SELECT COUNT(*) FROM samples WHERE id IN ( SELECT DISTINCT(s.id) FROM
> >samples_lg_txt s JOIN keys k ON s.key = k.id WHERE (*name = 'key1' AND
> tsv
> >@@ to_tsquery('value1')) AND (name = 'key2' AND tsv @@
> >to_tsquery('value2'))*;
>
> You're overthinking it
>
>   SELECT count(distinct s.id)
> FROM  samples_lg_txt AS s
> JOIN  keys AS k ON k.id = s.key
> WHERE (k.name = 'key1' AND s.tsv @@ to_query('value1')
>OR (k.name = 'key2' AND s.tsv @@ to_query('value2')
>
> ​but that is an OR query, I'm trying to do an AND query.
​

> There's actually no need to join if you can use the key name instead
> of an integer id.  You can FK on strings, so you can still maintain an
> identity table of keys.  E.g.,
>
> > id | integer   |
> > key| vchar(32) | FK key(name) ...
> > val| text  |
> > tsv| tsvector  |
>
>
> Then the query could be just
>
>   SELECT count(distinct id)
> FROM  samples_lg_txt
> WHERE (key = 'key1' AND tsv @@ to_query('value1')
>OR (key = 'key2' AND tsv @@ to_query('value2')
>
>
> Just a reminder [it's late here 8-)]: FK columns contai

Re: [GENERAL] full text search on hstore or json with materialized view?

2017-04-20 Thread Rj Ewing
On Wed, Apr 19, 2017 at 8:09 PM, Jeff Janes  wrote:

>
> Your best bet might be to ignore the per-field searching in the initial
> (indexed) pass of the query to get everything that has all the search
> terms, regardless of which field they occur in.  And the re-check whether
> each of the found values was found in the appropriate field in a later pass.
>
> Something like
>
> select * from sample where
>  to_tsvector(json_thing->>:key1) @@ :value1
>   and to_tsvector(json_thing->>:key2) @@ :value2
>   and to_tsvector('english',json_thing) @@ (:value1 || :value2)
>

​that worked pretty well when there was an AND condition with multiple k:v
pairs as you have. However replacing it with an OR condition across k:v
pairs it was pretty slow. I do like the simplicity though. Maybe indexing
the 10ish most common columns ​would be a "good enough" solution.

>
> From the initial email:
>
> > An idea that has come up is to use a materialized view or secondary
> table with triggers, where we would have 3 columns (id, key, value).
>
> How would this be different from the "triple store" you are abandoning?
>

​it would be fairly similar. One advantage would be that we could simplify
the backend to just a RDMS (which we use already), and not have to maintain
a separate "triple store" instance


Re: [GENERAL] full text search on hstore or json with materialized view?

2017-04-20 Thread Rj Ewing
On Wed, Apr 19, 2017 at 6:44 PM, George Neuner <gneun...@comcast.net> wrote:
>
> On Wed, 19 Apr 2017 11:57:26 -0700, Rj Ewing <ewing...@gmail.com>
> wrote:
>
> >I did some testing using a secondary table with the key, value column.
> >However I don't think this will provide the performance that we need.
> >Queries we taking 60+ seconds just for a count.
>
> SELECT count(*) or filtered?
>
> Either way, your statistics may be way off.  Did you vacuum analyze
> the table after the inserts (and the index creation if it was done
> separately)?


​I think my statistics were off. I never ran vacuum analyze.

>
> Without more detail re: your hardware, Postgresql version, what
> indexes are/will be available, the types of queries you want to run,
> etc., it's very hard to give really meaningful suggestions.
>
> ​postgresql 9.6, currently testing on 8gb ram, but have upto 64gb for
production. 7-core 2.10GHz​.

mostly want to run ad-hoc queries, returning entire row, matching 1 - 3 k:v
conditions.

The kind of query you have alluded to is pretty easily parallelized:
> it can be spread over multiple sessions with result aggregation done
> on the client side.
>
> Or, if you you have 9.6, you might try using backend parallelism:
> https://www.postgresql.org/docs/9.6/static/parallel-query.html
> [I've not used this, but some people have done it successfully.]​


>

> If you can restrict the FTS query to certain keys:
>
>   SELECT id FROM mytable
> WHERE tsquery( ... ) @@ to_tsvector(v)
> AND k IN ( ... )
> GROUP BY id
>
>   [note: according to David Rowley, GROUP BY may be parallelized
>  whereas  DISTINCT currently cannot be.]
>
> then given an index on 'k' it may be much faster than just the FTS
> query alone.  Subject to key variability, it also may be improved by
> table partitioning to reduce the search space.
>
> If the FTS query is key restricted, you can parallelize either on the
> client or on the server.  If the FTS query is not key restricted, you
> pretty much are limited to server side (and 9.6 or later).
>
> ​I'll look into parallelism if we can't get the performance we need.

​What do you mean if I can restrict the FTS query to certain keys? I'm not
a sql expert, but it seems like the above query would match multiple keys
to 1 tsquery value?

We need to be able to do AND conditions with separate k:v pairs. Our keys
are know ahead of time, and would vary for a given table, but would be in
the range of 30-60 keys per table.


Re: [GENERAL] full text search on hstore or json with materialized view?

2017-04-19 Thread Rj Ewing
okay, messing around a bit more with the secondary k,v table it seems like
this could be a good solution..

I created a keys table to hold the 63 key values, then I dropped and
recreated the secondary table, using a FK referencing the keys table. I'm
not really sure why, but a basic full text query on 44 million row is
taking aproxx. 20ms.

my table structure is:

 Table "public.samples_lg_txt"
 Column |   Type   | Modifiers
+--+---
 id | integer  |
 key| integer  |
 val| text |
 tsv| tsvector |
Indexes:
"idx_tsv_samples_lg_text" gin (tsv)
Foreign-key constraints:
"samples_lg_txt_id_fkey" FOREIGN KEY (id) REFERENCES samples_lg(id) ON
DELETE CASCADE
"samples_lg_txt_key_fkey" FOREIGN KEY (key) REFERENCES keys(id)


how would I write an AND query that filtered on 2 separate keys from the
samples_lg_txt table?

something like:

SELECT COUNT(*) FROM samples WHERE id IN ( SELECT DISTINCT(s.id) FROM
samples_lg_txt s JOIN keys k ON s.key = k.id WHERE (*name = 'key1' AND tsv
@@ to_tsquery('value1')) AND (name = 'key2' AND tsv @@
to_tsquery('value2'))*;

On Wed, Apr 19, 2017 at 11:57 AM, Rj Ewing <ewing...@gmail.com> wrote:

> I did some testing using a secondary table with the key, value column.
> However I don't think this will provide the performance that we need.
> Queries we taking 60+ seconds just for a count.
>
> With 1 million rows in the primary table, this resulted in 44 million rows
> in the secondary k,v table for full text searching. The same query is es
> takes ~50 ms on my local machine with 1/10th the ram allocated to es then
> was allocated to psql.
>
> I'm gonna test using trigrams indexes on approx 10 json fields, and see if
> that gives us what we are looking for.
>
> any thought on getting sub 1 sec queries on a table with 44 million rows?
>
> RJ
>
> On Tue, Apr 18, 2017 at 10:35 PM, George Neuner <gneun...@comcast.net>
> wrote:
>
>> On Tue, 18 Apr 2017 14:38:15 -0700, Rj Ewing <ewing...@gmail.com>
>> wrote:
>>
>> >I am evaluating postgres for as a datastore for our webapp. We are moving
>> >away from a triple store db due to performance issues.
>> >
>> >Our data model consists of sets of user defined attributes. Approx 10% of
>> >the attributes tend to be 100% filled with 50% of the attributes having
>> >approx 25% filled. This is fairly sparse data, and it seems that jsonb or
>> >hstore will be best for us.
>> >
>> >Unfortunately, from my understanding, postres doesn't support fulltext
>> >search across hstore or jsonb key:values or even the entire document.
>> While
>> >this is not a deal breaker, this would be a great feature to have. We
>> have
>> >been experimenting w/ elasticsearch a bit, and particularly enjoy this
>> >feature, however we don't really want to involve the complexity and
>> >overhead of adding elasticsearch in front of our datasource right now.
>>
>> hstore and JSON values all really are just formatted text with a
>> custom column type.  You can create tsvectors from the values if you
>> cast them to text.
>>
>> Note that a tsvector can only work on a /flat/ key:value structure: it
>> won't understand nesting, and it and even with a flat store it won't
>> understand the difference between keys/tags and the associated values.
>>
>> E.g., you will be able to see that a value contains both "foo" and
>> "bar", but to distinguish 'foo:bar' from 'bar:foo' or 'foo:q,bar:z'
>> you either must check the token positions (from the tsvector) or *try*
>> to extract the key(s) you are interested in and check the associated
>> value(s).
>>
>> This might work ok if you search only for keys in a "document" ... but
>> trying to search values, I think would be far too complicated.
>>
>> It might help if you stored a 2D array instead of a flat structure,
>> but even that would be fairly complicated to work with.
>>
>>
>>
>> >An idea that has come up is to use a materialized view or secondary table
>> >with triggers, where we would have 3 columns (id, key, value).
>> >
>> >I think this would allow us to store a tsvector and gin index. Giving us
>> >the ability to use fulltext search on k:v pairs, then join the original
>> >data on the id field to return the entire record.
>>
>> This is a much better idea because it separates the key from the
>> value, and unlike the full "document" case [above], you will know that
>> the FTS index is covering only the values.
>>
>> If you need to preserve key order 

Re: [GENERAL] full text search on hstore or json with materialized view?

2017-04-19 Thread Rj Ewing
I did some testing using a secondary table with the key, value column.
However I don't think this will provide the performance that we need.
Queries we taking 60+ seconds just for a count.

With 1 million rows in the primary table, this resulted in 44 million rows
in the secondary k,v table for full text searching. The same query is es
takes ~50 ms on my local machine with 1/10th the ram allocated to es then
was allocated to psql.

I'm gonna test using trigrams indexes on approx 10 json fields, and see if
that gives us what we are looking for.

any thought on getting sub 1 sec queries on a table with 44 million rows?

RJ

On Tue, Apr 18, 2017 at 10:35 PM, George Neuner <gneun...@comcast.net>
wrote:

> On Tue, 18 Apr 2017 14:38:15 -0700, Rj Ewing <ewing...@gmail.com>
> wrote:
>
> >I am evaluating postgres for as a datastore for our webapp. We are moving
> >away from a triple store db due to performance issues.
> >
> >Our data model consists of sets of user defined attributes. Approx 10% of
> >the attributes tend to be 100% filled with 50% of the attributes having
> >approx 25% filled. This is fairly sparse data, and it seems that jsonb or
> >hstore will be best for us.
> >
> >Unfortunately, from my understanding, postres doesn't support fulltext
> >search across hstore or jsonb key:values or even the entire document.
> While
> >this is not a deal breaker, this would be a great feature to have. We have
> >been experimenting w/ elasticsearch a bit, and particularly enjoy this
> >feature, however we don't really want to involve the complexity and
> >overhead of adding elasticsearch in front of our datasource right now.
>
> hstore and JSON values all really are just formatted text with a
> custom column type.  You can create tsvectors from the values if you
> cast them to text.
>
> Note that a tsvector can only work on a /flat/ key:value structure: it
> won't understand nesting, and it and even with a flat store it won't
> understand the difference between keys/tags and the associated values.
>
> E.g., you will be able to see that a value contains both "foo" and
> "bar", but to distinguish 'foo:bar' from 'bar:foo' or 'foo:q,bar:z'
> you either must check the token positions (from the tsvector) or *try*
> to extract the key(s) you are interested in and check the associated
> value(s).
>
> This might work ok if you search only for keys in a "document" ... but
> trying to search values, I think would be far too complicated.
>
> It might help if you stored a 2D array instead of a flat structure,
> but even that would be fairly complicated to work with.
>
>
>
> >An idea that has come up is to use a materialized view or secondary table
> >with triggers, where we would have 3 columns (id, key, value).
> >
> >I think this would allow us to store a tsvector and gin index. Giving us
> >the ability to use fulltext search on k:v pairs, then join the original
> >data on the id field to return the entire record.
>
> This is a much better idea because it separates the key from the
> value, and unlike the full "document" case [above], you will know that
> the FTS index is covering only the values.
>
> If you need to preserve key order to reconstruct records, you will
> need an additional column to maintain that ordering.
>
>
> >is anyone currently doing this? Is there a better alternative? Any
> >performance issues that immediately jump out ( I realize the writes will
> >take longer)?
> >
> >the nature of our data is "relatively" static with bulk uploads (100 -
> 1000
> >records). So we can sacrifice some write performance.
> >
> >RJ
>
> Having to "reconstruct" records will make reads take longer as well,
> but I think separating the keys and values is the best way to do it.
>
>
> YMMV,
> George
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] full text search on hstore or json with materialized view?

2017-04-18 Thread Rj Ewing
A step in the right direction for me, however it doesn't appear to support
per field full text searching.
It is exciting though!

On Tue, Apr 18, 2017 at 3:00 PM, Bruce Momjian <br...@momjian.us> wrote:

> On Tue, Apr 18, 2017 at 02:38:15PM -0700, Rj Ewing wrote:
> > I am evaluating postgres for as a datastore for our webapp. We are
> moving away
> > from a triple store db due to performance issues.
> >
> > Our data model consists of sets of user defined attributes. Approx 10%
> of the
> > attributes tend to be 100% filled with 50% of the attributes having
> approx 25%
> > filled. This is fairly sparse data, and it seems that jsonb or hstore
> will be
> > best for us.
> >
> > Unfortunately, from my understanding, postres doesn't support fulltext
> search
> > across hstore or jsonb key:values or even the entire document. While
> this is
> > not a deal breaker, this would be a great feature to have. We have been
> > experimenting w/ elasticsearch a bit, and particularly enjoy this
> feature,
> > however we don't really want to involve the complexity and overhead of
> adding
> > elasticsearch in front of our datasource right now.
>
> Full text search of JSON and JSONB data is coming in Postgres 10, which
> is to to be released in September of this year:
>
> https://www.depesz.com/2017/04/04/waiting-for-postgresql-
> 10-full-text-search-support-for-json-and-jsonb/
>
> --
>   Bruce Momjian  <br...@momjian.us>http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
> + As you are, so once was I.  As I am, so you will be. +
> +  Ancient Roman grave inscription +
>


[GENERAL] full text search on hstore or json with materialized view?

2017-04-18 Thread Rj Ewing
I am evaluating postgres for as a datastore for our webapp. We are moving
away from a triple store db due to performance issues.

Our data model consists of sets of user defined attributes. Approx 10% of
the attributes tend to be 100% filled with 50% of the attributes having
approx 25% filled. This is fairly sparse data, and it seems that jsonb or
hstore will be best for us.

Unfortunately, from my understanding, postres doesn't support fulltext
search across hstore or jsonb key:values or even the entire document. While
this is not a deal breaker, this would be a great feature to have. We have
been experimenting w/ elasticsearch a bit, and particularly enjoy this
feature, however we don't really want to involve the complexity and
overhead of adding elasticsearch in front of our datasource right now.

An idea that has come up is to use a materialized view or secondary table
with triggers, where we would have 3 columns (id, key, value).

I think this would allow us to store a tsvector and gin index. Giving us
the ability to use fulltext search on k:v pairs, then join the original
data on the id field to return the entire record.

is anyone currently doing this? Is there a better alternative? Any
performance issues that immediately jump out ( I realize the writes will
take longer)?

the nature of our data is "relatively" static with bulk uploads (100 - 1000
records). So we can sacrifice some write performance.

RJ


Re: [GENERAL] dynamic schema modeling and performance

2017-04-14 Thread Rj Ewing
Some example queries are:

give me all attributes for entity1 where entity1.attribute1 > 1000 and
entity1.attribute15 = "someValue"

give me all attributes for entity1 where entity1.parentId = 1

give me all attributes for entity1 & parent_entity where entity1.attribute2
= "this"


​Nothing too complicated.

Our application level checks more consist of validating the uploaded data
against user defined rules.​ They are fairly basic.

Scale wise...

Nothing huge. In terms of a more traditional schema, where each entity was
a separate table, the largest table might contain 1,000,000 rows, with the
vast majority of them under 10,000. Another issue is that there might be 30
attributes for a given entity. However the distribution of data might look
like 8 of 30 are 100% populated (each row has a value). 4 of 30 are 80%
populated, 15 are 50% populate, and the rest are <25% populated.

On Fri, Apr 14, 2017 at 11:23 AM, Vincent Elschot <vi...@xs4all.nl> wrote:

>
> Op 14/04/2017 om 19:03 schreef Rj Ewing:
>
> We do know where we want to end up. We've had the application running for
> a while using a triple store db. We're looking to move away from the triple
> store due to performance issues. Our core concept is that each project can
> define a set of entities and their relations. Each entity has a set of
> attributes. We need to be able to efficiently run ad-hoc queries across
> multiple entities in a project, filtering via the attribute values
>
> What kind of queries are you talking about?
>
>
> I think our business needs probably eliminate the possibility of data
> integrity at the db level. We currently do application level data
> validation.
>
>
> Application level checks can only be done if you exclusively lock the
> database from before you start the check until the moment you nolonger need
> the certainty.
> That's usually a no-go because it makes your data single-user for the
> duration of the check.
>
>
> Regarding EAV, is there a good way to do this? Most everything I read
> online says that EAV is a terrible idea and performance is lack luster.
> However there are 6NF advocators who say that done right, it is highly
> performant. Are there any articles on the correct way to implement EAV?
>
> Performance always depends entirely on what you are doing with it, and on
> what scale. This is something that you should experiment with.
>
>
> would jsonb or eav likely provide better query performance?
>
> Can't tell without more information about what you actually do with the
> data.
>
> But dynamic schemas can be just fine, as long as your application is 100%
> in control over what can and cannot be done,
> and as long as you use separate databases per customer/project/whatever.
> You will probably want to scale up at some point
> and move customers to different servers, so you might aswell take that
> into account before you start.
>
>
>
> On Wed, Apr 12, 2017 at 7:43 AM, Merlin Moncure <mmonc...@gmail.com>
> wrote:
>
>> On Tue, Apr 11, 2017 at 12:46 PM, Rj Ewing <ewing...@gmail.com> wrote:
>> > I'm looking for thoughts on the best way to handle dynamic schemas.
>> >
>> > The application I am developing revolves around user defined entities.
>> Each
>> > entity is a tabular dataset with user defined columns and data types.
>> > Entities can also be related to each other through Parent-Child
>> > relationships. Some entities will be 100% user driven, while others
>> (such as
>> > an entity representing a photo) will be partially user driven (all photo
>> > entities will have common fields + custom user additions).
>> >
>> > I was hoping to get opinions on whether postgresql would be a suitable
>> > backend. A couple of options I have thought of are:
>> >
>> > 1. Each entity is represented as a table in psql. The schema would be
>> > dynamically updated (with limits) when an entity mapping is updated. I
>> > believe that this would provide the best data constraints and allow the
>> best
>> > data normalization. A concern I have is that there could be an enormous
>> > amount of tables generated and the performance impacts this might have
>> in
>> > the future. I could then run elasticsearch as a denormalized cache for
>> > efficient querying and full-text-search.
>> >
>> > 2. Use a nosql database. This provides the "dynamic" schema aspect. A
>> > concern here is the lack of relation support, thus leading to a more
>> > denormalized data structure and the potential for the data to become
>> > corrupted.
>> >
>> > Any opinions on 

Re: [GENERAL] dynamic schema modeling and performance

2017-04-14 Thread Rj Ewing
We do know where we want to end up. We've had the application running for a
while using a triple store db. We're looking to move away from the triple
store due to performance issues. Our core concept is that each project can
define a set of entities and their relations. Each entity has a set of
attributes. We need to be able to efficiently run ad-hoc queries across
multiple entities in a project, filtering via the attribute values

I think our business needs probably eliminate the possibility of data
integrity at the db level. We currently do application level data
validation.

Regarding EAV, is there a good way to do this? Most everything I read
online says that EAV is a terrible idea and performance is lack luster.
However there are 6NF advocators who say that done right, it is highly
performant. Are there any articles on the correct way to implement EAV?

would jsonb or eav likely provide better query performance?


On Wed, Apr 12, 2017 at 7:43 AM, Merlin Moncure <mmonc...@gmail.com> wrote:

> On Tue, Apr 11, 2017 at 12:46 PM, Rj Ewing <ewing...@gmail.com> wrote:
> > I'm looking for thoughts on the best way to handle dynamic schemas.
> >
> > The application I am developing revolves around user defined entities.
> Each
> > entity is a tabular dataset with user defined columns and data types.
> > Entities can also be related to each other through Parent-Child
> > relationships. Some entities will be 100% user driven, while others
> (such as
> > an entity representing a photo) will be partially user driven (all photo
> > entities will have common fields + custom user additions).
> >
> > I was hoping to get opinions on whether postgresql would be a suitable
> > backend. A couple of options I have thought of are:
> >
> > 1. Each entity is represented as a table in psql. The schema would be
> > dynamically updated (with limits) when an entity mapping is updated. I
> > believe that this would provide the best data constraints and allow the
> best
> > data normalization. A concern I have is that there could be an enormous
> > amount of tables generated and the performance impacts this might have in
> > the future. I could then run elasticsearch as a denormalized cache for
> > efficient querying and full-text-search.
> >
> > 2. Use a nosql database. This provides the "dynamic" schema aspect. A
> > concern here is the lack of relation support, thus leading to a more
> > denormalized data structure and the potential for the data to become
> > corrupted.
> >
> > Any opinions on the use of psql for this case, or other options would be
> > greatly appreciated!
>
> Postgres can function as a nosql database -- you can use jsonb for
> example to archive data in such a way that the data model can be
> changed without making schema adjustments.   Another way to do it is
> EAV pattern as noted.  These might be good strategies if you're not
> sure where you want to end up.
>
> It really comes down to this: how formal do you want your data model
> to be?   Adding formality leads to performance optimizations, exposes
> your data to the fantastic SQL language, and allows rigorous
> assumptions to made made from external dependencies and trusted.
> Formality also brings a degree of inflexibility since your data has to
> be forced into predetermined structures.
>
> merlin
>


Re: [GENERAL] dynamic schema modeling and performance

2017-04-11 Thread RJ Ewing
I thought that might be an answer around here :)

I guess I was looking for what might be a better approach. 

Is dynamically creating a table for each entity a bad idea? I can see something 
like creating a schema for each project (group of related entities) and then 
creating a table for each schema. I don't expect having more then a few 
thousand projects anytime soon. We have a relatively targeted audience. 

Or would it be better to use jsonb data types and create a denormalized index 
elsewhere?



> On Apr 11, 2017, at 5:17 PM, Dorian Hoxha <dorian.ho...@gmail.com> wrote:
> 
> If you are asking if you should go nosql, 99% you should not.
> 
>> On Tue, Apr 11, 2017 at 10:06 PM, Poul Kristensen <bcc5...@gmail.com> wrote:
>> dataverse.org uses Postgresql and is well documented + it is completely user 
>> driven. Maybe the concept could be usefull for you. I have installed and 
>> configuration a few to be uses for researchers.
>> 
>> regards 
>> Poul   
>> 
>> 
>> 2017-04-11 19:46 GMT+02:00 Rj Ewing <ewing...@gmail.com>:
>>> I'm looking for thoughts on the best way to handle dynamic schemas.
>>> 
>>> The application I am developing revolves around user defined entities. Each 
>>> entity is a tabular dataset with user defined columns and data types. 
>>> Entities can also be related to each other through Parent-Child 
>>> relationships. Some entities will be 100% user driven, while others (such 
>>> as an entity representing a photo) will be partially user driven (all photo 
>>> entities will have common fields + custom user additions).
>>> 
>>> I was hoping to get opinions on whether postgresql would be a suitable 
>>> backend. A couple of options I have thought of are:
>>> 
>>> 1. Each entity is represented as a table in psql. The schema would be 
>>> dynamically updated (with limits) when an entity mapping is updated. I 
>>> believe that this would provide the best data constraints and allow the 
>>> best data normalization. A concern I have is that there could be an 
>>> enormous amount of tables generated and the performance impacts this might 
>>> have in the future. I could then run elasticsearch as a denormalized cache 
>>> for efficient querying and full-text-search.
>>> 
>>> 2. Use a nosql database. This provides the "dynamic" schema aspect. A 
>>> concern here is the lack of relation support, thus leading to a more 
>>> denormalized data structure and the potential for the data to become 
>>> corrupted.
>>> 
>>> Any opinions on the use of psql for this case, or other options would be 
>>> greatly appreciated!
>>> 
>>> RJ
>> 
>> 
>> 
>> -- 
>> Med venlig hilsen / Best regards
>> Poul Kristensen
>> Linux-OS/Virtualizationexpert and Oracle DBA
> 


Re: [GENERAL] dynamic schema modeling and performance

2017-04-11 Thread RJ Ewing
Poul,

I took a quick look at the demo site, but didn't see anything where the user 
was defining the fields. It looks like they can choose from a list of 
predetermined metadata fields. Looking at the code, but not actually seeing the 
total db schema, it looks like they might be using the EAV pattern?

> On Apr 11, 2017, at 1:06 PM, Poul Kristensen  wrote:
> 
> concept could be usefull


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] dynamic schema modeling and performance

2017-04-11 Thread Rj Ewing
I'm looking for thoughts on the best way to handle dynamic schemas.

The application I am developing revolves around user defined entities. Each
entity is a tabular dataset with user defined columns and data types.
Entities can also be related to each other through Parent-Child
relationships. Some entities will be 100% user driven, while others (such
as an entity representing a photo) will be partially user driven (all photo
entities will have common fields + custom user additions).

I was hoping to get opinions on whether postgresql would be a suitable
backend. A couple of options I have thought of are:

1. Each entity is represented as a table in psql. The schema would be
dynamically updated (with limits) when an entity mapping is updated. I
believe that this would provide the best data constraints and allow the
best data normalization. *A concern I have is that there could be an
enormous amount of tables generated and the performance impacts this might
have in the future*. I could then run elasticsearch as a denormalized cache
for efficient querying and full-text-search.

2. Use a nosql database. This provides the "dynamic" schema aspect. A
concern here is the lack of relation support, thus leading to a more
denormalized data structure and the potential for the data to become
corrupted.

Any opinions on the use of psql for this case, or other options would be
greatly appreciated!

RJ