Why can I not get lexemes for Hebrew but can get them for Armenian?

2019-02-27 Thread Sam Saffron
(This is a cross post from Stack Exchange, not getting much traction there)

On my Mac install of PG:

```
=# select to_tsvector('english', 'abcd สวัสดี');
 to_tsvector
-
 'abcd':1
(1 row)

=# select * from ts_debug('hello สวัสดี');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
---+-+---++--+-
 asciiword | Word, all ASCII | hello | {english_stem} | english_stem | {hello}
 blank | Space symbols   |  สวัสดี | {} |  |
(2 rows)
```

On my Linux install of PG:

```
=# select to_tsvector('english', 'abcd สวัสดี');
to_tsvector
---
 'abcd':1 'สวัสดี':2
(1 row)

=# select * from ts_debug('hello สวัสดี');
   alias   |description| token |  dictionaries  |  dictionary  | lexemes
---+---+---++--+-
 asciiword | Word, all ASCII   | hello | {english_stem} | english_stem | {hello}
 blank | Space symbols |   | {} |  |
 word  | Word, all letters | สวัสดี  | {english_stem} |
english_stem | {สวัสดี}
(3 rows)

```

So something is clearly different about the way the tokenisation is
defined in PG. My question is, how do I figure out what is different
and how do I make my mac install of PG work like the Linux one?

On both installs:

```
# SHOW default_text_search_config;
 default_text_search_config

 pg_catalog.english
(1 row)

# show lc_ctype;
  lc_ctype
-
 en_US.UTF-8
(1 row)
```

So somehow this mac install thinks that thai letters are spaces... how
do I debug this and fix the "Space Symbol" definition here.

Interestingly this install works with Armenian, but falls over when we
reach Hebrew

```
=# select * from ts_debug('ԵԵԵ');
 alias |description| token |  dictionaries  |  dictionary  | lexemes
---+---+---++--+-
 word  | Word, all letters | ԵԵԵ   | {english_stem} | english_stem | {եեե}
(1 row)

=# select * from ts_debug('אאא');
 alias |  description  | token | dictionaries | dictionary | lexemes
---+---+---+--++-
 blank | Space symbols | אאא   | {}   ||
(1 row)
```



Faster way of estimating database size

2018-06-14 Thread Sam Saffron
Hi there,

At the moment we are using:

SELECT pg_database.datname, pg_database_size(pg_database.datname) as
size FROM pg_database

To gather size of databases for monitoring purposes in Prometheus.

Our pg stat logging now shows this is our number one query cause we
seem to be running it for some crazy reason 19 times a minute. Clearly
this is too much and we should (and will) cache results for a bit.

Nonetheless, I notice it take 400ms to run on our cluster (with a few
200 dbs) and was wondering if there is a quicker way of estimating
this number?

Sam



Why is tuple_percent so low?

2018-02-26 Thread Sam Saffron
I am trying to refactor a table on disk so it consumes less space:

Original is:

create table post_timings(
   topic_id int not null,
   post_number int not null,
   user_id int not null,
   msecs int not null
)


Target is:

create table post_timings(
  post_id int not null,
  user_id int not null,
  dsecs smallint not null
)


Before I have:

select * from pgstattuple('post_timings2');

 table_len  | tuple_count | tuple_len  | tuple_percent |
dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space |
free_percent
+-++---+--++++--
 5146427392 |   116221695 | 4648867800 | 90.33 |
 0 |  0 |  0 |   15082484 | 0.29


After I have:

 table_len  | tuple_count | tuple_len  | tuple_percent |
dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space |
free_percent
+-++---+--++++--
 5142036480 |   116122544 | 3948166496 | 76.78 |
 0 |  0 |  0 |   15069224 | 0.29


What I find striking is that the table size on disk remains almost
unchanged despite tuples taking 6 less bytes per tuple.

All the "missing space" is in overhead that is missing from
pgstattuple, in particular tuple percent moves from 90 to 76.7

I was wondering:

1. Where is all my missing space, is this in page alignment stuff and
per-page overhead?

2. Is there any other schemes I can look at for storing this data to
have a more efficient yet easily queryable / updateable table.

Keep in mind these tables get huge and in many of our cases will span
10-20GB just to store this information.

Sam