Re: [HACKERS] IF NOT EXISTS option for CREATE SERVER and CREATE USER MAPPING statements

2017-03-13 Thread Artur Zakirov

On 13.03.2017 13:01, Anastasia Lubennikova wrote:


Thanks for catching that.
It was caused by a conflict on applying of the patch.
Updated versions of both patches are attached.



I think the code is good and the patches are small. Documentation is 
updated by the patches.


All regression tests are passed.

Marked the patch as "Ready for Commiter".

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IF NOT EXISTS option for CREATE SERVER and CREATE USER MAPPING statements

2017-03-13 Thread Artur Zakirov

On 15.02.2017 20:54, Anastasia Lubennikova wrote:


Done.



I have gotten the error that AlterUserMappingStmt doesn't have 
if_not_exists (in Russian):



gram.y: В функции «base_yyparse»:
gram.y:4918:7: ошибка: «AlterUserMappingStmt {aka struct AlterUserMappingStmt}» 
не содержит элемента с именем «if_not_exists»
  n->if_not_exists = false;
   ^~


After applying the CREATE USER patch in gram.y I have:



AlterUserMappingStmt: ALTER USER MAPPING FOR auth_ident SERVER name 
alter_generic_options
{
AlterUserMappingStmt *n = makeNode(AlterUserMappingStmt);
n->user = $5;
n->servername = $7;
n->options = $8;
n->if_not_exists = false;
$$ = (Node *) n;
}
| CREATE USER MAPPING IF_P NOT EXISTS FOR auth_ident SERVER 
name create_generic_options
{
CreateUserMappingStmt *n = makeNode(CreateUserMappingStmt);
n->user = $8;
n->servername = $10;
n->options = $11;
n->if_not_exists = true;
$$ = (Node *) n;
}
;


Here ALTER USER MAPPING and CREATE USER MAPPING commands were mixed.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag

2017-03-10 Thread Artur Zakirov

On 10.03.2017 04:00, Michael Paquier wrote:

On Thu, Mar 9, 2017 at 10:25 PM, Artur Zakirov <a.zaki...@postgrespro.ru> wrote:

I think this is good fixes. I've checked them. And in my opinion they are
correct.

The code also is good.


Having something with conflicts is not nice, so attached is a rebased version.


Thank you!

I've rerun regression and TAP tests. They all passed.

Also maybe it will be good to fix comments.

In buf_internals.h:

#define BM_PERMANENT(1U << 31)/* permanent relation (not
   * unlogged) */


And in FlushBuffer():

/*
 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
 * rule that log updates must hit disk before any of the data-file changes
 * they describe do.
 *
 * However, this rule does not apply to unlogged relations, which will be
 * lost after a crash anyway.  Most unlogged relation pages do not bear


Because BM_PERMANENT is used for init forks of unlogged indexes now.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should buffer of initialization fork have a BM_PERMANENT flag

2017-03-09 Thread Artur Zakirov

Hello,

I wanted to review the patch. But the patch is applied with errors. I've 
rebased the local copy and have done review on it. I'm not sure is it 
properly to send rebased patch by reviewer, so I haven't sent it to 
avoid confuses.


On 29.01.2017 17:00, Michael Paquier wrote:

Attached is what I have in mind for HEAD. btree, gist, spgist and
bloom indexes are changed so as the init forks created go through the
shared buffers instead of having their empty() routines handle the
flush of the page created. This removes any kind of race conditions
between the checkpointer and the init fork creations, which is I think
a good thing.


I think this is good fixes. I've checked them. And in my opinion they 
are correct.


The code also is good.



Here are the tests I have done.
First running those commands to create all types of indexes.
create extension bloom;
create extension btree_gist;
create extension btree_gin;
create unlogged table foo (a int);
create index foo_bt on foo(a);
create index foo_bloom on foo using bloom(a);
create index foo_gin on foo using gin (a);
create index foo_gist on foo using gist (a);
create index foo_brin on foo using brin (a);
create unlogged table foo_geo (a box);
create index foo_spgist ON foo_geo using spgist(a);
checkpoint;

Then crash the server, restart it, and the following vacuums are able
to complete.
vacuum foo;
vacuum foo_geo;



I've done this tests. Before the patch server crashes on vacuum command. 
After applying the patch server doesn't crash on vacuum command.


I have run regression and TAP tests. They all passed without error.

I think the patch can be marked as "Ready for Committer" after rebase.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in to_timestamp().

2017-02-27 Thread Artur Zakirov

On 15.02.2017 15:26, Amul Sul wrote:

The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, passed

Review of v7 patch:
- Patch applies to the top of master HEAD cleanly & make check-world also fine.
- Tom Lane's review comments are fixed.



The new status of this patch is: Ready for Committer



Thank you for your review!

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Generic type subscription

2017-01-15 Thread Artur Zakirov
> Yes, but it was related to the idea of having `ArrayRef` and `JsonbRef` nodes
> for specific types. Since now there is generic `SubscriptingRef` node, I think
> it should be ok.

Sorry I misunderstood it.

> Just to be clear - as far as I understood, these compilation problems were
> caused not because the extension knew something about ArrayRef node in
> particular, but because the extension tried to extract all nodes to generate
> code from them. It means any change will require "refetching", so I think it's
> natural for this extension.

Agree. It will be hard to maintain both nodes. And it is not so smart
to have two nodes ArrayRef (deprecated) and SubscriptingRef. It is not
hard to replace ArrayRef node with SubscriptingRef  in other
extensions.

There is a little note:

>  #include "utils/lsyscache.h"
> +#include "utils/syscache.h"
>  #include "utils/memutils.h"

I think "utils/syscache.h" isn't necessary here. PostgreSQL could be
compiled without this include.

I suppose that this patch can be marked as "Ready for commiter". Any opinions?

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Generic type subscription

2017-01-04 Thread Artur Zakirov
2016-12-27 14:42 GMT+05:00 Dmitry Dolgov <9erthali...@gmail.com>:
>> On 27 December 2016 at 16:09, Aleksander Alekseev
>> <a.aleks...@postgrespro.ru> wrote:
>> until it breaks existing extensions.
>
> Hm...I already answered, that I managed to avoid compilation problems for
> this particular extension
> using the `genparser` command again:

I suppose that a separate node type could solve it. But I'm not
convinced about how to distinguish ArrayRef node with new
SubscriptingRef node. Maybe it could be done in the
transformIndirection() function. If I understand all correctly.

Also Tom pointed that he had bad experience with using ArrayRef node:
https://www.postgresql.org/message-id/518.1439846343%40sss.pgh.pa.us
> No.  Make a new expression node type.
>
> (Salesforce did something similar for an internal feature, and it was a
> disaster both for code modularity and performance.  We had to change it to
> a separate node type, which I just got finished doing.  Don't go down that
> path.  While you're at it, I'd advise that fetch and assignment be two
> different node types rather than copying ArrayRef's bad precedent of using
> only one.)

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Generic type subscription

2016-12-26 Thread Artur Zakirov
2016-12-26 18:49 GMT+03:00 Dmitry Dolgov <9erthali...@gmail.com>:
>> On 5 December 2016 at 12:03, Haribabu Kommi <kommi.harib...@gmail.com>
>> wrote:
>
>> Moved to next CF with "needs review" status.
>
> Looks like we stuck here little bit. Does anyone else have any
> suggestions/improvements, or this patch is in good enough shape?

Would you rebase the patch, please? It seems it is necessary. It can't
be applied now.

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG?] pg_event_trigger_ddl_commands() error with ALTER TEXT SEARCH CONFIGURATION

2016-12-21 Thread Artur Zakirov
Thank you for your comments, Stephen.

2016-12-21 20:34 GMT+03:00 Stephen Frost <sfr...@snowman.net>:
>
> Did you happen to look at adding a regression test for this to
> test_ddl_deparse?

Of course. I updated the patch.

>
>> This patch only fixes the bug. But I think I also can do a patch which
>> will give pg_ts_config_map entries with
>> pg_event_trigger_ddl_commands() if someone wants. It can be done using
>> new entry in the CollectedCommandType structure maybe.
>
> While that sounds like a good idea, it seems like it's more a feature
> addition rather than a bugfix, no?
>

Yes, agree with you. It should be added as a separate patch. I think I
will deal with it.


-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/backend/commands/tsearchcmds.c b/src/backend/commands/tsearchcmds.c
index b24011371c..2b84848cf5 100644
--- a/src/backend/commands/tsearchcmds.c
+++ b/src/backend/commands/tsearchcmds.c
@@ -1215,10 +1215,10 @@ AlterTSConfiguration(AlterTSConfigurationStmt *stmt)
 	/* Update dependencies */
 	makeConfigurationDependencies(tup, true, relMap);
 
-	InvokeObjectPostAlterHook(TSConfigMapRelationId,
+	InvokeObjectPostAlterHook(TSConfigRelationId,
 			  HeapTupleGetOid(tup), 0);
 
-	ObjectAddressSet(address, TSConfigMapRelationId, cfgId);
+	ObjectAddressSet(address, TSConfigRelationId, cfgId);
 
 	heap_close(relMap, RowExclusiveLock);
 
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index fd4eff4907..b7a4c8e531 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1479,7 +1479,8 @@ ProcessUtilitySlow(ParseState *pstate,
 break;
 
 			case T_AlterTSConfigurationStmt:
-address = AlterTSConfiguration((AlterTSConfigurationStmt *) parsetree);
+AlterTSConfiguration((AlterTSConfigurationStmt *) parsetree);
+commandCollected = true;
 break;
 
 			case T_AlterTableMoveAllStmt:
diff --git a/src/test/modules/test_ddl_deparse/Makefile b/src/test/modules/test_ddl_deparse/Makefile
index 8ea6f39afd..3a57a95c84 100644
--- a/src/test/modules/test_ddl_deparse/Makefile
+++ b/src/test/modules/test_ddl_deparse/Makefile
@@ -23,6 +23,7 @@ REGRESS = test_ddl_deparse \
 	comment_on \
 	alter_function \
 	alter_sequence \
+	alter_ts_config \
 	alter_type_enum \
 	opfamily \
 	defprivs \
diff --git a/src/test/modules/test_ddl_deparse/expected/alter_ts_config.out b/src/test/modules/test_ddl_deparse/expected/alter_ts_config.out
new file mode 100644
index 00..afc352fc5f
--- /dev/null
+++ b/src/test/modules/test_ddl_deparse/expected/alter_ts_config.out
@@ -0,0 +1,8 @@
+--
+-- ALTER TEXT SEARCH CONFIGURATION
+--
+CREATE TEXT SEARCH CONFIGURATION en (copy=english);
+NOTICE:  DDL test: type simple, tag CREATE TEXT SEARCH CONFIGURATION
+ALTER TEXT SEARCH CONFIGURATION en
+  ALTER MAPPING FOR host, email, url, sfloat WITH simple;
+NOTICE:  DDL test: type alter text search configuration, tag ALTER TEXT SEARCH CONFIGURATION
diff --git a/src/test/modules/test_ddl_deparse/sql/alter_ts_config.sql b/src/test/modules/test_ddl_deparse/sql/alter_ts_config.sql
new file mode 100644
index 00..ac13e21dda
--- /dev/null
+++ b/src/test/modules/test_ddl_deparse/sql/alter_ts_config.sql
@@ -0,0 +1,8 @@
+--
+-- ALTER TEXT SEARCH CONFIGURATION
+--
+
+CREATE TEXT SEARCH CONFIGURATION en (copy=english);
+
+ALTER TEXT SEARCH CONFIGURATION en
+  ALTER MAPPING FOR host, email, url, sfloat WITH simple;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rethinking our fulltext phrase-search implementation

2016-12-21 Thread Artur Zakirov

Hello Tom,

On 17.12.2016 21:36, Tom Lane wrote:


4. The transformations are wrong anyway.  The OR case I showed above is
all right, but as I argued in <24331.1480199...@sss.pgh.pa.us>, the AND
case is not:

regression=# select 'a <-> (b & c)'::tsquery;
  tsquery
---
 'a' <-> 'b' & 'a' <-> 'c'
(1 row)

This matches 'a b a c', because 'a <-> b' and 'a <-> c' can each be
matched at different places in that text; but it seems highly unlikely to
me that that's what the writer of such a query wanted.  (If she did want
that, she would write it that way to start with.)  NOT is not very nice
either:


If I'm not mistaken PostgreSQL 9.6 and master with patch 
"fix-phrase-search.patch" return false for the query:


select 'a b a c' @@ 'a <-> (b & c)'::tsquery;
 ?column?
--
 f
(1 row)

I agree that such query is confusing. Maybe it is better to return true 
for such queries?
Otherwise it seems that queries like 'a <-> (b & c)' will always return 
false. Then we need maybe some warning message.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Crash in tsquery_rewrite/QTNBinary

2016-12-10 Thread Artur Zakirov
Hi,

2016-12-07 9:06 GMT+03:00 Andreas Seltenreich <seltenre...@gmx.de>:
> Hi,
>
> the following query crashes master as of 4212cb7.
>
> select ts_rewrite(
>   tsquery_phrase(
>  tsquery $$'sanct' & 'peter'$$,
>  tsquery $$'5' <-> '6'$$,
>  42),
>   tsquery $$'5' <-> '6'$$,
>   plainto_tsquery('I') );
>

This happens also for queries:

select ts_rewrite(
  to_tsquery('5 & (6 | 5)'),
  to_tsquery('5'),
  to_tsquery('I'));

select ts_rewrite(
  to_tsquery('5 & 6 <-> 5'),
  to_tsquery('5'),
  to_tsquery('I'));

It happens because 'I' is stop word and substitute query becomes
empty. And for queries above we need recursive dropvoidsubtree()
function. Without this patch this function cleans only first level of
tree. And query above becomes: '6 | void'.

Firstly I made recursive dropvoidsubtree(). But attached patch cleans
query tree in dofindsubquery() to avoid extra tree scan.

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/backend/utils/adt/tsquery_rewrite.c 
b/src/backend/utils/adt/tsquery_rewrite.c
index e5bed6e..00174bd 100644
--- a/src/backend/utils/adt/tsquery_rewrite.c
+++ b/src/backend/utils/adt/tsquery_rewrite.c
@@ -186,6 +186,15 @@ findeq(QTNode *node, QTNode *ex, QTNode *subs, bool 
*isfind)
  * Recursive guts of findsubquery(): attempt to replace "ex" with "subs"
  * at the root node, and if we failed to do so, recursively match against
  * child nodes.
+ *
+ * Delete any void subtrees. In the following example '5' is replaced by empty
+ * operand:
+ *
+ *AND   ->6
+ *   /   \
+ *  5PHRASE
+ *   / \
+ *  6  5
  */
 static QTNode *
 dofindsubquery(QTNode *root, QTNode *ex, QTNode *subs, bool *isfind)
@@ -200,39 +209,20 @@ dofindsubquery(QTNode *root, QTNode *ex, QTNode *subs, 
bool *isfind)
 
if (root && (root->flags & QTN_NOCHANGE) == 0 && root->valnode->type == 
QI_OPR)
{
-   int i;
-
-   for (i = 0; i < root->nchild; i++)
-   root->child[i] = dofindsubquery(root->child[i], ex, 
subs, isfind);
-   }
-
-   return root;
-}
-
-/*
- * Delete any void subtrees that may have been inserted when the replacement
- * subtree is void.
- */
-static QTNode *
-dropvoidsubtree(QTNode *root)
-{
-   if (!root)
-   return NULL;
-
-   if (root->valnode->type == QI_OPR)
-   {
int i,
j = 0;
 
for (i = 0; i < root->nchild; i++)
{
-   if (root->child[i])
-   {
-   root->child[j] = root->child[i];
+   root->child[j] = dofindsubquery(root->child[i], ex, 
subs, isfind);
+   if (root->child[j])
j++;
-   }
}
 
+   /*
+* Delete any void subtrees that may have been inserted when 
the replacement
+* subtree is void.
+*/
root->nchild = j;
 
if (root->nchild == 0)
@@ -267,9 +257,6 @@ findsubquery(QTNode *root, QTNode *ex, QTNode *subs, bool 
*isfind)
 
root = dofindsubquery(root, ex, subs, );
 
-   if (!subs && DidFind)
-   root = dropvoidsubtree(root);
-
if (isfind)
*isfind = DidFind;
 
diff --git a/src/test/regress/expected/tsearch.out 
b/src/test/regress/expected/tsearch.out
index 55d6a85..9c20aaf 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1251,6 +1251,14 @@ SELECT ts_rewrite('5 <-> (6 | 8)', 'SELECT keyword, 
sample FROM test_tsquery'::t
  '5' <-> '7' | '5' <-> '8'
 (1 row)
 
+-- Check empty substitute
+SELECT ts_rewrite(to_tsquery('5 & 6 <-> 5'), to_tsquery('5'), to_tsquery('I'));
+NOTICE:  text-search query contains only stop words or doesn't contain 
lexemes, ignored
+ ts_rewrite 
+
+ '6'
+(1 row)
+
 SELECT keyword FROM test_tsquery WHERE keyword @> 'new';
 keyword 
 
diff --git a/src/test/regress/sql/tsearch.sql b/src/test/regress/sql/tsearch.sql
index afd990e..8752344 100644
--- a/src/test/regress/sql/tsearch.sql
+++ b/src/test/regress/sql/tsearch.sql
@@ -418,6 +418,8 @@ SELECT ts_rewrite('1 & (2 <2> 3)', 'SELECT keyword, sample 
FROM test_tsquery'::t
 SELECT ts_rewrite('5 <-> (1 & (2 <-> 3))', 'SELECT keyword, sample FROM 
test_tsquery'::text );
 SELECT ts_rewrite('5 <-> (6 | 8)', 'SELECT keyword, sample FROM 
test_tsquery'::text );
 
+-- Check empty substitute

Re: [HACKERS] [BUG?] pg_event_trigger_ddl_commands() error with ALTER TEXT SEARCH CONFIGURATION

2016-11-28 Thread Artur Zakirov
2016-11-28 21:32 GMT+03:00 Robert Haas <robertmh...@gmail.com>:
>
> You might need to add this patch to https://commitfest.postgresql.org/
> so that it doesn't get forgotten.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

I added the patch to https://commitfest.postgresql.org/12/895/
I added it to the next commitfest. Because we are in the end of the
current commitfest.

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: session server side variables

2016-11-28 Thread Artur Zakirov

On 28.11.2016 10:42, Pavel Stehule wrote:


next update - setattr, getattr functions are working now

notes, comments?

Regards

Pavel



It is interesting!

Do you have plans to support also table variables? For example, like this:

create type composite_type_2 as (a int, b text);
create variable var7 composite_type_2;
select insertvar('var7','(10,Hello world\, Hello world\, Hello world)');
select insertvar('var7','(1000,Hola, hola!)');
select * from getvar('var7');
  a   |   b
--+---
 10   | Hello world, Hello world, Hello world
 1000 | Hola, hola!

Or it is a bad idea? Or it is not related to this patch?

We have the extension (https://github.com/postgrespro/pg_variables). And 
it supports table like variables. It shows better performance against 
temporary tables.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG?] pg_event_trigger_ddl_commands() error with ALTER TEXT SEARCH CONFIGURATION

2016-11-27 Thread Artur Zakirov
Thank you for answers.

2016-11-19 21:28 GMT+03:00 Michael Paquier <michael.paqu...@gmail.com>:
> On Thu, Nov 17, 2016 at 1:17 PM, Alvaro Herrera
> <alvhe...@2ndquadrant.com> wrote:
>> It's a bug.  You're right that we need to handle the object class
>> somewhere.  Perhaps I failed to realize that tsconfigs could get
>> altered.
>
> It seems to me that the thing to be careful of here is how a new
> OBJECT_TSCONFIGURATIONMAP should use ObjectAddress. It does not seem
> that complicated, but it needs some work.
> --
> Michael

After some investigation it seems to me that OBJECT_TSCONFIGURATIONMAP
can't be added for pg_ts_config_map. Because this catalog hasn't Oids.

But this bug can be easily fixed (patch attached). I think in
AlterTSConfiguration() TSConfigRelationId should be used instead of
TSConfigMapRelationId. Secondly, in ProcessUtilitySlow() we can use
commandCollected = true. Because configuration entry is added in
EventTriggerCollectAlterTSConfig() into
currentEventTriggerState->commandList.

This patch only fixes the bug. But I think I also can do a patch which
will give pg_ts_config_map entries with
pg_event_trigger_ddl_commands() if someone wants. It can be done using
new entry in the CollectedCommandType structure maybe.

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/backend/commands/tsearchcmds.c b/src/backend/commands/tsearchcmds.c
index b240113..2b84848 100644
--- a/src/backend/commands/tsearchcmds.c
+++ b/src/backend/commands/tsearchcmds.c
@@ -1215,10 +1215,10 @@ AlterTSConfiguration(AlterTSConfigurationStmt *stmt)
 	/* Update dependencies */
 	makeConfigurationDependencies(tup, true, relMap);
 
-	InvokeObjectPostAlterHook(TSConfigMapRelationId,
+	InvokeObjectPostAlterHook(TSConfigRelationId,
 			  HeapTupleGetOid(tup), 0);
 
-	ObjectAddressSet(address, TSConfigMapRelationId, cfgId);
+	ObjectAddressSet(address, TSConfigRelationId, cfgId);
 
 	heap_close(relMap, RowExclusiveLock);
 
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index f50ce40..1217d1a 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1477,7 +1477,8 @@ ProcessUtilitySlow(ParseState *pstate,
 break;
 
 			case T_AlterTSConfigurationStmt:
-address = AlterTSConfiguration((AlterTSConfigurationStmt *) parsetree);
+AlterTSConfiguration((AlterTSConfigurationStmt *) parsetree);
+commandCollected = true;
 break;
 
 			case T_AlterTableMoveAllStmt:

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] Re: [PATCH] Tab completion for ALTER TYPE … RENAME VALUE …

2016-11-06 Thread Artur Zakirov
Hello,

2016-09-12 16:16 GMT+03:00 Dagfinn Ilmari Mannsåker <ilm...@ilmari.org>:
>
> I've added it to the 2016-11 commit fest:
> https://commitfest.postgresql.org/11/795/
>
> - ilmari

I've tested your patch.

Patch was applied to the master. It seems there is no need to rebase
it. PostgreSQL was compiled without errors with the patch.

I've tested the patch with type:

CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green',
'blue', 'purple');

And the following completions work as expected:

=> ALTER TYPE rainbow RENAME 
ATTRIBUTE  TO VALUE

=> ALTER TYPE rainbow RENAME VALUE 
'blue''green'   'orange'  'purple'  'red' 'yellow'

It seems that the patch can be commited without any fixes. So I marked
it as "Ready for Committer".

-- 
Sincerely,
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in to_timestamp().

2016-11-06 Thread Artur Zakirov
Thank you for your comments,

2016-11-04 20:36 GMT+02:00 Tom Lane <t...@sss.pgh.pa.us>:
>
> Artur Zakirov <a.zaki...@postgrespro.ru> writes:
> > I attached new version of the patch, which fix is_char_separator()
> > declaration too.
>
> I did some experimenting using
> http://rextester.com/l/oracle_online_compiler
>

>
> which makes it look a lot like Oracle treats separator characters in the
> pattern the same as spaces (but I haven't checked their documentation to
> confirm that).
>
> The proposed patch doesn't seem to me to be trying to follow
> these Oracle behaviors, but I think there is very little reason for
> changing any of this stuff unless we move it closer to Oracle.

Previous versions of the patch doesn't try to follow all Oracle
behaviors. It tries to fix Amul Sul's behaviors. Because I was
confused by dealing with spaces and separators by Oracle
to_timestamp() and there is not information about it in the Oracle
documentation:
https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements004.htm#g195443

I've thought better about it now and fixed the patch. Now parser
removes spaces after and before fields and insists that count of
separators in the input string should match count of spaces or
separators in the formatting string (but in formatting string we can
have more separators than in input string).

>
> Some other nitpicking:
>
> * I think the is-separator function would be better coded like
>
> static bool
> is_separator_char(const char *str)
> {
> /* ASCII printable character, but not letter or digit */
> return (*str > 0x20 && *str < 0x7F &&
> !(*str >= 'A' && *str <= 'Z') &&
> !(*str >= 'a' && *str <= 'z') &&
> !(*str >= '0' && *str <= '9'));
> }
>
> The previous way is neither readable nor remarkably efficient, and it
> knows much more about the ASCII character set than it needs to.

Fixed.

>
> * Don't forget the cast to unsigned char when using isspace() or other
>  functions.

Fixed.

>
> * I do not see the reason for throwing an error here:
>
> +   /* Previous character was a backslash */
> +   if (in_backslash)
> +   {
> +   /* After backslash should go non-space character */
> +   if (isspace(*str))
> +   ereport(ERROR,
> +   
> (errcode(ERRCODE_SYNTAX_ERROR),
> +        errmsg("invalid escape 
> sequence")));
> +   in_backslash = false;
>
> Why shouldn't backslash-space be a valid quoting sequence?
>

Hm, truly. Fixed it.

I've attached the fixed patch.

--
Sincerely,
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2e64cc4..5a4e248 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -6159,7 +6159,8 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
to_timestamp and to_date
skip multiple blank spaces in the input string unless the
FX option is used. For example,
-   to_timestamp('2000JUN', ' MON') works, but
+   to_timestamp('2000JUN', ' MON') and
+   to_timestamp('2000JUN', 'MON') work, but
to_timestamp('2000JUN', 'FX MON') returns an error
because to_timestamp expects one space only.
FX must be specified as the first item in
@@ -6169,6 +6170,19 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
 
  
   
+   to_timestamp and to_date don't
+   skip multiple printable non letter and non digit characters in the input
+   string, but skip them in the formatting string. For example,
+   to_timestamp('2000-JUN', '/MON') and
+   to_timestamp('2000/JUN', '//MON') work, but
+   to_timestamp('2000//JUN', '/MON')
+   returns an error because count of the "/" character in the input string
+   doesn't match count of it in the formatting string.
+  
+ 
+
+ 
+  
Ordinary text is allowed in to_char
templates and will be output literally.  You can put a substring
in double quotes to force it to be interpreted as literal text
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index d4eaa50..d28ceec 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -169,6 +169,8 @@ struct FormatNode
 #define NODE_TYPE_END		1
 #define NODE_TYPE_ACTION	2
 #define NODE_TYPE_CHAR		3
+#define NODE_TYPE_SEPARATOR	4
+#define NODE_TYPE_SPACE		5
 
 #define SUFFTYPE_PREFIX		1
 #define SUFFTYPE_POSTFIX	2

Re: [HACKERS] [PATCH] Generic type subscription

2016-11-03 Thread Artur Zakirov
Hello,

Do you have an updated version of the patch?

2016-10-18 20:41 GMT+03:00 Dmitry Dolgov <9erthali...@gmail.com>:
>
>
> > The term "subscription" is confusing me
>
> Yes, you're right. "container" is too general I think, so I renamed
> everything
> to "subscripting".
>

There is another occurrences of "subscription" including file names. Please
fix them.

Also I've sent a personal email, that we have performance degradation of
SELECT queries:

create table test (
a int2[],
b int4[][][]);

With patch:

=> insert into test (a[1:5], b[1:1][1:2][1:2])
  select '{1,2,3,4,5}', '{{{0,0},{1,2}}}'
  from generate_series(1,10);
Time: 936,285 ms

=> UPDATE test SET a[0] = '2';
Time: 1605,406 ms (00:01,605)

=> UPDATE test SET b[1:1][1:1][1:2] = '{113, 117}';
Time: 1603,076 ms (00:01,603)

=> explain analyze select a[1], b[1][1][1] from test;
 QUERY PLAN


-
 Seq Scan on test  (cost=0.00..3858.00 rows=10 width=6) (actual
time=0.035..241.280 rows=10 loops=1)
 Planning time: 0.087 ms
 Execution time: 246.916 ms
(3 rows)

And without patch:

=> insert into test (a[1:5], b[1:1][1:2][1:2])
  select '{1,2,3,4,5}', '{{{0,0},{1,2}}}'
  from generate_series(1,10);
Time: 971,677 ms

=> UPDATE test SET a[0] = '2';
Time: 1508,262 ms (00:01,508)

=> UPDATE test SET b[1:1][1:1][1:2] = '{113, 117}';
Time: 1473,459 ms (00:01,473)

=> explain analyze select a[1], b[1][1][1] from test;
 QUERY PLAN



 Seq Scan on test  (cost=0.00..5286.00 rows=10 width=6) (actual
time=0.024..98.475 rows=10 loops=1)
 Planning time: 0.081 ms
 Execution time: 105.055 ms
(3 rows)

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [HACKERS] FTS Configuration option

2016-10-13 Thread Artur Zakirov

On 13.10.2016 11:54, Emre Hasegeli wrote:



Maybe also better to use -> instead of AND? AND would has another
behaviour. I could create the following configuration:

=> ALTER TEXT SEARCH CONFIGURATION multi_conf
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH (german_ispell AND english_ispell) OR simple;

which will return both german_ispell and english_ispell results. But
I'm not sure that this is a good solution.


I see you usecase for AND.  It might indeed be useful.  AND suits well to it.

Maybe THEN can be the keyword instead of -> for pass the results to
subsequent dictionaries.  They are all reserved keywords.  I guess it
wouldn't be a problem to use them.


I agree with THEN. It is better than using -> I think. I suppose it 
wouldn't be a problem too. I think it is necessary to fix gram.y and 
implement logic with OR, AND and THEN.





Of course if this syntax will be implemented, old syntax with commas
also should be maintained.


Yes, we should definitely.  The comma can be interpreted either one of
the keywords depending on left hand side dictionary.

I would be glad to review, if you develop this feature.



Then I will develop it :). But I suppose I can do it a few days or weeks 
later, because I have other tasks with higher priority.


BTW, I've already implemented USING option a few weeks before 
https://github.com/select-artur/postgres/tree/join_tsconfig . But of 
course it is not useful now.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FTS Configuration option

2016-10-12 Thread Artur Zakirov
Thank you for sharing your thoughts!

2016-10-12 15:08 GMT+03:00 Emre Hasegeli <e...@hasegeli.com>:
> However then the stemmer doesn't do a good job on those words, because
> the changed characters are important for the language.  What I really
> needed was something like this:
>
>> ALTER TEXT SEARCH CONFIGURATION turkish
>> ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, 
>> hword_part
>> WITH (fix_mistyped_characters AND (turkish_hunspell OR turkish_stem) AND 
>> unaccent);

Your syntax looks more flexible and prettier than with JOIN option. As
I understand there are three steps here. On each step a dictionary
return a lexeme and pass it to next dictionary. If dictionary return
NULL then the processing will interrupt.

With such syntax we also don't need the TSL_FILTER flag for lexeme. At
the current time unaccent extension set this flag to pass a lexeme to
a next dictionary. This flag is used by the text-search parser. It
looks like a hard coded solution. User can't change this behaviour.

Maybe also better to use -> instead of AND? AND would has another
behaviour. I could create the following configuration:

=> ALTER TEXT SEARCH CONFIGURATION multi_conf
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH (german_ispell AND english_ispell) OR simple;

which will return both german_ispell and english_ispell results. But
I'm not sure that this is a good solution.

Of course if this syntax will be implemented, old syntax with commas
also should be maintained.

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] FTS Configuration option

2016-10-10 Thread Artur Zakirov
Hello hackers,

Sometimes there is a need to collect lexems from various dictionaries.
For example, if we have a column with text in various languages.

Let's say there is a new option JOIN. This option will allow to parser
to append lexems from current dictionary and go to next dictionary to
get another lexems:

=> CREATE TEXT SEARCH CONFIGURATION multi_conf (COPY=simple);
=> ALTER TEXT SEARCH CONFIGURATION multi_conf
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH german_ispell (JOIN), english_ispell, simple;

Here there are the following cases:
- found lexem in german_ispell, didn't found lexem in english_ispell.
Return lexem from german_ispell.
- didn't found lexem in german_ispell, found lexem in english_ispell.
Return lexem from english_ispell.
- didn't found lexems in dictionaries. Return lexem from simple.
- found lexems in both dictionaries. Return lexems from both.

Could be such option is useful to the community? Name of the option is
debatable.

Thank you!

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] FTS Configuration option

2016-10-10 Thread Artur Zakirov

Hello hackers,

Sometimes there is a need to collect lexems from various dictionaries.
For example, if we have a column with text in various languages.

Let's say there is a new option JOIN. This option will allow to the 
parser to append lexems from the current dictionary and go to the next 
dictionary:


=> CREATE TEXT SEARCH CONFIGURATION multi_conf (COPY=simple);
=> ALTER TEXT SEARCH CONFIGURATION multi_conf
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH german_ispell (JOIN), english_ispell, simple;

Here there are the following cases:
- found lexem in german_ispell, didn't found lexem in english_ispell.
Return lexem from german_ispell.
- didn't found lexem in german_ispell, found lexem in english_ispell.
Return lexem from english_ispell.
- didn't found lexems in dictionaries. Return lexem from simple.
- found lexems in both dictionaries. Return lexems from both.

Could be such option is useful to the community? Name of the option is
debatable.

Thank you!

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Generic type subscription

2016-10-05 Thread Artur Zakirov



+JsonbValue *
+JsonbToJsonbValue(Jsonb *jsonb, JsonbValue *val)
+{


In this function we could palloc JsonbValue every time and don't use val 
argument. And maybe better to copy all JsonbContainer from jsonb->root 
using memcpy(). Instead of assigning reference to jsonb->root. As is the 
case in JsonbValueToJsonb().


It is necessary to remove the notice about JsonbToJsonbValue() in 
JsonbValueToJsonb() comments.



-static JsonbValue *
+JsonbValue *
 setPath(JsonbIterator **it, Datum *path_elems,


Why did you remove static keyword? setPath() is still static.


-   JsonbValue  v;
+   JsonbValue  v, *new = (JsonbValue *) newval;


Is declaration of "new" variable necessary here? I think it is extra 
declaration. Also its name may bring some problems. For example, there 
is a thread where guys try to port PostgreSQL to C++.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in to_timestamp().

2016-09-29 Thread Artur Zakirov
2016-09-29 13:54 GMT+05:00 amul sul <sula...@gmail.com>:
>
> On Thu, Sep 29, 2016 at 2:48 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> >
> > I started looking at your 0001-to-timestamp-format-checking-v4.patch
> > and this point immediately jumped out at me.  Currently the code relies
> > ... without any documentation ... on no elog being thrown out of
> > parse_format().  That's at the very least trouble waiting to happen.
> > There's a hack to deal with errors from within the NUMDesc_prepare
> > subroutine, but it's a pretty ugly and underdocumented hack.  And what
> > you had here was randomly different from that solution, too.
> >
> > After a bit of thought it seemed to me that a much cleaner fix is to add
> > a "valid" flag to the cache entries, which we can leave clear until we
> > have finished parsing the new format string.  That avoids adding extra
> > data copying as you suggested, removes the need for PG_TRY, and just
> > generally seems cleaner and more bulletproof.
> >
> > I've pushed a patch that does it that way.  The 0001 patch will need
> > to be rebased over that (might just require removal of some hunks,
> > not sure).
> >
> > I also pushed 0002-to-timestamp-validation-v2.patch with some revisions
> > (it'd broken acceptance of BC dates, among other things, but I think
> > I fixed everything).

Thank you for committing the 0002 part of the patch! I wanted to fix
cache functions too, but wasn't sure about that.

> >
> > Since you told us earlier that you'd be on vacation through the end of
> > September, I'm assuming that nothing more will happen on this patch during
> > this commitfest, so I will mark the CF entry Returned With Feedback.
>
> Behalf of Artur I've rebased patch, removed hunk dealing with broken
> cache entries by copying it, which is no longer required after 83bed06
> commit.
>
> Commitfest status left untouched, I am not sure how to deal with
> "Returned With Feedback" patch. Is there any chance that, this can be
> considered again for committer review?

Thank you for fixing the patch!
Today I have access to the internet and able to fix and test the
patch. I've looked at your 0001-to-timestamp-format-checking-v5.patch.
It seems nice to me. I suppose it is necessary to fix
is_char_separator() declaration.

from:
static bool is_char_separator(char *str);

to:
static bool is_char_separator(const char *str);

Because now in parse_format() *str argument is const.
I attached new version of the patch, which fix is_char_separator()
declaration too.

Sorry for confusing!

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


0001-to-timestamp-format-checking-v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in to_timestamp().

2016-09-16 Thread Artur Zakirov

On 25.08.2016 13:26, amul sul wrote:

Thanks. I've created the entry in https://commitfest.postgresql.org/10/713/
. You can add yourself as a reviewer.



Done, added myself as reviewer & changed status to "Ready for
Committer". Thanks !

Regards,
Amul Sul




It seems there is no need to rebase patches. But I will not be able to 
fix patches for two weeks because of travel if someone will have a note 
or remark.


So it would be nice if someone will be able to fix possible issues for 
above reasone.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in to_timestamp().

2016-08-25 Thread Artur Zakirov

You are right. I assigned to prev_type NODE_TYPE_SPACE to be able to
execute such query:


SELECT to_timestamp('---2000JUN', ' MON');


Will be it a proper behaviour?



Looks good to me, no one will complain if something working on PG but not on 
Oracle.


Thanks. I've created the entry in 
https://commitfest.postgresql.org/10/713/ . You can add yourself as a 
reviewer.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in to_timestamp().

2016-08-25 Thread Artur Zakirov

Hi,


#1.
Whitespace @ line # 317.


Sorry, fixed.


#2. Warning at compilation;

formatting.c: In function ‘do_to_timestamp’:
formatting.c:3049:37: warning: ‘prev_type’ may be used uninitialized in this 
function [-Wmaybe-uninitialized]
if (prev_type == NODE_TYPE_SPACE || prev_type == NODE_TYPE_SEPARATOR)
^
formatting.c:2988:5: note: ‘prev_type’ was declared here
prev_type;
^

You can avoid this by assigning  zero (or introduce NODE_TYPE_INVAL ) to 
prev_type at following line:

256 +   prev_type;


You are right. I assigned to prev_type NODE_TYPE_SPACE to be able to 
execute such query:


SELECT to_timestamp('---2000JUN', ' MON');

Will be it a proper behaviour?

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5c1c4f6..36d8b3e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -6146,9 +6146,12 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
  
   
to_timestamp and to_date
-   skip multiple blank spaces in the input string unless the
+   skip multiple blank spaces and printable non letter and non digit
+   characters in the input string and in the formatting string unless the
FX option is used. For example,
-   to_timestamp('2000JUN', ' MON') works, but
+   to_timestamp('2000JUN', ' MON'),
+   to_timestamp('2000JUN', ' MON')
+   and to_timestamp('2000JUN', 'MON') work, but
to_timestamp('2000JUN', 'FX MON') returns an error
because to_timestamp expects one space only.
FX must be specified as the first item in
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index bbd97dc..7430013 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -169,6 +169,8 @@ struct FormatNode
 #define NODE_TYPE_END		1
 #define NODE_TYPE_ACTION	2
 #define NODE_TYPE_CHAR		3
+#define NODE_TYPE_SEPARATOR	4
+#define NODE_TYPE_SPACE		5
 
 #define SUFFTYPE_PREFIX		1
 #define SUFFTYPE_POSTFIX	2
@@ -947,6 +949,7 @@ typedef struct NUMProc
 static const KeyWord *index_seq_search(char *str, const KeyWord *kw,
  const int *index);
 static const KeySuffix *suff_search(char *str, const KeySuffix *suf, int type);
+static bool is_char_separator(char *str);
 static void NUMDesc_prepare(NUMDesc *num, FormatNode *n);
 static void parse_format(FormatNode *node, char *str, const KeyWord *kw,
 			 const KeySuffix *suf, const int *index, int ver, NUMDesc *Num);
@@ -963,7 +966,6 @@ static void dump_node(FormatNode *node, int max);
 static const char *get_th(char *num, int type);
 static char *str_numth(char *dest, char *num, int type);
 static int	adjust_partial_year_to_2020(int year);
-static int	strspace_len(char *str);
 static void from_char_set_mode(TmFromChar *tmfc, const FromCharDateMode mode);
 static void from_char_set_int(int *dest, const int value, const FormatNode *node);
 static int	from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node);
@@ -1036,6 +1038,17 @@ suff_search(char *str, const KeySuffix *suf, int type)
 	return NULL;
 }
 
+static bool
+is_char_separator(char *str)
+{
+	return ((pg_mblen(str) == 1) &&
+			/* printable character, but not letter and digit */
+			((*str >= '!' && *str <= '/') ||
+			 (*str >= ':' && *str <= '@') ||
+			 (*str >= '[' && *str <= '`') ||
+			 (*str >= '{' && *str <= '~')));
+}
+
 /* --
  * Prepare NUMDesc (number description struct) via FormatNode struct
  * --
@@ -1237,9 +1250,10 @@ parse_format(FormatNode *node, char *str, const KeyWord *kw,
 {
 	const KeySuffix *s;
 	FormatNode *n;
+	bool		in_text = false,
+in_backslash = false;
 	int			node_set = 0,
-suffix,
-last = 0;
+suffix;
 
 #ifdef DEBUG_TO_FROM_CHAR
 	elog(DEBUG_elog_output, "to_char/number(): run parser");
@@ -1251,6 +1265,55 @@ parse_format(FormatNode *node, char *str, const KeyWord *kw,
 	{
 		suffix = 0;
 
+		/* Previous character was a backslash */
+		if (in_backslash)
+		{
+			/* After backslash should go non-space character */
+			if (isspace(*str))
+ereport(ERROR,
+		(errcode(ERRCODE_SYNTAX_ERROR),
+		 errmsg("invalid escape sequence")));
+			in_backslash = false;
+
+			n->type = NODE_TYPE_CHAR;
+			n->character = *str;
+			n->key = NULL;
+			n->suffix = 0;
+			n++;
+			str++;
+			continue;
+		}
+		/* Previous character was a quote */
+		else if (in_text)
+		{
+			if (*str == '"')
+			{
+str++;
+in_text = false;
+			}
+			else if (*str == '\\')
+			{
+str++;
+in_backslash = true;
+			}
+			else
+			{
+if (ver == DCH_TYPE && is_char_separator(str))
+	n->type = NODE_TYPE_SEPARATOR;
+else if (isspace(*str))
+	n->type = NODE_TYPE_SPACE;
+else
+	n->type = NODE_T

Re: [HACKERS] Bug in to_timestamp().

2016-08-24 Thread Artur Zakirov
Sorry. I did not get last two mails from Amul. Don't know why. So I 
reply to another mail.



Documented as working case, but unfortunatly it does not :

postgres=# SELECT to_timestamp('2000JUN', ' MON');
ERROR:  invalid value "---" for "MON"
DETAIL:  The given value did not match any of the allowed values for this field.


Indeed! Fixed it. Now this query executes without error. Added this case 
to tests.



NODE_TYPE_CHAR assumption in else block seems incorrect. What if we have space 
after double quote?  It will again have incorrect output without any error, see 
below:

postgres=# SELECT to_timestamp('Year: 1976, Month: May, Day: 16',
postgres(# '"Year:" , "Month:" FMMonth, "Day:"   DD');
to_timestamp
--
0006-05-16 00:00:00-07:52:58
(1 row)

I guess, we might need NODE_TYPE_SEPARATOR and NODE_TYPE_SPACE check as well?


Agree. Fixed and added to tests.


Unnecessary hunk?
We should not touch any code unless it necessary to implement proposed feature, 
otherwise it will add unnecessary diff to the patch and eventually extra burden 
to review the code. Similarly hunk in the patch @ line # 313 - 410, nothing to 
do with to_timestamp behaviour improvement, IIUC.

If you think this changes need to be in, please submit separate cleanup-patch.


Fixed it. It was a typo.
About lines # 313 - 410. It is necessary to avoid bugs. I wrote aboud it 
in 
https://www.postgresql.org/message-id/b2a39359-3282-b402-f4a3-057aae500ee7%40postgrespro.ru 
:



- now DCH_cache_getnew() is called after parse_format(). Because now
parse_format() can raise an error and in the next attempt
DCH_cache_search() could return broken cache entry.


For example, you can test incorrect inputs for to_timestamp(). Try to 
execute such query several times.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 6355300..0fe50e1 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -6146,9 +6146,12 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
  
   
to_timestamp and to_date
-   skip multiple blank spaces in the input string unless the
+   skip multiple blank spaces and printable non letter and non digit
+   characters in the input string and in the formatting string unless the
FX option is used. For example,
-   to_timestamp('2000JUN', ' MON') works, but
+   to_timestamp('2000JUN', ' MON'),
+   to_timestamp('2000JUN', ' MON')
+   and to_timestamp('2000JUN', 'MON') work, but
to_timestamp('2000JUN', 'FX MON') returns an error
because to_timestamp expects one space only.
FX must be specified as the first item in
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index bbd97dc..a3dbcaf 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -169,6 +169,8 @@ struct FormatNode
 #define NODE_TYPE_END		1
 #define NODE_TYPE_ACTION	2
 #define NODE_TYPE_CHAR		3
+#define NODE_TYPE_SEPARATOR	4
+#define NODE_TYPE_SPACE		5
 
 #define SUFFTYPE_PREFIX		1
 #define SUFFTYPE_POSTFIX	2
@@ -947,6 +949,7 @@ typedef struct NUMProc
 static const KeyWord *index_seq_search(char *str, const KeyWord *kw,
  const int *index);
 static const KeySuffix *suff_search(char *str, const KeySuffix *suf, int type);
+static bool is_char_separator(char *str);
 static void NUMDesc_prepare(NUMDesc *num, FormatNode *n);
 static void parse_format(FormatNode *node, char *str, const KeyWord *kw,
 			 const KeySuffix *suf, const int *index, int ver, NUMDesc *Num);
@@ -963,7 +966,6 @@ static void dump_node(FormatNode *node, int max);
 static const char *get_th(char *num, int type);
 static char *str_numth(char *dest, char *num, int type);
 static int	adjust_partial_year_to_2020(int year);
-static int	strspace_len(char *str);
 static void from_char_set_mode(TmFromChar *tmfc, const FromCharDateMode mode);
 static void from_char_set_int(int *dest, const int value, const FormatNode *node);
 static int	from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node);
@@ -1036,6 +1038,17 @@ suff_search(char *str, const KeySuffix *suf, int type)
 	return NULL;
 }
 
+static bool
+is_char_separator(char *str)
+{
+	return ((pg_mblen(str) == 1) &&
+			/* printable character, but not letter and digit */
+			((*str >= '!' && *str <= '/') ||
+			 (*str >= ':' && *str <= '@') ||
+			 (*str >= '[' && *str <= '`') ||
+			 (*str >= '{' && *str <= '~')));
+}
+
 /* --
  * Prepare NUMDesc (number description struct) via FormatNode struct
  * --
@@ -1237,9 +1250,10 @@ parse_format(FormatNode *node, char *str, const KeyWord *kw,
 {
 	const KeySuffix *s;
 	FormatNode *n;
+	bool		in_text = false,
+in_bac

Re: [HACKERS] Bug in to_timestamp().

2016-08-17 Thread Artur Zakirov
I attached new patch "0001-to-timestamp-format-checking-v2.patch". It 
fixes behaviour for Amul's scenarious:



Following are few scenarios where we break existing behaviour:

SELECT TO_TIMESTAMP('2015-12-31 13:43:36', ' MM DD HH24 MI SS');
SELECT TO_TIMESTAMP('2011$03!18 23_38_15', '-MM-DD HH24:MI:SS');
SELECT TO_TIMESTAMP('2011*03*18 23^38&15', '-MM-DD HH24:MI:SS');
SELECT TO_TIMESTAMP('2011*03!18 #%23^38$15', '-MM-DD$$$HH24:MI:SS');

But current patch behaviour is not that much bad either at least we have 
errors, but I am not sure about community acceptance.

I would like to divert communities' attention on following case:
SELECT TO_TIMESTAMP('2013--10-01', '-MM-DD');


For queries above the patch gives an output without any error.


Another is, shouldn’t we have error in following cases?
SELECT TO_TIMESTAMP('2016-06-13 99:99:99', '-MM-DD HH24:MI:SS');
SELECT TO_TIMESTAMP('2016-02-30 15:43:36', '-MM-DD HH24:MI:SS');


I attached second patch "0002-to-timestamp-validation-v2.patch". With it 
PostgreSQL perform additional checks for date and time. But as I wrote 
there is another patch in the thread "to_date_valid()" wich differs from 
this patch.


Sincerely,
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7830334..559c55f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -6083,9 +6083,12 @@ SELECT regexp_matches('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
  
   
to_timestamp and to_date
-   skip multiple blank spaces in the input string unless the
+   skip multiple blank spaces and printable non letter and non digit
+   characters in the input string and in the formatting string unless the
FX option is used. For example,
-   to_timestamp('2000JUN', ' MON') works, but
+   to_timestamp('2000JUN', ' MON'),
+   to_timestamp('2000JUN', ' MON')
+   and to_timestamp('2000JUN', 'MON') work, but
to_timestamp('2000JUN', 'FX MON') returns an error
because to_timestamp expects one space only.
FX must be specified as the first item in
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index bbd97dc..b14678d 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -169,6 +169,8 @@ struct FormatNode
 #define NODE_TYPE_END		1
 #define NODE_TYPE_ACTION	2
 #define NODE_TYPE_CHAR		3
+#define NODE_TYPE_SEPARATOR	4
+#define NODE_TYPE_SPACE		5
 
 #define SUFFTYPE_PREFIX		1
 #define SUFFTYPE_POSTFIX	2
@@ -947,6 +949,7 @@ typedef struct NUMProc
 static const KeyWord *index_seq_search(char *str, const KeyWord *kw,
  const int *index);
 static const KeySuffix *suff_search(char *str, const KeySuffix *suf, int type);
+static bool is_char_separator(char *str);
 static void NUMDesc_prepare(NUMDesc *num, FormatNode *n);
 static void parse_format(FormatNode *node, char *str, const KeyWord *kw,
 			 const KeySuffix *suf, const int *index, int ver, NUMDesc *Num);
@@ -963,7 +966,6 @@ static void dump_node(FormatNode *node, int max);
 static const char *get_th(char *num, int type);
 static char *str_numth(char *dest, char *num, int type);
 static int	adjust_partial_year_to_2020(int year);
-static int	strspace_len(char *str);
 static void from_char_set_mode(TmFromChar *tmfc, const FromCharDateMode mode);
 static void from_char_set_int(int *dest, const int value, const FormatNode *node);
 static int	from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node);
@@ -1036,6 +1038,17 @@ suff_search(char *str, const KeySuffix *suf, int type)
 	return NULL;
 }
 
+static bool
+is_char_separator(char *str)
+{
+	return ((pg_mblen(str) == 1) &&
+			/* printable character, but not letter and digit */
+			((*str >= '!' && *str <= '/') ||
+			 (*str >= ':' && *str <= '@') ||
+			 (*str >= '[' && *str <= '`') ||
+			 (*str >= '{' && *str <= '~')));
+}
+
 /* --
  * Prepare NUMDesc (number description struct) via FormatNode struct
  * --
@@ -1237,9 +1250,10 @@ parse_format(FormatNode *node, char *str, const KeyWord *kw,
 {
 	const KeySuffix *s;
 	FormatNode *n;
+	bool		in_text = false,
+in_backslash = false;
 	int			node_set = 0,
-suffix,
-last = 0;
+suffix;
 
 #ifdef DEBUG_TO_FROM_CHAR
 	elog(DEBUG_elog_output, "to_char/number(): run parser");
@@ -1251,6 +1265,49 @@ parse_format(FormatNode *node, char *str, const KeyWord *kw,
 	{
 		suffix = 0;
 
+		/* Previous character was a backslash */
+		if (in_backslash)
+		{
+			/* After backslash should go non-space character */
+			if (isspace(*str))
+ereport(ERROR,
+		(errcode(ERRCODE_SYNTAX_ERROR),
+		 errmsg("invalid escape sequence")));
+			in_backslash = false;
+
+			n->type = NODE_TYPE_CHAR;
+			n

Re: [HACKERS] Bug in to_timestamp().

2016-08-16 Thread Artur Zakirov

On 15.08.2016 19:28, Robert Haas wrote:


Well, what's the Oracle behavior in any of these cases?  I don't think
we can decide to change any of this behavior without knowing that.  If
a proposed behavior change is incompatible with our previous releases,
I think it'd better at least be more compatible with Oracle.
Otherwise, we're just changing from an established behavior that we
invented ourselves to a new behavior we invented ourselves, which is
only worthwhile if it's absolutely clear that the new behavior is way
better.



1 - Oracle's output for first queries is:

-> SELECT TO_TIMESTAMP('2015-12-31 13:43:36', ' MM DD HH24 MI SS') 
FROM dual;


TO_TIMESTAMP('2015-12-3113:43:36','MMDDHH24MISS')
---
31-DEC-15 01.43.36.0 PM

-> SELECT TO_TIMESTAMP('2011$03!18 23_38_15', '-MM-DD HH24:MI:SS') 
FROM dual;


TO_TIMESTAMP('2011$03!1823_38_15','-MM-DDHH24:MI:SS')
---
18-MAR-11 11.38.15.0 PM

-> SELECT TO_TIMESTAMP('2011*03!18 #%23^38$15', 
'-MM-DD$$$HH24:MI:SS') FROM dual;


TO_TIMESTAMP('2011*03!18#%23^38$15','-MM-DD$$$HH24:MI:SS')
---
18-MAR-11 11.38.15.0 PM

PostgreSQL with the patch gives "ERROR:  expected space character in 
given string". I will fix this.



2 - Oracle's output for query with hyphen is:

-> SELECT TO_TIMESTAMP('2013--10-01', '-MM-DD') FROM dual;
SELECT TO_TIMESTAMP('2013--10-01', '-MM-DD') FROM dual
*
ERROR at line 1:
ORA-01843: not a valid month

Here PostgreSQL with the patch does not give an error. So I will fix 
this too.



3 - The last two queries give an error. This patch do not handle such 
queries intentionally, because there is the thread 
https://www.postgresql.org/message-id/57786490.9010201%40wars-nicht.de . 
That thread concerns to_date() function. But it should concerns 
to_timestamp() also. So I suppose that should be a different patch for 
this last case.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] to_date_valid()

2016-08-15 Thread Artur Zakirov

On 15.08.2016 14:33, Andreas 'ads' Scherbaum wrote:

Is it right and "true" way to validate date by extra transforming and
comparison?

Maybe validate date by using ValidateDate(). Attached sample patch.


This does not solve the problem at hand, and let's wrong dates/formats
slip through:

./buildclient.py -v -c demo-config-pg.yaml --run-configure --run-make
--run-install --no-clean-at-all --patch
'https://www.postgresql.org/message-id/95738e12-6ed6-daf5-9dcf-6336072e6b15%40postgrespro.ru'



postgres=# SELECT to_date('2011 12  18', ' MM   DD');
  to_date

 2011-12-08
(1 row)


That is from the regression tests, and obviously handles the date
transformation wrong. My attempt catches this, because I compare the
date with the input date, and do not rely on a valid date only.


I suppose that your sample query is an another issue, not date validate 
task. I sent the patch to the thread 
https://www.postgresql.org/message-id/b2a39359-3282-b402-f4a3-057aae500...@postgrespro.ru 
. It fixes formatting issues.


I thought that it is better to distinguish this issues to:
- validation of input date/timestmap string and input format string
- result date/timestamp validation

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] to_date_valid()

2016-08-15 Thread Artur Zakirov

On 14.08.2016 01:52, Andreas 'ads' Scherbaum wrote:


Attached is a patch to "do the right thing". The verification is in
"to_date()" now, the extra function is removed. Regression tests are
updated - two or three of them returned a wrong date before, and still
passed. They fail now. Documentation is also updated.


Regards,



Is it right and "true" way to validate date by extra transforming and 
comparison?


Maybe validate date by using ValidateDate(). Attached sample patch.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index bbd97dc..c0048c9 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -3563,7 +3563,9 @@ do_to_timestamp(text *date_txt, text *fmt,
 {
 	FormatNode *format;
 	TmFromChar	tmfc;
-	int			fmt_len;
+	int			fmt_len,
+fmask = 0;		/* Bit mask for ValidateDate() */
+	char	   *date_str = NULL;
 
 	ZERO_tmfc();
 	ZERO_tm(tm);
@@ -3574,7 +3576,6 @@ do_to_timestamp(text *date_txt, text *fmt,
 	if (fmt_len)
 	{
 		char	   *fmt_str;
-		char	   *date_str;
 		bool		incache;
 
 		fmt_str = text_to_cstring(fmt);
@@ -3630,7 +3631,6 @@ do_to_timestamp(text *date_txt, text *fmt,
 
 		DCH_from_char(format, date_str, );
 
-		pfree(date_str);
 		pfree(fmt_str);
 		if (!incache)
 			pfree(format);
@@ -3706,6 +3706,7 @@ do_to_timestamp(text *date_txt, text *fmt,
 			if (tmfc.bc && tm->tm_year > 0)
 tm->tm_year = -(tm->tm_year - 1);
 		}
+		fmask |= DTK_M(YEAR);
 	}
 	else if (tmfc.cc)			/* use first year of century */
 	{
@@ -3717,10 +3718,14 @@ do_to_timestamp(text *date_txt, text *fmt,
 		else
 			/* +1 because year == 599 is 600 BC */
 			tm->tm_year = tmfc.cc * 100 + 1;
+		fmask |= DTK_M(YEAR);
 	}
 
 	if (tmfc.j)
+	{
 		j2date(tmfc.j, >tm_year, >tm_mon, >tm_mday);
+		fmask |= DTK_DATE_M;
+	}
 
 	if (tmfc.ww)
 	{
@@ -3734,6 +3739,7 @@ do_to_timestamp(text *date_txt, text *fmt,
 isoweekdate2date(tmfc.ww, tmfc.d, >tm_year, >tm_mon, >tm_mday);
 			else
 isoweek2date(tmfc.ww, >tm_year, >tm_mon, >tm_mday);
+			fmask |= DTK_DATE_M;
 		}
 		else
 			tmfc.ddd = (tmfc.ww - 1) * 7 + 1;
@@ -3744,11 +3750,17 @@ do_to_timestamp(text *date_txt, text *fmt,
 	if (tmfc.d)
 		tm->tm_wday = tmfc.d - 1;		/* convert to native numbering */
 	if (tmfc.dd)
+	{
 		tm->tm_mday = tmfc.dd;
+		fmask |= DTK_M(DAY);
+	}
 	if (tmfc.ddd)
 		tm->tm_yday = tmfc.ddd;
 	if (tmfc.mm)
+	{
 		tm->tm_mon = tmfc.mm;
+		fmask |= DTK_M(MONTH);
+	}
 
 	if (tmfc.ddd && (tm->tm_mon <= 1 || tm->tm_mday <= 1))
 	{
@@ -3771,6 +3783,7 @@ do_to_timestamp(text *date_txt, text *fmt,
 			j0 = isoweek2j(tm->tm_year, 1) - 1;
 
 			j2date(j0 + tmfc.ddd, >tm_year, >tm_mon, >tm_mday);
+			fmask |= DTK_DATE_M;
 		}
 		else
 		{
@@ -3793,9 +3806,24 @@ do_to_timestamp(text *date_txt, text *fmt,
 
 			if (tm->tm_mday <= 1)
 tm->tm_mday = tmfc.ddd - y[i - 1];
+
+			fmask |= DTK_M(MONTH) | DTK_M(DAY);
 		}
 	}
 
+	/* Validate date with bit mask received above */
+	if (fmask != 0 && date_str)
+	{
+		if (ValidateDate(fmask, false, false, false, tm) != 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_DATETIME_FIELD_OVERFLOW),
+	 errmsg("date/time field value out of range: \"%s\"",
+			date_str)));
+	}
+
+	if (date_str)
+		pfree(date_str);
+
 #ifdef HAVE_INT64_TIMESTAMP
 	if (tmfc.ms)
 		*fsec += tmfc.ms * 1000;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in to_timestamp().

2016-08-11 Thread Artur Zakirov

Hello,

On 14.07.2016 12:16, Pavel Stehule wrote:


last point was discussed in thread related to to_date_valid function.

Regards

Pavel


Thank you.

Here is my patch. It is a proof of concept.

Date/Time Formatting


There are changes in date/time formatting rules:

- now to_timestamp() and to_date() skip spaces in the input string and 
in the formatting string unless FX option is used, as Amul Sul wrote on 
first message of this thread. But Ex.2 gives an error now with this 
patch (should we fix this too?).


- in the code space characters and separator characters have different 
types of FormatNode. Separator characters are characters ',', '-', '.', 
'/' and ':'. This is done to have different rules of formatting to space 
and separator characters.
If FX option isn't used then PostgreSQL do not insist that separator in 
the formatting string should match separator in the formatting string. 
But count of separators should be equal with or without FX option.


- now PostgreSQL check is there a closing quote. Otherwise the error is 
raised.


Still PostgreSQL do not insist that text character in the formatting 
string should match text character in the input string. It is not 
obvious if this should be fixed. Because we may have different character 
case or character with accent mark or without accent mark.
But I suppose that it is not right just check text character count. For 
example, there is unicode version of space character U+00A0.


Code changes


- new defines:

#define NODE_TYPE_SEPARATOR 4
#define NODE_TYPE_SPACE 5

- now DCH_cache_getnew() is called after parse_format(). Because now 
parse_format() can raise an error and in the next attempt 
DCH_cache_search() could return broken cache entry.



This patch do not handle all noticed issues in this thread, since still 
there is not consensus about them. So this patch in a proof of concept 
status and it can be changed.


Of course this patch can be completely wrong. But it tries to introduce 
more formal rules for formatting.


I will be grateful for notes and remarks.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7830334..5656245 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -6083,9 +6083,10 @@ SELECT regexp_matches('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
  
   
to_timestamp and to_date
-   skip multiple blank spaces in the input string unless the
-   FX option is used. For example,
-   to_timestamp('2000JUN', ' MON') works, but
+   skip multiple blank spaces in the input string and in the formatting
+   string unless the FX option is used. For example,
+   to_timestamp('2000JUN', ' MON')
+   and to_timestamp('2000JUN', 'MON') work, but
to_timestamp('2000JUN', 'FX MON') returns an error
because to_timestamp expects one space only.
FX must be specified as the first item in
@@ -6121,6 +6122,19 @@ SELECT regexp_matches('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
skips two input characters.
   
  
+ 
+ 
+  
+   In to_date and to_timestamp separator
+   characters ',', '-',
+   '.', '/' and ':'
+   outside of double-quoted strings skip the number of input characters
+   contained in the string unless the FX option is used.
+   If FX option is specified then consumed separator
+   character in the formatting string must match the separator character
+   in the input string.
+  
+ 
 
  
   
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index bbd97dc..ef39df9 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -169,6 +169,8 @@ struct FormatNode
 #define NODE_TYPE_END		1
 #define NODE_TYPE_ACTION	2
 #define NODE_TYPE_CHAR		3
+#define NODE_TYPE_SEPARATOR	4
+#define NODE_TYPE_SPACE		5
 
 #define SUFFTYPE_PREFIX		1
 #define SUFFTYPE_POSTFIX	2
@@ -947,6 +949,7 @@ typedef struct NUMProc
 static const KeyWord *index_seq_search(char *str, const KeyWord *kw,
  const int *index);
 static const KeySuffix *suff_search(char *str, const KeySuffix *suf, int type);
+static bool is_char_separator(char *str);
 static void NUMDesc_prepare(NUMDesc *num, FormatNode *n);
 static void parse_format(FormatNode *node, char *str, const KeyWord *kw,
 			 const KeySuffix *suf, const int *index, int ver, NUMDesc *Num);
@@ -963,7 +966,6 @@ static void dump_node(FormatNode *node, int max);
 static const char *get_th(char *num, int type);
 static char *str_numth(char *dest, char *num, int type);
 static int	adjust_partial_year_to_2020(int year);
-static int	strspace_len(char *str);
 static void from_char_set_mode(TmFromChar *tmfc, const FromCharDateMode mode);
 static void from_char_set_int(int *dest, const int value, const FormatNode *node);
 static int

[HACKERS] pg_upgrade: exit_hook_registered variable

2016-07-28 Thread Artur Zakirov

Hello hackers,

I noticed that exit_hook_registered variable in start_postmaster() is 
local variable. Shouldn't it be a static variable?


I attached a patch. Thank you!

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/bin/pg_upgrade/server.c b/src/bin/pg_upgrade/server.c
index 969e5d6..a13800f 100644
--- a/src/bin/pg_upgrade/server.c
+++ b/src/bin/pg_upgrade/server.c
@@ -14,6 +14,8 @@
 
 static PGconn *get_db_conn(ClusterInfo *cluster, const char *db_name);
 
+static bool exit_hook_registered = false;
+
 
 /*
  * connectToServer()
@@ -174,7 +176,6 @@ start_postmaster(ClusterInfo *cluster, bool throw_error)
 {
 	char		cmd[MAXPGPATH * 4 + 1000];
 	PGconn	   *conn;
-	bool		exit_hook_registered = false;
 	bool		pg_ctl_return = false;
 	char		socket_string[MAXPGPATH + 200];
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ispell/hunspell imprecision in error message

2016-07-26 Thread Artur Zakirov

On 25.07.2016 19:53, Peter Eisentraut wrote:

I was trying to look up the background of this error message:

"Ispell dictionary supports only \"default\", \"long\", and \"num\" flag
value"

(src/backend/tsearch/spell.c)

But I found that this actually talks about Hunspell format dictionaries.
 (So the man page is hunspell(5) as opposed to ispell(5).)  While as far
as the tsearch interfaces are concerned, those two are lumped together,
should we not change the error message to refer to Hunspell?



Hello,

As I understand, this error message refers to the Ispell dictionary 
template. This template can use various dictionary formats. Which is 
confusing. Maybe would be better to change dictionary template name. But 
it can break backward compatibility...


If we want to change the error message, maybe change it to the following?

"Hunspell dictionary format supports only \"default\", \"long\", and 
\"num\" flag value"


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in to_timestamp().

2016-07-14 Thread Artur Zakirov

On 23.06.2016 21:02, Tom Lane wrote:

Robert Haas <robertmh...@gmail.com> writes:

On Thu, Jun 23, 2016 at 1:40 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:

At the very least I'd want to see a thought-through proposal that
addresses all three of these interrelated points:

* what should a space in the format match
* what should a non-space, non-format-code character in the format match
* how should we handle fields that are not exactly the width suggested
by the format



I'm not averse to some further study of those issues, and I think the
first two are closely related.  The third one strikes me as a somewhat
separate consideration that doesn't need to be addressed by the same
patch.


If you think those issues are not interrelated, you have not thought
about it carefully enough.

As an example, what we can do to handle not-expected-width fields is
very different if the format is "DDMMYY" versus if it is "DD-MM-YY".
In the first case we have little choice but to believe that each
field is exactly two digits wide.  In the second case, depending on
how we decide to define matching of "-", we might be able to allow
the field widths to vary so that they're effectively "whatever is
between the dashes".  But that would require insisting that "-"
match a "-", or at least a non-alphanumeric, which is not how it
behaves today.

I don't want to twiddle these behaviors in 9.6 and then again next year.

regards, tom lane




Hi,

I want to start work on this patch.

As a conclusion:
- need a decision about three questions:



* what should a space in the format match
* what should a non-space, non-format-code character in the format match
* how should we handle fields that are not exactly the width suggested
by the format


- nobody wants solve this issue in 9.6.

And I have question: what about wrong input in date argument? For 
example, from Alex's message:



postgres=# SELECT TO_TIMESTAMP('2016-02-30 15:43:36', '-MM-DD
HH24:MI:SS');
   to_timestamp

  2016-03-01 15:43:36+03
(1 row)


Here '2016-02-30' is wrong date. I didn't see any conclusion about this 
case in the thread.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Patch: fix typo, duplicated word in indexam.sgml

2016-04-04 Thread Artur Zakirov

Hello,

There is a duplicated word in indexam.sgml. The patch is attached.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index b36889b..69edeea 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -58,7 +58,7 @@
   
 
   
-   Index access access methods can be defined and dropped using
+   Index access methods can be defined and dropped using
 and
  SQL commands respectively.
   

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unexpected result from to_tsvector

2016-03-30 Thread Artur Zakirov

On 29.03.2016 19:17, Shulgin, Oleksandr wrote:


Hm, indeed.   Unfortunately, it is not quite easy to find "the" new RFC,
there was quite a number of correcting and extending RFCs issued over
the last (almost) 30 years, which is not that surprising...

Are we going to do something about it?  Is it likely that
relaxing/changing the rules on our side will break any possible
workarounds that people might have employed to make the search work like
they want it to work?


Do you mean here workarounds to recognize such values as 
't...@123-reg.ro' as an email address? Actually I do not see any 
workarounds except a patch to PostgreSQL.


By the way, Teodor committed the patch yesterday.



--
Alex




--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IF (NOT) EXISTS in psql-completion

2016-03-29 Thread Artur Zakirov

On 29.03.2016 10:59, Pavel Stehule wrote:

Hi

2016-03-29 8:43 GMT+02:00 Kyotaro HORIGUCHI
<horiguchi.kyot...@lab.ntt.co.jp <mailto:horiguchi.kyot...@lab.ntt.co.jp>>:

Thank you Pavel, David.

Thank you for pointing syntaxes to be addressed. Most of the are
addressed in the attached patch.


At Tue, 22 Mar 2016 12:57:27 -0400, David Steele
<da...@pgmasters.net <mailto:da...@pgmasters.net>> wrote in
<56f17977.8040...@pgmasters.net <mailto:56f17977.8040...@pgmasters.net>>
> Hi Kyotaro,
>
> On 3/18/16 3:22 AM, Pavel Stehule wrote:
>
> > I am looking this patch. It looks well, but this feature doesn't
> > respect upper or lower chars. It enforce upper chars. This is not
> > consistent with any other autocomplete.

As mentioned before, upper-lower problem is an existing
issue. The case of the words in a query result list cannot be
edited since it may contain words that should not be changed,
such as relation names. So we can address it only before issueing
a query but I haven't found simple way to do it.


This is unpleasant. I am sorry. I had very uncomfortable feeling from
this behave. I am thinking so it should be solvable - you have to
convert only keyword IF EXISTS or IF NOT EXISTS. Maybe there are not
trivial solution, but this should be fixed.



Hello,

Can we do something like in the patch? This patch should be applied 
after the patch 
"0001-Suggest-IF-NOT-EXISTS-for-tab-completion-of-psql_v3.patch".


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 73c5601..ed4ff09 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -153,6 +153,7 @@ do { \
 do { \
 	completion_squery = &(query); \
 	completion_charp = addon; \
+	completion_case_sensitive = false; \
 	matches = completion_matches(text, complete_from_schema_query); \
 } while (0)
 
@@ -3754,7 +3755,17 @@ _complete_from_query(int is_schema_query, const char *text, int state)
 		while (list_index < PQntuples(result) &&
 			   (item = PQgetvalue(result, list_index++, 0)))
 			if (pg_strncasecmp(text, item, byte_length) == 0)
-return pg_strdup(item);
+			{
+if (completion_case_sensitive)
+	return pg_strdup(item);
+else
+
+	/*
+	 * If case insensitive matching was requested initially,
+	 * adjust the case according to setting.
+	 */
+	return pg_strdup_keyword_case(item, text);
+			}
 	}
 
 	/* If nothing matches, free the db structure and return null */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Phrase search ported to 9.6

2016-03-25 Thread Artur Zakirov

On 25.03.2016 21:42, Dmitry Ivanov wrote:

Sorry for the delay, I desperately needed some time to finish a bunch of
dangling tasks.

I've added some new comments and clarified the ones that were obscure.
Moreover, I felt an urge to recheck most parts of the code since apparently
nobody (besides myself) has gone so far yet.

On 25.03.16 18:42 MSK, David Steele wrote:

Time is short and it's not encouraging that you say there is "still much

work to be done".

Indeed, I was inaccurate. I am more than interested in the positive outcome.



Hi,

Thank you for your work!

I tested the patch and take a look on it. All regression tests passed. 
The code looks good and the patch introduce a great functionality.


I think the patch can be marked as "Ready for Commiter". But I do not 
feel the force to do that myself.


Also I agree with you about tsvector_setweight(). There is not a problem 
with it because this weights are immutable and so there is not benefits 
from new function.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unexpected result from to_tsvector

2016-03-25 Thread Artur Zakirov

On 25.03.2016 19:15, David Steele wrote:

On 3/25/16 12:14 PM, Artur Zakirov wrote:

On 25.03.2016 18:19, David Steele wrote:

Hi Artur,

On 3/20/16 10:42 AM, Tom Lane wrote:

"Shulgin, Oleksandr" <oleksandr.shul...@zalando.de> writes:

On Mar 20, 2016 01:09, "Dmitrii Golub" <dmitrii.go...@gmail.com>
wrote:

Alex, actually subdomain can start with digit,



Not according to the RFC you have linked to.


The powers-that-be relaxed that some time ago; I assume there's a newer
RFC.  For instance, "163.com" is a real domain:


You marked this patch "needs review" and then a few minutes later
changed it to "waiting on author".

If this was a mistake please change it back to "needs review".  If you
really are working on a new patch when can we expect that?

Thanks,


Hi,

The previous patch is current, which can be commited.

I mark this patch as "needs review", because I noticed that the patch
was marked as "waiting on author". And I thought that I forgot to mark
as "need review".

But then I noticed that Robert Haas marked the patch as "waiting on
author" after my answer, and I returned "waiting on author". But I cant
find any questions or comments to me after my last answer.

Actually I think that this patch should be marked as "need review".


Done.



Thank you!

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unexpected result from to_tsvector

2016-03-25 Thread Artur Zakirov

On 25.03.2016 18:19, David Steele wrote:

Hi Artur,

On 3/20/16 10:42 AM, Tom Lane wrote:

"Shulgin, Oleksandr" <oleksandr.shul...@zalando.de> writes:

On Mar 20, 2016 01:09, "Dmitrii Golub" <dmitrii.go...@gmail.com> wrote:

Alex, actually subdomain can start with digit,



Not according to the RFC you have linked to.


The powers-that-be relaxed that some time ago; I assume there's a newer
RFC.  For instance, "163.com" is a real domain:


You marked this patch "needs review" and then a few minutes later
changed it to "waiting on author".

If this was a mistake please change it back to "needs review".  If you
really are working on a new patch when can we expect that?

Thanks,


Hi,

The previous patch is current, which can be commited.

I mark this patch as "needs review", because I noticed that the patch 
was marked as "waiting on author". And I thought that I forgot to mark 
as "need review".


But then I noticed that Robert Haas marked the patch as "waiting on 
author" after my answer, and I returned "waiting on author". But I cant 
find any questions or comments to me after my last answer.


Actually I think that this patch should be marked as "need review".

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension

2016-03-22 Thread Artur Zakirov

I attached the patch, which fixes the pg_trgm documentation.

On 19.03.2016 01:18, Artur Zakirov wrote:



2016-03-18 23:46 GMT+03:00 Jeff Janes <jeff.ja...@gmail.com
<mailto:jeff.ja...@gmail.com>>:


<% and <<-> are not documented at all.  Is that a deliberate choice?
Since they were added as convenience functions for the user, I think
they really need to be in the user documentation.


I can send a patch a little bit later. I documented  %>
and <->> because examples of other operators have the following order:

SELECT t, t <-> 'word' AS dist
   FROM test_trgm
   ORDER BY dist LIMIT 10;

and

SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';

I did not include <% and <<-> because I did not know how to document
commutators. But I can fix it.

And honestly the following order:

SELECT 'word' <% t FROM test_trgm;

is more convenient to me too.

Do you know how do not break the line in the operators table in the
first column? Now I see:

Operator| Returns
|--
text %   |boolean
text  |

But the following is better:

Operator| Returns
|--
text % text |boolean


Also, the documentation should probably include <% and <<-> as the
"parent" operators and use them in the examples, and only mention %>
and <->> in passing, as the commutators.  That is because <% and <<->
take their arguments in the same order as word_similarity does.  It
would be less confusing if the documentation and the examples didn't
need to keep changing the argument orders.

Cheers,

Jeff




--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/doc/src/sgml/pgtrgm.sgml
--- b/doc/src/sgml/pgtrgm.sgml
***
*** 153,167 

   
   
!text % text
!boolean
!
! Returns true if its first argument has the similar word in
! the second argument and they have a similarity that is greater than the
! current word similarity threshold set by
! pg_trgm.word_similarity_threshold parameter.
!
!   
   
text - text
real
--- 153,174 

   
   
!   text % text
!   boolean
!   
!Returns true if its first argument has the similar word in
!the second argument and they have a similarity that is greater than the
!current word similarity threshold set by
!pg_trgm.word_similarity_threshold parameter.
!   
!  
!  
!   text % text
!   boolean
!   
!Commutator of the % operator.
!   
!  
   
text - text
real
***
*** 171,184 

   
   
!
! text - text
!
!real
!
! Returns the distance between the arguments, that is
! one minus the word_similarity() value.
!
   
  
 
--- 178,200 

   
   
!   
!text - text
!   
!   real
!   
!Returns the distance between the arguments, that is
!one minus the word_similarity() value.
!   
!  
!  
!   
!text - text
!   
!   real
!   
!Commutator of the - operator.
!   
   
  
 
***
*** 215,221 
   

 Sets the current word similarity threshold that is used by
!the % operator.  The threshold must be between
 0 and 1 (default is 0.6).

   
--- 231,237 
   

 Sets the current word similarity threshold that is used by
!the % operator.  The threshold must be between
 0 and 1 (default is 0.6).

   
***
*** 283,289  SELECT t, t - 'word' AS dist
  
  SELECT t, word_similarity('word', t) AS sml
FROM test_trgm
!   WHERE t % 'word'
ORDER BY sml DESC, t;
  
 This will return all values in the text column that have a word
--- 299,305 
  
  SELECT t, word_similarity('word', t) AS sml
FROM test_trgm
!   WHERE 'word' % t
ORDER BY sml DESC, t;
  
 This will return all values in the text column that have a word
***
*** 295,301  SELECT t, word_similarity('word', t) AS sml

 A variant of the above query is
  
! SELECT t, t - 'word' AS dist
FROM test_trgm
ORDER BY dist LIMIT 10;
  
--- 311,317 

 A variant of the above query is
  
! SELECT t, 'word' - t AS dist
FROM test_trgm
ORDER BY dist LIMIT 10;
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Phrase search ported to 9.6

2016-03-21 Thread Artur Zakirov
I tried to find some bugs in the code. I can't find them. But it does 
not mean that there are not bugs.


Still there are a lack of comments and trailing whitespaces.

On 16.03.2016 19:38, Dmitry Ivanov wrote:

Hi, Artur

I've made an attempt to fix some of the issues you've listed, although there's
still much work to be done. I'll add some comments later.


This function has the duplicated piece from the tsvector_setweight()
from tsvector_op.c. You can define new function for it.


I'm not sure it's worth the trouble. IMO these functions are relatively small
and we won't benefit from extracting the duplicate code.


I think that a separate function would be better. These weights are 
occurred in other functions. But I might be wrong and maybe I cavil at 
the code too much.





These functions was removed in 9acb9007de30b3daaa9efc16763c3bc6e3e0a92d.
It seems that tsvector_op.c was not synchronized with the master.


Got it, thanks!




--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unexpected result from to_tsvector

2016-03-19 Thread Artur Zakirov
I found the discussion about allowing an underscore in emails 
http://www.postgresql.org/message-id/200908281359.n7sdxfaf044...@wwwmaster.postgresql.org


That bug report is about recognizing an underscore in the local part of 
an email. And is not about recognizing an underscore in a domain name. 
But that patch allows an underscore in recognized host names also.


I am not good in RFC, so I put excerpt from Wikipedia 
https://en.wikipedia.org/wiki/Email_address:



The local-part of the email address may use any of these ASCII characters:

Uppercase and lowercase Latin letters (A–Z, a–z) (ASCII: 65–90, 97–122)
Digits 0 to 9 (ASCII: 48–57)
These special characters: !#$%&'*+-/=?^_`{|}~ (ASCII: 33, 35–39, 42, 43, 45, 
47, 61, 63, 94–96, 123–126)
Character . (dot, period, full stop), ASCII 46, provided that it is not the 
first or last character, and provided also that it does not appear 
consecutively (e.g. john.@example.com is not allowed).
Other special characters are allowed with restrictions (they are only allowed 
inside a quoted string, as described in the paragraph below, and in addition, a 
backslash or double-quote must be preceded by a backslash). These characters 
are:
Space and "(),:;<>@[\] (ASCII: 32, 34, 40, 41, 44, 58, 59, 60, 62, 64, 91–93)
Comments are allowed with parentheses at either end of the local part; e.g. 
john.smith(comment)@example.com and (comment)john.sm...@example.com are both 
equivalent to john.sm...@example.com.


and https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_host_names


The Internet standards (Requests for Comments) for protocols mandate that 
component hostname labels may contain only the ASCII letters 'a' through 'z' 
(in a case-insensitive manner),the digits '0' through '9', and the hyphen 
('-'). The original specification of hostnames in RFC 952, mandated that labels 
could not start with a digit or with a hyphen, and must not end with a hyphen. 
However, a subsequent specification (RFC 1123) permitted hostname labels to 
start with digits. No other symbols, punctuation characters, or white space are 
permitted.


Hence the valid emails is (I might be wrong):

12...@sample.com
12...@sample.com
1...@123-sample.com
1...@123sample.com

The attached patch allow them to be recognized as a email. But this 
patch does not prohibit underscore in recognized host names.


As a result this patch gives the following results with underscores:

=# select * from ts_debug('simple', 'aaa@123_yyy.zzz');
 alias |  description  |  token  | dictionaries | dictionary | 
 lexemes

---+---+-+--++---
 email | Email address | aaa@123_yyy.zzz | {simple} | simple | 
{aaa@123_yyy.zzz}

(1 row)

=# select * from ts_debug('simple', '123_yyy.zzz');
 alias | description |token| dictionaries | dictionary | 
lexemes

---+-+-+--++---
 host  | Host| 123_yyy.zzz | {simple} | simple | 
{123_yyy.zzz}

(1 row)

On 14.03.2016 17:45, Artur Zakirov wrote:

On 14.03.2016 16:22, Shulgin, Oleksandr wrote:


Hm...  now that doesn't look all that consistent to me (after applying
the patch):

=# select ts_debug('simple', 'a...@123-yyy.zzz');
  ts_debug
---

  (email,"Email
address",a...@123-yyy.zzz,{simple},simple,{a...@123-yyy.zzz})
(1 row)

But:

=# select ts_debug('simple', 'aaa@123_yyy.zzz');
 ts_debug
-
  (asciiword,"Word, all ASCII",aaa,{simple},simple,{aaa})
  (blank,"Space symbols",@,{},,)
  (uint,"Unsigned integer",123,{simple},simple,{123})
  (blank,"Space symbols",_,{},,)
  (host,Host,yyy.zzz,{simple},simple,{yyy.zzz})
(5 rows)

One can also see that if we only keep the domain name, the result is
similar:

=# select ts_debug('simple', '123-yyy.zzz');
ts_debug
---
  (host,Host,123-yyy.zzz,{simple},simple,{123-yyy.zzz})
(1 row)

=# select ts_debug('simple', '123_yyy.zzz');
   ts_debug
-
  (uint,"Unsigned integer",123,{simple},simple,{123})
  (blank,"Space symbols",_,{},,)
  (host,Host,yyy.zzz,{simple},simple,{yyy.zzz})
(3 rows)

But, this only has to do with 123 being recognized as a number, not with
the underscore:

=# select ts_debug('simple', 'abc_yyy.zzz');
ts_debug
---
  (host,Host,abc_yyy.zzz,{simple},simple,{abc_yyy.zzz})
(1 row)

=# select ts_debug('simple', '1abc_yyy.zzz');
ts_debug
---
  (host,Host,1abc_yyy.zzz,{simple},simple,{1abc_yyy.zz

Re: [HACKERS] Proposal: Generic WAL logical messages

2016-03-18 Thread Artur Zakirov

On 16.03.2016 18:56, David Steele wrote:


This patch applies cleanly and is ready for review with no outstanding
issues that I can see.  Simon and Artur, you are both signed up as
reviewers.  Care to take a crack at it?

Thanks,



I have tested the patch once again and have looked the code. It looks 
good for me. I haven't any observation.


The patch does its function correctly. I have tested it with a plugin, 
which writes DDL queries to the WAL using a hook and replicates this 
queries at subscribers.


If Simon is not against, the patch can be marked as "Ready for Commiter".

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension

2016-03-18 Thread Artur Zakirov
2016-03-18 23:46 GMT+03:00 Jeff Janes <jeff.ja...@gmail.com>:
>
>
> <% and <<-> are not documented at all.  Is that a deliberate choice?
> Since they were added as convenience functions for the user, I think
> they really need to be in the user documentation.
>

I can send a patch a little bit later. I documented  %>
and <->> because examples of other operators have the following order:

SELECT t, t <-> 'word' AS dist
  FROM test_trgm
  ORDER BY dist LIMIT 10;

and

SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';

I did not include <% and <<-> because I did not know how to document
commutators. But I can fix it.

And honestly the following order:

SELECT 'word' <% t FROM test_trgm;

is more convenient to me too.

Do you know how do not break the line in the operators table in the first
column? Now I see:

Operator | Returns
|--
text %   |boolean
text  |

But the following is better:

Operator | Returns
|--
text % text |boolean


>
> Also, the documentation should probably include <% and <<-> as the
> "parent" operators and use them in the examples, and only mention %>
> and <->> in passing, as the commutators.  That is because <% and <<->
> take their arguments in the same order as word_similarity does.  It
> would be less confusing if the documentation and the examples didn't
> need to keep changing the argument orders.
>
> Cheers,
>
> Jeff
>



-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension

2016-03-15 Thread Artur Zakirov

On 15.03.2016 17:28, David Steele wrote:

On 3/14/16 12:27 PM, Artur Zakirov wrote:

On 14.03.2016 18:48, David Steele wrote:

Hi Jeff,

On 2/25/16 5:00 PM, Jeff Janes wrote:


But, It doesn't sound like I am going to win that debate.  Given that,
I don't think we need a different name for the function. I'm fine with
explaining the word-boundary subtlety in the documentation, and
keeping the function name itself simple.


It's not clear to me if you are requesting more documentation here or
stating that you are happy with it as-is.  Care to elaborate?

Other than that I think this patch looks to be ready for committer. Any
objections?



There was some comments about the word-boundary subtlety. But I think it
was not enough.

I rephrased the explanation of word_similarity() and %>. It is better now.

But if it is not correct I can change the explanation.


Since to only change in the latest patch is to documentation I have
marked this "ready for committer".



Thank you!

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension

2016-03-14 Thread Artur Zakirov

On 14.03.2016 18:48, David Steele wrote:

Hi Jeff,

On 2/25/16 5:00 PM, Jeff Janes wrote:


But, It doesn't sound like I am going to win that debate.  Given that,
I don't think we need a different name for the function. I'm fine with
explaining the word-boundary subtlety in the documentation, and
keeping the function name itself simple.


It's not clear to me if you are requesting more documentation here or
stating that you are happy with it as-is.  Care to elaborate?

Other than that I think this patch looks to be ready for committer. Any
objections?



There was some comments about the word-boundary subtlety. But I think it 
was not enough.


I rephrased the explanation of word_similarity() and %>. It is better now.

But if it is not correct I can change the explanation.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/contrib/pg_trgm/pg_trgm--1.2.sql
--- b/contrib/pg_trgm/pg_trgm--1.2.sql
***
*** 3,13 
--- 3,15 
  -- complain if script is sourced in psql, rather than via CREATE EXTENSION
  \echo Use "CREATE EXTENSION pg_trgm" to load this file. \quit
  
+ -- Deprecated function
  CREATE FUNCTION set_limit(float4)
  RETURNS float4
  AS 'MODULE_PATHNAME'
  LANGUAGE C STRICT VOLATILE;
  
+ -- Deprecated function
  CREATE FUNCTION show_limit()
  RETURNS float4
  AS 'MODULE_PATHNAME'
***
*** 26,32  LANGUAGE C STRICT IMMUTABLE;
  CREATE FUNCTION similarity_op(text,text)
  RETURNS bool
  AS 'MODULE_PATHNAME'
! LANGUAGE C STRICT STABLE;  -- stable because depends on trgm_limit
  
  CREATE OPERATOR % (
  LEFTARG = text,
--- 28,34 
  CREATE FUNCTION similarity_op(text,text)
  RETURNS bool
  AS 'MODULE_PATHNAME'
! LANGUAGE C STRICT STABLE;  -- stable because depends on pg_trgm.similarity_threshold
  
  CREATE OPERATOR % (
  LEFTARG = text,
*** a/contrib/pg_trgm/trgm.h
--- b/contrib/pg_trgm/trgm.h
***
*** 105,111  typedef char *BITVECP;
  
  typedef struct TrgmPackedGraph TrgmPackedGraph;
  
! extern float4 trgm_limit;
  
  extern uint32 trgm2int(trgm *ptr);
  extern void compact_trigram(trgm *tptr, char *str, int bytelen);
--- 105,111 
  
  typedef struct TrgmPackedGraph TrgmPackedGraph;
  
! extern double similarity_threshold;
  
  extern uint32 trgm2int(trgm *ptr);
  extern void compact_trigram(trgm *tptr, char *str, int bytelen);
*** a/contrib/pg_trgm/trgm_gin.c
--- b/contrib/pg_trgm/trgm_gin.c
***
*** 206,212  gin_trgm_consistent(PG_FUNCTION_ARGS)
  			 * similarity is just c / len1.
  			 * So, independly on DIVUNION the upper bound formula is the same.
  			 */
! 			res = (nkeys == 0) ? false : ((float4) ntrue) / ((float4) nkeys))) >= trgm_limit) ? true : false);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
--- 206,214 
  			 * similarity is just c / len1.
  			 * So, independly on DIVUNION the upper bound formula is the same.
  			 */
! 			res = (nkeys == 0) ? false :
! ((float4) ntrue) / ((float4) nkeys))) >= similarity_threshold)
! 	? true : false);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
***
*** 283,289  gin_trgm_triconsistent(PG_FUNCTION_ARGS)
  			/*
  			 * See comment in gin_trgm_consistent() about * upper bound formula
  			 */
! 			res = (nkeys == 0) ? GIN_FALSE : (float4) ntrue) / ((float4) nkeys)) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
--- 285,293 
  			/*
  			 * See comment in gin_trgm_consistent() about * upper bound formula
  			 */
! 			res = (nkeys == 0) ? GIN_FALSE :
! (float4) ntrue) / ((float4) nkeys)) >= similarity_threshold)
! 	? GIN_MAYBE : GIN_FALSE);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
*** a/contrib/pg_trgm/trgm_gist.c
--- b/contrib/pg_trgm/trgm_gist.c
***
*** 294,300  gtrgm_consistent(PG_FUNCTION_ARGS)
  float4		tmpsml = cnt_sml(key, qtrg);
  
  /* strange bug at freebsd 5.2.1 and gcc 3.3.3 */
! res = (*(int *)  == *(int *) _limit || tmpsml > trgm_limit) ? true : false;
  			}
  			else if (ISALLTRUE(key))
  			{	/* non-leaf contains signature */
--- 294,301 
  float4		tmpsml = cnt_sml(key, qtrg);
  
  /* strange bug at freebsd 5.2.1 and gcc 3.3.3 */
! res = (*(int *)  == *(int *) _threshold
! 		|| tmpsml > similarity_threshold) ? true : false;
  			}
  			else if (ISALLTRUE(key))
  			{	/* non-leaf contains signature */
***
*** 308,314  gtrgm_consistent(PG_FUNCTION_ARGS)
  if (len == 0)
  	res = false;
  else
! 	res = (float8) count) / ((float8) len))) >= trgm_limit) ? true : false;
  			}
  			break;
  		case ILikeStrategyNumber:
--- 309,316 
  if (len == 0)
  	res = false;
  else
! 	res = (float8) count) / ((float8) len))) >= similarity_threshold)
! 			? true : false;
  			}
  			break;
  		ca

Re: [HACKERS] plpgsql - DECLARE - cannot to use %TYPE or %ROWTYPE for composite types

2016-03-14 Thread Artur Zakirov

On 14.03.2016 17:54, Tom Lane wrote:

Joe Conway <m...@joeconway.com> writes:

This new version of the patch was posted after the commitfest item was
marked ready for committer. Does anyone have further comments or
objections to the concept or syntax before I try to take this forward?


The quoted excerpt fails to say what solution was adopted to the array
syntax issues, so it's impossible to have an opinion without actually
reading the patch.

However ... one thing I was intending to mention on this thread is that
"get the array type over this type" isn't the only extension one might
wish for.  Another likely desire is "get the type of field 'foo' of this
composite type".  I don't suggest that this patch needs to implement
that right now; but it would be a good thing if we could see how the
chosen syntax could be extended in such a direction.  Otherwise we might
be painting ourselves into a corner.

regards, tom lane



I looked this patch and the previous. The patch applies correctly to 
HEAD. Regression tests pass successfully, without errors.


In comparison with the previous patch it adds the following functionality:
- %TYPE - now can be used for composite types (same syntax).
- %TYPE[] - new functionality, provides the array type from a
variable or table column (syntax was changed).
- var ELEMENT OF othervar%TYPE - new funcitonality, provides the element 
type of a given array (syntax was changed).


Was changed plpgsql_derive_type(). Now it has the following definition:


PLpgSQL_type *
plpgsql_derive_type(PLpgSQL_type *base_type, bool to_element_type, bool 
to_array_type)


Previously it had the following definition:


static PLpgSQL_type *
derive_type(PLpgSQL_type *base_type, PLpgSQL_reftype reftype)


where PLpgSQL_reftype was the enum:


+ typedef enum
+ {
+   PLPGSQL_REFTYPE_TYPE,   /* use type of some variable */
+   PLPGSQL_REFTYPE_ELEMENT,/* use a element type of referenced 
variable */
+   PLPGSQL_REFTYPE_ARRAY   /* use a array type of referenced 
variable */
+ } PLpgSQL_reftype;


I think the previous version was better, because enum is better than 
additional function parameters. But it is only for me.


Also there is a little typo here:


+  * This routine is used for generating element or array type from base type.
+  * The options to_element_type and to_array_type can be used together, when
+  * we would to ensure valid result. The array array type is original type, so
+  * this direction is safe. The element of scalar type is not allowed, but if
+  * we do "to array" transformation first, then this direction should be safe
+  * too. This design is tolerant, because we should to support a design of
+  * polymorphic parameters, where a array value can be passed as anyelement
+  * or anyarray parameter.
+  */
+ PLpgSQL_type *
+ plpgsql_derive_type(PLpgSQL_type *base_type,


Here the word "array" occurs two times in the third line.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unexpected result from to_tsvector

2016-03-14 Thread Artur Zakirov

On 14.03.2016 16:22, Shulgin, Oleksandr wrote:


Hm...  now that doesn't look all that consistent to me (after applying
the patch):

=# select ts_debug('simple', 'a...@123-yyy.zzz');
  ts_debug
---
  (email,"Email address",a...@123-yyy.zzz,{simple},simple,{a...@123-yyy.zzz})
(1 row)

But:

=# select ts_debug('simple', 'aaa@123_yyy.zzz');
 ts_debug
-
  (asciiword,"Word, all ASCII",aaa,{simple},simple,{aaa})
  (blank,"Space symbols",@,{},,)
  (uint,"Unsigned integer",123,{simple},simple,{123})
  (blank,"Space symbols",_,{},,)
  (host,Host,yyy.zzz,{simple},simple,{yyy.zzz})
(5 rows)

One can also see that if we only keep the domain name, the result is
similar:

=# select ts_debug('simple', '123-yyy.zzz');
ts_debug
---
  (host,Host,123-yyy.zzz,{simple},simple,{123-yyy.zzz})
(1 row)

=# select ts_debug('simple', '123_yyy.zzz');
   ts_debug
-
  (uint,"Unsigned integer",123,{simple},simple,{123})
  (blank,"Space symbols",_,{},,)
  (host,Host,yyy.zzz,{simple},simple,{yyy.zzz})
(3 rows)

But, this only has to do with 123 being recognized as a number, not with
the underscore:

=# select ts_debug('simple', 'abc_yyy.zzz');
ts_debug
---
  (host,Host,abc_yyy.zzz,{simple},simple,{abc_yyy.zzz})
(1 row)

=# select ts_debug('simple', '1abc_yyy.zzz');
ts_debug
---
  (host,Host,1abc_yyy.zzz,{simple},simple,{1abc_yyy.zzz})
(1 row)

In fact, the 123-yyy.zzz domain is not valid either according to the RFC
(subdomain can't start with a digit), but since we already allow it,
should we not allow 123_yyy.zzz to be recognized as a Host?  Then why
not recognize aaa@123_yyy.zzz as an email address?

Another option is to prohibit underscore in recognized host names, but
this has more breakage potential IMO.

--
Alex



It seems reasonable to me. I like more first option. But I am not 
confident that we should allow 123_yyy.zzz to be recognized as a Host.


By the way, in this question http://webmasters.stackexchange.com/a/775 
you can see examples of domain names with numbers (but not subdomains).


If there are not objections from others, I will send a new patch today 
later or tomorrow with 123_yyy.zzz recognizing.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Generic WAL logical messages

2016-03-08 Thread Artur Zakirov

I think here


+const char *
+logicalmsg_identify(uint8 info)
+{
+   if (info & ~XLR_INFO_MASK == XLOG_LOGICAL_MESSAGE)
+   return "MESSAGE";
+
+   return NULL;
+}


we should use brackets

const char *
logicalmsg_identify(uint8 info)
{
if ((info & ~XLR_INFO_MASK) == XLOG_LOGICAL_MESSAGE)
return "MESSAGE";

return NULL;
}

Because of operator priorities 
http://en.cppreference.com/w/c/language/operator_precedence we may get 
errors.


On 01.03.2016 00:10, Petr Jelinek wrote:

Hi,

attached is the newest version of the patch.

I removed the registry, renamed the 'send' to 'emit', documented the
callback parameters properly. I also added the test to ddl.sql for the
serialization and deserialization (and of course found a bug there) and
in general fixed all the stuff Andres reported.

(see more inline)


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unexpected result from to_tsvector

2016-03-07 Thread Artur Zakirov

Hello,

On 07.03.2016 23:55, Dmitrii Golub wrote:



Hello,

Should we added tests for this case?


I think we should. I have added tests for teo...@123-stack.net and 
1...@stack.net emails.




123_reg.ro <http://123_reg.ro> is not valid domain name, bacause of
symbol "_"

https://tools.ietf.org/html/rfc1035 page 8.

Dmitrii Golub


Thank you for the information. Fixed.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/src/backend/tsearch/wparser_def.c
--- b/src/backend/tsearch/wparser_def.c
***
*** 1121,1126  static const TParserStateActionItem actionTPS_InUnsignedInt[] = {
--- 1121,1128 
  	{p_iseqC, '.', A_PUSH, TPS_InUDecimalFirst, 0, NULL},
  	{p_iseqC, 'e', A_PUSH, TPS_InMantissaFirst, 0, NULL},
  	{p_iseqC, 'E', A_PUSH, TPS_InMantissaFirst, 0, NULL},
+ 	{p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL},
+ 	{p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
  	{p_isasclet, 0, A_PUSH, TPS_InHost, 0, NULL},
  	{p_isalpha, 0, A_NEXT, TPS_InNumWord, 0, NULL},
  	{p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL},
*** a/src/test/regress/expected/tsearch.out
--- b/src/test/regress/expected/tsearch.out
***
*** 264,270  SELECT * FROM ts_token_type('default');
  23 | entity  | XML entity
  (23 rows)
  
! SELECT * FROM ts_parse('default', '345 qwe@efd.r '' http://www.com/ http://aew.werc.ewr/?ad=qwe 1aew.werc.ewr/?ad=qwe 2aew.werc.ewr http://3aew.werc.ewr/?ad=qwe http://4aew.werc.ewr http://5aew.werc.ewr:8100/?  ad=qwe 6aew.werc.ewr:8100/?ad=qwe 7aew.werc.ewr:8100/?ad=qwe=%20%32 +4.0e-10 qwe qwe qwqwe 234.435 455 5.005 teo...@stack.net qwe-wer asdf qwer jf sdjk ewr1> ewri2 
  /usr/local/fff /awdf/dwqe/4325 rewt/ewr wefjn /wqe-324/ewr gist.h gist.h.c gist.c. readline 4.2 4.2. 4.2, readline-4.2 readline-4.2. 234
   wow  < jqw <> qwerty');
   tokid |token 
--- 264,270 
  23 | entity  | XML entity
  (23 rows)
  
! SELECT * FROM ts_parse('default', '345 qwe@efd.r '' http://www.com/ http://aew.werc.ewr/?ad=qwe 1aew.werc.ewr/?ad=qwe 2aew.werc.ewr http://3aew.werc.ewr/?ad=qwe http://4aew.werc.ewr http://5aew.werc.ewr:8100/?  ad=qwe 6aew.werc.ewr:8100/?ad=qwe 7aew.werc.ewr:8100/?ad=qwe=%20%32 +4.0e-10 qwe qwe qwqwe 234.435 455 5.005 teo...@stack.net teo...@123-stack.net 1...@stack.net qwe-wer asdf qwer jf sdjk ewr1> ewri2 
  /usr/local/fff /awdf/dwqe/4325 rewt/ewr wefjn /wqe-324/ewr gist.h gist.h.c gist.c. readline 4.2 4.2. 4.2, readline-4.2 readline-4.2. 234
   wow  < jqw <> qwerty');
   tokid |token 
***
*** 332,337  SELECT * FROM ts_parse('default', '345 qwe@efd.r '' http://www.com/ http://aew.w
--- 332,341 
  12 |  
   4 | teo...@stack.net
  12 |  
+  4 | teo...@123-stack.net
+ 12 |  
+  4 | 1...@stack.net
+ 12 |  
  16 | qwe-wer
  11 | qwe
  12 | -
***
*** 404,425  SELECT * FROM ts_parse('default', '345 qwe@efd.r '' http://www.com/ http://aew.w
  12 |  
  12 | <> 
   1 | qwerty
! (133 rows)
  
! SELECT to_tsvector('english', '345 qwe@efd.r '' http://www.com/ http://aew.werc.ewr/?ad=qwe 1aew.werc.ewr/?ad=qwe 2aew.werc.ewr http://3aew.werc.ewr/?ad=qwe http://4aew.werc.ewr http://5aew.werc.ewr:8100/?  ad=qwe 6aew.werc.ewr:8100/?ad=qwe 7aew.werc.ewr:8100/?ad=qwe=%20%32 +4.0e-10 qwe qwe qwqwe 234.435 455 5.005 teo...@stack.net qwe-wer asdf qwer jf sdjk ewr1> ewri2 
  /usr/local/fff /awdf/dwqe/4325 rewt/ewr wefjn /wqe-324/ewr gist.h gist.h.c gist.c. readline 4.2 4.2. 4.2, readline-4.2 readline-4.2. 234
   wow  < jqw <> qwerty');
! 

Re: [HACKERS] Confusing with commit time usage in logical decoding

2016-03-01 Thread Artur Zakirov

Hello, Andres

You have introduced a large replication progress tracking infrastructure 
last year. And there is a problem described at the link in the quote below.


Attached patch fix this issue. Is this patch correct? I will be grateful 
if it is and if it will be committed.


Thanks.

On 29.02.2016 14:18, Artur Zakirov wrote:

Hello,

I read this message
http://www.postgresql.org/message-id/56d4197e.9050...@informatik.uni-kl.de

Is this a bug or a typo? In DecodeCommit() in decode.c instead of:

if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
{
 origin_lsn = parsed->origin_lsn;
 commit_time = parsed->origin_timestamp;
}

should be:

if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
{
 origin_lsn = parsed->origin_lsn;
 commit_time = parsed->origin_timestamp;
}
else
 commit_time = parsed->xact_time;




--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/src/backend/replication/logical/decode.c
--- b/src/backend/replication/logical/decode.c
***
*** 458,463  DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
--- 458,465 
  		origin_lsn = parsed->origin_lsn;
  		commit_time = parsed->origin_timestamp;
  	}
+ 	else
+ 		commit_time = parsed->xact_time;
  
  	/*
  	 * Process invalidation messages, even if we're not interested in the

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Phrase search ported to 9.6

2016-02-29 Thread Artur Zakirov
n define new function for it.



+/*
+ * Check if datatype is the specified type or equivalent to it.
+ *
+ * Note: we could just do getBaseType() unconditionally, but since that's
+ * a relatively expensive catalog lookup that most users won't need, we
+ * try the straight comparison first.
+ */
+static bool
+is_expected_type(Oid typid, Oid expected_type)
+{
+   if (typid == expected_type)
+   return true;
+   typid = getBaseType(typid);
+   if (typid == expected_type)
+   return true;
+   return false;
+}
+
+/* Check if datatype is TEXT or binary-equivalent to it */
+static bool
+is_text_type(Oid typid)
+{
+   /* varchar(n) and char(n) are binary-compatible with text */
+   if (typid == TEXTOID || typid == VARCHAROID || typid == BPCHAROID)
+   return true;
+   /* Allow domains over these types, too */
+   typid = getBaseType(typid);
+   if (typid == TEXTOID || typid == VARCHAROID || typid == BPCHAROID)
+   return true;
+   return false;
+}


These functions was removed in 9acb9007de30b3daaa9efc16763c3bc6e3e0a92d. 
It seems that tsvector_op.c was not synchronized with the master.


Conclusion
--

This patch is large and it needs more research. I will be reviewing it 
and will give another notes later.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Confusing with commit time usage in logical decoding

2016-02-29 Thread Artur Zakirov

Hello,

I read this message 
http://www.postgresql.org/message-id/56d4197e.9050...@informatik.uni-kl.de


Is this a bug or a typo? In DecodeCommit() in decode.c instead of:

if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
{
origin_lsn = parsed->origin_lsn;
commit_time = parsed->origin_timestamp;
}

should be:

if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
{
origin_lsn = parsed->origin_lsn;
commit_time = parsed->origin_timestamp;
}
else
commit_time = parsed->xact_time;

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Generic WAL logical messages

2016-02-27 Thread Artur Zakirov

Hello,

On 27.02.2016 03:05, Andres Freund wrote:

Hi,

I'm not really convinced by RegisterStandbyMsgPrefix() et al. There's
not much documentation about what it actually is supposed to
acomplish. Afaics you're basically forced to use
shared_preload_libraries with it right now?  Also, iterating through a
linked list everytime something is logged doesn't seem very satisfying?



I have did some tests with a simple plugin. I have used event triggers 
to send messages. It works, but I agree with Andres. We have problems if 
plugin is not loaded. For example, if you will execute the query:


SELECT 'msg2' FROM pg_logical_send_message(false, 'test', 'msg2');

you will get the error (if plugin which should register a prefix is not 
loaded yet):


ERROR:  standby message prefix "test" is not registered

Some stylistic note (logical.c):


+static void message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+  XLogRecPtr 
message_lsn,
+  bool transactional, 
const char *prefix,
+  Size sz, const char 
*message)
+{
+   LogicalDecodingContext *ctx = cache->private_data;
+   LogicalErrorCallbackState state;
+   ErrorContextCallback errcallback;


It should be written in the following way:

static void
message_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
   XLogRecPtr message_lsn,
   bool transactional, const char *prefix,
   Size sz, const char *message)
{
LogicalDecodingContext *ctx = cache->private_data;
LogicalErrorCallbackState state;
ErrorContextCallback errcallback;

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpgsql - DECLARE - cannot to use %TYPE or %ROWTYPE for composite types

2016-02-24 Thread Artur Zakirov

On 21.02.2016 11:31, Pavel Stehule wrote:

Hi

I am sending updated version - the changes are related to fix comments.



Great.

I am new in reviewing, I think Pavel took into account all comments. 
This patch is compiled and regression tests are passed. So I change its 
status to "Ready for Committer".





 By the way, these functions are misnamed after this patch.
They are
 called "wordtype" and "cwordtype" originally because they
accept
 "word%TYPE" and "compositeword%TYPE", but after the patch
they not only
 accept TYPE at the right of the percent sign but also
ELEMENTTYPE and
 ARRAYTYPE.  Not sure that this is something we want to be
too strict
 about.


Understand - used name ***reftype instead type


I am not sure, but it seems that new names is a little worse. I
think original names are good too. They accept a word and return the
PLpgSQL_type structure.


The "TYPE" word in this name was related to syntax %TYPE. And because
new syntax allows more constructs, then the change name is correct. I am
think. But choosing names is hard work. The new name little bit more
strongly show relation to work with referenced types.



Agree.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] unexpected result from to_tsvector

2016-02-23 Thread Artur Zakirov

Hello,

Here is a little patch. It fixes this issue 
http://www.postgresql.org/message-id/20160217080048.26357.49...@wrigleys.postgresql.org


Without patch we get wrong result for the second email 't...@123-reg.ro':

=> SELECT * FROM ts_debug('simple', 't...@vauban-reg.ro');
 alias |  description  |   token| dictionaries | dictionary 
|   lexemes

---+---++--++--
 email | Email address | t...@vauban-reg.ro | {simple} | simple  | 
{t...@vauban-reg.ro}

(1 row)

=> SELECT * FROM ts_debug('simple', 't...@123-reg.ro');
   alias   |   description| token  | dictionaries | dictionary | 
lexemes

---+--++--++--
 asciiword | Word, all ASCII  | test   | {simple} | simple | {test}
 blank | Space symbols| @  | {}   ||
 uint  | Unsigned integer | 123| {simple} | simple | {123}
 blank | Space symbols| -  | {}   ||
 host  | Host | reg.ro | {simple} | simple | 
{reg.ro}

(5 rows)

After patch we get correct result for the second email:

=> SELECT * FROM ts_debug('simple', 't...@123-reg.ro');
 alias |  description  |  token  | dictionaries | dictionary | 
  lexemes

---+---+-+--++--
 email | Email address | t...@123-reg.ro | {simple} | simple  | 
{t...@123-reg.ro}

(1 row)

This patch allows to parser work with emails 't...@123-reg.ro', 
'1...@123-reg.ro' and 'test@123_reg.ro' correctly.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/src/backend/tsearch/wparser_def.c
--- b/src/backend/tsearch/wparser_def.c
***
*** 1121,1126  static const TParserStateActionItem actionTPS_InUnsignedInt[] = {
--- 1121,1129 
  	{p_iseqC, '.', A_PUSH, TPS_InUDecimalFirst, 0, NULL},
  	{p_iseqC, 'e', A_PUSH, TPS_InMantissaFirst, 0, NULL},
  	{p_iseqC, 'E', A_PUSH, TPS_InMantissaFirst, 0, NULL},
+ 	{p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL},
+ 	{p_iseqC, '_', A_PUSH, TPS_InHostFirstAN, 0, NULL},
+ 	{p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
  	{p_isasclet, 0, A_PUSH, TPS_InHost, 0, NULL},
  	{p_isalpha, 0, A_NEXT, TPS_InNumWord, 0, NULL},
  	{p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL},

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Generic WAL logical messages

2016-02-20 Thread Artur Zakirov

On 23.01.2016 01:22, Petr Jelinek wrote:

Hi,

here is updated version of this patch, calling the messages logical
(decoding) messages consistently everywhere and removing any connection
to standby messages. Moving this to it's own module gave me place to
write some brief explanation about this so the code documentation has
hopefully improved as well.

The functionality itself didn't change.






Hello,

It seems that you forgot regression tests for test_decoding. There is an 
entry in test_decoding/Makefile, but there are not files 
sql/messages.sql and expected/messages.out. However they are included in 
the first version of the patch.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpgsql - DECLARE - cannot to use %TYPE or %ROWTYPE for composite types

2016-02-19 Thread Artur Zakirov

It seems all fixes are done. I tested the patch and regression tests passed.

On 27.01.2016 20:58, Pavel Stehule wrote:



 > --- 1681,1687 
 >* --
 >*/
 >   PLpgSQL_type *
 > ! plpgsql_parse_wordtype(char *ident, int reftype_mode)
 >   {
 >   PLpgSQL_type *dtype;
 >   PLpgSQL_nsitem *nse;

Use the typedef'ed enum, as above.

 > --- 1699,1721 
 >   switch (nse->itemtype)
 >   {
 >   case PLPGSQL_NSTYPE_VAR:
 > ! {
 > ! dtype = ((PLpgSQL_var *)
(plpgsql_Datums[nse->itemno]))->datatype;
 > ! return derive_type(dtype,
reftype_mode);
 > ! }
 >
 > ! case PLPGSQL_NSTYPE_ROW:
 > ! {
 > ! dtype = ((PLpgSQL_row *)
(plpgsql_Datums[nse->itemno]))->datatype;
 > ! return derive_type(dtype,
reftype_mode);
 > ! }
 >
 > + /*
 > +  * XXX perhaps allow REC here? Probably it
has not any sense, because
 > +  * in this moment, because PLpgSQL doesn't
support rec parameters, so
 > +  * there should not be any rec polymorphic
parameter, and any work can
 > +  * be done inside function.
 > +  */

I think you should remove from the "?" onwards in that comment, i.e.
just keep what was already in the original comment (minus the ROW)


I tried to fix it, not sure if understood well.


I think here Alvaro means that you should keep original comment without 
the ROW. Like this:


/* XXX perhaps allow REC here? */



 > *** extern bool plpgsql_parse_dblword(char *
 > *** 961,968 
 > PLwdatum *wdatum, PLcword
*cword);
 >   extern bool plpgsql_parse_tripword(char *word1, char *word2,
char *word3,
 >  PLwdatum *wdatum,
PLcword *cword);
 > ! extern PLpgSQL_type *plpgsql_parse_wordtype(char *ident);
 > ! extern PLpgSQL_type *plpgsql_parse_cwordtype(List *idents);
 >   extern PLpgSQL_type *plpgsql_parse_wordrowtype(char *ident);
 >   extern PLpgSQL_type *plpgsql_parse_cwordrowtype(List *idents);
 >   extern PLpgSQL_type *plpgsql_build_datatype(Oid typeOid, int32
typmod,
 > --- 973,980 
 > PLwdatum *wdatum, PLcword
*cword);
 >   extern bool plpgsql_parse_tripword(char *word1, char *word2,
char *word3,
 >  PLwdatum *wdatum,
PLcword *cword);
 > ! extern PLpgSQL_type *plpgsql_parse_wordtype(char *ident, int
reftype_mode);
 > ! extern PLpgSQL_type *plpgsql_parse_cwordtype(List *idents, int
reftype_mode);
 >   extern PLpgSQL_type *plpgsql_parse_wordrowtype(char *ident);
 >   extern PLpgSQL_type *plpgsql_parse_cwordrowtype(List *idents);
 >   extern PLpgSQL_type *plpgsql_build_datatype(Oid typeOid, int32
typmod,

By the way, these functions are misnamed after this patch.  They are
called "wordtype" and "cwordtype" originally because they accept
"word%TYPE" and "compositeword%TYPE", but after the patch they not only
accept TYPE at the right of the percent sign but also ELEMENTTYPE and
ARRAYTYPE.  Not sure that this is something we want to be too strict
about.


Understand - used name ***reftype instead type


I am not sure, but it seems that new names is a little worse. I think 
original names are good too. They accept a word and return the 
PLpgSQL_type structure.




Thank you for your comment

Attached updated patch

Regards

Pavel



--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




I noticed a little typo in the comment in the derive_type():
/* Return base_type, when it is a array already */

should be:
/* Return base_type, when it is an array already */

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension

2016-02-17 Thread Artur Zakirov

On 12.02.2016 20:56, Teodor Sigaev wrote:

On Thu, Feb 11, 2016 at 9:56 AM, Teodor Sigaev <teo...@sigaev.ru> wrote:

1 - sml_limit to similarity_limit. sml_threshold is difficult to
write I
think,
similarity_limit is more simple.


It seems to me that threshold is right word by meaning. sml_threshold
is my
choice.


Why abbreviate it like that?  Nobody's going to know that "sml" stands
for "similarity" without consulting the documentation, and that sucks.


Ok, I don't have an objections. I worked a lot on various similarity
modules and sml becomes usual for me. That's why I was asking.



Hi!

I attached new version of the patch. It fixes names of GUC variables and 
functions.


Now the patch introduces:
1 - functions:
- word_similarity()
2 - operators:
- %> (and <%)
- <->> (and <<->)
3 - GUC variables:
- pg_trgm.similarity_threshold
- pg_trgm.word_similarity_threshold

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/contrib/pg_trgm/pg_trgm--1.2.sql
--- b/contrib/pg_trgm/pg_trgm--1.2.sql
***
*** 3,13 
--- 3,15 
  -- complain if script is sourced in psql, rather than via CREATE EXTENSION
  \echo Use "CREATE EXTENSION pg_trgm" to load this file. \quit
  
+ -- Deprecated function
  CREATE FUNCTION set_limit(float4)
  RETURNS float4
  AS 'MODULE_PATHNAME'
  LANGUAGE C STRICT VOLATILE;
  
+ -- Deprecated function
  CREATE FUNCTION show_limit()
  RETURNS float4
  AS 'MODULE_PATHNAME'
***
*** 26,32  LANGUAGE C STRICT IMMUTABLE;
  CREATE FUNCTION similarity_op(text,text)
  RETURNS bool
  AS 'MODULE_PATHNAME'
! LANGUAGE C STRICT STABLE;  -- stable because depends on trgm_limit
  
  CREATE OPERATOR % (
  LEFTARG = text,
--- 28,34 
  CREATE FUNCTION similarity_op(text,text)
  RETURNS bool
  AS 'MODULE_PATHNAME'
! LANGUAGE C STRICT STABLE;  -- stable because depends on pg_trgm.similarity_threshold
  
  CREATE OPERATOR % (
  LEFTARG = text,
*** a/contrib/pg_trgm/trgm.h
--- b/contrib/pg_trgm/trgm.h
***
*** 105,111  typedef char *BITVECP;
  
  typedef struct TrgmPackedGraph TrgmPackedGraph;
  
! extern float4 trgm_limit;
  
  extern uint32 trgm2int(trgm *ptr);
  extern void compact_trigram(trgm *tptr, char *str, int bytelen);
--- 105,111 
  
  typedef struct TrgmPackedGraph TrgmPackedGraph;
  
! extern double similarity_threshold;
  
  extern uint32 trgm2int(trgm *ptr);
  extern void compact_trigram(trgm *tptr, char *str, int bytelen);
*** a/contrib/pg_trgm/trgm_gin.c
--- b/contrib/pg_trgm/trgm_gin.c
***
*** 206,212  gin_trgm_consistent(PG_FUNCTION_ARGS)
  			 * similarity is just c / len1.
  			 * So, independly on DIVUNION the upper bound formula is the same.
  			 */
! 			res = (nkeys == 0) ? false : ((float4) ntrue) / ((float4) nkeys))) >= trgm_limit) ? true : false);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
--- 206,214 
  			 * similarity is just c / len1.
  			 * So, independly on DIVUNION the upper bound formula is the same.
  			 */
! 			res = (nkeys == 0) ? false :
! ((float4) ntrue) / ((float4) nkeys))) >= similarity_threshold)
! 	? true : false);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
***
*** 283,289  gin_trgm_triconsistent(PG_FUNCTION_ARGS)
  			/*
  			 * See comment in gin_trgm_consistent() about * upper bound formula
  			 */
! 			res = (nkeys == 0) ? GIN_FALSE : (float4) ntrue) / ((float4) nkeys)) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
--- 285,293 
  			/*
  			 * See comment in gin_trgm_consistent() about * upper bound formula
  			 */
! 			res = (nkeys == 0) ? GIN_FALSE :
! (float4) ntrue) / ((float4) nkeys)) >= similarity_threshold)
! 	? GIN_MAYBE : GIN_FALSE);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
*** a/contrib/pg_trgm/trgm_gist.c
--- b/contrib/pg_trgm/trgm_gist.c
***
*** 294,300  gtrgm_consistent(PG_FUNCTION_ARGS)
  float4		tmpsml = cnt_sml(key, qtrg);
  
  /* strange bug at freebsd 5.2.1 and gcc 3.3.3 */
! res = (*(int *)  == *(int *) _limit || tmpsml > trgm_limit) ? true : false;
  			}
  			else if (ISALLTRUE(key))
  			{	/* non-leaf contains signature */
--- 294,301 
  float4		tmpsml = cnt_sml(key, qtrg);
  
  /* strange bug at freebsd 5.2.1 and gcc 3.3.3 */
! res = (*(int *)  == *(int *) _threshold
! 		|| tmpsml > similarity_threshold) ? true : false;
  			}
  			else if (ISALLTRUE(key))
  			{	/* non-leaf contains signature */
***
*** 308,314  gtrgm_consistent(PG_FUNCTION_ARGS)
  if (len == 0)
  	res = false;
  else
! 	res = (float8) count) / ((float8) len))) >= trgm_limit) ? true : false;
  			}
  			break;
  		case ILikeStrategyNumber:
--- 309,316 
  if (len == 0

Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-02-17 Thread Artur Zakirov

On 16.02.2016 18:14, Artur Zakirov wrote:

I attached a new version of the patch.



Sorry for noise. I attached new version of the patch. I saw mistakes in 
DecodeFlag(). This patch fix them.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/doc/src/sgml/textsearch.sgml
--- b/doc/src/sgml/textsearch.sgml
***
*** 2615,2632  SELECT plainto_tsquery('supernova star');
 
  
 
! To create an Ispell dictionary, use the built-in
! ispell template and specify several parameters:
 
! 
  
! CREATE TEXT SEARCH DICTIONARY english_ispell (
  TEMPLATE = ispell,
! DictFile = english,
! AffFile = english,
! StopWords = english
! );
  
  
 
  Here, DictFile, AffFile, and StopWords
--- 2615,2655 
 
  
 
! To create an Ispell dictionary perform these steps:
 
!
! 
!  
!   download dictionary configuration files. OpenOffice
!   extension files have the .oxt extension. It is necessary
!   to extract .aff and .dic files, change
!   extensions to .affix and .dict. For some
!   dictionary files it is also needed to convert characters to the UTF-8
!   encoding with commands (for example, for norwegian language dictionary):
  
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
! 
!  
! 
! 
!  
!   copy files to the $SHAREDIR/tsearch_data directory
!  
! 
! 
!  
!   load files into PostgreSQL with the following command:
! 
! CREATE TEXT SEARCH DICTIONARY english_hunspell (
  TEMPLATE = ispell,
! DictFile = en_us,
! AffFile = en_us,
! Stopwords = english);
  
+  
+ 
+
  
 
  Here, DictFile, AffFile, and StopWords
***
*** 2643,2648  CREATE TEXT SEARCH DICTIONARY english_ispell (
--- 2666,2721 
 
  
 
+ The .affix file of Ispell has the following
+ structure:
+ 
+ prefixes
+ flag *A:
+ .   >   RE  # As in enter > reenter
+ suffixes
+ flag T:
+ E   >   ST  # As in late > latest
+ [^AEIOU]Y   >   -Y,IEST # As in dirty > dirtiest
+ [AEIOU]Y>   EST # As in gray > grayest
+ [^EY]   >   EST # As in small > smallest
+ 
+
+
+ And the .dict file has the following structure:
+ 
+ lapse/ADGRS
+ lard/DGRS
+ large/PRTY
+ lark/MRS
+ 
+
+ 
+
+ Format of the .dict file is:
+ 
+ basic_form/affix_class_name
+ 
+
+ 
+
+ In the .affix file every affix flag is described in the
+ following format:
+ 
+ condition > [-stripping_letters,] adding_affix
+ 
+
+ 
+
+ Here, condition has a format similar to the format of regular expressions.
+ It can use groupings [...] and [^...].
+ For example, [AEIOU]Y means that the last letter of the word
+ is "y" and the penultimate letter is "a",
+ "e", "i", "o" or "u".
+ [^EY] means that the last letter is neither "e"
+ nor "y".
+
+ 
+
  Ispell dictionaries support splitting compound words;
  a useful feature.
  Notice that the affix file should specify a special flag using the
***
*** 2663,2668  SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
--- 2736,2800 
  
 
  
+
+ MySpell is very similar to Hunspell.
+ The .affix file of Hunspell has the following
+ structure:
+ 
+ PFX A Y 1
+ PFX A   0 re .
+ SFX T N 4
+ SFX T   0 st e
+ SFX T   y iest   [^aeiou]y
+ SFX T   0 est[aeiou]y
+ SFX T   0 est[^ey]
+ 
+
+ 
+
+ The first line of an affix class is the header. Fields of an affix rules are
+ listed after the header:
+
+
+ 
+  
+   parameter name (PFX or SFX)
+  
+ 
+ 
+  
+   flag (name of the affix class)
+  
+ 
+ 
+  
+   stripping characters from beginning (at prefix) or end (at suffix) of the
+   word
+  
+ 
+ 
+  
+   adding affix
+  
+ 
+ 
+  
+   condition that has a format similar to the format of regular expressions.
+  
+ 
+
+ 
+
+ The .dict file looks like the .dict file of
+ Ispell:
+ 
+ larder/M
+ lardy/RT
+ large/RSPMYT
+ largehearted
+ 
+
+ 
 
  
   MySpell does not support compound words.
*** a/src/backend/tsearch/Makefile
--- b/src/backend/tsearch/Makefile
***
*** 13,20  include $(top_builddir)/src/Makefile.global
  
  DICTDIR=tsearch_data
  
! DICTFILES=synonym_sample.syn thesaurus_sample.ths hunspell_sample.affix \
! 	ispell_sample.affix ispell_sample.dict
  
  OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
  	dict_simple.o dict_synonym.o dict_thesaurus.o \
--- 13,23 
  
  DICTDIR=tsearch_data
  
! DICTFILES=di

Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-02-16 Thread Artur Zakirov

I attached a new version of the patch.

On 10.02.2016 19:46, Teodor Sigaev wrote:



I duplicate the patch here.


it's very good thing to update disctionaries to support modern versions.
And thank you for improving documentation. Also I've impressed by long
description in spell.c header.

Som notices about code:

1
  struct SPELL. Why do you remove union p? You leave comment
  about using d struct instead of flag field and as can see
  it's right comment. It increases size of SPELL structure.


Fixed.



2 struct AFFIX. I'm agree with Alvaro taht sum of sizes of bit fields
should be less or equal to size of integer. In opposite case, suppose,
we can get undefined behavior. Please, split  bitfields  to two integers.


Fixed.



3 unsigned char flagval[65000];
   Is it forbidden to use 6 number? In any case, decodeFlag() doesn't
   restrict return value. I suggest to enlarge array to 1<<16 and add limit
   to return value of decodeFlag().


flagval array was enlarged. Added verification of return value of 
DecodeFlag() for for various FLAG parameter (FM_LONG, FM_NUM and FM_CHAR).




4
  I'd like to see a short comment describing at least new functions


Added some comments which describe new functions and old functions for 
loading dictionaries into PostgreSQL. This patch adds new functions and 
modifies functions which is used for loading dictionaries.


At the moment, comments does not describe functions which used for word 
normalization. But I can add more comments.




5
  Pls, add tests for new code.




Added tests. Old sample dictionaries files was moved to the folder 
"dicts". New sample dictionaries files was added:

- hunspell_sample_long.affix
- hunspell_sample_long.dict
- hunspell_sample_num.affix
- hunspell_sample_num.dict

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/doc/src/sgml/textsearch.sgml
--- b/doc/src/sgml/textsearch.sgml
***
*** 2615,2632  SELECT plainto_tsquery('supernova star');
 
  
 
! To create an Ispell dictionary, use the built-in
! ispell template and specify several parameters:
 
! 
  
! CREATE TEXT SEARCH DICTIONARY english_ispell (
  TEMPLATE = ispell,
! DictFile = english,
! AffFile = english,
! StopWords = english
! );
  
  
 
  Here, DictFile, AffFile, and StopWords
--- 2615,2655 
 
  
 
! To create an Ispell dictionary perform these steps:
 
!
! 
!  
!   download dictionary configuration files. OpenOffice
!   extension files have the .oxt extension. It is necessary
!   to extract .aff and .dic files, change extensions
!   to .affix and .dict. For some dictionary
!   files it is also needed to convert characters to the UTF-8 encoding
!   with commands (for example, for norwegian language dictionary):
  
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
! 
!  
! 
! 
!  
!   copy files to the $SHAREDIR/tsearch_data directory
!  
! 
! 
!  
!   load files into PostgreSQL with the following command:
! 
! CREATE TEXT SEARCH DICTIONARY english_hunspell (
  TEMPLATE = ispell,
! DictFile = en_us,
! AffFile = en_us,
! Stopwords = english);
  
+  
+ 
+
  
 
  Here, DictFile, AffFile, and StopWords
***
*** 2643,2648  CREATE TEXT SEARCH DICTIONARY english_ispell (
--- 2666,2720 
 
  
 
+ The .affix file of Ispell has the following structure:
+ 
+ prefixes
+ flag *A:
+ .   >   RE  # As in enter > reenter
+ suffixes
+ flag T:
+ E   >   ST  # As in late > latest
+ [^AEIOU]Y   >   -Y,IEST # As in dirty > dirtiest
+ [AEIOU]Y>   EST # As in gray > grayest
+ [^EY]   >   EST # As in small > smallest
+ 
+
+
+ And the .dict file has the following structure:
+ 
+ lapse/ADGRS
+ lard/DGRS
+ large/PRTY
+ lark/MRS
+ 
+
+ 
+
+ Format of the .dict file is:
+ 
+ basic_form/affix_class_name
+ 
+
+ 
+
+ In the .affix file every affix flag is described in the
+ following format:
+ 
+ condition > [-stripping_letters,] adding_affix
+ 
+
+ 
+
+ Here, condition has a format similar to the format of regular expressions.
+ It can use groupings [...] and [^...].
+ For example, [AEIOU]Y means that the last letter of the word
+ is "y" and the penultimate letter is "a",
+ "e", "i", "o" or "u".
+ [^EY] means that the last letter is neither "e"
+ nor "y".
+
+ 
+
  Ispell dictionaries support splitting compound words;
  a useful feature.
  Notice that the affix file should specify a special flag using the
***
*** 2663,2668  SELECT ts_lexize('norwegian_ispell', 'sjokolad

Re: [HACKERS] commitfest problem ?

2016-02-16 Thread Artur Zakirov

On 16.02.2016 17:50, Oleg Bartunov wrote:

This entry https://commitfest.postgresql.org/8/419/ contains very
unrelated patches from another commitfest. I think


Oleg


No, this is not commitfest problem. The part of the thread "[PROPOSAL] 
Improvements of Hunspell dictionaries support" 
(http://www.postgresql.org/message-id/56aa02ee.6090...@postgrespro.ru) 
was moved to the thread "[PATCH] we have added support for box type in 
SP-GiST index" by mistake.


I had noticed it too late. I was writing into the wrong thread. And now 
commitfest shows wrong last attachment.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IF (NOT) EXISTS in psql-completion

2016-02-15 Thread Artur Zakirov

On 05.02.2016 11:09, Kyotaro HORIGUCHI wrote:

Hello,

I considered how to make tab-completion robust for syntactical
noises, in other words, optional words in syntax. Typically "IF
(NOT) EXISTS", UNIQUE and TEMPORARY are words that don't affect
further completion. However, the current delimit-matching
mechanism is not so capable (or is complexty-prone) to live with
such noises. I have proposed to use regular expressions or
simplified one for the robustness but it was too complex to be
applied.

This is another answer for the problem. Removal of such words
on-the-fly makes further matching more robust.

Next, currently some CREATE xxx subsyntaxes of CREATE SCHEMA are
matched using TailMatching but it makes difficult the
options-removal operations, which needs forward matching.

So I introduced two things to resolve them by this patch.



I did some tests with your patch. But I am not confident in tab-complete.c.

And I have some notes:

1 - I execute git apply command and get the following warning:

../0001-Suggest-IF-NOT-EXISTS-for-tab-completion-of-psql.patch:302: 
trailing whitespace.

/*
warning: 1 line adds whitespace errors.

This is because of superfluous whitespace I think.

2 - In psql I write "create table if" and press . psql adds the 
following:


create table IF NOT EXISTS

I think psql should continue with lower case if user wrote query with 
loser case text:


create table if not exists

3 - Same with "IF EXISTS". If a write "alter view if" and press  
psql writes:


alter view IF EXISTS

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-11 Thread Artur Zakirov

On 11.02.2016 01:19, Tom Lane wrote:

I wrote:

Artur Zakirov <a.zaki...@postgrespro.ru> writes:

I think this is not a bug. It is a normal behavior. In Mac OS sscanf()
with the %s format reads the string one character at a time. The size of
letter 'Ñ…' is 2. And sscanf() separate it into two wrong characters.



That argument might be convincing if OSX behaved that way for all
multibyte characters, but it doesn't seem to be doing that.  Why is
only 'Ñ…' affected?


I looked into the OS X sources, and found that indeed you are right:
*scanf processes the input a byte at a time, and applies isspace() to
each byte separately, even when the locale is such that that's a clearly
insane thing to do.  Since this code was derived from FreeBSD, FreeBSD
has or once had the same issue.  (A look at the freebsd project on github
says it still does, assuming that's the authoritative repo.)  Not sure
about other BSDen.

I also verified that in UTF8-based locales, isspace() thinks that 0x85 and
0xA0, and no other high-bit-set values, are spaces.  Not sure exactly why
it thinks that, but that explains why 'Ñ…' fails when adjacent code points
don't.

So apparently the coding rule we have to adopt is "don't use *scanf()
on data that might contain multibyte characters".  (There might be corner
cases where it'd work all right for conversion specifiers other than %s,
but probably you might as well just use strtol and friends in such cases.)
Ugh.

regards, tom lane



Yes, I meant this. The second byte divides the word into two wrong pieces.

Sorry for my unclear explanation. I should to explain more clearly.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-11 Thread Artur Zakirov

On 11.02.2016 03:33, Tom Lane wrote:

Artur Zakirov <a.zaki...@postgrespro.ru> writes:

  [ tsearch_aff_parse_v1.patch ]


I've pushed this with some corrections --- notably, I did not like the
lack of buffer-overrun prevention, and it did the wrong thing if a line
had more than one trailing space character.

We still need to look at other uses of *scanf(), but I think that any
other changes that might be needed should be separate patches anyhow.

regards, tom lane



Thank you!

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-02-11 Thread Artur Zakirov

Thank you for the review.

On 10.02.2016 19:46, Teodor Sigaev wrote:



I duplicate the patch here.


it's very good thing to update disctionaries to support modern versions.
And thank you for improving documentation. Also I've impressed by long
description in spell.c header.

Som notices about code:

1
  struct SPELL. Why do you remove union p? You leave comment
  about using d struct instead of flag field and as can see
  it's right comment. It increases size of SPELL structure.


I will fix it. I had misunderstood the Alvaro's comment about it.



2 struct AFFIX. I'm agree with Alvaro taht sum of sizes of bit fields
should be less or equal to size of integer. In opposite case, suppose,
we can get undefined behavior. Please, split  bitfields  to two integers.


I will fix it. Here I had misunderstood too.



3 unsigned char flagval[65000];
   Is it forbidden to use 6 number? In any case, decodeFlag() doesn't
   restrict return value. I suggest to enlarge array to 1<<16 and add limit
   to return value of decodeFlag().


I think it can be done.



4
  I'd like to see a short comment describing at least new functions


Now in spell.c there are more comments. I wanted to send fixed patch 
after adding all comments that I want to add. But I can send the patch now.
Also I will merge this commit 
http://www.postgresql.org/message-id/e1atf9o-0001ga...@gemulon.postgresql.org




5
  Pls, add tests for new code.




I will add.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension

2016-02-11 Thread Artur Zakirov

On 11.02.2016 16:35, Mike Rylander wrote:

On Thu, Feb 11, 2016 at 8:11 AM, Teodor Sigaev <teo...@sigaev.ru> wrote:

I have attached a new version of the patch. It fixes error of operators
<->> and
%>:
- operator <->> did not pass the regression test in CentOS 32 bit (gcc
4.4.7
20120313).
- operator %> did not pass the regression test in FreeBSD 32 bit (gcc
4.2.1
20070831).

It was because of variable optimization by gcc.



Fixed with volatile modifier, right?

I'm close to push this patches, but I still doubt in names, and I'd like to
see comment from English speackers:
1 sml_limit GUC variable (options: similarity_limit, sml_threshold)
2 subword_similarity(). Actually, it finds most similar word (not
substring!) from whole string. word_similarity? word_in_string_similarity?



At least for this English speaker, substring_similarity is not
confusing even if it's not internally accurate, but English is a
strange language.

Because I want the bike shed to be blue, how does
query_string_similarity sound instead?  If that's overly precise, then
word_similarity would be fine with me.

Thanks,

--
Mike Rylander



Great. I can change names:
1 - sml_limit to similarity_limit. sml_threshold is difficult to write I 
think, similarity_limit is more simple.

2 - subword_similarity() to word_similarity().

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension

2016-02-11 Thread Artur Zakirov

On 11.02.2016 16:11, Teodor Sigaev wrote:

I have attached a new version of the patch. It fixes error of
operators <->> and
%>:
- operator <->> did not pass the regression test in CentOS 32 bit (gcc
4.4.7
20120313).
- operator %> did not pass the regression test in FreeBSD 32 bit (gcc
4.2.1
20070831).

It was because of variable optimization by gcc.


Fixed with volatile modifier, right?


Yes, it was fixes with volatile modifier.



I'm close to push this patches, but I still doubt in names, and I'd like
to see comment from English speackers:
1 sml_limit GUC variable (options: similarity_limit, sml_threshold)
2 subword_similarity(). Actually, it finds most similar word (not
substring!) from whole string. word_similarity? word_in_string_similarity?


substring_similarity_pos() could be a separate patch.




--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-10 Thread Artur Zakirov

On 09.02.2016 20:13, Tom Lane wrote:

I do not like this patch much.  It is basically "let's stop using sscanf()
because it seems to have a bug on one platform".  There are at least two
things wrong with that approach:

1. By my count there are about 80 uses of *scanf() in our code.  Are we
going to replace every one of them with hand-rolled code?  If not, why
is only this instance vulnerable?  How can we know whether future uses
will have a problem?


It seems that *scanf() with %s format occures only here:
- check.c - get_bin_version()
- server.c - get_major_server_version()
- filemap.c - isRelDataFile()
- pg_backup_directory.c - _LoadBlobs()
- xlog.c - do_pg_stop_backup()
- mac.c - macaddr_in()
I think here sscanf() do not works with the UTF-8 characters. And 
probably this is only spell.c issue.


I agree that previous patch is wrong. Instead of using new 
parse_ooaffentry() function maybe better to use sscanf() with %ls 
format. The %ls format is used to read a wide character string.




2. We're not being very good citizens of the software universe if we
just install a hack in Postgres rather than nagging Apple to fix the
bug at its true source.

I think the appropriate next step to take is to dig into the OS X
sources (see http://www.opensource.apple.com, I think probably the
relevant code is in the Libc package) and identify exactly what is
causing the misbehavior.  That would both allow an informed answer
to point #1 and greatly increase the odds of getting action on a
bug report to Apple.  Even if we end up applying this patch verbatim,
I think we need that information first.

regards, tom lane



I think this is not a bug. It is a normal behavior. In Mac OS sscanf() 
with the %s format reads the string one character at a time. The size of 
letter 'х' is 2. And sscanf() separate it into two wrong characters.


In conclusion, I think in spell.c should be used sscanf() with %ls format.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension

2016-02-10 Thread Artur Zakirov

On 02.02.2016 15:45, Artur Zakirov wrote:

On 01.02.2016 20:12, Artur Zakirov wrote:


I have changed the patch:
1 - trgm2.data was corrected, duplicates were deleted.
2 - I have added operators <<-> and <->> with GiST index supporting. A
regression test will pass only with the patch
http://www.postgresql.org/message-id/capphfdt19fwqxaryjkzxb3oxmv-kan3fluzrooare_u3h3c...@mail.gmail.com


3 - the function substring_similarity() was renamed to
subword_similarity().

But there is not a function substring_similarity_pos() yet. It is not
trivial.



Sorry, in the previous patch was a typo. Here is the fixed patch.



I have attached a new version of the patch. It fixes error of operators 
<->> and %>:
- operator <->> did not pass the regression test in CentOS 32 bit (gcc 
4.4.7 20120313).
- operator %> did not pass the regression test in FreeBSD 32 bit (gcc 
4.2.1 20070831).


It was because of variable optimization by gcc.

In this patch pg_trgm documentation was corrected. Now operators were 
wrote as %> and <->> (not <% and <<->).


There is a problem in adding the substring_similarity_pos() function. It 
can bring additional overhead. Because we need to store characters 
position including spaces in addition. Spaces between words are lost in 
current implementation.

Does it actually need?


In conclusion, this patch introduces:
1 - functions:
- subword_similarity()
2 - operators:
- %>
- <->>
3 - GUC variables:
- pg_trgm.sml_limit
- pg_trgm.subword_limit

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/contrib/pg_trgm/pg_trgm--1.2.sql
--- b/contrib/pg_trgm/pg_trgm--1.2.sql
***
*** 3,13 
--- 3,15 
  -- complain if script is sourced in psql, rather than via CREATE EXTENSION
  \echo Use "CREATE EXTENSION pg_trgm" to load this file. \quit
  
+ -- Deprecated function
  CREATE FUNCTION set_limit(float4)
  RETURNS float4
  AS 'MODULE_PATHNAME'
  LANGUAGE C STRICT VOLATILE;
  
+ -- Deprecated function
  CREATE FUNCTION show_limit()
  RETURNS float4
  AS 'MODULE_PATHNAME'
***
*** 26,32  LANGUAGE C STRICT IMMUTABLE;
  CREATE FUNCTION similarity_op(text,text)
  RETURNS bool
  AS 'MODULE_PATHNAME'
! LANGUAGE C STRICT STABLE;  -- stable because depends on trgm_limit
  
  CREATE OPERATOR % (
  LEFTARG = text,
--- 28,34 
  CREATE FUNCTION similarity_op(text,text)
  RETURNS bool
  AS 'MODULE_PATHNAME'
! LANGUAGE C STRICT STABLE;  -- stable because depends on pg_trgm.sml_limit
  
  CREATE OPERATOR % (
  LEFTARG = text,
*** a/contrib/pg_trgm/trgm.h
--- b/contrib/pg_trgm/trgm.h
***
*** 105,111  typedef char *BITVECP;
  
  typedef struct TrgmPackedGraph TrgmPackedGraph;
  
! extern float4 trgm_limit;
  
  extern uint32 trgm2int(trgm *ptr);
  extern void compact_trigram(trgm *tptr, char *str, int bytelen);
--- 105,111 
  
  typedef struct TrgmPackedGraph TrgmPackedGraph;
  
! extern double trgm_sml_limit;
  
  extern uint32 trgm2int(trgm *ptr);
  extern void compact_trigram(trgm *tptr, char *str, int bytelen);
*** a/contrib/pg_trgm/trgm_gin.c
--- b/contrib/pg_trgm/trgm_gin.c
***
*** 206,212  gin_trgm_consistent(PG_FUNCTION_ARGS)
  			 * similarity is just c / len1.
  			 * So, independly on DIVUNION the upper bound formula is the same.
  			 */
! 			res = (nkeys == 0) ? false : ((float4) ntrue) / ((float4) nkeys))) >= trgm_limit) ? true : false);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
--- 206,213 
  			 * similarity is just c / len1.
  			 * So, independly on DIVUNION the upper bound formula is the same.
  			 */
! 			res = (nkeys == 0) ? false :
! ((float4) ntrue) / ((float4) nkeys))) >= trgm_sml_limit) ? true : false);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
***
*** 283,289  gin_trgm_triconsistent(PG_FUNCTION_ARGS)
  			/*
  			 * See comment in gin_trgm_consistent() about * upper bound formula
  			 */
! 			res = (nkeys == 0) ? GIN_FALSE : (float4) ntrue) / ((float4) nkeys)) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
--- 284,291 
  			/*
  			 * See comment in gin_trgm_consistent() about * upper bound formula
  			 */
! 			res = (nkeys == 0) ? GIN_FALSE :
! (float4) ntrue) / ((float4) nkeys)) >= trgm_sml_limit) ? GIN_MAYBE : GIN_FALSE);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
*** a/contrib/pg_trgm/trgm_gist.c
--- b/contrib/pg_trgm/trgm_gist.c
***
*** 294,300  gtrgm_consistent(PG_FUNCTION_ARGS)
  float4		tmpsml = cnt_sml(key, qtrg);
  
  /* strange bug at freebsd 5.2.1 and gcc 3.3.3 */
! res = (*(int *)  == *(int *) _limit || tmpsml > trgm_limit) ? true : false;
  			}
  			else if (ISALLTRUE(key))
  			{	/* non-leaf contains signature */
--- 294,301

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-02-10 Thread Artur Zakirov

On 10.02.2016 18:51, Teodor Sigaev wrote:

Hmm. Here
src/backend/access/transam/xlog.c read_tablespace_map()
using %s in scanf looks suspisious. I don't fully understand but it
looks like it tries to read oid as string. So, it should be safe in
usial case

Next, _LoadBlobs() reads filename (fname) with a help of sscanf. Could
file name be in UTF-8 encoding here?


This function reads the "blobs.toc" file. It lines have the following 
format:

 

Where  is blob_.dat.

Therefore it should be safe too.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension

2016-02-02 Thread Artur Zakirov

On 01.02.2016 20:12, Artur Zakirov wrote:


I have changed the patch:
1 - trgm2.data was corrected, duplicates were deleted.
2 - I have added operators <<-> and <->> with GiST index supporting. A
regression test will pass only with the patch
http://www.postgresql.org/message-id/capphfdt19fwqxaryjkzxb3oxmv-kan3fluzrooare_u3h3c...@mail.gmail.com

3 - the function substring_similarity() was renamed to
subword_similarity().

But there is not a function substring_similarity_pos() yet. It is not
trivial.



Sorry, in the previous patch was a typo. Here is the fixed patch.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/contrib/pg_trgm/pg_trgm--1.2.sql
--- b/contrib/pg_trgm/pg_trgm--1.2.sql
***
*** 3,13 
--- 3,15 
  -- complain if script is sourced in psql, rather than via CREATE EXTENSION
  \echo Use "CREATE EXTENSION pg_trgm" to load this file. \quit
  
+ -- Deprecated function
  CREATE FUNCTION set_limit(float4)
  RETURNS float4
  AS 'MODULE_PATHNAME'
  LANGUAGE C STRICT VOLATILE;
  
+ -- Deprecated function
  CREATE FUNCTION show_limit()
  RETURNS float4
  AS 'MODULE_PATHNAME'
***
*** 26,32  LANGUAGE C STRICT IMMUTABLE;
  CREATE FUNCTION similarity_op(text,text)
  RETURNS bool
  AS 'MODULE_PATHNAME'
! LANGUAGE C STRICT STABLE;  -- stable because depends on trgm_limit
  
  CREATE OPERATOR % (
  LEFTARG = text,
--- 28,34 
  CREATE FUNCTION similarity_op(text,text)
  RETURNS bool
  AS 'MODULE_PATHNAME'
! LANGUAGE C STRICT STABLE;  -- stable because depends on pg_trgm.sml_limit
  
  CREATE OPERATOR % (
  LEFTARG = text,
*** a/contrib/pg_trgm/trgm.h
--- b/contrib/pg_trgm/trgm.h
***
*** 105,111  typedef char *BITVECP;
  
  typedef struct TrgmPackedGraph TrgmPackedGraph;
  
! extern float4 trgm_limit;
  
  extern uint32 trgm2int(trgm *ptr);
  extern void compact_trigram(trgm *tptr, char *str, int bytelen);
--- 105,111 
  
  typedef struct TrgmPackedGraph TrgmPackedGraph;
  
! extern double trgm_sml_limit;
  
  extern uint32 trgm2int(trgm *ptr);
  extern void compact_trigram(trgm *tptr, char *str, int bytelen);
*** a/contrib/pg_trgm/trgm_gin.c
--- b/contrib/pg_trgm/trgm_gin.c
***
*** 206,212  gin_trgm_consistent(PG_FUNCTION_ARGS)
  			 * similarity is just c / len1.
  			 * So, independly on DIVUNION the upper bound formula is the same.
  			 */
! 			res = (nkeys == 0) ? false : ((float4) ntrue) / ((float4) nkeys))) >= trgm_limit) ? true : false);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
--- 206,213 
  			 * similarity is just c / len1.
  			 * So, independly on DIVUNION the upper bound formula is the same.
  			 */
! 			res = (nkeys == 0) ? false :
! ((float4) ntrue) / ((float4) nkeys))) >= trgm_sml_limit) ? true : false);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
***
*** 283,289  gin_trgm_triconsistent(PG_FUNCTION_ARGS)
  			/*
  			 * See comment in gin_trgm_consistent() about * upper bound formula
  			 */
! 			res = (nkeys == 0) ? GIN_FALSE : (float4) ntrue) / ((float4) nkeys)) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
--- 284,291 
  			/*
  			 * See comment in gin_trgm_consistent() about * upper bound formula
  			 */
! 			res = (nkeys == 0) ? GIN_FALSE :
! (float4) ntrue) / ((float4) nkeys)) >= trgm_sml_limit) ? GIN_MAYBE : GIN_FALSE);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
*** a/contrib/pg_trgm/trgm_gist.c
--- b/contrib/pg_trgm/trgm_gist.c
***
*** 294,300  gtrgm_consistent(PG_FUNCTION_ARGS)
  float4		tmpsml = cnt_sml(key, qtrg);
  
  /* strange bug at freebsd 5.2.1 and gcc 3.3.3 */
! res = (*(int *)  == *(int *) _limit || tmpsml > trgm_limit) ? true : false;
  			}
  			else if (ISALLTRUE(key))
  			{	/* non-leaf contains signature */
--- 294,301 
  float4		tmpsml = cnt_sml(key, qtrg);
  
  /* strange bug at freebsd 5.2.1 and gcc 3.3.3 */
! res = (*(int *)  == *(int *) _sml_limit
! 		|| tmpsml > trgm_sml_limit) ? true : false;
  			}
  			else if (ISALLTRUE(key))
  			{	/* non-leaf contains signature */
***
*** 308,314  gtrgm_consistent(PG_FUNCTION_ARGS)
  if (len == 0)
  	res = false;
  else
! 	res = (float8) count) / ((float8) len))) >= trgm_limit) ? true : false;
  			}
  			break;
  		case ILikeStrategyNumber:
--- 309,315 
  if (len == 0)
  	res = false;
  else
! 	res = (float8) count) / ((float8) len))) >= trgm_sml_limit) ? true : false;
  			}
  			break;
  		case ILikeStrategyNumber:
*** a/contrib/pg_trgm/trgm_op.c
--- b/contrib/pg_trgm/trgm_op.c
***
*** 14,20 
  
  PG_MODULE_MAGIC;
  
! float4		trgm_limit = 0.3f;
  
  PG_FUNCTION_INFO_V1(set_limit);
  PG_FUNCTION_INFO_V1(show_limit);
--- 14,23 
  

Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension

2016-02-01 Thread Artur Zakirov

On 29.01.2016 18:58, Artur Zakirov wrote:

On 29.01.2016 18:39, Alvaro Herrera wrote:

Teodor Sigaev wrote:

The behavior of this function is surprising to me.

select substring_similarity('dog' ,  'hotdogpound') ;

  substring_similarity
--
  0.25


Substring search was desined to search similar word in string:
contrib_regression=# select substring_similarity('dog' ,  'hot
dogpound') ;
  substring_similarity
--
  0.75

contrib_regression=# select substring_similarity('dog' ,  'hot dog
pound') ;
  substring_similarity
--
 1


Hmm, this behavior looks too much like magic to me.  I mean, a substring
is a substring -- why are we treating the space as a special character
here?



I think, I can rename this function to subword_similarity() and correct
the documentation.

The current behavior is developed to find most similar word in a text.
For example, if we will search just substring (not word) then we will
get the following result:

select substring_similarity('dog', 'dogmatist');
  substring_similarity
-
 1
(1 row)

But this is wrong I think. They are completely different words.

For searching a similar substring (not word) in a text maybe another
function should be added?



I have changed the patch:
1 - trgm2.data was corrected, duplicates were deleted.
2 - I have added operators <<-> and <->> with GiST index supporting. A 
regression test will pass only with the patch 
http://www.postgresql.org/message-id/capphfdt19fwqxaryjkzxb3oxmv-kan3fluzrooare_u3h3c...@mail.gmail.com

3 - the function substring_similarity() was renamed to subword_similarity().

But there is not a function substring_similarity_pos() yet. It is not 
trivial.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/contrib/pg_trgm/pg_trgm--1.2.sql
--- b/contrib/pg_trgm/pg_trgm--1.2.sql
***
*** 3,13 
--- 3,15 
  -- complain if script is sourced in psql, rather than via CREATE EXTENSION
  \echo Use "CREATE EXTENSION pg_trgm" to load this file. \quit
  
+ -- Deprecated function
  CREATE FUNCTION set_limit(float4)
  RETURNS float4
  AS 'MODULE_PATHNAME'
  LANGUAGE C STRICT VOLATILE;
  
+ -- Deprecated function
  CREATE FUNCTION show_limit()
  RETURNS float4
  AS 'MODULE_PATHNAME'
***
*** 26,32  LANGUAGE C STRICT IMMUTABLE;
  CREATE FUNCTION similarity_op(text,text)
  RETURNS bool
  AS 'MODULE_PATHNAME'
! LANGUAGE C STRICT STABLE;  -- stable because depends on trgm_limit
  
  CREATE OPERATOR % (
  LEFTARG = text,
--- 28,34 
  CREATE FUNCTION similarity_op(text,text)
  RETURNS bool
  AS 'MODULE_PATHNAME'
! LANGUAGE C STRICT STABLE;  -- stable because depends on pg_trgm.sml_limit
  
  CREATE OPERATOR % (
  LEFTARG = text,
*** a/contrib/pg_trgm/trgm.h
--- b/contrib/pg_trgm/trgm.h
***
*** 105,111  typedef char *BITVECP;
  
  typedef struct TrgmPackedGraph TrgmPackedGraph;
  
! extern float4 trgm_limit;
  
  extern uint32 trgm2int(trgm *ptr);
  extern void compact_trigram(trgm *tptr, char *str, int bytelen);
--- 105,111 
  
  typedef struct TrgmPackedGraph TrgmPackedGraph;
  
! extern double trgm_sml_limit;
  
  extern uint32 trgm2int(trgm *ptr);
  extern void compact_trigram(trgm *tptr, char *str, int bytelen);
*** a/contrib/pg_trgm/trgm_gin.c
--- b/contrib/pg_trgm/trgm_gin.c
***
*** 206,212  gin_trgm_consistent(PG_FUNCTION_ARGS)
  			 * similarity is just c / len1.
  			 * So, independly on DIVUNION the upper bound formula is the same.
  			 */
! 			res = (nkeys == 0) ? false : ((float4) ntrue) / ((float4) nkeys))) >= trgm_limit) ? true : false);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
--- 206,213 
  			 * similarity is just c / len1.
  			 * So, independly on DIVUNION the upper bound formula is the same.
  			 */
! 			res = (nkeys == 0) ? false :
! ((float4) ntrue) / ((float4) nkeys))) >= trgm_sml_limit) ? true : false);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
***
*** 283,289  gin_trgm_triconsistent(PG_FUNCTION_ARGS)
  			/*
  			 * See comment in gin_trgm_consistent() about * upper bound formula
  			 */
! 			res = (nkeys == 0) ? GIN_FALSE : (float4) ntrue) / ((float4) nkeys)) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
--- 284,291 
  			/*
  			 * See comment in gin_trgm_consistent() about * upper bound formula
  			 */
! 			res = (nkeys == 0) ? GIN_FALSE :
! (float4) ntrue) / ((float4) nkeys)) >= trgm_sml_limit) ? GIN_MAYBE : GIN_FALSE);
  			break;
  		case ILikeStrategyNumber:
  #ifndef IGNORECASE
*** a/contrib/pg_trgm/trgm_gist.c
--- b/contrib/pg_trgm/trgm_gist.c
***
*** 294,300  gtrgm_consistent(PG_FUNCTION_ARGS)
  float4		tmpsml = cnt

Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension

2016-01-29 Thread Artur Zakirov

On 29.01.2016 17:15, Teodor Sigaev wrote:

The behavior of this function is surprising to me.

select substring_similarity('dog' ,  'hotdogpound') ;

  substring_similarity
--
  0.25


Substring search was desined to search similar word in string:
contrib_regression=# select substring_similarity('dog' ,  'hot dogpound') ;
  substring_similarity
--
  0.75

contrib_regression=# select substring_similarity('dog' ,  'hot dog
pound') ;
  substring_similarity
--
 1
It seems to me that users search words in long string. But I'm agree
that more detailed explanation needed and, may be, we need to change
feature name to fuzzywordsearch or something else, I can't imagine how.



Thank you for the review. I will rename the function name. Maybe to 
subword_similarity()?






Also, should we have a function which indicates the position in the
2nd string at which the most similar match to the 1st argument occurs?

select substring_similarity_pos('dog' ,  'hotdogpound') ;

answering: 4

Interesting, I think, it will be useful in some cases.



We could call them <<-> and <->> , where the first corresponds to <%
and the second to %>

Agree


I will add them.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-01-29 Thread Artur Zakirov

On 28.01.2016 17:42, Artur Zakirov wrote:

On 27.01.2016 15:28, Artur Zakirov wrote:

On 27.01.2016 14:14, Stas Kelvich wrote:

Hi.

I tried that and confirm strange behaviour. It seems that problem with
small cyrillic letter ‘х’. (simplest obscene language filter? =)

That can be reproduced with simpler test

Stas




The test program was corrected. Now it uses wchar_t type. And it works
correctly and gives right output.

I think the NIImportOOAffixes() in spell.c should be corrected to avoid
this bug.



I have attached a patch. It adds new functions parse_ooaffentry() and
get_nextentry() and fixes a couple comments.

Now russian and other supported dictionaries can be used for text search
in Mac OS.

parse_ooaffentry() parses an affix file entry instead of sscanf(). It
has a similar algorithm to the parse_affentry() function.

Should I create a new patch to fix this bug (as I did) or this patch
should go with the patch
http://www.postgresql.org/message-id/56aa02ee.6090...@postgrespro.ru ?



I have created a new entry in the commitfest for this patch 
https://commitfest.postgresql.org/9/496/


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension

2016-01-29 Thread Artur Zakirov

On 21.01.2016 00:25, Alvaro Herrera wrote:

Artur Zakirov wrote:


I don't quite understand why aren't we using a custom GUC variable here.
These already have SHOW and SET support ...



Added GUC variables:
- pg_trgm.limit
- pg_trgm.substring_limit
I added this variables to the documentation.
show_limit() and set_limit() functions work correctly and they are marked as
deprecated.


Thanks.  I'm willing to commit quickly a small patch that only changes
the existing function to GUCs, then have a look at a separate patch that
adds the new substring operator.  Would you split the patch that way?



What status of this patch? In commitfest it is "Needs review".

Can this patch get the status "Ready for Commiter"?

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension

2016-01-29 Thread Artur Zakirov

On 29.01.2016 18:39, Alvaro Herrera wrote:

Teodor Sigaev wrote:

The behavior of this function is surprising to me.

select substring_similarity('dog' ,  'hotdogpound') ;

  substring_similarity
--
  0.25


Substring search was desined to search similar word in string:
contrib_regression=# select substring_similarity('dog' ,  'hot dogpound') ;
  substring_similarity
--
  0.75

contrib_regression=# select substring_similarity('dog' ,  'hot dog pound') ;
  substring_similarity
--
 1


Hmm, this behavior looks too much like magic to me.  I mean, a substring
is a substring -- why are we treating the space as a special character
here?



I think, I can rename this function to subword_similarity() and correct 
the documentation.


The current behavior is developed to find most similar word in a text. 
For example, if we will search just substring (not word) then we will 
get the following result:


select substring_similarity('dog', 'dogmatist');
 substring_similarity
-
1
(1 row)

But this is wrong I think. They are completely different words.

For searching a similar substring (not word) in a text maybe another 
function should be added?


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-01-28 Thread Artur Zakirov

On 28.01.2016 14:19, Alvaro Herrera wrote:

Artur Zakirov wrote:


I undo the changes and the error will be raised. I will update the patch
soon.


I don't think you ever did this.  I'm closing it now, but it sounds
useful stuff so please do resubmit for 2016-03.



I'm working on the patch. I wanted to send this changes after all changes.

This version of the patch has a top-level comment. Another comments I 
will provides soon.


Also this patch has some changes with ternary operators.

> I don't think you ever did this.  I'm closing it now, but it sounds
> useful stuff so please do resubmit for 2016-03.

Moved to next CF.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/doc/src/sgml/textsearch.sgml
--- b/doc/src/sgml/textsearch.sgml
***
*** 2615,2632  SELECT plainto_tsquery('supernova star');
 
  
 
! To create an Ispell dictionary, use the built-in
! ispell template and specify several parameters:
 
! 
  
! CREATE TEXT SEARCH DICTIONARY english_ispell (
  TEMPLATE = ispell,
! DictFile = english,
! AffFile = english,
! StopWords = english
! );
  
  
 
  Here, DictFile, AffFile, and StopWords
--- 2615,2655 
 
  
 
! To create an Ispell dictionary perform these steps:
 
!
! 
!  
!   download dictionary configuration files. OpenOffice
!   extension files have the .oxt extension. It is necessary
!   to extract .aff and .dic files, change extensions
!   to .affix and .dict. For some dictionary
!   files it is also needed to convert characters to the UTF-8 encoding
!   with commands (for example, for norwegian language dictionary):
  
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
! 
!  
! 
! 
!  
!   copy files to the $SHAREDIR/tsearch_data directory
!  
! 
! 
!  
!   load files into PostgreSQL with the following command:
! 
! CREATE TEXT SEARCH DICTIONARY english_hunspell (
  TEMPLATE = ispell,
! DictFile = en_us,
! AffFile = en_us,
! Stopwords = english);
  
+  
+ 
+
  
 
  Here, DictFile, AffFile, and StopWords
***
*** 2643,2648  CREATE TEXT SEARCH DICTIONARY english_ispell (
--- 2666,2720 
 
  
 
+ The .affix file of Ispell has the following structure:
+ 
+ prefixes
+ flag *A:
+ .   >   RE  # As in enter > reenter
+ suffixes
+ flag T:
+ E   >   ST  # As in late > latest
+ [^AEIOU]Y   >   -Y,IEST # As in dirty > dirtiest
+ [AEIOU]Y>   EST # As in gray > grayest
+ [^EY]   >   EST # As in small > smallest
+ 
+
+
+ And the .dict file has the following structure:
+ 
+ lapse/ADGRS
+ lard/DGRS
+ large/PRTY
+ lark/MRS
+ 
+
+ 
+
+ Format of the .dict file is:
+ 
+ basic_form/affix_class_name
+ 
+
+ 
+
+ In the .affix file every affix flag is described in the
+ following format:
+ 
+ condition > [-stripping_letters,] adding_affix
+ 
+
+ 
+
+ Here, condition has a format similar to the format of regular expressions.
+ It can use groupings [...] and [^...].
+ For example, [AEIOU]Y means that the last letter of the word
+ is "y" and the penultimate letter is "a",
+ "e", "i", "o" or "u".
+ [^EY] means that the last letter is neither "e"
+ nor "y".
+
+ 
+
  Ispell dictionaries support splitting compound words;
  a useful feature.
  Notice that the affix file should specify a special flag using the
***
*** 2663,2668  SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
--- 2735,2796 
  
 
  
+
+ MySpell is very similar to Hunspell.
+ The .affix file of Hunspell has the following structure:
+ 
+ PFX A Y 1
+ PFX A   0 re .
+ SFX T N 4
+ SFX T   0 st e
+ SFX T   y iest   [^aeiou]y
+ SFX T   0 est[aeiou]y
+ SFX T   0 est[^ey]
+ 
+
+ 
+
+ The first line of an affix class is the header. Fields of an affix rules are listed after the header:
+
+
+ 
+  
+   parameter name (PFX or SFX)
+  
+ 
+ 
+  
+   flag (name of the affix class)
+  
+ 
+ 
+  
+   stripping characters from beginning (at prefix) or end (at suffix) of the word
+  
+ 
+ 
+  
+   adding affix
+  
+ 
+ 
+  
+   condition that has a format similar to the format of regular expressions.
+  
+ 
+
+ 
+
+ The .dict file looks like the .dict file of
+ Ispell:
+ 
+ larder/M
+ lardy/RT
+ large/RSPMYT
+ largehearted
+ 
+
+ 
 
  
   MySpell does not support compound words.
*** a/src/backend/tsearch/spell.c
--- b/src/backend/ts

Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-01-28 Thread Artur Zakirov

Sorry, I don't know why this thread was moved to another thread.

I duplicate the patch here.


On 28.01.2016 14:19, Alvaro Herrera wrote:

Artur Zakirov wrote:


I undo the changes and the error will be raised. I will update the patch
soon.


I don't think you ever did this. I'm closing it now, but it sounds
useful stuff so please do resubmit for 2016-03.



I'm working on the patch. I wanted to send this changes after all changes.

This version of the patch has a top-level comment. Another comments I will 
provides soon.

Also this patch has some changes with ternary operators.


I don't think you ever did this. I'm closing it now, but it sounds
useful stuff so please do resubmit for 2016-03.


Moved to next CF.




--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/doc/src/sgml/textsearch.sgml
--- b/doc/src/sgml/textsearch.sgml
***
*** 2615,2632  SELECT plainto_tsquery('supernova star');
 
  
 
! To create an Ispell dictionary, use the built-in
! ispell template and specify several parameters:
 
! 
  
! CREATE TEXT SEARCH DICTIONARY english_ispell (
  TEMPLATE = ispell,
! DictFile = english,
! AffFile = english,
! StopWords = english
! );
  
  
 
  Here, DictFile, AffFile, and StopWords
--- 2615,2655 
 
  
 
! To create an Ispell dictionary perform these steps:
 
!
! 
!  
!   download dictionary configuration files. OpenOffice
!   extension files have the .oxt extension. It is necessary
!   to extract .aff and .dic files, change extensions
!   to .affix and .dict. For some dictionary
!   files it is also needed to convert characters to the UTF-8 encoding
!   with commands (for example, for norwegian language dictionary):
  
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
! 
!  
! 
! 
!  
!   copy files to the $SHAREDIR/tsearch_data directory
!  
! 
! 
!  
!   load files into PostgreSQL with the following command:
! 
! CREATE TEXT SEARCH DICTIONARY english_hunspell (
  TEMPLATE = ispell,
! DictFile = en_us,
! AffFile = en_us,
! Stopwords = english);
  
+  
+ 
+
  
 
  Here, DictFile, AffFile, and StopWords
***
*** 2643,2648  CREATE TEXT SEARCH DICTIONARY english_ispell (
--- 2666,2720 
 
  
 
+ The .affix file of Ispell has the following structure:
+ 
+ prefixes
+ flag *A:
+ .   >   RE  # As in enter > reenter
+ suffixes
+ flag T:
+ E   >   ST  # As in late > latest
+ [^AEIOU]Y   >   -Y,IEST # As in dirty > dirtiest
+ [AEIOU]Y>   EST # As in gray > grayest
+ [^EY]   >   EST # As in small > smallest
+ 
+
+
+ And the .dict file has the following structure:
+ 
+ lapse/ADGRS
+ lard/DGRS
+ large/PRTY
+ lark/MRS
+ 
+
+ 
+
+ Format of the .dict file is:
+ 
+ basic_form/affix_class_name
+ 
+
+ 
+
+ In the .affix file every affix flag is described in the
+ following format:
+ 
+ condition > [-stripping_letters,] adding_affix
+ 
+
+ 
+
+ Here, condition has a format similar to the format of regular expressions.
+ It can use groupings [...] and [^...].
+ For example, [AEIOU]Y means that the last letter of the word
+ is "y" and the penultimate letter is "a",
+ "e", "i", "o" or "u".
+ [^EY] means that the last letter is neither "e"
+ nor "y".
+
+ 
+
  Ispell dictionaries support splitting compound words;
  a useful feature.
  Notice that the affix file should specify a special flag using the
***
*** 2663,2668  SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
--- 2735,2796 
  
 
  
+
+ MySpell is very similar to Hunspell.
+ The .affix file of Hunspell has the following structure:
+ 
+ PFX A Y 1
+ PFX A   0 re .
+ SFX T N 4
+ SFX T   0 st e
+ SFX T   y iest   [^aeiou]y
+ SFX T   0 est[aeiou]y
+ SFX T   0 est[^ey]
+ 
+
+ 
+
+ The first line of an affix class is the header. Fields of an affix rules are listed after the header:
+
+
+ 
+  
+   parameter name (PFX or SFX)
+  
+ 
+ 
+  
+   flag (name of the affix class)
+  
+ 
+ 
+  
+   stripping characters from beginning (at prefix) or end (at suffix) of the word
+  
+ 
+ 
+  
+   adding affix
+  
+ 
+ 
+  
+   condition that has a format similar to the format of regular expressions.
+  
+ 
+
+ 
+
+ The .dict file looks like the .dict file of
+ Ispell:
+ 
+ larder/M
+ lardy/RT
+ large/RSPMYT
+ largehearted
+ 
+
+ 
 
  
  

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-01-28 Thread Artur Zakirov

On 27.01.2016 15:28, Artur Zakirov wrote:

On 27.01.2016 14:14, Stas Kelvich wrote:

Hi.

I tried that and confirm strange behaviour. It seems that problem with
small cyrillic letter ‘х’. (simplest obscene language filter? =)

That can be reproduced with simpler test

Stas




The test program was corrected. Now it uses wchar_t type. And it works
correctly and gives right output.

I think the NIImportOOAffixes() in spell.c should be corrected to avoid
this bug.



I have attached a patch. It adds new functions parse_ooaffentry() and 
get_nextentry() and fixes a couple comments.


Now russian and other supported dictionaries can be used for text search 
in Mac OS.


parse_ooaffentry() parses an affix file entry instead of sscanf(). It 
has a similar algorithm to the parse_affentry() function.


Should I create a new patch to fix this bug (as I did) or this patch 
should go with the patch 
http://www.postgresql.org/message-id/56aa02ee.6090...@postgrespro.ru ?


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***
*** 458,469  NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
  }
  
  #define PAE_WAIT_MASK	0
! #define PAE_INMASK	1
  #define PAE_WAIT_FIND	2
! #define PAE_INFIND	3
  #define PAE_WAIT_REPL	4
! #define PAE_INREPL	5
  
  static bool
  parse_affentry(char *str, char *mask, char *find, char *repl)
  {
--- 458,579 
  }
  
  #define PAE_WAIT_MASK	0
! #define PAE_INMASK		1
  #define PAE_WAIT_FIND	2
! #define PAE_INFIND		3
  #define PAE_WAIT_REPL	4
! #define PAE_INREPL		5
! #define PAE_WAIT_TYPE	6
! #define PAE_WAIT_FLAG	7
  
+ /*
+  * Used in parse_ooaffentry() to parse an .affix file entry.
+  */
+ static bool
+ get_nextentry(char **str, char *next)
+ {
+ 	int			state = PAE_WAIT_MASK;
+ 	char	   *pnext = next;
+ 
+ 	*next = '\0';
+ 
+ 	while (**str)
+ 	{
+ 		if (state == PAE_WAIT_MASK)
+ 		{
+ 			if (t_iseq(*str, '#'))
+ return false;
+ 			else if (!t_isspace(*str))
+ 			{
+ COPYCHAR(pnext, *str);
+ pnext += pg_mblen(*str);
+ state = PAE_INMASK;
+ 			}
+ 		}
+ 		else if (state == PAE_INMASK)
+ 		{
+ 			if (t_isspace(*str))
+ 			{
+ *pnext = '\0';
+ return true;
+ 			}
+ 			else
+ 			{
+ COPYCHAR(pnext, *str);
+ pnext += pg_mblen(*str);
+ 			}
+ 		}
+ 		*str += pg_mblen(*str);
+ 	}
+ 
+ 	*pnext ='\0';
+ 
+ 	return *next;
+ }
+ 
+ /*
+  * Parses entry of an .affix file of MySpell or Hunspell format.
+  *
+  * An .affix file entry has the following format:
+  * - header
+  * 
+  * - fields after header:
+  *   
+  */
+ static int
+ parse_ooaffentry(char *str, char *type, char *flag, char *find,
+ char *repl, char *mask)
+ {
+ 	int			state = PAE_WAIT_TYPE,
+ next_state = PAE_WAIT_FLAG;
+ 	int			parse_read = 0;
+ 	bool		valid = true;
+ 
+ 	*type = *flag = *find = *repl = *mask = '\0';
+ 
+ 	while (*str && valid)
+ 	{
+ 		switch (state)
+ 		{
+ 			case PAE_WAIT_TYPE:
+ valid = get_nextentry(, type);
+ break;
+ 			case PAE_WAIT_FLAG:
+ valid = get_nextentry(, flag);
+ next_state = PAE_WAIT_FIND;
+ break;
+ 			case PAE_WAIT_FIND:
+ valid = get_nextentry(, find);
+ next_state = PAE_WAIT_REPL;
+ break;
+ 			case PAE_WAIT_REPL:
+ valid = get_nextentry(, repl);
+ next_state = PAE_WAIT_MASK;
+ break;
+ 			case PAE_WAIT_MASK:
+ get_nextentry(, mask);
+ /* break loop */
+ valid = false;
+ break;
+ 			default:
+ elog(ERROR, "unrecognized state in parse_ooaffentry: %d", state);
+ 		}
+ 		state = next_state;
+ 		if (*str)
+ 			str += pg_mblen(str);
+ 
+ 		parse_read++;
+ 	}
+ 
+ 	return parse_read;
+ }
+ 
+ /*
+  * Parses entry of an .affix file of Ispell format
+  *
+  * An .affix file entry has the following format:
+  *   >  [-,]
+  */
  static bool
  parse_affentry(char *str, char *mask, char *find, char *repl)
  {
***
*** 618,625  NIImportOOAffixes(IspellDict *Conf, const char *filename)
  	int			flag = 0;
  	char		flagflags = 0;
  	tsearch_readline_state trst;
! 	int			scanread = 0;
! 	char		scanbuf[BUFSIZ];
  	char	   *recoded;
  
  	/* read file to find any flag */
--- 728,734 
  	int			flag = 0;
  	char		flagflags = 0;
  	tsearch_readline_state trst;
! 	int			parseread = 0;
  	char	   *recoded;
  
  	/* read file to find any flag */
***
*** 682,689  NIImportOOAffixes(IspellDict *Conf, const char *filename)
  	}
  	tsearch_readline_end();
  
- 	sprintf(scanbuf, "%%6s %%%ds %%%ds %%%ds %%%ds", BUFSIZ / 5, BUFSIZ / 5, BUFSIZ / 5, BUFSIZ / 5);
- 
  	if (!tsearch_readline_begin(, filename))
  		ereport(ERROR,
  (errcode(ERRCODE_CONFIG_FILE_ERROR),
--- 791,796 
***
*** 695,709  NIImportOOAffixes(IspellDict *Conf, const char *filename)
  		if (*recoded == '\0' || t_isspace(recoded) || t_iseq(recoded, '#'))
  			goto nextline;
  
! 		scanread = sscan

Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-01-27 Thread Artur Zakirov

On 27.01.2016 14:14, Stas Kelvich wrote:

Hi.

I tried that and confirm strange behaviour. It seems that problem with small 
cyrillic letter ‘х’. (simplest obscene language filter? =)

That can be reproduced with simpler test

Stas




The test program was corrected. Now it uses wchar_t type. And it works 
correctly and gives right output.


I think the NIImportOOAffixes() in spell.c should be corrected to avoid 
this bug.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
#include 
#include 
#include 

char *src = "SFX Y   хаться шутсяхаться";

int
main(int argc, char *argv[])
{
	wchar_t c1[1024], c2[1024], c3[1024], c4[1024], c5[1024];

	setlocale(LC_CTYPE, "ru_RU.UTF-8");

	sscanf(src, "%6ls %204ls %204ls %204ls %204ls", c1, c2, c3, c4, c5);

	printf("%ls/%ls/%ls/%ls/%ls\n", c1, c2, c3, c4, c5);

	return 0;
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-01-27 Thread Artur Zakirov

Hello.

When a user try to create a text search dictionary for the russian 
language on Mac OS then called the following error message:


  CREATE EXTENSION hunspell_ru_ru;
+ ERROR:  invalid byte sequence for encoding "UTF8": 0xd1
+ CONTEXT:  line 341 of configuration file 
"/Users/stas/code/postgrespro2/tmp_install/Users/stas/code/postgrespro2/install/share/tsearch_data/ru_ru.affix": 
"SFX Y   хаться шутсяхаться


Russian dictionary was downloaded from 
http://extensions.openoffice.org/en/project/slovari-dlya-russkogo-yazyka-dictionaries-russian
Affix and dictionary files was extracted from the archive and converted 
to UTF-8. Also a converted dictionary can be downloaded from 
https://github.com/select-artur/hunspell_dicts/tree/master/ru_ru


This behavior occurs on:
- Mac OS X 10.10 Yosemite and Mac OS X 10.11 El Capitan.
- latest PostgreSQL version from git and PostgreSQL 9.5 (probably also 
on 9.4.5).


There is also the test to reproduce this bug in the attachment.

Did you meet this bug? Do you have a solution or a workaround?

Thanks in advance.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
#include 
#include 

char *src = "SFX Y   хаться шутсяхаться";

int
main(int argc, char *argv[])
{
	char c1[1024], c2[1024], c3[1024], c4[1024], c5[1024];

	setlocale(LC_CTYPE, "ru_RU.UTF-8");

	sscanf(src, "%6s %204s %204s %204s %204s", c1, c2, c3, c4, c5);

	printf("%s/%s/%s/%s/%s\n", c1, c2, c3, c4, c5);

	return 0;
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Mac OS: invalid byte sequence for encoding "UTF8"

2016-01-27 Thread Artur Zakirov

On 27.01.2016 13:46, Shulgin, Oleksandr wrote:


Not sure why the file uses "SET KOI8-R" directive then?



This directive is used only by Hunspell program. PostgreSQL ignores this 
directive and assumes that input affix and dictionary files in the UTF-8 
encoding.





What error message do you get with this test program?  (I don't get any,
but I'm not on Mac OS.)
--
Alex




With this program you will get wrong output. A error message is not 
called. You can execute the following commands:


> cc test.c -o test
> ./test

You will get the output:

SFX/Y/?/аться/шутся

Although the output should be:

SFX/Y/хаться/шутся/хаться

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] easy way of copying regex_t

2016-01-25 Thread Artur Zakirov

On 25.01.2016 13:07, Tomas Vondra wrote:


Right, it's definitely not thread-safe so there'd need to be some lock
protecting the regex_t copy. I was thinking about either using a group
of locks, each protecting a small subset of the affixes (thus making it
possible to work in parallel to some extent), or simply using a single
lock and each process would make a private copy at the beginning.

In the end, I've decided to do it differently, and simply parse the
affix list from scratch in each process. The affix list is tiny and
takes less than a millisecond to parse in most cases, and I don't have
to care about the regex stuff at all. The main benefit is from sharing
parsed wordlist anyway.


This is nice decision since the affix list is small. For our task I will 
change shared_ispell to use this solution.




It's an old-school shared segment created by the extension at init time.
You're right the size is fixed so it's possible to run out of space by
loading too many dictionaries, but that was not a big deal for the type
of setups it was designed for - in those cases the list of dictionaries
is stable, so it's possible to size the segment accordingly in advance.

But I guess we could do better now that we have dynamic shared memory,
possibly allocating one segment per dictionary as needed, or something
like that.

regards



Yes it would be better as we will not need to define the maximum size of 
the shared segment in postgresql.conf.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] easy way of copying regex_t

2016-01-24 Thread Artur Zakirov

Hi all,

I've been working on moving an extension that allows to move the ispell
dictionaries to shared segment. It's almost complete, the last FIXME is
about copying regex_t structure (stored in AFFIX).

According to regex.h the structure is fairly complex and not exactly easy
to understand, so I'd like to know if anyone here already implemented that
or something that may serve the same purpose. Any ideas?

kind regards
Tomas


This message is the reply to the message 
http://www.postgresql.org/message-id/dd02a31fdeffbf5cb24771e34213b40f.squir...@sq.gransy.com

Sorry, I can't reply to it directly. I can't get it from archive.

Thank you for your extension shared_ispell. It is very useful. I have 
got it from https://github.com/tvondra/shared_ispell
With this message I want to send some patch to your repository with 
draft of a code, which allows shared_ispell to copy regex_t.


The main idea of the patch is:
- we doesn't need copy all regex_t structure
- most of fields and structures used only in a compile time
- we need copy structures: guts, colormap, subre, cnfa
- from the subre structure we need only cnfa

colormap represents a directed acyclic graph. cnfa represents a 
nondeterministic finite automaton.


In this patch also was done the following:
- added regression tests
- deleted spell.h and spell.c since they have duplicate code
- added shared_ispell.h which declares some structures
- fix an issue when stopFile can be NULL
- fixed countCMPDAffixes since theoretically could be zero affix
- added copyCMPDAffix

Question to hackers. Can such patch be useful as a PostgreSQL patch to 
Full-Text search? Is it needed?


shared_ispell loads dictionaries into a shared memory. The main benefits 
are:
- saving of memory. Every dictionary is loaded only once. Dictionaries 
are not loaded for each backend. In current version of PostgreSQL 
dictionaires are loaded for each backend where it was requested.
- saving of time. The first load of a dictionary takes much time. With 
this patch dictionaries will be loaded only once.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/Makefile
--- b/Makefile
***
*** 1,18 
  MODULE_big = shared_ispell
! OBJS = src/shared_ispell.o src/spell.o
  
  EXTENSION = shared_ispell
! DATA = sql/shared_ispell--1.0.0.sql
! MODULES = shared_ispell
  
! CFLAGS=`pg_config --includedir-server`
  
  PG_CONFIG = pg_config
  PGXS := $(shell $(PG_CONFIG) --pgxs)
  include $(PGXS)
! 
! all: shared_ispell.so
! 
! shared_ispell.so: $(OBJS)
! 
! %.o : src/%.c
--- 1,20 
+ # contrib/shared_ispell/Makefile
+ 
  MODULE_big = shared_ispell
! OBJS = src/shared_ispell.o
  
  EXTENSION = shared_ispell
! DATA = sql/shared_ispell--1.1.0.sql
  
! REGRESS = shared_ispell
  
+ ifdef USE_PGXS
  PG_CONFIG = pg_config
  PGXS := $(shell $(PG_CONFIG) --pgxs)
  include $(PGXS)
! else
! subdir = contrib/shared_ispell
! top_builddir = ../..
! include $(top_builddir)/src/Makefile.global
! include $(top_srcdir)/contrib/contrib-global.mk
! endif
*** a/README.md
--- b/README.md
***
*** 13,28  If you need just snowball-type dictionaries, this extension is not
  really interesting for you. But if you really need an ispell
  dictionary, this may save you a lot of resources.
  
- Warning
- ---
- The extension does not yet handle affixes that require full regular
- expressions (regex_t, implemented in regex.h). This is indicated by
- an error when initializing the dictionary.
- 
- Simple affixes and affixes that can be handled by fast regex subset
- (as implemented in regis.h) are handled just fine.
- 
- 
  Install
  ---
  Installing the extension is quite simple, especially if you're on 9.1.
--- 13,18 
*** /dev/null
--- b/expected/shared_ispell.out
***
*** 0 
--- 1,193 
+ CREATE EXTENSION shared_ispell;
+ -- Test ISpell dictionary with ispell affix file
+ CREATE TEXT SEARCH DICTIONARY shared_ispell (
+ Template=shared_ispell,
+ DictFile=ispell_sample,
+ AffFile=ispell_sample
+ );
+ SELECT ts_lexize('shared_ispell', 'skies');
+  ts_lexize 
+ ---
+  {sky}
+ (1 row)
+ 
+ SELECT ts_lexize('shared_ispell', 'bookings');
+ts_lexize
+ 
+  {booking,book}
+ (1 row)
+ 
+ SELECT ts_lexize('shared_ispell', 'booking');
+ts_lexize
+ 
+  {booking,book}
+ (1 row)
+ 
+ SELECT ts_lexize('shared_ispell', 'foot');
+  ts_lexize 
+ ---
+  {foot}
+ (1 row)
+ 
+ SELECT ts_lexize('shared_ispell', 'foots');
+  ts_lexize 
+ ---
+  {foot}
+ (1 row)
+ 
+ SELECT ts_lexize('shared_ispell', 'rebookings');
+ts_lexize
+ 
+  {booking,book}
+ (1 row)
+ 
+ SELECT ts_lexize('shared_ispell', 'rebooking');
+ts_lexize
+ 
+  {booking,book}
+ (1 row)
+ 
+ SELECT ts_lexize('shared_ispell', 'rebook');
+  ts_lexize 
+ ---
+  
+ (1 row)
+ 
+ SELECT ts_lexize

Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension

2016-01-15 Thread Artur Zakirov

On 12.01.2016 02:31, Alvaro Herrera wrote:

I gave a quick look through the patch and noticed a few minor things
while trying to understand it.

I think the test corpus isn't particularly interesting for how big it
is.  I'd rather have (a) a small corpus (say 100 words) with which to do
detailed regression testing, and (b) some larger document for more
extensive testing.  I'm not sure (b) is actually necessary.

Overall I think the new functions could stand a lot more commentary.



Thank you for a review. I will send fixed patch in a few days.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-01-09 Thread Artur Zakirov

On 09.01.2016 05:38, Alvaro Herrera wrote:

Artur Zakirov wrote:


Now almost all dictionaries are loaded into PostgreSQL. But the da_dk
dictionary does not load. I see the following error:

ERROR: invalid regular expression: quantifier operand invalid
CONTEXT: line 439 of configuration file
"/home/artur/progs/pgsql/share/tsearch_data/da_dk.affix": "SFX 55 0 s
+GENITIV

If you open the affix file in editor you can see that there is incorrect
format of the affix 55 in 439 line (screen1.png):
  
  [ another email ]



I also had implemented a patch that fixes an error from the e-mail
http://www.postgresql.org/message-id/562e1073.8030...@postgrespro.ru
This patch just ignore that error.

I think it's a bad idea to just ignore these syntax errors.  This affix
file is effectively corrupt, after all, so it seems a bad idea that we
need to cope with it.  I think it would be better to raise the error
normally and instruct the user to fix the file; obviously it's better if
the upstream provider of the file fixes it.

Now, if there is proof somewhere that the file is correct, then the code
must cope in some reasonable way.  But in any case I don't think this
change is acceptable ... it can only cause pain, in the long run.
This error is raised in Danish dictionary because of erroneous entry in 
the .affix file. I sent a bug-report to developer. He fixed this bug. 
Corrected dictionary can be downloaded from LibreOffice site.


I undo the changes and the error will be raised. I will update the patch 
soon.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-01-09 Thread Artur Zakirov

Thanks for review.

On 09.01.2016 02:04, Alvaro Herrera wrote:

Artur Zakirov wrote:

--- 74,80 
   
   typedef struct aff_struct

   {
!   uint32  flag:16,
type:1,
flagflags:7,
issimple:1,

By doing this, you're using 40 bits of a 32-bits-wide field.  What does
this mean?  Are the final 8 bits lost?  Does the compiler allocate a
second uint32 member for those additional bits?  I don't know, but I
don't think this is a very clean idea.
No, 8 bits are not lost. This 8 bits are used if a dictionary uses 
double extended ASCII character flag type (Conf->flagMode == FM_LONG) or 
decimal number flag type (Conf->flagMode == FM_NUM). If a dictionary 
uses single extended ASCII character flag type (Conf->flagMode == 
FM_CHAR), then 8 bits lost.


You can see it in decodeFlag function. This function is used in 
NIImportOOAffixes function, decode affix flag from string type and store 
in flag field (flag:16).



   typedef struct spell_struct
   {
!   struct
{
!   int affix;
!   int len;
!   }   d;
!   /*
!* flag is filled in by NIImportDictionary. After NISortDictionary, d
!* is valid and flag is invalid.
!*/
!   char   *flag;
charword[FLEXIBLE_ARRAY_MEMBER];
   } SPELL;

Here you removed the union, with no rationale for doing so.  Why did you
do it?  Can it be avoided?  Because of the comment, I'd imagine that d
and flag are valid at different times, so at any time we care about only
one of them; but you haven't updated the comment stating the reason for
that no longer to be the case.  I suspect you need to keep flag valid
after NISortDictionary has been called, but if so why?  If "flag" is
invalid as the comment says, what's the reason for keeping it?
Union was removed because the flag field need to be dynamically sized. 
It had 16 size in the previous version. In this field flag set are 
stored. For example, if .dict file has the entry:


mitigate/NDSGny

Then the "NDSGny" is stored in the flag field.

But in some cases a flag set can have size bigger than 16. I added this 
changes after this message 
http://www.postgresql.org/message-id/CAE2gYzwom3=11u9g8zxmt5plkzrwb12bwzxh4db3hud89fo...@mail.gmail.com

In that Turkish dictionary there are can be large flag set. For example:

özek/2240,852,749,5026,2242,4455,2594,2597,4963,1608,494,2409

This flag set has 56 size.
This "flag" is valid all the time. It is used in NISortDictionary and it 
is not used after NISortDictionary function has been called. Maybe you 
right and there are no reason for keeping it, and it is necessary to 
store all flags in separate variable, that will be deleted after 
NISortDictionary has been called.



The routines decodeFlag and isAffixFlagInUse could do with more
comments.  Your patch adds zero.  Actually the whole file has not nearly
enough comments; adding some more would be very good.

Actually, after some more reading, I think this code is pretty terrible.
I have a hard time figuring out how the original works, which makes it
even more difficult to figure out whether your changes make sense.  I
would have to take your patch on faith, which doesn't sound so great an
idea.

palloc / cpalloc / tmpalloc make the whole mess even more confusing.
Why does this file have three ways to allocate memory?

Not sure what's a good way to go about this.  I am certainly not going
to commit this as is, because if I do whatever bugs you have are going
to become my problem; and with the severe lack of documentation and
given how fiddly this stuff is, I bet there are going to be a bunch of
bugs.  I suspect most committers are going to be in the same position.
I think you should start by adding a few comments here and there on top
of the original to explain how it works, then your patch on top.  I
suppose it's going to be a lot of work for you but I don't see any other
way.

A top-level overview about it would be good, too.  The current comment
at top of file states:

  * spell.c
  * Normalizing word with ISpell

which is, err, somewhat laconic.

I will provide comments and explain how it works in comments. Maybe I 
will add some explanation about dictionaries structure. I will update 
the patch soon.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: [HACKERS] plpgsql - DECLARE - cannot to use %TYPE or %ROWTYPE for composite types

2015-12-21 Thread Artur Zakirov

Hi.
I have tried to do some review of this patch. Below are my comments.

Introduction:

This patch fixes and adds the following functionality:
- %TYPE - now can be used for composite types.
- %ARRAYTYPE - new functionality, provides the array type from a 
variable or table column.
- %ELEMENTTYPE - new funcitonality, provides the element type of a given 
array.


New regression tests are included in the patch. Changes to the 
documentation are not provided.


Initial Run:

The patch applies correctly to HEAD. Regression tests pass successfully, 
without errors. It seems that the patch work as you expected.


Performance:

It seems patch have not possible performance issues for the existing 
functionality.


Coding:

The style looks fine. I attached the patch that does some corrections in 
code and documentation. I have corrected indentation in pl_comp.c and 
"read_datatype" function in pl_gram.y. I think changes in 
"read_datatype" function would be better to avoid a code duplication. 
But I could be wrong of course.


Conclusion:

The patch could be applied on master with documentation corrections. But 
I'm not sure that your task could be resloved only by adding %ARRAYTYPE 
and %ELEMENTTYPE. Maybe you will give some examples?


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/doc/src/sgml/plpgsql.sgml
--- b/doc/src/sgml/plpgsql.sgml
***
*** 710,715  SELECT merge_fields(t.*) FROM table1 t WHERE ... ;
--- 710,756 
 

  
+   
+Array Types
+ 
+ 
+ variable%ARRAYTYPE
+ 
+ 
+
+ %ARRAYTYPE provides the array type from a variable or
+ table column.  You can use this to declare array variables that will
+ hold database values.  To declare an array variable that will hold
+ values from users.user_id you can write:
+ 
+ user_id users.user_id%ARRAYTYPE;
+ 
+
+ 
+
+ By using %ARRAYTYPE you don't need to know the data
+ type of elements you are referencing.
+
+   
+ 
+   
+Array Element Types
+ 
+ 
+ variable%ELEMENTTYPE
+ 
+ 
+
+ %ELEMENTTYPE provides the element type of a given
+ array.  To declare a variable with the same data type as
+ users array element you can write:
+ 
+ user_id users%ELEMENTTYPE;
+ 
+
+ 
+   
+ 

 Collation of PL/pgSQL Variables
  
*** a/src/pl/plpgsql/src/pl_comp.c
--- b/src/pl/plpgsql/src/pl_comp.c
***
*** 1619,1625  plpgsql_parse_tripword(char *word1, char *word2, char *word3,
  
  /*
   * Derive type from ny base type controlled by reftype_mode
-  *
   */
  static PLpgSQL_type *
  derive_type(PLpgSQL_type *base_type, int reftype_mode)
--- 1619,1624 
***
*** 1661,1667  derive_type(PLpgSQL_type *base_type, int reftype_mode)
  ereport(ERROR,
  		(errcode(ERRCODE_DATATYPE_MISMATCH),
  		 errmsg("there are not array type for type %s",
! 	format_type_be(base_type->typoid;
  
  			return plpgsql_build_datatype(typoid, -1,
  			plpgsql_curr_compile->fn_input_collation);
--- 1660,1666 
  ereport(ERROR,
  		(errcode(ERRCODE_DATATYPE_MISMATCH),
  		 errmsg("there are not array type for type %s",
! format_type_be(base_type->typoid;
  
  			return plpgsql_build_datatype(typoid, -1,
  			plpgsql_curr_compile->fn_input_collation);
*** a/src/pl/plpgsql/src/pl_gram.y
--- b/src/pl/plpgsql/src/pl_gram.y
***
*** 2694,2700  read_datatype(int tok)
  	StringInfoData		ds;
  	char			   *type_name;
  	int	startlocation;
! 	PLpgSQL_type		*result;
  	int	parenlevel = 0;
  
  	/* Should only be called while parsing DECLARE sections */
--- 2694,2700 
  	StringInfoData		ds;
  	char			   *type_name;
  	int	startlocation;
! 	PLpgSQL_type		*result = 0;
  	int	parenlevel = 0;
  
  	/* Should only be called while parsing DECLARE sections */
***
*** 2720,2751  read_datatype(int tok)
  			tok = yylex();
  			if (tok_is_keyword(tok, ,
  			   K_TYPE, "type"))
- 			{
  result = plpgsql_parse_wordtype(dtname, PLPGSQL_REFTYPE_TYPE);
! if (result)
! 	return result;
! 			}
! 			if (tok_is_keyword(tok, ,
! 			   K_ELEMENTTYPE, "elementtype"))
! 			{
  result = plpgsql_parse_wordtype(dtname, PLPGSQL_REFTYPE_ELEMENT);
! if (result)
! 	return result;
! 			}
! 			if (tok_is_keyword(tok, ,
! 			   K_ARRAYTYPE, "arraytype"))
! 			{
  result = plpgsql_parse_wordtype(dtname, PLPGSQL_REFTYPE_ARRAY);
- if (result)
- 	return result;
- 			}
  			else if (tok_is_keyword(tok, ,
  	K_ROWTYPE, "rowtype"))
- 			{
  result = plpgsql_parse_wordrowtype(dtname);
! if (result)
! 	return result;
! 			}
  		}
  	}
  	else if (plpgsql_token_is_unreserved_keyword(tok))
--- 2720,2737 
  			tok = yylex();
  			if (tok_is_keyword(tok, ,
  			   K_TYPE, "type"))
  result = plpgsql_pars

Re: [HACKERS] Fuzzy substring searching with the pg_trgm extension

2015-12-18 Thread Artur Zakirov

On 18.12.2015 22:43, Artur Zakirov wrote:

Hello.

PostgreSQL has a contrib module named pg_trgm. It is used to the fuzzy 
text search. It provides some functions and operators for determining 
the similarity of the given texts using trigram matching.



Sorry, I have forgotten to mark previous message with [PROPOSAL].
I registered the patch in commitfest:
https://commitfest.postgresql.org/8/448/

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow replication roles to use file access functions

2015-12-15 Thread Artur Zakirov

On 03.09.2015 05:40, Michael Paquier wrote:


Ah, OK. I thought that you were referring to a protocol where caller
sends a single LSN from which it gets a differential backup that needs
to scan all the relation files of the source cluster to get the data
blocks with an LSN newer than the one sent, and then sends them back
to the caller.

I guess that what you are suggesting instead is an approach where
caller sends something like that through the replication protocol with
a relation OID and a block list:
BLOCK_DIFF relation_oid BLOCK_LIST m,n,[o, ...]
Which is close to what pg_read_binary_file does now for a superuser.
We would need as well to extend BASE_BACKUP so as it does not include
relation files though for this use case.

Regards,



Hi,

we need to run pg_rewind without using a superuser role too. What if the 
new parameter EXCLUDE_DATA_FILES will be added to the BASE_BACKUP 
command? This parameter will force the BASE_BACKUP command to not 
include relation files.


And pg_rewind can execute, for example, the following command:

BASE_BACKUP LABEL 'pg_rewind base backup' WAL EXCLUDE_DATA_FILES

This command will be executed if --source-server parameter is defined. 
Are there any pitfalls in this condition?


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2015-12-01 Thread Artur Zakirov

On 16.11.2015 15:51, Artur Zakirov wrote:

On 10.11.2015 13:23, Artur Zakirov wrote:


Link to patch in commitfest:
https://commitfest.postgresql.org/8/420/

Link to regression tests:
https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz


I have done some changes in documentation in the section "12.6. 
Dictionaries". I have added some description how to load Ispell and 
Hunspell dictionaries and description about Ispell and Hunspell formats.


Patches for the documentation and for the code are attached separately.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***
*** 153,159  cmpspell(const void *s1, const void *s2)
  static int
  cmpspellaffix(const void *s1, const void *s2)
  {
! 	return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN));
  }
  
  static char *
--- 153,159 
  static int
  cmpspellaffix(const void *s1, const void *s2)
  {
! 	return (strncmp((*(SPELL *const *) s1)->flag, (*(SPELL *const *) s2)->flag, MAXFLAGLEN));
  }
  
  static char *
***
*** 237,242  cmpaffix(const void *s1, const void *s2)
--- 237,309 
  	   (const unsigned char *) a2->repl);
  }
  
+ static unsigned short
+ decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext)
+ {
+ 	unsigned short	s;
+ 	char		   *next;
+ 
+ 	switch (Conf->flagMode)
+ 	{
+ 		case FM_LONG:
+ 			s = (int)sflag[0] << 8 | (int)sflag[1];
+ 			if (sflagnext)
+ *sflagnext = sflag + 2;
+ 			break;
+ 		case FM_NUM:
+ 			s = (unsigned short) strtol(sflag, , 10);
+ 			if (sflagnext)
+ 			{
+ if (next)
+ {
+ 	*sflagnext = next;
+ 	while (**sflagnext)
+ 	{
+ 		if (**sflagnext == ',')
+ 		{
+ 			*sflagnext = *sflagnext + 1;
+ 			break;
+ 		}
+ 		*sflagnext = *sflagnext + 1;
+ 	}
+ }
+ else
+ 	*sflagnext = 0;
+ 			}
+ 			break;
+ 		default:
+ 			s = (unsigned short) *((unsigned char *)sflag);
+ 			if (sflagnext)
+ *sflagnext = sflag + 1;
+ 	}
+ 
+ 	return s;
+ }
+ 
+ static bool
+ isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag)
+ {
+ 	char *flagcur;
+ 	char *flagnext = 0;
+ 
+ 	if (affixflag == 0)
+ 		return true;
+ 
+ 	flagcur = Conf->AffixData[affix];
+ 
+ 	while (*flagcur)
+ 	{
+ 		if (decodeFlag(Conf, flagcur, ) == affixflag)
+ 			return true;
+ 		if (flagnext)
+ 			flagcur = flagnext;
+ 		else
+ 			break;
+ 	}
+ 
+ 	return false;
+ }
+ 
  static void
  NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
  {
***
*** 255,261  NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
  	}
  	Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
  	strcpy(Conf->Spell[Conf->nspell]->word, word);
! 	strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN);
  	Conf->nspell++;
  }
  
--- 322,328 
  	}
  	Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
  	strcpy(Conf->Spell[Conf->nspell]->word, word);
! 	Conf->Spell[Conf->nspell]->flag = (*flag != '\0') ? cpstrdup(Conf, flag) : VoidString;
  	Conf->nspell++;
  }
  
***
*** 355,361  FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
  	else if ((flag & StopMiddle->compoundflag) == 0)
  		return 0;
  
! 	if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL))
  		return 1;
  }
  node = StopMiddle->node;
--- 422,428 
  	else if ((flag & StopMiddle->compoundflag) == 0)
  		return 0;
  
! 	if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag))
  		return 1;
  }
  node = StopMiddle->node;
***
*** 394,400  NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
  
  	Affix = Conf->Affix + Conf->naffixes;
  
! 	if (strcmp(mask, ".") == 0)
  	{
  		Affix->issimple = 1;
  		Affix->isregis = 0;
--- 461,467 
  
  	Affix = Conf->Affix + Conf->naffixes;
  
! 	if (strcmp(mask, ".") == 0 || *mask == '\0')
  	{
  		Affix->issimple = 1;
  		Affix->isregis = 0;
***
*** 429,443  NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
  		err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen,
  		 REG_ADVANCED | REG_NOSUB,
  		 DEFAULT_COLLATION_OID);
  		if (err)
! 		{
! 			char		errstr[100];
! 
! 			pg_regerror(err, &(Affix->reg.regex), errstr, sizeof(errstr));
! 			ereport(ERROR,
! 	(errcode(ERRCODE_INVALID_REGULAR_EXPRESSION),
! 	 errmsg("invalid regular expression: %s", errstr)));
! 		}
  	}
  
  	Affix->flagflags = flagflags;
--- 496,504 
  		err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen,
  		

Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2015-11-16 Thread Artur Zakirov

On 10.11.2015 13:23, Artur Zakirov wrote:


Link to patch in commitfest:
https://commitfest.postgresql.org/8/420/

Link to regression tests:
https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz



Hello!

Do you have any remarks or comments about my patch?

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2015-11-10 Thread Artur Zakirov

08.11.2015 14:23, Artur Zakirov пишет:

Thank you for reply.

This was because of the flag field size of the SPELL struct. And long
flags were being trancated in the .dict file.

I attached new patch. It is temporary patch, not final. It can be done
better.



I have updated the patch and attached it. Now dynamic memory allocation 
is used to the flag field of the SPELL struct.


I have valued time of a dictionary loading and memory using by a 
dictionary in the new patch. Dictionary is loaded at the first reference 
to it. For example, if we execute ts_lexize function. And first 
ts_lexize executing takes more time than second.


The following table shows performance of some dictionaries before patch 
and after in my computer.


-
|  |  loading time, ms  |  memory, MB   |
|  |  before  |  after  |  before |  after  |
-
|ar| 700  | 300 | 23,7| 15,7|
|br_fr | 410  | 450 | 27,4| 27,5|
|ca| 248  | 245 | 14,7| 15,4|
|en_us | 100  | 100 | 5,4 | 6,2 |
|fr| 160  | 178 | 13,7| 14,1|
|gl_es | 160  | 150 | 9   | 9,4 |
|is| 260  | 202 | 16,1| 16,3|
-

As you can see, substantially loading time and memory using before and 
after the patch are same.


Link to patch in commitfest:
https://commitfest.postgresql.org/8/420/

Link to regression tests:
https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***
*** 153,159  cmpspell(const void *s1, const void *s2)
  static int
  cmpspellaffix(const void *s1, const void *s2)
  {
! 	return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN));
  }
  
  static char *
--- 153,159 
  static int
  cmpspellaffix(const void *s1, const void *s2)
  {
! 	return (strncmp((*(SPELL *const *) s1)->flag, (*(SPELL *const *) s2)->flag, MAXFLAGLEN));
  }
  
  static char *
***
*** 237,242  cmpaffix(const void *s1, const void *s2)
--- 237,309 
  	   (const unsigned char *) a2->repl);
  }
  
+ static unsigned short
+ decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext)
+ {
+ 	unsigned short	s;
+ 	char		   *next;
+ 
+ 	switch (Conf->flagMode)
+ 	{
+ 		case FM_LONG:
+ 			s = (int)sflag[0] << 8 | (int)sflag[1];
+ 			if (sflagnext)
+ *sflagnext = sflag + 2;
+ 			break;
+ 		case FM_NUM:
+ 			s = (unsigned short) strtol(sflag, , 10);
+ 			if (sflagnext)
+ 			{
+ if (next)
+ {
+ 	*sflagnext = next;
+ 	while (**sflagnext)
+ 	{
+ 		if (**sflagnext == ',')
+ 		{
+ 			*sflagnext = *sflagnext + 1;
+ 			break;
+ 		}
+ 		*sflagnext = *sflagnext + 1;
+ 	}
+ }
+ else
+ 	*sflagnext = 0;
+ 			}
+ 			break;
+ 		default:
+ 			s = (unsigned short) *((unsigned char *)sflag);
+ 			if (sflagnext)
+ *sflagnext = sflag + 1;
+ 	}
+ 
+ 	return s;
+ }
+ 
+ static bool
+ isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag)
+ {
+ 	char *flagcur;
+ 	char *flagnext = 0;
+ 
+ 	if (affixflag == 0)
+ 		return true;
+ 
+ 	flagcur = Conf->AffixData[affix];
+ 
+ 	while (*flagcur)
+ 	{
+ 		if (decodeFlag(Conf, flagcur, ) == affixflag)
+ 			return true;
+ 		if (flagnext)
+ 			flagcur = flagnext;
+ 		else
+ 			break;
+ 	}
+ 
+ 	return false;
+ }
+ 
  static void
  NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
  {
***
*** 255,261  NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
  	}
  	Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
  	strcpy(Conf->Spell[Conf->nspell]->word, word);
! 	strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN);
  	Conf->nspell++;
  }
  
--- 322,328 
  	}
  	Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
  	strcpy(Conf->Spell[Conf->nspell]->word, word);
! 	Conf->Spell[Conf->nspell]->flag = (*flag != '\0') ? cpstrdup(Conf, flag) : VoidString;
  	Conf->nspell++;
  }
  
***
*** 355,361  FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
  	else if ((flag & StopMiddle->compoundflag) == 0)
  		return 0;
  
! 	if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL))
  		return 1;
  }
  node = StopMiddle->node;
--- 422,428 
  	else if ((flag & StopMiddle->compoundflag) == 0)
  		return 0;
  
! 	if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag))
  		return 1;
  }
  node = StopMiddle->node;
***
*** 394,400  NIAddAffix(Is

Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2015-11-08 Thread Artur Zakirov

On 07.11.2015 17:20, Emre Hasegeli wrote:

It seems to have something to do with the order of the affixes.  It
works, if I move affix 2646 to the beginning of the list.

[1] https://tr-spell.googlecode.com/files/dict_aff_5000_suffix_113_words.zip

Thank you for reply.

This was because of the flag field size of the SPELL struct. And long 
flags were being trancated in the .dict file.


I attached new patch. It is temporary patch, not final. It can be done 
better.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***
*** 153,159  cmpspell(const void *s1, const void *s2)
  static int
  cmpspellaffix(const void *s1, const void *s2)
  {
! 	return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN));
  }
  
  static char *
--- 153,159 
  static int
  cmpspellaffix(const void *s1, const void *s2)
  {
! 	return (strcmp((*(SPELL *const *) s1)->flag, (*(SPELL *const *) s2)->flag));
  }
  
  static char *
***
*** 237,242  cmpaffix(const void *s1, const void *s2)
--- 237,309 
  	   (const unsigned char *) a2->repl);
  }
  
+ static unsigned short
+ decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext)
+ {
+ 	unsigned short	s;
+ 	char		   *next;
+ 
+ 	switch (Conf->flagMode)
+ 	{
+ 		case FM_LONG:
+ 			s = (int)sflag[0] << 8 | (int)sflag[1];
+ 			if (sflagnext)
+ *sflagnext = sflag + 2;
+ 			break;
+ 		case FM_NUM:
+ 			s = (unsigned short) strtol(sflag, , 10);
+ 			if (sflagnext)
+ 			{
+ if (next)
+ {
+ 	*sflagnext = next;
+ 	while (**sflagnext)
+ 	{
+ 		if (**sflagnext == ',')
+ 		{
+ 			*sflagnext = *sflagnext + 1;
+ 			break;
+ 		}
+ 		*sflagnext = *sflagnext + 1;
+ 	}
+ }
+ else
+ 	*sflagnext = 0;
+ 			}
+ 			break;
+ 		default:
+ 			s = (unsigned short) *((unsigned char *)sflag);
+ 			if (sflagnext)
+ *sflagnext = sflag + 1;
+ 	}
+ 
+ 	return s;
+ }
+ 
+ static bool
+ isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag)
+ {
+ 	char *flagcur;
+ 	char *flagnext = 0;
+ 
+ 	if (affixflag == 0)
+ 		return true;
+ 
+ 	flagcur = Conf->AffixData[affix];
+ 
+ 	while (*flagcur)
+ 	{
+ 		if (decodeFlag(Conf, flagcur, ) == affixflag)
+ 			return true;
+ 		if (flagnext)
+ 			flagcur = flagnext;
+ 		else
+ 			break;
+ 	}
+ 
+ 	return false;
+ }
+ 
  static void
  NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
  {
***
*** 255,261  NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
  	}
  	Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
  	strcpy(Conf->Spell[Conf->nspell]->word, word);
! 	strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN);
  	Conf->nspell++;
  }
  
--- 322,328 
  	}
  	Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
  	strcpy(Conf->Spell[Conf->nspell]->word, word);
! 	Conf->Spell[Conf->nspell]->flag = cpstrdup(Conf, flag);
  	Conf->nspell++;
  }
  
***
*** 355,361  FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
  	else if ((flag & StopMiddle->compoundflag) == 0)
  		return 0;
  
! 	if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL))
  		return 1;
  }
  node = StopMiddle->node;
--- 422,428 
  	else if ((flag & StopMiddle->compoundflag) == 0)
  		return 0;
  
! 	if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag))
  		return 1;
  }
  node = StopMiddle->node;
***
*** 394,400  NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
  
  	Affix = Conf->Affix + Conf->naffixes;
  
! 	if (strcmp(mask, ".") == 0)
  	{
  		Affix->issimple = 1;
  		Affix->isregis = 0;
--- 461,467 
  
  	Affix = Conf->Affix + Conf->naffixes;
  
! 	if (strcmp(mask, ".") == 0 || *mask == '\0')
  	{
  		Affix->issimple = 1;
  		Affix->isregis = 0;
***
*** 429,443  NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
  		err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen,
  		 REG_ADVANCED | REG_NOSUB,
  		 DEFAULT_COLLATION_OID);
  		if (err)
! 		{
! 			char		errstr[100];
! 
! 			pg_regerror(err, &(Affix->reg.regex), errstr, sizeof(errstr));
! 			ereport(ERROR,
! 	(errcode(ERRCODE_INVALID_REGULAR_EXPRESSION),
! 	 errmsg("invalid regular expression: %s", errstr)));
! 		}
  	}
  
  	Affix->flagflags = flagflags;
--- 496,504 
  		err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen,
  		 REG_ADVANCED | REG_NOSUB,
  		 DEFAULT_COLLATION_OID);
+ 		/* Ignore regular expression error and do n

Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2015-11-06 Thread Artur Zakirov

Hello again!

Patches
===

I had implemented support for FLAG long, FLAG num and AF parameters. I 
attached patch to the e-mail (hunspell-dict.patch).


This patch allow to use Hunspell dictionaries listed in the previous 
e-mail: ar, br_fr, ca, ca_valencia, en_ca, en_gb, en_us, fr, gl_es, 
hu_hu, is, ne_np, nl_nl, si_lk.


The most part of changes was in spell.c in the affix file parsing code.
The following are dictionary structures changes:
- useFlagAliases and flagMode fields had been added to the IspellDict 
struct;

- flagval array size had been increased from 256 to 65000;
- flag field of the AFFIX struct also had been increased.

I also had implemented a patch that fixes an error from the e-mail
http://www.postgresql.org/message-id/562e1073.8030...@postgrespro.ru
This patch just ignore that error.

Tests
=

Extention test dictionaries for loading into PostgreSQL and for 
normalizing with ts_lexize function can be downloaded from 
https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz


It would be nice if somebody can do additional tests of dictionaries of 
well known languages. Because I do not know many of them.


Other Improvements
==

There are also some parameters for compound words. But I am not sure 
that we want use this parameters.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***
*** 237,242  cmpaffix(const void *s1, const void *s2)
--- 237,309 
  	   (const unsigned char *) a2->repl);
  }
  
+ static unsigned short
+ decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext)
+ {
+ 	unsigned short	s;
+ 	char		   *next;
+ 
+ 	switch (Conf->flagMode)
+ 	{
+ 		case FM_LONG:
+ 			s = (int)sflag[0] << 8 | (int)sflag[1];
+ 			if (sflagnext)
+ *sflagnext = sflag + 2;
+ 			break;
+ 		case FM_NUM:
+ 			s = (unsigned short) strtol(sflag, , 10);
+ 			if (sflagnext)
+ 			{
+ if (next)
+ {
+ 	*sflagnext = next;
+ 	while (**sflagnext)
+ 	{
+ 		if (**sflagnext == ',')
+ 		{
+ 			*sflagnext = *sflagnext + 1;
+ 			break;
+ 		}
+ 		*sflagnext = *sflagnext + 1;
+ 	}
+ }
+ else
+ 	*sflagnext = 0;
+ 			}
+ 			break;
+ 		default:
+ 			s = (unsigned short) *((unsigned char *)sflag);
+ 			if (sflagnext)
+ *sflagnext = sflag + 1;
+ 	}
+ 
+ 	return s;
+ }
+ 
+ static bool
+ isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag)
+ {
+ 	char *flagcur;
+ 	char *flagnext = 0;
+ 
+ 	if (affixflag == 0)
+ 		return true;
+ 
+ 	flagcur = Conf->AffixData[affix];
+ 
+ 	while (*flagcur)
+ 	{
+ 		if (decodeFlag(Conf, flagcur, ) == affixflag)
+ 			return true;
+ 		if (flagnext)
+ 			flagcur = flagnext;
+ 		else
+ 			break;
+ 	}
+ 
+ 	return false;
+ }
+ 
  static void
  NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
  {
***
*** 355,361  FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
  	else if ((flag & StopMiddle->compoundflag) == 0)
  		return 0;
  
! 	if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL))
  		return 1;
  }
  node = StopMiddle->node;
--- 422,428 
  	else if ((flag & StopMiddle->compoundflag) == 0)
  		return 0;
  
! 	if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag))
  		return 1;
  }
  node = StopMiddle->node;
***
*** 394,400  NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
  
  	Affix = Conf->Affix + Conf->naffixes;
  
! 	if (strcmp(mask, ".") == 0)
  	{
  		Affix->issimple = 1;
  		Affix->isregis = 0;
--- 461,467 
  
  	Affix = Conf->Affix + Conf->naffixes;
  
! 	if (strcmp(mask, ".") == 0 || *mask == '\0')
  	{
  		Affix->issimple = 1;
  		Affix->isregis = 0;
***
*** 595,604  addFlagValue(IspellDict *Conf, char *s, uint32 val)
  (errcode(ERRCODE_CONFIG_FILE_ERROR),
   errmsg("multibyte flag character is not allowed")));
  
! 	Conf->flagval[*(unsigned char *) s] = (unsigned char) val;
  	Conf->usecompound = true;
  }
  
  /*
   * Import an affix file that follows MySpell or Hunspell format
   */
--- 662,719 
  (errcode(ERRCODE_CONFIG_FILE_ERROR),
   errmsg("multibyte flag character is not allowed")));
  
! 	Conf->flagval[decodeFlag(Conf, s, (char **)NULL)] = (unsigned char) val;
  	Conf->usecompound = true;
  }
  
+ static int
+ getFlagValues(IspellDict *Conf, char *s)
+ {
+ 	uint32	 flag = 0;
+ 	char	*flagcur;
+ 	char	*flagnext = 0;
+ 
+ 	flagcur = s;
+ 	while (*flagcur)
+ 	{
+ 		flag |= Conf->flagval[decodeFlag(Conf, flagcur, )];
+ 		if (flagnext)
+ 			flagcur = flagnext;
+ 		else
+ 			break;
+ 	}
+ 
+ 	return flag;
+ }
+ 
+ /*
+  * Get flag set from "s".
+  *

Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2015-11-06 Thread Artur Zakirov

06.11.2015 12:33, Artur Zakirov пишет:

Hello again!

Patches
===


Link to commitfest:
https://commitfest.postgresql.org/8/420/

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2015-10-26 Thread Artur Zakirov

20.10.2015 17:00, Artur Zakirov пишет:

These flag types are used in affix files of such dictionaries as ar,
br_fr, ca, ca_valencia, da_dk, en_ca, en_gb, en_us, fr, gl_es, is,
ne_np, nl_nl, si_lk (from
http://cgit.freedesktop.org/libreoffice/dictionaries/tree/).


Now almost all dictionaries are loaded into PostgreSQL. But the da_dk 
dictionary does not load. I see the following error:


ERROR: invalid regular expression: quantifier operand invalid
CONTEXT: line 439 of configuration file 
"/home/artur/progs/pgsql/share/tsearch_data/da_dk.affix": "SFX 55 0 s 
+GENITIV


If you open the affix file in editor you can see that there is incorrect 
format of the affix 55 in 439 line (screen1.png):


SFX 55 0  s +GENITIV

SFX parameter should have a 5 fields. There are no field between "0" 
digit and "s" symbol. "+GENITIV" is the optional morphological field and 
ignored by PostgreSQL.


I think that it is a error of the affix file. I wrote a e-mail to 
i...@stavekontrolden.dk to the dictionary authors about this error.


What is the right decision in this case? Should PostgreSQL ignore this 
error and do not show it?


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >