[HACKERS] text search patch status update?

2008-09-16 Thread Sushant Sinha
Any status updates on the following patches?

1. Fragments in tsearch2 headlines:
http://archives.postgresql.org/pgsql-hackers/2008-08/msg00043.php

2. Bug in hlCover:
http://archives.postgresql.org/pgsql-hackers/2008-08/msg00089.php

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] text search patch status update?

2008-09-16 Thread Sushant Sinha
Patch #1. Teodor was fine with the previous version of the patch. After that
I modified it slightly to allow a FragmentDelimiter option and Teodor may
have to look at that.

Patch #2. I think this is a straigt forward bug fix.

-Sushant.

On Tue, Sep 16, 2008 at 11:27 AM, Alvaro Herrera [EMAIL PROTECTED]
 wrote:

 Sushant Sinha escribió:
  Any status updates on the following patches?
 
  1. Fragments in tsearch2 headlines:
  http://archives.postgresql.org/pgsql-hackers/2008-08/msg00043.php
 
  2. Bug in hlCover:
  http://archives.postgresql.org/pgsql-hackers/2008-08/msg00089.php

 Are these ready for review?  If so, please add them to this commitfest,
 http://wiki.postgresql.org/wiki/CommitFest:2008-09

 --
 Alvaro Herrera
 http://www.CommandPrompt.com/
 PostgreSQL Replication, Consulting, Custom Development, 24x7 support



Re: [HACKERS] Very bad FTS performance with the Polish config

2009-11-18 Thread Sushant Sinha
ts_headline calls ts_lexize equivalent to break the text. Off course there
is algorithm to process the tokens and generate the headline. I would be
really surprised if the algorithm to generate the headline is somehow
dependent on language (as it only processes the tokens). So Oleg is right
when he says ts_lexize is something to be checked.

I will try to replicate what you are trying to do but in the meantime can
you run the same ts_headline under psql multiple times and paste the result.

-Sushant.

2009/11/19 Wojciech Knapik webmas...@wolniartysci.pl


 Oleg Bartunov wrote:

  Yes, for 4-word texts the results are similar.
 Try that with a longer text and the difference becomes more and more
 significant. For the lorem ipsum text, 'polish' is about 4 times slower,
 than 'english'. For 5 repetitions of the text, it's 6 times, for 10
 repetitions - 7.5 times...


 Again, I see nothing unclear here, since dictionaries (as specified
 in configuration) apply to ALL words in document. The more words in
 document, the more overhead.


 You're missing the point. I'm not surprised that the function takes more
 time for larger input texts - that's obvious. The thing is, the computation
 times rise more steeply when the Polish config is used. Steeply enough, that
 the difference between the Polish and English configs becomes enormous in
 practical cases.

 Now this may be expected behaviour, but since I don't know if it is, I
 posted to the mailing lists to find out. If you're saying this is ok and
 there's nothing to fix here, then there's nothing more to discuss and we may
 consider the thread closed.
 If not, ts_headline deserves a closer look.

 cheers,
 Wojciech Knapik


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



[HACKERS] lexeme ordering in tsvector

2009-11-30 Thread Sushant Sinha
It seems like the ordering of lexemes in tsvector has changed from 8.3
to 8.4.

For example in 8.3.1,

postgres=# select to_tsvector('english', 'quit everytime');
  to_tsvector  
---
 'quit':1 'everytim':2

The lexemes are arranged by length and then by string comparison.

In postgres 8.4.1,

select to_tsvector('english', 'quit everytime');
  to_tsvector  
---
 'everytim':2 'quit':1

they are arranged by strncmp and then by length.

I looked in tsvector_op.c, in the function tsCompareString, first memcmp
and then length comparison is done.

Was this change in ordering deliberate?

Wouldn't length comparison be cheaper than memcmp?

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-05-24 Thread Sushant Sinha
Now I understand the code much better. A few more questions on headline
generation that I was not able to get from the code:

1. Why is hlparsetext used to parse the document rather than the
parsetext function? Since  words to be included in the headline will be
marked afterwords, it seems more reasonable to just use the parsetext
function.

The main difference I see is the use of hlfinditem and marking whether
some word is repeated.

The reason this is important is that hlparsetext does not seem to be
storing word positions which parsetext does. The word positions are
important for generating headline with fragments.

2.
 I would prefer the signature ts_headline( [regconfig,] text, tsquery
[,text] )and function should accept 'NumFragments=N' for default
parser. Another parsers may use another options.

Does this mean we want a unified function ts_headline and we trigger the
fragments if NumFragments is specified? It seems that introducing a new
function which can take configuration OID, or name is complex as there
are so many functions handling these issues in wparser.c.

If this is true then we need to just  add marking of headline words in
prsd_headline. Otherwise we will need another prsd_headline_with_covers
function.

3. In many cases people may already have TSVector for a given document
(for search operation). Would it be faster to pass TSVector to headline
function when compared to computing TSVector each time? If that is the
case then should we have an option to pass TSVector to headline
function?

-Sushant.

On Sat, 2008-05-24 at 07:57 +0400, Teodor Sigaev wrote:
 [moved to -hackers, because talk is about implementation details]
 
  I've ported the patch of Sushant Sinha for fragmented headlines to pg8.3.1
  (http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php)
 Thank you.
 
 1  diff -Nrub postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.c
 now contrib/tsearch2 is compatibility layer for old applications - they don't
 know about new features. So, this part isn't needed.
 
 2 solution to compile function (ts_headline_with_fragments)  into core, but
 using it only from contrib module looks very odd. So, new feature can be used
 only with compatibility layer for old release :)
 
 3 headline_with_fragments() is hardcoded to use default parser, but what will 
 be
 in case when configuration uses another parser? For example, for japanese 
 language.
 
 4 I would prefer the signature ts_headline( [regconfig,] text, tsquery 
 [,text] )
 and function should accept 'NumFragments=N' for default parser. Another 
 parsers
 may use another options.
 
 5 it just doesn't work correctly, because new code doesn't care of parser
 specific type of lexemes.
 contrib_regression=# select headline_with_fragments('english', 'wow asd-wow
 wow', 'asd', '');
   headline_with_fragments
 --
   ...wow asd-wowbasd/b-wow wow
 (1 row)
 
 
 So, I incline to use existing framework/infrastructure although it may be a
 subject to change.
 
 Some description:
 1 ts_headline defines a correct parser to use
 2 it calls hlparsetext to split text into structure suitable for both goals:
 find the best fragment(s) and concatenate that fragment(s) back to the text
 representation
 3 it calls parser specific method   prsheadline which works with preparsed 
 text
 (parse was done in hlparsetext). Method should mark a needed
 words/parts/lexemes etc.
 4 ts_headline glues fragments into text and returns that.
 
 We need a parser's headline method because only parser knows all about its 
 lexemes.
 
 
 -- 
 Teodor Sigaev   E-mail: [EMAIL PROTECTED]
 WWW: http://www.sigaev.ru/
 
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-05-31 Thread Sushant Sinha
I have attached a new patch with respect to the current cvs head. This
produces headline in a document for a given query. Basically it
identifies fragments of text that contain the query and displays them.

DESCRIPTION

 HeadlineParsedText contains an array of  actual words but not
information about the norms. We need an indexed position vector for each
norm so that we can quickly evaluate a number of possible fragments.
Something that tsvector provides.

So this patch changes HeadlineParsedText to contain the norms
(ParsedText). This field is updated while parsing in hlparsetext. The
position information of the norms corresponds to the position of words
in HeadlineParsedText (not to the norms positions as is the case in
tsvector). This works correctly with the current parser. If you think
there may be issues with other parsers please let me know.

This approach does not change any other interface and fits nicely with
the overall framework.

The norms are converted into tsvector and a number of covers are
generated. The best covers are then chosen to be in the headline. The
covers are separated using a hardcoded coversep. Let me know if you want
to expose this as an option.

Covers that overlap with already chosen covers are excluded.

Some options like ShortWord and MinWords are not taken care of right
now. MaxWords are used as maxcoversize. Let me know if you would like to
see other options for fragment generation as well.

Let me know any more changes you would like to see.

-Sushant.

On Tue, 2008-05-27 at 13:30 +0400, Teodor Sigaev wrote:
 Hi!
 
  1. Why is hlparsetext used to parse the document rather than the
  parsetext function? Since  words to be included in the headline will be
  marked afterwords, it seems more reasonable to just use the parsetext
  function.
  The main difference I see is the use of hlfinditem and marking whether
  some word is repeated.
 hlparsetext preserves any kind of lexeme - not indexed, spaces etc. parsetext 
 doesn't.
 hlparsetext preserves original form of lexemes. parsetext doesn't.
 
  
  The reason this is important is that hlparsetext does not seem to be
  storing word positions which parsetext does. The word positions are
  important for generating headline with fragments.
 Doesn't needed - hlparsetext preserves the whole text, so, position is a 
 number 
 of array.
 
  
  2.
  I would prefer the signature ts_headline( [regconfig,] text, tsquery
  [,text] )and function should accept 'NumFragments=N' for default
  parser. Another parsers may use another options.
  
  Does this mean we want a unified function ts_headline and we trigger the
  fragments if NumFragments is specified? 
 
 Trigger should be inside parser-specific function (pg_ts_parser.prsheadline). 
 Another parsers might not recognize that option.
 
  It seems that introducing a new
  function which can take configuration OID, or name is complex as there
  are so many functions handling these issues in wparser.c.
 No, of course - ts_headline takes care about finding configuration and 
 calling 
 correct parser.
 
  
  If this is true then we need to just  add marking of headline words in
  prsd_headline. Otherwise we will need another prsd_headline_with_covers
  function.
 Yeah, pg_ts_parser.prsheadline should mark the lexemes to. It even can  
 change 
 an array of HeadlineParsedText.
 
  
  3. In many cases people may already have TSVector for a given document
  (for search operation). Would it be faster to pass TSVector to headline
  function when compared to computing TSVector each time? If that is the
  case then should we have an option to pass TSVector to headline
  function?
 As I mentioned above, tsvector doesn;t contain whole information about text.
 
Index: src/backend/tsearch/dict.c
===
RCS file: /home/sushant/devel/pgsql-cvs/pgsql/src/backend/tsearch/dict.c,v
retrieving revision 1.5
diff -u -r1.5 dict.c
--- src/backend/tsearch/dict.c	25 Mar 2008 22:42:43 -	1.5
+++ src/backend/tsearch/dict.c	30 May 2008 23:20:57 -
@@ -16,6 +16,7 @@
 #include catalog/pg_type.h
 #include tsearch/ts_cache.h
 #include tsearch/ts_utils.h
+#include tsearch/ts_public.h
 #include utils/builtins.h
 
 
Index: src/backend/tsearch/to_tsany.c
===
RCS file: /home/sushant/devel/pgsql-cvs/pgsql/src/backend/tsearch/to_tsany.c,v
retrieving revision 1.12
diff -u -r1.12 to_tsany.c
--- src/backend/tsearch/to_tsany.c	16 May 2008 16:31:01 -	1.12
+++ src/backend/tsearch/to_tsany.c	31 May 2008 08:43:27 -
@@ -15,6 +15,7 @@
 
 #include catalog/namespace.h
 #include tsearch/ts_cache.h
+#include tsearch/ts_public.h
 #include tsearch/ts_utils.h
 #include utils/builtins.h
 #include utils/syscache.h
Index: src/backend/tsearch/ts_parse.c
===
RCS file: /home/sushant/devel/pgsql-cvs/pgsql/src/backend/tsearch/ts_parse.c,v
retrieving 

[HACKERS] phrase search

2008-05-31 Thread Sushant Sinha
I have attached a patch for phrase search with respect to the cvs head.
Basically it takes a a phrase (text) and a TSVector. It checks if the
relative positions of lexeme in the phrase are same as in their
positions in TSVector.

If the configuration for text search is simple, then this will produce
exact phrase search. Otherwise the stopwords in a phrase will be ignored
and the words in a phrase will only be matched with the stemmed lexeme.

For my application I am using this as a separate shared object. I do not
know how to expose this function from the core. Can someone explain how
to do this?

I saw this discussion on phrase search and I am not sure what other
functionality is required.

http://archives.postgresql.org/pgsql-general/2008-02/msg01170.php

-Sushant.
Index: src/backend/utils/adt/Makefile
===
RCS file: /home/sushant/devel/pgsql-cvs/pgsql/src/backend/utils/adt/Makefile,v
retrieving revision 1.69
diff -u -r1.69 Makefile
--- src/backend/utils/adt/Makefile	19 Feb 2008 10:30:08 -	1.69
+++ src/backend/utils/adt/Makefile	31 May 2008 19:57:34 -
@@ -29,7 +29,7 @@
 	tsginidx.o tsgistidx.o tsquery.o tsquery_cleanup.o tsquery_gist.o \
 	tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
 	tsvector.o tsvector_op.o tsvector_parser.o \
-	txid.o uuid.o xml.o
+	txid.o uuid.o xml.o phrase_search.o
 
 like.o: like.c like_match.c
 
Index: src/backend/utils/adt/phrase_search.c
===
RCS file: src/backend/utils/adt/phrase_search.c
diff -N src/backend/utils/adt/phrase_search.c
--- /dev/null	1 Jan 1970 00:00:00 -
+++ src/backend/utils/adt/phrase_search.c	31 May 2008 19:56:59 -
@@ -0,0 +1,167 @@
+#include postgres.h
+
+#include tsearch/ts_type.h
+#include tsearch/ts_utils.h
+
+#include fmgr.h
+
+#ifdef PG_MODULE_MAGIC
+PG_MODULE_MAGIC;
+#endif
+
+PG_FUNCTION_INFO_V1(is_phrase_present);
+Datum is_phrase_present(PG_FUNCTION_ARGS);
+
+typedef struct {
+	WordEntryPosVector 	*posVector;
+	int4	posInPhrase;
+	int4 			curpos;	
+} PhraseInfo;
+
+static int
+WordCompareVectorEntry(char *eval, WordEntry *ptr, ParsedWord *prsdword)
+{
+	if (ptr-len == prsdword-len)
+		return strncmp(
+	   eval + ptr-pos,
+	   prsdword-word,
+	   prsdword-len);
+
+	return (ptr-len  prsdword-len) ? 1 : -1;
+}
+
+/*
+ * Returns a pointer to a WordEntry from tsvector t corresponding to prsdword. 
+ * Returns NULL if not found.
+ */
+static WordEntry *
+find_wordentry_prsdword(TSVector t, ParsedWord *prsdword)
+{
+	WordEntry  *StopLow = ARRPTR(t);
+	WordEntry  *StopHigh = (WordEntry *) STRPTR(t);
+	WordEntry  *StopMiddle;
+	int			difference;
+
+	/* Loop invariant: StopLow = item  StopHigh */
+
+	while (StopLow  StopHigh)
+	{
+		StopMiddle = StopLow + (StopHigh - StopLow) / 2;
+		difference = WordCompareVectorEntry(STRPTR(t), StopMiddle, prsdword);
+		if (difference == 0)
+			return StopMiddle;
+		else if (difference  0)
+			StopLow = StopMiddle + 1;
+		else
+			StopHigh = StopMiddle;
+	}
+
+	return NULL;
+}
+
+
+static int4 
+check_and_advance(int4 i, PhraseInfo *phraseInfo)
+{
+ 	WordEntryPosVector *posvector1, *posvector2;
+	int4 diff;
+
+	posvector1 = phraseInfo[i].posVector;
+posvector2 = phraseInfo[i+1].posVector;
+	
+	diff = phraseInfo[i+1].posInPhrase - phraseInfo[i].posInPhrase;
+	while (posvector2-pos[phraseInfo[i+1].curpos] - posvector1-pos[phraseInfo[i].curpos]  diff)
+		if (phraseInfo[i+1].curpos = posvector2-npos - 1)
+			return 2;
+		else
+			phraseInfo[i+1].curpos += 1;
+
+	if (posvector2-pos[phraseInfo[i+1].curpos] - posvector1-pos[phraseInfo[i].curpos] == diff)
+		return 1;
+	else
+		return 0;
+}
+
+int4
+initialize_phraseinfo(ParsedText *prs, TSVector t, PhraseInfo *phraseInfo)
+{
+	WordEntry *entry;
+	int4 i;
+
+	for (i = 0; i  prs-curwords; i++)
+	{
+		phraseInfo[i].posInPhrase = prs-words[i].pos.pos;
+		entry = find_wordentry_prsdword(t, (prs-words[i]));
+		if (entry == NULL)
+			return 0;
+		else
+			phraseInfo[i].posVector = _POSVECPTR(t, entry);
+	}			
+	return 1;
+}
+Datum
+is_phrase_present(PG_FUNCTION_ARGS)
+{
+	ParsedText	prs;
+	int4		numwords, i, retval, found = 0;
+	PhraseInfo  *phraseInfo;
+	text	*phrase	= PG_GETARG_TEXT_P(0);
+	TSVector 	t	= PG_GETARG_TSVECTOR(1);
+Oid	cfgId   = getTSCurrentConfig(true);
+
+	prs.lenwords = (VARSIZE(phrase) - VARHDRSZ) / 6;/* just estimation of* word's number */
+	if (prs.lenwords == 0)
+		prs.lenwords = 2;
+	prs.curwords = 0;
+	prs.pos = 0;
+	prs.words = (ParsedWord *) palloc0(sizeof(ParsedWord) * prs.lenwords);
+
+	parsetext(cfgId, prs, VARDATA(phrase), VARSIZE(phrase) - VARHDRSZ);
+
+	// allocate  initialize 
+	numwords 	= prs.curwords;
+	phraseInfo	= palloc0(numwords * sizeof(PhraseInfo));
+
+	
+	if (numwords  0  initialize_phraseinfo(prs, t, 

Re: [HACKERS] phrase search

2008-06-02 Thread Sushant Sinha
On Mon, 2008-06-02 at 19:39 +0400, Teodor Sigaev wrote:
 
  I have attached a patch for phrase search with respect to the cvs head.
  Basically it takes a a phrase (text) and a TSVector. It checks if the
  relative positions of lexeme in the phrase are same as in their
  positions in TSVector.
 
 Ideally, phrase search should be implemented as new operator in tsquery, say 
 # 
 with optional distance. So, tsquery 'foo #2 bar' means: find all texts where 
 'bar' is place no far than two word from 'foo'. The complexity is about 
 complex 
 boolean expressions ( 'foo #1 ( bar1  bar2 )' ) and about several languages 
 as 
 norwegian or german. German language has combining words, like a footboolbar  
 - 
   and they have several variants of splitting, so result of to_tsquery('foo # 
 footboolbar') will be a 'foo # ( ( football  bar ) | ( foot  ball  bar ) )'
 where variants are connected with OR operation.

This is far more complicated than I thought.

 Of course, phrase search should be able to use indexes.

I can probably look into how to use index. Any pointers on this?

  
  If the configuration for text search is simple, then this will produce
  exact phrase search. Otherwise the stopwords in a phrase will be ignored
  and the words in a phrase will only be matched with the stemmed lexeme.
 
 Your solution can't be used as is, because user should use tsquery too to use 
 an 
 index:
 
 column @@ to_tsquery('phrase search') AND  is_phrase_present('phrase search', 
 column)
 
 First clause will be used for index scan and it will fast search a candidates.

Yes this is exactly how I am using in my application. Do you think this
will solve a lot of common case or we should try to get phrase search

1. Use index
2. Support arbitrary distance between lexemes
3. Support complex boolean queries

-Sushant. 

 
  For my application I am using this as a separate shared object. I do not
  know how to expose this function from the core. Can someone explain how
  to do this?
 
 Look at contrib/ directory in pgsql's source code - make a contrib module 
 from 
 your patch. As an example, look at adminpack module - it's rather simple.
 
 Comments of your code:
 1)
 +#ifdef PG_MODULE_MAGIC
 +PG_MODULE_MAGIC;
 +#endif
 
 That isn't needed for compiled-in in core files, it's only needed for modules.
 
 2)
   use only /**/ comments, do not use a // (C++ style) comments


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-06-02 Thread Sushant Sinha
Efficiency: I realized that we do not need to store all norms. We need
to only store store norms that are in the query. So I moved the addition
of norms from addHLParsedLex to hlfinditem. This should add very little
memory overhead to existing headline generation.

If this is still not acceptable for default headline generation, then I
can push it into mark_hl_fragments. But I think any headline marking
function will benefit by having the norms corresponding to the query.

Why we need norms?

hlCover does the exact thing that Cover in tsrank does which is to find
the  cover that contains the query. However hlcover has to go through
words that do not match the query. Cover on the other hand operates on
position indexes for just the query words and so it should be faster. 

The main reason why I would I like it to be fast is that I want to
generate all covers for a given query. Then choose covers with smallest
length as they will be the one that will best explain relation of a
query to a document. Finally stretch those covers to the specified size.

In my understanding, the current headline generation tries to find the
biggest cover for display in the headline. I personally think that such
a cover does not explain the context of a query in a document. We may
differ on this and thats why we may need both options.

Let me know what you think on this patch and I will update the patch to
respect other options like MinWords and ShortWord. 

NumFragments  2:
I wanted people to use the new headline marker if they specify
NumFragments = 1. If they do not specify the NumFragments or put it to
0 then the default marker will be used. This becomes a bit of tricky
parameter so please put in any idea on how to trigger the new marker.

On an another note I found that make_tsvector crashes if it receives a
ParsedText with curwords = 0. Specifically uniqueWORD returns curwords
as 1 even when it gets 0 words. I am not sure if this is the desired
behavior.

-Sushant.


On Mon, 2008-06-02 at 18:10 +0400, Teodor Sigaev wrote:
  I have attached a new patch with respect to the current cvs head. This
  produces headline in a document for a given query. Basically it
  identifies fragments of text that contain the query and displays them.
 New variant is much better, but...
 
   HeadlineParsedText contains an array of  actual words but not
  information about the norms. We need an indexed position vector for each
  norm so that we can quickly evaluate a number of possible fragments.
  Something that tsvector provides.
 
 Why do you need to store norms? The single purpose of norms is identifying 
 words 
 from query - but it's already done by hlfinditem. It sets 
 HeadlineWordEntry-item to corresponding QueryOperand in tsquery.
 Look, headline function is rather expensive and your patch adds a lot of 
 extra 
 work  - at least in memory usage. And if user calls with NumFragments=0 the 
 that 
 work is unneeded.
 
  This approach does not change any other interface and fits nicely with
  the overall framework.
 Yeah, it's a really big step forward. Thank you. You are very close to 
 committing except: Did you find a hlCover() function which produce a cover 
 from 
 original HeadlineParsedText representation? Is any reason to do not use it?
 
  
  The norms are converted into tsvector and a number of covers are
  generated. The best covers are then chosen to be in the headline. The
  covers are separated using a hardcoded coversep. Let me know if you want
  to expose this as an option.
 
 
  
  Covers that overlap with already chosen covers are excluded.
  
  Some options like ShortWord and MinWords are not taken care of right
  now. MaxWords are used as maxcoversize. Let me know if you would like to
  see other options for fragment generation as well.
 ShortWord, MinWords and MaxWords should store their meaning, but for each 
 fragment, not for the whole headline.
 
 
  
  Let me know any more changes you would like to see.
 
  if (num_fragments == 0)
  /* call the default headline generator */
  mark_hl_words(prs, query, highlight, shortword, min_words, 
 max_words);
  else
  mark_hl_fragments(prs, query, highlight, num_fragments, 
 max_words);
 
 
 Suppose, num_fragments  2?
 
Index: src/backend/tsearch/dict.c
===
RCS file: /home/sushant/devel/pgsql-cvs/pgsql/src/backend/tsearch/dict.c,v
retrieving revision 1.5
diff -u -r1.5 dict.c
--- src/backend/tsearch/dict.c	25 Mar 2008 22:42:43 -	1.5
+++ src/backend/tsearch/dict.c	30 May 2008 23:20:57 -
@@ -16,6 +16,7 @@
 #include catalog/pg_type.h
 #include tsearch/ts_cache.h
 #include tsearch/ts_utils.h
+#include tsearch/ts_public.h
 #include utils/builtins.h
 
 
Index: src/backend/tsearch/to_tsany.c
===
RCS file: /home/sushant/devel/pgsql-cvs/pgsql/src/backend/tsearch/to_tsany.c,v
retrieving revision 1.12

Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-06-03 Thread Sushant Sinha
My main argument for using Cover instead of hlCover was that Cover will
be faster. I tested the default headline generation that uses hlCover
with the current patch that uses Cover. There was not much difference.
So I think you are right in that we do not need norms and we can just
use hlCover.

I also compared performance of ts_headline with my first patch to
headline generation (one which was a separate function and took tsvector
as input). The performance was dramatically different. For one query
ts_headline took roughly 200 ms while headline_with_fragments took just
70 ms. On an another query ts_headline took 76 ms while
headline_with_fragments took 24 ms. You can find 'explain analyze' for
the first query at the bottom of the page. 

These queries were run multiple times to ensure that I never hit the
disk. This is a m/c with 2.0 GhZ Pentium 4 CPU and 512 MB RAM running
Linux 2.6.22-gentoo-r8.

A couple of caveats: 

1. ts_headline testing was done with current cvs head where as
headline_with_fragments was done with postgres 8.3.1.

2. For headline_with_fragments, TSVector for the document was obtained
by joining with another table.

Are these differences understandable?

If you think these caveats are the reasons or there is something I am
missing, then I can repeat the entire experiments with exactly the same
conditions. 

-Sushant.


Here is 'explain analyze' for both the functions:


ts_headline


lawdb=# explain analyze SELECT ts_headline('english', doc, q, '')
FROMdocraw, plainto_tsquery('english', 'freedom of
speech') as q
WHERE   docraw.tid = 125596;
 QUERY
PLAN 

 Nested Loop  (cost=0.00..8.31 rows=1 width=497) (actual
time=199.692..200.207 rows=1 loops=1)
   -  Index Scan using docraw_pkey on docraw  (cost=0.00..8.29 rows=1
width=465) (actual time=0.041..0.065 rows=1 loops=1)
 Index Cond: (tid = 125596)
   -  Function Scan on q  (cost=0.00..0.01 rows=1 width=32) (actual
time=0.010..0.014 rows=1 loops=1)
 Total runtime: 200.311 ms


headline_with_fragments
---

lawdb=# explain analyze SELECT headline_with_fragments('english',
docvector, doc, q, 'MaxWords=40')
FROMdocraw, docmeta, plainto_tsquery('english', 'freedom
of speech') as q
WHERE   docraw.tid = 125596 and docmeta.tid=125596;
 QUERY
PLAN 
--
 Nested Loop  (cost=0.00..16.61 rows=1 width=883) (actual
time=70.564..70.949 rows=1 loops=1)
   -  Nested Loop  (cost=0.00..16.59 rows=1 width=851) (actual
time=0.064..0.094 rows=1 loops=1)
 -  Index Scan using docraw_pkey on docraw  (cost=0.00..8.29
rows=1 width=454) (actual time=0.040..0.044 rows=1 loops=1)
   Index Cond: (tid = 125596)
 -  Index Scan using docmeta_pkey on docmeta  (cost=0.00..8.29
rows=1 width=397) (actual time=0.017..0.040 rows=1 loops=1)
   Index Cond: (docmeta.tid = 125596)
   -  Function Scan on q  (cost=0.00..0.01 rows=1 width=32) (actual
time=0.012..0.016 rows=1 loops=1)
 Total runtime: 71.076 ms
(8 rows)


On Tue, 2008-06-03 at 22:53 +0400, Teodor Sigaev wrote:
  Why we need norms?
 
 We don't need norms at all - all matched HeadlineWordEntry already marked by 
 HeadlineWordEntry-item! If it equals to NULL then this word isn't contained 
 in 
 tsquery.
 
  hlCover does the exact thing that Cover in tsrank does which is to find
  the  cover that contains the query. However hlcover has to go through
  words that do not match the query. Cover on the other hand operates on
  position indexes for just the query words and so it should be faster. 
 Cover, by definition, is a minimal continuous text's piece matched by query. 
 May 
 be a several covers in text and hlCover will find all of them. Next, 
 prsd_headline() (for now) tries to define the best one. Best means: cover 
 contains a lot of words from query, not less that MinWords, not greater than 
 MaxWords, hasn't words shorter that ShortWord on the begin and end of cover 
 etc.
  
  The main reason why I would I like it to be fast is that I want to
  generate all covers for a given query. Then choose covers with smallest
 hlCover generates all covers.
 
  Let me know what you think on this patch and I will update the patch to
  respect other options like MinWords and ShortWord. 
 
 As I understand, you very wish to call Cover() function instead of hlCover() 
 - 
 by design, they should be identical, but accepts different document's 
 representation. So, the best way is generalize them: develop a new one which 
 can 
 be called with some kind of callback or/and opaque structure to use it in 
 both 
 rank and headline.
 
  
  NumFragments  2:
  I wanted people to use the new headline marker if they specify
  NumFragments = 1. If they do not 

Re: [HACKERS] phrase search

2008-06-03 Thread Sushant Sinha
On Tue, 2008-06-03 at 22:16 +0400, Teodor Sigaev wrote:
  This is far more complicated than I thought.
  Of course, phrase search should be able to use indexes.
  I can probably look into how to use index. Any pointers on this?
 
 src/backend/utils/adt/tsginidx.c, if you invent operation #  in tsquery then 
 you 
 will have index support with minimal effort.
  
  Yes this is exactly how I am using in my application. Do you think this
  will solve a lot of common case or we should try to get phrase search
 
 Yeah, it solves a lot of useful case, for simple use it's needed to invent 
 function similar to existsing plaitnto_tsquery, say phraseto_tsquery. It 
 should 
 produce correct tsquery with described above operations.
 

I can add index support and support for arbitrary distance between
lexeme. 

It appears to me that supporting arbitrary boolean expression will be
complicated. Can we pull out something from TSQuery?

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-06-21 Thread Sushant Sinha
I have an attached an updated patch with following changes:

1. Respects ShortWord and MinWords
2. Uses hlCover instead of Cover
3. Does not store norm (or lexeme) for headline marking
4. Removes ts_rank.h
5. Earlier it was counting even NONWORDTOKEN in the headline. Now it
only counts the actual words and excludes spaces etc.

I have also changed NumFragments option to MaxFragments as there may not
be enough covers to display NumFragments.

Another change that I was thinking:

Right now if cover size  max_words then I just cut the trailing words.
Instead I was thinking that we should split the cover into more
fragments such that each fragment contains a few query words. Then each
fragment will not contain all query words but will show more occurrences
of query words in the headline. I would  like to know what your opinion
on this is.

-Sushant.

On Thu, 2008-06-05 at 20:21 +0400, Teodor Sigaev wrote:
  A couple of caveats: 
  
  1. ts_headline testing was done with current cvs head where as
  headline_with_fragments was done with postgres 8.3.1.
  2. For headline_with_fragments, TSVector for the document was obtained
  by joining with another table.
  Are these differences understandable?
 
 That is possible situation because ts_headline has several criterias of 
 'best' 
 covers - length, number of words from query, good words at the begin and at 
 the 
 end of headline while your fragment's algorithm takes care only on total 
 number 
 of words in all covers. It's not very good, but it's acceptable, I think. 
 Headline (and ranking too) hasn't any formal rules to define is it good or 
 bad? 
 Just a people's opinions.
 
 Next possible reason: original algorithm had a look on all covers trying to 
 find 
 the best one while your algorithm tries to find just the shortest covers to 
 fill 
 a headline.
 
 But it's very desirable to use ShortWord - it's not very comfortable for user 
 if 
 one option produces unobvious side effect with another one.
 `
 
  If you think these caveats are the reasons or there is something I am
  missing, then I can repeat the entire experiments with exactly the same
  conditions. 
 
 Interesting for me test is a comparing hlCover with Cover in your patch, i.e. 
 develop a patch which uses hlCover instead of Cover and compare  old patch 
 with 
 new one.
Index: src/backend/tsearch/wparser_def.c
===
RCS file: /home/sushant/devel/pgsql-cvs/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.14
diff -c -r1.14 wparser_def.c
*** src/backend/tsearch/wparser_def.c	1 Jan 2008 19:45:52 -	1.14
--- src/backend/tsearch/wparser_def.c	21 Jun 2008 07:59:02 -
***
*** 1684,1701 
  	return false;
  }
  
! Datum
! prsd_headline(PG_FUNCTION_ARGS)
  {
! 	HeadlineParsedText *prs = (HeadlineParsedText *) PG_GETARG_POINTER(0);
! 	List	   *prsoptions = (List *) PG_GETARG_POINTER(1);
! 	TSQuery		query = PG_GETARG_TSQUERY(2);
  
! 	/* from opt + start and and tag */
! 	int			min_words = 15;
! 	int			max_words = 35;
! 	int			shortword = 3;
  
  	int			p = 0,
  q = 0;
  	int			bestb = -1,
--- 1684,1891 
  	return false;
  }
  
! static void 
! mark_fragment(HeadlineParsedText *prs, int highlight, int startpos, int endpos)
  {
! 	int   i;
! 	char *coversep = ...;
!	int   coverlen = strlen(coversep);
  
! 	for (i = startpos; i = endpos; i++)
! 	{
! 		if (prs-words[i].item)
! 			prs-words[i].selected = 1;
! 		if (highlight == 0)
! 		{
! 			if (HLIDIGNORE(prs-words[i].type))
! prs-words[i].replace = 1;
! 		}
! 		else
! 		{
! 			if (XMLHLIDIGNORE(prs-words[i].type))
! prs-words[i].replace = 1;
! 		}
! 
! 		prs-words[i].in = (prs-words[i].repeated) ? 0 : 1;
! 	}
! 	/* add cover separators if needed */ 
! 	if (startpos  0  strncmp(prs-words[startpos-1].word, coversep, 
! 		prs-words[startpos-1].len) != 0)
! 	{
! 		
! 		prs-words[startpos-1].word = repalloc(prs-words[startpos-1].word, sizeof(char) * coverlen);
! 		prs-words[startpos-1].in   = 1;
! 		prs-words[startpos-1].len  = coverlen;
! 		memcpy(prs-words[startpos-1].word, coversep, coverlen);
! 	}
! 	if (endpos-1  prs-curwords   strncmp(prs-words[startpos-1].word, coversep,
! 		prs-words[startpos-1].len) != 0)
! 	{
! 		prs-words[endpos+1].word = repalloc(prs-words[endpos+1].word, sizeof(char) * coverlen);
! 		prs-words[endpos+1].in   = 1;
! 		memcpy(prs-words[endpos+1].word, coversep, coverlen);
! 	}
! }
! 
! typedef struct 
! {
! 	int4 startpos;
! 	int4 endpos;
! 	int2 in;
! 	int2 excluded;
! } CoverPos;
! 
! 
! static void
! mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, int highlight,
! int shortword, int min_words, 
! 			int max_words, int max_fragments)
! {
! 	int4   	curlen, coverlen, i, f, num_f;
! 	int4		stretch, maxstretch;
! 
! 	int4   	startpos = 0, 
!  			endpos   = 0,
! 			p= 0,
! 			q= 0;
! 
! 	int4		numcovers = 0, 
! 			maxcovers = 32;
! 
! 	int4

[HACKERS] initdb in current cvs head broken?

2008-07-10 Thread Sushant Sinha
I am trying to generate a patch with respect to the current CVS head. So
ai rsynced the tree, then did cvs up and installed the db. However, when
I did initdb on a data directory it is stuck:

It is stuck after printing creating template1
creating template1 database in /home/postgres/data/base/1 ... 

I did strace  

$ strace -p 9852
Process 9852 attached - interrupt to quit
waitpid(9864,

then I  straced 9864

$ strace -p 9864
Process 9864 attached - interrupt to quit
semop(8060958, 0xbff36fee,

 $ ps aux|grep 9864   
postgres  9864  1.5  1.3  37296  6816 pts/1S+   07:51
0:02 /usr/local/pgsql/bin/postgres --boot -x1 -F


Seems like a bug to me. Is the tree stable only after commit fests and I
should not use the unstable tree for generating patches?

Thanks,
-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb in current cvs head broken?

2008-07-10 Thread Sushant Sinha
You are right. I did not do make clean last time. After make clean, make
all, and make install it works fine. 

-Sushant.

On Thu, 2008-07-10 at 17:55 +0530, Pavan Deolasee wrote:
 On Thu, Jul 10, 2008 at 5:36 PM, Sushant Sinha [EMAIL PROTECTED] wrote:
 
 
 
  Seems like a bug to me. Is the tree stable only after commit fests and I
  should not use the unstable tree for generating patches?
 
 
 I quickly tried on my repo and its working fine. (Well it could be a
 bit out of sync with the head).
 
 Usually, the tree may get a bit inconsistent during the active period,
 but its not very common. I've seen committers doing a good job before
 checking in any code and making sure it works fine (atleast initdb and
 regression tests).
 
 I would suggest doing a clean build at your end once again.
 
 Thanks,
 Pavan
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-07-14 Thread Sushant Sinha
Attached a new patch that:

1. fixes previous bug
2. better handles the case when cover size is greater than the MaxWords.
Basically it divides a cover greater than MaxWords into fragments of
MaxWords, resizes each such fragment so that each end of the fragment
contains a query word and then evaluates best fragments based on number of
query words in each fragment. In case of tie it picks up the smaller
fragment. This allows more query words to be shown with multiple fragments
in case a single cover is larger than the MaxWords.

The resizing of a  fragment such that each end is a query word provides room
for stretching both sides of the fragment. This (hopefully) better presents
the context in which query words appear in the document. If a cover is
smaller than MaxWords then the cover is treated as a fragment.

Let me know if you have any more suggestions or anything is not clear.

I have not yet added the regression tests. The regression test suite seemed
to be only ensuring that the function works. How many tests should I be
adding? Is there any other place that I need to add different test cases for
the function?

-Sushant.


Nice. But it will be good to resolve following issues:
 1) Patch contains mistakes, I didn't investigate or carefully read it. Get
 http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gzhttp://www.sai.msu.su/%7Emegera/postgres/fts/apod.dump.gzand
  load in db.

 Queries
 # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
 from apod where to_tsvector(body) @@ plainto_tsquery('black hole');

 and

 # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
 from apod;

 crash postgresql :(

 2) pls, include in your patch documentation and regression tests.


 Another change that I was thinking:

 Right now if cover size  max_words then I just cut the trailing words.
 Instead I was thinking that we should split the cover into more
 fragments such that each fragment contains a few query words. Then each
 fragment will not contain all query words but will show more occurrences
 of query words in the headline. I would  like to know what your opinion
 on this is.


 Agreed.


 --
 Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW:
 http://www.sigaev.ru/

Index: src/backend/tsearch/wparser_def.c
===
RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.15
diff -c -r1.15 wparser_def.c
*** src/backend/tsearch/wparser_def.c	17 Jun 2008 16:09:06 -	1.15
--- src/backend/tsearch/wparser_def.c	15 Jul 2008 04:30:34 -
***
*** 1684,1701 
  	return false;
  }
  
! Datum
! prsd_headline(PG_FUNCTION_ARGS)
  {
! 	HeadlineParsedText *prs = (HeadlineParsedText *) PG_GETARG_POINTER(0);
! 	List	   *prsoptions = (List *) PG_GETARG_POINTER(1);
! 	TSQuery		query = PG_GETARG_TSQUERY(2);
  
! 	/* from opt + start and and tag */
! 	int			min_words = 15;
! 	int			max_words = 35;
! 	int			shortword = 3;
  
  	int			p = 0,
  q = 0;
  	int			bestb = -1,
--- 1684,1944 
  	return false;
  }
  
! static void 
! mark_fragment(HeadlineParsedText *prs, int highlight, int startpos, int endpos)
  {
! 	int   i;
! 	char *coversep = ... ;
!	int   seplen   = strlen(coversep);
  
! 	for (i = startpos; i = endpos; i++)
! 	{
! 		if (prs-words[i].item)
! 			prs-words[i].selected = 1;
! 		if (highlight == 0)
! 		{
! 			if (HLIDIGNORE(prs-words[i].type))
! prs-words[i].replace = 1;
! 		}
! 		else
! 		{
! 			if (XMLHLIDIGNORE(prs-words[i].type))
! prs-words[i].replace = 1;
! 		}
! 
! 		prs-words[i].in = (prs-words[i].repeated) ? 0 : 1;
! 	}
! 	/* add cover separators if needed */ 
! 	if (startpos  0)
! 	{
! 		
! 		prs-words[startpos-1].word = repalloc(prs-words[startpos-1].word, sizeof(char) * seplen);
! 		prs-words[startpos-1].in   = 1;
! 		prs-words[startpos-1].len  = seplen;
! 		memcpy(prs-words[startpos-1].word, coversep, seplen);
! 	}
! }
! 
! typedef struct 
! {
! 	int4 startpos;
! 	int4 endpos;
! 	int4 poslen;
! 	int4 curlen;
! 	int2 in;
! 	int2 excluded;
! } CoverPos;
! 
! static void 
! get_next_fragment(HeadlineParsedText *prs, int *startpos, int *endpos,
! 			int *curlen, int *poslen, int max_words)
! {
! 	int i;
! 	/* Objective: Generate a fragment of words between startpos and endpos 
! 	 * such that it has at most max_words and both ends has query words. 
! 	 * If the startpos and endpos are the endpoints of the cover and the 
! 	 * cover has fewer words than max_words, then this function should 
! 	 * just return the cover 
! 	 */
! 	/* first move startpos to an item */
! 	for(i = *startpos; i = *endpos; i++)
! 	{
! 		*startpos = i;
! 		if (prs-words[i].item  !prs-words[i].repeated)
! 			break;
! 	}
! 	/* cut endpos to have only max_words */
! 	*curlen = 0;
! 	*poslen = 0;
! 	for(i = *startpos; i = *endpos  *curlen  max_words; i++) 
! 	{
! 		

Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-07-15 Thread Sushant Sinha
attached are two patches:

1. documentation
2. regression tests
 for headline with fragments.

-Sushant.

On Tue, 2008-07-15 at 13:29 +0400, Teodor Sigaev wrote:
  Attached a new patch that:
  
  1. fixes previous bug
  2. better handles the case when cover size is greater than the MaxWords. 
 
 Looks good, I'll make some tests with  real-world application.
 
  I have not yet added the regression tests. The regression test suite 
  seemed to be only ensuring that the function works. How many tests 
  should I be adding? Is there any other place that I need to add 
  different test cases for the function?
 
 Just add 3-5 selects to src/test/regress/sql/tsearch.sql with checking basic 
 functionality and corner cases like
   - there is no covers in text
   - Cover(s) is too big
   - and so on
 
 Add some words in documentation too, pls.
 
 
Index: doc/src/sgml/textsearch.sgml
===
RCS file: /home/postgres/devel/pgsql-cvs/pgsql/doc/src/sgml/textsearch.sgml,v
retrieving revision 1.44
diff -c -r1.44 textsearch.sgml
*** doc/src/sgml/textsearch.sgml	16 May 2008 16:31:01 -	1.44
--- doc/src/sgml/textsearch.sgml	16 Jul 2008 02:37:28 -
***
*** 1100,1105 
--- 1100,1117 
   /listitem
   listitem
para
+literalMaxFragments/literal: maximum number of text excerpts 
+or fragments that matches the query words. It also triggers a 
+different headline generation function than the default one. This
+function finds text fragments with as many query words as possible.
+Each fragment will be of at most MaxWords and will not have words
+of size less than or equal to ShortWord at the start or end of a 
+fragment. If all query words are not found in the document, then
+a single fragment of MinWords will be displayed.
+   /para
+  /listitem
+  listitem
+   para
 literalHighlightAll/literal: Boolean flag;  if
 literaltrue/literal the whole document will be highlighted.
/para
***
*** 1109,1115 
  Any unspecified options receive these defaults:
  
  programlisting
! StartSel=lt;bgt;, StopSel=lt;/bgt;, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE
  /programlisting
 /para
  
--- 1121,1127 
  Any unspecified options receive these defaults:
  
  programlisting
! StartSel=lt;bgt;, StopSel=lt;/bgt;, MaxFragments=0, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE
  /programlisting
 /para
  
Index: src/test/regress/sql/tsearch.sql
===
RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/test/regress/sql/tsearch.sql,v
retrieving revision 1.9
diff -c -r1.9 tsearch.sql
*** src/test/regress/sql/tsearch.sql	16 May 2008 16:31:02 -	1.9
--- src/test/regress/sql/tsearch.sql	16 Jul 2008 03:45:24 -
***
*** 208,213 
--- 208,253 
  /html',
  to_tsquery('english', 'seafoo'), 'HighlightAll=true');
  
+ --Check if headline fragments work 
+ SELECT ts_headline('english', '
+ Day after day, day after day,
+   We stuck, nor breath nor motion,
+ As idle as a painted Ship
+   Upon a painted Ocean.
+ Water, water, every where
+   And all the boards did shrink;
+ Water, water, every where,
+   Nor any drop to drink.
+ S. T. Coleridge (1772-1834)
+ ', to_tsquery('english', 'ocean'), 'MaxFragments=1');
+ 
+ --Check if more than one fragments are displayed
+ SELECT ts_headline('english', '
+ Day after day, day after day,
+   We stuck, nor breath nor motion,
+ As idle as a painted Ship
+   Upon a painted Ocean.
+ Water, water, every where
+   And all the boards did shrink;
+ Water, water, every where,
+   Nor any drop to drink.
+ S. T. Coleridge (1772-1834)
+ ', to_tsquery('english', 'Coleridge  stuck'), 'MaxFragments=2');
+ 
+ --Fragments when there all query words are not in the document
+ SELECT ts_headline('english', '
+ Day after day, day after day,
+   We stuck, nor breath nor motion,
+ As idle as a painted Ship
+   Upon a painted Ocean.
+ Water, water, every where
+   And all the boards did shrink;
+ Water, water, every where,
+   Nor any drop to drink.
+ S. T. Coleridge (1772-1834)
+ ', to_tsquery('english', 'ocean  seahorse'), 'MaxFragments=1');
+ 
+ 
  --Rewrite sub system
  
  CREATE TABLE test_tsquery (txtkeyword TEXT, txtsample TEXT);
Index: src/test/regress/expected/tsearch.out
===
RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/test/regress/expected/tsearch.out,v
retrieving revision 1.14
diff -c -r1.14 tsearch.out
*** src/test/regress/expected/tsearch.out	16 May 2008 16:31:02 -	1.14
--- src/test/regress/expected/tsearch.out	16 Jul 2008 03:47:46 -
***
*** 632,637 
--- 632,705 
   /html
  (1 row)
  
+ --Check if headline fragments work 
+ SELECT ts_headline('english', '
+ Day after day, day after day,
+   We 

Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-07-16 Thread Sushant Sinha
I will add test queries and their results for the corner cases in a
separate file. I guess the only thing I am confused about is what should
be the behavior of headline generation when Query items have words of
size less than ShortWord. I guess the answer is to ignore ShortWord
parameter but let me know if the answer is any different.

-Sushant.
 
On Thu, 2008-07-17 at 02:53 +0400, Oleg Bartunov wrote:
 Sushant,
 
 first, please, provide simple test queries, which demonstrate the right work
 in the corner cases. This will helps reviewers to test your patch and
 helps you to make sure your new version is ok. For example:
 
 =# select ts_headline('1 2 3 4 5 1 2 3 1','13'::tsquery);
   ts_headline
 --
   b1/b 2 b3/b 4 5 b1/b 2 b3/b b1/b
 
 This select breaks your code:
 
 =# select ts_headline('1 2 3 4 5 1 2 3 1','13'::tsquery,'maxfragments=2');
   ts_headline
 --
   ...  2 ...
 
 and so on 
 
 
 Oleg
 On Tue, 15 Jul 2008, Sushant Sinha wrote:
 
  Attached a new patch that:
 
  1. fixes previous bug
  2. better handles the case when cover size is greater than the MaxWords.
  Basically it divides a cover greater than MaxWords into fragments of
  MaxWords, resizes each such fragment so that each end of the fragment
  contains a query word and then evaluates best fragments based on number of
  query words in each fragment. In case of tie it picks up the smaller
  fragment. This allows more query words to be shown with multiple fragments
  in case a single cover is larger than the MaxWords.
 
  The resizing of a  fragment such that each end is a query word provides room
  for stretching both sides of the fragment. This (hopefully) better presents
  the context in which query words appear in the document. If a cover is
  smaller than MaxWords then the cover is treated as a fragment.
 
  Let me know if you have any more suggestions or anything is not clear.
 
  I have not yet added the regression tests. The regression test suite seemed
  to be only ensuring that the function works. How many tests should I be
  adding? Is there any other place that I need to add different test cases for
  the function?
 
  -Sushant.
 
 
  Nice. But it will be good to resolve following issues:
  1) Patch contains mistakes, I didn't investigate or carefully read it. Get
  http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gzhttp://www.sai.msu.su/%7Emegera/postgres/fts/apod.dump.gzand
   load in db.
 
  Queries
  # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
  from apod where to_tsvector(body) @@ plainto_tsquery('black hole');
 
  and
 
  # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1')
  from apod;
 
  crash postgresql :(
 
  2) pls, include in your patch documentation and regression tests.
 
 
  Another change that I was thinking:
 
  Right now if cover size  max_words then I just cut the trailing words.
  Instead I was thinking that we should split the cover into more
  fragments such that each fragment contains a few query words. Then each
  fragment will not contain all query words but will show more occurrences
  of query words in the headline. I would  like to know what your opinion
  on this is.
 
 
  Agreed.
 
 
  --
  Teodor Sigaev   E-mail: [EMAIL PROTECTED]
WWW:
  http://www.sigaev.ru/
 
 
 
   Regards,
   Oleg
 _
 Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
 Sternberg Astronomical Institute, Moscow University, Russia
 Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
 phone: +007(495)939-16-83, +007(495)939-23-83


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] small bug in hlCover

2008-07-16 Thread Sushant Sinha
I think there is a slight bug in hlCover function in wparser_def.c

If there is only one query item and that is the first word in the text,
then hlCover does not returns any cover. This is evident in this example
when ts_headline only generates the min_words:

testdb=# select ts_headline('1 2 3 4 5 6 7 8 9 10','1'::tsquery,
'MinWords=5');
   ts_headline
--
 b1/b 2 3 4 5
(1 row)

The problem is that *q is initialized to 0 which is a legitimate value
for a cover. So I have attached a patch that fixes it and after applying
the patch here is the result.

testdb=# select ts_headline('1 2 3 4 5 6 7 8 9 10','1'::tsquery,
'MinWords=5');
 ts_headline 
-
 b1/b 2 3 4 5 6 7 8 9 10
(1 row)

-Sushant.
Index: src/backend/tsearch/wparser_def.c
===
RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.15
diff -c -r1.15 wparser_def.c
*** src/backend/tsearch/wparser_def.c	17 Jun 2008 16:09:06 -	1.15
--- src/backend/tsearch/wparser_def.c	17 Jul 2008 02:45:34 -
***
*** 1621,1627 
  	QueryItem  *item = GETQUERY(query);
  	int			pos = *p;
  
! 	*q = 0;
  	*p = 0x7fff;
  
  	for (j = 0; j  query-size; j++)
--- 1621,1627 
  	QueryItem  *item = GETQUERY(query);
  	int			pos = *p;
  
! 	*q = -1;
  	*p = 0x7fff;
  
  	for (j = 0; j  query-size; j++)
***
*** 1643,1649 
  		item++;
  	}
  
! 	if (*q == 0)
  		return false;
  
  	item = GETQUERY(query);
--- 1643,1649 
  		item++;
  	}
  
! 	if (*q  0)
  		return false;
  
  	item = GETQUERY(query);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-07-17 Thread Sushant Sinha
Fixed some off by one errors pointed by Oleg and errors in excluding
overlapping fragments.
 
Also adding test queries and updating regression tests.

Let me know of any other changes that are needed.

-Sushant.



On Thu, 2008-07-17 at 03:28 +0400, Oleg Bartunov wrote:
 On Wed, 16 Jul 2008, Sushant Sinha wrote:
 
  I will add test queries and their results for the corner cases in a
  separate file. I guess the only thing I am confused about is what should
  be the behavior of headline generation when Query items have words of
  size less than ShortWord. I guess the answer is to ignore ShortWord
  parameter but let me know if the answer is any different.
 
 
 ShortWord is about headline text, it doesn't affects words in query,
 so you can't discard them from query.
 
  -Sushant.
 
  On Thu, 2008-07-17 at 02:53 +0400, Oleg Bartunov wrote:
  Sushant,
 
  first, please, provide simple test queries, which demonstrate the right 
  work
  in the corner cases. This will helps reviewers to test your patch and
  helps you to make sure your new version is ok. For example:
 
  =# select ts_headline('1 2 3 4 5 1 2 3 1','13'::tsquery);
ts_headline
  --
b1/b 2 b3/b 4 5 b1/b 2 b3/b b1/b
 
  This select breaks your code:
 
  =# select ts_headline('1 2 3 4 5 1 2 3 1','13'::tsquery,'maxfragments=2');
ts_headline
  --
...  2 ...
 
  and so on 
 
 
  Oleg
  On Tue, 15 Jul 2008, Sushant Sinha wrote:
 
  Attached a new patch that:
 
  1. fixes previous bug
  2. better handles the case when cover size is greater than the MaxWords.
  Basically it divides a cover greater than MaxWords into fragments of
  MaxWords, resizes each such fragment so that each end of the fragment
  contains a query word and then evaluates best fragments based on number of
  query words in each fragment. In case of tie it picks up the smaller
  fragment. This allows more query words to be shown with multiple fragments
  in case a single cover is larger than the MaxWords.
 
  The resizing of a  fragment such that each end is a query word provides 
  room
  for stretching both sides of the fragment. This (hopefully) better 
  presents
  the context in which query words appear in the document. If a cover is
  smaller than MaxWords then the cover is treated as a fragment.
 
  Let me know if you have any more suggestions or anything is not clear.
 
  I have not yet added the regression tests. The regression test suite 
  seemed
  to be only ensuring that the function works. How many tests should I be
  adding? Is there any other place that I need to add different test cases 
  for
  the function?
 
  -Sushant.
 
 
  Nice. But it will be good to resolve following issues:
  1) Patch contains mistakes, I didn't investigate or carefully read it. 
  Get
  http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gzhttp://www.sai.msu.su/%7Emegera/postgres/fts/apod.dump.gzand
   load in db.
 
  Queries
  # select ts_headline(body, plainto_tsquery('black hole'), 
  'MaxFragments=1')
  from apod where to_tsvector(body) @@ plainto_tsquery('black hole');
 
  and
 
  # select ts_headline(body, plainto_tsquery('black hole'), 
  'MaxFragments=1')
  from apod;
 
  crash postgresql :(
 
  2) pls, include in your patch documentation and regression tests.
 
 
  Another change that I was thinking:
 
  Right now if cover size  max_words then I just cut the trailing words.
  Instead I was thinking that we should split the cover into more
  fragments such that each fragment contains a few query words. Then each
  fragment will not contain all query words but will show more occurrences
  of query words in the headline. I would  like to know what your opinion
  on this is.
 
 
  Agreed.
 
 
  --
  Teodor Sigaev   E-mail: [EMAIL PROTECTED]
WWW:
  http://www.sigaev.ru/
 
 
 
 Regards,
 Oleg
  _
  Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
  Sternberg Astronomical Institute, Moscow University, Russia
  Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
  phone: +007(495)939-16-83, +007(495)939-23-83
 
 
   Regards,
   Oleg
 _
 Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
 Sternberg Astronomical Institute, Moscow University, Russia
 Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
 phone: +007(495)939-16-83, +007(495)939-23-83
Index: src/backend/tsearch/wparser_def.c
===
RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.15
diff -c -r1.15 wparser_def.c
*** src/backend/tsearch/wparser_def.c	17 Jun 2008 16:09:06 -	1.15
--- src/backend/tsearch/wparser_def.c	18 Jul 2008

Re: [HACKERS] phrase search

2008-07-18 Thread Sushant Sinha
I looked at query operators for tsquery and here are some of the new
query operators for position based queries. I am just proposing some
changes and the questions I have.

1. What is the meaning of such a query operator?

foo #5 bar - true if the document has word foo followed by bar at
5th position.
   
foo #5 bar - true if document has word foo followed by bar with in
5 positions

foo #5 bar - true if document has word foo followed by bar after 5
positions

then some other ways it can be used are
!(foo #5 bar) - true if document never has any foo  followed by bar
with in 5 positions.

etc .

2. How to implement such query operators?

Should we modify QueryItem to include additional distance information or
is there any other way to accomplish it?

Is the following list sufficient to accomplish this?
a. Modify to_tsquery
b. Modify TS_execute in tsvector_op.c to check new operator

Is there anything needed in rewrite subsystem?

3. Are these valid uses of the operators and if yes what would they
mean?

foo #5 (bar  cup)

If no then should the operator be applied to only two QI_VAL's?

4. If the operator only applies to two query items can we create an
index such that (foo, bar)- documents[min distance, max distance] 
How difficult it is to implement an index like this?


Thanks,
-Sushant.

On Thu, 2008-06-05 at 19:37 +0400, Teodor Sigaev wrote:
  I can add index support and support for arbitrary distance between
  lexeme. 
  It appears to me that supporting arbitrary boolean expression will be
  complicated. Can we pull out something from TSQuery?
 
 I don't very like an idea to have separated interface for phrase search. Your 
 patch may be a module and used by people who really wants to have a phrase 
 search.
 
 Introducing new operator in tsquery allows to use already existing 
 infrastructure of tsquery such as concatenations (, ||, !!), rewrite 
 subsystem 
 etc.  But new operation/types specially designed for phrase search makes 
 needing 
 to make that work again.
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-07-23 Thread Sushant Sinha
I guess it is more readable to add cover separator at the end of a fragment
than in the front. Let me know what you think and I can update it.

I think the right place for cover separator is in the structure
HeadlineParsedText just like startsel and stopsel. This will enable users to
specify their own cover separators. But this will require changes to the
structure as well as to the generateHeadline function. This option will not
also play well with the default headline generation function.

The default MaxWords = 35 seems a bit high for this headline generation
function and 20 seems to be more reasonable. Any thoughts?

-Sushant.

On Wed, Jul 23, 2008 at 7:44 AM, Oleg Bartunov [EMAIL PROTECTED] wrote:

 btw, is it intentional to have '' in headline ?

 =# select ts_headline('1 2 3 4 5 1 2 3 1','14'::tsquery,'MaxFragments=1');
   ts_headline
 -
  ... b4/b 5 b1/b



 Oleg

 On Wed, 23 Jul 2008, Teodor Sigaev wrote:

  Let me know of any other changes that are needed.


 Looks like ready to commit, but documentation is needed.



Regards,
Oleg
 _
 Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
 Sternberg Astronomical Institute, Moscow University, Russia
 Internet: [EMAIL PROTECTED], 
 http://www.sai.msu.su/~megera/http://www.sai.msu.su/%7Emegera/
 phone: +007(495)939-16-83, +007(495)939-23-83



Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2008-08-02 Thread Sushant Sinha
Sorry for the delay. Here is the patch with FragmentDelimiter option. 
It requires an extra option in HeadlineParsedText and uses that option
during generateHeadline.

Implementing notion of fragments in HeadlineParsedText and a separate
function to join them seems more complicated. So for the time being I
just dump a FragmentDelimiter whenever a new fragment (other than the
first one) starts.

The patch also contains the updated regression tests/results and also a
new test for FragmentDelimiter option. It also contains the
documentation for the new options.

I have also attached a separate file that tests different aspects of the
new headline generation function.

Let me know if anything else is needed.

-Sushant.

On Thu, 2008-07-24 at 00:28 +0400, Oleg Bartunov wrote:
 On Wed, 23 Jul 2008, Sushant Sinha wrote:
 
  I guess it is more readable to add cover separator at the end of a fragment
  than in the front. Let me know what you think and I can update it.
 
 FragmentsDelimiter should *separate* fragments and that says all. 
 Not very difficult algorithmic problem, it's like  perl's
 join(FragmentsDelimiter, @array)
 
 
  I think the right place for cover separator is in the structure
  HeadlineParsedText just like startsel and stopsel. This will enable users to
  specify their own cover separators. But this will require changes to the
  structure as well as to the generateHeadline function. This option will not
  also play well with the default headline generation function.
 
 As soon as we introduce FragmentsDelimiter we should make it
 configurable.
 
 
  The default MaxWords = 35 seems a bit high for this headline generation
  function and 20 seems to be more reasonable. Any thoughts?
 
 I think we should not change default value because it could change
 behaviour of existing applications. I'm not sure if it'd be useful and
 possible to define default values in CREATE TEXT SEARCH PARSER
 
 
  -Sushant.
 
  On Wed, Jul 23, 2008 at 7:44 AM, Oleg Bartunov [EMAIL PROTECTED] wrote:
 
  btw, is it intentional to have '' in headline ?
 
  =# select ts_headline('1 2 3 4 5 1 2 3 1','14'::tsquery,'MaxFragments=1');
ts_headline
  -
   ... b4/b 5 b1/b
 
 
 
  Oleg
 
  On Wed, 23 Jul 2008, Teodor Sigaev wrote:
 
   Let me know of any other changes that are needed.
 
 
  Looks like ready to commit, but documentation is needed.
 
 
 
 Regards,
 Oleg
  _
  Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
  Sternberg Astronomical Institute, Moscow University, Russia
  Internet: [EMAIL PROTECTED], 
  http://www.sai.msu.su/~megera/http://www.sai.msu.su/%7Emegera/
  phone: +007(495)939-16-83, +007(495)939-23-83
 
 
 
   Regards,
   Oleg
 _
 Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
 Sternberg Astronomical Institute, Moscow University, Russia
 Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
 phone: +007(495)939-16-83, +007(495)939-23-83
Index: src/include/tsearch/ts_public.h
===
RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/include/tsearch/ts_public.h,v
retrieving revision 1.10
diff -c -r1.10 ts_public.h
*** src/include/tsearch/ts_public.h	18 Jun 2008 18:42:54 -	1.10
--- src/include/tsearch/ts_public.h	2 Aug 2008 02:40:27 -
***
*** 52,59 
--- 52,61 
  	int4		curwords;
  	char	   *startsel;
  	char	   *stopsel;
+ 	char	   *fragdelim;
  	int2		startsellen;
  	int2		stopsellen;
+ 	int2		fragdelimlen; 
  } HeadlineParsedText;
  
  /*
Index: src/backend/tsearch/wparser_def.c
===
RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.15
diff -c -r1.15 wparser_def.c
*** src/backend/tsearch/wparser_def.c	17 Jun 2008 16:09:06 -	1.15
--- src/backend/tsearch/wparser_def.c	2 Aug 2008 15:25:46 -
***
*** 1684,1701 
  	return false;
  }
  
! Datum
! prsd_headline(PG_FUNCTION_ARGS)
  {
! 	HeadlineParsedText *prs = (HeadlineParsedText *) PG_GETARG_POINTER(0);
! 	List	   *prsoptions = (List *) PG_GETARG_POINTER(1);
! 	TSQuery		query = PG_GETARG_TSQUERY(2);
  
! 	/* from opt + start and and tag */
! 	int			min_words = 15;
! 	int			max_words = 35;
! 	int			shortword = 3;
  
  	int			p = 0,
  q = 0;
  	int			bestb = -1,
--- 1684,1930 
  	return false;
  }
  
! static void 
! mark_fragment(HeadlineParsedText *prs, int highlight, int startpos, int endpos)
  {
! 	int   i;
  
! 	for (i = startpos; i = endpos; i++)
! 	{
! 		if (prs-words[i].item)
! 			prs-words[i].selected = 1;
! 		if (highlight == 0)
! 		{
! 			if (HLIDIGNORE(prs-words[i].type))
! prs-words[i].replace = 1;
! 		}
! 		else
! 		{
! 			if (XMLHLIDIGNORE(prs-words[i].type

Re: [HACKERS] small bug in hlCover

2008-08-03 Thread Sushant Sinha
Has any one noticed this?

-Sushant.

On Wed, 2008-07-16 at 23:01 -0400, Sushant Sinha wrote:
 I think there is a slight bug in hlCover function in wparser_def.c
 
 If there is only one query item and that is the first word in the text,
 then hlCover does not returns any cover. This is evident in this example
 when ts_headline only generates the min_words:
 
 testdb=# select ts_headline('1 2 3 4 5 6 7 8 9 10','1'::tsquery,
 'MinWords=5');
ts_headline
 --
  b1/b 2 3 4 5
 (1 row)
 
 The problem is that *q is initialized to 0 which is a legitimate value
 for a cover. So I have attached a patch that fixes it and after applying
 the patch here is the result.
 
 testdb=# select ts_headline('1 2 3 4 5 6 7 8 9 10','1'::tsquery,
 'MinWords=5');
  ts_headline 
 -
  b1/b 2 3 4 5 6 7 8 9 10
 (1 row)
 
 -Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] small bug in hlCover

2008-08-03 Thread Sushant Sinha
On Mon, 2008-08-04 at 00:36 -0300, Euler Taveira de Oliveira wrote:
 Sushant Sinha escreveu:
  I think there is a slight bug in hlCover function in wparser_def.c
  
 The bug is not in the hlCover. In prsd_headline, if we didn't find a 
 suitable bestlen (i.e. = 0), than it includes up to document length or 
 *maxWords* (here is the bug). I'm attaching a small patch that fixes it 
 and some comment typos. Please apply it to 8_3_STABLE too.

Well hlCover purpose is to find a cover and for the document  '1 2 3 4 5
6 7 8 9 10' and the query '1'::tsquery, a cover exists. So it should
point it out.

On my source I see that prsd_headline marks only min_words which seems
like the right thing to do.

-Sushant.

 
 euler=# select ts_headline('1 2 3 4 5 6 7 8 9 10','1'::tsquery, 
 'MinWords=5');
   ts_headline
 -
   b1/b 2 3 4 5 6 7 8 9 10
 (1 registro)
 
 euler=# select ts_headline('1 2 3 4 5 6 7 8 9 10','1'::tsquery);
   ts_headline
 -
   b1/b 2 3 4 5 6 7 8 9 10
 (1 registro)
 
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] english parser in text search: support for multiple words in the same position

2010-08-01 Thread Sushant Sinha
Currently the english parser in text search does not support multiple
words in the same position. Consider a word wikipedia.org. The text
search would return a single token wikipedia.org. However if someone
searches for wikipedia org then there will not be a match. There are
two problems here:

1. We do not have separate tokens wikipedia and org
2. If we have the two tokens we should have them at adjacent position so
that a phrase search for wikipedia org should work.

 It will be nice to have the following tokenization and positioning for
wikipedia.org

position 0: WORD(wikipedia), URL(wikipedia.org)
position 1: WORD(org)

Take the example of wikipedia.org/search?q=sushant

Here is the TSVECTOR:

select to_tsvector('english', 'wikipedia.org/search?q=sushant');

to_tsvector 

'/search?q=sushant':3 'wikipedia.org':2
'wikipedia.org/search?q=sushant':1

And here are the tokens:

select ts_debug('english', 'wikipedia.org/search?q=sushant');

ts_debug

(url,URL,wikipedia.org/search?q=sushant,{simple},simple,{wikipedia.org/search?q
=sushant})
 (host,Host,wikipedia.org,{simple},simple,{wikipedia.org})
 (url_path,URL
path,/search?q=sushant,{simple},simple,{/search?q=sushant})

The tokenization I would like to see is:

position 0: WORD(wikipedia), URL(wikipedia.org/search?q=sushant)
position 1: WORD(org)
position 2: WORD(search), URL_PATH(search/?q=sushant)
position 3: WORD(q), URL_QUERY(q=search)
position 4: WORD(sushant)

So what we need is to support multiple tokens at the same position. And
I need help in understanding how to realize this. Currently the position
assignment happens in make_tsvector by working or parsed lexemes. The
lexeme is obtained by prsd_nexttoken.

However, prsd_nexttoken only returns a single token. Will it be possiblt
to store some tokens and return them tokegher? Or can we put a flag to
certain tokens that say the position should not be increased?

-Sushant.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] english parser in text search: support for multiple words in the same position

2010-08-02 Thread Sushant Sinha
 On 08/01/2010 08:04 PM, Sushant Sinha wrote:
  1. We do not have separate tokens wikipedia and org
  2. If we have the two tokens we should have them at adjacent position so
  that a phrase search for wikipedia org should work.
 
 This would needlessly increase the number of tokens. Instead you'd 
 better make it work like compound word support, having just wikipedia 
 and org as tokens.

The current text parser already returns url and url_path. That already
increases the number of unique tokens. I am only asking for adding of
normal english words as well so that if someone types only wikipedia
he gets a match. 

 
 Searching for wikipedia.org or wikipedia org should then result in 
 the same search query with the two tokens: wikipedia and org.

Earlier people have expressed the need to index urls/emails and
currently the text parser already does so. Reverting that would be a
regression of functionality. Further, a ranking function can take
advantage of direct match of a token.

  position 0: WORD(wikipedia), URL(wikipedia.org/search?q=sushant)
 
 IMO the differentiation between WORDs and URLs is not something the text 
 search engine should have to take care a lot. Let it just do the 
 searching and make it do that well.

Postgres english parser already emits urls as tokens. Only thing I am
asking is on improving the tokenization and positioning.

 What does a token wikipedia.org/search?q=sushant buy you in terms of 
 text searching? Or even result highlighting? I wouldn't expect anybody 
 to want to search for a full URL, do you?

There have been need expressed in past. And an exact token match can
result in better ranking functions. For example, a tf-idf ranking will
rank matching of such unique tokens significantly higher.

-Sushant.

 Regards
 
 Markus Wanner



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] english parser in text search: support for multiple words in the same position

2010-08-02 Thread Sushant Sinha
On Mon, 2010-08-02 at 09:32 -0400, Robert Haas wrote:
 On Mon, Aug 2, 2010 at 9:12 AM, Sushant Sinha sushant...@gmail.com wrote:
  The current text parser already returns url and url_path. That already
  increases the number of unique tokens. I am only asking for adding of
  normal english words as well so that if someone types only wikipedia
  he gets a match.
 [...]
  Postgres english parser already emits urls as tokens. Only thing I am
  asking is on improving the tokenization and positioning.
 
 Can you write a patch to implement your idea?
 

Yes thats what I am planning to do. I just wanted to see if anyone can
help me in estimating whether this is doable in the current parser or I
need to write a new one. If possible, then some idea on how to go about
implementing?

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] english parser in text search: support for multiple words in the same position

2010-09-01 Thread Sushant Sinha
I have attached a patch that emits parts of a host token, a url token,
an email token and a file token. Further, it makes sure that a
host/url/email/file token and the first part-token are at the same
position in tsvector.

The two major changes are:

1. Tokenization changes: The patch exploits the special handlers in the
text parser to reset the parser position to the start of a
host/url/email/file token when it finds one. Special handlers were
already used for extracting host and urlpath from a full url. So this is
more of an extension of the same idea.

2. Position changes: We do not advance position when we encounter a
host/url/email/file token. As a result the first part of that token
aligns with the token itself.

Attachments:

tokens_output.txt: sample queries and results with the patch
token_v1.patch:patch wrt cvs head

Currently, the patch output parts of the tokens as normal tokens like
WORD, NUMWORD etc. Tom argued earlier that this will break
backward-compatibility and so it should be outputted as parts of the
respective tokens. If there is an agreement over what Tom says, then the
current patch can be modified to output subtokens as parts. However,
before I complicate the patch with that, I wanted to get feedback on any
other major problem with the patch.

-Sushant.

On Mon, 2010-08-02 at 10:20 -0400, Tom Lane wrote:
 Sushant Sinha sushant...@gmail.com writes:
  This would needlessly increase the number of tokens. Instead you'd 
  better make it work like compound word support, having just wikipedia 
  and org as tokens.
 
  The current text parser already returns url and url_path. That already
  increases the number of unique tokens. I am only asking for adding of
  normal english words as well so that if someone types only wikipedia
  he gets a match. 
 
 The suggestion to make it work like compound words is still a good one,
 ie given wikipedia.org you'd get back
 
   hostwikipedia.org
   host-part   wikipedia
   host-part   org
 
 not just the host token as at present.
 
 Then the user could decide whether he needed to index hostname
 components or not, by choosing whether to forward hostname-part
 tokens to a dictionary or just discard them.
 
 If you submit a patch that tries to force the issue by classifying
 hostname parts as plain words, it'll probably get rejected out of
 hand on backwards-compatibility grounds.
 
   regards, tom lane

1. FILEPATH

testdb=# SELECT ts_debug('/stuff/index.html');
 ts_debug   
  

--
 (file,File or path name,/stuff/index.html,{simple},simple,{/stuff/index.html}
)
 (blank,Space symbols,/,{},,)
 (asciiword,Word, all ASCII,stuff,{english_stem},english_stem,{stuff})
 (blank,Space symbols,/,{},,)
 (asciiword,Word, all ASCII,index,{english_stem},english_stem,{index})
 (blank,Space symbols,.,{},,)
 (asciiword,Word, all ASCII,html,{english_stem},english_stem,{html})


SELECT to_tsvector('english', '/stuff/index.html');
to_tsvector 

 '/stuff/index.html':0 'html':2 'index':1 'stuff':0
(1 row)

2. URL

testdb=# SELECT ts_debug('http://example.com/stuff/index.html');
   ts_debug 
   

---
 (protocol,Protocol head,http://,{},,)
 (url,URL,example.com/stuff/index.html,{simple},simple,{example.com/stuff/index.
html})
 (host,Host,example.com,{simple},simple,{example.com})
 (asciiword,Word, all ASCII,example,{english_stem},english_stem,{exampl})
 (blank,Space symbols,.,{},,)
 (asciiword,Word, all ASCII,com,{english_stem},english_stem,{com})
 (url_path,URL path,/stuff/index.html,{simple},simple,{/stuff/index.html})
 (blank,Space symbols,/,{},,)
 (asciiword,Word, all ASCII,stuff,{english_stem},english_stem,{stuff})
 (blank,Space symbols,/,{},,)
 (asciiword,Word, all ASCII,index,{english_stem},english_stem,{index})
 (blank,Space symbols,.,{},,)
 (asciiword,Word, all ASCII,html,{english_stem},english_stem,{html})
(13 rows)

testdb=# SELECT to_tsvector('english', 'http://example.com/stuff/index.html');
  to_tsvector   



 '/stuff/index.html':2 'com':1 'exampl':0 'example.com':0 'example.com/stuff/ind
ex.html':0 'html':4 'index':3 'stuff':2

3. EMAIL

testdb=# SELECT ts_debug('sush...@foo.bar');
  ts_debug   
-
 (email,Email address,sush...@foo.bar,{simple},simple,{sush

Re: [HACKERS] english parser in text search: support for multiple words in the same position

2010-09-04 Thread Sushant Sinha
Updating the patch with emitting parttoken and registering it with
snowball config.

-Sushant.

On Fri, 2010-09-03 at 09:44 -0400, Robert Haas wrote:
 On Wed, Sep 1, 2010 at 2:42 AM, Sushant Sinha sushant...@gmail.com wrote:
  I have attached a patch that emits parts of a host token, a url token,
  an email token and a file token. Further, it makes sure that a
  host/url/email/file token and the first part-token are at the same
  position in tsvector.
 
 You should probably add this patch here:
 
 https://commitfest.postgresql.org/action/commitfest_view/open
 

Index: src/backend/snowball/snowball.sql.in
===
RCS file: /projects/cvsroot/pgsql/src/backend/snowball/snowball.sql.in,v
retrieving revision 1.6
diff -u -r1.6 snowball.sql.in
--- src/backend/snowball/snowball.sql.in	27 Oct 2007 16:01:08 -	1.6
+++ src/backend/snowball/snowball.sql.in	4 Sep 2010 02:59:10 -
@@ -22,6 +22,6 @@
 	WITH _ASCDICTNAME_;
 
 ALTER TEXT SEARCH CONFIGURATION _CFGNAME_ ADD MAPPING
-FOR word, hword_part, hword
+FOR word, hword_part, hword, parttoken
 	WITH _NONASCDICTNAME_;
 
Index: src/backend/tsearch/ts_parse.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/ts_parse.c,v
retrieving revision 1.17
diff -u -r1.17 ts_parse.c
--- src/backend/tsearch/ts_parse.c	26 Feb 2010 02:01:05 -	1.17
+++ src/backend/tsearch/ts_parse.c	4 Sep 2010 02:59:11 -
@@ -19,7 +19,7 @@
 #include tsearch/ts_utils.h
 
 #define IGNORE_LONGLEXEME	1
-
+#define COMPLEX_TOKEN(x) ( x == 4 || x == 5 || x == 6 || x == 18 || x == 17 || x == 18 || x == 19)   
 /*
  * Lexize subsystem
  */
@@ -407,8 +407,6 @@
 		{
 			TSLexeme   *ptr = norms;
 
-			prs-pos++;			/* set pos */
-
 			while (ptr-lexeme)
 			{
 if (prs-curwords == prs-lenwords)
@@ -429,6 +427,10 @@
 prs-curwords++;
 			}
 			pfree(norms);
+
+			if (!COMPLEX_TOKEN(type)) 
+prs-pos++;			/* set pos */
+
 		}
 	} while (type  0);
 
Index: src/backend/tsearch/wparser_def.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.33
diff -u -r1.33 wparser_def.c
--- src/backend/tsearch/wparser_def.c	19 Aug 2010 05:57:34 -	1.33
+++ src/backend/tsearch/wparser_def.c	4 Sep 2010 02:59:12 -
@@ -23,7 +23,7 @@
 
 
 /* Define me to enable tracing of parser behavior */
-/* #define WPARSER_TRACE */
+//#define WPARSER_TRACE 
 
 
 /* Output token categories */
@@ -51,8 +51,9 @@
 #define SIGNEDINT		21
 #define UNSIGNEDINT		22
 #define XMLENTITY		23
+#define PARTTOKEN		24
 
-#define LASTNUM			23
+#define LASTNUM			24
 
 static const char *const tok_alias[] = {
 	,
@@ -78,7 +79,8 @@
 	float,
 	int,
 	uint,
-	entity
+	entity,
+	parttoken
 };
 
 static const char *const lex_descr[] = {
@@ -105,7 +107,8 @@
 	Decimal notation,
 	Signed integer,
 	Unsigned integer,
-	XML entity
+	XML entity,
+Part of file/url/host/email
 };
 
 
@@ -249,7 +252,8 @@
 	TParserPosition *state;
 	bool		ignore;
 	bool		wanthost;
-
+	int 		partstop;
+	TParserState	afterpart;
 	/* silly char */
 	char		c;
 
@@ -617,8 +621,41 @@
 	}
 	return 1;
 }
+static int
+p_ispartbingo(TParser *prs)
+{
+	int ret = 0;
+	if (prs-partstop  0)
+	{
+		ret = 1;
+		if (prs-partstop = prs-state-posbyte)	
+		{
+			prs-state-state = prs-afterpart;
+			prs-partstop = 0;
+		}
+		else
+			prs-state-state = TPS_Base;
+	}
+	return ret; 
+}
 
+static int
+p_ispart(TParser *prs)
+{
+	if (prs-partstop  0)
+		return  1;
+	else
+		return 0;
+}
 
+static int
+p_ispartEOF(TParser *prs)
+{
+	if (p_ispart(prs)  p_isEOF(prs))
+ 		return 1;
+	else
+		return 0;
+}
 /* deliberately suppress unused-function complaints for the above */
 void		_make_compiler_happy(void);
 void
@@ -688,6 +725,21 @@
 }
 
 static void
+SpecialPart(TParser *prs)
+{
+	prs-partstop = prs-state-posbyte;
+	prs-state-posbyte -= prs-state-lenbytetoken;
+	prs-state-poschar -= prs-state-lenchartoken;
+	prs-afterpart = TPS_Base;
+}
+static void
+SpecialUrlPart(TParser *prs)
+{
+	SpecialPart(prs);
+	prs-afterpart = TPS_InURLPathStart;
+}
+
+static void
 SpecialVerVersion(TParser *prs)
 {
 	prs-state-posbyte -= prs-state-lenbytetoken;
@@ -1057,6 +1109,7 @@
 	{p_iseqC, '-', A_PUSH, TPS_InSignedIntFirst, 0, NULL},
 	{p_iseqC, '+', A_PUSH, TPS_InSignedIntFirst, 0, NULL},
 	{p_iseqC, '', A_PUSH, TPS_InXMLEntityFirst, 0, NULL},
+	{p_ispart, 0, A_NEXT, TPS_InSpace, 0, NULL},
 	{p_iseqC, '~', A_PUSH, TPS_InFileTwiddle, 0, NULL},
 	{p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL},
 	{p_iseqC, '.', A_PUSH, TPS_InPathFirstFirst, 0, NULL},
@@ -1065,9 +1118,11 @@
 
 
 static const TParserStateActionItem actionTPS_InNumWord[] = {
+	{p_ispartEOF, 0, A_BINGO, TPS_Null, PARTTOKEN, NULL},
 	{p_isEOF, 0, A_BINGO, TPS_Base, NUMWORD, NULL},
 	{p_isalnum, 0, A_NEXT, TPS_InNumWord, 0, NULL},
 	{p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL},
+	{p_ispartbingo, 0, A_BINGO

Re: [HACKERS] english parser in text search: support for multiple words in the same position

2010-09-08 Thread Sushant Sinha
For the headline generation to work properly, email/file/url/host need
to become skip tokens. Updating the patch with that change.

-Sushant.

On Sat, 2010-09-04 at 13:25 +0530, Sushant Sinha wrote:
 Updating the patch with emitting parttoken and registering it with
 snowball config.
 
 -Sushant.
 
 On Fri, 2010-09-03 at 09:44 -0400, Robert Haas wrote:
  On Wed, Sep 1, 2010 at 2:42 AM, Sushant Sinha sushant...@gmail.com wrote:
   I have attached a patch that emits parts of a host token, a url token,
   an email token and a file token. Further, it makes sure that a
   host/url/email/file token and the first part-token are at the same
   position in tsvector.
  
  You should probably add this patch here:
  
  https://commitfest.postgresql.org/action/commitfest_view/open
  
 

Index: src/backend/snowball/snowball.sql.in
===
RCS file: /projects/cvsroot/pgsql/src/backend/snowball/snowball.sql.in,v
retrieving revision 1.6
diff -u -r1.6 snowball.sql.in
--- src/backend/snowball/snowball.sql.in	27 Oct 2007 16:01:08 -	1.6
+++ src/backend/snowball/snowball.sql.in	7 Sep 2010 01:46:55 -
@@ -22,6 +22,6 @@
 	WITH _ASCDICTNAME_;
 
 ALTER TEXT SEARCH CONFIGURATION _CFGNAME_ ADD MAPPING
-FOR word, hword_part, hword
+FOR word, hword_part, hword, parttoken
 	WITH _NONASCDICTNAME_;
 
Index: src/backend/tsearch/ts_parse.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/ts_parse.c,v
retrieving revision 1.17
diff -u -r1.17 ts_parse.c
--- src/backend/tsearch/ts_parse.c	26 Feb 2010 02:01:05 -	1.17
+++ src/backend/tsearch/ts_parse.c	7 Sep 2010 01:46:55 -
@@ -19,7 +19,7 @@
 #include tsearch/ts_utils.h
 
 #define IGNORE_LONGLEXEME	1
-
+#define COMPLEX_TOKEN(x) ( x == 4 || x == 5 || x == 6 || x == 18 || x == 17 || x == 18 || x == 19)   
 /*
  * Lexize subsystem
  */
@@ -407,8 +407,6 @@
 		{
 			TSLexeme   *ptr = norms;
 
-			prs-pos++;			/* set pos */
-
 			while (ptr-lexeme)
 			{
 if (prs-curwords == prs-lenwords)
@@ -429,6 +427,10 @@
 prs-curwords++;
 			}
 			pfree(norms);
+
+			if (!COMPLEX_TOKEN(type)) 
+prs-pos++;			/* set pos */
+
 		}
 	} while (type  0);
 
Index: src/backend/tsearch/wparser_def.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.33
diff -u -r1.33 wparser_def.c
--- src/backend/tsearch/wparser_def.c	19 Aug 2010 05:57:34 -	1.33
+++ src/backend/tsearch/wparser_def.c	7 Sep 2010 01:46:56 -
@@ -23,7 +23,7 @@
 
 
 /* Define me to enable tracing of parser behavior */
-/* #define WPARSER_TRACE */
+//#define WPARSER_TRACE 
 
 
 /* Output token categories */
@@ -51,8 +51,9 @@
 #define SIGNEDINT		21
 #define UNSIGNEDINT		22
 #define XMLENTITY		23
+#define PARTTOKEN		24
 
-#define LASTNUM			23
+#define LASTNUM			24
 
 static const char *const tok_alias[] = {
 	,
@@ -78,7 +79,8 @@
 	float,
 	int,
 	uint,
-	entity
+	entity,
+	parttoken
 };
 
 static const char *const lex_descr[] = {
@@ -105,7 +107,8 @@
 	Decimal notation,
 	Signed integer,
 	Unsigned integer,
-	XML entity
+	XML entity,
+Part of file/url/host/email
 };
 
 
@@ -249,7 +252,8 @@
 	TParserPosition *state;
 	bool		ignore;
 	bool		wanthost;
-
+	int 		partstop;
+	TParserState	afterpart;
 	/* silly char */
 	char		c;
 
@@ -617,8 +621,41 @@
 	}
 	return 1;
 }
+static int
+p_ispartbingo(TParser *prs)
+{
+	int ret = 0;
+	if (prs-partstop  0)
+	{
+		ret = 1;
+		if (prs-partstop = prs-state-posbyte)	
+		{
+			prs-state-state = prs-afterpart;
+			prs-partstop = 0;
+		}
+		else
+			prs-state-state = TPS_Base;
+	}
+	return ret; 
+}
 
+static int
+p_ispart(TParser *prs)
+{
+	if (prs-partstop  0)
+		return  1;
+	else
+		return 0;
+}
 
+static int
+p_ispartEOF(TParser *prs)
+{
+	if (p_ispart(prs)  p_isEOF(prs))
+ 		return 1;
+	else
+		return 0;
+}
 /* deliberately suppress unused-function complaints for the above */
 void		_make_compiler_happy(void);
 void
@@ -688,6 +725,21 @@
 }
 
 static void
+SpecialPart(TParser *prs)
+{
+	prs-partstop = prs-state-posbyte;
+	prs-state-posbyte -= prs-state-lenbytetoken;
+	prs-state-poschar -= prs-state-lenchartoken;
+	prs-afterpart = TPS_Base;
+}
+static void
+SpecialUrlPart(TParser *prs)
+{
+	SpecialPart(prs);
+	prs-afterpart = TPS_InURLPathStart;
+}
+
+static void
 SpecialVerVersion(TParser *prs)
 {
 	prs-state-posbyte -= prs-state-lenbytetoken;
@@ -1057,6 +1109,7 @@
 	{p_iseqC, '-', A_PUSH, TPS_InSignedIntFirst, 0, NULL},
 	{p_iseqC, '+', A_PUSH, TPS_InSignedIntFirst, 0, NULL},
 	{p_iseqC, '', A_PUSH, TPS_InXMLEntityFirst, 0, NULL},
+	{p_ispart, 0, A_NEXT, TPS_InSpace, 0, NULL},
 	{p_iseqC, '~', A_PUSH, TPS_InFileTwiddle, 0, NULL},
 	{p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL},
 	{p_iseqC, '.', A_PUSH, TPS_InPathFirstFirst, 0, NULL},
@@ -1065,9 +1118,11 @@
 
 
 static const TParserStateActionItem actionTPS_InNumWord[] = {
+	{p_ispartEOF, 0

Re: [HACKERS] text search patch status update?

2009-01-07 Thread Sushant Sinha
The default headline generation function is complicated. It checks a lot
of cases to determine the best headline to be displayed. So Heikki's
examples just say that headline generation function may not be very
intuitive. However, his examples were not affected by the bug.

Because of the bug, hlcover was not returning a cover when the query
item was the first lexeme in the text. And so the headline generation
function will return just MINWORDS rather than the actual headline as
per the logic.

After the patch you will see the difference in the example:

http://archives.postgresql.org/pgsql-hackers/2008-07/msg00785.php

-Sushant.

On Wed, 2009-01-07 at 20:50 -0500, Bruce Momjian wrote:
 Uh, where are we on this?  I see the same output in CVS HEAD as Heikki,
 and I assume he thought at least one of them was wrong.  ;-)
 
 ---
 
 Heikki Linnakangas wrote:
  Sushant Sinha wrote:
   Patch #2. I think this is a straigt forward bug fix.
  
  Yes, I think you're right. In hlCover(), *q is 0 when the only match is 
  the first item in the text, and we shouldn't bail out with return 
  false in that case.
  
  But there seems to be something else going on here as well:
  
  postgres=# select ts_headline('1 2 3 4 5', '2'::tsquery, 'MinWords=2, 
  MaxWords=3');
ts_headline
  --
b2/b 3 4
  (1 row)
  
  postgres=# select ts_headline('aaa1 aaa2 aaa3 aaa4 
  aaa5','aaa2'::tsquery, 'MinWords=2, MaxWords=3');
  ts_headline
  --
baaa2/b aaa3
  (1 row)
  
  In the first example, you get three words, and in the 2nd, just two. It 
  must be because of the default ShortWord setting of 3. Also, if only the 
  last word matches, and it's a short word, you get the whole text:
  
  postgres=# select ts_headline('1 2 3 4 5','5'::tsquery, 'MinWords=2, 
  MaxWords=3');
  ts_headline
  --
1 2 3 4 b5/b
  (1 row)
  
  -- 
 Heikki Linnakangas
 EnterpriseDB   http://www.enterprisedb.com
  
  -- 
  Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-hackers
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] possible bug in cover density ranking?

2009-01-28 Thread Sushant Sinha
I am running postgres 8.3.1. In tsrank.c I am looking at the cover
density function used for ranking while doing text search:
float4
calc_rank_cd(float4 *arrdata, TSVector txt, TSQuery query, int method)


Here is the excerpt of code that I think may possibly have bug when
document is big enough to exceed the 16383 position limit.

CODE
===
Cpos = ((double) (ext.end - ext.begin + 1)) / InvSum;

/*
 * if doc are big enough then ext.q may be equal to ext.p due to limit
 * of posional information. In this case we approximate number of
 * noise word as half cover's length
 */
nNoise = (ext.q - ext.p) - (ext.end - ext.begin);
if (nNoise  0)
nNoise = (ext.end - ext.begin) / 2
Wdoc += Cpos / ((double) (1 + nNoise));
===

As per my understanding, ext.end -ext.begin + 1 is the number of query
items in the cover and ext.q-ext.p says the length of the cover.

So consider a query with two query items. When we run out of position
information, Cover returns ext.q = 16383 and ext.p = 16383 and the
number of query items= ext.end-ext-begin + 1 = 2

nNoise becomes -1 and then nNoise is initialized to (ext.end
-ext.begin)/2 = 0
Wdoc becomes Cpos = 2/InvSum = 2/(1/0.1+1/0.1) = 0.1

Is this what is desired? It seems to me that Wdoc is getting a high
ranking even when we are not sure of the position information. 

The comment above says that In this case we approximate number of
noise word as half cover's length. But we do not know the cover's
length in this case as ext.p and ext.q are both unreliable. And ext.end
-ext.begin is not the cover's length. It is the number of query items
found in the cover.

Any clarification would be useful. 

Thanks,
-Sushant.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] possible bug in cover density ranking?

2009-01-29 Thread Sushant Sinha
On Thu, Jan 29, 2009 at 12:38 PM, Teodor Sigaev teo...@sigaev.ru wrote:

 Is this what is desired? It seems to me that Wdoc is getting a high
 ranking even when we are not sure of the position information.

 0.1 is not very high rank, and we could not suggest any reasonable rank in
 this case. This document may be good, may be bad. rank_cd is not limited by
 1.



For a cover of 2 query items, 0.1 is actually the maximum rank. This is only
possible when both query items are adjacent to each other.

0.1 may not seem too high when we look at its absoule value. But the problem
is we are ranking a document for which we have no positional information
available higher than a document for which we may have positional
information available with let suppose the cover length of 3. I think we
should rank the document with cover length 3 higher than the document for
which we have no positional information (and assume cover length of 2 as we
are doing now).

I feel that if ext.p=ext.q for query items  1, then we should not count
that cover for ranking at all. Or, another option will be to significantly
inflate nNoise in this scenrio to  say 100. Putting
nNoise=(ext.end-ext.begin)/2 is way too low for covers that we have no idea
on (it is 0 for query items = 2).

I am not assuming or suggesting that rank_cd is bounded by one. Off course
its rank increases as more and more covers are added.

Thanks,
Sushant.




 The comment above says that In this case we approximate number of
 noise word as half cover's length. But we do not know the cover's
 length in this case as ext.p and ext.q are both unreliable. And ext.end
 -ext.begin is not the cover's length. It is the number of query items
 found in the cover.


 Yeah, but if there is no information then information is absent :), but I
 agree with you to change comment
 --
 Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW:
 http://www.sigaev.ru/



Re: [HACKERS] Ellipses around result fragment of ts_headline

2009-02-14 Thread Sushant Sinha
I think we currently do that. We add ellipses only when we encounter a
new fragment. So there should not be ellipses if we are at the end of
the document or if that is the first fragment (includes the beginning of
the document). Here is the code in generateHeadline, ts_parse.c that
adds the ellipses:

if (!infrag)
{

/* start of a new fragment */
infrag = 1;
numfragments ++;
/* add a fragment delimitor if this is after the first
one */
if (numfragments  1)
{
memcpy(ptr, prs-fragdelim, prs-fragdelimlen);
ptr += prs-fragdelimlen;
}

}

It is possible that there is a bug that needs to be fixed. Can you show
me an example where you found that?

-Sushant.




On Sat, 2009-02-14 at 15:13 -0500, Asher Snyder wrote:
 It would be very useful if there were an option to have ts_headline append
 ellipses before or after a result fragement based on the position of the
 fragment in the source document. For instance, when running ts_headline(doc,
 query) it will correctly return a fragment with words highlighted, however,
 there's no easy way to determine whether this returned fragment is at the
 beginning or end of the original doc, and add the necessary ellipses. 
 
 Searches such as postgresql.org ALWAYS add ellipses before or after the
 fragment regardless of whether or not ellipses are warranted. In my opinion
 always adding ellipses to the fragment is deceptive to the user, in many of
 my search result cases, the fragment is at the beginning of the doc, and
 would confuse the user to always see ellipses. So you can see how useful the
 feature described above would be beneficial to the accuracy of the search
 result fragment.
 
 
 
 
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Ellipses around result fragment of ts_headline

2009-02-14 Thread Sushant Sinha
The documentation in 8.4dev has information on FragmentDelimiter
http://developer.postgresql.org/pgdocs/postgres/textsearch-controls.html

If you do not specify MaxFragments  0, then the default headline
generator kicks in. The default headline generator does not have any
fragment delimiter. So it is correct that you will not see any
delimiter.

I think you are looking for the default headline generator to add
ellipses as  well depending on where the fragment is. I do not what
other people opinion on this is.

-Sushant.

On Sat, 2009-02-14 at 16:21 -0500, Asher Snyder wrote:
 Interesting, it could be that you already do it, but the documentation makes
 no reference to a fragment delimiter, so there's no way that I can see to
 add one. The documentation for ts_headline only lists StartSel, StopSel,
 MaxWords, MinWords, ShortWord, and HighlightAll, there appears to be no
 option for a fragment delimiter.
 
 In my case I do:
 
 SELECT v1.id, v1.type_id, v1.title, ts_headline(v1.copy, query, 'MinWords =
 17') as copy, ts_rank(v1.text_search, query) AS rank FROM 
   (SELECT b1.*, (setweight(to_tsvector(coalesce(b1.title,'')), 'A')
 ||
  setweight(to_tsvector(coalesce(b1.copy,'')), 'B')) as text_search
FROM search.v_searchable_content b1) v1,  
   plainto_tsquery($1) query
 WHERE ($2 IS NULL OR (type_id = ANY($2))) AND query @@ v1.text_search ORDER
 BY rank DESC, title
 
 Now, this use of ts_headline correctly returns me highlighted fragmented
 search results, but there will be no fragment delimiter for the headline.
 Some suggestions were to change ts_headline(v1.copy, query, 'MinWords = 17')
 to '...' || _headline(v1.copy, query, 'MinWords = 17') || '...',  but as you
 can clearly see this would always occur, and not be intelligent regarding
 the fragments. I hope that you're correct and that it is implemented, and
 not documented
 
 -Original Message-
 From: Sushant Sinha [mailto:sushant...@gmail.com]
 Sent: Saturday, February 14, 2009 4:07 PM
 To: Asher Snyder
 Cc: pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Ellipses around result fragment of ts_headline
 
 I think we currently do that. We add ellipses only when we encounter a
 new fragment. So there should not be ellipses if we are at the end of
 the document or if that is the first fragment (includes the beginning of
 the document). Here is the code in generateHeadline, ts_parse.c that
 adds the ellipses:
 
 if (!infrag)
 {
 
 /* start of a new fragment */
 infrag = 1;
 numfragments ++;
 /* add a fragment delimitor if this is after the first
 one */
 if (numfragments  1)
 {
 memcpy(ptr, prs-fragdelim, prs-fragdelimlen);
 ptr += prs-fragdelimlen;
 }
 
 }
 
 It is possible that there is a bug that needs to be fixed. Can you show
 me an example where you found that?
 
 -Sushant.
 
 
 
 
 On Sat, 2009-02-14 at 15:13 -0500, Asher Snyder wrote:
  It would be very useful if there were an option to have ts_headline
 append
  ellipses before or after a result fragement based on the position of
 the
  fragment in the source document. For instance, when running
 ts_headline(doc,
  query) it will correctly return a fragment with words highlighted,
 however,
  there's no easy way to determine whether this returned fragment is at
 the
  beginning or end of the original doc, and add the necessary ellipses.
 
  Searches such as postgresql.org ALWAYS add ellipses before or after
 the
  fragment regardless of whether or not ellipses are warranted. In my
 opinion
  always adding ellipses to the fragment is deceptive to the user, in
 many of
  my search result cases, the fragment is at the beginning of the doc,
 and
  would confuse the user to always see ellipses. So you can see how
 useful the
  feature described above would be beneficial to the accuracy of the
 search
  result fragment.
 
 
 
 
 
 
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Ellipses around result fragment of ts_headline

2009-02-14 Thread Sushant Sinha
Sorry ... I thought you were running the development branch.

-Sushant.

On Sat, 2009-02-14 at 16:34 -0500, Tom Lane wrote:
 Sushant Sinha sushant...@gmail.com writes:
  I think we currently do that.
 
 ... since about four months ago.
 
 2008-10-17 14:05  teodor
 
   * doc/src/sgml/textsearch.sgml, src/backend/tsearch/ts_parse.c,
   src/backend/tsearch/wparser_def.c, src/include/tsearch/ts_public.h,
   src/test/regress/expected/tsearch.out,
   src/test/regress/sql/tsearch.sql: Improve headeline generation. Now
   headline can contain several fragments a-la Google.
   
   Sushant Sinha sushant...@gmail.com
 
   regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] patch for space around the FragmentDelimiter

2009-03-01 Thread Sushant Sinha
FragmentDelimiter is an argument for ts_headline function to separates
different headline fragments. The default delimiter is  ... .
Currently if someone specifies the delimiter as an option to the
function, no extra space is added around the delimiter. However, it does
not look good without space around the delimter. 

Since the option parsing function removes any space around the  given
value, it is not possible to add any desired space. The attached patch
adds space when a FragmentDelimiter is specified.

QUERY:

SELECT ts_headline('english', '
Day after day, day after day,
  We stuck, nor breath nor motion,
As idle as a painted Ship
  Upon a painted Ocean.
Water, water, every where
  And all the boards did shrink;
Water, water, every where,
  Nor any drop to drink.
S. T. Coleridge (1772-1834)
', to_tsquery('english', 'Coleridge  stuck'),
'MaxFragments=2,FragmentDelimiter=***');

OLD RESULT
ts_headline 

 after day, day after day,
   We bstuck/b, nor breath nor motion,
 As idle as a painted Ship
   Upon a painted Ocean.
 Water, water, every where
   And all the boards did shrink;
 Water, water, every where***drop to drink.
 S. T. bColeridge/b
(1 row)




NEW RESULT after the patch

 ts_headline  
--
 after day, day after day,
   We bstuck/b, nor breath nor motion,
 As idle as a painted Ship
   Upon a painted Ocean.
 Water, water, every where
   And all the boards did shrink;
 Water, water, every where *** drop to drink.
 S. T. bColeridge/b



Index: src/backend/tsearch/wparser_def.c
===
RCS file: /home/sushant/devel/pgrep/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.20
diff -c -r1.20 wparser_def.c
*** src/backend/tsearch/wparser_def.c	15 Jan 2009 16:33:59 -	1.20
--- src/backend/tsearch/wparser_def.c	2 Mar 2009 06:00:02 -
***
*** 2082,2087 
--- 2082,2088 
  	int			shortword = 3;
  	int			max_fragments = 0;
  	int			highlight = 0;
+ 	int			len;
  	ListCell   *l;
  
  	/* config */
***
*** 2105,2111 
  		else if (pg_strcasecmp(defel-defname, StopSel) == 0)
  			prs-stopsel = pstrdup(val);
  		else if (pg_strcasecmp(defel-defname, FragmentDelimiter) == 0)
! 			prs-fragdelim = pstrdup(val);
  		else if (pg_strcasecmp(defel-defname, HighlightAll) == 0)
  			highlight = (pg_strcasecmp(val, 1) == 0 ||
  		 pg_strcasecmp(val, on) == 0 ||
--- 2106,2116 
  		else if (pg_strcasecmp(defel-defname, StopSel) == 0)
  			prs-stopsel = pstrdup(val);
  		else if (pg_strcasecmp(defel-defname, FragmentDelimiter) == 0)
! 		{
! 			len = strlen(val) + 2 + 1;/* 2 for spaces and 1 for end of string */
! 			prs-fragdelim = palloc(len * sizeof(char));
! 			snprintf(prs-fragdelim, len,  %s , val);
! 		}
  		else if (pg_strcasecmp(defel-defname, HighlightAll) == 0)
  			highlight = (pg_strcasecmp(val, 1) == 0 ||
  		 pg_strcasecmp(val, on) == 0 ||
Index: src/test/regress/expected/tsearch.out
===
RCS file: /home/sushant/devel/pgrep/pgsql/src/test/regress/expected/tsearch.out,v
retrieving revision 1.15
diff -c -r1.15 tsearch.out
*** src/test/regress/expected/tsearch.out	17 Oct 2008 18:05:19 -	1.15
--- src/test/regress/expected/tsearch.out	2 Mar 2009 02:02:38 -
***
*** 624,630 
   body
   bSea/b view wow ubfoo/b bar/u iqq/i
   a href=http://www.google.com/foo.bar.html; target=_blankYES nbsp;/a
!   ff-bg
   script
  document.write(15);
   /script
--- 624,630 
   body
   bSea/b view wow ubfoo/b bar/u iqq/i
   a href=http://www.google.com/foo.bar.html; target=_blankYES nbsp;/a
!  ff-bg
   script
  document.write(15);
   /script
***
*** 712,726 
Nor any drop to drink.
  S. T. Coleridge (1772-1834)
  ', to_tsquery('english', 'Coleridge  stuck'), 'MaxFragments=2,FragmentDelimiter=***');
! ts_headline 
! 
   after day, day after day,
 We bstuck/b, nor breath nor motion,
   As idle as a painted Ship
 Upon a painted Ocean.
   Water, water, every where
 And all the boards did shrink;
!  Water, water, every where***drop to drink.
   S. T. bColeridge/b
  (1 row)
  
--- 712,726 
Nor any drop to drink.
  S. T. Coleridge (1772-1834)
  ', to_tsquery('english', 'Coleridge  stuck'), 'MaxFragments=2,FragmentDelimiter=***');
!  ts_headline  
! --
   after day, day after day,
 We bstuck/b, nor breath nor motion,
   As idle as a painted Ship
 Upon a painted Ocean.
   Water, water, every where
 And all the boards did shrink;
!  Water, water, every where *** drop to drink.
   S. T. bColeridge/b
  (1 row)
  

-- 
Sent via pgsql-hackers mailing list 

Re: [HACKERS] patch for space around the FragmentDelimiter

2009-03-01 Thread Sushant Sinha
yeah you are right. I did not know that you can pass space using double
quotes.

-Sushant.

On Sun, 2009-03-01 at 20:49 -0500, Tom Lane wrote:
 Sushant Sinha sushant...@gmail.com writes:
  FragmentDelimiter is an argument for ts_headline function to separates
  different headline fragments. The default delimiter is  ... .
  Currently if someone specifies the delimiter as an option to the
  function, no extra space is added around the delimiter. However, it does
  not look good without space around the delimter. 
 
 Maybe not to you, for the particular delimiter you happen to be working
 with, but it doesn't follow that spaces are always appropriate.
 
  Since the option parsing function removes any space around the  given
  value, it is not possible to add any desired space. The attached patch
  adds space when a FragmentDelimiter is specified.
 
 I think this is a pretty bad idea.  Better would be to document how to
 get spaces into the delimiter, ie, use double quotes:
 
   ... FragmentDelimiter =  ...  ...
 
 Hmm, actually, it looks to me that the documentation already shows this,
 in the example of the default values.
 
   regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline

2009-04-13 Thread Sushant Sinha
Headline generation uses hlCover to get fragments in text with *all*
query items. In case there is no such fragment, it does not return
anything.

What you are asking will either require returning *maximally* matching
covers or handling it as a separate case.

-Sushant.


On Mon, 2009-04-13 at 20:57 -0400, Tom Lane wrote:
 Sushant Sinha sushant...@gmail.com writes:
  Sorry for the delay. Here is the patch with FragmentDelimiter option. 
  It requires an extra option in HeadlineParsedText and uses that option
  during generateHeadline.
 
 I did some editing of the documentation for this patch and noticed that
 the explanation of the fragment-based headline method says
 
If not all query words are found in the
document, then a single fragment of the first literalMinWords/
in the document will be displayed.
 
 (That's what it says now, that is, based on my editing and testing of
 the original.)  This seems like a pretty dumb fallback approach ---
 if you have only a partial match, the headline generation suddenly
 becomes about as stupid as it could possibly be.  I could understand
 doing the above if the text actually contains *none* of the query
 words, but surely if it contains some of them we should still select
 fragments centered on those words.
 
   regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] possible bug in cover density ranking?

2009-05-01 Thread Sushant Sinha
I see this as open items here

http://wiki.postgresql.org/wiki/PostgreSQL_8.4_Open_Items

Any interest in fixing this?

-Sushant.

On Thu, 2009-01-29 at 13:54 -0500, Sushant Sinha wrote:
 
 
 On Thu, Jan 29, 2009 at 12:38 PM, Teodor Sigaev teo...@sigaev.ru
 wrote:
 Is this what is desired? It seems to me that Wdoc is
 getting a high
 ranking even when we are not sure of the position
 information. 
 0.1 is not very high rank, and we could not suggest any
 reasonable rank in this case. This document may be good, may
 be bad. rank_cd is not limited by 1.
 
  
 For a cover of 2 query items, 0.1 is actually the maximum rank. This
 is only possible when both query items are adjacent to each other.
 
 0.1 may not seem too high when we look at its absoule value. But the
 problem is we are ranking a document for which we have no positional
 information available higher than a document for which we may have
 positional information available with let suppose the cover length of
 3. I think we should rank the document with cover length 3 higher than
 the document for which we have no positional information (and assume
 cover length of 2 as we are doing now).
 
 I feel that if ext.p=ext.q for query items  1, then we should not
 count that cover for ranking at all. Or, another option will be to
 significantly inflate nNoise in this scenrio to  say 100. Putting
 nNoise=(ext.end-ext.begin)/2 is way too low for covers that we have no
 idea on (it is 0 for query items = 2).
 
 I am not assuming or suggesting that rank_cd is bounded by one. Off
 course its rank increases as more and more covers are added.
 
 Thanks,
 Sushant.
 
 
 
 The comment above says that In this case we
 approximate number of
 noise word as half cover's length. But we do not know
 the cover's
 length in this case as ext.p and ext.q are both
 unreliable. And ext.end
 -ext.begin is not the cover's length. It is the
 number of query items
 found in the cover.
 
 
 Yeah, but if there is no information then information is
 absent :), but I agree with you to change comment
 -- 
 Teodor Sigaev   E-mail:
 teo...@sigaev.ru
   WWW:
 http://www.sigaev.ru/
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] dot to be considered as a word delimiter?

2009-05-30 Thread Sushant Sinha
Currently it seems like that dot is not considered as a word delimiter
by the english parser.

lawdb=# select to_tsvector('english', 'Mr.J.Sai Deepak');
   to_tsvector   
-
 'deepak':2 'mr.j.sai':1
(1 row)

So the word obtained is mr.j.sai rather than three words mr, j,
sai

It does it correctly if there is space in between, as space is
definitely a word delimiter.

lawdb=# select to_tsvector('english', 'Mr. J. Sai Deepak');
   to_tsvector   
-
 'j':2 'mr':1 'sai':3 'deepak':4
(1 row)


I think that dot should be considered by as a word delimiter because
when dot is not followed by a space, most of the time it is an error in
typing. Beside they are not many valid english words that have dot in
between.

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dot to be considered as a word delimiter?

2009-06-02 Thread Sushant Sinha
Fair enough. I agree that there is a valid need for returning such tokens as
a host. But I think there is definitely a need to break it down into
individual words. This will help in cases when a document is missing a space
in between the words.


So what we can do is: return the entire compound word as Host and also break
it down into individual words. I can put up a patch for this if you guys
agree.

Returning multiple tokens for the same word is a feature of the text search
parser as explained in the documentation here:
http://www.postgresql.org/docs/8.3/static/textsearch-parsers.html

Thanks,
Sushant.

On Tue, Jun 2, 2009 at 8:47 AM, Kenneth Marshall k...@rice.edu wrote:

 On Mon, Jun 01, 2009 at 08:22:23PM -0500, Kevin Grittner wrote:
  Sushant Sinha sushant...@gmail.com wrote:
 
   I think that dot should be considered by as a word delimiter because
   when dot is not followed by a space, most of the time it is an error
   in typing. Beside they are not many valid english words that have
   dot in between.
 
  It's not treating it as an English word, but as a host name.
 
  select ts_debug('english', 'Mr.J.Sai Deepak');
   ts_debug
 
 ---
   (host,Host,Mr.J.Sai,{simple},simple,{mr.j.sai})
   (blank,Space symbols, ,{},,)
   (asciiword,Word, all
  ASCII,Deepak,{english_stem},english_stem,{deepak})
  (3 rows)
 
  You could run it through a dictionary which would deal with host
  tokens differently.  Just be aware of what you'll be doing to
  www.google.com if you run into it.
 
  I hope this helps.
 
  -Kevin
 

 In our uses for full text indexing, it is much more important to
 be able to find host name and URLs than to find mistyped names.
 My two cents.

 Cheers,
 Ken



Re: [HACKERS] It's June 1; do you know where your release is?

2009-06-02 Thread Sushant Sinha
On Tue, 2009-06-02 at 17:26 -0700, Josh Berkus wrote:
 
  * possible bug in cover density ranking?
 
 -- From Teodor's response, this is maybe a doc patch and not a code 
 patch.  Teodor?  Oleg?


I personally think that this is a bug, because we are assigning very
high rank when we are not sure about the positional information. This is
not a show stopper though.

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TS: Limited cover density ranking

2012-01-27 Thread Sushant Sinha
The rank counts 1/coversize. So bigger covers will not have much impact
anyway. What is the need of the patch?

-Sushant.

On Fri, 2012-01-27 at 18:06 +0200, karave...@mail.bg wrote:
 Hello, 
 
 I have developed a variation of cover density ranking functions that
 counts only covers that are lesser than a specified limit. It is
 useful for finding combinations of terms that appear nearby one
 another. Here is an example of usage: 
 
 -- normal cover density ranking : not changed 
 luben= select ts_rank_cd(to_tsvector('a b c d e g h i j k'),
 to_tsquery('ad')); 
 ts_rank_cd 
  
 0.033 
 (1 row) 
 
 -- limited to 2 
 luben= select ts_rank_cd(2, to_tsvector('a b c d e g h i j k'),
 to_tsquery('ad')); 
 ts_rank_cd 
  
 0 
 (1 row) 
 
 luben= select ts_rank_cd(2, to_tsvector('a b c d e g h i j k a d'),
 to_tsquery('ad')); 
 ts_rank_cd 
  
 0.1 
 (1 row) 
 
 -- limited to 3 
 luben= select ts_rank_cd(3, to_tsvector('a b c d e g h i j k'),
 to_tsquery('ad')); 
 ts_rank_cd 
  
 0.033 
 (1 row) 
 
 luben= select ts_rank_cd(3, to_tsvector('a b c d e g h i j k a d'),
 to_tsquery('ad')); 
 ts_rank_cd 
  
 0.13 
 (1 row) 
 
 Find attached a path agains 9.1.2 sources. I preferred to make a
 patch, not a separate extension because it is only 1 statement change
 in calc_rank_cd function. If I have to make an extension a lot of code
 would be duplicated between backend/utils/adt/tsrank.c and the
 extension. 
 
 I have some questions: 
 
 1. Is it interesting to develop it further (documentation, cleanup,
 etc) for inclusion in one of the next versions? If this is the case,
 there are some further questions: 
 
 - should I overload ts_rank_cd (as in examples above and the patch) or
 should I define new set of functions, for example ts_rank_lcd ? 
 - should I define define this new sql level functions in core or
 should I go only with this 2 lines change in calc_rank_cd() and define
 the new functions as an extension? If we prefer the later, could I
 overload core functions with functions defined in extensions? 
 - and finally there is always the possibility to duplicate the code
 and make an independent extension. 
 
 2. If I run the patched version on cluster that was initialized with
 unpatched server, is there a way to register the new functions in the
 system catalog without reinitializing the cluster? 
 
 Best regards 
 luben 
 
 -- 
 Luben Karavelov



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] bug in ts_rank_cd

2010-12-21 Thread Sushant Sinha
There is a bug in ts_rank_cd. It does not correctly give rank when the
query lexeme is the first one in the tsvector.

Example:

select ts_rank_cd(to_tsvector('english', 'abc sdd'),
plainto_tsquery('english', 'abc'));   
 ts_rank_cd 

  0

select ts_rank_cd(to_tsvector('english', 'bcg abc sdd'),
plainto_tsquery('english', 'abc'));
 ts_rank_cd 

0.1

The problem is that the Cover finding algorithm ignores the lexeme at
the 0th position, I have attached a patch which fixes it. After the
patch the result is fine.

select ts_rank_cd(to_tsvector('english', 'abc sdd'), plainto_tsquery(
'english', 'abc'));
 ts_rank_cd 

0.1

--- postgresql-9.0.0/src/backend/utils/adt/tsrank.c	2010-01-02 22:27:55.0 +0530
+++ postgres-9.0.0-tsrankbugfix/src/backend/utils/adt/tsrank.c	2010-12-21 18:39:57.0 +0530
@@ -551,7 +551,7 @@
 	memset(qr-operandexist, 0, sizeof(bool) * qr-query-size);
 
 	ext-p = 0x7fff;
-	ext-q = 0;
+	ext-q = -1;
 	ptr = doc + ext-pos;
 
 	/* find upper bound of cover from current position, move up */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] bug in ts_rank_cd

2010-12-21 Thread Sushant Sinha
MY PREV EMAIL HAD A PROBLEM. Please reply to this one
==

There is a bug in ts_rank_cd. It does not correctly give rank when the
query lexeme is the first one in the tsvector.

Example:

select ts_rank_cd(to_tsvector('english', 'abc sdd'),
plainto_tsquery('english', 'abc'));   
 ts_rank_cd 

  0

select ts_rank_cd(to_tsvector('english', 'bcg abc sdd'),
plainto_tsquery('english', 'abc'));
 ts_rank_cd 

0.1

The problem is that the Cover finding algorithm ignores the lexeme at
the 0th position, I have attached a patch which fixes it. After the
patch the result is fine.

select ts_rank_cd(to_tsvector('english', 'abc sdd'), plainto_tsquery(
'english', 'abc'));
 ts_rank_cd 

0.1

--- postgresql-9.0.0/src/backend/utils/adt/tsrank.c	2010-01-02 22:27:55.0 +0530
+++ postgres-9.0.0-tsrankbugfix/src/backend/utils/adt/tsrank.c	2010-12-21 18:39:57.0 +0530
@@ -551,7 +551,7 @@
 	memset(qr-operandexist, 0, sizeof(bool) * qr-query-size);
 
 	ext-p = 0x7fff;
-	ext-q = 0;
+	ext-q = -1;
 	ptr = doc + ext-pos;
 
 	/* find upper bound of cover from current position, move up */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in ts_rank_cd

2010-12-22 Thread Sushant Sinha
Sorry for sounding the false alarm. I was not running the vanilla
postgres and that is why I was seeing that problem. Should have checked
with the vanilla one.

-Sushant

On Tue, 2010-12-21 at 23:03 -0500, Tom Lane wrote:
 Sushant Sinha sushant...@gmail.com writes:
  There is a bug in ts_rank_cd. It does not correctly give rank when the
  query lexeme is the first one in the tsvector.
 
 Hmm ... I cannot reproduce the behavior you're complaining of.
 You say
 
  select ts_rank_cd(to_tsvector('english', 'abc sdd'),
  plainto_tsquery('english', 'abc'));   
   ts_rank_cd 
  
0
 
 but I get
 
 regression=# select ts_rank_cd(to_tsvector('english', 'abc sdd'),
 regression(# plainto_tsquery('english', 'abc'));   
  ts_rank_cd 
 
 0.1
 (1 row)
 
  The problem is that the Cover finding algorithm ignores the lexeme at
  the 0th position,
 
 As far as I can tell, there is no 0th position --- tsvector counts
 positions from one.  The only way to see pos == 0 in the input to
 Cover() is if the tsvector has been stripped of position information.
 ts_rank_cd is documented to return 0 in that situation.  Your patch
 would have the effect of causing it to return some nonzero, but quite
 bogus, ranking.
 
   regards, tom lane



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] english parser in text search: support for multiple words in the same position

2010-12-22 Thread Sushant Sinha
Just a reminder that this patch is discussing  how to break url, emails etc
into its components.

On Mon, Oct 4, 2010 at 3:54 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 [ sorry for not responding on this sooner, it's been hectic the last
  couple weeks ]

 Sushant Sinha sushant...@gmail.com writes:
  I looked at this patch a bit.  I'm fairly unhappy that it seems to be
  inventing a brand new mechanism to do something the ts parser can
  already do.  Why didn't you code the url-part mechanism using the
  existing support for compound words?

  I am not familiar with compound word implementation and so I am not sure
  how to split a url with compound word support. I looked into the
  documentation for compound words and that does not say much about how to
  identify components of a token.

 IIRC, the way that that works is associated with pushing a sub-state
 of the state machine in order to scan each compound-word part.  I don't
 have the details in my head anymore, though I recall having traced
 through it in the past.  Look at the state machine actions that are
 associated with producing the compound word tokens and sub-tokens.


I did look around for compound word support in postgres. In particular, I
read the documentation and code in tsearch/spell.c that seems to implement
the compound word support.

So in my understanding the way it works is:

1. Specify a dictionary of words in which each word will have applicable
prefix/suffix flags
2. Specify a flag file that provides prefix/suffix operations on those flags
3. flag z indicates that a word in the dictionary can participate in
compound word splitting
4. When a token matches words specified in the dictionary (after applying
affix/suffix operations), the matching words are emitted as sub-words of the
token (i.e., compound word)

If my above understanding is correct, then I think it will not be possible
to implement url/email splitting using the compound word support.

The main reason is that the compound word support requires the
PRE-DETERMINED dictionary of words. So to split a url/email we will need
to provide a list of *all possible* host names and user names. I do not
think that is a possibility.

Please correct me if I have mis-understood something.

-Sushant.


Re: [HACKERS] english parser in text search: support for multiple words in the same position

2011-01-06 Thread Sushant Sinha
Do not know if this mail got lost in between or no one noticed it!

On Thu, 2010-12-23 at 11:05 +0530, Sushant Sinha wrote:
Just a reminder that this patch is discussing  how to break url, emails
etc into its components.
 
 On Mon, Oct 4, 2010 at 3:54 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 [ sorry for not responding on this sooner, it's been hectic
 the last
  couple weeks ]
 
 Sushant Sinha sushant...@gmail.com writes:
 
  I looked at this patch a bit.  I'm fairly unhappy that it
 seems to be
  inventing a brand new mechanism to do something the ts
 parser can
  already do.  Why didn't you code the url-part mechanism
 using the
  existing support for compound words?
 
  I am not familiar with compound word implementation and so I
 am not sure
  how to split a url with compound word support. I looked into
 the
  documentation for compound words and that does not say much
 about how to
  identify components of a token.
 
 
 IIRC, the way that that works is associated with pushing a
 sub-state
 of the state machine in order to scan each compound-word
 part.  I don't
 have the details in my head anymore, though I recall having
 traced
 through it in the past.  Look at the state machine actions
 that are
 associated with producing the compound word tokens and
 sub-tokens.
 

I did look around for compound word support in postgres. In particular,
I read the documentation and code in tsearch/spell.c that seems to
implement the compound word support. 

So in my understanding the way it works is:

1. Specify a dictionary of words in which each word will have applicable
prefix/suffix flags

2. Specify a flag file that provides prefix/suffix operations on those
flags

3. flag z indicates that a word in the dictionary can participate in
compound word splitting

4. When a token matches words specified in the dictionary (after
applying affix/suffix operations), the matching words are emitted as
sub-words of the token (i.e., compound word)

If my above understanding is correct, then I think it will not be
possible to implement url/email splitting using the compound word
support.

The main reason is that the compound word support requires the
PRE-DETERMINED dictionary of words. So to split a url/email we will
need to provide a list of *all possible* host names and user names. I do
not think that is a possibility.

Please correct me if I have mis-understood something.

-Sushant. 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] text search: restricting the number of parsed words in headline generation

2012-08-15 Thread Sushant Sinha
I will do the profiling and present the results.

On Wed, 2012-08-15 at 12:45 -0400, Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  Is this a TODO?
 
 AFAIR nothing's been done about the speed issue, so yes.  I didn't
 like the idea of creating a user-visible knob when the speed issue
 might be fixable with internal algorithm improvements, but we never
 followed up on this in either fashion.
 
   regards, tom lane
 
  ---
 
  On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote:
  Sushant Sinha sushant...@gmail.com writes:
  Doesn't this force the headline to be taken from the first N words of
  the document, independent of where the match was?  That seems rather
  unworkable, or at least unhelpful.
  
  In headline generation function, we don't have any index or knowledge of
  where the match is. We discover the matches by first tokenizing and then
  comparing the matches with the query tokens. So it is hard to do
  anything better than first N words.
  
  After looking at the code in wparser_def.c a bit more, I wonder whether
  this patch is doing what you think it is.  Did you do any profiling to
  confirm that tokenization is where the cost is?  Because it looks to me
  like the match searching in hlCover() is at least O(N^2) in the number
  of tokens in the document, which means it's probably the dominant cost
  for any long document.  I suspect that your patch helps not so much
  because it saves tokenization costs as because it bounds the amount of
  effort spent in hlCover().
  
  I haven't tried to do anything about this, but I wonder whether it
  wouldn't be possible to eliminate the quadratic blowup by saving more
  state across the repeated calls to hlCover().  At the very least, it
  shouldn't be necessary to find the last query-token occurrence in the
  document from scratch on each and every call.
  
  Actually, this code seems probably flat-out wrong: won't every
  successful call of hlCover() on a given document return exactly the same
  q value (end position), namely the last token occurrence in the
  document?  How is that helpful?
  
  regards, tom lane
  
  -- 
  Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-hackers
 
  -- 
Bruce Momjian  br...@momjian.ushttp://momjian.us
EnterpriseDB http://enterprisedb.com
 
+ It's impossible for everything to be true. +
 
 
  -- 
  Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_trgm: unicode string not working

2011-06-12 Thread Sushant Sinha
I am using pg_trgm for spelling correction as prescribed in the
documentation. But I see that it does not work for unicode sring. The
database was initialized with utf8 encoding and the C locale.

Here is the table:
 \d words
 Table public.words
 Column |  Type   | Modifiers 
+-+---
 word   | text| 
 ndoc   | integer | 
 nentry | integer | 
Indexes:
words_idx gin (word gin_trgm_ops)

Query: select word from words where word % 'कतद';

I get an error:

ERROR:  GIN indexes do not support whole-index scans


Any idea what is wrong?

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PL/Python: No stack trace for an exception

2011-07-21 Thread Sushant Sinha
I am using plpythonu on postgres 9.0.2. One of my python functions was
throwing a TypeError exception. However, I only see the exception in the
database and not the stack trace. It becomes difficult to debug if the
stack trace is absent in Python.

logdb=# select get_words(forminput) from fi;   
ERROR:  PL/Python: TypeError: an integer is required
CONTEXT:  PL/Python function get_words


And here is the error if I run that function on the same data in python:

Traceback (most recent call last):
  File valid.py, line 215, in module
parse_query(result['forminput'])
  File valid.py, line 132, in parse_query
dateobj = datestr_to_obj(columnHash[column])
  File valid.py, line 37, in datestr_to_obj
dateobj = datetime.date(words[2], words[1], words[0])
TypeError: an integer is required


Is this a known problem or this needs addressing?

Thanks,
Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PL/Python: No stack trace for an exception

2011-07-21 Thread Sushant Sinha

On Thu, 2011-07-21 at 15:31 +0200, Jan Urbański wrote:
 On 21/07/11 15:27, Sushant Sinha wrote:
  I am using plpythonu on postgres 9.0.2. One of my python functions was
  throwing a TypeError exception. However, I only see the exception in the
  database and not the stack trace. It becomes difficult to debug if the
  stack trace is absent in Python.
  
  logdb=# select get_words(forminput) from fi;   
  ERROR:  PL/Python: TypeError: an integer is required
  CONTEXT:  PL/Python function get_words
  
  And here is the error if I run that function on the same data in python:
  
  [traceback]
  
  Is this a known problem or this needs addressing?
 
 Yes, traceback support in PL/Python has already been implemented and is
 a new feature that will be available in PostgreSQL 9.1.
 
 Cheers,
 Jan

Thanks Jan! Just one more reason to try 9.1.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] text search: restricting the number of parsed words in headline generation

2011-08-23 Thread Sushant Sinha
Given a document and a query, the goal of headline generation is to
produce text excerpts in which the query appears. Currently the headline
generation in postgres follows the following steps:

1. Tokenize the documents and obtain the lexemes
2. Decide on lexemes that should be the part of the headline
3. Generate the headline

So the time taken by the headline generation is directly dependent on
the size of the document. The longer the document, the more time taken
to tokenize and more lexemes to operate on.

Most of the time is taken during the tokenization phase and for very big
documents, the headline generation is very expensive. 

Here is a simple patch that limits the number of words during the
tokenization phase and puts an upper-bound on the headline generation.
The headline function takes a parameter MaxParsedWords. If this
parameter is negative or not supplied, then the entire document is
tokenized  and operated on (the current behavior). However, if the
supplied MaxParsedWords is a positive number, then the tokenization
stops after MaxParsedWords is obtained. The remaining headline
generation happens on the tokens obtained till that point.

The current patch can be applied to 9.1rc1. It lacks changes to the
documentation and test cases. I will add them if you folks agree on the
functionality.

-Sushant.
diff -ru postgresql-9.1rc1/src/backend/tsearch/ts_parse.c postgresql-9.1rc1-dev/src/backend/tsearch/ts_parse.c
--- postgresql-9.1rc1/src/backend/tsearch/ts_parse.c	2011-08-19 02:53:13.0 +0530
+++ postgresql-9.1rc1-dev/src/backend/tsearch/ts_parse.c	2011-08-23 21:27:10.0 +0530
@@ -525,10 +525,11 @@
 }
 
 void
-hlparsetext(Oid cfgId, HeadlineParsedText *prs, TSQuery query, char *buf, int buflen)
+hlparsetext(Oid cfgId, HeadlineParsedText *prs, TSQuery query, char *buf, int buflen, int max_parsed_words)
 {
 	int			type,
-lenlemm;
+lenlemm,
+numparsed = 0;
 	char	   *lemm = NULL;
 	LexizeData	ldata;
 	TSLexeme   *norms;
@@ -580,8 +581,8 @@
 			else
 addHLParsedLex(prs, query, lexs, NULL);
 		} while (norms);
-
-	} while (type  0);
+		numparsed += 1;
+	} while (type  0  (max_parsed_words  0 || numparsed  max_parsed_words));
 
 	FunctionCall1((prsobj-prsend), PointerGetDatum(prsdata));
 }
--- postgresql-9.1rc1/src/backend/tsearch/wparser.c	2011-08-19 02:53:13.0 +0530
+++ postgresql-9.1rc1-dev/src/backend/tsearch/wparser.c	2011-08-23 21:30:12.0 +0530
@@ -304,6 +304,8 @@
 	text	   *out;
 	TSConfigCacheEntry *cfg;
 	TSParserCacheEntry *prsobj;
+	ListCell   *l;
+int max_parsed_words = -1;
 
 	cfg = lookup_ts_config_cache(PG_GETARG_OID(0));
 	prsobj = lookup_ts_parser_cache(cfg-prsId);
@@ -317,13 +319,21 @@
 	prs.lenwords = 32;
 	prs.words = (HeadlineWordEntry *) palloc(sizeof(HeadlineWordEntry) * prs.lenwords);
 
-	hlparsetext(cfg-cfgId, prs, query, VARDATA(in), VARSIZE(in) - VARHDRSZ);
 
 	if (opt)
 		prsoptions = deserialize_deflist(PointerGetDatum(opt));
 	else
 		prsoptions = NIL;
 
+	foreach(l, prsoptions)
+	{
+		DefElem*defel = (DefElem *) lfirst(l);
+		char	   *val = defGetString(defel);
+		if (pg_strcasecmp(defel-defname, MaxParsedWords) == 0)
+			max_parsed_words = pg_atoi(val, sizeof(int32), 0);
+}
+
+	hlparsetext(cfg-cfgId, prs, query, VARDATA(in), VARSIZE(in) - VARHDRSZ, max_parsed_words);
 	FunctionCall3((prsobj-prsheadline),
   PointerGetDatum(prs),
   PointerGetDatum(prsoptions),
diff -ru postgresql-9.1rc1/src/include/tsearch/ts_utils.h postgresql-9.1rc1-dev/src/include/tsearch/ts_utils.h
--- postgresql-9.1rc1/src/include/tsearch/ts_utils.h	2011-08-19 02:53:13.0 +0530
+++ postgresql-9.1rc1-dev/src/include/tsearch/ts_utils.h	2011-08-23 21:04:14.0 +0530
@@ -98,7 +98,7 @@
  */
 
 extern void hlparsetext(Oid cfgId, HeadlineParsedText *prs, TSQuery query,
-			char *buf, int4 buflen);
+			char *buf, int4 buflen, int max_parsed_words);
 extern text *generateHeadline(HeadlineParsedText *prs);
 
 /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] text search: restricting the number of parsed words in headline generation

2011-08-23 Thread Sushant Sinha

  Here is a simple patch that limits the number of words during the
  tokenization phase and puts an upper-bound on the headline generation.
 
 Doesn't this force the headline to be taken from the first N words of
 the document, independent of where the match was?  That seems rather
 unworkable, or at least unhelpful.
 
   regards, tom lane

In headline generation function, we don't have any index or knowledge of
where the match is. We discover the matches by first tokenizing and then
comparing the matches with the query tokens. So it is hard to do
anything better than first N words.


One option could be that we start looking for good match while
tokenizing and then stop if we have found good match. Currently the
algorithms that decide a good match operates independently of the
tokenization and there are two of them. So integrating them would not be
easy.

The patch is very helpful if you believe in the common case assumption
that most of the time a good match is at the top of the document.
Typically a search application generates headline for the top matches of
a query i.e., those in which the query terms appears frequently. So
there should be atleast one or two good text excerpt matches at the top
of the document.



-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] text search: restricting the number of parsed words in headline generation

2011-08-23 Thread Sushant Sinha

 Actually, this code seems probably flat-out wrong: won't every
 successful call of hlCover() on a given document return exactly the same
 q value (end position), namely the last token occurrence in the
 document?  How is that helpful?

regards, tom lane


There is a line that saves the computation state from the previous call and
search only starts from there:

int pos = *p;


Re: [HACKERS] english parser in text search: support for multiple words in the same position

2010-09-21 Thread Sushant Sinha
 I looked at this patch a bit.  I'm fairly unhappy that it seems to be
 inventing a brand new mechanism to do something the ts parser can
 already do.  Why didn't you code the url-part mechanism using the
 existing support for compound words? 

I am not familiar with compound word implementation and so I am not sure
how to split a url with compound word support. I looked into the
documentation for compound words and that does not say much about how to
identify components of a token. Does a compound word split by matching
with a list of words? If yes, then we will not be able to use that as we
do not know all the words that can appear in a url/host/email/file.

I think another approach can be to use the dict_regex dictionary
support. However, we will have to match the regex with something that
parser is doing. 

The current patch is not inventing any new mechanism. It uses the
special handler mechanism already present in the parser. For example,
when the current parser finds a URL it runs a special handler called
SpecialFURL which resets the parser position to the start of token to
find hostname. After finding the host it moves to finding the path. So
you first get the URL and then the host and finally the path.

Similarly, we are resetting the parser to the start of the token on
finding a url to output url parts. Then before entering the state that
can lead to a url we output the url part. The state machine modification
is similar for other tokens like file/email/host.


 The changes made to parsetext()
 seem particularly scary: it's not clear at all that that's not breaking
 unrelated behaviors.  In fact, the changes in the regression test
 results suggest strongly to me that it *is* breaking things.  Why are
 there so many diffs in examples that include no URLs at all?
 

I think some of the difference is coming from the fact that now pos
starts with 0 and it used to be 1 earlier. That is easily fixable
though. 

 An issue that's nearly as bad is the 100% lack of documentation,
 which makes the patch difficult to review because it's hard to tell
 what it intends to accomplish or whether it's met the intent.
 The patch is not committable without documentation anyway, but right
 now I'm not sure it's even usefully reviewable.

I did not provide any explanation as I could not find any place in the
code to provide the documentation (that was just a modification of state
machine). Should I do a separate write-up to explain the desired output
and the changes to achieve it?

 
 In line with the lack of documentation, I would say that the choice of
 the name parttoken for the new token type is not helpful.  Part of
 what?  And none of the other token type names include the word token,
 so that's not a good decision either.  Possibly url_part would be a
 suitable name.
 

I can modify it to output url-part/host-part/email-part/file-part if
there is an agreement over the rest of the issues. So let me know if I
should go ahead with this.

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Configuring Text Search parser?

2010-09-21 Thread Sushant Sinha
Your changes are somewhat fine. It will get you tokens with _
characters in it. However, it is not nice to mix your new token with
existing token like NUMWORD. Give a new name to your new type of
token .. probably UnderscoreWord. Then on seeing _, move to a state
that can identify the new token. If you finally recognize that token,
then output it.

In order to extract portions of the newly created token,  you can write
a special handler for the token that resets the parser position to the
start of the token to get parts of it. And then modify the state machine
to output the part-token before going into the state that can lead to
the token that was identified earlier.


Look at these changes to the text parser as well:

http://archives.postgresql.org/pgsql-hackers/2010-09/msg4.php

-Sushant.


On Mon, 2010-09-20 at 16:01 +0200, jes...@krogh.cc wrote:
 Hi.
 
 I'm trying to migrate an application off an existing Full Text Search engine
 and onto PostgreSQL .. one of my main (remaining) headaches are the
 fact that PostgreSQL treats _ as a seperation charachter whereas the existing
 behaviour is to not split. That means:
 
 testdb=# select ts_debug('database_tag_number_999');
ts_debug
 --
  (asciiword,Word, all ASCII,database,{english_stem},english_stem,{databas})
  (blank,Space symbols,_,{},,)
  (asciiword,Word, all ASCII,tag,{english_stem},english_stem,{tag})
  (blank,Space symbols,_,{},,)
  (asciiword,Word, all ASCII,number,{english_stem},english_stem,{number})
  (blank,Space symbols,_,{},,)
  (uint,Unsigned integer,999,{simple},simple,{999})
 (7 rows)
 
 Where the incoming data, by design contains a set of tags which includes _
 and are expected to be one lexeme.
 
 I've tried patching my way out of this using this patch.
 
 $ diff -w -C 5 src/backend/tsearch/wparser_def.c.orig
 src/backend/tsearch/wparser_def.c
 *** src/backend/tsearch/wparser_def.c.orig2010-09-20 15:58:37.06460
 +0200
 --- src/backend/tsearch/wparser_def.c 2010-09-20 15:58:41.193335577 +0200
 ***
 *** 967,986 
 --- 967,988 
 
   static const TParserStateActionItem actionTPS_InNumWord[] = {
   {p_isEOF, 0, A_BINGO, TPS_Base, NUMWORD, NULL},
   {p_isalnum, 0, A_NEXT, TPS_InNumWord, 0, NULL},
   {p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL},
 + {p_iseqC, '_', A_NEXT, TPS_InNumWord, 0, NULL},
   {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
   {p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL},
   {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},
   {p_iseqC, '-', A_PUSH, TPS_InHyphenNumWordFirst, 0, NULL},
   {NULL, 0, A_BINGO, TPS_Base, NUMWORD, NULL}
   };
 
   static const TParserStateActionItem actionTPS_InAsciiWord[] = {
   {p_isEOF, 0, A_BINGO, TPS_Base, ASCIIWORD, NULL},
   {p_isasclet, 0, A_NEXT, TPS_Null, 0, NULL},
 + {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},
   {p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL},
   {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},
   {p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL},
   {p_iseqC, '-', A_PUSH, TPS_InHyphenAsciiWordFirst, 0, NULL},
   {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
 ***
 *** 995,1004 
 --- 997,1007 
 
   static const TParserStateActionItem actionTPS_InWord[] = {
   {p_isEOF, 0, A_BINGO, TPS_Base, WORD_T, NULL},
   {p_isalpha, 0, A_NEXT, TPS_Null, 0, NULL},
   {p_isspecial, 0, A_NEXT, TPS_Null, 0, NULL},
 + {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},
   {p_isdigit, 0, A_NEXT, TPS_InNumWord, 0, NULL},
   {p_iseqC, '-', A_PUSH, TPS_InHyphenWordFirst, 0, NULL},
   {NULL, 0, A_BINGO, TPS_Base, WORD_T, NULL}
   };
 
 
 
 This will obviously break other peoples applications, so my questions would
 be: If this should be made configurable.. how should it be done?
 
 As a sidenote... Xapian doesn't split on _ .. Lucene does.
 
 Thanks.
 
 -- 
 Jesper
 
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] english parser in text search: support for multiple words in the same position

2010-09-28 Thread Sushant Sinha
Any updates on this?


On Tue, Sep 21, 2010 at 10:47 PM, Sushant Sinha sushant...@gmail.comwrote:

  I looked at this patch a bit.  I'm fairly unhappy that it seems to be
  inventing a brand new mechanism to do something the ts parser can
  already do.  Why didn't you code the url-part mechanism using the
  existing support for compound words?

 I am not familiar with compound word implementation and so I am not sure
 how to split a url with compound word support. I looked into the
 documentation for compound words and that does not say much about how to
 identify components of a token. Does a compound word split by matching
 with a list of words? If yes, then we will not be able to use that as we
 do not know all the words that can appear in a url/host/email/file.

 I think another approach can be to use the dict_regex dictionary
 support. However, we will have to match the regex with something that
 parser is doing.

 The current patch is not inventing any new mechanism. It uses the
 special handler mechanism already present in the parser. For example,
 when the current parser finds a URL it runs a special handler called
 SpecialFURL which resets the parser position to the start of token to
 find hostname. After finding the host it moves to finding the path. So
 you first get the URL and then the host and finally the path.

 Similarly, we are resetting the parser to the start of the token on
 finding a url to output url parts. Then before entering the state that
 can lead to a url we output the url part. The state machine modification
 is similar for other tokens like file/email/host.


  The changes made to parsetext()
  seem particularly scary: it's not clear at all that that's not breaking
  unrelated behaviors.  In fact, the changes in the regression test
  results suggest strongly to me that it *is* breaking things.  Why are
  there so many diffs in examples that include no URLs at all?
 

 I think some of the difference is coming from the fact that now pos
 starts with 0 and it used to be 1 earlier. That is easily fixable
 though.

  An issue that's nearly as bad is the 100% lack of documentation,
  which makes the patch difficult to review because it's hard to tell
  what it intends to accomplish or whether it's met the intent.
  The patch is not committable without documentation anyway, but right
  now I'm not sure it's even usefully reviewable.

 I did not provide any explanation as I could not find any place in the
 code to provide the documentation (that was just a modification of state
 machine). Should I do a separate write-up to explain the desired output
 and the changes to achieve it?

 
  In line with the lack of documentation, I would say that the choice of
  the name parttoken for the new token type is not helpful.  Part of
  what?  And none of the other token type names include the word token,
  so that's not a good decision either.  Possibly url_part would be a
  suitable name.
 

 I can modify it to output url-part/host-part/email-part/file-part if
 there is an agreement over the rest of the issues. So let me know if I
 should go ahead with this.

 -Sushant.




Re: [HACKERS] Re: [GENERAL] Text search parser's treatment of URLs and emails

2010-10-12 Thread Sushant Sinha

On Tue, 2010-10-12 at 19:31 -0400, Tom Lane wrote:
 This seems much of a piece with the existing proposal to allow
 individual words of a URL to be reported separately:
 https://commitfest.postgresql.org/action/patch_view?id=378
 
 As I said in that thread, this could be done in a backwards-compatible
 way using the tsearch parser's existing ability to report multiple
 overlapping tokens out of the same piece of text.  But I'd like to see
 one unified proposal and patch for this and Sushant's patch, not
 independent hacks changing the behavior in the same area.
 
   regards, tom lane
What Tom has suggested will require me to look into a different piece of
code and so this will take some time before I can update the patch.

-Sushant.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] planner row-estimates for tsvector seems horribly wrong

2010-10-24 Thread Sushant Sinha
I am using gin index on a tsvector and doing basic search. I see the
row-estimate of the planner to be horribly wrong. It is returning
row-estimate as 4843 for all queries whether it matches zero rows, a
medium number of rows (88,000) or a large number of rows (726,000).

The table has roughly a million docs.

I see a similar problem reported here but thought it was fixed in 9.0
which I am running. 

http://archives.postgresql.org/pgsql-hackers/2010-05/msg01389.php

Here is the version info and detailed planner output for all the three
queries:


select version();

version 

 PostgreSQL 9.0.0 on x86_64-unknown-linux-gnu, compiled by GCC gcc
(Gentoo 4.3.4 p1.1, pie-10.1.5) 4.3.4, 64-bit


Case I: FOR A NON-MATCHING WORD
===

explain analyze select count(*) from  docmeta,
plainto_tsquery('english', 'dyfdfdf') as qdoc where  docvector @@ qdoc;
 QUERY
PLAN 

 Aggregate  (cost=20322.17..20322.18 rows=1 width=0) (actual
time=0.058..0.058 rows=1 loops=1)
   -  Nested Loop  (cost=5300.28..20310.06 rows=4843 width=0) (actual
time=0.055..0.055 rows=0 loops=1)
 -  Function Scan on qdoc  (cost=0.00..0.01 rows=1 width=32)
(actual time=0.005..0.005 rows=1 loops=1)
 -  Bitmap Heap Scan on docmeta  (cost=5300.28..20249.51
rows=4843 width=270) (actual time=0.046..0.046 rows=0 loops=1)
   Recheck Cond: (docmeta.docvector @@ qdoc.qdoc)
   -  Bitmap Index Scan on doc_index  (cost=0.00..5299.07
rows=4843 width=0) (actual time=0.044..0.044 rows=0 loops=1)
 Index Cond: (docmeta.docvector @@ qdoc.qdoc)
 Total runtime: 0.092 ms

CASE II: FOR A MEDIUM-MATCHING WORD
===
 explain analyze select count(*) from  docmeta,
plainto_tsquery('english', 'quit') as qdoc where  docvector @@ qdoc;
 QUERY
PLAN 

 Aggregate  (cost=20322.17..20322.18 rows=1 width=0) (actual
time=1222.856..1222.857 rows=1 loops=1)
   -  Nested Loop  (cost=5300.28..20310.06 rows=4843 width=0) (actual
time=639.275..1212.460 rows=88545 loops=1)
 -  Function Scan on qdoc  (cost=0.00..0.01 rows=1 width=32)
(actual time=0.006..0.007 rows=1 loops=1)
 -  Bitmap Heap Scan on docmeta  (cost=5300.28..20249.51
rows=4843 width=270) (actual time=639.264..1196.542 rows=88545 loops=1)
   Recheck Cond: (docmeta.docvector @@ qdoc.qdoc)
   -  Bitmap Index Scan on doc_index  (cost=0.00..5299.07
rows=4843 width=0) (actual time=621.877..621.877 rows=88545 loops=1)
 Index Cond: (docmeta.docvector @@ qdoc.qdoc)
 Total runtime: 1222.907 ms


Case II: FOR A HIGH-MATCHING WORD
=

explain analyze select count(*) from  docmeta,
plainto_tsquery('english', 'j') as qdoc where  docvector @@ qdoc;
 QUERY
PLAN  

 Aggregate  (cost=20322.17..20322.18 rows=1 width=0) (actual
time=742.857..742.858 rows=1 loops=1)
   -  Nested Loop  (cost=5300.28..20310.06 rows=4843 width=0) (actual
time=126.804..660.895 rows=726985 loops=1)
 -  Function Scan on qdoc  (cost=0.00..0.01 rows=1 width=32)
(actual time=0.004..0.006 rows=1 loops=1)
 -  Bitmap Heap Scan on docmeta  (cost=5300.28..20249.51
rows=4843 width=270) (actual time=126.795..530.422 rows=726985 loops=1)
   Recheck Cond: (docmeta.docvector @@ qdoc.qdoc)
   -  Bitmap Index Scan on doc_index  (cost=0.00..5299.07
rows=4843 width=0) (actual time=113.742..113.742 rows=726985 loops=1)
 Index Cond: (docmeta.docvector @@ qdoc.qdoc)
 Total runtime: 742.906 ms

Thanks,
Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] planner row-estimates for tsvector seems horribly wrong

2010-10-24 Thread Sushant Sinha
Thanks a ton Jan! It works quite correctly. But many tsearch tutorials
ask tsquery to be placed in 'from' statement and that can cause bad
plan. Isn't it possible to return the correct number for a join with the
query as well?

-Sushant.

On Sun, 2010-10-24 at 15:07 +0200, Jan Urbański wrote:
 On 24/10/10 14:44, Sushant Sinha wrote:
  I am using gin index on a tsvector and doing basic search. I see the
  row-estimate of the planner to be horribly wrong. It is returning
  row-estimate as 4843 for all queries whether it matches zero rows, a
  medium number of rows (88,000) or a large number of rows (726,000).
  
  The table has roughly a million docs.
 
  explain analyze select count(*) from  docmeta,
  plainto_tsquery('english', 'dyfdfdf') as qdoc where  docvector @@ qdoc;
 
 OK, forget my previous message. The problem is that you are doing a join
 using @@ as the operator for the join condition, so the planner uses the
 operator's join selectivity estimate. For @@ the tsmatchjoinsel function
 simply returns 0.005.
 
 Try doing:
 
 explain analyze select count(*) from docmeta where docvector @@
 plainto_tsquery('english', 'dyfdfdf');
 
 It should help.
 
 Cheers,
 Jan



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] lexemes in prefix search going through dictionary modifications

2011-10-25 Thread Sushant Sinha
I am currently using the prefix search feature in text search. I find
that the prefix characters are treated the same as a normal lexeme and
passed through stemming and stopword dictionaries. This seems like a bug
to me. 

db=# select to_tsquery('english', 's:*');
NOTICE:  text-search query contains only stop words or doesn't contain
lexemes, ignored
 to_tsquery 

 
(1 row)

db=# select to_tsquery('simple', 's:*');
 to_tsquery 

 's':*
(1 row)


I also think that this is a mistake. It should only be highlighting s.
db=# select ts_headline('sushant', to_tsquery('simple', 's:*'));
  ts_headline   

 bsushant/b


Thanks,
Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lexemes in prefix search going through dictionary modifications

2011-10-25 Thread Sushant Sinha
On Tue, 2011-10-25 at 18:05 +0200, Florian Pflug wrote:
 On Oct25, 2011, at 17:26 , Sushant Sinha wrote:
  I am currently using the prefix search feature in text search. I find
  that the prefix characters are treated the same as a normal lexeme and
  passed through stemming and stopword dictionaries. This seems like a bug
  to me.
 
 Hm, I don't think so. If they don't pass through stopword dictionaries,
 then queries containing stopwords will fail to find any rows - which is
 probably not what one would expect.

I think what you are saying a feature is really a bug. I am fairly sure
that when someone says to_tsquery('english', 's:*') one is looking for
an entry that has a *non-stopword* word that starts with 's'. And
specially so in a text search configuration that eliminates stop words. 

Does it even make sense to stem, abbreviate, synonym for a few letters?
It will be so unpredictable.

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lexemes in prefix search going through dictionary modifications

2011-10-25 Thread Sushant Sinha
On Tue, 2011-10-25 at 19:27 +0200, Florian Pflug wrote:

 Assume, for example, that the postgres mailing list archive search used
 tsearch (which I think it does, but I'm not sure). It'd then probably make
 sense to add postgres to the list of stopwords, because it's bound to 
 appear in nearly every mail. But wouldn't you want searched which include
 'postgres*' to turn up empty? Quite certainly not.

That improves recall for postgres:* query and certainly doesn't help
other queries like post:*. But more importantly it affects precision
for all queries like a:*, an:*, and:*, s:*, 't:*', the:*, etc
(When that is the only search it also affects recall as no row matches
an empty tsquery). Since stopwords are smaller, it means prefix search
for a few characters is meaningless. And I would argue that is when the
prefix search is more important -- only when you know a few characters.


-Sushant.






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a tsearch issue

2011-11-06 Thread Sushant Sinha
On Fri, 2011-11-04 at 11:22 +0100, Pavel Stehule wrote:
 Hello
 
 I found a interesting issue when I checked a tsearch prefix searching.
 
 We use a ispell based dictionary
 
 CREATE TEXT SEARCH DICTIONARY cspell
(template=ispell, dictfile = czech, afffile=czech, stopwords=czech);
 CREATE TEXT SEARCH CONFIGURATION cs (copy=english);
 ALTER TEXT SEARCH CONFIGURATION cs
ALTER MAPPING FOR word, asciiword WITH cspell, simple;
 
 Then I created a table
 
 postgres=# create table n(a varchar);
 CREATE TABLE
 postgres=# insert into n values('Stěhule'),('Chromečka');
 INSERT 0 2
 postgres=# select * from n;
  a
 ───
  Stěhule
  Chromečka
 (2 rows)
 
 and I tested a prefix searching:
 
 I found a following issue
 
 postgres=# select * from n where to_tsvector('cs', a) @@
 to_tsquery('cs','Stě:*') ;
  a
 ───
 (0 rows)

Most likely you are hit by this problem.
http://archives.postgresql.org/pgsql-hackers/2011-10/msg01347.php

'Stě' may be a stopword in czech.

 I expected one row. The problem is in transformation of word 'Stě'
 
 postgres=# select * from ts_debug('cs','Stě:*') ;
 ─[ RECORD 1 ]┬──
 alias│ word
 description  │ Word, all letters
 token│ Stě
 dictionaries │ {cspell,simple}
 dictionary   │ cspell
 lexemes  │ {sto}
 ─[ RECORD 2 ]┼──
 alias│ blank
 description  │ Space symbols
 token│ :*
 dictionaries │ {}
 dictionary   │ [null]
 lexemes  │ [null]
 

':*' is only specific to to_tsquery. ts_debug just invokes the parser.
So this is not correct.

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lexemes in prefix search going through dictionary modifications

2011-11-08 Thread Sushant Sinha
I think there is a need to provide prefix search to bypass
dictionaries.If you folks think that there is some credibility to such a
need then I can think about implementing it. How about an operator like
:# that does this? The :* will continue to mean the same as
currently.

-Sushant.

On Tue, 2011-10-25 at 23:45 +0530, Sushant Sinha wrote:
 On Tue, 2011-10-25 at 19:27 +0200, Florian Pflug wrote:
 
  Assume, for example, that the postgres mailing list archive search used
  tsearch (which I think it does, but I'm not sure). It'd then probably make
  sense to add postgres to the list of stopwords, because it's bound to 
  appear in nearly every mail. But wouldn't you want searched which include
  'postgres*' to turn up empty? Quite certainly not.
 
 That improves recall for postgres:* query and certainly doesn't help
 other queries like post:*. But more importantly it affects precision
 for all queries like a:*, an:*, and:*, s:*, 't:*', the:*, etc
 (When that is the only search it also affects recall as no row matches
 an empty tsquery). Since stopwords are smaller, it means prefix search
 for a few characters is meaningless. And I would argue that is when the
 prefix search is more important -- only when you know a few characters.
 
 
 -Sushant



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Postgres 9.1: Adding rows to table causing too much latency in other queries

2011-12-19 Thread Sushant Sinha
I recently upgraded my postgres server from 9.0 to 9.1.2 and I am
finding a peculiar problem.I have a program that periodically adds rows
to this table using INSERT. Typically the number of rows is just 1-2
thousand when the table already has 500K rows. Whenever the program is
adding rows, the performance of the search query on the same table is
very bad. The query uses the gin index and the tsearch ranking function
ts_rank_cd. 


This never happened earlier with postgres 9.0 Is there a known issue
with Postgres 9.1? Or how to report this problem?

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres 9.1: Adding rows to table causing too much latency in other queries

2011-12-19 Thread Sushant Sinha
On Mon, 2011-12-19 at 19:08 +0200, Marti Raudsepp wrote:
 Another thought -- have you read about the GIN fast updates feature?
 This existed in 9.0 too. Instead of updating the index directly, GIN
 appends all changes to a sequential list, which needs to be scanned in
 whole for read queries. The periodic autovacuum process has to merge
 these values back into the index.
 
 Maybe the solution is to tune autovacuum to run more often on the
 table.
 
 http://www.postgresql.org/docs/9.1/static/gin-implementation.html
 
 Regards,
 Marti 

Probably this is the problem. Is running vacuum analyze under psql is
the same as autovacuum?

-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres 9.1: Adding rows to table causing too much latency in other queries

2011-12-19 Thread Sushant Sinha
On Mon, 2011-12-19 at 12:41 -0300, Euler Taveira de Oliveira wrote:
 On 19-12-2011 12:30, Sushant Sinha wrote:
  I recently upgraded my postgres server from 9.0 to 9.1.2 and I am
  finding a peculiar problem.I have a program that periodically adds
 rows
  to this table using INSERT. Typically the number of rows is just 1-2
  thousand when the table already has 500K rows. Whenever the program
 is
  adding rows, the performance of the search query on the same table
 is
  very bad. The query uses the gin index and the tsearch ranking
 function
  ts_rank_cd. 
  
 How bad is bad? It seems you are suffering from don't-fit-on-cache
 problem, no? 

The memory is 32GB and the entire database is just 22GB. Even vmstat 1
does not show any disk activity. 

I was not able to isolate the performance numbers since I have observed
this only on the production box where the number of requests keep
increasing as the box gets loaded. But a query that takes 1sec normally
is taking more than 10secs (not sure whether it got the same number of
CPU cycles). Is there a way to find that?

-Sushant.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsearch Parser Hacking

2011-02-14 Thread Sushant Sinha
I agree that it will be a good idea to rewrite the entire thing. However, in
the mean time, I sent a proposal earlier

http://archives.postgresql.org/pgsql-hackers/2010-08/msg00019.php

And a patch later:

http://archives.postgresql.org/pgsql-hackers/2010-09/msg00476.php

Tom asked me to look into Compound Word support but I found it not usable.
Here was my response:
http://archives.postgresql.org/pgsql-hackers/2011-01/msg00419.php

I have not got any response since then,

-Sushant.


On Tue, Feb 15, 2011 at 9:33 AM, David E. Wheeler da...@kineticode.comwrote:

 On Feb 14, 2011, at 3:57 PM, Tom Lane wrote:

  There is zero, none, nada, provision for modifying the behavior of the
  default parser, other than by changing its compiled-in state transition
  tables.
 
  It doesn't help any that said tables are baroquely designed and utterly
  undocumented.
 
  IMO, sooner or later we need to trash that code and replace it with
  something a bit more modification-friendly.

 I was afraid you'd say that. Thanks.

 David

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers