Any status updates on the following patches?
1. Fragments in tsearch2 headlines:
http://archives.postgresql.org/pgsql-hackers/2008-08/msg00043.php
2. Bug in hlCover:
http://archives.postgresql.org/pgsql-hackers/2008-08/msg00089.php
-Sushant.
--
Sent via pgsql-hackers mailing list
PROTECTED]
wrote:
Sushant Sinha escribió:
Any status updates on the following patches?
1. Fragments in tsearch2 headlines:
http://archives.postgresql.org/pgsql-hackers/2008-08/msg00043.php
2. Bug in hlCover:
http://archives.postgresql.org/pgsql-hackers/2008-08/msg00089.php
ts_headline calls ts_lexize equivalent to break the text. Off course there
is algorithm to process the tokens and generate the headline. I would be
really surprised if the algorithm to generate the headline is somehow
dependent on language (as it only processes the tokens). So Oleg is right
when
It seems like the ordering of lexemes in tsvector has changed from 8.3
to 8.4.
For example in 8.3.1,
postgres=# select to_tsvector('english', 'quit everytime');
to_tsvector
---
'quit':1 'everytim':2
The lexemes are arranged by length and then by string
to pass TSVector to headline
function?
-Sushant.
On Sat, 2008-05-24 at 07:57 +0400, Teodor Sigaev wrote:
[moved to -hackers, because talk is about implementation details]
I've ported the patch of Sushant Sinha for fragmented headlines to pg8.3.1
(http://archives.postgresql.org/pgsql-general/2007
I have attached a new patch with respect to the current cvs head. This
produces headline in a document for a given query. Basically it
identifies fragments of text that contain the query and displays them.
DESCRIPTION
HeadlineParsedText contains an array of actual words but not
information
I have attached a patch for phrase search with respect to the cvs head.
Basically it takes a a phrase (text) and a TSVector. It checks if the
relative positions of lexeme in the phrase are same as in their
positions in TSVector.
If the configuration for text search is simple, then this will
On Mon, 2008-06-02 at 19:39 +0400, Teodor Sigaev wrote:
I have attached a patch for phrase search with respect to the cvs head.
Basically it takes a a phrase (text) and a TSVector. It checks if the
relative positions of lexeme in the phrase are same as in their
positions in TSVector.
Efficiency: I realized that we do not need to store all norms. We need
to only store store norms that are in the query. So I moved the addition
of norms from addHLParsedLex to hlfinditem. This should add very little
memory overhead to existing headline generation.
If this is still not acceptable
My main argument for using Cover instead of hlCover was that Cover will
be faster. I tested the default headline generation that uses hlCover
with the current patch that uses Cover. There was not much difference.
So I think you are right in that we do not need norms and we can just
use hlCover.
I
On Tue, 2008-06-03 at 22:16 +0400, Teodor Sigaev wrote:
This is far more complicated than I thought.
Of course, phrase search should be able to use indexes.
I can probably look into how to use index. Any pointers on this?
src/backend/utils/adt/tsginidx.c, if you invent operation # in
I have an attached an updated patch with following changes:
1. Respects ShortWord and MinWords
2. Uses hlCover instead of Cover
3. Does not store norm (or lexeme) for headline marking
4. Removes ts_rank.h
5. Earlier it was counting even NONWORDTOKEN in the headline. Now it
only counts the actual
I am trying to generate a patch with respect to the current CVS head. So
ai rsynced the tree, then did cvs up and installed the db. However, when
I did initdb on a data directory it is stuck:
It is stuck after printing creating template1
creating template1 database in /home/postgres/data/base/1
You are right. I did not do make clean last time. After make clean, make
all, and make install it works fine.
-Sushant.
On Thu, 2008-07-10 at 17:55 +0530, Pavan Deolasee wrote:
On Thu, Jul 10, 2008 at 5:36 PM, Sushant Sinha [EMAIL PROTECTED] wrote:
Seems like a bug to me. Is the tree
Attached a new patch that:
1. fixes previous bug
2. better handles the case when cover size is greater than the MaxWords.
Basically it divides a cover greater than MaxWords into fragments of
MaxWords, resizes each such fragment so that each end of the fragment
contains a query word and then
attached are two patches:
1. documentation
2. regression tests
for headline with fragments.
-Sushant.
On Tue, 2008-07-15 at 13:29 +0400, Teodor Sigaev wrote:
Attached a new patch that:
1. fixes previous bug
2. better handles the case when cover size is greater than the MaxWords.
'::tsquery,'maxfragments=2');
ts_headline
--
... 2 ...
and so on
Oleg
On Tue, 15 Jul 2008, Sushant Sinha wrote:
Attached a new patch that:
1. fixes previous bug
2. better handles the case when cover size is greater than the MaxWords.
Basically it divides
I think there is a slight bug in hlCover function in wparser_def.c
If there is only one query item and that is the first word in the text,
then hlCover does not returns any cover. This is evident in this example
when ts_headline only generates the min_words:
testdb=# select ts_headline('1 2 3 4
, Sushant Sinha wrote:
I will add test queries and their results for the corner cases in a
separate file. I guess the only thing I am confused about is what should
be the behavior of headline generation when Query items have words of
size less than ShortWord. I guess the answer is to ignore
I looked at query operators for tsquery and here are some of the new
query operators for position based queries. I am just proposing some
changes and the questions I have.
1. What is the meaning of such a query operator?
foo #5 bar - true if the document has word foo followed by bar at
5th
I guess it is more readable to add cover separator at the end of a fragment
than in the front. Let me know what you think and I can update it.
I think the right place for cover separator is in the structure
HeadlineParsedText just like startsel and stopsel. This will enable users to
specify their
file that tests different aspects of the
new headline generation function.
Let me know if anything else is needed.
-Sushant.
On Thu, 2008-07-24 at 00:28 +0400, Oleg Bartunov wrote:
On Wed, 23 Jul 2008, Sushant Sinha wrote:
I guess it is more readable to add cover separator at the end
Has any one noticed this?
-Sushant.
On Wed, 2008-07-16 at 23:01 -0400, Sushant Sinha wrote:
I think there is a slight bug in hlCover function in wparser_def.c
If there is only one query item and that is the first word in the text,
then hlCover does not returns any cover. This is evident
On Mon, 2008-08-04 at 00:36 -0300, Euler Taveira de Oliveira wrote:
Sushant Sinha escreveu:
I think there is a slight bug in hlCover function in wparser_def.c
The bug is not in the hlCover. In prsd_headline, if we didn't find a
suitable bestlen (i.e. = 0), than it includes up to document
Currently the english parser in text search does not support multiple
words in the same position. Consider a word wikipedia.org. The text
search would return a single token wikipedia.org. However if someone
searches for wikipedia org then there will not be a match. There are
two problems here:
1.
On 08/01/2010 08:04 PM, Sushant Sinha wrote:
1. We do not have separate tokens wikipedia and org
2. If we have the two tokens we should have them at adjacent position so
that a phrase search for wikipedia org should work.
This would needlessly increase the number of tokens. Instead you'd
On Mon, 2010-08-02 at 09:32 -0400, Robert Haas wrote:
On Mon, Aug 2, 2010 at 9:12 AM, Sushant Sinha sushant...@gmail.com wrote:
The current text parser already returns url and url_path. That already
increases the number of unique tokens. I am only asking for adding of
normal english words
complicate the patch with that, I wanted to get feedback on any
other major problem with the patch.
-Sushant.
On Mon, 2010-08-02 at 10:20 -0400, Tom Lane wrote:
Sushant Sinha sushant...@gmail.com writes:
This would needlessly increase the number of tokens. Instead you'd
better make it work like
Updating the patch with emitting parttoken and registering it with
snowball config.
-Sushant.
On Fri, 2010-09-03 at 09:44 -0400, Robert Haas wrote:
On Wed, Sep 1, 2010 at 2:42 AM, Sushant Sinha sushant...@gmail.com wrote:
I have attached a patch that emits parts of a host token, a url token
For the headline generation to work properly, email/file/url/host need
to become skip tokens. Updating the patch with that change.
-Sushant.
On Sat, 2010-09-04 at 13:25 +0530, Sushant Sinha wrote:
Updating the patch with emitting parttoken and registering it with
snowball config.
-Sushant
. ;-)
---
Heikki Linnakangas wrote:
Sushant Sinha wrote:
Patch #2. I think this is a straigt forward bug fix.
Yes, I think you're right. In hlCover(), *q is 0 when the only match is
the first item in the text, and we shouldn't bail out with return
false in that case
I am running postgres 8.3.1. In tsrank.c I am looking at the cover
density function used for ranking while doing text search:
float4
calc_rank_cd(float4 *arrdata, TSVector txt, TSQuery query, int method)
Here is the excerpt of code that I think may possibly have bug when
document is big enough
On Thu, Jan 29, 2009 at 12:38 PM, Teodor Sigaev teo...@sigaev.ru wrote:
Is this what is desired? It seems to me that Wdoc is getting a high
ranking even when we are not sure of the position information.
0.1 is not very high rank, and we could not suggest any reasonable rank in
this case.
I think we currently do that. We add ellipses only when we encounter a
new fragment. So there should not be ellipses if we are at the end of
the document or if that is the first fragment (includes the beginning of
the document). Here is the code in generateHeadline, ts_parse.c that
adds the
the fragments. I hope that you're correct and that it is implemented, and
not documented
-Original Message-
From: Sushant Sinha [mailto:sushant...@gmail.com]
Sent: Saturday, February 14, 2009 4:07 PM
To: Asher Snyder
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Ellipses around
Sorry ... I thought you were running the development branch.
-Sushant.
On Sat, 2009-02-14 at 16:34 -0500, Tom Lane wrote:
Sushant Sinha sushant...@gmail.com writes:
I think we currently do that.
... since about four months ago.
2008-10-17 14:05 teodor
* doc/src/sgml
FragmentDelimiter is an argument for ts_headline function to separates
different headline fragments. The default delimiter is ... .
Currently if someone specifies the delimiter as an option to the
function, no extra space is added around the delimiter. However, it does
not look good without space
yeah you are right. I did not know that you can pass space using double
quotes.
-Sushant.
On Sun, 2009-03-01 at 20:49 -0500, Tom Lane wrote:
Sushant Sinha sushant...@gmail.com writes:
FragmentDelimiter is an argument for ts_headline function to separates
different headline fragments
:57 -0400, Tom Lane wrote:
Sushant Sinha sushant...@gmail.com writes:
Sorry for the delay. Here is the patch with FragmentDelimiter option.
It requires an extra option in HeadlineParsedText and uses that option
during generateHeadline.
I did some editing of the documentation for this patch
I see this as open items here
http://wiki.postgresql.org/wiki/PostgreSQL_8.4_Open_Items
Any interest in fixing this?
-Sushant.
On Thu, 2009-01-29 at 13:54 -0500, Sushant Sinha wrote:
On Thu, Jan 29, 2009 at 12:38 PM, Teodor Sigaev teo...@sigaev.ru
wrote:
Is this what
Currently it seems like that dot is not considered as a word delimiter
by the english parser.
lawdb=# select to_tsvector('english', 'Mr.J.Sai Deepak');
to_tsvector
-
'deepak':2 'mr.j.sai':1
(1 row)
So the word obtained is mr.j.sai rather than three words
Thanks,
Sushant.
On Tue, Jun 2, 2009 at 8:47 AM, Kenneth Marshall k...@rice.edu wrote:
On Mon, Jun 01, 2009 at 08:22:23PM -0500, Kevin Grittner wrote:
Sushant Sinha sushant...@gmail.com wrote:
I think that dot should be considered by as a word delimiter because
when dot is not followed
On Tue, 2009-06-02 at 17:26 -0700, Josh Berkus wrote:
* possible bug in cover density ranking?
-- From Teodor's response, this is maybe a doc patch and not a code
patch. Teodor? Oleg?
I personally think that this is a bug, because we are assigning very
high rank when we are not
The rank counts 1/coversize. So bigger covers will not have much impact
anyway. What is the need of the patch?
-Sushant.
On Fri, 2012-01-27 at 18:06 +0200, karave...@mail.bg wrote:
Hello,
I have developed a variation of cover density ranking functions that
counts only covers that are
There is a bug in ts_rank_cd. It does not correctly give rank when the
query lexeme is the first one in the tsvector.
Example:
select ts_rank_cd(to_tsvector('english', 'abc sdd'),
plainto_tsquery('english', 'abc'));
ts_rank_cd
0
select
MY PREV EMAIL HAD A PROBLEM. Please reply to this one
==
There is a bug in ts_rank_cd. It does not correctly give rank when the
query lexeme is the first one in the tsvector.
Example:
select ts_rank_cd(to_tsvector('english', 'abc sdd'),
Sorry for sounding the false alarm. I was not running the vanilla
postgres and that is why I was seeing that problem. Should have checked
with the vanilla one.
-Sushant
On Tue, 2010-12-21 at 23:03 -0500, Tom Lane wrote:
Sushant Sinha sushant...@gmail.com writes:
There is a bug in ts_rank_cd
Just a reminder that this patch is discussing how to break url, emails etc
into its components.
On Mon, Oct 4, 2010 at 3:54 AM, Tom Lane t...@sss.pgh.pa.us wrote:
[ sorry for not responding on this sooner, it's been hectic the last
couple weeks ]
Sushant Sinha sushant...@gmail.com writes
Do not know if this mail got lost in between or no one noticed it!
On Thu, 2010-12-23 at 11:05 +0530, Sushant Sinha wrote:
Just a reminder that this patch is discussing how to break url, emails
etc into its components.
On Mon, Oct 4, 2010 at 3:54 AM, Tom Lane t...@sss.pgh.pa.us wrote
:
Sushant Sinha sushant...@gmail.com writes:
Doesn't this force the headline to be taken from the first N words of
the document, independent of where the match was? That seems rather
unworkable, or at least unhelpful.
In headline generation function, we don't have any index or knowledge
I am using pg_trgm for spelling correction as prescribed in the
documentation. But I see that it does not work for unicode sring. The
database was initialized with utf8 encoding and the C locale.
Here is the table:
\d words
Table public.words
Column | Type | Modifiers
I am using plpythonu on postgres 9.0.2. One of my python functions was
throwing a TypeError exception. However, I only see the exception in the
database and not the stack trace. It becomes difficult to debug if the
stack trace is absent in Python.
logdb=# select get_words(forminput) from fi;
On Thu, 2011-07-21 at 15:31 +0200, Jan Urbański wrote:
On 21/07/11 15:27, Sushant Sinha wrote:
I am using plpythonu on postgres 9.0.2. One of my python functions was
throwing a TypeError exception. However, I only see the exception in the
database and not the stack trace. It becomes
Given a document and a query, the goal of headline generation is to
produce text excerpts in which the query appears. Currently the headline
generation in postgres follows the following steps:
1. Tokenize the documents and obtain the lexemes
2. Decide on lexemes that should be the part of the
Here is a simple patch that limits the number of words during the
tokenization phase and puts an upper-bound on the headline generation.
Doesn't this force the headline to be taken from the first N words of
the document, independent of where the match was? That seems rather
unworkable,
Actually, this code seems probably flat-out wrong: won't every
successful call of hlCover() on a given document return exactly the same
q value (end position), namely the last token occurrence in the
document? How is that helpful?
regards, tom lane
There is a line
I looked at this patch a bit. I'm fairly unhappy that it seems to be
inventing a brand new mechanism to do something the ts parser can
already do. Why didn't you code the url-part mechanism using the
existing support for compound words?
I am not familiar with compound word implementation
Your changes are somewhat fine. It will get you tokens with _
characters in it. However, it is not nice to mix your new token with
existing token like NUMWORD. Give a new name to your new type of
token .. probably UnderscoreWord. Then on seeing _, move to a state
that can identify the new token.
Any updates on this?
On Tue, Sep 21, 2010 at 10:47 PM, Sushant Sinha sushant...@gmail.comwrote:
I looked at this patch a bit. I'm fairly unhappy that it seems to be
inventing a brand new mechanism to do something the ts parser can
already do. Why didn't you code the url-part mechanism
On Tue, 2010-10-12 at 19:31 -0400, Tom Lane wrote:
This seems much of a piece with the existing proposal to allow
individual words of a URL to be reported separately:
https://commitfest.postgresql.org/action/patch_view?id=378
As I said in that thread, this could be done in a
I am using gin index on a tsvector and doing basic search. I see the
row-estimate of the planner to be horribly wrong. It is returning
row-estimate as 4843 for all queries whether it matches zero rows, a
medium number of rows (88,000) or a large number of rows (726,000).
The table has roughly a
:
On 24/10/10 14:44, Sushant Sinha wrote:
I am using gin index on a tsvector and doing basic search. I see the
row-estimate of the planner to be horribly wrong. It is returning
row-estimate as 4843 for all queries whether it matches zero rows, a
medium number of rows (88,000) or a large number
I am currently using the prefix search feature in text search. I find
that the prefix characters are treated the same as a normal lexeme and
passed through stemming and stopword dictionaries. This seems like a bug
to me.
db=# select to_tsquery('english', 's:*');
NOTICE: text-search query
On Tue, 2011-10-25 at 18:05 +0200, Florian Pflug wrote:
On Oct25, 2011, at 17:26 , Sushant Sinha wrote:
I am currently using the prefix search feature in text search. I find
that the prefix characters are treated the same as a normal lexeme and
passed through stemming and stopword
On Tue, 2011-10-25 at 19:27 +0200, Florian Pflug wrote:
Assume, for example, that the postgres mailing list archive search used
tsearch (which I think it does, but I'm not sure). It'd then probably make
sense to add postgres to the list of stopwords, because it's bound to
appear in nearly
On Fri, 2011-11-04 at 11:22 +0100, Pavel Stehule wrote:
Hello
I found a interesting issue when I checked a tsearch prefix searching.
We use a ispell based dictionary
CREATE TEXT SEARCH DICTIONARY cspell
(template=ispell, dictfile = czech, afffile=czech, stopwords=czech);
CREATE TEXT
, 2011-10-25 at 23:45 +0530, Sushant Sinha wrote:
On Tue, 2011-10-25 at 19:27 +0200, Florian Pflug wrote:
Assume, for example, that the postgres mailing list archive search used
tsearch (which I think it does, but I'm not sure). It'd then probably make
sense to add postgres to the list
I recently upgraded my postgres server from 9.0 to 9.1.2 and I am
finding a peculiar problem.I have a program that periodically adds rows
to this table using INSERT. Typically the number of rows is just 1-2
thousand when the table already has 500K rows. Whenever the program is
adding rows, the
On Mon, 2011-12-19 at 19:08 +0200, Marti Raudsepp wrote:
Another thought -- have you read about the GIN fast updates feature?
This existed in 9.0 too. Instead of updating the index directly, GIN
appends all changes to a sequential list, which needs to be scanned in
whole for read queries. The
On Mon, 2011-12-19 at 12:41 -0300, Euler Taveira de Oliveira wrote:
On 19-12-2011 12:30, Sushant Sinha wrote:
I recently upgraded my postgres server from 9.0 to 9.1.2 and I am
finding a peculiar problem.I have a program that periodically adds
rows
to this table using INSERT. Typically
I agree that it will be a good idea to rewrite the entire thing. However, in
the mean time, I sent a proposal earlier
http://archives.postgresql.org/pgsql-hackers/2010-08/msg00019.php
And a patch later:
http://archives.postgresql.org/pgsql-hackers/2010-09/msg00476.php
Tom asked me to look into
71 matches
Mail list logo