[HACKERS] tsvector term positions and character offset

2011-11-24 Thread Yoann Moreau
Hello, I'm working on text data, actually some tsvectors of the text. 
The tsvector provides terms and positions for each term, I would need to 
map these positions to the character offsets of the terms in the 
original text.


'This is an example text for example'
tsvector -> 'an':3 'exampl':4,7 'for':6 'is':2 'text':5 'this':1
What I need would be for the first term 'This' : 0, or the term 
'example' : 11,28.


I've searched for anything able to do that without success (also asked 
on general pg list).
As the offset positions seem to be not stored or used at any time in the 
fulltext functions, the only way I figured out would be to parse the 
text again counting terms AND characters read. I coded this function as 
a very very dirty external C function, with many tsearch code copied 
because it can't be used outside of the source file.


My questions
1) Is there any other way to achieve what I need ?
2) Could my need be part of future more general functionality of the 
tsearch module ?
If not, any suggestion about the way to code it as clean and robust as 
possible ?


Regards,
Yoann Moreau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Term positions in GIN fulltext index

2011-11-04 Thread Yoann Moreau

On 04/11/11 12:15, Florian Pflug wrote:

AFAICS, the internal storage layout of tsvector should allow you to extract an
individual lexem's positions quite efficiently (with time complexity log(N) 
where
N is the number of lexems in the tsvector). Doing so will require you to 
implement
your function in C though - any solution that works from a tsvector's textual
representation will obviously have time complexity N.

best regards,
Florian Pflug

I'll do a pl/pgsql function first, I need to test it with other parts of 
the project. But I will look for more efficient algorithms for a C 
function as soon as possible if we still decide to use the postgresql 
fulltext engine.


Regards,
Yoann Moreau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Term positions in GIN fulltext index

2011-11-04 Thread Yoann Moreau

On 03/11/11 19:19, Florian Pflug wrote:

There's a difference between values of type tsvector, and what GIN indices
on columns or expressions of type tsvector store.


I was wondering what was the point about storing the tsvector in the 
table, I now understand. I then should use the GIN index to rank my 
documents, and work on the stored tsvectors for positions.



As I pointed out above, you'll first need to make sure to store the result of
to_tsvector in a columns. Then, what you need seems to be a functions that
takes a tsvector value and returns the contained lexems as individual rows.

Postgres doesn't seem to contain such a function currently (don't believe that,
though - go and recheck the documentation. I don't know all thousands of 
built-in
functions by heart). But it's easy to add one. You could either use PL/pgSQL
to parse the tsvector's textual representation, or write a C function. If you
go the PL/pgSQL route, regexp_split_to_table() might come in handy.


This seems easier to program than what I was thinking about, I'm going 
to do that. But I'm wondering about size of database with the GIN index 
plus the tsvector column, and performance about parsing the whole 
tsvectors for each document I need positions from (as I need them for a 
very few terms).


Maybe some external fulltext engine managing lexemes and positions would 
be more efficient for my purpose. I'll try some different things and let 
you know the results.


Thanks all for your help
Regards,
Yoann Moreau


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Term positions in GIN fulltext index

2011-11-03 Thread Yoann Moreau

Hello,
I'm using a GIN index for a text column on a big table. I use it to rank
the rows, but I also need to get the term positions for each document of a
subset of documents for one or more terms. I suppose these positions are stored
in the index as the to_tsvector shows them : 'lexeme':{positions}

I've searched and asked on general postgresql mailing list, and I assume
there is no simple way to get these term positions.

For example, for 2 rows of a 'docs' table with a text column 'text' (indexed 
with GIN) :
'I get lexemes and I get term positions.'
'Did you get the positions ?'

I'd need a function like this :
select term_positions(text, 'get') from docs;
 id_doc | positions
+---
  1 | {2,6}
  2 |   {3}

I'd like to add this function in my database, for experimental purpose.
I got a look at the source code but didn't find some code example using the GIN 
index ;
I can not figure out where the GIN index is read as a tsvector
or where the '@@' operator gets the matching tsvectors for the terms of the 
tsquery.

Any help about where to start reading would be very welcome :)

Regards,
Yoann Moreau


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers