Re: [GENERAL] Tsearch vector not stored by update/set

2005-03-22 Thread Justin L. Kennedy
On Mon, 21 Mar 2005, Andrew J. Kopciuch wrote:
 If you look at your config mappings for the default_english you will notice
 that you have 16 records, as opposed to 19 records like every other
 configuration mapping.  From some more in depth observations, I noticed you
 are missing entries for the 'lword', 'lhword' and ''lpart_hword'.  That means
 that tokens found to be of types 'Latin Words', 'Latin Hyphenated Words' and
 'Latin Part Hyphenated Words' are just dropped because you do not have a
 configuration mapping set up for them.

You were correct.  It was these missing three entries that made the
difference.  Thanks for identifying them for me.

The triggers work properly in both server now.

-- 
Justin Kennedy
Systems Analyst I


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [GENERAL] Tsearch vector not stored by update/set

2005-03-21 Thread Andrew J. Kopciuch
 It seems to be selective of only numbers, words with numbers in them,
 words with '.' or '/' characters.  It completely ignores any other words
 or text in any of the 3 fields.


This is a very big hint to your problem.

 You requested the pg_ts_* tables:
 On the Linux-redhat, pg7.3.2

 pg_ts_cfgmap(73 rows)
 ts_name  tok_alias dict_name
 default lword {en_stem}
 default nlword {simple}
 default word {simple}
 default email {simple}
 default url {simple}
 default host {simple}
 default sfloat {simple}
 default version {simple}
 default part_hword {simple}
 default nlpart_hword {simple}
 default lpart_hword {en_stem}
 default hword {simple}
 default lhword {en_stem}
 default nlhword {simple}
 default uri {simple}
 default file {simple}
 default float {simple}
 default int {simple}
 default uint {simple}
 default_russian lword  {en_stem}
 default_russian nlword {ru_stem}
 default_russian word {ru_stem}
 default_russian email {simple}
 default_russian url {simple}
 default_russian host {simple}
 default_russian sfloat {simple}
 default_russian version {simple}
 default_russian part_hword {simple}
 default_russian nlpart_hword {ru_stem}
 default_russian lpart_hword {en_stem}
 default_russian hword {ru_stem}
 default_russian lhword {en_stem}
 default_russian nlhword {ru_stem}
 default_russian uri {simple}
 default_russian file {simple}
 default_russian float {simple}
 default_russian int {simple}
 default_russian uint {simple}
 simple lword {simple}
 simple nlword {simple}
 simple word {simple}
 simple email {simple}
 simple url {simple}
 simple host {simple}
 simple sfloat {simple}
 simple version {simple}
 simple part_hword {simple}
 simple nlpart_hword {simple}
 simple lpart_hword {simple}
 simple hword {simple}
 simple lhword {simple}
 simple nlhword {simple}
 simple uri {simple}
 simple file {simple}
 simple float {simple}
 simple int {simple}
 simple uint {simple}
 default_english url {simple}
 default_english host {simple}
 default_english sfloat {simple}
 default_english uri {simple}
 default_english int {simple}
 default_english float {simple}
 default_english email {simple}
 default_english word {simple}
 default_english hword {simple}
 default_english nlword {simple}
 default_english nlpart_hword {simple}
 default_english part_hword {simple}
 default_english nlhword {simple}
 default_english file {simple}
 default_english uint {simple}
 default_english version {simple}


I am assuming that your cluster is running created with en_US for the locale, 
and that you have set the matching tsearch2 configuration to be your default 
(Or curcfg for each process running).

If you look at your config mappings for the default_english you will notice 
that you have 16 records, as opposed to 19 records like every other 
configuration mapping.  From some more in depth observations, I noticed you 
are missing entries for the 'lword', 'lhword' and ''lpart_hword'.  That means 
that tokens found to be of types 'Latin Words', 'Latin Hyphenated Words' and 
'Latin Part Hyphenated Words' are just dropped because you do not have a 
configuration mapping set up for them.

This is why only numbers (or other lexem types) would show (They are returned 
as lexem_types : int, uint, float, url, etc. for which you have mappings).  
Most regular words are simply discarded due to missing entries.  If you fix 
your configurations the triggers should work properly.

Your examples worked before, simply because you specified the 'default' 
configuration on the insert statement.  Which is not the same as the 
'default_english' configuration which is used by the trigger based on your 
server encoding (en_US).

 I have made a single change to it from its default installation.  When I
 was working with the rank_cd() function on the 8.0.0 machine, it had
 errors due to a non-existant english stop file, so I changed
 pg_ts_dict.dict_initoption = '' where dict_name = 'en_stem'.  The indexing
 system was working fine both before and after the change to the pg_ts_dict
 table.  I also propagated the change to the 7.3.2 machine even though it
 didn't have the error message (the stop file didn't exist on that computer
 either, but it never gave an error message about it).

I would not recommend this.  The stop file should is most likely on the system 
somewhere.  It will change depending on your installation.  Look for 
english.stop on the computer(s).  If it is not there, you can grab the one 
out of the source distribution and put it wherever you want.  Then just 
update the settings to the location you used.


good luck,


Andy

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [GENERAL] Tsearch vector not stored by update/set

2005-03-21 Thread Justin L. Kennedy
On Sun, 20 Mar 2005, Andrew J. Kopciuch wrote:

 On Thursday 17 March 2005 17:55, you wrote:
  The short question is why does this:
 
  select to_tsvector('default', coalesce(name, '') ||' '||
  coalesce(description, '') ||' '|| coalesce(keywords,'')) from link_items;
 
  give different results than this:
 
  update link_items set linksfti=to_tsvector('default', coalesce(name, '')
 
  ||' '|| coalesce(description, '') ||' '|| coalesce(keywords,''));
 
  select linksfti from link_items;
 


 It shouldn't.  I just tested with some of your data on my machine, and the
 results were fine.  (PG 7.4.6).  I can not see why they would be any
 different for 7.3.2.

 Your data looks rather strange. (re '60':1 '000':2).  Is that really all
 that was inserted? Or have you just left some out for your email?

About 95% of the 1900 insertions ended up with empty strings (not NULLs),
the other 5% looked like that above.  Either just numbers, or occasionally
words (defined by consecutive non-whitespace characters separated by
whitespace) that had numbers or symbols in them.  Like: U.S. Senate was
transformed in such a way that Senate was dropped completely and U.S.
became lowercased u.s..  Another example was a URL that happened to be
in the description column of one was captured, but the rest of the text
was not.  Another had a name of World T.E.A.M. Sports and all that was
stored in the vector was t.e.a.m.

It seems to be selective of only numbers, words with numbers in them,
words with '.' or '/' characters.  It completely ignores any other words
or text in any of the 3 fields.

 I could see this being a configuration issue possibly.  What do your pg_ts
 tables look like?  Have you made modifications there?

This morning, I decided to remove the following trigger from the
link_items table:
CREATE TRIGGER updateprodtbl
  BEFORE INSERT OR UPDATE
  ON link_items
  FOR EACH ROW
  EXECUTE PROCEDURE tsearch2('linksfti', 'name', 'description',
'keywords');

Now the UPDATE command I listed above works, so apparently there is
something about this trigger that is blocking the search vector from being
stored.  This trigger was copied and pasted (with only changes to the
column names) from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html
in the INDEXING FIELDS IN A TABLE section.

This does fix the immediate problem of getting the search results for the
live website we are running, but now we have the problem of keeping it up
to date.  I could run a nightly cronjob to update all the rows, but it
seems inefficient, and I would really like to know why the trigger would
keep it from working on 7.3, but not 8.0.

You requested the pg_ts_* tables:
On the Linux-redhat, pg7.3.2

pg_ts_cfg: (4 rows)
oid ts_name prs_namelocale
106407  default   default   C
988004  default_english   default   en_US
106408  default_russian   default   ru_RU.KOI8-R
106409  simpledefault   NULL

pg_ts_dict: (5 rows)
oid dict_name   dict_initoption dict_init   dict_lexize
106356  en_stem   snb_en_init snb_lexize
106361  ispell_template NULLspell_init  spell_lexize
106358  ru_stem /usr/local/pgsql/share/contrib/russion.stop
snb_ru_init snb_lexize
106353  simple  NULLdex_initdex_lexize
106364  synonym NULLsyn_initsyn_lexize

pg_ts_parser: (1 row)
oid prs_nameprs_start   prs_nexttoken   prs_end
prs_headlineprs_lextype
106389  default   prsd_start  prsd_getlexeme  prsd_end
prsd_headline   prsd_lextype

pg_ts_cfgmap(73 rows)
ts_name tok_alias   dict_name
default   lword {en_stem}
default   nlword{simple}
default   word  {simple}
default   email {simple}
default   url   {simple}
default   host  {simple}
default   sfloat{simple}
default   version   {simple}
default   part_hword{simple}
default   nlpart_hword  {simple}
default   lpart_hword   {en_stem}
default   hword {simple}
default   lhword{en_stem}
default   nlhword   {simple}
default   uri   {simple}
default   file  {simple}
default   float {simple}
default   int   {simple}
default   uint  {simple}
default_russian   lword {en_stem}
default_russian   nlword{ru_stem}
default_russian   word  {ru_stem}
default_russian   email {simple}
default_russian   url   {simple}
default_russian   host  {simple}
default_russian   sfloat{simple}
default_russian   version   {simple}
default_russian   part_hword{simple}
default_russian   nlpart_hword  {ru_stem}
default_russian   lpart_hword   {en_stem}
default_russian   hword {ru_stem}
default_russian   lhword{en_stem}
default_russian   nlhword   {ru_stem}
default_russian   uri   {simple}
default_russian  

[GENERAL] Tsearch vector not stored by update/set

2005-03-20 Thread Justin L. Kennedy
The short question is why does this:

select to_tsvector('default', coalesce(name, '') ||' '||
coalesce(description, '') ||' '|| coalesce(keywords,'')) from link_items;

give different results than this:

update link_items set linksfti=to_tsvector('default', coalesce(name, '')
||' '|| coalesce(description, '') ||' '|| coalesce(keywords,''));
select linksfti from link_items;


Here are more details:

I am working with Tsearch2 on a server with version string:
PostgreSQL 7.3.2 on i686-pc-linux-gnu, compiled by GCC 2.96

I have a table with the following schema:
CREATE TABLE link_items
(
  link_id int4,
  name varchar(255),
  url varchar(255),
  description text,
  spanish int4,
  spanishurl varchar(255),
  lastmod date,
  visible int4,
  state varchar(25),
  promisepractice int4,
  keywords text,
  linksfti tsvector
)
WITH OIDS;
ALTER TABLE link_items OWNER TO gate;

I want linksfti to hold the search engine's indexing data (indexed on
'name', 'description', and 'keywords'), so I run the following command:

update link_items set linksfti=to_tsvector('default', coalesce(name, '')
||' '|| coalesce(description, '') ||' '|| coalesce(keywords,''));

The results are pretty empty.  Most have empty strings for data, other
only index one or two items in the 3 input columns.

For example, after running, my table looks like:
name;description;keywords;linksfti
American Occupational Therapy Association (AOTA) ;Nationally recognized
professional association for over 60,000 occupational therapists and
occupational therapy assistants. ;Rehabilitation Professional
Associations and Councils;'60':1 '000':2
American Physical Therapy Association (APTA);Represents more than
70,000 physical therapists, physical therapist assistants, and students of
physical therapy. ;Rehabilitation Professional Associations and
Councils;'70':1 '000':2
U.S. Deaf Ski  Snowboard Association;Winter sports for people who are
deaf  relevant links.;Recreation Winter Sports;'u.s':1
Texas Adaptive Aquatics;Adaptive water skiing program for people with
physical and/or mental disabilities. ;Recreation Water
Sports;'and/or':1
World T.E.A.M. Sports;Inclusive sports activities.;Recreation Team
Sports;'t.e.a.m':1
Tennessee;Official State Web Site;Legal State Agencies;
Project Vote Smart;By entering zip code, users get list of all their
elected officials. Links to elected officials' and candidates' web sites,
etc. ;Government / Public Policy General;
TRIPOD Captioned Films;Captioned Films for people who are deaf or hard
of hearing.;Recreation Captioned Movies;


When don't do it as an UPDATE and just print the contents to the screen, I
get the full expected output:

select name, description, keywords, to_tsvector('default', coalesce(name,
'') ||' '|| coalesce(description, '') ||' '|| coalesce(keywords,'')) from
link_items;

United States of America Deaf Track and Field;Promotes training of
track athletes who are deaf and coaches who are deaf and hearing.
;Recreation Track;'of':3,11 'and':7,17,22 'are':15,20 'who':14,19
'deaf':5,16,21 'hear':23 'unit':1 'coach':18 'field':8 'state':2
'track':6,12,25 'train':10 'athlet':13 'promot':9 'america':4
'recreat':24
Adventure Pursuit, Inc.;Adventure Pursuit is a group of volunteers who
like spending time with all kinds of people and focus on adventure sports
like kayaking.;Recreation Water Sports;'a':7 'is':6 'of':9,18 'on':22
'all':16 'and':20 'inc':3 'who':11 'kind':17 'like':12,25 'time':14
'with':15 'focus':21 'group':8 'kayak':26 'peopl':19 'spend':13
'sport':24,29 'water':28 'volunt':10 'pursuit':2,5 'recreat':27
'adventur':1,4,23


Using pgAdminIII, I copied (default backup/restore) the database from our
production server and put in on my personal desktop (Windows 2000, PgSQL
8.0.0) and re-ran the update query and it gave proper results.

Is it a known issue with 7.3.2, and is there a workaround without
upgrading the server to 8.0.0?  We will upgrade in a few months, but we
can't take the server offline now because we have too many websites that
depend on it.



-- 
Justin Kennedy

HEADLINE: GAY GUY NOT AROUSED BY UGLY, BITCHY GIRLS
  -ram


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]