Re: [HACKERS] [PATCH] tsearch parser inefficiency if text includes urls or emails - new version

2009-11-08 Thread Andres Freund
On Sunday 01 November 2009 16:19:43 Andres Freund wrote:
 While playing around/evaluating tsearch I notices that to_tsvector is
 obscenely slow for some files. After some profiling I found that this is
  due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If
  a multibyte encoding is in use TParserInit copies the whole remaining
  input and converts it to wchar_t or pg_wchar - for every email or protocol
  prefixed url in the the document. Which obviously is bad.
 
 I solved the issue by having a seperate TParserCopyInit/TParserCopyClose
  which reuses the the already converted strings of the original TParser -
  only at different offsets.
 
 Another approach would be to get rid of the separate parser invocations -
 requiring a bunch of additional states. This seemed more complex to me, so
  I wanted to get some feedback first.
 
 Without patch:
 andres=# SELECT to_tsvector('english', document) FROM document WHERE
  filename = '/usr/share/doc/libdrm-nouveau1/changelog';
 
  ──
 ─── ...
  (1 row)
 
 Time: 5835.676 ms
 
 With patch:
 andres=# SELECT to_tsvector('english', document) FROM document WHERE
  filename = '/usr/share/doc/libdrm-nouveau1/changelog';
 
  ──
 ─── ...
  (1 row)
 
 Time: 395.341 ms
 
 Ill cleanup the patch if it seems like a sensible solution...
As nobody commented here is a corrected (stupid thinko) and cleaned up 
version. Anyone cares to comment whether I am the only one thinking this is an 
issue?

Andres
From cbdeb0bb636f3b7619d0a3019854809ea5565dac Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Sun, 8 Nov 2009 16:30:42 +0100
Subject: [PATCH] Fix TSearch inefficiency because of repeated copying of strings

---
 src/backend/tsearch/wparser_def.c |   63 ++--
 1 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c
index 301c1eb..7bbd826 100644
--- a/src/backend/tsearch/wparser_def.c
+++ b/src/backend/tsearch/wparser_def.c
@@ -328,6 +328,41 @@ TParserInit(char *str, int len)
 	return prs;
 }
 
+/*
+ * As an alternative to a full TParserInit one can create a
+ * TParserCopy which basically is a normally TParser without a private
+ * copy of the string - instead it uses the one from another TParser.
+ * This is usefull because at some places TParsers are created
+ * recursively and the repeated copying around of the strings can
+ * cause major inefficiency.
+ * Obviously one may not close the original TParser before the copy.
+ */
+static TParser *
+TParserCopyInit(const TParser const* orig)
+{
+	TParser*prs = (TParser *) palloc0(sizeof(TParser));
+
+	prs-charmaxlen = orig-charmaxlen;
+	prs-usewide = orig-usewide;
+	prs-lenstr = orig-lenstr - orig-state-posbyte;
+
+	prs-str = orig-str + orig-state-posbyte;
+	if(orig-pgwstr)
+		prs-pgwstr = orig-pgwstr + orig-state-poschar;
+	if(orig-wstr)
+		prs-wstr = orig-wstr + orig-state-poschar;
+
+	prs-state = newTParserPosition(NULL);
+	prs-state-state = TPS_Base;
+
+#ifdef WPARSER_TRACE
+	fprintf(stderr, parsing copy \%.*s\\n, len, str);
+#endif
+
+	return prs;
+}
+
+
 static void
 TParserClose(TParser *prs)
 {
@@ -350,6 +385,26 @@ TParserClose(TParser *prs)
 }
 
 /*
+ * See TParserCopyInit
+ */
+static void
+TParserCopyClose(TParser *prs)
+{
+	while (prs-state)
+	{
+		TParserPosition *ptr = prs-state-prev;
+
+		pfree(prs-state);
+		prs-state = ptr;
+	}
+#ifdef WPARSER_TRACE
+	fprintf(stderr, closing parser copy);
+#endif
+	pfree(prs);
+}
+
+
+/*
  * Character-type support functions, equivalent to is* macros, but
  * working with any possible encodings and locales. Notes:
  *  - with multibyte encoding and C-locale isw* function may fail
@@ -617,7 +672,7 @@ p_isignore(TParser *prs)
 static int
 p_ishost(TParser *prs)
 {
-	TParser*tmpprs = TParserInit(prs-str + prs-state-posbyte, prs-lenstr - prs-state-posbyte);
+	TParser *tmpprs = TParserCopyInit(prs);
 	int			res = 0;
 
 	tmpprs-wanthost = true;
@@ -631,7 +686,7 @@ p_ishost(TParser *prs)
 		prs-state-charlen = tmpprs-state-charlen;
 		res = 1;
 	}
-	TParserClose(tmpprs);
+	TParserCopyClose(tmpprs);
 
 	return res;
 }
@@ -639,7 +694,7 @@ p_ishost(TParser *prs)
 static int
 p_isURLPath(TParser *prs)
 {
-	TParser*tmpprs = TParserInit(prs-str + prs-state-posbyte, prs-lenstr - prs-state-posbyte);
+	TParser *tmpprs = TParserCopyInit(prs);
 	int			res = 0;
 
 	tmpprs-state = newTParserPosition(tmpprs-state);
@@ -654,7 +709,7 @@ p_isURLPath(TParser *prs)
 		prs-state-charlen = tmpprs-state-charlen;
 		res = 1;
 	}
-	TParserClose(tmpprs);
+	TParserCopyClose(tmpprs);
 
 	return res;
 }
-- 
1.6.5.12.gd65df24


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] [PATCH] tsearch parser inefficiency if text includes urls or emails - new version

2009-11-08 Thread Kenneth Marshall
On Sun, Nov 08, 2009 at 05:00:53PM +0100, Andres Freund wrote:
 On Sunday 01 November 2009 16:19:43 Andres Freund wrote:
  While playing around/evaluating tsearch I notices that to_tsvector is
  obscenely slow for some files. After some profiling I found that this is
   due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If
   a multibyte encoding is in use TParserInit copies the whole remaining
   input and converts it to wchar_t or pg_wchar - for every email or protocol
   prefixed url in the the document. Which obviously is bad.
  
  I solved the issue by having a seperate TParserCopyInit/TParserCopyClose
   which reuses the the already converted strings of the original TParser -
   only at different offsets.
  
  Another approach would be to get rid of the separate parser invocations -
  requiring a bunch of additional states. This seemed more complex to me, so
   I wanted to get some feedback first.
  
  Without patch:
  andres=# SELECT to_tsvector('english', document) FROM document WHERE
   filename = '/usr/share/doc/libdrm-nouveau1/changelog';
  
   
  ??
  ?
   ...
   (1 row)
  
  Time: 5835.676 ms
  
  With patch:
  andres=# SELECT to_tsvector('english', document) FROM document WHERE
   filename = '/usr/share/doc/libdrm-nouveau1/changelog';
  
   
  ??
  ?
   ...
   (1 row)
  
  Time: 395.341 ms
  
  Ill cleanup the patch if it seems like a sensible solution...
 As nobody commented here is a corrected (stupid thinko) and cleaned up 
 version. Anyone cares to comment whether I am the only one thinking this is 
 an 
 issue?
 
 Andres

+1

As a user of tsearch, I can certainly appreciate to speed-up in parsing --
more CPU for everyone else.

Regards,
Ken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] tsearch parser inefficiency if text includes urls or emails - new version

2009-11-08 Thread Andres Freund
On Sunday 08 November 2009 17:41:15 Kenneth Marshall wrote:
 On Sun, Nov 08, 2009 at 05:00:53PM +0100, Andres Freund wrote:
  As nobody commented here is a corrected (stupid thinko) and cleaned up
  version. Anyone cares to comment whether I am the only one thinking this
  is an issue?
  Andres
 +1
 As a user of tsearch, I can certainly appreciate to speed-up in parsing --
 more CPU for everyone else.
Please note that this is mostly an issue when using rather long documents 
including either email addresses (ie. an @) or links with protocol prefixes 
(like ?+://) - so it might not give you personally a benefit :-(

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers