You did not give your queries. I assume that you are searching against the
'entryID' and updating the tag list. 

MySQL has a "fulltext" index. I assume this is a KWIC index but do not know.
A "fulltext" index on "entryID" should be very very fast since single-record
results are what Lucene does best.

Lance

-----Original Message-----
From: Ryan McKinley [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, October 09, 2007 12:14 PM
To: solr-user@lucene.apache.org
Subject: solr tuple/tag store

Hello-

I am running into some scaling performance problems with SQL that I hope a
clever solr solution could fix.  I've already gone through a bunch of loops,
so I figure I should solicit advice before continuing to chase my tail.

I have a bunch of things (100K-500K+) that are defined by a set of user
tags.  ryan says: (name=xxx, location=yyy, foo=[aaa,bbb,ccc]), and alison
says (name:zzz, location=bbb) - this list is constantly updating, it is fed
from automated crawlers and user generated content.  The 'names' can be
arbitrary, but 99% of them will be ~25 distinct names.

My approach has been to build a repository of all the 'tags' and then as
things come into that repository, I merge all the tags for that entry into a
single 'flat' document and index it with solr.

When my thing+tag count was small, a simple SQL table with a row for each
tag works great:

CREATE TABLE `my_tags` (
   entryID varchar(40) NOT NULL, 

   source varchar(40) NOT NULL,
   name varchar(40) NOT NULL,
   value TEXT NOT NULL,
   KEY( entryID ),
   KEY( source )
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

but as the row count gets big(2M+) this gets to be unusable.  To make it
tractable, I am now splitting the tags across a bunch of tables and pushing
the per user name/value pairs into a single text field (stored with JSON)

CREATE TABLE `my_tags_000` (
  entryID varchar(40) NOT NULL,
  source varchar(40) NOT NULL,
  tags LONGTEXT NOT NULL,
  PRIMARY KEY( entryID, source )
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Then I pick what table that goes into using:
  Math.abs( id.hashCode() )%10

This works OK, but it is still slower then I would like.  DB access is slow,
and it also needs to search across the updating solr index, and that gets
slow since it keeps reopening the searcher (autowarming is off!)

Soooo...  I see a few paths and would love external feedback before banging
my head on this longer.

1. Get help from someone who know more SQL then me and try to make a pure
SQL approach work.  This would need to work with 10M+ tags.  Solr indexing
is then a direct SQL -> solr dump.

2. Figure out how to keep the base Tuple store in solr.  I think this will
require finishing up SOLR-139.  This would keep the the core data in solr -
so there is no good way to 'rebuild' the index.

3. something else?  store input on disk?


Any thoughts / pointers / nay-saying would be really helpful!

thanks
ryan












Reply via email to