You did not give your queries. I assume that you are searching against the 'entryID' and updating the tag list.
MySQL has a "fulltext" index. I assume this is a KWIC index but do not know. A "fulltext" index on "entryID" should be very very fast since single-record results are what Lucene does best. Lance -----Original Message----- From: Ryan McKinley [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 09, 2007 12:14 PM To: solr-user@lucene.apache.org Subject: solr tuple/tag store Hello- I am running into some scaling performance problems with SQL that I hope a clever solr solution could fix. I've already gone through a bunch of loops, so I figure I should solicit advice before continuing to chase my tail. I have a bunch of things (100K-500K+) that are defined by a set of user tags. ryan says: (name=xxx, location=yyy, foo=[aaa,bbb,ccc]), and alison says (name:zzz, location=bbb) - this list is constantly updating, it is fed from automated crawlers and user generated content. The 'names' can be arbitrary, but 99% of them will be ~25 distinct names. My approach has been to build a repository of all the 'tags' and then as things come into that repository, I merge all the tags for that entry into a single 'flat' document and index it with solr. When my thing+tag count was small, a simple SQL table with a row for each tag works great: CREATE TABLE `my_tags` ( entryID varchar(40) NOT NULL, source varchar(40) NOT NULL, name varchar(40) NOT NULL, value TEXT NOT NULL, KEY( entryID ), KEY( source ) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; but as the row count gets big(2M+) this gets to be unusable. To make it tractable, I am now splitting the tags across a bunch of tables and pushing the per user name/value pairs into a single text field (stored with JSON) CREATE TABLE `my_tags_000` ( entryID varchar(40) NOT NULL, source varchar(40) NOT NULL, tags LONGTEXT NOT NULL, PRIMARY KEY( entryID, source ) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; Then I pick what table that goes into using: Math.abs( id.hashCode() )%10 This works OK, but it is still slower then I would like. DB access is slow, and it also needs to search across the updating solr index, and that gets slow since it keeps reopening the searcher (autowarming is off!) Soooo... I see a few paths and would love external feedback before banging my head on this longer. 1. Get help from someone who know more SQL then me and try to make a pure SQL approach work. This would need to work with 10M+ tags. Solr indexing is then a direct SQL -> solr dump. 2. Figure out how to keep the base Tuple store in solr. I think this will require finishing up SOLR-139. This would keep the the core data in solr - so there is no good way to 'rebuild' the index. 3. something else? store input on disk? Any thoughts / pointers / nay-saying would be really helpful! thanks ryan