[ts] Distributed indexing for large datasets and multi-core boxes

mixonic Mon, 23 Nov 2009 03:41:56 -0800

Hi Pat, friendly folk,

I've got a 375_000 row geo data set indexed by lat and lon.  I can
search just great, but queries usually take about 650ms on our
production HW.  I found there is a huge speed boost using the main
index to walk a split set of sources and indexes.  So instead of:


index datapoint
{
  type = distributed
  local = datapoint_core
}

I have:

index datapoint
{
  type = distributed
  local = datapoint_core_0
  local = datapoint_core_1
  local = datapoint_core_2
  local = datapoint_core_3
}

Each source then has a range by IDs:

source datapoint_core_0

sql_query_range = SELECT IFNULL(MIN(`id`), 1), IFNULL(MAX(`id`), 1)
FROM `restaurant_inspection_datapoints` WHERE `id` BETWEEN 0 AND 93878

Where 93878 is 1/4 of the records.  Each index covers 1/4 of the total
IDs.  On my laptop alone, this gave me massive gains...queries that
could take a second tool 300ms.

I would love to get this into riddle or thinking_sphinx, but the
riddle configuration code is really complex.  Is anyone interested in
working with me on this or giving me a starting point?

Pat, does that use of "local" make sense?  All the other examples out
there use agents.

See:

http://www.sphinxsearch.com/docs/current.html#distributed
http://www.sphinxsearch.com/bugs/view.php?id=407 <-- I didn't
encounter that bug
http://blog.wasimasif.com/sphinx-distributed-searching/

Thanks all, I'm very excited to get this up!

--
Matthew Beale :: 607 227 0871
Resume & Portfolio @ http://madhatted.com

--

You received this message because you are subscribed to the Google Groups 
"Thinking Sphinx" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/thinking-sphinx?hl=.

[ts] Distributed indexing for large datasets and multi-core boxes

Reply via email to