Thanks Pat!! That certainly did the trick. ~13.5 million businesses with their associated city and categories indexed in about 15 minutes. Of course the indexing was being done on the same box that the mysqld was being run on, but still -- pretty dang fast.
Thanks again! Now for the fun stuff: actual implementation, deployment concerns, etc. -Steve On Dec 14, 4:53 pm, Pat Allan <[email protected]> wrote: > Yeah, that's my thought too - was wondering if it was MySQL at fault, or > Sphinx. > > Sphinx 0.9.9 (and maybe 0.9.8) has a --enable-id64 flag when you configure - > I think it's worth recompiling and trying indexing again. > > -- > Pat > > On 15/12/2009, at 11:40 AM, Steve H wrote: > > > > > I found one thing very strange... no matter what the sql_range_step is > > set to (default - 10,000,000), the last query always has an upper > > limit of 2365744349. My very astute manager noticed that this is kind > > of in the neighborhood of the largest possible value for an int. > > Maybe that has something to do with it? > > > -Steve > > > On Dec 14, 4:35 pm, Steve H <[email protected]> wrote: > >> mysql> SELECT IFNULL(MIN(`id`), 1), IFNULL(MAX(`id`), 1) > >> -> FROM `businesses`; > >> +----------------------+----------------------+ > >> | IFNULL(MIN(`id`), 1) | IFNULL(MAX(`id`), 1) | > >> +----------------------+----------------------+ > >> | 1 | 15250646237 | > >> +----------------------+----------------------+ > > >> On Dec 14, 4:32 pm, Pat Allan <[email protected]> wrote: > > >>> Hmm, if you run the sql_query_range query yourself, what does it return? > > >>> -- > >>> Pat > > >>> On 15/12/2009, at 11:24 AM, Steve H wrote: > > >>>> Hi Pat, > > >>>> Here is the sql_query: > > >>>> sql_query = SELECT SQL_NO_CACHE `businesses`.`id` * 1 + 0 AS `id` , > >>>> CAST(`businesses`.`name` AS CHAR) AS `name`, CAST(GROUP_CONCAT > >>>> (DISTINCT IFNULL(`cs_categories`.`name`, '0') SEPARATOR ' ') AS CHAR) > >>>> AS `category_names`, CAST(CONCAT_WS(' ', `city`.`city`, > >>>> `city`.`state_code`) AS CHAR) AS `city`, `businesses`.`id` AS > >>>> `sphinx_internal_id`, 4054125678 AS `class_crc`, '4054125678' AS > >>>> `subclass_crcs`, 0 AS `sphinx_deleted`, IFNULL(`businesses`.`name`, > >>>> '') AS `name_sort`, GROUP_CONCAT(DISTINCT IFNULL(IFNULL > >>>> (`cs_categories`.`name`, ''), '0') SEPARATOR ' ') AS > >>>> `category_names_sort`, CONCAT_WS(' ', IFNULL(`city`.`city`, ''), IFNULL > >>>> (`city`.`state_code`, '')) AS `city_sort`, `businesses`.`lat` AS > >>>> `lat`, `businesses`.`lon` AS `lon` FROM `businesses` LEFT OUTER > >>>> JOIN `categorizations` ON (`businesses`.`id` = > >>>> `categorizations`.`business_id`) LEFT OUTER JOIN `cs_categories` ON > >>>> (`cs_categories`.`id` = `categorizations`.`cs_category_id`) LEFT > >>>> OUTER JOIN `city` ON `city`.city_id = `businesses`.city_id WHERE > >>>> `businesses`.`id` >= $start AND `businesses`.`id` <= $end GROUP BY > >>>> `businesses`.`id` ORDER BY NULL > >>>> sql_query_range = SELECT IFNULL(MIN(`id`), 1), IFNULL(MAX(`id`), 1) > >>>> FROM `businesses` > > >>>> We turned on logging for every query in mysqld to troubleshoot this. > >>>> This is the last query run: > > >>>> SELECT SQL_NO_CACHE `businesses`.`id` * 1 + 0 AS `id` , CAST > >>>> (`businesses`.`name` AS CHAR) AS `name`, CAST(GROUP > >>>> _CONCAT(DISTINCT IFNULL(`cs_categories`.`name`, '0') SEPARATOR ' ') AS > >>>> CHAR) AS `category_names`, CAST(CONCAT_WS(' ', `city`.`city`, > >>>> `city`.`state_code`) AS CHAR) AS `city`, `businesses`.`id` AS > >>>> `sphinx_internal_id`, 4054125678 AS `class_crc`, '4054125678' AS > >>>> `subclass_crcs`, 0 AS `sphinx_deleted`, IFNULL(`businesses`.`name`, > >>>> '') AS `name_sort`, GROUP_CONCAT(DISTINCT IFNULL(IFNULL > >>>> (`cs_categories`.`name`, ''), '0') SEPARATOR ' ') AS > >>>> `category_names_sort`, CONCAT_WS(' ', IFNULL(`city`.`city`, ''), IFNULL > >>>> (`city`.`state_code`, '')) AS `city_sort`, `businesses`.`lat` AS > >>>> `lat`, `businesses`.`lon` AS `lon` FROM `businesses` LEFT OUTER > >>>> JOIN `categorizations` ON (`businesses`.`id` = > >>>> `categorizations`.`business_id`) LEFT OUTER JOIN `cs_categories` ON > >>>> (`cs_categories`.`id` = `categorizations`.`cs_category_id`) LEFT > >>>> OUTER JOIN `city` ON `city`.city_id = `businesses`.city_id WHERE > >>>> `businesses`.`id` >= 2365000001 AND `businesses`.`id` <= 2365744349 > >>>> GROUP BY `businesses`.`id` ORDER BY NULL > > >>>> Which returns zero rows. In fact, the last 15+ queries return zero > >>>> rows. The businesses table has no records with ids > >>>> 2034010806...2500480805. And of course, the maximum id is > >>>> 15250646237. We need *all* the rows indexed, of course... :) > > >>>> So should we be increasing the step_size to like 10 million? > > >>>> -Steve > > >>>> On Dec 14, 2:58 pm, Pat Allan <[email protected]> wrote: > >>>>> Hi Steve > > >>>>> Thinking Sphinx uses left outer joins (because that's what ActiveRecord > >>>>> uses, and it's the best fit in this situation), and yes, one doc is one > >>>>> row. > > >>>>> There's nothing odd in your index definition - so let's have a look at > >>>>> the sql_query value of the source business_core_0, in > >>>>> config/development.sphinx.conf (or production - they should be the > >>>>> same). You should be able to run it manually, see if it returns the > >>>>> right results (perhaps over a particular step, to keep things > >>>>> reasonably fast and manageable). > > >>>>> -- > >>>>> Pat > > >>>>> On 15/12/2009, at 3:26 AM, Steve H wrote: > > >>>>>> Thanks for the lightning-fast reply Pat. :) > > >>>>>> define_index do > >>>>>> indexes :name, :sortable => true > >>>>>> indexes categories_and_tags(:name), :as > >>>>>> => :category_names, :sortable => true > >>>>>> indexes [city.city, city.state_code], :as => :city, :sortable => > >>>>>> true > >>>>>> set_property :group_concat_max_len => 8192 > >>>>>> # attributes > >>>>>> has lat, lon > >>>>>> end > > >>>>>> Every business has at least one category or tag, but I'm kind of > >>>>>> assuming here that the indexer is doing a full left join (ie, not an > >>>>>> inner join). > > >>>>>> Could something in this block be causing the problem? And is 1 doc = > >>>>>> 1 row in the db? > > >>>>>> -Steve > > >>>>>> On Dec 13, 7:03 pm, Pat Allan <[email protected]> wrote: > >>>>>>> Hi Steve > > >>>>>>> What does your define_index block for Business look like? > > >>>>>>> -- > >>>>>>> Pat > > >>>>>>> On 14/12/2009, at 1:44 PM, Steve H wrote: > > >>>>>>>> OK first off -- I'm not 100% certain that 1 doc = 1 record. But if > >>>>>>>> it > >>>>>>>> DOES then, sphinx definitely didn't collect enough. We have over 13 > >>>>>>>> million rows in our database, but this was the output after indexing: > > >>>>>>>> ----------------------------------------------- > >>>>>>>> indexing index 'business_core'... > >>>>>>>> collected 1199 docs, 0.7 MB > >>>>>>>> collected 0 attr values > >>>>>>>> sorted 0.0 Mvalues, 100.0% done > >>>>>>>> sorted 0.1 Mhits, 100.0% done > >>>>>>>> total 1199 docs, 685923 bytes > >>>>>>>> total 1381.345 sec, 496.56 bytes/sec, 0.87 docs/sec > >>>>>>>> distributed index 'business' can not be directly indexed; skipping. > >>>>>>>> Started successfully (pid 32011). > >>>>>>>> ----------------------------------------------- > > >>>>>>>> After reading some of the docs, I thought... maybe I have to increase > >>>>>>>> the sql_range_step in the sphinx.yml or increase the > >>>>>>>> group_concat_max_len property. So I did those things, but the output > >>>>>>>> after indexing is the same (albeit much much faster :) > > >>>>>>>> ----------------------------------------------- > >>>>>>>> Stopped search daemon (pid 2810). > >>>>>>>> Generating Configuration to /home/steve/insiderpages/config/ > >>>>>>>> sphinx_dev.sphinx.conf > >>>>>>>> Sphinx 0.9.8.1-release (r1533) > >>>>>>>> Copyright (c) 2001-2008, Andrew Aksyonoff > > >>>>>>>> using config file '/home/steve/insiderpages/config/ > >>>>>>>> sphinx_dev.sphinx.conf'... > >>>>>>>> indexing index 'business_core'... > >>>>>>>> collected 1199 docs, 0.1 MB > >>>>>>>> collected 0 attr values > >>>>>>>> sorted 0.0 Mvalues, 100.0% done > >>>>>>>> sorted 0.0 Mhits, 100.0% done > >>>>>>>> total 1199 docs, 79605 bytes > >>>>>>>> total 0.447 sec, 177901.38 bytes/sec, 2679.53 docs/sec > >>>>>>>> distributed index 'business' can not be directly indexed; skipping. > >>>>>>>> Started successfully (pid 6365). > >>>>>>>> ----------------------------------------------- > > >>>>>>>> What's the problem? > > >>>>>>>> -Steve > > >>>>>>>> PS: > >>>>>>>> mysql> select count(id) from businesses; > >>>>>>>> +-----------+ > >>>>>>>> | count(id) | > >>>>>>>> +-----------+ > >>>>>>>> | 13395683 | > >>>>>>>> +-----------+ > >>>>>>>> 1 row in set (0.00 sec) > > >>>>>>>> mysql> select max(id) from businesses; > >>>>>>>> +-------------+ > >>>>>>>> | max(id) | > >>>>>>>> +-------------+ > >>>>>>>> | 15250646126 | > >>>>>>>> +-------------+ > >>>>>>>> 1 row in set (0.00 sec) > > >>>>>>>> -- > > >>>>>>>> You received this message because you are subscribed to the Google > >>>>>>>> Groups "Thinking Sphinx" group. > >>>>>>>> To post to this group, send email to > >>>>>>>> [email protected]. > >>>>>>>> To unsubscribe from this group, send email to > >>>>>>>> [email protected]. > >>>>>>>> For more options, visit this group > >>>>>>>> athttp://groups.google.com/group/thinking-sphinx?hl=en. > > >>>>>> -- > > >>>>>> You received this message because you are subscribed to the Google > >>>>>> Groups "Thinking Sphinx" group. > >>>>>> To post to this group, send email to [email protected]. > >>>>>> To unsubscribe from this group, send email to > >>>>>> [email protected]. > >>>>>> For more options, visit this group > >>>>>> athttp://groups.google.com/group/thinking-sphinx?hl=en. > > >>>> -- > > >>>> You received this message because you are subscribed to the Google > >>>> Groups "Thinking Sphinx" group. > >>>> To post to this group, send email to [email protected]. > >>>> To unsubscribe from this group, send email to > >>>> [email protected]. > >>>> For more options, visit this group > >>>> athttp://groups.google.com/group/thinking-sphinx?hl=en. > > > -- > > > You received this message because you are subscribed to the Google Groups > > "Thinking Sphinx" group. > > To post to this group, send email to [email protected]. > > To unsubscribe from this group, send email to > > ... > > read more » -- You received this message because you are subscribed to the Google Groups "Thinking Sphinx" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/thinking-sphinx?hl=en.
