Hi Steve I've been thinking about this a bit today... it might be possible to default to bigint, although that has performance issues I'm guessing (granted, I doubt they're big ones). Maybe indexes need some flag to default all integers to big ints...
Consider it on my list of things to do. If you'd like to create a ticket for this on GitHub, that'd be fantastic. Cheers -- Pat On 05/01/2010, at 12:00 PM, Steve H wrote: > Pat -- > > For you and anyone else who may run into this issue: > We found that while the indexer successfully indexed all our models, > the sphinx_internal_id was wrong for all models with large ids > (anything 11 digits or more). This, of course, caused results brought > back to be not found in the database. We were wondering why the > "document" number returned by the search binary couldn't be used > instead of sphinx_internal_id... but that aside, we found that > sphinx_attr_bigint was introduced with 0.9.9, so we will be (upgrading > and) modifying our generated config file appropriately. > > Is there a way to specify that the sphinx_internal_id should be a > sphinx_attr_bigint in the sphinx.yml file? > > Thanks, > Steve > > On Dec 14 2009, 5:30 pm, Steve H <[email protected]> wrote: >> Thanks Pat!! That certainly did the trick. ~13.5 million businesses >> with their associated city and categories indexed in about 15 >> minutes. Of course the indexing was being done on the same box that >> the mysqld was being run on, but still -- pretty dang fast. >> >> Thanks again! Now for the fun stuff: actual implementation, >> deployment concerns, etc. >> >> -Steve >> >> On Dec 14, 4:53 pm, Pat Allan <[email protected]> wrote: >> >>> Yeah, that's my thought too - was wondering if it was MySQL at fault, or >>> Sphinx. >> >>> Sphinx 0.9.9 (and maybe 0.9.8) has a --enable-id64 flag when you configure >>> - I think it's worth recompiling and trying indexing again. >> >>> -- >>> Pat >> >>> On 15/12/2009, at 11:40 AM, Steve H wrote: >> >>>> I found one thing very strange... no matter what the sql_range_step is >>>> set to (default - 10,000,000), the last query always has an upper >>>> limit of 2365744349. My very astute manager noticed that this is kind >>>> of in the neighborhood of the largest possible value for an int. >>>> Maybe that has something to do with it? >> >>>> -Steve >> >>>> On Dec 14, 4:35 pm, Steve H <[email protected]> wrote: >>>>> mysql> SELECT IFNULL(MIN(`id`), 1), IFNULL(MAX(`id`), 1) >>>>> -> FROM `businesses`; >>>>> +----------------------+----------------------+ >>>>> | IFNULL(MIN(`id`), 1) | IFNULL(MAX(`id`), 1) | >>>>> +----------------------+----------------------+ >>>>> | 1 | 15250646237 | >>>>> +----------------------+----------------------+ >> >>>>> On Dec 14, 4:32 pm, Pat Allan <[email protected]> wrote: >> >>>>>> Hmm, if you run the sql_query_range query yourself, what does it return? >> >>>>>> -- >>>>>> Pat >> >>>>>> On 15/12/2009, at 11:24 AM, Steve H wrote: >> >>>>>>> Hi Pat, >> >>>>>>> Here is the sql_query: >> >>>>>>> sql_query = SELECT SQL_NO_CACHE `businesses`.`id` * 1 + 0 AS `id` , >>>>>>> CAST(`businesses`.`name` AS CHAR) AS `name`, CAST(GROUP_CONCAT >>>>>>> (DISTINCT IFNULL(`cs_categories`.`name`, '0') SEPARATOR ' ') AS CHAR) >>>>>>> AS `category_names`, CAST(CONCAT_WS(' ', `city`.`city`, >>>>>>> `city`.`state_code`) AS CHAR) AS `city`, `businesses`.`id` AS >>>>>>> `sphinx_internal_id`, 4054125678 AS `class_crc`, '4054125678' AS >>>>>>> `subclass_crcs`, 0 AS `sphinx_deleted`, IFNULL(`businesses`.`name`, >>>>>>> '') AS `name_sort`, GROUP_CONCAT(DISTINCT IFNULL(IFNULL >>>>>>> (`cs_categories`.`name`, ''), '0') SEPARATOR ' ') AS >>>>>>> `category_names_sort`, CONCAT_WS(' ', IFNULL(`city`.`city`, ''), IFNULL >>>>>>> (`city`.`state_code`, '')) AS `city_sort`, `businesses`.`lat` AS >>>>>>> `lat`, `businesses`.`lon` AS `lon` FROM `businesses` LEFT OUTER >>>>>>> JOIN `categorizations` ON (`businesses`.`id` = >>>>>>> `categorizations`.`business_id`) LEFT OUTER JOIN `cs_categories` ON >>>>>>> (`cs_categories`.`id` = `categorizations`.`cs_category_id`) LEFT >>>>>>> OUTER JOIN `city` ON `city`.city_id = `businesses`.city_id WHERE >>>>>>> `businesses`.`id` >= $start AND `businesses`.`id` <= $end GROUP BY >>>>>>> `businesses`.`id` ORDER BY NULL >>>>>>> sql_query_range = SELECT IFNULL(MIN(`id`), 1), IFNULL(MAX(`id`), 1) >>>>>>> FROM `businesses` >> >>>>>>> We turned on logging for every query in mysqld to troubleshoot this. >>>>>>> This is the last query run: >> >>>>>>> SELECT SQL_NO_CACHE `businesses`.`id` * 1 + 0 AS `id` , CAST >>>>>>> (`businesses`.`name` AS CHAR) AS `name`, CAST(GROUP >>>>>>> _CONCAT(DISTINCT IFNULL(`cs_categories`.`name`, '0') SEPARATOR ' ') AS >>>>>>> CHAR) AS `category_names`, CAST(CONCAT_WS(' ', `city`.`city`, >>>>>>> `city`.`state_code`) AS CHAR) AS `city`, `businesses`.`id` AS >>>>>>> `sphinx_internal_id`, 4054125678 AS `class_crc`, '4054125678' AS >>>>>>> `subclass_crcs`, 0 AS `sphinx_deleted`, IFNULL(`businesses`.`name`, >>>>>>> '') AS `name_sort`, GROUP_CONCAT(DISTINCT IFNULL(IFNULL >>>>>>> (`cs_categories`.`name`, ''), '0') SEPARATOR ' ') AS >>>>>>> `category_names_sort`, CONCAT_WS(' ', IFNULL(`city`.`city`, ''), IFNULL >>>>>>> (`city`.`state_code`, '')) AS `city_sort`, `businesses`.`lat` AS >>>>>>> `lat`, `businesses`.`lon` AS `lon` FROM `businesses` LEFT OUTER >>>>>>> JOIN `categorizations` ON (`businesses`.`id` = >>>>>>> `categorizations`.`business_id`) LEFT OUTER JOIN `cs_categories` ON >>>>>>> (`cs_categories`.`id` = `categorizations`.`cs_category_id`) LEFT >>>>>>> OUTER JOIN `city` ON `city`.city_id = `businesses`.city_id WHERE >>>>>>> `businesses`.`id` >= 2365000001 AND `businesses`.`id` <= 2365744349 >>>>>>> GROUP BY `businesses`.`id` ORDER BY NULL >> >>>>>>> Which returns zero rows. In fact, the last 15+ queries return zero >>>>>>> rows. The businesses table has no records with ids >>>>>>> 2034010806...2500480805. And of course, the maximum id is >>>>>>> 15250646237. We need *all* the rows indexed, of course... :) >> >>>>>>> So should we be increasing the step_size to like 10 million? >> >>>>>>> -Steve >> >>>>>>> On Dec 14, 2:58 pm, Pat Allan <[email protected]> wrote: >>>>>>>> Hi Steve >> >>>>>>>> Thinking Sphinx uses left outer joins (because that's what >>>>>>>> ActiveRecord uses, and it's the best fit in this situation), and yes, >>>>>>>> one doc is one row. >> >>>>>>>> There's nothing odd in your index definition - so let's have a look at >>>>>>>> the sql_query value of the source business_core_0, in >>>>>>>> config/development.sphinx.conf (or production - they should be the >>>>>>>> same). You should be able to run it manually, see if it returns the >>>>>>>> right results (perhaps over a particular step, to keep things >>>>>>>> reasonably fast and manageable). >> >>>>>>>> -- >>>>>>>> Pat >> >>>>>>>> On 15/12/2009, at 3:26 AM, Steve H wrote: >> >>>>>>>>> Thanks for the lightning-fast reply Pat. :) >> >>>>>>>>> define_index do >>>>>>>>> indexes :name, :sortable => true >>>>>>>>> indexes categories_and_tags(:name), :as >>>>>>>>> => :category_names, :sortable => true >>>>>>>>> indexes [city.city, city.state_code], :as => :city, :sortable => >>>>>>>>> true >>>>>>>>> set_property :group_concat_max_len => 8192 >>>>>>>>> # attributes >>>>>>>>> has lat, lon >>>>>>>>> end >> >>>>>>>>> Every business has at least one category or tag, but I'm kind of >>>>>>>>> assuming here that the indexer is doing a full left join (ie, not an >>>>>>>>> inner join). >> >>>>>>>>> Could something in this block be causing the problem? And is 1 doc = >>>>>>>>> 1 row in the db? >> >>>>>>>>> -Steve >> >>>>>>>>> On Dec 13, 7:03 pm, Pat Allan <[email protected]> wrote: >>>>>>>>>> Hi Steve >> >>>>>>>>>> What does your define_index block for Business look like? >> >>>>>>>>>> -- >>>>>>>>>> Pat >> >>>>>>>>>> On 14/12/2009, at 1:44 PM, Steve H wrote: >> >>>>>>>>>>> OK first off -- I'm not 100% certain that 1 doc = 1 record. But if >>>>>>>>>>> it >>>>>>>>>>> DOES then, sphinx definitely didn't collect enough. We have over 13 >>>>>>>>>>> million rows in our database, but this was the output after >>>>>>>>>>> indexing: >> >>>>>>>>>>> ----------------------------------------------- >>>>>>>>>>> indexing index 'business_core'... >>>>>>>>>>> collected 1199 docs, 0.7 MB >>>>>>>>>>> collected 0 attr values >>>>>>>>>>> sorted 0.0 Mvalues, 100.0% done >>>>>>>>>>> sorted 0.1 Mhits, 100.0% done >>>>>>>>>>> total 1199 docs, 685923 bytes >>>>>>>>>>> total 1381.345 sec, 496.56 bytes/sec, 0.87 docs/sec >>>>>>>>>>> distributed index 'business' can not be directly indexed; skipping. >>>>>>>>>>> Started successfully (pid 32011). >>>>>>>>>>> ----------------------------------------------- >> >>>>>>>>>>> After reading some of the docs, I thought... maybe I have to >>>>>>>>>>> increase >>>>>>>>>>> the sql_range_step in the sphinx.yml or increase the >>>>>>>>>>> group_concat_max_len property. So I did those things, but the >>>>>>>>>>> output >>>>>>>>>>> after indexing is the same (albeit much much faster :) >> >>>>>>>>>>> ----------------------------------------------- >>>>>>>>>>> Stopped search daemon (pid 2810). >>>>>>>>>>> Generating Configuration to /home/steve/insiderpages/config/ >>>>>>>>>>> sphinx_dev.sphinx.conf >>>>>>>>>>> Sphinx 0.9.8.1-release (r1533) >>>>>>>>>>> Copyright (c) 2001-2008, Andrew Aksyonoff >> >>>>>>>>>>> using config file '/home/steve/insiderpages/config/ >>>>>>>>>>> sphinx_dev.sphinx.conf'... >>>>>>>>>>> indexing index 'business_core'... >>>>>>>>>>> collected 1199 docs, 0.1 MB >>>>>>>>>>> collected 0 attr values >>>>>>>>>>> sorted 0.0 Mvalues, 100.0% done >>>>>>>>>>> sorted 0.0 Mhits, 100.0% done >>>>>>>>>>> total 1199 docs, 79605 bytes >>>>>>>>>>> total 0.447 sec, 177901.38 bytes/sec, 2679.53 docs/sec >>>>>>>>>>> distributed index 'business' can not be directly indexed; skipping. >>>>>>>>>>> Started successfully (pid 6365). >>>>>>>>>>> ----------------------------------------------- >> >>>>>>>>>>> What's the problem? >> >>>>>>>>>>> -Steve >> >>>>>>>>>>> PS: >>>>>>>>>>> mysql> select count(id) from businesses; >>>>>>>>>>> +-----------+ >>>>>>>>>>> | count(id) | >>>>>>>>>>> +-----------+ >>>>>>>>>>> | 13395683 | >>>>>>>>>>> +-----------+ >>>>>>>>>>> 1 row in set (0.00 sec) >> >>>>>>>>>>> mysql> select max(id) from businesses; >>>>>>>>>>> +-------------+ >>>>>>>>>>> | max(id) | >>>>>>>>>>> +-------------+ >>>>>>>>>>> | 15250646126 | >>>>>>>>>>> +-------------+ >>>>>>>>>>> 1 row in set (0.00 sec) >> >>>>>>>>>>> -- >> >>>>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>>>> Groups "Thinking Sphinx" group. >>>>>>>>>>> To post to this group, send email to >>>>>>>>>>> [email protected]. >>>>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>>>> [email protected]. >>>>>>>>>>> For more options, visit this group >>>>>>>>>>> athttp://groups.google.com/group/thinking-sphinx?hl=en. >> >>>>>>>>> -- >> >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "Thinking Sphinx" group. >>>>>>>>> To post to this group, send email to [email protected]. >>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>> [email protected]. >>>>>>>>> For more >> >> ... >> >> read more ยป > > -- > > You received this message because you are subscribed to the Google Groups > "Thinking Sphinx" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/thinking-sphinx?hl=en. > > -- You received this message because you are subscribed to the Google Groups "Thinking Sphinx" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/thinking-sphinx?hl=en.
