Hi Steve

I've been thinking about this a bit today... it might be possible to default to 
bigint, although that has performance issues I'm guessing (granted, I doubt 
they're big ones). Maybe indexes need some flag to default all integers to big 
ints...

Consider it on my list of things to do. If you'd like to create a ticket for 
this on GitHub, that'd be fantastic.

Cheers

-- 
Pat

On 05/01/2010, at 12:00 PM, Steve H wrote:

> Pat --
> 
> For you and anyone else who may run into this issue:
> We found that while the indexer successfully indexed all our models,
> the sphinx_internal_id was wrong for all models with large ids
> (anything 11 digits or more).  This, of course, caused results brought
> back to be not found in the database.  We were wondering why the
> "document" number returned by the search binary couldn't be used
> instead of sphinx_internal_id...  but that aside, we found that
> sphinx_attr_bigint was introduced with 0.9.9, so we will be (upgrading
> and) modifying our generated config file appropriately.
> 
> Is there a way to specify that the sphinx_internal_id should be a
> sphinx_attr_bigint in the sphinx.yml file?
> 
> Thanks,
> Steve
> 
> On Dec 14 2009, 5:30 pm, Steve H <[email protected]> wrote:
>> Thanks Pat!! That certainly did the trick.  ~13.5 million businesses
>> with their associated city and categories indexed in about 15
>> minutes.  Of course the indexing was being done on the same box that
>> the mysqld was being run on, but still -- pretty dang fast.
>> 
>> Thanks again!  Now for the fun stuff: actual implementation,
>> deployment concerns, etc.
>> 
>> -Steve
>> 
>> On Dec 14, 4:53 pm, Pat Allan <[email protected]> wrote:
>> 
>>> Yeah, that's my thought too - was wondering if it was MySQL at fault, or 
>>> Sphinx.
>> 
>>> Sphinx 0.9.9 (and maybe 0.9.8) has a --enable-id64 flag when you configure 
>>> - I think it's worth recompiling and trying indexing again.
>> 
>>> --
>>> Pat
>> 
>>> On 15/12/2009, at 11:40 AM, Steve H wrote:
>> 
>>>> I found one thing very strange... no matter what the sql_range_step is
>>>> set to (default - 10,000,000), the last query always has an upper
>>>> limit of 2365744349.  My very astute manager noticed that this is kind
>>>> of in the neighborhood of the largest possible value for an int.
>>>> Maybe that has something to do with it?
>> 
>>>> -Steve
>> 
>>>> On Dec 14, 4:35 pm, Steve H <[email protected]> wrote:
>>>>> mysql> SELECT IFNULL(MIN(`id`), 1), IFNULL(MAX(`id`), 1)
>>>>>     -> FROM `businesses`;
>>>>> +----------------------+----------------------+
>>>>> | IFNULL(MIN(`id`), 1) | IFNULL(MAX(`id`), 1) |
>>>>> +----------------------+----------------------+
>>>>> |                    1 |          15250646237 |
>>>>> +----------------------+----------------------+
>> 
>>>>> On Dec 14, 4:32 pm, Pat Allan <[email protected]> wrote:
>> 
>>>>>> Hmm, if you run the sql_query_range query yourself, what does it return?
>> 
>>>>>> --
>>>>>> Pat
>> 
>>>>>> On 15/12/2009, at 11:24 AM, Steve H wrote:
>> 
>>>>>>> Hi Pat,
>> 
>>>>>>> Here is the sql_query:
>> 
>>>>>>> sql_query = SELECT SQL_NO_CACHE `businesses`.`id` * 1 + 0 AS `id` ,
>>>>>>> CAST(`businesses`.`name` AS CHAR) AS `name`, CAST(GROUP_CONCAT
>>>>>>> (DISTINCT IFNULL(`cs_categories`.`name`, '0') SEPARATOR ' ') AS CHAR)
>>>>>>> AS `category_names`, CAST(CONCAT_WS(' ', `city`.`city`,
>>>>>>> `city`.`state_code`) AS CHAR) AS `city`, `businesses`.`id` AS
>>>>>>> `sphinx_internal_id`, 4054125678 AS `class_crc`, '4054125678' AS
>>>>>>> `subclass_crcs`, 0 AS `sphinx_deleted`, IFNULL(`businesses`.`name`,
>>>>>>> '') AS `name_sort`, GROUP_CONCAT(DISTINCT IFNULL(IFNULL
>>>>>>> (`cs_categories`.`name`, ''), '0') SEPARATOR ' ') AS
>>>>>>> `category_names_sort`, CONCAT_WS(' ', IFNULL(`city`.`city`, ''), IFNULL
>>>>>>> (`city`.`state_code`, '')) AS `city_sort`, `businesses`.`lat` AS
>>>>>>> `lat`, `businesses`.`lon` AS `lon` FROM `businesses`    LEFT OUTER
>>>>>>> JOIN `categorizations` ON (`businesses`.`id` =
>>>>>>> `categorizations`.`business_id`)  LEFT OUTER JOIN `cs_categories` ON
>>>>>>> (`cs_categories`.`id` = `categorizations`.`cs_category_id`)   LEFT
>>>>>>> OUTER JOIN `city` ON `city`.city_id = `businesses`.city_id  WHERE
>>>>>>> `businesses`.`id` >= $start AND `businesses`.`id` <= $end GROUP BY
>>>>>>> `businesses`.`id`  ORDER BY NULL
>>>>>>>  sql_query_range = SELECT IFNULL(MIN(`id`), 1), IFNULL(MAX(`id`), 1)
>>>>>>> FROM `businesses`
>> 
>>>>>>> We turned on logging for every query in mysqld to troubleshoot this.
>>>>>>> This is the last query run:
>> 
>>>>>>> SELECT SQL_NO_CACHE `businesses`.`id` * 1 + 0 AS `id` , CAST
>>>>>>> (`businesses`.`name` AS CHAR) AS `name`, CAST(GROUP
>>>>>>> _CONCAT(DISTINCT IFNULL(`cs_categories`.`name`, '0') SEPARATOR ' ') AS
>>>>>>> CHAR) AS `category_names`, CAST(CONCAT_WS(' ', `city`.`city`,
>>>>>>> `city`.`state_code`) AS CHAR) AS `city`, `businesses`.`id` AS
>>>>>>> `sphinx_internal_id`, 4054125678 AS `class_crc`, '4054125678' AS
>>>>>>> `subclass_crcs`, 0 AS `sphinx_deleted`, IFNULL(`businesses`.`name`,
>>>>>>> '') AS `name_sort`, GROUP_CONCAT(DISTINCT IFNULL(IFNULL
>>>>>>> (`cs_categories`.`name`, ''), '0') SEPARATOR ' ') AS
>>>>>>> `category_names_sort`, CONCAT_WS(' ', IFNULL(`city`.`city`, ''), IFNULL
>>>>>>> (`city`.`state_code`, '')) AS `city_sort`, `businesses`.`lat` AS
>>>>>>> `lat`, `businesses`.`lon` AS `lon` FROM `businesses`    LEFT OUTER
>>>>>>> JOIN `categorizations` ON (`businesses`.`id` =
>>>>>>> `categorizations`.`business_id`)  LEFT OUTER JOIN `cs_categories` ON
>>>>>>> (`cs_categories`.`id` = `categorizations`.`cs_category_id`)   LEFT
>>>>>>> OUTER JOIN `city` ON `city`.city_id = `businesses`.city_id  WHERE
>>>>>>> `businesses`.`id` >= 2365000001 AND `businesses`.`id` <= 2365744349
>>>>>>> GROUP BY `businesses`.`id`  ORDER BY NULL
>> 
>>>>>>> Which returns zero rows.  In fact, the last 15+ queries return zero
>>>>>>> rows.  The businesses table has no records with ids
>>>>>>> 2034010806...2500480805.  And of course, the maximum id is
>>>>>>> 15250646237.  We need *all* the rows indexed, of course...  :)
>> 
>>>>>>> So should we be increasing the step_size to like 10 million?
>> 
>>>>>>> -Steve
>> 
>>>>>>> On Dec 14, 2:58 pm, Pat Allan <[email protected]> wrote:
>>>>>>>> Hi Steve
>> 
>>>>>>>> Thinking Sphinx uses left outer joins (because that's what 
>>>>>>>> ActiveRecord uses, and it's the best fit in this situation), and yes, 
>>>>>>>> one doc is one row.
>> 
>>>>>>>> There's nothing odd in your index definition - so let's have a look at 
>>>>>>>> the sql_query value of the source business_core_0, in 
>>>>>>>> config/development.sphinx.conf (or production - they should be the 
>>>>>>>> same). You should be able to run it manually, see if it returns the 
>>>>>>>> right results (perhaps over a particular step, to keep things 
>>>>>>>> reasonably fast and manageable).
>> 
>>>>>>>> --
>>>>>>>> Pat
>> 
>>>>>>>> On 15/12/2009, at 3:26 AM, Steve H wrote:
>> 
>>>>>>>>> Thanks for the lightning-fast reply Pat.  :)
>> 
>>>>>>>>>  define_index do
>>>>>>>>>    indexes :name, :sortable => true
>>>>>>>>>    indexes categories_and_tags(:name), :as
>>>>>>>>> => :category_names, :sortable => true
>>>>>>>>>    indexes [city.city, city.state_code], :as => :city, :sortable =>
>>>>>>>>> true
>>>>>>>>>    set_property :group_concat_max_len => 8192
>>>>>>>>>    # attributes
>>>>>>>>>    has lat, lon
>>>>>>>>>  end
>> 
>>>>>>>>> Every business has at least one category or tag, but I'm kind of
>>>>>>>>> assuming here that the indexer is doing a full left join (ie, not an
>>>>>>>>> inner join).
>> 
>>>>>>>>> Could something in this block be causing the problem?  And is 1 doc =
>>>>>>>>> 1 row in the db?
>> 
>>>>>>>>> -Steve
>> 
>>>>>>>>> On Dec 13, 7:03 pm, Pat Allan <[email protected]> wrote:
>>>>>>>>>> Hi Steve
>> 
>>>>>>>>>> What does your define_index block for Business look like?
>> 
>>>>>>>>>> --
>>>>>>>>>> Pat
>> 
>>>>>>>>>> On 14/12/2009, at 1:44 PM, Steve H wrote:
>> 
>>>>>>>>>>> OK first off -- I'm not 100% certain that 1 doc = 1 record.  But if 
>>>>>>>>>>> it
>>>>>>>>>>> DOES then, sphinx definitely didn't collect enough.  We have over 13
>>>>>>>>>>> million rows in our database, but this was the output after 
>>>>>>>>>>> indexing:
>> 
>>>>>>>>>>> -----------------------------------------------
>>>>>>>>>>> indexing index 'business_core'...
>>>>>>>>>>> collected 1199 docs, 0.7 MB
>>>>>>>>>>> collected 0 attr values
>>>>>>>>>>> sorted 0.0 Mvalues, 100.0% done
>>>>>>>>>>> sorted 0.1 Mhits, 100.0% done
>>>>>>>>>>> total 1199 docs, 685923 bytes
>>>>>>>>>>> total 1381.345 sec, 496.56 bytes/sec, 0.87 docs/sec
>>>>>>>>>>> distributed index 'business' can not be directly indexed; skipping.
>>>>>>>>>>> Started successfully (pid 32011).
>>>>>>>>>>> -----------------------------------------------
>> 
>>>>>>>>>>> After reading some of the docs, I thought... maybe I have to 
>>>>>>>>>>> increase
>>>>>>>>>>> the sql_range_step in the sphinx.yml or increase the
>>>>>>>>>>> group_concat_max_len property.  So I did those things, but the 
>>>>>>>>>>> output
>>>>>>>>>>> after indexing is the same (albeit much much faster  :)
>> 
>>>>>>>>>>> -----------------------------------------------
>>>>>>>>>>> Stopped search daemon (pid 2810).
>>>>>>>>>>> Generating Configuration to /home/steve/insiderpages/config/
>>>>>>>>>>> sphinx_dev.sphinx.conf
>>>>>>>>>>> Sphinx 0.9.8.1-release (r1533)
>>>>>>>>>>> Copyright (c) 2001-2008, Andrew Aksyonoff
>> 
>>>>>>>>>>> using config file '/home/steve/insiderpages/config/
>>>>>>>>>>> sphinx_dev.sphinx.conf'...
>>>>>>>>>>> indexing index 'business_core'...
>>>>>>>>>>> collected 1199 docs, 0.1 MB
>>>>>>>>>>> collected 0 attr values
>>>>>>>>>>> sorted 0.0 Mvalues, 100.0% done
>>>>>>>>>>> sorted 0.0 Mhits, 100.0% done
>>>>>>>>>>> total 1199 docs, 79605 bytes
>>>>>>>>>>> total 0.447 sec, 177901.38 bytes/sec, 2679.53 docs/sec
>>>>>>>>>>> distributed index 'business' can not be directly indexed; skipping.
>>>>>>>>>>> Started successfully (pid 6365).
>>>>>>>>>>> -----------------------------------------------
>> 
>>>>>>>>>>> What's the problem?
>> 
>>>>>>>>>>> -Steve
>> 
>>>>>>>>>>> PS:
>>>>>>>>>>> mysql> select count(id) from businesses;
>>>>>>>>>>> +-----------+
>>>>>>>>>>> | count(id) |
>>>>>>>>>>> +-----------+
>>>>>>>>>>> |  13395683 |
>>>>>>>>>>> +-----------+
>>>>>>>>>>> 1 row in set (0.00 sec)
>> 
>>>>>>>>>>> mysql> select max(id) from businesses;
>>>>>>>>>>> +-------------+
>>>>>>>>>>> | max(id)     |
>>>>>>>>>>> +-------------+
>>>>>>>>>>> | 15250646126 |
>>>>>>>>>>> +-------------+
>>>>>>>>>>> 1 row in set (0.00 sec)
>> 
>>>>>>>>>>> --
>> 
>>>>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>>>>> Groups "Thinking Sphinx" group.
>>>>>>>>>>> To post to this group, send email to 
>>>>>>>>>>> [email protected].
>>>>>>>>>>> To unsubscribe from this group, send email to 
>>>>>>>>>>> [email protected].
>>>>>>>>>>> For more options, visit this group 
>>>>>>>>>>> athttp://groups.google.com/group/thinking-sphinx?hl=en.
>> 
>>>>>>>>> --
>> 
>>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>>> Groups "Thinking Sphinx" group.
>>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>>> To unsubscribe from this group, send email to 
>>>>>>>>> [email protected].
>>>>>>>>> For more
>> 
>> ...
>> 
>> read more ยป
> 
> --
> 
> You received this message because you are subscribed to the Google Groups 
> "Thinking Sphinx" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/thinking-sphinx?hl=en.
> 
> 

--

You received this message because you are subscribed to the Google Groups 
"Thinking Sphinx" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/thinking-sphinx?hl=en.


Reply via email to