Consider it done.  Ticket created.  :)

Cheers!

-Steve

On Jan 5, 12:36 am, Pat Allan <[email protected]> wrote:
> Hi Steve
>
> I've been thinking about this a bit today... it might be possible to default 
> to bigint, although that has performance issues I'm guessing (granted, I 
> doubt they're big ones). Maybe indexes need some flag to default all integers 
> to big ints...
>
> Consider it on my list of things to do. If you'd like to create a ticket for 
> this on GitHub, that'd be fantastic.
>
> Cheers
>
> --
> Pat
>
> On 05/01/2010, at 12:00 PM, Steve H wrote:
>
>
>
> > Pat --
>
> > For you and anyone else who may run into this issue:
> > We found that while the indexer successfully indexed all our models,
> > the sphinx_internal_id was wrong for all models with large ids
> > (anything 11 digits or more).  This, of course, caused results brought
> > back to be not found in the database.  We were wondering why the
> > "document" number returned by the search binary couldn't be used
> > instead of sphinx_internal_id...  but that aside, we found that
> > sphinx_attr_bigint was introduced with 0.9.9, so we will be (upgrading
> > and) modifying our generated config file appropriately.
>
> > Is there a way to specify that the sphinx_internal_id should be a
> > sphinx_attr_bigint in the sphinx.yml file?
>
> > Thanks,
> > Steve
>
> > On Dec 14 2009, 5:30 pm, Steve H <[email protected]> wrote:
> >> Thanks Pat!! That certainly did the trick.  ~13.5 million businesses
> >> with their associated city and categories indexed in about 15
> >> minutes.  Of course the indexing was being done on the same box that
> >> the mysqld was being run on, but still -- pretty dang fast.
>
> >> Thanks again!  Now for the fun stuff: actual implementation,
> >> deployment concerns, etc.
>
> >> -Steve
>
> >> On Dec 14, 4:53 pm, Pat Allan <[email protected]> wrote:
>
> >>> Yeah, that's my thought too - was wondering if it was MySQL at fault, or 
> >>> Sphinx.
>
> >>> Sphinx 0.9.9 (and maybe 0.9.8) has a --enable-id64 flag when you 
> >>> configure - I think it's worth recompiling and trying indexing again.
>
> >>> --
> >>> Pat
>
> >>> On 15/12/2009, at 11:40 AM, Steve H wrote:
>
> >>>> I found one thing very strange... no matter what the sql_range_step is
> >>>> set to (default - 10,000,000), the last query always has an upper
> >>>> limit of 2365744349.  My very astute manager noticed that this is kind
> >>>> of in the neighborhood of the largest possible value for an int.
> >>>> Maybe that has something to do with it?
>
> >>>> -Steve
>
> >>>> On Dec 14, 4:35 pm, Steve H <[email protected]> wrote:
> >>>>> mysql> SELECT IFNULL(MIN(`id`), 1), IFNULL(MAX(`id`), 1)
> >>>>>     -> FROM `businesses`;
> >>>>> +----------------------+----------------------+
> >>>>> | IFNULL(MIN(`id`), 1) | IFNULL(MAX(`id`), 1) |
> >>>>> +----------------------+----------------------+
> >>>>> |                    1 |          15250646237 |
> >>>>> +----------------------+----------------------+
>
> >>>>> On Dec 14, 4:32 pm, Pat Allan <[email protected]> wrote:
>
> >>>>>> Hmm, if you run the sql_query_range query yourself, what does it 
> >>>>>> return?
>
> >>>>>> --
> >>>>>> Pat
>
> >>>>>> On 15/12/2009, at 11:24 AM, Steve H wrote:
>
> >>>>>>> Hi Pat,
>
> >>>>>>> Here is the sql_query:
>
> >>>>>>> sql_query = SELECT SQL_NO_CACHE `businesses`.`id` * 1 + 0 AS `id` ,
> >>>>>>> CAST(`businesses`.`name` AS CHAR) AS `name`, CAST(GROUP_CONCAT
> >>>>>>> (DISTINCT IFNULL(`cs_categories`.`name`, '0') SEPARATOR ' ') AS CHAR)
> >>>>>>> AS `category_names`, CAST(CONCAT_WS(' ', `city`.`city`,
> >>>>>>> `city`.`state_code`) AS CHAR) AS `city`, `businesses`.`id` AS
> >>>>>>> `sphinx_internal_id`, 4054125678 AS `class_crc`, '4054125678' AS
> >>>>>>> `subclass_crcs`, 0 AS `sphinx_deleted`, IFNULL(`businesses`.`name`,
> >>>>>>> '') AS `name_sort`, GROUP_CONCAT(DISTINCT IFNULL(IFNULL
> >>>>>>> (`cs_categories`.`name`, ''), '0') SEPARATOR ' ') AS
> >>>>>>> `category_names_sort`, CONCAT_WS(' ', IFNULL(`city`.`city`, ''), 
> >>>>>>> IFNULL
> >>>>>>> (`city`.`state_code`, '')) AS `city_sort`, `businesses`.`lat` AS
> >>>>>>> `lat`, `businesses`.`lon` AS `lon` FROM `businesses`    LEFT OUTER
> >>>>>>> JOIN `categorizations` ON (`businesses`.`id` =
> >>>>>>> `categorizations`.`business_id`)  LEFT OUTER JOIN `cs_categories` ON
> >>>>>>> (`cs_categories`.`id` = `categorizations`.`cs_category_id`)   LEFT
> >>>>>>> OUTER JOIN `city` ON `city`.city_id = `businesses`.city_id  WHERE
> >>>>>>> `businesses`.`id` >= $start AND `businesses`.`id` <= $end GROUP BY
> >>>>>>> `businesses`.`id`  ORDER BY NULL
> >>>>>>>  sql_query_range = SELECT IFNULL(MIN(`id`), 1), IFNULL(MAX(`id`), 1)
> >>>>>>> FROM `businesses`
>
> >>>>>>> We turned on logging for every query in mysqld to troubleshoot this.
> >>>>>>> This is the last query run:
>
> >>>>>>> SELECT SQL_NO_CACHE `businesses`.`id` * 1 + 0 AS `id` , CAST
> >>>>>>> (`businesses`.`name` AS CHAR) AS `name`, CAST(GROUP
> >>>>>>> _CONCAT(DISTINCT IFNULL(`cs_categories`.`name`, '0') SEPARATOR ' ') AS
> >>>>>>> CHAR) AS `category_names`, CAST(CONCAT_WS(' ', `city`.`city`,
> >>>>>>> `city`.`state_code`) AS CHAR) AS `city`, `businesses`.`id` AS
> >>>>>>> `sphinx_internal_id`, 4054125678 AS `class_crc`, '4054125678' AS
> >>>>>>> `subclass_crcs`, 0 AS `sphinx_deleted`, IFNULL(`businesses`.`name`,
> >>>>>>> '') AS `name_sort`, GROUP_CONCAT(DISTINCT IFNULL(IFNULL
> >>>>>>> (`cs_categories`.`name`, ''), '0') SEPARATOR ' ') AS
> >>>>>>> `category_names_sort`, CONCAT_WS(' ', IFNULL(`city`.`city`, ''), 
> >>>>>>> IFNULL
> >>>>>>> (`city`.`state_code`, '')) AS `city_sort`, `businesses`.`lat` AS
> >>>>>>> `lat`, `businesses`.`lon` AS `lon` FROM `businesses`    LEFT OUTER
> >>>>>>> JOIN `categorizations` ON (`businesses`.`id` =
> >>>>>>> `categorizations`.`business_id`)  LEFT OUTER JOIN `cs_categories` ON
> >>>>>>> (`cs_categories`.`id` = `categorizations`.`cs_category_id`)   LEFT
> >>>>>>> OUTER JOIN `city` ON `city`.city_id = `businesses`.city_id  WHERE
> >>>>>>> `businesses`.`id` >= 2365000001 AND `businesses`.`id` <= 2365744349
> >>>>>>> GROUP BY `businesses`.`id`  ORDER BY NULL
>
> >>>>>>> Which returns zero rows.  In fact, the last 15+ queries return zero
> >>>>>>> rows.  The businesses table has no records with ids
> >>>>>>> 2034010806...2500480805.  And of course, the maximum id is
> >>>>>>> 15250646237.  We need *all* the rows indexed, of course...  :)
>
> >>>>>>> So should we be increasing the step_size to like 10 million?
>
> >>>>>>> -Steve
>
> >>>>>>> On Dec 14, 2:58 pm, Pat Allan <[email protected]> wrote:
> >>>>>>>> Hi Steve
>
> >>>>>>>> Thinking Sphinx uses left outer joins (because that's what 
> >>>>>>>> ActiveRecord uses, and it's the best fit in this situation), and 
> >>>>>>>> yes, one doc is one row.
>
> >>>>>>>> There's nothing odd in your index definition - so let's have a look 
> >>>>>>>> at the sql_query value of the source business_core_0, in 
> >>>>>>>> config/development.sphinx.conf (or production - they should be the 
> >>>>>>>> same). You should be able to run it manually, see if it returns the 
> >>>>>>>> right results (perhaps over a particular step, to keep things 
> >>>>>>>> reasonably fast and manageable).
>
> >>>>>>>> --
> >>>>>>>> Pat
>
> >>>>>>>> On 15/12/2009, at 3:26 AM, Steve H wrote:
>
> >>>>>>>>> Thanks for the lightning-fast reply Pat.  :)
>
> >>>>>>>>>  define_index do
> >>>>>>>>>    indexes :name, :sortable => true
> >>>>>>>>>    indexes categories_and_tags(:name), :as
> >>>>>>>>> => :category_names, :sortable => true
> >>>>>>>>>    indexes [city.city, city.state_code], :as => :city, :sortable =>
> >>>>>>>>> true
> >>>>>>>>>    set_property :group_concat_max_len => 8192
> >>>>>>>>>    # attributes
> >>>>>>>>>    has lat, lon
> >>>>>>>>>  end
>
> >>>>>>>>> Every business has at least one category or tag, but I'm kind of
> >>>>>>>>> assuming here that the indexer is doing a full left join (ie, not an
> >>>>>>>>> inner join).
>
> >>>>>>>>> Could something in this block be causing the problem?  And is 1 doc 
> >>>>>>>>> =
> >>>>>>>>> 1 row in the db?
>
> >>>>>>>>> -Steve
>
> >>>>>>>>> On Dec 13, 7:03 pm, Pat Allan <[email protected]> wrote:
> >>>>>>>>>> Hi Steve
>
> >>>>>>>>>> What does your define_index block for Business look like?
>
> >>>>>>>>>> --
> >>>>>>>>>> Pat
>
> >>>>>>>>>> On 14/12/2009, at 1:44 PM, Steve H wrote:
>
> >>>>>>>>>>> OK first off -- I'm not 100% certain that 1 doc = 1 record.  But 
> >>>>>>>>>>> if it
> >>>>>>>>>>> DOES then, sphinx definitely didn't collect enough.  We have over 
> >>>>>>>>>>> 13
> >>>>>>>>>>> million rows in our database, but this was the output after 
> >>>>>>>>>>> indexing:
>
> >>>>>>>>>>> -----------------------------------------------
> >>>>>>>>>>> indexing index 'business_core'...
> >>>>>>>>>>> collected 1199 docs, 0.7 MB
> >>>>>>>>>>> collected 0 attr values
> >>>>>>>>>>> sorted 0.0 Mvalues, 100.0% done
> >>>>>>>>>>> sorted 0.1 Mhits, 100.0% done
> >>>>>>>>>>> total 1199 docs, 685923 bytes
> >>>>>>>>>>> total 1381.345 sec, 496.56 bytes/sec, 0.87 docs/sec
> >>>>>>>>>>> distributed index 'business' can not be directly indexed; 
> >>>>>>>>>>> skipping.
> >>>>>>>>>>> Started successfully (pid 32011).
> >>>>>>>>>>> -----------------------------------------------
>
> >>>>>>>>>>> After reading some of the docs, I thought... maybe I have to 
> >>>>>>>>>>> increase
> >>>>>>>>>>> the sql_range_step in the sphinx.yml or increase the
> >>>>>>>>>>> group_concat_max_len property.  So I did those things, but the 
> >>>>>>>>>>> output
> >>>>>>>>>>> after indexing is the same (albeit much much faster  :)
>
> >>>>>>>>>>> -----------------------------------------------
> >>>>>>>>>>> Stopped search daemon (pid 2810).
> >>>>>>>>>>> Generating Configuration to /home/steve/insiderpages/config/
> >>>>>>>>>>> sphinx_dev.sphinx.conf
> >>>>>>>>>>> Sphinx 0.9.8.1-release (r1533)
> >>>>>>>>>>> Copyright (c) 2001-2008, Andrew Aksyonoff
>
> >>>>>>>>>>> using config file '/home/steve/insiderpages/config/
> >>>>>>>>>>> sphinx_dev.sphinx.conf'...
> >>>>>>>>>>> indexing index 'business_core'...
> >>>>>>>>>>> collected 1199 docs, 0.1 MB
> >>>>>>>>>>> collected 0 attr values
> >>>>>>>>>>> sorted 0.0 Mvalues, 100.0% done
> >>>>>>>>>>> sorted 0.0 Mhits, 100.0% done
> >>>>>>>>>>> total 1199 docs, 79605 bytes
> >>>>>>>>>>> total 0.447 sec, 177901.38 bytes/sec, 2679.53 docs/sec
> >>>>>>>>>>> distributed index 'business' can not be directly indexed; 
> >>>>>>>>>>> skipping.
> >>>>>>>>>>> Started successfully (pid 6365).
> >>>>>>>>>>> -----------------------------------------------
>
> >>>>>>>>>>> What's the problem?
>
> ...
>
> read more »
-- 
You received this message because you are subscribed to the Google Groups 
"Thinking Sphinx" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/thinking-sphinx?hl=en.


Reply via email to