Thanks, you've been great rubber ducks!

Just after I posted this, I tried "one more thing". (Actually, it was two or 
more things at once -- bad practice, I know.) As I hinted, I changed the 
relationship from has_many to has_one :content, and left the join in place, and 
changed the name of the content.plain from :content to :plain_text.

  define_index do
    set_property :group_concat_max_len => 10.megabytes

    indexes :title, :sortable => true
    indexes teaser
    join title.content
    indexes content.plain, :as => :plain_text
    indexes author_name, :sortable => true
    has created_at, updated_at
    where sanitize_sql(["publish", true])
  end


Then in my search method, I removed a dangling :include => :contents, 
parameter. Somewhere between these changes, the indexing process started making 
sense (about 1GB of index, as predicted) and the search started working -- and 
it started working faster, too. 

As my old creative director used to say of me, "Even a blind squirrel finds a 
nut or two."

Thanks for all you do,

Walter

On Apr 1, 2014, at 9:23 AM, Walter Lee Davis wrote:

> Here's the info from the production.sphinx.conf, including the SQL:
> 
> source title_core_0
> {
>  type = mysql
>  sql_host = 10.12.29.11
>  sql_user = user
>  sql_pass = pass
>  sql_db = oll2
>  sql_query_pre = SET SESSION group_concat_max_len = 10485760
>  sql_query_pre = SET NAMES utf8
>  sql_query_pre = SET TIME_ZONE = '+0:00'
>  sql_query = SELECT SQL_NO_CACHE `titles`.`id` * CAST(4 AS SIGNED) + 3 AS 
> `id` , `titles`.`title` AS `title`, `titles`.`teaser` AS `teaser`, 
> GROUP_CONCAT(DISTINCT IFNULL(`contents`.`plain`, '0') SEPARATOR ' ') AS 
> `content`, `titles`.`author_name` AS `author_name`, `titles`.`id` AS 
> `sphinx_internal_id`, 0 AS `sphinx_deleted`, 3942078319 AS `class_crc`, 
> IFNULL('Title', '') AS `sphinx_internal_class`, IFNULL(`titles`.`title`, '') 
> AS `title_sort`, IFNULL(`titles`.`author_name`, '') AS `author_name_sort`, 
> UNIX_TIMESTAMP(`titles`.`created_at`) AS `created_at`, 
> UNIX_TIMESTAMP(`titles`.`updated_at`) AS `updated_at` FROM `titles` LEFT 
> OUTER JOIN `contents` ON `contents`.`title_id` = `titles`.`id` WHERE 
> (`titles`.`id` >= $start AND `titles`.`id` <= $end AND publish) GROUP BY 
> `titles`.`id` ORDER BY NULL
>  sql_query_range = SELECT IFNULL(MIN(`id`), 1), IFNULL(MAX(`id`), 1) FROM 
> `titles` 
>  sql_attr_uint = sphinx_internal_id
>  sql_attr_uint = sphinx_deleted
>  sql_attr_uint = class_crc
>  sql_attr_timestamp = created_at
>  sql_attr_timestamp = updated_at
>  sql_attr_string = sphinx_internal_class
>  sql_attr_string = title_sort
>  sql_attr_string = author_name_sort
>  sql_query_info = SELECT * FROM `titles` WHERE `id` = (($id - 3) / 4)
> }
> 
> index title_core
> {
>  source = title_core_0
>  path = /data/www/db/sphinx/production/title_core
>  charset_type = utf-8
> }
> 
> index title
> {
>  type = distributed
>  local = title_core
> }
> 
> This is after I added an explicit line to the define_index statement:
> 
> join title.contents
> 
> There was no difference in terms of the index size that I noted earlier, and 
> I'm not astute enough to tell you if the join is correct in the generated 
> SQL. Here's the entirety of the define_index method:
> 
>  define_index do
>    set_property :group_concat_max_len => 10.megabytes
> 
>    indexes :title, :sortable => true
>    indexes teaser
>    join title.contents
>    indexes contents.plain, :as => :content
>    indexes author_name, :sortable => true
>    has created_at, updated_at
>    where sanitize_sql(["publish", true])
>  end
> 
> Do you see anything odd about this? Should I change the relationship to 
> contents from has_many to has_one? (The has_many was a relic from an earlier 
> structure where I was splitting the plain text on an arbitrary length. This 
> was before I found a way to increase the column width in my database to a 
> size guaranteed to hold my largest title.)
> 
> Thanks again for your tireless support!
> 
> Walter
> 
> On Mar 22, 2014, at 7:49 PM, Pat Allan wrote:
> 
>> Your points on the expected size are spot on... the syntax you're using 
>> (association_name.column_name) should ensure the SQL join is made 
>> automatically, but you can confirm that by looking at the sql_query value 
>> for the title_core index and see whether it's joining on contents, and how 
>> the plain column is being used.
>> 
>> Is there anything odd/wrong in that SQL statement?
>> 
>> -- 
>> Pat
>> 
>> On 23 Mar 2014, at 8:25 am, Walter Lee Davis <[email protected]> wrote:
>> 
>>> I've tried both, and when I don't separately index the contents table, I 
>>> don't get any hits on a fairly unique string that I know is in the 
>>> contents.plain field when I search on titles. The indexing process makes it 
>>> clear that this relation is not being touched -- when I index the contents 
>>> table, the result is this:
>>> 
>>> collected 1099 docs, 913.5 MB
>>> 
>>> When I index the titles (including the related field in contents) I only 
>>> get this:
>>> 
>>> collected 1603 docs, 24.6 MB
>>> 
>>> contents.plain is the only large part of contents, it's just title_id and 
>>> timestamps besides the plain column. If contents.plain was being accessed 
>>> as a related column (and thus added to the metadata from the titles table), 
>>> I would expect the result from the title indexing process to be in the 1GB 
>>> neighborhood. That would make sense to me, given the amount of data being 
>>> indexed.
>>> 
>>> What I have done as a fallback for now is added an index to the contents 
>>> table, and rigged up my search results page to display those results as if 
>>> they were to the title. But I'd still like to be searching once across all 
>>> titles, and finding hits whether they were against the metadata (in the 
>>> titles table) or the content (in the contents table). Is there some join 
>>> syntax I need to use here? Remember -- this is Sphinx 2, not 3.
>>> 
>>> Thanks so much for your excellent support!
>>> 
>>> Walter
>>> 
>>> On Mar 21, 2014, at 11:04 PM, Pat Allan wrote:
>>> 
>>>> * The indexes line you're using will work fine - also, it'll work with the 
>>>> arguments being shifted to method calls:
>>>> 
>>>> indexes contents.plain, :as => :content
>>>> 
>>>> If there's still issues, do get in touch.
>>> 
>>> -- 
>>> You received this message because you are subscribed to the Google Groups 
>>> "Thinking Sphinx" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/thinking-sphinx.
>>> For more options, visit https://groups.google.com/d/optout.
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Thinking Sphinx" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/thinking-sphinx.
>> For more options, visit https://groups.google.com/d/optout.
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Thinking Sphinx" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/thinking-sphinx.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"Thinking Sphinx" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/thinking-sphinx.
For more options, visit https://groups.google.com/d/optout.

Reply via email to