I believe that I may have found the culprit. (The "\u0000" unicode null
character.) By updating the scope of the index, I was able to narrow the
failure down to the two offending issues and their comments and attempting
to index just those issues. That made the failures quicker. Otherwise, I
was having to wait 30+ minutes for the indexer to fail. That made it
difficult to test out theories. It's obvious in hindsight, but it really
helped get to the bottom of things.

We concatenate all of the comments using a "comment_bodies" method on the
issue for indexing purposes. In both cases, one of the comments for the
issue contains "\u0000". I had previously been using Sequel Pro to examine
the contents of the comments, so I didn't see the character. Once I was
able to definitively identify the offending comments, I checked the value
of "comment_bodies" in console, and the \u0000 was visible.

I added a low-tech ".gsub(/\u0000/, '')" to our comment_bodies, and all was
right again in indexing land. So, it appears that the \u0000 null unicode
sequence is definitely the culprit. For now, we'll filter it out manually,
but I expect that's something that Thinking Sphinx could handle.

I hope this helps!


On Fri, Feb 28, 2014 at 7:19 PM, Pat Allan <[email protected]> wrote:

> From reading through the source a bit more, it seems the `where`
> conditions for a real-time index can be set to either a symbol
> (representing an instance method name within the indexed model) or a proc
> (called with a single argument, the instance of the model), and if any
> condition exists, each much return true for the object to be added to the
> real-time index.
>
> As for monitoring SphinxQL statements that are being sent to Sphinx,
> there's been a recent commit to TS that adds real-time updates to the logs.
> Adding the following to your Gemfile should sort that out:
>
>   gem 'thinking-sphinx', '~> 3.1.0',
>     :git    => 'git://github.com/pat/thinking-sphinx.git',
>     :branch => 'develop',
>     :ref    => '94ee176a7a'
>
> Also: anything within the scope block (includes or otherwise) are only
> applied for the full index generation. They're not used when updating the
> Sphinx indices for a single record is created/edited, nor do they apply
> during searches.
>
> On 1 Mar 2014, at 2:46 am, Garrett Dimon <[email protected]> wrote:
>
> > Thanks! We'll give that a shot. It seems the where *is* being applied at
> search time, but just not index-time. We're actually considering indexing
> everything with RT because previously when projects were unarchived or
> accounts unfrozen, they'd get picked backup with the full reindexes, but it
> seems with RT, that wouldn't happen because we don't run full re-indexes.
> Is there a way that we could tell TS to reindex in those situations if we
> use this approach and exclude them from the index?
> >
> > I'm trying to track down the specific content that's tripping up the
> generation, but I'm not having much luck. I found the record that's printed
> out when rake files, and I manually inspected the records in nearby
> proximity to it, but they all check out. Is there a way/location for me to
> see the SQL query that's being used at the moment the rake task fails?
> >
> >
> > On Fri, Feb 28, 2014 at 9:21 AM, Pat Allan <[email protected]>
> wrote:
> > The `where` method doesn't apply for real-time indices - but try this
> instead in the index definition block:
> >
> >   scope { Issue.where "account_status_id IN (1,2) AND archived IS false"
> }
> >
> > That should ensure only the appropriate issues are indexed as part of
> the generate call.
> >
> > Beyond that, you may wish to add `includes` within that scope to cover
> all associations used within the index?
> >
> > Unrelated to any of this: the match_mode defaults to extended with TS v3
> (indeed, it can't be anything else).
> >
> > --
> > Pat
> >
> > On 1 Mar 2014, at 2:09 am, Garrett Dimon <[email protected]> wrote:
> >
> > > Roger on the generation. Any high-level suggestions for optimization?
> > >
> > > I'll see if I can't figure out the exact record that's tripping up the
> index generation. In the meantime, here's a gist of our index definition:
> > > https://gist.github.com/garrettdimon/25f6c305541f30b3ce39
> > >
> > >
> > > On Thu, Feb 27, 2014 at 9:21 PM, Pat Allan <[email protected]>
> wrote:
> > > Hi Garrett
> > >
> > > Generation can be slow - at the end of the day, it really comes down
> to how much data you're dealing with, and if you're using aggregation
> methods, how quick they are. It's all going through your Rails app (instead
> of just SQL queries), so optimising for that is different to adding db
> indices and such.
> > >
> > > As for the error though... without having a copy of the database, it's
> a little hard to debug, but it sounds like there's a bug in TS with
> something in the data being passed through. Having a look at your app log
> may help identify the record in question... also, what does the index
> definition for that model look like?
> > >
> > > --
> > > Pat
> > >
> > > On 28 Feb 2014, at 9:47 am, Garrett Dimon <[email protected]>
> wrote:
> > >
> > >> Howdy, Pat. We're in the process of upgrading from TS 2 with delayed
> deltas to TS 3 with real time indexing.
> > >>
> > >> We've been able to get everything up and running, but we've run into
> a couple of problems/questions around the indexing. These may ultimately be
> Sphinx questions rather than Thinking Sphinx questions, but I thought I'd
> start here since we're only changing Sphinx from 2.0 to 2.1.
> > >>
> > >> 1. Index Creation Performance
> > >>
> > >> Our production logs show about a 20 minute turnaround to do a
> complete reindex of our production data with TS 2. Running some local
> tests, TS 3 generate is taking at least an hour for that data. (The
> generate is crashing, so it may ultimately take even longer.)  Our indexing
> configuration is setup so that a large portion of content is excluded from
> the index. (Inactive accounts, archived projects, etc.) We've verified that
> searching is correctly excluding the relevant records, but appears as if
> that's happening when the query is run rather than when the indexing
> occurs. Our only theory so far is that with TS 2 and traditional indexing,
> those weren't included in the index at all, but that with real time
> indexing, they're included in the index and filtered out when the query is
> run. Can you provide any insight about whether this sounds like normal
> behavior or whether we've likely screwed something up? :)
> > >>
> > >> 2. The Generate is crashing with "rake aborted! sphinxql: syntax
> error, unexpected $undefined, expecting CONST_INT (or 4 other tokens) near
> ''..." (where ... is content from our DB.)
> > >>
> > >> I've done some searching, but haven't had any luck. I've run the
> generate rake task on two separate occasions, and both times it failed with
> the same error message and content, so my gut is leading me to think that
> it's an encoding or unescaped quotation mark problem. Does that problem
> ring any bells?
> > >>
> > >> Thanks!
> > >>
> > >>
> > >>
> > >> --
> > >> You received this message because you are subscribed to the Google
> Groups "Thinking Sphinx" group.
> > >> To unsubscribe from this group and stop receiving emails from it,
> send an email to [email protected].
> > >>
> > >> To post to this group, send email to [email protected]
> .
> > >> Visit this group at http://groups.google.com/group/thinking-sphinx.
> > >> For more options, visit https://groups.google.com/groups/opt_out.
> > >
> > >
> > > --
> > > You received this message because you are subscribed to a topic in the
> Google Groups "Thinking Sphinx" group.
> > > To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/thinking-sphinx/7llAB4zO4bw/unsubscribe.
> > > To unsubscribe from this group and all its topics, send an email to
> [email protected].
> > > To post to this group, send email to [email protected].
> > > Visit this group at http://groups.google.com/group/thinking-sphinx.
> > > For more options, visit https://groups.google.com/groups/opt_out.
> > >
> > >
> > > --
> > > You received this message because you are subscribed to the Google
> Groups "Thinking Sphinx" group.
> > > To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected].
> > > To post to this group, send email to [email protected].
> > > Visit this group at http://groups.google.com/group/thinking-sphinx.
> > > For more options, visit https://groups.google.com/groups/opt_out.
> >
> > --
> > You received this message because you are subscribed to a topic in the
> Google Groups "Thinking Sphinx" group.
> > To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/thinking-sphinx/7llAB4zO4bw/unsubscribe.
> > To unsubscribe from this group and all its topics, send an email to
> [email protected].
> > To post to this group, send email to [email protected].
> > Visit this group at http://groups.google.com/group/thinking-sphinx.
> > For more options, visit https://groups.google.com/groups/opt_out.
> >
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "Thinking Sphinx" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected].
> > To post to this group, send email to [email protected].
> > Visit this group at http://groups.google.com/group/thinking-sphinx.
> > For more options, visit https://groups.google.com/groups/opt_out.
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Thinking Sphinx" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/thinking-sphinx/7llAB4zO4bw/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/thinking-sphinx.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Thinking Sphinx" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/thinking-sphinx.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to