If you’ve got tag names and their corresponding ids, I think it’d be better
(and more accurate) to query Sphinx by the ids:
# in the index:
has tag_ids
# when searching, maybe something like:
tag = Tag.find_by(name: params[:tag_name])
Document.search params[:query], :with => {:tag_ids => tag.id}
It doesn’t answer the question why octothorps aren’t being indexed/searched
correctly, but this should mean better search results generally.
Cheers,
—
Pat
> On 24 Feb 2021, at 1:48 am, Walter Lee Davis <[email protected]> wrote:
>
>
>
>> On Feb 23, 2021, at 12:02 AM, Pat Allan <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> Having the setting in the default block should be fine - you should be able
>> to see the charset_table setting in the generated Sphinx configuration files.
>>
>> Also: I generally recommend just using ts:rebuild, as that handles both
>> real-time indices and SQL-backed indices (i.e. it’s running the same things
>> as ts:rt:rebuild) - if you’re finding ts:rebuild is not working well for
>> you, I’m keen to hear why!
>
> While I was fighting with this, and fiddling with the configuration to use
> has instead of indexes, I got myself into a state where ts:rebuild would blow
> up with a SQL error (I think it was a Sphinx SQL error) and ts:rt:rebuild
> would work fine. But with the current configuration that I shared with you,
> both work.
>
>>
>> All that said, doesn’t sound like you’re doing anything wrong. I wonder if
>> html_strip is somehow filtering out the octothorps? Though I’m pretty sure
>> it’s looking just for HTML tags… still, may be worth turning that off to
>> double-check.
>>
>> And I’ve just run some quick tests locally - without the custom
>> charset_table value, I find the string “#test” is found by Sphinx when
>> searching by “#test” or “test” (because # is ignored, given it’s not an
>> indexable character - so the two searches are actually identical). Adding in
>> the charset_table setting, rebuilding - searching for #test returns a
>> result, but test doesn’t (as that now doesn’t exist as a standalone word in
>> what’s indexed).
>>
>> I doubt it matters, but: which version of Sphinx are you using?
>
> Sphinx 2.2.11-id64-release (95ae9a6), TS 5.0.0.
>
> It's definitely odd. I'm not sure if re-indexing is picking up the tag names
> when it runs en masse, and it seems to be something with GutenTag. If I find
> a document in console, the object that I get back has tag_names set to nil,
> but if I then call tag_names on that object, I get back the array of strings
> I am expecting. It's just the value that I see inside the <> brackets
> initially when to_s is called on the found object by irb, so I don't know if
> that's significant at all, or is getting in the way of Sphinx extracting the
> values. Again, when I test in console by calling my tags_for_indexing method
> on a found object, I get back the expected string value.
>
> I've told the client that she may need to get rid of her beloved hashtags in
> the tagging interface, or use Gutentag in place of Sphinx to get "everything
> tagged with this tag". I'm not convinced that's a bad idea, either.
>
> Walter
>
>>
>> —
>> Pat
>>
>>> On 23 Feb 2021, at 3:10 pm, Walter Lee Davis <[email protected]> wrote:
>>>
>>> Thanks for the speedy reply. I tried adding the charset table as
>>> recommended, but I am not seeing any difference in my search results. I did
>>> differ from the directions slightly, in that I put the character set in the
>>> default block at the top of my Yaml file, since it's then included in all
>>> of the environments. I figured that should work, but in case it doesn't can
>>> you explain why?
>>>
>>> default: &default
>>> morphology: stem_en
>>> html_strip: true
>>> batch_size: 300
>>> charset_table: "0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F,
>>> U+430..U+44F, U+23"
>>>
>>> development:
>>> <<: *default
>>>
>>> test:
>>> <<: *default
>>>
>>> production:
>>> <<: *default
>>>
>>> staging:
>>> <<: *default
>>> mysql41: 9320
>>>
>>>
>>> I forced a full rebuild/reindex with rake ts:rt:rebuild. When that didn't
>>> seem to change things, I also ran rake ts:rebuild. My understanding is that
>>> the first of these should be done when you use the Real Time index. If I'm
>>> mistaken, please let me know.
>>>
>>> Thanks again!
>>>
>>> Walter
>>>
>>>> On Feb 22, 2021, at 10:51 PM, Pat Allan <[email protected]> wrote:
>>>>
>>>> Hi Walter,
>>>>
>>>> I’m pretty sure Sphinx doesn’t index punctuation by default. If you want
>>>> octothorps included, you’ll need to define a custom charset_table value
>>>> (per environment in `config/thinking_sphinx.yml`) which includes that
>>>> character. The Sphinx docs outline the default, so best to take that and
>>>> then add in the octothorp (U+23).
>>>> http://sphinxsearch.com/docs/current.html#conf-charset-table
>>>> https://freelancing-gods.com/thinking-sphinx/v5/advanced_config.html#character-sets-and-tables
>>>>
>>>> Keep in mind that this will impact all uses of that character in all
>>>> fields - there’s no way to have it apply to just some fields (or, in this
>>>> case, words that only start with that character).
>>>>
>>>> Once you’ve added this configuration, a full rebuild will be required.
>>>>
>>>> Cheers,
>>>>
>>>> —
>>>> Pat
>>>>
>>>>> On 23 Feb 2021, at 2:41 pm, Walter Lee Davis <[email protected]> wrote:
>>>>>
>>>>> I'm using GutenTag to apply tags to individual pages in a CMS. The
>>>>> Document model uses TS5 with Real-Time Indexing. I've set up my index
>>>>> thusly:
>>>>>
>>>>> # in the model
>>>>> def tags_for_indexing
>>>>> tag_names.join ' '
>>>>> end
>>>>>
>>>>> # in the index
>>>>> ThinkingSphinx::Index.define :document, :with => :real_time do
>>>>> scope { Document.where(id: Document.publicly.map{ |d|
>>>>> [d.id].concat(d.descendants.published.map(&:id)) }.flatten) }
>>>>>
>>>>> indexes title
>>>>> indexes teaser
>>>>> indexes body_html
>>>>> indexes author_display
>>>>> indexes tags_for_indexing
>>>>>
>>>>> has created_at, type: :timestamp
>>>>> has updated_at, type: :timestamp
>>>>> end
>>>>>
>>>>> I've tested the method, and confirm that it outputs a space-delimited
>>>>> string of words for the tags.
>>>>>
>>>>> I run rake ts:rt:rebuild and everything seems to go fine. But trying to
>>>>> search on some of these tag names is not returning the results I am
>>>>> imagining. The client has insisted on making some of these tags start
>>>>> with an octothorp, because she is writing about "hashtags" on Twitter.
>>>>> Most tags do not have punctuation in them. I am able to find other terms,
>>>>> even very obscure ones, when I don't use punctuation in the tag names.
>>>>>
>>>>> Does this sound like something that I can fix, or should I advise the
>>>>> client to lay off the octothorps?
>>>>>
>>>>> Walter
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google Groups
>>>>> "Thinking Sphinx" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send an
>>>>> email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/thinking-sphinx/EA71574B-9EBF-484E-A5FA-BF7CD53A10BC%40wdstudio.com.
>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google Groups
>>>> "Thinking Sphinx" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an
>>>> email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/thinking-sphinx/05B716CE-D5C7-40F6-BDE3-EC2859738632%40freelancing-gods.com.
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "Thinking Sphinx" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/thinking-sphinx/0822E7D4-08AD-48D6-8105-3CC26F937006%40wdstudio.com.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Thinking Sphinx" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/thinking-sphinx/09329FD3-9473-4361-B9DE-C4A1847C882D%40freelancing-gods.com.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Thinking Sphinx" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected]
> <mailto:[email protected]>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/thinking-sphinx/683D4011-C092-4648-A61D-789D2EDF7E39%40wdstudio.com
>
> <https://groups.google.com/d/msgid/thinking-sphinx/683D4011-C092-4648-A61D-789D2EDF7E39%40wdstudio.com>.
--
You received this message because you are subscribed to the Google Groups
"Thinking Sphinx" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/thinking-sphinx/91327038-CA2F-4DD4-A99A-AE5B2B5686CE%40freelancing-gods.com.