If you’ve got tag names and their corresponding ids, I think it’d be better 
(and more accurate) to query Sphinx by the ids:

  # in the index:
  has tag_ids

  # when searching, maybe something like:
  tag = Tag.find_by(name: params[:tag_name])
  Document.search params[:query], :with => {:tag_ids => tag.id}

It doesn’t answer the question why octothorps aren’t being indexed/searched 
correctly, but this should mean better search results generally.

Cheers,

— 
Pat

> On 24 Feb 2021, at 1:48 am, Walter Lee Davis <[email protected]> wrote:
> 
> 
> 
>> On Feb 23, 2021, at 12:02 AM, Pat Allan <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Having the setting in the default block should be fine - you should be able 
>> to see the charset_table setting in the generated Sphinx configuration files.
>> 
>> Also: I generally recommend just using ts:rebuild, as that handles both 
>> real-time indices and SQL-backed indices (i.e. it’s running the same things 
>> as ts:rt:rebuild) - if you’re finding ts:rebuild is not working well for 
>> you, I’m keen to hear why!
> 
> While I was fighting with this, and fiddling with the configuration to use 
> has instead of indexes, I got myself into a state where ts:rebuild would blow 
> up with a SQL error (I think it was a Sphinx SQL error) and ts:rt:rebuild 
> would work fine. But with the current configuration that I shared with you, 
> both work.
> 
>> 
>> All that said, doesn’t sound like you’re doing anything wrong. I wonder if 
>> html_strip is somehow filtering out the octothorps? Though I’m pretty sure 
>> it’s looking just for HTML tags… still, may be worth turning that off to 
>> double-check.
>> 
>> And I’ve just run some quick tests locally - without the custom 
>> charset_table value, I find the string “#test” is found by Sphinx when 
>> searching by “#test” or “test” (because # is ignored, given it’s not an 
>> indexable character - so the two searches are actually identical). Adding in 
>> the charset_table setting, rebuilding - searching for #test returns a 
>> result, but test doesn’t (as that now doesn’t exist as a standalone word in 
>> what’s indexed).
>> 
>> I doubt it matters, but: which version of Sphinx are you using?
> 
> Sphinx 2.2.11-id64-release (95ae9a6), TS 5.0.0.
> 
> It's definitely odd. I'm not sure if re-indexing is picking up the tag names 
> when it runs en masse, and it seems to be something with GutenTag. If I find 
> a document in console, the object that I get back has tag_names set to nil, 
> but if I then call tag_names on that object, I get back the array of strings 
> I am expecting. It's just the value that I see inside the <> brackets 
> initially when to_s is called on the found object by irb, so I don't know if 
> that's significant at all, or is getting in the way of Sphinx extracting the 
> values. Again, when I test in console by calling my tags_for_indexing method 
> on a found object, I get back the expected string value.
> 
> I've told the client that she may need to get rid of her beloved hashtags in 
> the tagging interface, or use Gutentag in place of Sphinx to get "everything 
> tagged with this tag". I'm not convinced that's a bad idea, either.
> 
> Walter
> 
>> 
>> — 
>> Pat
>> 
>>> On 23 Feb 2021, at 3:10 pm, Walter Lee Davis <[email protected]> wrote:
>>> 
>>> Thanks for the speedy reply. I tried adding the charset table as 
>>> recommended, but I am not seeing any difference in my search results. I did 
>>> differ from the directions slightly, in that I put the character set in the 
>>> default block at the top of my Yaml file, since it's then included in all 
>>> of the environments. I figured that should work, but in case it doesn't can 
>>> you explain why?
>>> 
>>> default: &default
>>> morphology: stem_en
>>> html_strip: true
>>> batch_size: 300
>>> charset_table: "0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, 
>>> U+430..U+44F, U+23"
>>> 
>>> development:
>>> <<: *default
>>> 
>>> test:
>>> <<: *default
>>> 
>>> production:
>>> <<: *default
>>> 
>>> staging:
>>> <<: *default
>>> mysql41: 9320
>>> 
>>> 
>>> I forced a full rebuild/reindex with rake ts:rt:rebuild. When that didn't 
>>> seem to change things, I also ran rake ts:rebuild. My understanding is that 
>>> the first of these should be done when you use the Real Time index. If I'm 
>>> mistaken, please let me know.
>>> 
>>> Thanks again!
>>> 
>>> Walter
>>> 
>>>> On Feb 22, 2021, at 10:51 PM, Pat Allan <[email protected]> wrote:
>>>> 
>>>> Hi Walter,
>>>> 
>>>> I’m pretty sure Sphinx doesn’t index punctuation by default. If you want 
>>>> octothorps included, you’ll need to define a custom charset_table value 
>>>> (per environment in `config/thinking_sphinx.yml`) which includes that 
>>>> character. The Sphinx docs outline the default, so best to take that and 
>>>> then add in the octothorp (U+23).
>>>> http://sphinxsearch.com/docs/current.html#conf-charset-table
>>>> https://freelancing-gods.com/thinking-sphinx/v5/advanced_config.html#character-sets-and-tables
>>>> 
>>>> Keep in mind that this will impact all uses of that character in all 
>>>> fields - there’s no way to have it apply to just some fields (or, in this 
>>>> case, words that only start with that character).
>>>> 
>>>> Once you’ve added this configuration, a full rebuild will be required.
>>>> 
>>>> Cheers,
>>>> 
>>>> — 
>>>> Pat
>>>> 
>>>>> On 23 Feb 2021, at 2:41 pm, Walter Lee Davis <[email protected]> wrote:
>>>>> 
>>>>> I'm using GutenTag to apply tags to individual pages in a CMS. The 
>>>>> Document model uses TS5 with Real-Time Indexing. I've set up my index 
>>>>> thusly:
>>>>> 
>>>>> # in the model
>>>>> def tags_for_indexing
>>>>> tag_names.join ' '
>>>>> end
>>>>> 
>>>>> # in the index
>>>>> ThinkingSphinx::Index.define :document, :with => :real_time do
>>>>> scope { Document.where(id: Document.publicly.map{ |d| 
>>>>> [d.id].concat(d.descendants.published.map(&:id)) }.flatten) }
>>>>> 
>>>>> indexes title
>>>>> indexes teaser
>>>>> indexes body_html
>>>>> indexes author_display
>>>>> indexes tags_for_indexing
>>>>> 
>>>>> has created_at, type: :timestamp
>>>>> has updated_at, type: :timestamp
>>>>> end
>>>>> 
>>>>> I've tested the method, and confirm that it outputs a space-delimited 
>>>>> string of words for the tags.
>>>>> 
>>>>> I run rake ts:rt:rebuild and everything seems to go fine. But trying to 
>>>>> search on some of these tag names is not returning the results I am 
>>>>> imagining. The client has insisted on making some of these tags start 
>>>>> with an octothorp, because she is writing about "hashtags" on Twitter. 
>>>>> Most tags do not have punctuation in them. I am able to find other terms, 
>>>>> even very obscure ones, when I don't use punctuation in the tag names. 
>>>>> 
>>>>> Does this sound like something that I can fix, or should I advise the 
>>>>> client to lay off the octothorps?
>>>>> 
>>>>> Walter
>>>>> 
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google Groups 
>>>>> "Thinking Sphinx" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>>>> email to [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/thinking-sphinx/EA71574B-9EBF-484E-A5FA-BF7CD53A10BC%40wdstudio.com.
>>>> 
>>>> 
>>>> -- 
>>>> You received this message because you are subscribed to the Google Groups 
>>>> "Thinking Sphinx" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>>> email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/thinking-sphinx/05B716CE-D5C7-40F6-BDE3-EC2859738632%40freelancing-gods.com.
>>> 
>>> -- 
>>> You received this message because you are subscribed to the Google Groups 
>>> "Thinking Sphinx" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/thinking-sphinx/0822E7D4-08AD-48D6-8105-3CC26F937006%40wdstudio.com.
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Thinking Sphinx" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/thinking-sphinx/09329FD3-9473-4361-B9DE-C4A1847C882D%40freelancing-gods.com.
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Thinking Sphinx" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] 
> <mailto:[email protected]>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/thinking-sphinx/683D4011-C092-4648-A61D-789D2EDF7E39%40wdstudio.com
>  
> <https://groups.google.com/d/msgid/thinking-sphinx/683D4011-C092-4648-A61D-789D2EDF7E39%40wdstudio.com>.

-- 
You received this message because you are subscribed to the Google Groups 
"Thinking Sphinx" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/thinking-sphinx/91327038-CA2F-4DD4-A99A-AE5B2B5686CE%40freelancing-gods.com.

Reply via email to