Having the setting in the default block should be fine - you should be able to see the charset_table setting in the generated Sphinx configuration files.
Also: I generally recommend just using ts:rebuild, as that handles both real-time indices and SQL-backed indices (i.e. it’s running the same things as ts:rt:rebuild) - if you’re finding ts:rebuild is not working well for you, I’m keen to hear why! All that said, doesn’t sound like you’re doing anything wrong. I wonder if html_strip is somehow filtering out the octothorps? Though I’m pretty sure it’s looking just for HTML tags… still, may be worth turning that off to double-check. And I’ve just run some quick tests locally - without the custom charset_table value, I find the string “#test” is found by Sphinx when searching by “#test” or “test” (because # is ignored, given it’s not an indexable character - so the two searches are actually identical). Adding in the charset_table setting, rebuilding - searching for #test returns a result, but test doesn’t (as that now doesn’t exist as a standalone word in what’s indexed). I doubt it matters, but: which version of Sphinx are you using? — Pat > On 23 Feb 2021, at 3:10 pm, Walter Lee Davis <[email protected]> wrote: > > Thanks for the speedy reply. I tried adding the charset table as recommended, > but I am not seeing any difference in my search results. I did differ from > the directions slightly, in that I put the character set in the default block > at the top of my Yaml file, since it's then included in all of the > environments. I figured that should work, but in case it doesn't can you > explain why? > > default: &default > morphology: stem_en > html_strip: true > batch_size: 300 > charset_table: "0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, > U+430..U+44F, U+23" > > development: > <<: *default > > test: > <<: *default > > production: > <<: *default > > staging: > <<: *default > mysql41: 9320 > > > I forced a full rebuild/reindex with rake ts:rt:rebuild. When that didn't > seem to change things, I also ran rake ts:rebuild. My understanding is that > the first of these should be done when you use the Real Time index. If I'm > mistaken, please let me know. > > Thanks again! > > Walter > >> On Feb 22, 2021, at 10:51 PM, Pat Allan <[email protected]> wrote: >> >> Hi Walter, >> >> I’m pretty sure Sphinx doesn’t index punctuation by default. If you want >> octothorps included, you’ll need to define a custom charset_table value (per >> environment in `config/thinking_sphinx.yml`) which includes that character. >> The Sphinx docs outline the default, so best to take that and then add in >> the octothorp (U+23). >> http://sphinxsearch.com/docs/current.html#conf-charset-table >> https://freelancing-gods.com/thinking-sphinx/v5/advanced_config.html#character-sets-and-tables >> >> Keep in mind that this will impact all uses of that character in all fields >> - there’s no way to have it apply to just some fields (or, in this case, >> words that only start with that character). >> >> Once you’ve added this configuration, a full rebuild will be required. >> >> Cheers, >> >> — >> Pat >> >>> On 23 Feb 2021, at 2:41 pm, Walter Lee Davis <[email protected]> wrote: >>> >>> I'm using GutenTag to apply tags to individual pages in a CMS. The Document >>> model uses TS5 with Real-Time Indexing. I've set up my index thusly: >>> >>> # in the model >>> def tags_for_indexing >>> tag_names.join ' ' >>> end >>> >>> # in the index >>> ThinkingSphinx::Index.define :document, :with => :real_time do >>> scope { Document.where(id: Document.publicly.map{ |d| >>> [d.id].concat(d.descendants.published.map(&:id)) }.flatten) } >>> >>> indexes title >>> indexes teaser >>> indexes body_html >>> indexes author_display >>> indexes tags_for_indexing >>> >>> has created_at, type: :timestamp >>> has updated_at, type: :timestamp >>> end >>> >>> I've tested the method, and confirm that it outputs a space-delimited >>> string of words for the tags. >>> >>> I run rake ts:rt:rebuild and everything seems to go fine. But trying to >>> search on some of these tag names is not returning the results I am >>> imagining. The client has insisted on making some of these tags start with >>> an octothorp, because she is writing about "hashtags" on Twitter. Most tags >>> do not have punctuation in them. I am able to find other terms, even very >>> obscure ones, when I don't use punctuation in the tag names. >>> >>> Does this sound like something that I can fix, or should I advise the >>> client to lay off the octothorps? >>> >>> Walter >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "Thinking Sphinx" group. >>> To unsubscribe from this group and stop receiving emails from it, send an >>> email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/thinking-sphinx/EA71574B-9EBF-484E-A5FA-BF7CD53A10BC%40wdstudio.com. >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Thinking Sphinx" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/thinking-sphinx/05B716CE-D5C7-40F6-BDE3-EC2859738632%40freelancing-gods.com. > > -- > You received this message because you are subscribed to the Google Groups > "Thinking Sphinx" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/thinking-sphinx/0822E7D4-08AD-48D6-8105-3CC26F937006%40wdstudio.com. -- You received this message because you are subscribed to the Google Groups "Thinking Sphinx" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/thinking-sphinx/09329FD3-9473-4361-B9DE-C4A1847C882D%40freelancing-gods.com.
