Re: Defining a phonetic analyzer and searcher via the schema API

Erick Erickson Mon, 12 Mar 2018 10:36:04 -0700

Chris:

LGTM, except maybe ;).....


You'll want to look closely at your admin UI/Analysis page for the
field (or fieldType) once it's defined. Uncheck the "verbose" box when
you look the first time, it'll be less confusing. That'll show you
_exactly_ what the results are and whether they match your
expectations. "right" is such an existential question after all...

When you're using that page, think outside the box. For instance, I
can't say offhand whether the phonetic filter you chose gives
different results when words are capitalized or not. what about when
they have numbers? Put some punctuation in. Try an e-mail address.
Etc. etc. etc.

For instance. If you swap out StandardTokenizer for
WhitespaceTokenizer, you'll now have punctuation in the mix. Most
people don't notice if they have WordDelimiterGraphFilterFactory in
the analysis chain too....

bq: Actually, I have the script that builds the schema in VCS, so it's
roughly the same.

We're on the same page here. I don't particularly care how the schema
gets saved, as long as I can back up to the last known good schema and
start over....

I'll mention in passing that there's no problem whatsoever with using
the "classic" schema. The managed stuff is cool, and enables spiffy
front-ends etc. Personally I'm comfortable enough with hand-editing
the schemas that I find it faster so I usually use it.

BTW, bin/solr has a set of commands that allow you to move
upload/download configs, try "bin/solr zk -help".....

Walter:

"I don't usually test my code, but when I do it's in production".

These young whipper-snappers don't appreciate how _very_ many ways
things can go wrong ;)

My tongue-in-cheek way to distinguish novice from "veteran" programmers:

Novice: The code compiles and she's surprised when it doesn't work the
first time.

Veteran: The code ran perfectly the first time. She immediately goes
over it with a fine-tooth comb to see whether it's still running
canned test cases.

Best,
Erick


On Mon, Mar 12, 2018 at 10:14 AM, Christopher Schultz
<ch...@christopherschultz.net> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Erick,
>
> On 3/12/18 1:00 PM, Erick Erickson wrote:
>> bq: which you aren't supposed to edit directly.
>>
>> Well, kind of. Here's why it's "discouraged":
>> https://lucene.apache.org/solr/guide/6_6/schema-api.html.
>>
>> But as long as you don't mix-and-match hand-editing with using the
>> schema API you can hand edit it freely. You're then in charge of
>> pushing it to ZK and reloading your collections that use it
>> yourself however.
>
> No Zookeeper (yet), but I suspect I'll end up there. I'm mostly
> toying-around with it right now, but it won't be long before I'll want
> to go live with it and having a single Solr instance isn't going to
> help me sleep well at night. I'm sure I'll end up with two instances
> to begin with, which requires ZK, right?
>
>> As a side note, even if I _never_ hand-edited it I'd make it a
>> practice to regularly pull it from ZK and put it in some VCS system
>> ;)
>
> Actually, I have the script that builds the schema in VCS, so it's
> roughly the same.
>
> As for the schema modifications... did I get those right?
>
> Thanks,
> - -chris
>
>> On Mon, Mar 12, 2018 at 9:51 AM, Christopher Schultz
>> <ch...@christopherschultz.net> wrote: All,
>>
>> I'd like to add a new synthesized field that uses a phonetic
>> analyzer such as Beider-Morse. I'm using Solr 7.2.
>>
>> When I request the current schema via the schema API, I get a list
>> of existing fields, dynamic fields, and analyzers, none of which
>> appear to be what I'm looking for.
>>
>> Conceptually, I think I'd like to do something like this:
>>
>> add-field: { name: phoneticname, type: phonetic, multiValued: true
>> }
>>
>> ... but how do I define what type of data "phonetic" should be?
>>
>> I can see the example XML definition in this document:
>> https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#Filt
> er
>>
>>
> Descriptions-Beider-MorseFilter
>>
>> But I'm not sure how to add an analyzer to the schema using the
>> schema API:
>> https://lucene.apache.org/solr/guide/7_2/schema-api.html
>>
>> Under "Add a new field type", it says that new analyzers can be
>> defined, but I'm not entirely sure how to do that ... the API docs
>> refer to the field type definitions page[1] which just shows what
>> XML you'd have to put into your schema XML -- which you aren't
>> supposed to edit directly.
>>
>> When looking at the JSON version of my schema, I can see for
>> example thi s:
>>
>> "fieldTypes":[{ "name":"ancestor_path", "class":"solr.TextField",
>> "indexAnalyzer":{ "tokenizer":{
>> "class":"solr.KeywordTokenizerFactory"}}, "queryAnalyzer":{
>> "tokenizer":{ "class":"solr.PathHierarchyTokenizerFactory",
>> "delimiter":"/"}}},
>>
>> So should I create a new field type like this?
>>
>> "add-field-type" : { "name" : "phonetic", "class" :
>> "solr.TextField",
>>
>> "analyzer" : { "tokenizer": { "class" :
>> "solr.StandardTokenizerFactory" },
>>
>> "filters" : [{ "class": "solr.BeiderMorseFilterFactory",
>> "nameType": "GENERIC", "ruleType": "APPROX", "concat": "true",
>> "languageSet": "auto" }] } }
>>
>> Then, use copy-field as "usual":
>>
>> "add-field":{ "name":"phonetic", "type":"phonetic", multiValued:
>> true, "stored":false },
>>
>> "add-copy-field":{ "source":"first_name", "dest":"phonetic" },
>>
>> "add-copy-field":{ "source":"last_name", "dest":"phonetic" },
>>
>> This seems to work but I wanted to know if I was doing it the right
>> way.
>>
>> Thanks, -chris
>>
>> [1]
>> https://lucene.apache.org/solr/guide/7_2/field-type-definitions-and-pr
> op
>>
>>
> erties.html#field-type-definitions-and-properties
>>
> -----BEGIN PGP SIGNATURE-----
> Comment: GPGTools - http://gpgtools.org
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqmtY4dHGNocmlzQGNo
> cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFhdIA/9GkZ/yimVmkwB725L
> uS4kcy4YJowyYw+eMtvurpIq/ZV/U8H4hFJY/ddsT+bdrjeZMsTdc7B9Tdlha8xt
> dmuj1VcvDn3uyIUGooTOob6ZvZwjeJEZIJrbwUM5gNq7uJW8xpCU0/3+iP6Km7OY
> 1Nia5uCuwarLWcsRFdtjCvR3M7ZppBYHec3kVGGOUL637AC6ISgpxhuzOnuTHAss
> wCjuR1y6AdTjRbHpis3MJdiVIjEENfyzGpEnqvumsu1e+0F/A0DNbhU9nAPv+73d
> aOLfOW9Fs6jjnq96qzIBAkHLWkqU1GHKYNYHql7/59x8rFcjGkGC7ziSY69lKc+f
> ivrIEqLH1Go7kawz+1og3dPyl/n0CFWE3UK+wj5QeTY5XLduq0x6EmFKW6D790BS
> ywmFuqr4cmvKbs3N6BbxHz5QVbjgRsWO4jp4kJi3KDCepd8vKW+2xwHfX/zAcBKY
> rSDuVkM3KtxQal8xgm4tsvyU3g1dXpNEVa7PFXYJzd3uA2yij9OU6s83NS9LHK3N
> 2zssPfNDj7QddAEhYan0O4r4wSUN2UNT9nMhBVXXYRpoD6WzrhC5TdRUDh66rkOB
> AvhAUKsV0rfjct+MUBpQA9W+SUG7i911wNSBJJmB58MYbyxMAJb8NKGk1yEs1MyH
> FQHEgiEEFRCD9ZFd/fqwfuPyKQo=
> =Vqz6
> -----END PGP SIGNATURE-----

Re: Defining a phonetic analyzer and searcher via the schema API

Reply via email to