Re: Split words with period in between into separate tokens
Why didn't I thought of that. That's another alternative. Thank you for your suggestion. Appreciate it. On 10/13/2016 5:41 AM, Georg Sorst wrote: You could use a PatternReplaceCharFilter before your tokenizer to replace the dot with a space character. Derek Poh schrieb am Mi., 12. Okt. 2016 11:38: Seems like LetterTokenizerFactory tokenise/discard on numbers as well. The field does has values with numbers in them therefore it is not applicable. Thank you. On 10/12/2016 4:22 PM, Dheerendra Kulkarni wrote: You can use LetterTokenizerFactory instead. Regards, Dheerendra Kulkarni On Wed, Oct 12, 2016 at 6:24 AM, Derek Poh wrote: Hi How can I split words with period in between into separate tokens. Eg. "Co.Ltd" => "Co" "Ltd" . I am using StandardTokenizerFactory and it does notreplace periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names. This is the field definition, synonyms="synonyms.txt" ignoreCase="true" expand="true"/> Solr versionis 10.4.10. Derek -- CONFIDENTIALITY NOTICE This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons. -- CONFIDENTIALITY NOTICE This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons. -- CONFIDENTIALITY NOTICE This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.
Re: Split words with period in between into separate tokens
You could use a PatternReplaceCharFilter before your tokenizer to replace the dot with a space character. Derek Poh schrieb am Mi., 12. Okt. 2016 11:38: > Seems like LetterTokenizerFactory tokenise/discard on numbers as well. The > field does has values with numbers in them therefore it is not applicable. > Thank you. > > > On 10/12/2016 4:22 PM, Dheerendra Kulkarni wrote: > > You can use LetterTokenizerFactory instead. > > > > Regards, > > Dheerendra Kulkarni > > > > On Wed, Oct 12, 2016 at 6:24 AM, Derek Poh > wrote: > > > >> Hi > >> > >> How can I split words with period in between into separate tokens. > >> Eg. "Co.Ltd" => "Co" "Ltd" . > >> > >> I am using StandardTokenizerFactory and it does notreplace periods > (dots) > >> that are not followed by whitespace are kept as part of the token, > >> including Internet domain names. > >> > >> This is the field definition, > >> > >> >> positionIncrementGap="100"> > >> > >> > >> >> words="stopwords.txt" /> > >> > >> > >> > >> > >> >> words="stopwords.txt" /> > >> synonyms="synonyms.txt" > >> ignoreCase="true" expand="true"/> > >> > >> > >> > >> > >> Solr versionis 10.4.10. > >> > >> Derek > >> > >> -- > >> CONFIDENTIALITY NOTICE > >> This e-mail (including any attachments) may contain confidential and/or > >> privileged information. If you are not the intended recipient or have > >> received this e-mail in error, please inform the sender immediately and > >> delete this e-mail (including any attachments) from your computer, and > you > >> must not use, disclose to anyone else or copy this e-mail (including any > >> attachments), whether in whole or in part. > >> This e-mail and any reply to it may be monitored for security, legal, > >> regulatory compliance and/or other appropriate reasons. > > > > > > > > -- > CONFIDENTIALITY NOTICE > > This e-mail (including any attachments) may contain confidential and/or > privileged information. If you are not the intended recipient or have > received this e-mail in error, please inform the sender immediately and > delete this e-mail (including any attachments) from your computer, and you > must not use, disclose to anyone else or copy this e-mail (including any > attachments), whether in whole or in part. > > This e-mail and any reply to it may be monitored for security, legal, > regulatory compliance and/or other appropriate reasons. > >
Re: Split words with period in between into separate tokens
Seems like LetterTokenizerFactory tokenise/discard on numbers as well. The field does has values with numbers in them therefore it is not applicable. Thank you. On 10/12/2016 4:22 PM, Dheerendra Kulkarni wrote: You can use LetterTokenizerFactory instead. Regards, Dheerendra Kulkarni On Wed, Oct 12, 2016 at 6:24 AM, Derek Poh wrote: Hi How can I split words with period in between into separate tokens. Eg. "Co.Ltd" => "Co" "Ltd" . I am using StandardTokenizerFactory and it does notreplace periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names. This is the field definition, Solr versionis 10.4.10. Derek -- CONFIDENTIALITY NOTICE This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons. -- CONFIDENTIALITY NOTICE This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.
Re: Split words with period in between into separate tokens
You can use LetterTokenizerFactory instead. Regards, Dheerendra Kulkarni On Wed, Oct 12, 2016 at 6:24 AM, Derek Poh wrote: > Hi > > How can I split words with period in between into separate tokens. > Eg. "Co.Ltd" => "Co" "Ltd" . > > I am using StandardTokenizerFactory and it does notreplace periods (dots) > that are not followed by whitespace are kept as part of the token, > including Internet domain names. > > This is the field definition, > > positionIncrementGap="100"> > > > words="stopwords.txt" /> > > > > > words="stopwords.txt" /> > ignoreCase="true" expand="true"/> > > > > > Solr versionis 10.4.10. > > Derek > > -- > CONFIDENTIALITY NOTICE > This e-mail (including any attachments) may contain confidential and/or > privileged information. If you are not the intended recipient or have > received this e-mail in error, please inform the sender immediately and > delete this e-mail (including any attachments) from your computer, and you > must not use, disclose to anyone else or copy this e-mail (including any > attachments), whether in whole or in part. > This e-mail and any reply to it may be monitored for security, legal, > regulatory compliance and/or other appropriate reasons. -- Regards, Dheerendra