Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory
Hi Ahmet, Ok. Thanks for your advice. Regards, Edwin On 25 November 2017 at 10:23, Ahmet Arslanwrote: > > > Hi Zheng, > > UAX29UET recognizes URLs and e-mails. It does not tokenize them. It keeps > them single token. > > StandardTokenizer produce two or more tokens for an entity. > > Please try them using the analysis page, use which one suits your > requirements. > > Ahmet > > > > On Friday, November 24, 2017, 11:46:57 AM GMT+3, Zheng Lin Edwin Yeo < > edwinye...@gmail.com> wrote: > > > > > > Hi, > > I am indexing email addresses into Solr via EML files. Currently, I am > using ClassicTokenizerFactory with LowerCaseFilterFactory. However, I also > found that we can also use UAX29URLEmailTokenizerFactory with > LowerCaseFilterFactory. > > Does anyone have any recommendation on which Tokenizer is better? > > I am currently using Solr 6.5.1, and planning to upgrade to Solr 7.1.0. > > Regards, > Edwin >
Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory
Hi Rick, For both of the tokenizers, it does not split on the hyphens for email like this: solr-user@lucene.apache.org The entire email address remains intact for both of the tokenizers. Regards, Edwin On 24 November 2017 at 20:19, Rick Leirwrote: > Edwin > There is a spec for which characters are acceptable in an email name, and > another spec for chars in a domain name. I suspect you will have more > success with a tokenizer which is specialized for email, but I have not > looked at UAX29URLEmailTokenizerFactory. Does ClassicTokenizerFactory split > on hyphens? > Cheers --Rick > > On November 24, 2017 3:46:46 AM EST, Zheng Lin Edwin Yeo < > edwinye...@gmail.com> wrote: > >Hi, > > > >I am indexing email addresses into Solr via EML files. Currently, I am > >using ClassicTokenizerFactory with LowerCaseFilterFactory. However, I > >also > >found that we can also use UAX29URLEmailTokenizerFactory with > >LowerCaseFilterFactory. > > > >Does anyone have any recommendation on which Tokenizer is better? > > > >I am currently using Solr 6.5.1, and planning to upgrade to Solr 7.1.0. > > > >Regards, > >Edwin > > -- > Sorry for being brief. Alternate email is rickleir at yahoo dot com
Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory
Hi Zheng, UAX29UET recognizes URLs and e-mails. It does not tokenize them. It keeps them single token. StandardTokenizer produce two or more tokens for an entity. Please try them using the analysis page, use which one suits your requirements. Ahmet On Friday, November 24, 2017, 11:46:57 AM GMT+3, Zheng Lin Edwin Yeowrote: Hi, I am indexing email addresses into Solr via EML files. Currently, I am using ClassicTokenizerFactory with LowerCaseFilterFactory. However, I also found that we can also use UAX29URLEmailTokenizerFactory with LowerCaseFilterFactory. Does anyone have any recommendation on which Tokenizer is better? I am currently using Solr 6.5.1, and planning to upgrade to Solr 7.1.0. Regards, Edwin
Re: docValues
Erick, thanks for explaining the memory aspects. Regarding the end user perspective, our intention is to provide a first layer of filtering, where data will be rolled up in some buckets and be displayed in charts and tables. When I told about provide access to "full" documents, it was not to display on the web, but to allow the researcher to download the data so he can dive into the data with his own tools (R, spss, whatever). With this in mind, using /select handler is the only solution to get data with fields other than docValues that I visualized. Now that I have a little bit more clear that memory will not be hardly affected if I use docValues, I will start to think about disk usage grow and how much it impacts the infrastructure. Thanks again, 2017-11-24 16:16 GMT-02:00 Erick Erickson: > Kojo: > > bq: My question is, isn´t it to > expensive in terms of memory consumption to enable docValues on fields that > I dont need to facet, search etc? > > Well, yes and no. The memory consumed is your OS memory space and a > small bit of control structures on your Java heap. It's a bit scary > that your _index_ size will increase significantly on disk, but your > Java heap requirements won't be correspondingly large. > > But there's a bigger issue here. Streaming is built to handle very > large result sets in a map/reduce style form, i.e. subdivide the work > amongst lots of nodes. If you want to return _all_ the records to the > user along with description information and the like, what are they > going to do with them? 10,000,000 rows (small by some streaming > operations standards) is far too many to, say, display in a browser. > And it's an anti-pattern to ask for, say, 10,000,000 rows with the > select handler. > > You can page through these results, but it'll take a long time. So > basically my question is whether this capability is useful enough to > spend time on. If it is and you are going to return lots of rows > consider paging through with cursorMark capabilities, see: > https://lucidworks.com/2013/12/12/coming-soon-to-solr- > efficient-cursor-based-iteration-of-large-result-sets/ > > Best, > Erick > > On Fri, Nov 24, 2017 at 9:38 AM, Kojo wrote: > > I Think that I found the solution. After analysis, change from /export > > request handler to /select request handler in order to obtain other > fields. > > I will try that. > > > > > > > > 2017-11-24 15:15 GMT-02:00 Kojo : > > > >> Thank you very much for your answer, Shawn. > >> > >> That is it, I was looking for another way to include fields non > docValues > >> to the filtered result documents. > >> I can enable docValues to other fields and reindex all if necessary. I > >> will tell you about the use case, because I am not sure that I am on > the > >> right track. > >> > >> As I said before, I am using Streaming Expressions to deal with > different > >> collections. Up to this moment, it is decided that we will use this > >> approach. > >> > >> The goal is to provide our users a web interface where they can make > some > >> queries. The backend will get Solr data using the Streaming Expressions > >> rest api and will return rolled up data to the frontend, which will > display > >> some charts and aggregated data. > >> After that, the end user may want to have data used to generate this > >> aggregated information (not all fields of the filtered documents, but > the > >> fields used to aggregate information), combined with some other fields > >> (title, description of document for example) which are not docValues. As > >> you said I need to add docValues to then. My question is, isn´t it to > >> expensive in terms of memory consumption to enable docValues on fields > that > >> I dont need to facet, search etc? > >> > >> I think that to reconstruct a standard query that achieves the results > >> from a complex Streaming Expression is not simple. This is why I want to > >> use the same query used to make analysis, to return full data via export > >> handler. > >> > >> I am sorry if this is so much confusing. > >> > >> Thank you, > >> > >> > >> > >> > >> 2017-11-24 12:36 GMT-02:00 Shawn Heisey : > >> > >>> On 11/23/2017 1:51 PM, Kojo wrote: > >>> > I am working on Solr to develop a toll to make analysis. I am using > search > function of Streaming Expressions, which requires a field to be > indexed > with docValues enabled, so I can get it. > > Suppose that after someone finishes the analysis, and would like to > get > other fields of the resultset that are not docValues enabled. How can > it > be > done? > > >>> > >>> We did get this message, but it's confusing as to exactly what you're > >>> asking, which is why nobody responded. > >>> > >>> If you're saying that this theoretical person wants to use another > field > >>> with the streaming expression analysis you have provided, and that > field > >>>
Re: Strip out punctuation at the end of token
You need to play with the (many) parameters for WordDelimiterFilterFactory. For instance, you have preserveOriginal set to 1. That's what's generating the token with the dot. You have catenateAll and catenateNumbers set to zero. That means that someone searching for 61149008 won't get a hit. The fact that the dot is in the tokens generated doesn't really matter as long as the query tokens produced will match. I think you're getting a bit off track by focusing on the hyphen and dot, you're only seeing them in the index at all since you have preserveOriginal set to 1. Let's say that you set preserveOriginal to 0 and catenateNumbers to 1. Then you'd get: 61149 008 61149008 in your index. No dots, no hyphens. Not your _query_ analysis also has catenateNumbers as 1 and preserveOriginal as 0. The user searches for 61149-008 and the emitted tokens are in the index and you're OK. The user searches for 61149008 and gets a hit there too. The dot is irrelevant. now, all that said if that isn't comfortable you could certainly add PatternReplaceFilterFactory, but really WDFF is designed for this kind of thing, I think you'll be just fine if you play with the options enough to understand the nuances, which can be tricky I'll admit.. Best, Erick On Fri, Nov 24, 2017 at 7:13 AM, Sergio García Marotowrote: > Yes. You are right. I understand now. > Let me explain my issue a bit better with the exact problem i have. > > I have this text "Information number 61149-008." > Using the tokenizers and filters described previously i get this list of > tokens. > information > number > 61149-008. > 61149 > 008 > > Basically last token "61149-008." gets tokenized as > 61149-008. > 61149 > 008 > User is searching for "61149-008" without dot, so this is not a match. > I don't want to change the tokenization on the query to avoid altering the > matches for other cases. > > I would like to delete the dot at the end. Basically generate this extra > token > information > number > 61149-008. > 61149 > 008 > 61149-008 > > Not sure if what I am saying make sense or there is other way to do this > right. > > Thanks a lot > Sergio > > > On 24 November 2017 at 15:31, Shawn Heisey wrote: > >> On 11/24/2017 2:32 AM, marotosg wrote: >> >>> Hi Shaw. >>> Thanks for your reply. Actually my issue is with the last token. It looks >>> like for the last token of a string. It keeps the dot. >>> >>> In your case Testing. This is a test. Test. >>> >>> Keeps the "Test." >>> >>> Is there any reason I can't see for that behauviour? >>> >> >> I am really not sure what you're saying here. >> >> Every token is duplicated, one has the dot and one doesn't. This is what >> you wanted based on what I read in your initial email. >> >> Making a guess as to what you're asking about this time: If you're >> noticing that there isn't a "Test" as the last token on the line for WDF, >> then I have to tell you that it actually is there, the display was simply >> too wide for the browser window. Scrolling horizontally would be required >> to see the whole thing. >> >> Thanks, >> Shawn >> >>
Re: docValues
Kojo: bq: My question is, isn´t it to expensive in terms of memory consumption to enable docValues on fields that I dont need to facet, search etc? Well, yes and no. The memory consumed is your OS memory space and a small bit of control structures on your Java heap. It's a bit scary that your _index_ size will increase significantly on disk, but your Java heap requirements won't be correspondingly large. But there's a bigger issue here. Streaming is built to handle very large result sets in a map/reduce style form, i.e. subdivide the work amongst lots of nodes. If you want to return _all_ the records to the user along with description information and the like, what are they going to do with them? 10,000,000 rows (small by some streaming operations standards) is far too many to, say, display in a browser. And it's an anti-pattern to ask for, say, 10,000,000 rows with the select handler. You can page through these results, but it'll take a long time. So basically my question is whether this capability is useful enough to spend time on. If it is and you are going to return lots of rows consider paging through with cursorMark capabilities, see: https://lucidworks.com/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/ Best, Erick On Fri, Nov 24, 2017 at 9:38 AM, Kojowrote: > I Think that I found the solution. After analysis, change from /export > request handler to /select request handler in order to obtain other fields. > I will try that. > > > > 2017-11-24 15:15 GMT-02:00 Kojo : > >> Thank you very much for your answer, Shawn. >> >> That is it, I was looking for another way to include fields non docValues >> to the filtered result documents. >> I can enable docValues to other fields and reindex all if necessary. I >> will tell you about the use case, because I am not sure that I am on the >> right track. >> >> As I said before, I am using Streaming Expressions to deal with different >> collections. Up to this moment, it is decided that we will use this >> approach. >> >> The goal is to provide our users a web interface where they can make some >> queries. The backend will get Solr data using the Streaming Expressions >> rest api and will return rolled up data to the frontend, which will display >> some charts and aggregated data. >> After that, the end user may want to have data used to generate this >> aggregated information (not all fields of the filtered documents, but the >> fields used to aggregate information), combined with some other fields >> (title, description of document for example) which are not docValues. As >> you said I need to add docValues to then. My question is, isn´t it to >> expensive in terms of memory consumption to enable docValues on fields that >> I dont need to facet, search etc? >> >> I think that to reconstruct a standard query that achieves the results >> from a complex Streaming Expression is not simple. This is why I want to >> use the same query used to make analysis, to return full data via export >> handler. >> >> I am sorry if this is so much confusing. >> >> Thank you, >> >> >> >> >> 2017-11-24 12:36 GMT-02:00 Shawn Heisey : >> >>> On 11/23/2017 1:51 PM, Kojo wrote: >>> I am working on Solr to develop a toll to make analysis. I am using search function of Streaming Expressions, which requires a field to be indexed with docValues enabled, so I can get it. Suppose that after someone finishes the analysis, and would like to get other fields of the resultset that are not docValues enabled. How can it be done? >>> >>> We did get this message, but it's confusing as to exactly what you're >>> asking, which is why nobody responded. >>> >>> If you're saying that this theoretical person wants to use another field >>> with the streaming expression analysis you have provided, and that field >>> does not have docValues, then you'll need to add docValues to the field and >>> completely reindex. >>> >>> If you're asking something else, then you're going to need to provide >>> more details so we can actually know what you want to have happen. >>> >>> Thanks, >>> Shawn >>> >> >>
Re: docValues
I Think that I found the solution. After analysis, change from /export request handler to /select request handler in order to obtain other fields. I will try that. 2017-11-24 15:15 GMT-02:00 Kojo: > Thank you very much for your answer, Shawn. > > That is it, I was looking for another way to include fields non docValues > to the filtered result documents. > I can enable docValues to other fields and reindex all if necessary. I > will tell you about the use case, because I am not sure that I am on the > right track. > > As I said before, I am using Streaming Expressions to deal with different > collections. Up to this moment, it is decided that we will use this > approach. > > The goal is to provide our users a web interface where they can make some > queries. The backend will get Solr data using the Streaming Expressions > rest api and will return rolled up data to the frontend, which will display > some charts and aggregated data. > After that, the end user may want to have data used to generate this > aggregated information (not all fields of the filtered documents, but the > fields used to aggregate information), combined with some other fields > (title, description of document for example) which are not docValues. As > you said I need to add docValues to then. My question is, isn´t it to > expensive in terms of memory consumption to enable docValues on fields that > I dont need to facet, search etc? > > I think that to reconstruct a standard query that achieves the results > from a complex Streaming Expression is not simple. This is why I want to > use the same query used to make analysis, to return full data via export > handler. > > I am sorry if this is so much confusing. > > Thank you, > > > > > 2017-11-24 12:36 GMT-02:00 Shawn Heisey : > >> On 11/23/2017 1:51 PM, Kojo wrote: >> >>> I am working on Solr to develop a toll to make analysis. I am using >>> search >>> function of Streaming Expressions, which requires a field to be indexed >>> with docValues enabled, so I can get it. >>> >>> Suppose that after someone finishes the analysis, and would like to get >>> other fields of the resultset that are not docValues enabled. How can it >>> be >>> done? >>> >> >> We did get this message, but it's confusing as to exactly what you're >> asking, which is why nobody responded. >> >> If you're saying that this theoretical person wants to use another field >> with the streaming expression analysis you have provided, and that field >> does not have docValues, then you'll need to add docValues to the field and >> completely reindex. >> >> If you're asking something else, then you're going to need to provide >> more details so we can actually know what you want to have happen. >> >> Thanks, >> Shawn >> > >
Re: docValues
Thank you very much for your answer, Shawn. That is it, I was looking for another way to include fields non docValues to the filtered result documents. I can enable docValues to other fields and reindex all if necessary. I will tell you about the use case, because I am not sure that I am on the right track. As I said before, I am using Streaming Expressions to deal with different collections. Up to this moment, it is decided that we will use this approach. The goal is to provide our users a web interface where they can make some queries. The backend will get Solr data using the Streaming Expressions rest api and will return rolled up data to the frontend, which will display some charts and aggregated data. After that, the end user may want to have data used to generate this aggregated information (not all fields of the filtered documents, but the fields used to aggregate information), combined with some other fields (title, description of document for example) which are not docValues. As you said I need to add docValues to then. My question is, isn´t it to expensive in terms of memory consumption to enable docValues on fields that I dont need to facet, search etc? I think that to reconstruct a standard query that achieves the results from a complex Streaming Expression is not simple. This is why I want to use the same query used to make analysis, to return full data via export handler. I am sorry if this is so much confusing. Thank you, 2017-11-24 12:36 GMT-02:00 Shawn Heisey: > On 11/23/2017 1:51 PM, Kojo wrote: > >> I am working on Solr to develop a toll to make analysis. I am using search >> function of Streaming Expressions, which requires a field to be indexed >> with docValues enabled, so I can get it. >> >> Suppose that after someone finishes the analysis, and would like to get >> other fields of the resultset that are not docValues enabled. How can it >> be >> done? >> > > We did get this message, but it's confusing as to exactly what you're > asking, which is why nobody responded. > > If you're saying that this theoretical person wants to use another field > with the streaming expression analysis you have provided, and that field > does not have docValues, then you'll need to add docValues to the field and > completely reindex. > > If you're asking something else, then you're going to need to provide more > details so we can actually know what you want to have happen. > > Thanks, > Shawn >
Re: Strip out punctuation at the end of token
Yes. You are right. I understand now. Let me explain my issue a bit better with the exact problem i have. I have this text "Information number 61149-008." Using the tokenizers and filters described previously i get this list of tokens. information number 61149-008. 61149 008 Basically last token "61149-008." gets tokenized as 61149-008. 61149 008 User is searching for "61149-008" without dot, so this is not a match. I don't want to change the tokenization on the query to avoid altering the matches for other cases. I would like to delete the dot at the end. Basically generate this extra token information number 61149-008. 61149 008 61149-008 Not sure if what I am saying make sense or there is other way to do this right. Thanks a lot Sergio On 24 November 2017 at 15:31, Shawn Heiseywrote: > On 11/24/2017 2:32 AM, marotosg wrote: > >> Hi Shaw. >> Thanks for your reply. Actually my issue is with the last token. It looks >> like for the last token of a string. It keeps the dot. >> >> In your case Testing. This is a test. Test. >> >> Keeps the "Test." >> >> Is there any reason I can't see for that behauviour? >> > > I am really not sure what you're saying here. > > Every token is duplicated, one has the dot and one doesn't. This is what > you wanted based on what I read in your initial email. > > Making a guess as to what you're asking about this time: If you're > noticing that there isn't a "Test" as the last token on the line for WDF, > then I have to tell you that it actually is there, the display was simply > too wide for the browser window. Scrolling horizontally would be required > to see the whole thing. > > Thanks, > Shawn > >
Re: docValues
On 11/23/2017 1:51 PM, Kojo wrote: I am working on Solr to develop a toll to make analysis. I am using search function of Streaming Expressions, which requires a field to be indexed with docValues enabled, so I can get it. Suppose that after someone finishes the analysis, and would like to get other fields of the resultset that are not docValues enabled. How can it be done? We did get this message, but it's confusing as to exactly what you're asking, which is why nobody responded. If you're saying that this theoretical person wants to use another field with the streaming expression analysis you have provided, and that field does not have docValues, then you'll need to add docValues to the field and completely reindex. If you're asking something else, then you're going to need to provide more details so we can actually know what you want to have happen. Thanks, Shawn
Re: Strip out punctuation at the end of token
On 11/24/2017 2:32 AM, marotosg wrote: Hi Shaw. Thanks for your reply. Actually my issue is with the last token. It looks like for the last token of a string. It keeps the dot. In your case Testing. This is a test. Test. Keeps the "Test." Is there any reason I can't see for that behauviour? I am really not sure what you're saying here. Every token is duplicated, one has the dot and one doesn't. This is what you wanted based on what I read in your initial email. Making a guess as to what you're asking about this time: If you're noticing that there isn't a "Test" as the last token on the line for WDF, then I have to tell you that it actually is there, the display was simply too wide for the browser window. Scrolling horizontally would be required to see the whole thing. Thanks, Shawn
Re: Solr7 org.apache.lucene.index.IndexUpgrader
On 11/23/2017 11:31 PM, Leo Prince wrote: We were using bit older version Solr 4.10.2 and upgrading to Solr7. We have like 4mil records in one of the core which is of course pretty huge, hence re-sourcing the index is nearly impossible and re-querying from source Solr to Solr7 is also going to be an exhausting effort. I hate to burst your bubble here ... but 4 million docs is pretty small for a Solr index. I have one index that's a hundred times larger, and there are people with *billions* of documents in SolrCloud. Hence, I tried to upgrade the Index using org.apache.lucene.index.IndexUpgrader. IndexUpgrader ran just fine without any errors. but got this error with initializing the core. *java.lang.IllegalStateException:java.lang.IllegalStateException: unexpected docvalues type NONE for field '_version_' (expected=NUMERIC). Re-index with correct docvalues type.* Being said, I am using Classic Schema and used default managed-schema file as classic schema.xml. This error means that the existing index didn't have docValues on the _version_ field, but the new version does. At some point in 6.x, a whole bunch of field classes were changed to have docValues by default. You'll need to explicitly add 'docValues="false"' to the field definition to use an older index with a newer version. But based on some things you said later, this may be the least of the problems you're running into. When comparing schema of 4.10.2 with that of 7.1.0, I see the field type names have changed like follows ** Earlier until Solr6, it was int, float, long and double (*with out P at the beginning*). I read in docs, old field type names are deprecated in Solr7 and have to use everything starting with "*P*" which enhances the performances. Hence in this context, 1, The error I got *java.lang.IllegalStateException:java.lang.IllegalStateException, *Is it because my index data synced and upgraded contains old field type names and new Solr7 schema contains new field type names..? Being said, my IndexUpgrade completed without any errors. You *cannot* change the classes being used for your fields (which the fieldType changes you have described will do) on an existing index and expect Solr to work. If you change the class on a field, you must eliminate the current index and reindex from scratch. 2, How to sort out the error in 1, if my assessment correct.? Since my data is too large such that it's hard to re-source or re-query, is there any other work arounds to migrate the index if IndexUpgrade is not an option to upgrade index to 7. You would need to keep the schema the same for the upgrade, except that you would need to disable docValues on some of your fields to get rid of the error you encountered. You won't be able to take advantage of some of the new capability in the new version unless you re-engineer your config/schema and reindex. Upgrading an index, especially through three major versions, is generally not recommended. I always reindex when upgrading Solr, especially to a new major version, because Solr evolves quickly. Thanks, Shawn
Fwd: docValues
Hi, yesterday I sent a message bellow to this list, but just after I sent the message I received an e-mail from the mail server that said that my e-mail bounced. I don´t know what that means, and since I receive no answer for the question, I don´t know whether if the message has arrived to the list or not. I appreciate your attention. Thank you, -- Forwarded message -- From: KojoDate: 2017-11-23 18:51 GMT-02:00 Subject: docValues To: solr-user@lucene.apache.org Hi, I am working on Solr to develop a toll to make analysis. I am using search function of Streaming Expressions, which requires a field to be indexed with docValues enabled, so I can get it. Suppose that after someone finishes the analysis, and would like to get other fields of the resultset that are not docValues enabled. How can it be done? Thanks
Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory
Edwin There is a spec for which characters are acceptable in an email name, and another spec for chars in a domain name. I suspect you will have more success with a tokenizer which is specialized for email, but I have not looked at UAX29URLEmailTokenizerFactory. Does ClassicTokenizerFactory split on hyphens? Cheers --Rick On November 24, 2017 3:46:46 AM EST, Zheng Lin Edwin Yeowrote: >Hi, > >I am indexing email addresses into Solr via EML files. Currently, I am >using ClassicTokenizerFactory with LowerCaseFilterFactory. However, I >also >found that we can also use UAX29URLEmailTokenizerFactory with >LowerCaseFilterFactory. > >Does anyone have any recommendation on which Tokenizer is better? > >I am currently using Solr 6.5.1, and planning to upgrade to Solr 7.1.0. > >Regards, >Edwin -- Sorry for being brief. Alternate email is rickleir at yahoo dot com
Re: Strip out punctuation at the end of token
Hi Shaw. Thanks for your reply. Actually my issue is with the last token. It looks like for the last token of a string. It keeps the dot. In your case Testing. This is a test. Test. Keeps the "Test." Is there any reason I can't see for that behauviour? Thanks, Sergio Testing. This is a test. Test. Shawn Heisey-2 wrote > On 11/23/2017 8:06 AM, marotosg wrote: >> I am trying to strip out any "." at the end of a token but I would like >> to >> keep the original token as well. >> This is my index analyzer >> > >> > >> > > > generateWordParts="1" generateNumberParts="1" catenateWords="1" >> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" >> preserveOriginal="1"/> >> > > > preserveOriginal="false"/> >> > >> > >> >> i was thinking of using the solr.PatternReplaceFilterFactory but i see >> this >> one won't keep the original token. > > The WordDelimiterFilterFactory that you have configured will do that. > > Here I have taken your analysis chain, added it to a test install of > Solr, and tried it out. It appears to be doing exactly what you want it > to do. > > https://www.dropbox.com/s/5puf7rzbypdcspu/wdf-analysis-marotosg.png?dl=0 > > Thanks, > Shawn -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory
Hi, I am indexing email addresses into Solr via EML files. Currently, I am using ClassicTokenizerFactory with LowerCaseFilterFactory. However, I also found that we can also use UAX29URLEmailTokenizerFactory with LowerCaseFilterFactory. Does anyone have any recommendation on which Tokenizer is better? I am currently using Solr 6.5.1, and planning to upgrade to Solr 7.1.0. Regards, Edwin