Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory

2017-11-24 Thread Zheng Lin Edwin Yeo
Hi Ahmet,

Ok. Thanks for your advice.

Regards,
Edwin

On 25 November 2017 at 10:23, Ahmet Arslan  wrote:

>
>
> Hi Zheng,
>
> UAX29UET recognizes URLs and e-mails. It does not tokenize them. It keeps
> them single token.
>
> StandardTokenizer produce two or more tokens for an entity.
>
> Please try them using the analysis page, use which one suits your
> requirements.
>
> Ahmet
>
>
>
> On Friday, November 24, 2017, 11:46:57 AM GMT+3, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com> wrote:
>
>
>
>
>
> Hi,
>
> I am indexing email addresses into Solr via EML files. Currently, I am
> using ClassicTokenizerFactory with LowerCaseFilterFactory. However, I also
> found that we can also use UAX29URLEmailTokenizerFactory with
> LowerCaseFilterFactory.
>
> Does anyone have any recommendation on which Tokenizer is better?
>
> I am currently using Solr 6.5.1, and planning to upgrade to Solr 7.1.0.
>
> Regards,
> Edwin
>


Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory

2017-11-24 Thread Zheng Lin Edwin Yeo
Hi Rick,

For both of the tokenizers, it does not split on the hyphens for email like
this:
solr-user@lucene.apache.org

The entire email address remains intact for both of the tokenizers.

Regards,
Edwin

On 24 November 2017 at 20:19, Rick Leir  wrote:

> Edwin
> There is a spec for which characters are acceptable in an email name, and
> another spec for chars in a domain name. I suspect you will have more
> success with a tokenizer which is specialized for email, but I have not
> looked at UAX29URLEmailTokenizerFactory. Does ClassicTokenizerFactory split
> on hyphens?
> Cheers --Rick
>
> On November 24, 2017 3:46:46 AM EST, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com> wrote:
> >Hi,
> >
> >I am indexing email addresses into Solr via EML files. Currently, I am
> >using ClassicTokenizerFactory with LowerCaseFilterFactory. However, I
> >also
> >found that we can also use UAX29URLEmailTokenizerFactory with
> >LowerCaseFilterFactory.
> >
> >Does anyone have any recommendation on which Tokenizer is better?
> >
> >I am currently using Solr 6.5.1, and planning to upgrade to Solr 7.1.0.
> >
> >Regards,
> >Edwin
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com


Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory

2017-11-24 Thread Ahmet Arslan


Hi Zheng,

UAX29UET recognizes URLs and e-mails. It does not tokenize them. It keeps them 
single token.

StandardTokenizer produce two or more tokens for an entity.

Please try them using the analysis page, use which one suits your requirements.

Ahmet



On Friday, November 24, 2017, 11:46:57 AM GMT+3, Zheng Lin Edwin Yeo 
 wrote: 





Hi,

I am indexing email addresses into Solr via EML files. Currently, I am
using ClassicTokenizerFactory with LowerCaseFilterFactory. However, I also
found that we can also use UAX29URLEmailTokenizerFactory with
LowerCaseFilterFactory.

Does anyone have any recommendation on which Tokenizer is better?

I am currently using Solr 6.5.1, and planning to upgrade to Solr 7.1.0.

Regards,
Edwin


Re: docValues

2017-11-24 Thread Kojo
Erick,
thanks for explaining the memory aspects.

Regarding the end user perspective, our intention is to provide a first
layer of filtering, where data will be rolled up in some buckets and be
displayed in charts and tables.
When I told about provide access to "full" documents, it was not to display
on the web, but to allow the researcher to download the data so he can dive
into the data with his own tools (R, spss, whatever).

With this in mind, using /select handler is the only solution to get data
with fields other than docValues that I visualized.

Now that I have a little bit more clear that memory will not be hardly
affected if I use docValues, I will start to think about disk usage grow
and how much it impacts the infrastructure.

Thanks again,









2017-11-24 16:16 GMT-02:00 Erick Erickson :

> Kojo:
>
> bq: My question is, isn´t it to
> expensive in terms of memory consumption to enable docValues on fields that
> I dont need to facet, search etc?
>
> Well, yes and no. The memory consumed is your OS memory space and a
> small bit of control structures on your Java heap. It's a bit scary
> that your _index_ size will increase significantly on disk, but your
> Java heap requirements won't be correspondingly large.
>
> But there's a bigger issue here. Streaming is built to handle very
> large result sets in a map/reduce style form, i.e. subdivide the work
> amongst lots of nodes. If you want to return _all_ the records to the
> user along with description information and the like, what are they
> going to do with them? 10,000,000 rows (small by some streaming
> operations standards) is far too many to, say, display in a browser.
> And it's an anti-pattern to ask for, say, 10,000,000 rows with the
> select handler.
>
> You can page through these results, but it'll take a long time. So
> basically my question is whether this capability is useful enough to
> spend time on. If it is and you are going to return lots of rows
> consider paging through with cursorMark capabilities, see:
> https://lucidworks.com/2013/12/12/coming-soon-to-solr-
> efficient-cursor-based-iteration-of-large-result-sets/
>
> Best,
> Erick
>
> On Fri, Nov 24, 2017 at 9:38 AM, Kojo  wrote:
> > I Think that I found the solution. After analysis, change from /export
> > request handler to /select request handler in order to obtain other
> fields.
> > I will try that.
> >
> >
> >
> > 2017-11-24 15:15 GMT-02:00 Kojo :
> >
> >> Thank you very much for your answer, Shawn.
> >>
> >> That is it, I was looking for another way to include fields non
> docValues
> >> to the filtered result documents.
> >> I can enable docValues to other fields and reindex all if necessary. I
> >> will tell you about the use case, because I am not sure  that I am on
> the
> >> right track.
> >>
> >> As I said before, I am using Streaming Expressions to deal with
> different
> >> collections. Up to this moment, it is decided that we will use this
> >> approach.
> >>
> >> The goal is to provide our users a web interface where they can make
> some
> >> queries. The backend will get Solr data using the Streaming Expressions
> >> rest api and will return rolled up data to the frontend, which will
> display
> >> some charts and aggregated data.
> >> After that, the end user may want to have data used to generate this
> >> aggregated information (not all fields of the filtered documents, but
> the
> >> fields used to aggregate information), combined with some other fields
> >> (title, description of document for example) which are not docValues. As
> >> you said I need to add docValues to then. My question is, isn´t it to
> >> expensive in terms of memory consumption to enable docValues on fields
> that
> >> I dont need to facet, search etc?
> >>
> >> I think that to reconstruct a standard query that achieves the results
> >> from a complex Streaming Expression is not simple. This is why I want to
> >> use the same query used to make analysis, to return full data via export
> >> handler.
> >>
> >> I am sorry if this is so much confusing.
> >>
> >> Thank you,
> >>
> >>
> >>
> >>
> >> 2017-11-24 12:36 GMT-02:00 Shawn Heisey :
> >>
> >>> On 11/23/2017 1:51 PM, Kojo wrote:
> >>>
>  I am working on Solr to develop a toll to make analysis. I am using
>  search
>  function of Streaming Expressions, which requires a field to be
> indexed
>  with docValues enabled, so I can get it.
> 
>  Suppose that after someone finishes the analysis, and would like to
> get
>  other fields of the resultset that are not docValues enabled. How can
> it
>  be
>  done?
> 
> >>>
> >>> We did get this message, but it's confusing as to exactly what you're
> >>> asking, which is why nobody responded.
> >>>
> >>> If you're saying that this theoretical person wants to use another
> field
> >>> with the streaming expression analysis you have provided, and that
> field
> >>> 

Re: Strip out punctuation at the end of token

2017-11-24 Thread Erick Erickson
You need to play with the (many) parameters for WordDelimiterFilterFactory.

For instance, you have preserveOriginal set to 1. That's what's
generating the token with the dot.

You have catenateAll and catenateNumbers set to zero. That means that
someone searching for 61149008 won't get a hit.

The fact that the dot is in the tokens generated doesn't really matter
as long as the query tokens produced will match.

I think you're getting a bit off track by focusing on the hyphen and
dot, you're only seeing them in the index at all since you have
preserveOriginal set to 1. Let's say that you set preserveOriginal to
0 and catenateNumbers to 1. Then you'd get:
61149
008
61149008

in your index. No dots, no hyphens.

Not your _query_ analysis also has catenateNumbers as 1 and
preserveOriginal as 0. The user searches for
61149-008

and the emitted tokens are in the index and you're OK. The user
searches for 61149008 and gets a hit there too. The dot is irrelevant.

now, all that said if that isn't comfortable you could certainly add
PatternReplaceFilterFactory, but really WDFF is designed for this kind
of thing, I think you'll be just fine if you play with the options
enough to understand the nuances, which can be tricky I'll admit..


Best,
Erick

On Fri, Nov 24, 2017 at 7:13 AM, Sergio García Maroto
 wrote:
> Yes. You are right. I understand now.
> Let me explain my issue a bit better with the exact problem i have.
>
> I have this text "Information number  61149-008."
> Using the tokenizers and filters described previously i get this list of
> tokens.
> information
> number
> 61149-008.
> 61149
> 008
>
> Basically last token   "61149-008."  gets tokenized as
> 61149-008.
> 61149
> 008
> User is searching for "61149-008" without dot, so this is not a match.
> I don't want to change the tokenization on the query to avoid altering the
> matches for other cases.
>
> I would like to delete the dot at the end. Basically generate this extra
> token
> information
> number
> 61149-008.
> 61149
> 008
> 61149-008
>
> Not sure if what I am saying make sense or there is other way to do this
> right.
>
> Thanks a lot
> Sergio
>
>
> On 24 November 2017 at 15:31, Shawn Heisey  wrote:
>
>> On 11/24/2017 2:32 AM, marotosg wrote:
>>
>>> Hi Shaw.
>>> Thanks for your reply. Actually my issue is with the last token. It looks
>>> like for the last token of a string. It keeps the dot.
>>>
>>> In your case Testing. This is a test. Test.
>>>
>>> Keeps the "Test."
>>>
>>> Is there any reason I can't see for that behauviour?
>>>
>>
>> I am really not sure what you're saying here.
>>
>> Every token is duplicated, one has the dot and one doesn't.  This is what
>> you wanted based on what I read in your initial email.
>>
>> Making a guess as to what you're asking about this time: If you're
>> noticing that there isn't a "Test" as the last token on the line for WDF,
>> then I have to tell you that it actually is there, the display was simply
>> too wide for the browser window. Scrolling horizontally would be required
>> to see the whole thing.
>>
>> Thanks,
>> Shawn
>>
>>


Re: docValues

2017-11-24 Thread Erick Erickson
Kojo:

bq: My question is, isn´t it to
expensive in terms of memory consumption to enable docValues on fields that
I dont need to facet, search etc?

Well, yes and no. The memory consumed is your OS memory space and a
small bit of control structures on your Java heap. It's a bit scary
that your _index_ size will increase significantly on disk, but your
Java heap requirements won't be correspondingly large.

But there's a bigger issue here. Streaming is built to handle very
large result sets in a map/reduce style form, i.e. subdivide the work
amongst lots of nodes. If you want to return _all_ the records to the
user along with description information and the like, what are they
going to do with them? 10,000,000 rows (small by some streaming
operations standards) is far too many to, say, display in a browser.
And it's an anti-pattern to ask for, say, 10,000,000 rows with the
select handler.

You can page through these results, but it'll take a long time. So
basically my question is whether this capability is useful enough to
spend time on. If it is and you are going to return lots of rows
consider paging through with cursorMark capabilities, see:
https://lucidworks.com/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

Best,
Erick

On Fri, Nov 24, 2017 at 9:38 AM, Kojo  wrote:
> I Think that I found the solution. After analysis, change from /export
> request handler to /select request handler in order to obtain other fields.
> I will try that.
>
>
>
> 2017-11-24 15:15 GMT-02:00 Kojo :
>
>> Thank you very much for your answer, Shawn.
>>
>> That is it, I was looking for another way to include fields non docValues
>> to the filtered result documents.
>> I can enable docValues to other fields and reindex all if necessary. I
>> will tell you about the use case, because I am not sure  that I am on the
>> right track.
>>
>> As I said before, I am using Streaming Expressions to deal with different
>> collections. Up to this moment, it is decided that we will use this
>> approach.
>>
>> The goal is to provide our users a web interface where they can make some
>> queries. The backend will get Solr data using the Streaming Expressions
>> rest api and will return rolled up data to the frontend, which will display
>> some charts and aggregated data.
>> After that, the end user may want to have data used to generate this
>> aggregated information (not all fields of the filtered documents, but the
>> fields used to aggregate information), combined with some other fields
>> (title, description of document for example) which are not docValues. As
>> you said I need to add docValues to then. My question is, isn´t it to
>> expensive in terms of memory consumption to enable docValues on fields that
>> I dont need to facet, search etc?
>>
>> I think that to reconstruct a standard query that achieves the results
>> from a complex Streaming Expression is not simple. This is why I want to
>> use the same query used to make analysis, to return full data via export
>> handler.
>>
>> I am sorry if this is so much confusing.
>>
>> Thank you,
>>
>>
>>
>>
>> 2017-11-24 12:36 GMT-02:00 Shawn Heisey :
>>
>>> On 11/23/2017 1:51 PM, Kojo wrote:
>>>
 I am working on Solr to develop a toll to make analysis. I am using
 search
 function of Streaming Expressions, which requires a field to be indexed
 with docValues enabled, so I can get it.

 Suppose that after someone finishes the analysis, and would like to get
 other fields of the resultset that are not docValues enabled. How can it
 be
 done?

>>>
>>> We did get this message, but it's confusing as to exactly what you're
>>> asking, which is why nobody responded.
>>>
>>> If you're saying that this theoretical person wants to use another field
>>> with the streaming expression analysis you have provided, and that field
>>> does not have docValues, then you'll need to add docValues to the field and
>>> completely reindex.
>>>
>>> If you're asking something else, then you're going to need to provide
>>> more details so we can actually know what you want to have happen.
>>>
>>> Thanks,
>>> Shawn
>>>
>>
>>


Re: docValues

2017-11-24 Thread Kojo
I Think that I found the solution. After analysis, change from /export
request handler to /select request handler in order to obtain other fields.
I will try that.



2017-11-24 15:15 GMT-02:00 Kojo :

> Thank you very much for your answer, Shawn.
>
> That is it, I was looking for another way to include fields non docValues
> to the filtered result documents.
> I can enable docValues to other fields and reindex all if necessary. I
> will tell you about the use case, because I am not sure  that I am on the
> right track.
>
> As I said before, I am using Streaming Expressions to deal with different
> collections. Up to this moment, it is decided that we will use this
> approach.
>
> The goal is to provide our users a web interface where they can make some
> queries. The backend will get Solr data using the Streaming Expressions
> rest api and will return rolled up data to the frontend, which will display
> some charts and aggregated data.
> After that, the end user may want to have data used to generate this
> aggregated information (not all fields of the filtered documents, but the
> fields used to aggregate information), combined with some other fields
> (title, description of document for example) which are not docValues. As
> you said I need to add docValues to then. My question is, isn´t it to
> expensive in terms of memory consumption to enable docValues on fields that
> I dont need to facet, search etc?
>
> I think that to reconstruct a standard query that achieves the results
> from a complex Streaming Expression is not simple. This is why I want to
> use the same query used to make analysis, to return full data via export
> handler.
>
> I am sorry if this is so much confusing.
>
> Thank you,
>
>
>
>
> 2017-11-24 12:36 GMT-02:00 Shawn Heisey :
>
>> On 11/23/2017 1:51 PM, Kojo wrote:
>>
>>> I am working on Solr to develop a toll to make analysis. I am using
>>> search
>>> function of Streaming Expressions, which requires a field to be indexed
>>> with docValues enabled, so I can get it.
>>>
>>> Suppose that after someone finishes the analysis, and would like to get
>>> other fields of the resultset that are not docValues enabled. How can it
>>> be
>>> done?
>>>
>>
>> We did get this message, but it's confusing as to exactly what you're
>> asking, which is why nobody responded.
>>
>> If you're saying that this theoretical person wants to use another field
>> with the streaming expression analysis you have provided, and that field
>> does not have docValues, then you'll need to add docValues to the field and
>> completely reindex.
>>
>> If you're asking something else, then you're going to need to provide
>> more details so we can actually know what you want to have happen.
>>
>> Thanks,
>> Shawn
>>
>
>


Re: docValues

2017-11-24 Thread Kojo
Thank you very much for your answer, Shawn.

That is it, I was looking for another way to include fields non docValues
to the filtered result documents.
I can enable docValues to other fields and reindex all if necessary. I will
tell you about the use case, because I am not sure  that I am on the right
track.

As I said before, I am using Streaming Expressions to deal with different
collections. Up to this moment, it is decided that we will use this
approach.

The goal is to provide our users a web interface where they can make some
queries. The backend will get Solr data using the Streaming Expressions
rest api and will return rolled up data to the frontend, which will display
some charts and aggregated data.
After that, the end user may want to have data used to generate this
aggregated information (not all fields of the filtered documents, but the
fields used to aggregate information), combined with some other fields
(title, description of document for example) which are not docValues. As
you said I need to add docValues to then. My question is, isn´t it to
expensive in terms of memory consumption to enable docValues on fields that
I dont need to facet, search etc?

I think that to reconstruct a standard query that achieves the results from
a complex Streaming Expression is not simple. This is why I want to use the
same query used to make analysis, to return full data via export handler.

I am sorry if this is so much confusing.

Thank you,




2017-11-24 12:36 GMT-02:00 Shawn Heisey :

> On 11/23/2017 1:51 PM, Kojo wrote:
>
>> I am working on Solr to develop a toll to make analysis. I am using search
>> function of Streaming Expressions, which requires a field to be indexed
>> with docValues enabled, so I can get it.
>>
>> Suppose that after someone finishes the analysis, and would like to get
>> other fields of the resultset that are not docValues enabled. How can it
>> be
>> done?
>>
>
> We did get this message, but it's confusing as to exactly what you're
> asking, which is why nobody responded.
>
> If you're saying that this theoretical person wants to use another field
> with the streaming expression analysis you have provided, and that field
> does not have docValues, then you'll need to add docValues to the field and
> completely reindex.
>
> If you're asking something else, then you're going to need to provide more
> details so we can actually know what you want to have happen.
>
> Thanks,
> Shawn
>


Re: Strip out punctuation at the end of token

2017-11-24 Thread Sergio García Maroto
Yes. You are right. I understand now.
Let me explain my issue a bit better with the exact problem i have.

I have this text "Information number  61149-008."
Using the tokenizers and filters described previously i get this list of
tokens.
information
number
61149-008.
61149
008

Basically last token   "61149-008."  gets tokenized as
61149-008.
61149
008
User is searching for "61149-008" without dot, so this is not a match.
I don't want to change the tokenization on the query to avoid altering the
matches for other cases.

I would like to delete the dot at the end. Basically generate this extra
token
information
number
61149-008.
61149
008
61149-008

Not sure if what I am saying make sense or there is other way to do this
right.

Thanks a lot
Sergio


On 24 November 2017 at 15:31, Shawn Heisey  wrote:

> On 11/24/2017 2:32 AM, marotosg wrote:
>
>> Hi Shaw.
>> Thanks for your reply. Actually my issue is with the last token. It looks
>> like for the last token of a string. It keeps the dot.
>>
>> In your case Testing. This is a test. Test.
>>
>> Keeps the "Test."
>>
>> Is there any reason I can't see for that behauviour?
>>
>
> I am really not sure what you're saying here.
>
> Every token is duplicated, one has the dot and one doesn't.  This is what
> you wanted based on what I read in your initial email.
>
> Making a guess as to what you're asking about this time: If you're
> noticing that there isn't a "Test" as the last token on the line for WDF,
> then I have to tell you that it actually is there, the display was simply
> too wide for the browser window. Scrolling horizontally would be required
> to see the whole thing.
>
> Thanks,
> Shawn
>
>


Re: docValues

2017-11-24 Thread Shawn Heisey

On 11/23/2017 1:51 PM, Kojo wrote:

I am working on Solr to develop a toll to make analysis. I am using search
function of Streaming Expressions, which requires a field to be indexed
with docValues enabled, so I can get it.

Suppose that after someone finishes the analysis, and would like to get
other fields of the resultset that are not docValues enabled. How can it be
done?


We did get this message, but it's confusing as to exactly what you're 
asking, which is why nobody responded.


If you're saying that this theoretical person wants to use another field 
with the streaming expression analysis you have provided, and that field 
does not have docValues, then you'll need to add docValues to the field 
and completely reindex.


If you're asking something else, then you're going to need to provide 
more details so we can actually know what you want to have happen.


Thanks,
Shawn


Re: Strip out punctuation at the end of token

2017-11-24 Thread Shawn Heisey

On 11/24/2017 2:32 AM, marotosg wrote:

Hi Shaw.
Thanks for your reply. Actually my issue is with the last token. It looks
like for the last token of a string. It keeps the dot.

In your case Testing. This is a test. Test.

Keeps the "Test."

Is there any reason I can't see for that behauviour?


I am really not sure what you're saying here.

Every token is duplicated, one has the dot and one doesn't.  This is 
what you wanted based on what I read in your initial email.


Making a guess as to what you're asking about this time: If you're 
noticing that there isn't a "Test" as the last token on the line for 
WDF, then I have to tell you that it actually is there, the display was 
simply too wide for the browser window. Scrolling horizontally would be 
required to see the whole thing.


Thanks,
Shawn



Re: Solr7 org.apache.lucene.index.IndexUpgrader

2017-11-24 Thread Shawn Heisey

On 11/23/2017 11:31 PM, Leo Prince wrote:

We were using bit older version Solr 4.10.2 and upgrading to Solr7.

We have like 4mil records in one of the core which is of course pretty
huge, hence re-sourcing the index is nearly impossible and re-querying from
source Solr to Solr7 is also going to be an exhausting effort.


I hate to burst your bubble here ... but 4 million docs is pretty small 
for a Solr index.  I have one index that's a hundred times larger, and 
there are people with *billions* of documents in SolrCloud.



Hence, I tried to upgrade the Index using
org.apache.lucene.index.IndexUpgrader.



IndexUpgrader ran just fine without any errors. but got this error with
initializing the core.

*java.lang.IllegalStateException:java.lang.IllegalStateException:
unexpected docvalues type NONE for field '_version_' (expected=NUMERIC).
Re-index with correct docvalues type.*
Being said, I am using Classic Schema and used default managed-schema file
as classic schema.xml.


This error means that the existing index didn't have docValues on the 
_version_ field, but the new version does.  At some point in 6.x, a 
whole bunch of field classes were changed to have docValues by default. 
You'll need to explicitly add 'docValues="false"' to the field 
definition to use an older index with a newer version.  But based on 
some things you said later, this may be the least of the problems you're 
running into.



When comparing schema of 4.10.2 with that of 7.1.0, I see the field type
names have changed like follows

**

Earlier until Solr6, it was int, float, long and double (*with out P at the
beginning*). I read in docs, old field type names are deprecated in Solr7
and have to use everything starting with "*P*" which enhances the
performances. Hence in this context,

1, The error I got
*java.lang.IllegalStateException:java.lang.IllegalStateException,
*Is it because my index data synced and upgraded contains old field type
names and new Solr7 schema contains new field type names..? Being
said, my IndexUpgrade
completed without any errors.


You *cannot* change the classes being used for your fields (which the 
fieldType changes you have described will do) on an existing index and 
expect Solr to work.  If you change the class on a field, you must 
eliminate the current index and reindex from scratch.



2, How to sort out the error in 1, if my assessment correct.? Since my data
is too large such that it's hard to re-source or re-query, is there any
other work arounds to migrate the index if IndexUpgrade is not an option to
upgrade index to 7.


You would need to keep the schema the same for the upgrade, except that 
you would need to disable docValues on some of your fields to get rid of 
the error you encountered.  You won't be able to take advantage of some 
of the new capability in the new version unless you re-engineer your 
config/schema and reindex.


Upgrading an index, especially through three major versions, is 
generally not recommended.  I always reindex when upgrading Solr, 
especially to a new major version, because Solr evolves quickly.


Thanks,
Shawn


Fwd: docValues

2017-11-24 Thread Kojo
Hi,
yesterday I sent a message bellow to this list, but just after I sent the
message I received an e-mail from the mail server that said that my e-mail
bounced. I don´t know what that means, and since I receive no answer for
the question, I don´t know whether if the message has arrived  to the list
or not.
I appreciate your attention.

Thank you,




-- Forwarded message --
From: Kojo 
Date: 2017-11-23 18:51 GMT-02:00
Subject: docValues
To: solr-user@lucene.apache.org


Hi,
I am working on Solr to develop a toll to make analysis. I am using search
function of Streaming Expressions, which requires a field to be indexed
with docValues enabled, so I can get it.

Suppose that after someone finishes the analysis, and would like to get
other fields of the resultset that are not docValues enabled. How can it be
done?

Thanks


Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory

2017-11-24 Thread Rick Leir
Edwin
There is a spec for which characters are acceptable in an email name, and 
another spec for chars in a domain name. I suspect you will have more success 
with a tokenizer which is specialized for email, but I have not looked at 
UAX29URLEmailTokenizerFactory. Does ClassicTokenizerFactory split on hyphens? 
Cheers --Rick

On November 24, 2017 3:46:46 AM EST, Zheng Lin Edwin Yeo  
wrote:
>Hi,
>
>I am indexing email addresses into Solr via EML files. Currently, I am
>using ClassicTokenizerFactory with LowerCaseFilterFactory. However, I
>also
>found that we can also use UAX29URLEmailTokenizerFactory with
>LowerCaseFilterFactory.
>
>Does anyone have any recommendation on which Tokenizer is better?
>
>I am currently using Solr 6.5.1, and planning to upgrade to Solr 7.1.0.
>
>Regards,
>Edwin

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: Strip out punctuation at the end of token

2017-11-24 Thread marotosg
Hi Shaw.
Thanks for your reply. Actually my issue is with the last token. It looks
like for the last token of a string. It keeps the dot.

In your case Testing. This is a test. Test.

Keeps the "Test." 

Is there any reason I can't see for that behauviour?

Thanks,
Sergio

Testing. This is a test. Test.
Shawn Heisey-2 wrote
> On 11/23/2017 8:06 AM, marotosg wrote:
>> I am trying to strip out any "."  at the end of a token but I would like
>> to
>> keep the original token as well.
>> This is my index analyzer
>> 
> 
>>   
> 
>>
> >
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"
>> preserveOriginal="1"/>
>>
> >
>  preserveOriginal="false"/>
>>
> 
>> 
> 
>> 
>> i was thinking of using the solr.PatternReplaceFilterFactory but i see
>> this
>> one won't keep the original token.
> 
> The WordDelimiterFilterFactory that you have configured will do that.
> 
> Here I have taken your analysis chain, added it to a test install of 
> Solr, and tried it out.  It appears to be doing exactly what you want it 
> to do.
> 
> https://www.dropbox.com/s/5puf7rzbypdcspu/wdf-analysis-marotosg.png?dl=0
> 
> Thanks,
> Shawn





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory

2017-11-24 Thread Zheng Lin Edwin Yeo
Hi,

I am indexing email addresses into Solr via EML files. Currently, I am
using ClassicTokenizerFactory with LowerCaseFilterFactory. However, I also
found that we can also use UAX29URLEmailTokenizerFactory with
LowerCaseFilterFactory.

Does anyone have any recommendation on which Tokenizer is better?

I am currently using Solr 6.5.1, and planning to upgrade to Solr 7.1.0.

Regards,
Edwin