Auto-Suggest within Tier Architecture

2020-02-03 Thread Moyer, Brett
Hello,

Looking to see how others accomplished this goal. We have a 3 Tier 
architecture, Solr is down deep in T3 far from the end user. How do you make 
Auto-Suggest calls from the Internet Browser through the Tiers down to Solr in 
T3? We essentially created steps down each tier, but I'm looking to know what 
other approaches people have created. Did you put your solr in T1, I assume 
not, that would put it at risk. Thanks!

Brett Moyer
*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


RE: Odd Edge Case for SpellCheck

2019-11-25 Thread Moyer, Brett
This is a great help, thank you!

Brett Moyer

-Original Message-
From: Erick Erickson  
Sent: Monday, November 25, 2019 4:12 PM
To: solr-user@lucene.apache.org
Subject: Re: Odd Edge Case for SpellCheck

If you’re using direct spell checking, it looks for the _indexed_ term. So this 
means you get stemmed corrections if you’re stemming etc. Usually you should 
use a copyField to a field with minimal analysis and use that field for 
spellchecking.

Another way to thing about it is that if you use the admin/analysis page for 
terms in a field, the terms in the dictionary are what’s at the end of the 
indexed side of the page.

Best,
Erick

> On Nov 25, 2019, at 4:02 PM, Moyer, Brett  wrote:
> 
> Yes we are stemming, ahh so we shouldn't stem our words to be spelled?
> 
> Brett Moyer
> 
> -Original Message-
> From: Jörn Franke 
> Sent: Friday, November 22, 2019 8:34 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Odd Edge Case for SpellCheck
> 
> Stemming involved ?
> 
>> Am 22.11.2019 um 14:23 schrieb Moyer, Brett :
>> 
>> Hello, we have spellcheck running, using the index as the dictionary. An 
>> odd use case came up today wanted to get your thoughts and see if what we 
>> determined is correct. Use case: User sends a query for q=brokerage, 
>> spellcheck fires and returns "brokerage". Looking at the output I see that 
>> solr must have pulled the root word "brokage" then spellcheck said hey I 
>> need to fix that. Is that correct? There's no issue, it's just an unexpected 
>> outcome. Thanks!
>> 
>> "q":"brokerage",
>> "spellcheck":{
>>   "suggestions":
>>   [
>> {"name":"brokage",{
>>   "type":"str","value":"numFound":1,
>>   "startOffset":0,
>>   "endOffset":9,
>>   "suggestion":["brokerage"]}}],
>>   "collations":
>>   [
>> {"name":"collation","type":"str","value":"brokerage"}]}}
>> 
>> Brett Moyer
>> *
>> *
>> *** This e-mail may contain confidential or privileged information.
>> If you are not the intended recipient, please notify the sender immediately 
>> and then delete it.
>> 
>> TIAA
>> *
>> *
>> ***
> **
> *** This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately 
> and then delete it.
> 
> TIAA
> **
> ***

*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


RE: Odd Edge Case for SpellCheck

2019-11-25 Thread Moyer, Brett
Yes we are stemming, ahh so we shouldn't stem our words to be spelled?

Brett Moyer

-Original Message-
From: Jörn Franke  
Sent: Friday, November 22, 2019 8:34 AM
To: solr-user@lucene.apache.org
Subject: Re: Odd Edge Case for SpellCheck

Stemming involved ?

> Am 22.11.2019 um 14:23 schrieb Moyer, Brett :
> 
> Hello, we have spellcheck running, using the index as the dictionary. An odd 
> use case came up today wanted to get your thoughts and see if what we 
> determined is correct. Use case: User sends a query for q=brokerage, 
> spellcheck fires and returns "brokerage". Looking at the output I see that 
> solr must have pulled the root word "brokage" then spellcheck said hey I need 
> to fix that. Is that correct? There's no issue, it's just an unexpected 
> outcome. Thanks!
> 
> "q":"brokerage",
> "spellcheck":{
>"suggestions":
>[
>  {"name":"brokage",{
>"type":"str","value":"numFound":1,
>"startOffset":0,
>"endOffset":9,
>"suggestion":["brokerage"]}}],
>"collations":
>[
>  {"name":"collation","type":"str","value":"brokerage"}]}}
> 
> Brett Moyer
> **
> *** This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately 
> and then delete it.
> 
> TIAA
> **
> ***
*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


Odd Edge Case for SpellCheck

2019-11-22 Thread Moyer, Brett
Hello, we have spellcheck running, using the index as the dictionary. An odd 
use case came up today wanted to get your thoughts and see if what we 
determined is correct. Use case: User sends a query for q=brokerage, spellcheck 
fires and returns "brokerage". Looking at the output I see that solr must have 
pulled the root word "brokage" then spellcheck said hey I need to fix that. Is 
that correct? There's no issue, it's just an unexpected outcome. Thanks!

"q":"brokerage",
"spellcheck":{
"suggestions":
[
  {"name":"brokage",{
"type":"str","value":"numFound":1,
"startOffset":0,
"endOffset":9,
"suggestion":["brokerage"]}}],
"collations":
[
  {"name":"collation","type":"str","value":"brokerage"}]}}

Brett Moyer
*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


RE: Facet Advice

2019-10-15 Thread Moyer, Brett
Hello Shawn, thanks for reply. The results that come back are correct, but are 
we implementing the query correctly to filter by a selected facet? When I say 
wrong, it's more about the design/use of Facets in the Query. Is it proper to 
do fq=Tags:Retirement? Is using a Multivalued field correct for Facets? Why do 
you say the above are not Facets?

Here is an excerpt from our JSON:

"facet_counts": {
"facet_queries": {},
"facet_fields": {
"Tags": [
"Retirement",
1260,
"Locations & People",
1149,
"Advice and Tools",
1015,
"Careers",
156,
"Annuities",
101,
"Performance",

Brett Moyer
Manager, Sr. Technical Lead | TFS Technology
  Public Production Support
  Digital Search & Discovery

8625 Andrew Carnegie Blvd | 4th floor
Charlotte, NC 28263
Tel: 704.988.4508
Fax: 704.988.4907
bmo...@tiaa.org

-Original Message-
From: Shawn Heisey  
Sent: Tuesday, October 15, 2019 5:40 AM
To: solr-user@lucene.apache.org
Subject: Re: Facet Advice

On 10/14/2019 3:25 PM, Moyer, Brett wrote:
> Hello, looking for some advice, I have the suspicion we are doing Facets all 
> wrong. We host financial information and recently "tagged" our pages with 
> appropriate Facets. We have built a Flat design. Are we going at it the wrong 
> way?
> 
> In Solr we have a "Tags" field, based on some magic we tagged each page on 
> the site with a number of the below example Facets. We have the UI team 
> sending queries in the form of 1) q=get a loan=Tags:Retirement, 2) q=get a 
> loan=Tags:Retirement AND Tags:Move Money. This restricts the resultset 
> hopefully guiding the user to their desired result. Something about it 
> doesn’t seem right. Is this right with a flat single level pattern like what 
> we have? Should each doc have multiple Fields to map to different values? Any 
> help is appreciated. Thanks!
> 
> Example Facets:
> Brokerage
> Retirement
> Open an Account
> Move Money
> Estate Planning

The queries you mentioned above do not have facets, only the q and fq 
parameters.  You also have not mentioned what in the results is wrong to you.

If you restrict the query to only a certain value in the tag field, then facets 
will only count documents that match the full query -- users will not be able 
to see the count of documents that do NOT match the query, unless you use 
tagging/excluding with your filters.  This is part of the functionality called 
multi-select faceting.

http://yonik.com/multi-select-faceting/

Because your message doesn't say what in the results is wrong, we can only 
guess about how to help you.  I do not know if the above information will be 
helpful or not.

Thanks,
Shawn
*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


Facet Advice

2019-10-14 Thread Moyer, Brett
Hello, looking for some advice, I have the suspicion we are doing Facets all 
wrong. We host financial information and recently "tagged" our pages with 
appropriate Facets. We have built a Flat design. Are we going at it the wrong 
way?

In Solr we have a "Tags" field, based on some magic we tagged each page on the 
site with a number of the below example Facets. We have the UI team sending 
queries in the form of 1) q=get a loan=Tags:Retirement, 2) q=get a 
loan=Tags:Retirement AND Tags:Move Money. This restricts the resultset 
hopefully guiding the user to their desired result. Something about it doesn’t 
seem right. Is this right with a flat single level pattern like what we have? 
Should each doc have multiple Fields to map to different values? Any help is 
appreciated. Thanks!

Example Facets:
Brokerage
Retirement
Open an Account
Move Money
Estate Planning
Etc..

Brett
*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


RE: Indexed Data Size

2019-08-13 Thread Moyer, Brett
Turns out this is due to a job that indexes logs. We were able to clear some 
with another job. We are working through the value of these indexed logs. 
Thanks for all your help!

Brett Moyer
Manager, Sr. Technical Lead | TFS Technology
  Public Production Support
  Digital Search & Discovery

8625 Andrew Carnegie Blvd | 4th floor
Charlotte, NC 28263
Tel: 704.988.4508
Fax: 704.988.4907
bmo...@tiaa.org

-Original Message-
From: Shawn Heisey  
Sent: Friday, August 9, 2019 2:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexed Data Size

On 8/9/2019 12:17 PM, Moyer, Brett wrote:
> The biggest is /data/solr/system_logs_shard1_replica_n1/data/index, files 
> with the extensions I stated previously. Each is 5gb and there are a few 
> hundred. Dated by to last 3 months. I don’t understand why there are so many 
> files with such small indexes. Not sure how to clean them up.

Can you get a screenshot of the core overview for that particular core? 
Solr should correctly calculate the size on the overview based on what files 
are actually in the index directory.

Thanks,
Shawn
*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


RE: Indexed Data Size

2019-08-09 Thread Moyer, Brett
Correct our indexes are small document wise, but for some ready we have a 
years' worth of files in the data/solr folders. There are no index. 
files.

The biggest is /data/solr/system_logs_shard1_replica_n1/data/index, files with 
the extensions I stated previously. Each is 5gb and there are a few hundred. 
Dated by to last 3 months. I don’t understand why there are so many files with 
such small indexes. Not sure how to clean them up. 

-Original Message-
From: Shawn Heisey  
Sent: Friday, August 9, 2019 9:11 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexed Data Size

On 8/9/2019 6:12 AM, Moyer, Brett wrote:
> Thanks! We update each index nightly, we don’t clear, but bring in New and 
> Deltas, delete expired/404. All our data are basically webpages, so none are 
> very large. Some PDFs but again not too large. We are running Solr 7.5, 
> hopefully you can access the links.

Solr is saying that the entire size of the index directory is 95 MB for one of 
those indexes and the other is 30 MB.  Those sound to me like very small 
indexes, not very large like you indicated.  You were saying that the large 
files were in data/index, and did not mention anything about index. 
directories.

If you do have a bunch of index. directories in the "Data" 
directory mentioned on the Core overview page, you can safely delete all of the 
index and/or index.* directories under that directory EXCEPT the one that is 
indicated as the "Index" directory.  If you delete that one, you're deleting 
the actual live index ... and since you're not on Windows, the OS will let you 
delete it without complaining.

The directory locations are cut off on both screenshots, so I can't confirm 
anything there.

The larger core has about 2000 deleted docs and the smaller one has 40. 
Doing an optimize will not save much disk space or take very long.

Thanks,
Shawn
*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


RE: Indexed Data Size

2019-08-09 Thread Moyer, Brett
Thanks! We update each index nightly, we don’t clear, but bring in New and 
Deltas, delete expired/404. All our data are basically webpages, so none are 
very large. Some PDFs but again not too large. We are running Solr 7.5, 
hopefully you can access the links.

https://www.dropbox.com/s/lzd6hkoikhagujs/CoreOne.png?dl=0
https://www.dropbox.com/s/ae6rayb38q39u9c/CoreTwo.png?dl=0

Brett
-Original Message-
From: Erick Erickson  
Sent: Thursday, August 8, 2019 5:49 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexed Data Size

On the surface, this makes no sense at all, so there’s something I don’t 
understand here ;). 

How often do you update your index? Having files from a long time ago is 
perfectly reasonable if you’re not updating regularly.

But your statement that some of these are huge for just a 50K document index is 
odd unless they’re _huge_ documents.

I wouldn’t optimize, unless you’re on Solr 7.5+ as that’ll create a single 
segment, see: 
https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
and
https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/

The extensions you mentioned are perfectly reasonable. Each segment is made up 
of multiple files. .fdt for instance contains stored data. See: 
https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/codecs/lucene62/package-summary.html

Can you give us a long listing of one of your index directories?

Best,
Erick

> On Aug 8, 2019, at 5:17 PM, Moyer, Brett  wrote:
> 
> In our data/solr//data/index on the filesystem, we have files 
> that go back 1 year. I don’t understand why and I doubt they are in use. 
> Files with extensions like fdx,cfe,doc,pos,tip,dvm etc. Some of these are 
> very large and running us out of server space. Our search indexes themselves 
> are not large, in total we might have 50k documents.  How can I reduce this 
> /data/solr space? Is this what the Solr Optimize command is for? Thanks!
> 
> Brett
> 
> **
> *** This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately 
> and then delete it.
> 
> TIAA
> **
> ***

*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


RE: modify query response plugin

2019-08-08 Thread Moyer, Brett
Highlight? What about using the Highlighter? 
https://lucene.apache.org/solr/guide/6_6/highlighting.html

Brett Moyer
Manager, Sr. Technical Lead | TFS Technology
  Public Production Support
  Digital Search & Discovery

8625 Andrew Carnegie Blvd | 4th floor
Charlotte, NC 28263
Tel: 704.988.4508
Fax: 704.988.4907
bmo...@tiaa.org


-Original Message-
From: Maria Muslea  
Sent: Thursday, August 8, 2019 1:28 PM
To: solr-user@lucene.apache.org
Subject: Re: modify query response plugin

Thank you for your response. I believe that the Tagger is used for NER, which 
is different than what I am trying to do.
It is also available only with Solr 7 and I would need this to work with 
version 6.5.0.

I am trying to manipulate the data that I already have in the response, and I 
can't find a good example of a plugin that does something similar, so I can see 
how I can access the response and construct a new one.

Your help is greatly appreciated.

Thank you,
Maria

On Tue, Aug 6, 2019 at 3:19 PM Erik Hatcher  wrote:

> I think you’re looking for the Solr Tagger, described here:
> https://lucidworks.com/post/solr-tagger-improving-relevancy/
>
> > On Aug 6, 2019, at 16:04, Maria Muslea  wrote:
> >
> > Hi,
> >
> > I am trying to implement a plugin that will modify my query 
> > response. For example, I would like to execute a query that will return 
> > something like:
> >
> > {...
> > "description":"flights at LAX",
> > "highlight":"airport;11;3"
> > ...}
> > This is information that I have in my document, so I can return it.
> >
> > Now, I would like the plugin to intercept the result, do some 
> > processing
> on
> > it, and return something like:
> >
> > {...
> > "description":"flights at LAX",
> > "highlight":{
> >   "concept":"airport",
> >   "description":"flights at LAX"
> > ...}
> >
> > I looked at some RequestHandler implementations, but I can't find 
> > any sample code that would help me with this. Would this type of 
> > plugin be handled by a RequestHandler? Could you maybe point me to a 
> > sample plugin that does something similar?
> >
> > I would really appreciate your help.
> >
> > Thank you,
> > Maria
>
*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


Indexed Data Size

2019-08-08 Thread Moyer, Brett
In our data/solr//data/index on the filesystem, we have files 
that go back 1 year. I don’t understand why and I doubt they are in use. Files 
with extensions like fdx,cfe,doc,pos,tip,dvm etc. Some of these are very large 
and running us out of server space. Our search indexes themselves are not 
large, in total we might have 50k documents.  How can I reduce this /data/solr 
space? Is this what the Solr Optimize command is for? Thanks!

Brett

*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


Solr spellcheck Collation JSON

2019-04-07 Thread Moyer, Brett
Hello,

Looks like a more recent Solr release introduced a bug for collation. 
Does anyone know of a way to correct it, or if a future release will address? 
Because of this change we had to make the app teams rewrite their code. Made us 
look bad because we can't control our code and introduced a bug their 
perspective) Thanks

Solr 7.4
--
"spellcheck": {
"suggestions": [
"acount",
{
"numFound": 1,
"startOffset": 0,
"endOffset": 6,
"suggestion": [
"account"
]
}
],
"collations": [
"collation", <-this is the bad line
"account"
]

Previous Solr versions
--
"spellcheck": {
"suggestions": [
"acount",
{
"numFound": 1,
"startOffset": 0,
"endOffset": 6,
"suggestion": [
"account"
]
}
],
"collations": [
"collation":"account" <--correct format
]

Brett Moyer
*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


RE: IRA or IRA the Person

2019-04-01 Thread Moyer, Brett
Wow, thank you Trey, great information! We are a Fusion client, works well for 
us, we are leveraging the Signals Boosting. We were thinking omitNorms might be 
of help here, turning that off actually. The PERSON document ranks #1 always 
because it’s a tiny document with very short fields. I'll take a closer look at 
what you sent, Thank you!

Brett Moyer
Manager, Sr. Technical Lead | TFS Technology
  Public Production Support
  Digital Search & Discovery

8625 Andrew Carnegie Blvd | 4th floor
Charlotte, NC 28263
Tel: 704.988.4508
Fax: 704.988.4907
bmo...@tiaa.org 


-Original Message-
From: Trey Grainger [mailto:solrt...@gmail.com] 
Sent: Monday, April 01, 2019 1:15 PM
To: solr-user@lucene.apache.org
Subject: Re: IRA or IRA the Person

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you recognize the sender and know the content 
is safe.


Hi Brett,

There are a couple of angles you can take here. If you are only concerned
about this specific term or a small number of other known terms like "IRA"
and want to spot fix it, you can use something like the query elevation
component in Solr (
https://lucene.apache.org/solr/guide/7_7/the-query-elevation-component.html)
to explicitly include or exclude documents.

Otherwise, if you are looking for a more data-driven approach to solving
this, you can leverage the aggregate click-streams for your users across
all of the searches on your platform to boost documents higher that are
more popular for any given search. We do this in our commercial product
(Lucidworks Fusion) through our Signals Boosting feature, but you could
implement something similar yourself with some work, as the general
architecture is fairly well-documented here:
https://doc.lucidworks.com/fusion-ai/4.2/user-guide/signals/index.html

If you do not have long-lived content OR your do not have sufficient
signals history, you could alternatively use something like Solr's Semantic
Knowledge Graph to automatically find term vectors that are the most
related to your terms within your content. In that case, if the "individual
retirement account" meaning is more common across your documents, you'd
probably end up with terms more related to that which could be used to do
data-driven boosts on your query to that concept (instead of the person, in
this case).

I gave a presentation at Activate ("the Search & AI Conference") last year
on some of the more data-driven approaches to parsing and understanding the
meaning of terms within queries, that included things like disambiguation
(similar to what you're doing here) and some additional approaches
leveraging a combination of query log mining, the semantic knowledge graph,
and the Solr Text Tagger. If you start handling these use cases in a more
systematic and data-driven way, you might want to check out some of the
techniques I mentioned there: Video:
https://www.youtube.com/watch?v=4fMZnunTRF8 | Slides:
https://www.slideshare.net/treygrainger/how-to-build-a-semantic-search-system


All the best,

Trey Grainger
Chief Algorithms Officer @ Lucidworks


On Mon, Apr 1, 2019 at 11:45 AM Moyer, Brett  wrote:

> Hello,
>
> Looking for ideas on how to determine intent and drive results to
> a person result or an article result. We are a financial institution and we
> have IRA's Individual Retirement Accounts and we have a page that talks
> about an Advisor, IRA Black.
>
> Our users are in a bad habit of only using single terms for
> search. A very common search term is "ira". The PERSON page ranks higher
> than the article on IRA's. With essentially no information from the user,
> what are some way we can detect and rank differently? Thanks!
>
> Brett Moyer
> *
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender
> immediately and then delete it.
>
> TIAA
> *
>
*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


IRA or IRA the Person

2019-04-01 Thread Moyer, Brett
Hello,

Looking for ideas on how to determine intent and drive results to a 
person result or an article result. We are a financial institution and we have 
IRA's Individual Retirement Accounts and we have a page that talks about an 
Advisor, IRA Black.

Our users are in a bad habit of only using single terms for search. A 
very common search term is "ira". The PERSON page ranks higher than the article 
on IRA's. With essentially no information from the user, what are some way we 
can detect and rank differently? Thanks!

Brett Moyer
*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


RE: FieldTypes and LowerCase

2019-03-14 Thread Moyer, Brett
Ok I think I'm getting it. At Index/Query time the analyzers fire and "do 
stuff". Ex: "the sheep jumped over the MOON" that could be Tokened on spaces, 
lowercased etc. and that is stored in the Inverted Index, something you 
probably can't really see.

In solr the string above is what you see in its original form. When you search 
for "sheep" that would come back because the Inverted Index has it stored in 
that form, separated words based on spaces, right? Further if I searched for 
moon (lowercase) it would be found because the analyzer is also storing in the 
Inverted Index the lowercase form, right?

I'm getting closer I think. Ok so if I want to physically lowercase the URL and 
store it that way, I need to do it before it gets to the Index as you stated. 
Ok got it, Thanks!

Brett Moyer
Manager, Sr. Technical Lead | TFS Technology
  Public Production Support
  Digital Search & Discovery

8625 Andrew Carnegie Blvd | 4th floor
Charlotte, NC 28263
Tel: 704.988.4508
Fax: 704.988.4907
bmo...@tiaa.org 


-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Thursday, March 14, 2019 10:57 AM
To: solr-user@lucene.apache.org
Subject: Re: FieldTypes and LowerCase

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you recognize the sender and know the content 
is safe.


On 3/14/2019 8:49 AM, Moyer, Brett wrote:
> Thanks Shawn, " Analysis only happens to indexed data" Being the case when 
> the data gets Indexed, then wouldn't the Analyzer kickoff and lowercase the 
> URL? The analyzer I have defined is not set for Index or Query, so as I 
> understand it will fire during both events. If that is the case I still don't 
> get why the Lowercase doesn't fire when the data is being indexed.

It does happen for both index and query.

It sounds like you are assuming that when index analysis happens, that
what you get back in search results will be affected by that analysis.

What you get back in search results is stored data -- that is never
affected by analysis.

What gets affected by analysis is indexed data -- the data that is
searched by queries.  Not the data that comes back in search results.

Thanks,
Shawn
*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


RE: FieldTypes and LowerCase

2019-03-14 Thread Moyer, Brett
Thanks Shawn, " Analysis only happens to indexed data" Being the case when the 
data gets Indexed, then wouldn't the Analyzer kickoff and lowercase the URL? 
The analyzer I have defined is not set for Index or Query, so as I understand 
it will fire during both events. If that is the case I still don't get why the 
Lowercase doesn't fire when the data is being indexed. 

Brett Moyer

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Thursday, March 14, 2019 10:44 AM
To: solr-user@lucene.apache.org
Subject: Re: FieldTypes and LowerCase

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you recognize the sender and know the content 
is safe.


On 3/14/2019 7:47 AM, Moyer, Brett wrote:
> I'm using the below FieldType/Field but when I index my documents, the URL is 
> not being lower case. Any ideas? Do I have the below wrong?
>
> Example: http://connect.rightprospectus.com/RSVP/TADF
> Expect: http://connect.rightprospectus.com/rsvp/tadf
>
>  omitNorms="true">
> 
>
>
> 
> 
>
>  stored="true"/>

Analysis only happens to indexed data.

The data that you get back from Solr (stored data) is *always* EXACTLY
what Solr indexes, before analysis.

You'll need to lowercase the data before it reaches analysis.  This is
how it is designed to work ... that will not be changing.

If you were to configure an Update Processor chain that did the
lowercasing, that would affect stored data as well as indexed data.

Thanks,
Shawn
*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*



FieldTypes and LowerCase

2019-03-14 Thread Moyer, Brett
I'm using the below FieldType/Field but when I index my documents, the URL is 
not being lower case. Any ideas? Do I have the below wrong?

Example: http://connect.rightprospectus.com/RSVP/TADF
Expect: http://connect.rightprospectus.com/rsvp/tadf



  
  





Brett Moyer


*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


RE: URL Case Sensitive/Insensitive

2018-12-11 Thread Moyer, Brett
https://www.nuveen.com/mutual-funds/nuveen-high-yield-municipal-bond-fund
https://www.nuveen.com/mutual-funds/Nuveen-High-Yield-Municipal-Bond-Fund

Is there any issue if we just lowercase all URLs? I can't think of an issue 
that would be caused, but that's why I'm asking the Guru's!

Brett Moyer
   

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, December 11, 2018 12:41 PM
To: solr-user
Subject: Re: URL Case Sensitive/Insensitive

What do you mean by "url case"? No, I'm not being snarky.

The value returned in a doc is very different than the value searched.
The stored data is the original input without going through any
filters.

If you mean the value _returned_ by Solr from a stored field, then the
case is exactly whatever was input originally. To get it a consistent
case, I'd change it on the client side before sending  to Solr, or
use, say, a  ScriptUpdateProcessor to change it on the way in to Solr.

If you're talking about _searching_ the URL, you need to put the
appropriate filters in your analysis chain. Most distributions have a
"lowercase" type that is a keywordtokenizer and lowercasefilter That
still treats the searchable text as a single token, so for instance
you wouldn't be able to search for url:com with pre-and-post wildcards
which is not a good pattern. If you want to search sub-parts of a url,
you'll use one of the text-based types to break it up into tokens.
Even in this case, though, the returned data is still the original
case since it's the stored data that's returned.

Best,
Erick
On Tue, Dec 11, 2018 at 8:38 AM Moyer, Brett  wrote:
>
> Hello, I'm new to Solr been using it for a few months. A recent question came 
> up from our business partners about URL casing. Previously their URLs were 
> upper case, they made a change and now all lower. Both pages/URLs are still 
> accessible so there are duplicates in Solr. They are requesting all URLs be 
> evaluated as lowercase. What is the best practice on URL case? Is there a 
> negative to making all lowercase? I know I can drop the index and re-crawl to 
> fix it, but long term how should URL case be treated? Thanks!
>
> Brett Moyer
>
> *
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately 
> and then delete it.
>
> TIAA
> *
*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


URL Case Sensitive/Insensitive

2018-12-11 Thread Moyer, Brett
Hello, I'm new to Solr been using it for a few months. A recent question came 
up from our business partners about URL casing. Previously their URLs were 
upper case, they made a change and now all lower. Both pages/URLs are still 
accessible so there are duplicates in Solr. They are requesting all URLs be 
evaluated as lowercase. What is the best practice on URL case? Is there a 
negative to making all lowercase? I know I can drop the index and re-crawl to 
fix it, but long term how should URL case be treated? Thanks!

Brett Moyer

*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*