Re: Interesting Grouping/Facet issue

2019-04-09 Thread Shawn Heisey

On 4/9/2019 7:03 AM, Erie Data Systems wrote:

Solr 8.0.0, I have a HASHTAG string field I am trying to facet on to get
the most popular hashtags (top 100) across many sources. (SITE field is
string)

/select?facet.field=hashtag=on=0=%2Bhashtag:*%20%2BDT:[" .
date('Y-m-d') . "T00:00:00Z+TO+" . date('Y-m-d')  .
"T23:59:59Z]=100=1=fc

It works but not to what I feel should happen... For example if one site
has 1000 rows on todays date and they all have a HASHTAG in common, that
HASHTAG automatically rises to the top simply because one SITE has 1000
pages with the same HASHTAG.


That is exactly what faceting is designed to do.  It is behaving exactly 
as designed.



Is there a way to get a better more even distribution of top HASHTAGS for a
given date, ie facet. ..by a grouping or distinct or filter of some sort?
Im more interesting in knowing if a HASHTAG is used frequently among SITEs,
not just one one.


If you use pivot facets, first on the field you want to classify on, 
then on HASHTAG, that MIGHT get you what you want.


You could also try running many different facet queries, each one with a 
specific query and/or filter that achieves the results you want.


FYI:  Including "hashtag:*" in your query makes it a wildcard query. 
This is most likely VERY slow.  If you are trying to match all possible 
values in the hashtag field, then take it out, it's unnecessary.  If you 
are trying to match only documents where hashtag contains a value, then 
replace it with this for a performance improvement:


hashtag:[* TO *]

Range queries are almost always faster than wildcards.

Thanks,
Shawn


Interesting Grouping/Facet issue

2019-04-09 Thread Erie Data Systems
Solr 8.0.0, I have a HASHTAG string field I am trying to facet on to get
the most popular hashtags (top 100) across many sources. (SITE field is
string)

/select?facet.field=hashtag=on=0=%2Bhashtag:*%20%2BDT:[" .
date('Y-m-d') . "T00:00:00Z+TO+" . date('Y-m-d')  .
"T23:59:59Z]=100=1=fc

It works but not to what I feel should happen... For example if one site
has 1000 rows on todays date and they all have a HASHTAG in common, that
HASHTAG automatically rises to the top simply because one SITE has 1000
pages with the same HASHTAG.

Is there a way to get a better more even distribution of top HASHTAGS for a
given date, ie facet. ..by a grouping or distinct or filter of some sort?
Im more interesting in knowing if a HASHTAG is used frequently among SITEs,
not just one one.

Hope this makes sense... any recommendations welcomed.

Thank you in advance,
-Craig