UAX29 URL Email Tokenizer not working as expected

2019-05-06 Thread Tom Van Cuyck
Hi,

The UAX29 URL Email Tokenizer is not working as expected.
According to the documentation (
https://lucene.apache.org/solr/guide/7_2/tokenizers.html): "Words are split
at hyphens, unless there is a number in the word, in which case the token
is not split and the numbers and hyphen(s) are preserved."

So I expect "ABC-123" to remain "ABC-123"
However the term is split in 2 separate tokens "ABC" and "123".

Same for "AB12-CD34" --> "AB12" and "CD34" etc...

Is this behavior to be expected? Or is there a way to get the behavior I
expect?

Kind regards, Tom

-- 

Would you like to receive our newsletter to stay updated? Please click here
<http://eepurl.com/dwoymH>


Tom Van Cuyck
Software Engineer

<http://www.ontoforce.com>
ONTOFORCE
WINNER of EY scale-up of the year 2018
@: tom.vancu...@ontoforce.com
T: +32 9 292 80 37 <+32+9+292+80+37>
W: http://www.ontoforce.com
W: http://www.disqover.com
AA Tower, Technologiepark 122 (3/F), 9052 Gent, Belgium
<https://goo.gl/maps/UjuekPHVoFK2>
CIC, One Broadway, MA 02142 Cambridge, United States
<https://www.google.com/maps/place/One+Broadway,+1+Broadway,+Cambridge,+MA+02142/@42.3627659,-71.0857549,17z/data=!3m2!4b1!5s0x89e370a5bef53651:0xa9387af4906ce9a3!4m5!3m4!1s0x89e370a5b9258c7b:0x7d922521464507ad!8m2!3d42.3627822!4d-71.0835375>

DISCLAIMER This message (including any attachments) may contain information
which is confidential and/or protected by intellectual property rights and
is intended for the sole use of the recipient(s) named above. Any use of
the information herein (including, but not limited to, total or partial
reproduction, communication or distribution in any form) by persons other
than the designated recipient(s) is prohibited. If you have received it by
mistake, please notify the sender by return email and delete this message
from your system. Please note that emails are susceptible to change.
ONTOFORCE shall not be liable for the improper or incomplete transmission
of the information contained in this communication nor for any delay in its
receipt or damage to your system. ONTOFORCE does not guarantee that the
integrity of this communication is free of viruses, interceptions or
interference.


Limit facet terms based on a substring using the JSON facet API

2019-01-29 Thread Tom Van Cuyck
Hi

In the old Solr facet API there are the facet.contains and
facet.conains.ignoreCase parameters to limit the facet values to those
terms containing the specified substring.
Is there an equivalent option in the JSON facet API? Or is there a way to
obtain the same behavior with the JSON API? I can't find anything in the
official documentation.

Kind regards, Tom
-- 

Would you like to receive our newsletter to stay updated? Please click here
<http://eepurl.com/dwoymH>


Tom Van Cuyck
Software Engineer

<http://www.ontoforce.com>
ONTOFORCE
WINNER of EY scale-up of the year 2018
@: tom.vancu...@ontoforce.com
T: +32 9 292 80 37 <+32+9+292+80+37>
W: http://www.ontoforce.com
W: http://www.disqover.com
AA Tower, Technologiepark 122 (3/F), 9052 Gent, Belgium
<https://goo.gl/maps/UjuekPHVoFK2>
CIC, One Broadway, MA 02142 Cambridge, United States
<https://www.google.com/maps/place/One+Broadway,+1+Broadway,+Cambridge,+MA+02142/@42.3627659,-71.0857549,17z/data=!3m2!4b1!5s0x89e370a5bef53651:0xa9387af4906ce9a3!4m5!3m4!1s0x89e370a5b9258c7b:0x7d922521464507ad!8m2!3d42.3627822!4d-71.0835375>

DISCLAIMER This message (including any attachments) may contain information
which is confidential and/or protected by intellectual property rights and
is intended for the sole use of the recipient(s) named above. Any use of
the information herein (including, but not limited to, total or partial
reproduction, communication or distribution in any form) by persons other
than the designated recipient(s) is prohibited. If you have received it by
mistake, please notify the sender by return email and delete this message
from your system. Please note that emails are susceptible to change.
ONTOFORCE shall not be liable for the improper or incomplete transmission
of the information contained in this communication nor for any delay in its
receipt or damage to your system. ONTOFORCE does not guarantee that the
integrity of this communication is free of viruses, interceptions or
interference.


Is there a way to sort by conditional function in the Solr 7.2 JSON API?

2018-03-02 Thread Tom Van Cuyck
Hi,

In the Solr 7.2 JSON API, when faceting over terms, I would like to sort
the buckets over the average of a numerical property, as shown below

curl http://localhost:8983/solr/core/select -d '
q=*:*&
rows=0&
wt=json&
json.facet={
 "field" : {
"type" : "terms",
"field" : "string-field",
"sort" : "avg desc",
"limit" : 50,
facet : {
avg : "avg(number_i)",
unique : "unique(number_i)"
   }
  }
}'


However, when none of the documents in a bucket has a value for the
numerical property (e.g. unique = 0 in this case), an average value avg = 0
is returned.
This average value of 0 is then used for sorting the buckets.

I would like the buckets with no value for the numerical property to be
sorted last.
Is there a way to e.g. use conditional sorting? E.g.
sort: "if(gt(unique,0),avg,-9) desc"

I can't get this to work, while in the old API this appaers to be possible.

Or is there another way to sort the buckets with a missing numeric value
last?

Kind regards, Tom


Issues with refine parameter when subfaceting in a range facet

2018-01-24 Thread Tom Van Cuyck
Hi,

We encountered an issue when using the refine parameter when subfaceting in
a range facet.
When enabling the refine option, the counts of the response are the double
of the counts of the response without refine option.
We are running Solr 6.6.1 in a cloud setup.

If I execute the query:

curl http://localhost:8899/solr/data/select -d '{ "params" :
{"wt":"json","rows":0,"json.facet":"
  {

\"MaximumAge_f\":
{
  \"type\":\"range\",
  \"field\":\"MaximumAge_f\",
  \"start\":0.0,
  \"end\":55000.0,
  \"gap\":1000.0,
  \"other\":\"between\",
  \"facet\":
  {
\"Gender_sf\":
{
  \"type\":\"terms\",
  \"field\":\"Gender_sf\",
  \"missing\":true,
*  \"refine\":true,*
  \"overrequest\":24,
  \"limit\":12,
  \"offset\":0
}
  }
}
  }",
  "q":"*:*"
}'

I get the following response:

  "facets": {
"count": 379417,
"MaximumAge_f": {
  "buckets": [
{
  "val": 0,
  "count": 8252,
  "Gender_sf": {
"buckets": [
  {
"val": "All",
"count": 8152
  },
  {
"val": "Male",
"count": 74
  {
  },
  {:wink
"val": "Female",
"count": 26
  }
],
"missing": {
  "count": 0
}
  }
},
...

If I execute the same query WITHOUT refine: true in the subfacet, I get the
following response:

  "facets": {
"count": 379417,
"MaximumAge_f": {
  "buckets": [
{
  "val": 0,
  "count": 4126,
  "Gender_sf": {
"buckets": [
  {
"val": "All",
"count": 4076
  },
  {
"val": "Male",
"count": 37
  },
  {
"val": "Female",
"count": 13
  }
],
"missing": {
  "count": 0
}
  }
},
...

There is a factor 2 difference for each count in each bucket.

If I perform the same queries with a larger range gap, e.g.
  \"start\":0.0,
  \"end\":55000.0,
  \"gap\":5000.0,
there is no difference between the response with and without refine: true.

Is this a known issue, or is there something we are overlooking?
And is there information on whether or not this behavior will be the same
in Solr 7?

Kind regards, Tom