from:"Doug Turnbull"


[ 
https://issues.apache.org/jira/browse/SOLR-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263377#comment-16263377
 ] 

Doug Turnbull edited comment on SOLR-11662 at 11/22/17 9:31 PM:


PR updated w/ code in Solr level, patch can be viewed here 
https://github.com/apache/lucene-solr/pull/275.patch


was (Author: softwaredoug):
PR updated, patch can be viewed here 
https://github.com/apache/lucene-solr/pull/275.patch

> Make overlapping query term scoring configurable per field type
> ---
>
> Key: SOLR-11662
> URL: https://issues.apache.org/jira/browse/SOLR-11662
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Doug Turnbull
> Fix For: 7.2, master (8.0)
>
>
> This patch customizes the query-time behavior when query terms overlap 
> positions. Right now the only option is SynonymQuery. This is a fantastic 
> default & improvement on past versions. However, there are use cases where 
> terms overlap positions but don't carry exact synonymy relationships. Often 
> synonyms are actually used to model hypernym/hyponym relationships using 
> synonyms (or other analyzers). So the individual term scores matter, with 
> terms with higher specificity (hyponym) scoring higher than terms with lower 
> specificity (hypernym).
> This patch adds the fieldType setting scoreOverlaps, as in:
> {code:java}
>class="solr.TextField" positionIncrementGap="100" multiValued="true">
> {code}
> Valid values for scoreOverlaps are:
> *as_one_term*
> Default, most synonym use cases. Uses SynonymQuery
> Treats all terms as if they're exactly equivalent, with document frequency 
> from underlying terms blended 
> *pick_best*
> For a given document, score using the best scoring synonym (ie dismax over 
> generated terms). 
> Useful when synonyms not exactly equilevant. Instead they are used to model 
> hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
> scores will reflect that quality
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the dismax (text:tabby | text:cat | text:animal)
> *as_distinct_terms*
> (The pre 6.0 behavior.)
> Compromise between pick_best and as_oneSterm
> Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
> scores stack, so documents with more tabby, cat, or animal the better w/ a 
> bias towards the term with highest specificity
> Terms are turned into a boolean OR query, with documen frequencies not blended
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the boolean query (text:tabby  text:cat 
> text:animal)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11662) Make overlapping query term scoring configurable per field type


[ 
https://issues.apache.org/jira/browse/SOLR-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263377#comment-16263377
 ] 

Doug Turnbull commented on SOLR-11662:
--

PR updated, patch can be viewed here 
https://github.com/apache/lucene-solr/pull/275.patch

> Make overlapping query term scoring configurable per field type
> ---
>
> Key: SOLR-11662
> URL: https://issues.apache.org/jira/browse/SOLR-11662
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Doug Turnbull
> Fix For: 7.2, master (8.0)
>
>
> This patch customizes the query-time behavior when query terms overlap 
> positions. Right now the only option is SynonymQuery. This is a fantastic 
> default & improvement on past versions. However, there are use cases where 
> terms overlap positions but don't carry exact synonymy relationships. Often 
> synonyms are actually used to model hypernym/hyponym relationships using 
> synonyms (or other analyzers). So the individual term scores matter, with 
> terms with higher specificity (hyponym) scoring higher than terms with lower 
> specificity (hypernym).
> This patch adds the fieldType setting scoreOverlaps, as in:
> {code:java}
>class="solr.TextField" positionIncrementGap="100" multiValued="true">
> {code}
> Valid values for scoreOverlaps are:
> *as_one_term*
> Default, most synonym use cases. Uses SynonymQuery
> Treats all terms as if they're exactly equivalent, with document frequency 
> from underlying terms blended 
> *pick_best*
> For a given document, score using the best scoring synonym (ie dismax over 
> generated terms). 
> Useful when synonyms not exactly equilevant. Instead they are used to model 
> hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
> scores will reflect that quality
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the dismax (text:tabby | text:cat | text:animal)
> *as_distinct_terms*
> (The pre 6.0 behavior.)
> Compromise between pick_best and as_oneSterm
> Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
> scores stack, so documents with more tabby, cat, or animal the better w/ a 
> bias towards the term with highest specificity
> Terms are turned into a boolean OR query, with documen frequencies not blended
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the boolean query (text:tabby  text:cat 
> text:animal)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11662) Make overlapping query term scoring configurable per field type


[ 
https://issues.apache.org/jira/browse/SOLR-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262567#comment-16262567
 ] 

Doug Turnbull commented on SOLR-11662:
--

Great! And that would actually let me submit an ES patch in parallel... I'll 
update my PR/patch

> Make overlapping query term scoring configurable per field type
> ---
>
> Key: SOLR-11662
> URL: https://issues.apache.org/jira/browse/SOLR-11662
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Doug Turnbull
> Fix For: 7.2, master (8.0)
>
>
> This patch customizes the query-time behavior when query terms overlap 
> positions. Right now the only option is SynonymQuery. This is a fantastic 
> default & improvement on past versions. However, there are use cases where 
> terms overlap positions but don't carry exact synonymy relationships. Often 
> synonyms are actually used to model hypernym/hyponym relationships using 
> synonyms (or other analyzers). So the individual term scores matter, with 
> terms with higher specificity (hyponym) scoring higher than terms with lower 
> specificity (hypernym).
> This patch adds the fieldType setting scoreOverlaps, as in:
> {code:java}
>class="solr.TextField" positionIncrementGap="100" multiValued="true">
> {code}
> Valid values for scoreOverlaps are:
> *as_one_term*
> Default, most synonym use cases. Uses SynonymQuery
> Treats all terms as if they're exactly equivalent, with document frequency 
> from underlying terms blended 
> *pick_best*
> For a given document, score using the best scoring synonym (ie dismax over 
> generated terms). 
> Useful when synonyms not exactly equilevant. Instead they are used to model 
> hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
> scores will reflect that quality
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the dismax (text:tabby | text:cat | text:animal)
> *as_distinct_terms*
> (The pre 6.0 behavior.)
> Compromise between pick_best and as_oneSterm
> Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
> scores stack, so documents with more tabby, cat, or animal the better w/ a 
> bias towards the term with highest specificity
> Terms are turned into a boolean OR query, with documen frequencies not blended
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the boolean query (text:tabby  text:cat 
> text:animal)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11662) Make overlapping query term scoring configurable per field type


[ 
https://issues.apache.org/jira/browse/SOLR-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262468#comment-16262468
 ] 

Doug Turnbull commented on SOLR-11662:
--

Thanks Adrien! Yes, it could be moved to SolrQueryParser. This would narrow the 
scope to just Solr, however. I would like to see this capability in 
Elasticsearch as well. Though that could be handled differently.

> Make overlapping query term scoring configurable per field type
> ---
>
> Key: SOLR-11662
> URL: https://issues.apache.org/jira/browse/SOLR-11662
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Doug Turnbull
> Fix For: 7.2, master (8.0)
>
>
> This patch customizes the query-time behavior when query terms overlap 
> positions. Right now the only option is SynonymQuery. This is a fantastic 
> default & improvement on past versions. However, there are use cases where 
> terms overlap positions but don't carry exact synonymy relationships. Often 
> synonyms are actually used to model hypernym/hyponym relationships using 
> synonyms (or other analyzers). So the individual term scores matter, with 
> terms with higher specificity (hyponym) scoring higher than terms with lower 
> specificity (hypernym).
> This patch adds the fieldType setting scoreOverlaps, as in:
> {code:java}
>class="solr.TextField" positionIncrementGap="100" multiValued="true">
> {code}
> Valid values for scoreOverlaps are:
> *as_one_term*
> Default, most synonym use cases. Uses SynonymQuery
> Treats all terms as if they're exactly equivalent, with document frequency 
> from underlying terms blended 
> *pick_best*
> For a given document, score using the best scoring synonym (ie dismax over 
> generated terms). 
> Useful when synonyms not exactly equilevant. Instead they are used to model 
> hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
> scores will reflect that quality
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the dismax (text:tabby | text:cat | text:animal)
> *as_distinct_terms*
> (The pre 6.0 behavior.)
> Compromise between pick_best and as_oneSterm
> Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
> scores stack, so documents with more tabby, cat, or animal the better w/ a 
> bias towards the term with highest specificity
> Terms are turned into a boolean OR query, with documen frequencies not blended
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the boolean query (text:tabby  text:cat 
> text:animal)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-11662) Make overlapping query term scoring configurable per field type


 [ 
https://issues.apache.org/jira/browse/SOLR-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Turnbull updated SOLR-11662:
-
Description: 
This patch customizes the query-time behavior when query terms overlap 
positions. Right now the only option is SynonymQuery. This is a fantastic 
default & improvement on past versions. However, there are use cases where 
terms overlap positions but don't carry exact synonymy relationships. Often 
synonyms are actually used to model hypernym/hyponym relationships using 
synonyms (or other analyzers). So the individual term scores matter, with terms 
with higher specificity (hyponym) scoring higher than terms with lower 
specificity (hypernym).

This patch adds the fieldType setting scoreOverlaps, as in:


{code:java}
  

{code}


Valid values for scoreOverlaps are:

*as_one_term*
Default, most synonym use cases. Uses SynonymQuery
Treats all terms as if they're exactly equivalent, with document frequency from 
underlying terms blended 

*pick_best*
For a given document, score using the best scoring synonym (ie dismax over 
generated terms). 
Useful when synonyms not exactly equilevant. Instead they are used to model 
hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
scores will reflect that quality
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the dismax (text:tabby | text:cat | text:animal)

*as_distinct_terms*
(The pre 6.0 behavior.)
Compromise between pick_best and as_oneSterm
Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
scores stack, so documents with more tabby, cat, or animal the better w/ a bias 
towards the term with highest specificity
Terms are turned into a boolean OR query, with documen frequencies not blended
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the boolean query (text:tabby  text:cat text:animal)


  was:
This patch customizes the query-time behavior when query terms overlap 
positions. Right now the only option is SynonymQuery. This is a fantastic 
default & improvement on past versions. However, there are use cases where 
terms overlap positions but don't carry exact synonymy relationships. Often 
synonyms are actually used to model hypernym/hyponym relationships using 
synonyms (or other analyzers). So the individual term scores matter, with terms 
with higher specificity (hyponym) scoring higher than terms with lower 
specificity (hypernym).

This patch adds the fieldType setting scoreOverlaps, as in:


{code:java}
  

{code}


Valid values for scoreOverlaps are:

*as_one_term*
Default, most synonym use cases. Uses SynonymQuery
Treats all terms as if they're exactly equivalent, with document frequency from 
underlying terms blended 

*pick_best*
For a given document, score using the best scoring synonym (ie dismax over 
generated terms). 
Useful when synonyms not exactly equilevant. Instead they are used to model 
hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
scores will reflect that quality
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the dismax (text:tabby | text:cat | text:animal)

*as_distinct_terms*
(The pre 6.0 behavior.)
Compromise between pick_best and as_oneSterm
Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
scores stack, so documents with more tabby, cat, or animal the better w/ a bias 
towards the term with highest specificity
Terms are turned into a boolean OR query, with documen frequencies not blended
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the dismax (text:tabby  text:cat text:animal)



> Make overlapping query term scoring configurable per field type
> ---
>
> Key: SOLR-11662
> URL: https://issues.apache.org/jira/browse/SOLR-11662
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Doug Turnbull
> Fix For: 7.2, master (8.0)
>
>
> This patch customizes the query-time behavior when query terms overlap 
> positions. Right now the only option is SynonymQuery. This is a fantastic 
> default & improvement on past versions. However, there are use cases where 
> terms overlap positions but don't carry exact synonymy relationships. Often 
> synonyms are actually used to model hypernym/hyponym relationships using 
> synonyms (or other analyzers). So the individual term scores matter, with 
> terms with higher specificity (hyponym) scoring higher than terms with lower 
> specificity (hypernym).
> This patch adds the fieldType setting scoreOv

[jira] [Commented] (SOLR-11662) Make overlapping query term scoring configurable per field type


[ 
https://issues.apache.org/jira/browse/SOLR-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261549#comment-16261549
 ] 

Doug Turnbull commented on SOLR-11662:
--

Associated pull request https://github.com/apache/lucene-solr/pull/275/files
And Patch 
https://patch-diff.githubusercontent.com/raw/apache/lucene-solr/pull/275.patch

> Make overlapping query term scoring configurable per field type
> ---
>
> Key: SOLR-11662
> URL: https://issues.apache.org/jira/browse/SOLR-11662
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Doug Turnbull
> Fix For: 7.2, master (8.0)
>
>
> This patch customizes the query-time behavior when query terms overlap 
> positions. Right now the only option is SynonymQuery. This is a fantastic 
> default & improvement on past versions. However, there are use cases where 
> terms overlap positions but don't carry exact synonymy relationships. Often 
> synonyms are actually used to model hypernym/hyponym relationships using 
> synonyms (or other analyzers). So the individual term scores matter, with 
> terms with higher specificity (hyponym) scoring higher than terms with lower 
> specificity (hypernym).
> This patch adds the fieldType setting scoreOverlaps, as in:
> {code:java}
>class="solr.TextField" positionIncrementGap="100" multiValued="true">
> {code}
> Valid values for scoreOverlaps are:
> *as_one_term*
> Default, most synonym use cases. Uses SynonymQuery
> Treats all terms as if they're exactly equivalent, with document frequency 
> from underlying terms blended 
> *pick_best*
> For a given document, score using the best scoring synonym (ie dismax over 
> generated terms). 
> Useful when synonyms not exactly equilevant. Instead they are used to model 
> hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
> scores will reflect that quality
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the dismax (text:tabby | text:cat | text:animal)
> *as_distinct_terms*
> (The pre 6.0 behavior.)
> Compromise between pick_best and as_oneSterm
> Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
> scores stack, so documents with more tabby, cat, or animal the better w/ a 
> bias towards the term with highest specificity
> Terms are turned into a boolean OR query, with documen frequencies not blended
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the dismax (text:tabby  text:cat text:animal)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-11662) Make overlapping query term scoring configurable per field type


 [ 
https://issues.apache.org/jira/browse/SOLR-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Turnbull updated SOLR-11662:
-
Summary: Make overlapping query term scoring configurable per field type  
(was: More than SynonymQuery: Let overlapping query terms model 
hypernym/hyponym relationships)

> Make overlapping query term scoring configurable per field type
> ---
>
> Key: SOLR-11662
> URL: https://issues.apache.org/jira/browse/SOLR-11662
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Doug Turnbull
> Fix For: 7.2, master (8.0)
>
>
> This patch customizes the query-time behavior when query terms overlap 
> positions. Right now the only option is SynonymQuery. This is a fantastic 
> default & improvement on past versions. However, there are use cases where 
> terms overlap positions but don't carry exact synonymy relationships. Often 
> synonyms are actually used to model hypernym/hyponym relationships using 
> synonyms (or other analyzers). So the individual term scores matter, with 
> terms with higher specificity (hyponym) scoring higher than terms with lower 
> specificity (hypernym).
> This patch adds the fieldType setting scoreOverlaps, as in:
> {code:java}
>class="solr.TextField" positionIncrementGap="100" multiValued="true">
> {code}
> Valid values for scoreOverlaps are:
> *as_one_term*
> Default, most synonym use cases. Uses SynonymQuery
> Treats all terms as if they're exactly equivalent, with document frequency 
> from underlying terms blended 
> *pick_best*
> For a given document, score using the best scoring synonym (ie dismax over 
> generated terms). 
> Useful when synonyms not exactly equilevant. Instead they are used to model 
> hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
> scores will reflect that quality
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the dismax (text:tabby | text:cat | text:animal)
> *as_distinct_terms
> *(The pre 6.0 behavior.)
> Compromise between pick_best and as_oneSterm
> Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
> scores stack, so documents with more tabby, cat, or animal the better w/ a 
> bias towards the term with highest specificity
> Terms are turned into a boolean OR query, with documen frequencies not blended
> IE this query time expansion
> tabby => tabby, cat, animal
> Searching "text", generates the dismax (text:tabby  text:cat text:animal)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-11662) Make overlapping query term scoring configurable per field type


 [ 
https://issues.apache.org/jira/browse/SOLR-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Turnbull updated SOLR-11662:
-
Description: 
This patch customizes the query-time behavior when query terms overlap 
positions. Right now the only option is SynonymQuery. This is a fantastic 
default & improvement on past versions. However, there are use cases where 
terms overlap positions but don't carry exact synonymy relationships. Often 
synonyms are actually used to model hypernym/hyponym relationships using 
synonyms (or other analyzers). So the individual term scores matter, with terms 
with higher specificity (hyponym) scoring higher than terms with lower 
specificity (hypernym).

This patch adds the fieldType setting scoreOverlaps, as in:


{code:java}
  

{code}


Valid values for scoreOverlaps are:

*as_one_term*
Default, most synonym use cases. Uses SynonymQuery
Treats all terms as if they're exactly equivalent, with document frequency from 
underlying terms blended 

*pick_best*
For a given document, score using the best scoring synonym (ie dismax over 
generated terms). 
Useful when synonyms not exactly equilevant. Instead they are used to model 
hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
scores will reflect that quality
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the dismax (text:tabby | text:cat | text:animal)

*as_distinct_terms*
(The pre 6.0 behavior.)
Compromise between pick_best and as_oneSterm
Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
scores stack, so documents with more tabby, cat, or animal the better w/ a bias 
towards the term with highest specificity
Terms are turned into a boolean OR query, with documen frequencies not blended
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the dismax (text:tabby  text:cat text:animal)


  was:
This patch customizes the query-time behavior when query terms overlap 
positions. Right now the only option is SynonymQuery. This is a fantastic 
default & improvement on past versions. However, there are use cases where 
terms overlap positions but don't carry exact synonymy relationships. Often 
synonyms are actually used to model hypernym/hyponym relationships using 
synonyms (or other analyzers). So the individual term scores matter, with terms 
with higher specificity (hyponym) scoring higher than terms with lower 
specificity (hypernym).

This patch adds the fieldType setting scoreOverlaps, as in:


{code:java}
  

{code}


Valid values for scoreOverlaps are:

*as_one_term*
Default, most synonym use cases. Uses SynonymQuery
Treats all terms as if they're exactly equivalent, with document frequency from 
underlying terms blended 

*pick_best*
For a given document, score using the best scoring synonym (ie dismax over 
generated terms). 
Useful when synonyms not exactly equilevant. Instead they are used to model 
hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
scores will reflect that quality
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the dismax (text:tabby | text:cat | text:animal)

*as_distinct_terms
*(The pre 6.0 behavior.)
Compromise between pick_best and as_oneSterm
Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
scores stack, so documents with more tabby, cat, or animal the better w/ a bias 
towards the term with highest specificity
Terms are turned into a boolean OR query, with documen frequencies not blended
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the dismax (text:tabby  text:cat text:animal)



> Make overlapping query term scoring configurable per field type
> ---
>
> Key: SOLR-11662
> URL: https://issues.apache.org/jira/browse/SOLR-11662
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Doug Turnbull
> Fix For: 7.2, master (8.0)
>
>
> This patch customizes the query-time behavior when query terms overlap 
> positions. Right now the only option is SynonymQuery. This is a fantastic 
> default & improvement on past versions. However, there are use cases where 
> terms overlap positions but don't carry exact synonymy relationships. Often 
> synonyms are actually used to model hypernym/hyponym relationships using 
> synonyms (or other analyzers). So the individual term scores matter, with 
> terms with higher specificity (hyponym) scoring higher than terms with lower 
> specificity (hypernym).
> This patch adds the fieldType setting scoreOv

[jira] [Created] (SOLR-11662) More than SynonymQuery: Let overlapping query terms model hypernym/hyponym relationships

Doug Turnbull created SOLR-11662:


 Summary: More than SynonymQuery: Let overlapping query terms model 
hypernym/hyponym relationships
 Key: SOLR-11662
 URL: https://issues.apache.org/jira/browse/SOLR-11662
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Doug Turnbull
 Fix For: 7.2, master (8.0)


This patch customizes the query-time behavior when query terms overlap 
positions. Right now the only option is SynonymQuery. This is a fantastic 
default & improvement on past versions. However, there are use cases where 
terms overlap positions but don't carry exact synonymy relationships. Often 
synonyms are actually used to model hypernym/hyponym relationships using 
synonyms (or other analyzers). So the individual term scores matter, with terms 
with higher specificity (hyponym) scoring higher than terms with lower 
specificity (hypernym).

This patch adds the fieldType setting scoreOverlaps, as in:


{code:java}
  

{code}


Valid values for scoreOverlaps are:

*as_one_term*
Default, most synonym use cases. Uses SynonymQuery
Treats all terms as if they're exactly equivalent, with document frequency from 
underlying terms blended 

*pick_best*
For a given document, score using the best scoring synonym (ie dismax over 
generated terms). 
Useful when synonyms not exactly equilevant. Instead they are used to model 
hypernym/hyponym relationships. Such as expanding to synonyms of where terms 
scores will reflect that quality
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the dismax (text:tabby | text:cat | text:animal)

*as_distinct_terms
*(The pre 6.0 behavior.)
Compromise between pick_best and as_oneSterm
Appropriate when synonyms reflect a hypernym/hyponym relationship, but lets 
scores stack, so documents with more tabby, cat, or animal the better w/ a bias 
towards the term with highest specificity
Terms are turned into a boolean OR query, with documen frequencies not blended
IE this query time expansion

tabby => tabby, cat, animal

Searching "text", generates the dismax (text:tabby  text:cat text:animal)




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Looking for development docs.

2017-04-26 Thread Doug Turnbull

Something I found helpful was to go back to very early Lucene versions.
That let's you see the essential functionality in relatively
straightforward Java code. You can get a sense for how Lucene is
structured. Functionality has been built around this since. The Java has
been battle tested, refactored, and optimized. But those core bits were
really helpful for me to see what Lucene specifically did.

https://sourceforge.net/projects/lucene/

That plus Lucene in Action
On Wed, Apr 26, 2017 at 7:16 PM Erick Erickson 
wrote:

> Solr/Lucene is big. Really big. I'd think seriously about taking
> something you're interested in/know about, finding a JIRA that you'd
> like to work on and diving in. Plus there aren't very many
> architecture docs.
>
> Your characterization of the realms of responsibility is pretty accurate.
>
> Have you seen: https://wiki.apache.org/solr/HowToContribute?
>
> A somewhat painful but "safe" way to get your feet wet is to look at
> the coverage reports on jenkins and see what code is not tested in the
> junit tests and...write a test. At least I think the coverage reports
> are still there.
>
> Best,
> Erick
>
> On Wed, Apr 26, 2017 at 3:12 PM, David Lee 
> wrote:
> > I'd like to have a better understanding of how much of Solr is unique to
> it
> > versus directly extending Lucene.
> >
> > For example, I assume that sharding, replication, etc. is implemented in
> > Solr where-as indexing, querying, etc. would be implemented by Lucene.
> >
> > I'm hoping to learn enough to be able to contribute at some point.
> >
> > Thanks,
> >
> > David
> >
> >
> >
> > ---
> > This email has been checked for viruses by Avast antivirus software.
> > https://www.avast.com/antivirus
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: Change Default Response Format (wt) to JSON in Solr 7.0?

2017-04-14 Thread Doug Turnbull

Sounds great. I agree!

I can imagine there might be really old client libraries/integrations that
assume XML without sending a wt, but I think it's ok to break those sorts
of things in a major release. And those folks can learn to send wt=xml

-Doug

On Fri, Apr 14, 2017 at 2:53 PM Trey Grainger  wrote:

> Just wanted to throw this out there for discussion. Solr's default query
> response format is still XML, despite the fact that Solr has supported the
> JSON response format for over a decade, developer mindshare has clearly
> shifted toward JSON over the years, and most modern/competing systems also
> use JSON format now by default.
>
> In fact, Solr's admin UI even explicitly adds wt=json to the request (by
> default in the UI) to override the default of wt=xml, so Solr's Admin UI
> effectively has a different default than the API.
>
> We have now introduced things like the JSON faceting API, and the new more
> modern /V2 apis assume JSON for the areas of Solr they cover, so clearly
> we're moving in the direction of JSON anyway.
>
> I'd like propose that we switch the default response writer to JSON
> (wt=json) instead of XML for Solr 7.0, as this seems to me like the right
> direction and a good time to make this change with the next major version.
>
> Before I create a JIRA and submit a patch, though, I wanted to check here
> make sure there were no strong objections to changing the default.
>
> -Trey Grainger
>

Re: Search Engine question

2017-03-21 Thread Doug Turnbull

Definitely start with Solr unless you have some specialized use case.
Lucene skills can come up in a Solr context (ie if you wanted to write
plugins)

I would also recommend:
- Solr in Action
- Lucene in Action (out of date, but many concepts still valid)
- Apache Solr Ref Guide (
https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide
)
- Solr Start (http://www.solr-start.com/)
- Relevant Search (I wrote this book, email me directly for a discount code)

Slightly shameless plug. What might help you is I basically give anyone a
free hour of my time for consulting, so hit me up and I'd be happy to walk
you through some basics/ideas on getting started
http://opensourceconnections.com/blog/2016/08/01/search-for-lunch/

Best
-Doug

On Tue, Mar 21, 2017 at 8:04 PM Bina N Shah  wrote:

> Good Afternoon,
>
>
>
> My name is Bina Shah and I work for University of New Mexico Hospitals,
> non-profit organization.
>
>
>
> We are considering ways to implement Search Engine for our static intranet
> pages. In the second phase, implement search engine for our dynamic web
> applications. I noticed on your web site, there are two different Search
> projects:  Apache Lucene Core and Apache Solr.  I need your guidance as to
> where to start, search engine demo video, and which would be the
> appropriate Search project?
>
>
>
> Thank you in advance for your time and looking forward to hearing from you.
>
>
>
> Thank you,
>
>
>
> Bina Shah
>
> Web Analyst
>
> UNM Hospitals
>
> bns...@salud.unm.edu
>
> (505) 925-4795
>
>
>
>
>

Re: Developer's Guide

2017-03-03 Thread Doug Turnbull

As an aside, I'm pretty sure if anyone wanted to write a new edition of
Lucene in Action, and you're masochistic enough to write a book for a top
tier tech book publisher, I'd be happy to introduce you to someone at
Manning :)

And Lucene In Action is a very good read, will help you get the big ideas,
even if the examples are outdated

-Doug

On Fri, Mar 3, 2017 at 11:23 AM David Smiley 
wrote:

> Hi,
> There is no developer's guide.  There are Javadocs, and there's an
> outdated book (although the concepts are still good but it's the details
> that have changed).
> ~ David
>
> On Fri, Mar 3, 2017 at 11:16 AM Nilesh Kamani 
> wrote:
>
> Could anybody please help me with this ?
>
> On Wed, Mar 1, 2017 at 9:22 AM, Nilesh Kamani 
> wrote:
>
> Hello All,
>
> Are there any Developer's Guide to understand various packages and classes
> and their role ?
> I am looking to modify boolean AND search to meet some specific criteria.
>
> Thanks,
> Nilesh Kamani
>
>
>
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>

[jira] [Commented] (SOLR-9418) Probabilistic-Query-Parser RequestHandler

2016-10-20 Thread Doug Turnbull (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15591921#comment-15591921
 ] 

Doug Turnbull commented on SOLR-9418:
-

Looking at your patch (I'm not a committer just curious about the patch). A few 
things jump out in a shallow reading that would probably need to change for 
this to be accepted:

- Field names and thresholds likely need to be configurable, as most folks 
won't nescesarilly have a field named exactly "title" or "content." 
- Can this be a qparser plugin instead of a request handler? It's likely I'd 
want to use it alongside other qparsers and SearchComponents (like highlighting 
or facets).
- Can you provide some documentation on how the thresholds work/can be 
configured?

> Probabilistic-Query-Parser RequestHandler
> -
>
> Key: SOLR-9418
> URL: https://issues.apache.org/jira/browse/SOLR-9418
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Akash Mehta
> Attachments: SOLR-9418.zip
>
>
> The main aim of this requestHandler is to get the best parsing for a given 
> query. This basically means recognizing different phrases within the query. 
> We need some kind of training data to generate these phrases. The way this 
> project works is:
> 1.)Generate all possible parsings for the given query
> 2.)For each possible parsing, a naive-bayes like score is calculated.
> 3.)The main scoring is done by going through all the documents in the 
> training set and finding the probability of bunch of words occurring together 
> as a phrase as compared to them occurring randomly in the same document. Then 
> the score is normalized. Some higher importance is given to the title field 
> as compared to content field which is configurable.
> 4.)Finally after scoring each of the possible parsing, the one with the 
> highest score is returned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7436) MinHashFilter has package-local constructor and constants

2016-09-06 Thread Doug Turnbull (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15467891#comment-15467891
 ] 

Doug Turnbull commented on LUCENE-7436:
---

Fix is here https://github.com/apache/lucene-solr/pull/78

> MinHashFilter has package-local constructor and constants
> -
>
> Key: LUCENE-7436
> URL: https://issues.apache.org/jira/browse/LUCENE-7436
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 6.2
>Reporter: Doug Turnbull
>Priority: Minor
>
> Trying to use the MinHashFilter outside of Lucene/Solr. Was it intentional 
> that the constructor and useful defaults are package-private? Seems like an 
> oversight to me, correct me if I'm wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-7436) MinHashFilter has package-local constructor and constants

2016-09-06 Thread Doug Turnbull (JIRA)

Doug Turnbull created LUCENE-7436:
-

 Summary: MinHashFilter has package-local constructor and constants
 Key: LUCENE-7436
 URL: https://issues.apache.org/jira/browse/LUCENE-7436
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 6.2
Reporter: Doug Turnbull
Priority: Minor


Trying to use the MinHashFilter outside of Lucene/Solr. Was it intentional that 
the constructor and useful defaults are package-private? Seems like an 
oversight to me, correct me if I'm wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Word stop list in examples (was Re: Default stop word list)

2016-09-04 Thread Doug Turnbull

I see it more of a performance tweak than a relevance thing. matches on
stopwords introduce the potential for many more documents to be scored.

Large collections usually should have a high min-should-match, so more than
likely queries with at least one or two non-stopwords that dramatically
limit the docs that will be scored. And since large collections are where
people have stopwords perf problems, this tends to obviate the performance
gains of removing stopwords.

On Sun, Sep 4, 2016 at 12:08 PM Erick Erickson 
wrote:

> Wouldn't most frequent term serve?
>
> On Sep 4, 2016 08:52, "Alexandre Rafalovitch"  wrote:
>
>> On 4 September 2016 at 22:23, Walter Underwood 
>> wrote:
>> > If you do want to use stopwords, I’d index without them, then look at
>> the
>> > words with the lowest IDF to make the list.
>>
>> That's an interesting approach. Is there an easy way to do that (in Solr?)
>>
>> Regards,
>>Alex.
>>
>> 
>> Newsletter and resources for Solr beginners and intermediates:
>> http://www.solr-start.com/
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

Re: Lucene or Apache Solr : Project Decision making

2016-08-30 Thread Doug Turnbull

Hi Archit, I would make a strong argument for using Solr unless you have
some exotic requirements.

- Solr has distributed indexing and search built in, building your own
distributed system is non-trivial, just as Mark Miller :)
- Solr comes prebaked with an HTTP API for non search experts to interact
with.
- For hiring, it's more likely you'll find a Solr expert than a Lucene
expert
- Custom capabilities can be handled by Solr plugins that specialize bits
and pieces of Solr to your needs
- You can pretty easily proxy Solr for security, from anything from a dumb
nginx proxy to a tad bit of custom code

I might consider using just Lucene if the consumers of my library don't
realize there's "search" under the hood
- I really just want a Java library that does search-like operations under
the hood, but the consumers of my code don't care about search.
- I'm doing something data-sciency with Lucene, my problem doesn't resemble
search, and I want direct control (ie classification, etc).

(note Elasticsearch would have similar capabilities and pros/cons vs Solr,
but the Solr vs ES is a whole 'nother conversation and I don't want to
hijack your thread)

-Doug



On Tue, Aug 30, 2016 at 1:24 PM Alexandre Rafalovitch 
wrote:

> SolerCloud uses Zookeeper. As to the rest, Solr is open source. It may be
> more efficient stripping out whatever you don't want than reinventing it on
> top of Lucene again.
>
> Regards,
> Alex
>
> On 31 Aug 2016 12:17 AM, "archit mehta"  wrote:
>
>> Hi,
>>
>> We need to take decision whether to go for lucene or solr. There are few
>> points which I would like to mention.
>>
>> 1. If we use lucene we do not have to worry about security as it is
>> already taken care but need to build own distributed indexer and searcher,
>> if we use solr then we don't have to worry about distributed indexer and
>> searcher but as it is a another process we have to put some security
>> controls.
>>
>> In our case getting permission for solr is bit difficult, lucene is
>> already in production (withou distribution stuff)
>>
>> 2. Does solr uses kafka or zookeeper or other third party library? Can I
>> get list from somewhere?
>> Server is heavily loaded, new process and running kafka/zookeeper is also
>> an overhead for us.
>> With current implementation we removed kafka and wrote some of our own
>> code.
>>
>> How much easy or difficult to build distributed indexer and searcher with
>> core lucene?
>>
>> Kindly share your views based on the point I have mentioned here. In case
>> any more clarification require write me back.
>>
>>
>> Regards,
>> Archit
>>
>>

Re: Proposal to Move Solr Ref Guide off Confluence

2016-08-18 Thread Doug Turnbull

Is there anyway to maintain inbound links to confluence pages with the new
system? I'm just thinking about all the user group questions, stackoverflow
Qs, and the like that link to cwiki pages.

Is it possible to setup the right redirects for cwiki pages into the new
system?

Doug
On Thu, Aug 18, 2016 at 7:30 PM Chris Hostetter 
wrote:

>
> : First, I'm not about to second-guess this. I wouldn't like to lose the
> : ability to download a full doc to search offline, but it looks like
> : this solution allows that since there is a PDF version after all.
>
> I also like being able to officially "release" the guide, and doing so via
> PDF will still be possible.
>
> But the other nice thing is that this will make it easy to
> maintain "branches" of the ref guide in git, and publish those with
> releases as well -- so you can edit the docs on master, and backport the
> docs to the branch_6x at the same you backport the feature, and we can
> publish HTML versions of the guide right along side the javadoc docs for
> each version of solr.
>
> : As you know, every time I try to edit he CWiki I come whimpering to
> : you or Hoss. Sounds like this solution will reduce the volume of my
> : whimpering which is a good thing. I so loathe Confluence that find
>
> Ideally yes -- a lot of the problems we have with confluence today stem
> from the "WYSI-kind-of-WYG" mentality of it's editor, and the fact that it
> sometimes preserves html styling you can't see until the PDF is published
> (especially when you copy/paste).  Most of that pain should go away
> because the adoc files will be plain text.  (Any markup langauge has it's
> share of "wait, how do i get get formatting XYZ?" but being plain text
> files in git will make it a lot easier to spot mistakes in diffs -- as
> opposed to confluence with it's "heres a historical diff that is also in
> rendered HTML, so good luck noticing that there is an extra span with a
> css class that affects the PDF but isn't mentioned in the web stylesheet"
>
> : I downloaded AsciidocFX and it looks quite usable. There may be better
> : tools out there but that was fast to find and I could work with it. I
> : see a Chrome extension, IntelliJ plugin etc. so it looks like there
> : are a variety of ways to go about all this.
>
> yeah -- just like java IDE/editor choices can be very personal,
> people will also be free to choose any tooling they want for editing
> asciidoc files -- which is another nice win over the web based confluence
> editor.  The trick will be having good automation in place to build the
> HTML & PDF output formats from the source documents, and give helpful
> feedback/errors about any weirdness that we can detect in scripts.  I plan
> on working to help cassandra with the "ongoing automation" when i get back
> from vacation in a few weeks.
>
> (at the moment, I'm spending my last few days before vacation tyring to
> better automate the confluence->(clean)asciidoc conversion so
> cassandra can iterate faster on demos of the full guide)
>
> : > If reaction is positive, my next step will be to expand the demo
> : > online with a full copy of the Ref Guide instead of the current small
> : > set.
>
>
>
> -Hoss
> http://www.lucidworks.com/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

[jira] [Commented] (SOLR-9395) Add ceil/floor bounding to stats calculations


[ 
https://issues.apache.org/jira/browse/SOLR-9395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15413994#comment-15413994
 ] 

Doug Turnbull commented on SOLR-9395:
-

Hmm that won't work, nm as you'd do stats over a relevance score :-/ yeah you 
probably need some way of passing up the exists value and/or declaring 
something as non existent. I'll have to think on it some more

> Add ceil/floor bounding to stats calculations
> -
>
> Key: SOLR-9395
> URL: https://issues.apache.org/jira/browse/SOLR-9395
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (7.0)
>    Reporter: Doug Turnbull
> Fix For: master (7.0)
>
>
> In the pull request to be attached we add optional ceil and floor parameters 
> to a field being computed via the stats component. This bounds the stats 
> calculations to ceil to floor inclusive.
> For example, let's say your searching over all the employees.
> stats=true=employee_age
> But you want to focus on employees aged 18-60 for whatever reason. You can 
> reissue this query as
> stats=true={!floor=18 ceil=60}employee_age
> This limits the resulting stats calculations to 18-60 inclusive. This 
> functionality also works on date fields (see test in PR).
> Now one question might be, why not do this with a filter query? In many cases 
> you don't necessarily want to filter these documents from the main search 
> results. You just want to eliminate outliers from a specific stats 
> calculation. For example, you search your employee database for "clerks." You 
> still want to see all the clerks, even little 16 year old Timmy. But for this 
> particular calculation you just want to focus on folks of traditional working 
> age for whatever reason.
> Some notes
> - floor/ceil are only supported as local params.
> - works for date and numeric values
> - date math works!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-9395) Add ceil/floor bounding to stats calculations


[ 
https://issues.apache.org/jira/browse/SOLR-9395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15413459#comment-15413459
 ] 

Doug Turnbull edited comment on SOLR-9395 at 8/9/16 12:41 PM:
--

Thanks [~hossman] and [~dsmiley]

Lots of good ideas. I'm going to try with the JSON Facets [~dsmiley]

To your point [~hossman], does {{query}} return exists false if the query 
doesn't match? If that's the case, perhaps this could be achieved with 
combining a query with a filter query with a range? Something like 

{quote}
stats.field=\{!func\}query($someRangeFilter) 
{quote}

I hadn't tried that, but I wonder if it would work. I'll have to try it and 
report back...


was (Author: softwaredoug):
Thanks [~hossman] and [~dsmiley]

Lots of good ideas. I'm going to try with the JSON Facets [~dsmiley]

To your point [~hossman], does {{query}} return exists false if the query 
doesn't match? If that's the case, perhaps this could be achieved with 
combining a query with a filter query with a range? Something like 

{quote}
stats.field={!func}query($someRangeFilter) 
{quote}

I hadn't tried that, but I wonder if it would work. I'll have to try it and 
report back...

> Add ceil/floor bounding to stats calculations
> -
>
> Key: SOLR-9395
> URL: https://issues.apache.org/jira/browse/SOLR-9395
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (7.0)
>    Reporter: Doug Turnbull
> Fix For: master (7.0)
>
>
> In the pull request to be attached we add optional ceil and floor parameters 
> to a field being computed via the stats component. This bounds the stats 
> calculations to ceil to floor inclusive.
> For example, let's say your searching over all the employees.
> stats=true=employee_age
> But you want to focus on employees aged 18-60 for whatever reason. You can 
> reissue this query as
> stats=true={!floor=18 ceil=60}employee_age
> This limits the resulting stats calculations to 18-60 inclusive. This 
> functionality also works on date fields (see test in PR).
> Now one question might be, why not do this with a filter query? In many cases 
> you don't necessarily want to filter these documents from the main search 
> results. You just want to eliminate outliers from a specific stats 
> calculation. For example, you search your employee database for "clerks." You 
> still want to see all the clerks, even little 16 year old Timmy. But for this 
> particular calculation you just want to focus on folks of traditional working 
> age for whatever reason.
> Some notes
> - floor/ceil are only supported as local params.
> - works for date and numeric values
> - date math works!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-9395) Add ceil/floor bounding to stats calculations


[ 
https://issues.apache.org/jira/browse/SOLR-9395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15413459#comment-15413459
 ] 

Doug Turnbull edited comment on SOLR-9395 at 8/9/16 12:41 PM:
--

Thanks [~hossman] and [~dsmiley]

Lots of good ideas. I'm going to try with the JSON Facets [~dsmiley]

To your point [~hossman], does {{query}} return exists false if the query 
doesn't match? If that's the case, perhaps this could be achieved with 
combining a query with a filter query with a range? Something like 

{quote}
stats.field={!func}query($someRangeFilter) 
{quote}

I hadn't tried that, but I wonder if it would work. I'll have to try it and 
report back...


was (Author: softwaredoug):
Thanks [~hossman] and [~dsmiley]

Lots of good ideas. I'm going to try with the JSON Facets [~dsmiley]

To your point [~hossman], does {{query}} return exists false if the query 
doesn't match? If that's the case, perhaps this could be achieved with 
combining a query with a filter query with a range? Something like 
{{stats.field={!func}query($someRangeFilter)}}. I hadn't tried that, but I 
wonder if it would work. I'll have to try it and report back...

> Add ceil/floor bounding to stats calculations
> -
>
> Key: SOLR-9395
> URL: https://issues.apache.org/jira/browse/SOLR-9395
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (7.0)
>    Reporter: Doug Turnbull
> Fix For: master (7.0)
>
>
> In the pull request to be attached we add optional ceil and floor parameters 
> to a field being computed via the stats component. This bounds the stats 
> calculations to ceil to floor inclusive.
> For example, let's say your searching over all the employees.
> stats=true=employee_age
> But you want to focus on employees aged 18-60 for whatever reason. You can 
> reissue this query as
> stats=true={!floor=18 ceil=60}employee_age
> This limits the resulting stats calculations to 18-60 inclusive. This 
> functionality also works on date fields (see test in PR).
> Now one question might be, why not do this with a filter query? In many cases 
> you don't necessarily want to filter these documents from the main search 
> results. You just want to eliminate outliers from a specific stats 
> calculation. For example, you search your employee database for "clerks." You 
> still want to see all the clerks, even little 16 year old Timmy. But for this 
> particular calculation you just want to focus on folks of traditional working 
> age for whatever reason.
> Some notes
> - floor/ceil are only supported as local params.
> - works for date and numeric values
> - date math works!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-9395) Add ceil/floor bounding to stats calculations


[ 
https://issues.apache.org/jira/browse/SOLR-9395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15413459#comment-15413459
 ] 

Doug Turnbull commented on SOLR-9395:
-

Thanks [~hossman] and [~dsmiley]

Lots of good ideas. I'm going to try with the JSON Facets [~dsmiley]

To your point [~hossman], does {{query}} return exists false if the query 
doesn't match? If that's the case, perhaps this could be achieved with 
combining a query with a filter query with a range? Something like 
{{stats.field={!func}query($someRangeFilter)}}. I hadn't tried that, but I 
wonder if it would work. I'll have to try it and report back...

> Add ceil/floor bounding to stats calculations
> -
>
> Key: SOLR-9395
> URL: https://issues.apache.org/jira/browse/SOLR-9395
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (7.0)
>    Reporter: Doug Turnbull
> Fix For: master (7.0)
>
>
> In the pull request to be attached we add optional ceil and floor parameters 
> to a field being computed via the stats component. This bounds the stats 
> calculations to ceil to floor inclusive.
> For example, let's say your searching over all the employees.
> stats=true=employee_age
> But you want to focus on employees aged 18-60 for whatever reason. You can 
> reissue this query as
> stats=true={!floor=18 ceil=60}employee_age
> This limits the resulting stats calculations to 18-60 inclusive. This 
> functionality also works on date fields (see test in PR).
> Now one question might be, why not do this with a filter query? In many cases 
> you don't necessarily want to filter these documents from the main search 
> results. You just want to eliminate outliers from a specific stats 
> calculation. For example, you search your employee database for "clerks." You 
> still want to see all the clerks, even little 16 year old Timmy. But for this 
> particular calculation you just want to focus on folks of traditional working 
> age for whatever reason.
> Some notes
> - floor/ceil are only supported as local params.
> - works for date and numeric values
> - date math works!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-9395) Add ceil/floor bounding to stats calculations

2016-08-08 Thread Doug Turnbull (JIRA)

Doug Turnbull created SOLR-9395:
---

 Summary: Add ceil/floor bounding to stats calculations
 Key: SOLR-9395
 URL: https://issues.apache.org/jira/browse/SOLR-9395
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
Affects Versions: master (7.0)
Reporter: Doug Turnbull
 Fix For: master (7.0)


In the pull request to be attached we add optional ceil and floor parameters to 
a field being computed via the stats component. This bounds the stats 
calculations to ceil to floor inclusive.

For example, let's say your searching over all the employees.

stats=true=employee_age

But you want to focus on employees aged 18-60 for whatever reason. You can 
reissue this query as

stats=true={!floor=18 ceil=60}employee_age

This limits the resulting stats calculations to 18-60 inclusive. This 
functionality also works on date fields (see test in PR).

Now one question might be, why not do this with a filter query? In many cases 
you don't necessarily want to filter these documents from the main search 
results. You just want to eliminate outliers from a specific stats calculation. 
For example, you search your employee database for "clerks." You still want to 
see all the clerks, even little 16 year old Timmy. But for this particular 
calculation you just want to focus on folks of traditional working age for 
whatever reason.

Some notes
- floor/ceil are only supported as local params.
- works for date and numeric values
- date math works!




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-9279) Add greater than, less than, etc in Solr function queries

2016-07-28 Thread Doug Turnbull (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397636#comment-15397636
 ] 

Doug Turnbull commented on SOLR-9279:
-

+1

> Add greater than, less than, etc in Solr function queries
> -
>
> Key: SOLR-9279
> URL: https://issues.apache.org/jira/browse/SOLR-9279
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>    Reporter: Doug Turnbull
> Fix For: master (7.0)
>
> Attachments: SOLR-9279.patch
>
>
> If you use the "if" function query, you'll often expect to be able to use 
> greater than/less than functions. For example, you might want to boost books 
> written in the past 7 years. Unfortunately, there's no "greater than" 
> function query that will return non-zero when the lhs > rhs. Instead to get 
> this, you need to create really awkward function queries like I do here 
> (http://opensourceconnections.com/blog/2014/11/26/stepwise-date-boosting-in-solr/):
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> The pull request attached to this Jira adds the following function queries
> (https://github.com/apache/lucene-solr/pull/49)
> -gt(lhs, rhs) (returns 1 if lhs > rhs, 0 otherwise)
> -lt(lhs, rhs) (returns 1 if lhs < rhs, 0 otherwise)
> -gte
> -lte
> -eq
> So instead of 
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> one could now write
> if(lt(ms(mydatefield),315569259747,0.8,1)
> (if mydatefield < 315569259747 then 0.8 else 1)
> A bit more readable and less puzzling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [jira] [Commented] (SOLR-9279) Add greater than, less than, etc in Solr function queries

2016-07-27 Thread Doug Turnbull

Great!

+1
On Wed, Jul 27, 2016 at 3:26 PM David Smiley (JIRA) <j...@apache.org> wrote:

>
> [
> https://issues.apache.org/jira/browse/SOLR-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15396205#comment-15396205
> ]
>
> David Smiley commented on SOLR-9279:
> 
>
> Sure -- trivial enough.  Unless there are further suggestions on this
> issue, I'll commit it with that change later this week.  I'll update Lucene
> & Solr's CHANGES.txt since both get something here.
>
> > Add greater than, less than, etc in Solr function queries
> > -
> >
> > Key: SOLR-9279
> > URL: https://issues.apache.org/jira/browse/SOLR-9279
> > Project: Solr
> >  Issue Type: New Feature
> >  Security Level: Public(Default Security Level. Issues are Public)
> >  Components: search
> >Reporter: Doug Turnbull
> > Fix For: master (7.0)
> >
> > Attachments: SOLR-9279.patch
> >
> >
> > If you use the "if" function query, you'll often expect to be able to
> use greater than/less than functions. For example, you might want to boost
> books written in the past 7 years. Unfortunately, there's no "greater than"
> function query that will return non-zero when the lhs > rhs. Instead to get
> this, you need to create really awkward function queries like I do here (
> http://opensourceconnections.com/blog/2014/11/26/stepwise-date-boosting-in-solr/
> ):
> > if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> > The pull request attached to this Jira adds the following function
> queries
> > (https://github.com/apache/lucene-solr/pull/49)
> > -gt(lhs, rhs) (returns 1 if lhs > rhs, 0 otherwise)
> > -lt(lhs, rhs) (returns 1 if lhs < rhs, 0 otherwise)
> > -gte
> > -lte
> > -eq
> > So instead of
> > if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> > one could now write
> > if(lt(ms(mydatefield),315569259747,0.8,1)
> > (if mydatefield < 315569259747 then 0.8 else 1)
> > A bit more readable and less puzzling
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

[jira] [Commented] (SOLR-9279) Add greater than, less than, etc in Solr function queries

2016-07-27 Thread Doug Turnbull (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15396118#comment-15396118
 ] 

Doug Turnbull commented on SOLR-9279:
-

Looks great [~dsmiley]! Definitely a big improvement. Appreciate your 
attention, I've learned a lot through this issue.

What do you think about adding an objectValue override as suggested by 
[~hossman]?

{code:java}
 @Override
  public Object objectVal(int doc) {
return exists(doc) ? boolVal(doc) : null;
  }
{code}

> Add greater than, less than, etc in Solr function queries
> -
>
> Key: SOLR-9279
> URL: https://issues.apache.org/jira/browse/SOLR-9279
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>    Reporter: Doug Turnbull
> Fix For: master (7.0)
>
> Attachments: SOLR-9279.patch
>
>
> If you use the "if" function query, you'll often expect to be able to use 
> greater than/less than functions. For example, you might want to boost books 
> written in the past 7 years. Unfortunately, there's no "greater than" 
> function query that will return non-zero when the lhs > rhs. Instead to get 
> this, you need to create really awkward function queries like I do here 
> (http://opensourceconnections.com/blog/2014/11/26/stepwise-date-boosting-in-solr/):
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> The pull request attached to this Jira adds the following function queries
> (https://github.com/apache/lucene-solr/pull/49)
> -gt(lhs, rhs) (returns 1 if lhs > rhs, 0 otherwise)
> -lt(lhs, rhs) (returns 1 if lhs < rhs, 0 otherwise)
> -gte
> -lte
> -eq
> So instead of 
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> one could now write
> if(lt(ms(mydatefield),315569259747,0.8,1)
> (if mydatefield < 315569259747 then 0.8 else 1)
> A bit more readable and less puzzling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-9279) Add greater than, less than, etc in Solr function queries

2016-07-26 Thread Doug Turnbull (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394521#comment-15394521
 ] 

Doug Turnbull commented on SOLR-9279:
-

[~hossman] Thanks for your help! Great points. -- I think I addressed your 
comments other than the Object value one. Is there documentation on an object 
value source? I'm not sure what's expected here.

> Add greater than, less than, etc in Solr function queries
> -
>
> Key: SOLR-9279
> URL: https://issues.apache.org/jira/browse/SOLR-9279
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>    Reporter: Doug Turnbull
> Fix For: master (7.0)
>
>
> If you use the "if" function query, you'll often expect to be able to use 
> greater than/less than functions. For example, you might want to boost books 
> written in the past 7 years. Unfortunately, there's no "greater than" 
> function query that will return non-zero when the lhs > rhs. Instead to get 
> this, you need to create really awkward function queries like I do here 
> (http://opensourceconnections.com/blog/2014/11/26/stepwise-date-boosting-in-solr/):
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> The pull request attached to this Jira adds the following function queries
> (https://github.com/apache/lucene-solr/pull/49)
> -gt(lhs, rhs) (returns 1 if lhs > rhs, 0 otherwise)
> -lt(lhs, rhs) (returns 1 if lhs < rhs, 0 otherwise)
> -gte
> -lte
> -eq
> So instead of 
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> one could now write
> if(lt(ms(mydatefield),315569259747,0.8,1)
> (if mydatefield < 315569259747 then 0.8 else 1)
> A bit more readable and less puzzling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-9279) Add greater than, less than, etc in Solr function queries

2016-07-05 Thread Doug Turnbull (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Turnbull updated SOLR-9279:

Description: 
If you use the "if" function query, you'll often expect to be able to use 
greater than/less than functions. For example, you might want to boost books 
written in the past 7 years. Unfortunately, there's no "greater than" function 
query that will return non-zero when the lhs > rhs. Instead to get this, you 
need to create really awkward function queries like I do here 
(http://opensourceconnections.com/blog/2014/11/26/stepwise-date-boosting-in-solr/):

if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)

The pull request attached to this Jira adds the following function queries
(https://github.com/apache/lucene-solr/pull/49)

-gt(lhs, rhs) (returns 1 if lhs > rhs, 0 otherwise)
-lt(lhs, rhs) (returns 1 if lhs < rhs, 0 otherwise)
-gte
-lte
-eq

So instead of 

if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)

one could now write

if(lt(ms(mydatefield),315569259747,0.8,1)

(if mydatefield < 315569259747 then 0.8 else 1)

A bit more readable and less puzzling


  was:
If you use the "if" function query, you'll often expect to be able to use 
greater than/less than functions. For example, you might want to boost books 
written in the past 7 years. Unfortunately, there's no "greater than" function 
query that will return non-zero when the lhs > rhs. Instead to get this, you 
need to create really awkward function queries like I do here 
(http://opensourceconnections.com/blog/2014/11/26/stepwise-date-boosting-in-solr/):

if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)

The pull request to be attached to this Jira adds the following function queries

-gt(lhs, rhs) (returns 1 if lhs > rhs, 0 otherwise)
-lt(lhs, rhs) (returns 1 if lhs < rhs, 0 otherwise)
-gte
-lte
-eq

So instead of 

if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)

one could now write

if(lt(ms(mydatefield),315569259747,0.8,1)

(if mydatefield < 315569259747 then 0.8 else 1)

A bit more readable and less puzzling



> Add greater than, less than, etc in Solr function queries
> -
>
> Key: SOLR-9279
> URL: https://issues.apache.org/jira/browse/SOLR-9279
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Reporter: Doug Turnbull
> Fix For: master (7.0)
>
>
> If you use the "if" function query, you'll often expect to be able to use 
> greater than/less than functions. For example, you might want to boost books 
> written in the past 7 years. Unfortunately, there's no "greater than" 
> function query that will return non-zero when the lhs > rhs. Instead to get 
> this, you need to create really awkward function queries like I do here 
> (http://opensourceconnections.com/blog/2014/11/26/stepwise-date-boosting-in-solr/):
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> The pull request attached to this Jira adds the following function queries
> (https://github.com/apache/lucene-solr/pull/49)
> -gt(lhs, rhs) (returns 1 if lhs > rhs, 0 otherwise)
> -lt(lhs, rhs) (returns 1 if lhs < rhs, 0 otherwise)
> -gte
> -lte
> -eq
> So instead of 
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> one could now write
> if(lt(ms(mydatefield),315569259747,0.8,1)
> (if mydatefield < 315569259747 then 0.8 else 1)
> A bit more readable and less puzzling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-9279) Add greater than, less than, etc in Solr function queries

2016-07-05 Thread Doug Turnbull (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363563#comment-15363563
 ] 

Doug Turnbull commented on SOLR-9279:
-

Associated Pull request https://github.com/apache/lucene-solr/pull/49

> Add greater than, less than, etc in Solr function queries
> -
>
> Key: SOLR-9279
> URL: https://issues.apache.org/jira/browse/SOLR-9279
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>    Reporter: Doug Turnbull
> Fix For: master (7.0)
>
>
> If you use the "if" function query, you'll often expect to be able to use 
> greater than/less than functions. For example, you might want to boost books 
> written in the past 7 years. Unfortunately, there's no "greater than" 
> function query that will return non-zero when the lhs > rhs. Instead to get 
> this, you need to create really awkward function queries like I do here 
> (http://opensourceconnections.com/blog/2014/11/26/stepwise-date-boosting-in-solr/):
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> The pull request to be attached to this Jira adds the following function 
> queries
> -gt(lhs, rhs) (returns 1 if lhs > rhs, 0 otherwise)
> -lt(lhs, rhs) (returns 1 if lhs < rhs, 0 otherwise)
> -gte
> -lte
> -eq
> So instead of 
> if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)
> one could now write
> if(lt(ms(mydatefield),315569259747,0.8,1)
> (if mydatefield < 315569259747 then 0.8 else 1)
> A bit more readable and less puzzling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-9279) Add greater than, less than, etc in Solr function queries

2016-07-05 Thread Doug Turnbull (JIRA)

Doug Turnbull created SOLR-9279:
---

 Summary: Add greater than, less than, etc in Solr function queries
 Key: SOLR-9279
 URL: https://issues.apache.org/jira/browse/SOLR-9279
 Project: Solr
  Issue Type: New Feature
  Security Level: Public (Default Security Level. Issues are Public)
  Components: search
Reporter: Doug Turnbull
 Fix For: master (7.0)


If you use the "if" function query, you'll often expect to be able to use 
greater than/less than functions. For example, you might want to boost books 
written in the past 7 years. Unfortunately, there's no "greater than" function 
query that will return non-zero when the lhs > rhs. Instead to get this, you 
need to create really awkward function queries like I do here 
(http://opensourceconnections.com/blog/2014/11/26/stepwise-date-boosting-in-solr/):

if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)

The pull request to be attached to this Jira adds the following function queries

-gt(lhs, rhs) (returns 1 if lhs > rhs, 0 otherwise)
-lt(lhs, rhs) (returns 1 if lhs < rhs, 0 otherwise)
-gte
-lte
-eq

So instead of 

if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)

one could now write

if(lt(ms(mydatefield),315569259747,0.8,1)

(if mydatefield < 315569259747 then 0.8 else 1)

A bit more readable and less puzzling




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Lucene/Solr git mirror will soon turn off

2015-12-16 Thread Doug Turnbull

In defense of more history immediately available--it is often far more
useful to poke around code history/run blame to figure out some code than
by taking it at face value. Putting this in a secondary place like
Apache SVN repo IMO reduces the readability of the code itself. This is
doubly true for new developers that won't know about Apache's SVN. And
Lucene can be quite intricate code. Further in my own work poking around in
github mirrors I frequently hit the current cutoff. Which is one reason I
stopped using them for anything but the casual investigation.

I'm not totally against a cutoff point, but I'd advocate for exhausting
other options first, such as trimming out unrelated projects, binaries, etc.

-Doug


On Wednesday, December 16, 2015, Shawn Heisey <apa...@elyograg.org
<javascript:_e(%7B%7D,'cvml','apa...@elyograg.org');>> wrote:

> On 12/16/2015 5:53 PM, Alexandre Rafalovitch wrote:
> > On 16 December 2015 at 00:44, Dawid Weiss <dawid.we...@gmail.com> wrote:
> >> 4) The size of JARs is really not an issue. The entire SVN repo I
> mirrored
> >> locally (including empty interim commits to cater for svn:mergeinfos)
> is 4G.
> >> If you strip the stuff like javadocs and side projects (Nutch, Tika,
> Mahout)
> >> then I bet the entire history can fit in 1G total. Of course stripping
> JARs
> >> is also doable.
> > I think this answered one of the issues. So, this is not something to
> focus on.
> >
> > The question I had (I am sure a very dumb one): WHY do we care about
> > history preserved perfectly in Git? Because that seems to be the real
> > bottleneck now. Does anybody still checks out an intermediate commit
> > in Solr 1.4 branch?
>
> I do not think we need every bit of history -- at least in the primary
> read/write repository.  I wonder how much of a size difference there
> would be between tossing all history before 5.0 and tossing all history
> before the ivy transition was completed.
>
> In the interests of reducing the size and download time of a clone
> operation, I definitely think we should trim history in the main repo to
> some arbitrary point, as long as the full history is available
> elsewhere.  It's my understanding that it will remain in svn.apache.org
> (possibly forever), and I think we could also create "historical"
> read-only git repos.
>
> Almost every time I am working on the code, I only care about the stable
> branch and trunk.  Sometimes I will check out an older 4.x tag so I can
> see the exact code referenced by a stacktrace in a user's error message,
> but when this is required, I am willing to go to an entirely different
> repository and chew up bandwidth/disk resourcesto obtain it, and I do
> not care whether it is git or svn.  As time marches on, fewer people
> will have reasons to look at the historical record.
>
> Thanks,
> Shawn
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: Lucene/Solr git mirror will soon turn off

2015-12-15 Thread Doug Turnbull

I thought the general consensus at minimum was to investigate a git mirror
that stripped some artifacts out (jars etc) to lighten up the work of the
process. If at some point the project switched to git, such a mirror might
be a suitable git repo for the project with archived older versions in SVN.

I think probably what is lacking is a volunteer to figure it all out.

-Doug

On Tue, Dec 15, 2015 at 11:32 AM, Mark Miller <markrmil...@gmail.com> wrote:

> Anyone willing to lead this discussion to some kind of better resolution?
> Did that whole back and forth help with any ideas on the best path forward?
> I know it's a complicated issue, git / svn, the light side, the dark side,
> but doesn't GitHub also depend on this mirroring? It's going to be super
> annoying when I can no longer pull from a relatively up to date git remote.
>
> Who has boiled down the correct path?
>
> - Mark
>
> On Wed, Dec 9, 2015 at 6:07 AM Dawid Weiss <dawid.we...@gmail.com> wrote:
>
>> FYI.
>>
>> - All of Lucene's SVN, incremental deltas, uncompressed: 5.0G
>> - the above, tar.bz2: 1.2G
>>
>> Sadly, I didn't succeed at recreating a local SVN repo from those
>> incremental dumps. svnadmin load fails with a cryptic error related to
>> the fact that revision number of node-copy operations refer to
>> original SVN numbers and they're apparently renumbered on import.
>> svnadmin isn't smart enough to somehow keep a reference of those
>> original numbers and svndumpfilter can't work with incremental dump
>> files... A seemingly trivial task of splitting a repo on a clean
>> boundary seems incredibly hard with SVN...
>>
>> If anybody wishes to play with the dump files, here they are:
>> http://goo.gl/m6q3J8
>>
>> Dawid
>>
>> On Tue, Dec 8, 2015 at 10:49 PM, Upayavira <u...@odoko.co.uk> wrote:
>> > You can't avoid having the history in SVN. The ASF has one large repo,
>> and
>> > won't be deleting that repo, so the history will survive in perpetuity,
>> > regardless of what we do now.
>> >
>> > Upayavira
>> >
>> > On Tue, Dec 8, 2015, at 09:24 PM, Doug Turnbull wrote:
>> >
>> > It seems you'd want to preserve that history in a frozen/archiced
>> Apache Svn
>> > repo for Lucene. Then make the new git repo slimmer before switching.
>> Folks
>> > that want very old versions or doing research can at least go through
>> the
>> > original SVN repo.
>> >
>> > On Tuesday, December 8, 2015, Dawid Weiss <dawid.we...@gmail.com>
>> wrote:
>> >
>> > One more thing, perhaps of importance, the raw Lucene repo contains
>> > all the history of projects that then turned top-level (Nutch,
>> > Mahout). These could also be dropped (or ignored) when converting to
>> > git. If we agree JARs are not relevant, why should projects not
>> > directly related to Lucene/ Solr be?
>> >
>> > Dawid
>> >
>> > On Tue, Dec 8, 2015 at 10:05 PM, Dawid Weiss <dawid.we...@gmail.com>
>> wrote:
>> >>> Don’t know how much we have of historic jars in our history.
>> >>
>> >> I actually do know. Or will know. In about ~10 hours. I wrote a script
>> >> that does the following:
>> >>
>> >> 1) git log all revisions touching
>> https://svn.apache.org/repos/asf/lucene
>> >> 2) grep revision numbers
>> >> 3) use svnrdump to get every single commit (revision) above, in
>> >> incremental mode.
>> >>
>> >> This will allow me to:
>> >>
>> >> 1) recreate only Lucene/ Solr SVN, locally.
>> >> 2) measure the size of SVN repo.
>> >> 3) measure the size of any conversion to git (even if it's one-by-one
>> >> checkout, then-sync with git).
>> >>
>> >> From what I see up until now size should not be an issue at all. Even
>> >> with all binary blobs so far the SVN incremental dumps measure ~3.7G
>> >> (and I'm about 75% done). There is one interesting super-large commit,
>> >> this one:
>> >>
>> >> svn log -r1240618 https://svn.apache.org/repos/asf/lucene
>> >>
>> 
>> >> r1240618 | gsingers | 2012-02-04 22:45:17 +0100 (Sat, 04 Feb 2012) | 1
>> >> line
>> >>
>> >> LUCENE-2748: bring in old Lucene docs
>> >>
>> >> This commit diff weights... wait for it... 1.3G! I didn't check what
>> >> it actually was

Re: Lucene/Solr git mirror will soon turn off

2015-12-08 Thread Doug Turnbull

It seems you'd want to preserve that history in a frozen/archiced Apache
Svn repo for Lucene. Then make the new git repo slimmer before switching.
Folks that want very old versions or doing research can at least go through
the original SVN repo.

On Tuesday, December 8, 2015, Dawid Weiss <dawid.we...@gmail.com> wrote:

> One more thing, perhaps of importance, the raw Lucene repo contains
> all the history of projects that then turned top-level (Nutch,
> Mahout). These could also be dropped (or ignored) when converting to
> git. If we agree JARs are not relevant, why should projects not
> directly related to Lucene/ Solr be?
>
> Dawid
>
> On Tue, Dec 8, 2015 at 10:05 PM, Dawid Weiss <dawid.we...@gmail.com
> <javascript:;>> wrote:
> >> Don’t know how much we have of historic jars in our history.
> >
> > I actually do know. Or will know. In about ~10 hours. I wrote a script
> > that does the following:
> >
> > 1) git log all revisions touching
> https://svn.apache.org/repos/asf/lucene
> > 2) grep revision numbers
> > 3) use svnrdump to get every single commit (revision) above, in
> > incremental mode.
> >
> > This will allow me to:
> >
> > 1) recreate only Lucene/ Solr SVN, locally.
> > 2) measure the size of SVN repo.
> > 3) measure the size of any conversion to git (even if it's one-by-one
> > checkout, then-sync with git).
> >
> > From what I see up until now size should not be an issue at all. Even
> > with all binary blobs so far the SVN incremental dumps measure ~3.7G
> > (and I'm about 75% done). There is one interesting super-large commit,
> > this one:
> >
> > svn log -r1240618 https://svn.apache.org/repos/asf/lucene
> > 
> > r1240618 | gsingers | 2012-02-04 22:45:17 +0100 (Sat, 04 Feb 2012) | 1
> line
> >
> > LUCENE-2748: bring in old Lucene docs
> >
> > This commit diff weights... wait for it... 1.3G! I didn't check what
> > it actually was.
> >
> > Will keep you posted.
> >
> > D.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org <javascript:;>
> For additional commands, e-mail: dev-h...@lucene.apache.org <javascript:;>
>
>

-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: Lucene/Solr git mirror will soon turn off

2015-12-06 Thread Doug Turnbull

I had not heard of git-lfs looks promising

https://git-lfs.github.com/?utm_source=github_site_medium=blog_campaign=gitlfs

On Sunday, December 6, 2015, Jan Høydahl <jan@cominvent.com> wrote:

> If the size of historic jars is the problem here, would looking into
> git-lfs for *.jar be one workaround? I might also be totally off here :-)
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> 6. des. 2015 kl. 00.46 skrev Scott Blum <dragonsi...@gmail.com
> <javascript:_e(%7B%7D,'cvml','dragonsi...@gmail.com');>>:
>
> If lucene was a new project being started today, is there any question
> about whether it would be managed in svn or git?  If not, this might be a
> good impetus for moving to a better world.
>
> On Sat, Dec 5, 2015 at 6:19 PM, Yonik Seeley <ysee...@gmail.com
> <javascript:_e(%7B%7D,'cvml','ysee...@gmail.com');>> wrote:
>
>> On Sat, Dec 5, 2015 at 5:53 PM, david.w.smi...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','david.w.smi...@gmail.com');>
>> <david.w.smi...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','david.w.smi...@gmail.com');>> wrote:
>> > I understand Gus; but we’d like to separate the question of wether we
>> should
>> > move from svn to git from fixing the git mirror.
>>
>> Except moving to git is one path to fixing the issue, so it's not
>> really separate.
>> Give the multiple problems that the svn-git bridge seems to have (both
>> memory leaks + history), perhaps the sooner we switch to git, the
>> better.
>>
>> -Yonik
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> <javascript:_e(%7B%7D,'cvml','dev-unsubscr...@lucene.apache.org');>
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>> <javascript:_e(%7B%7D,'cvml','dev-h...@lucene.apache.org');>
>>
>>
>
>

-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: Lucene/Solr git mirror will soon turn off

2015-12-04 Thread Doug Turnbull

The only downside is GitHub is a convenient way to run blame, etc. It's
very convenient for sleuthing through code. (If only their search wasn't
abysmal in terms of relevancy, but I digress)

Is the more systemic problem large binaries checked in I'm the past? Can we
do any surgery to svn or git to remove these? IIRC this is one reason
avoiding changing from git to svn to begin with. If removing some jars from
an old version of Lucene fixes it, perhaps this is a better long term
solution. I suppose the issue is having someone with the right svn/git
skills and the time to pull this off.

Doug

On Friday, December 4, 2015, Uwe Schindler <u...@thetaphi.de> wrote:

> Hi,
>
> This looks like a good idea to me. Maybe we just have a limited amount of
> history and branches in Git/Github, so people can work and create pull
> requests. Nobody wants to create pull request on a very old branch or
> against a revision years ago.
>
> Maybe Infra can mirror only the last 2 years of trunk and branch_5x?
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de <javascript:;>
>
> > -Original Message-
> > From: Dyer, James [mailto:james.d...@ingramcontent.com <javascript:;>]
> > Sent: Friday, December 04, 2015 10:48 PM
> > To: dev@lucene.apache.org <javascript:;>
> > Cc: infrastruct...@apache.org <javascript:;>
> > Subject: RE: Lucene/Solr git mirror will soon turn off
> >
> > I know Infra has tried a number of things to resolve this, to no avail.
> But did
> > we try "git-svn --revision=" to only mirror "post-LUCENE-3930" (ivy,
> > r1307099)?  Or if that's not lean enough for the git-svn mirror to work,
> then
> > cut off when 4.x was branched or whenever.  The hope would be to give git
> > users enough of the past that it would be useful for new development but
> > then also we can retain the status quo with svn (which is the best path
> for a
> > 26-day timeframe).
> >
> > James Dyer
> > Ingram Content Group
> >
> >
> > -Original Message-
> > From: Michael McCandless [mailto:luc...@mikemccandless.com
> <javascript:;>]
> > Sent: Friday, December 04, 2015 2:58 PM
> > To: Lucene/Solr dev
> > Cc: infrastruct...@apache.org <javascript:;>
> > Subject: Lucene/Solr git mirror will soon turn off
> >
> > Hello devs,
> >
> > The infra team has notified us (Lucene/Solr) that in 26 days our
> > git-svn mirror will be turned off, because running it consumes too
> > many system resources, affecting other projects, apparently because of
> > a memory leak in git-svn.
> >
> > Does anyone know of a link to this git-svn issue?  Is it a known
> > issue?  If there's something simple we can do (remove old jars from
> > our svn history, remove old branches), maybe we can sidestep the issue
> > and infra will allow it to keep running?
> >
> > Or maybe someone in the Lucene/Solr dev community with prior
> > experience with git-svn could volunteer to play with it to see if
> > there's a viable solution, maybe with command-line options e.g. to
> > only mirror specific branches (trunk, 5.x)?
> >
> > Or maybe it's time for us to switch to git, but there are problems
> > there too, e.g. we are currently missing large parts of our svn
> > history from the mirror now and it's not clear whether that would be
> > fixed if we switched:
> > https://issues.apache.org/jira/browse/INFRA-10828  Also, because we
> > used to add JAR files to svn, the "git clone" would likely take
> > several GBs unless we remove those JARs from our history.
> >
> > Or if anyone has any other ideas, we should explore them, because
> > otherwise in 26 days there will be no more updates to the git mirror
> > of Lucene and Solr sources...
> >
> > Thanks,
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org <javascript:;>
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> <javascript:;>
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org <javascript:;>
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> <javascript:;>
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org <javascript:;>
> For additional commands, e-mail: dev-h...@lucene.apache.org <javascript:;>
>
>

-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

[jira] [Commented] (SOLR-8201) Swap space info not showing in new UI (see screenshot)

2015-10-24 Thread Doug Turnbull (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-8201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14972743#comment-14972743
 ] 

Doug Turnbull commented on SOLR-8201:
-

+1!

These little hints in the admin UI can hint at problems before I have to use a 
more robust profiler

> Swap space info not showing in new UI (see screenshot)
> --
>
> Key: SOLR-8201
> URL: https://issues.apache.org/jira/browse/SOLR-8201
> Project: Solr
>  Issue Type: Bug
>  Components: UI
>Reporter: Youssef Chaker
>Priority: Minor
> Attachments: swap space.png
>
>
> The old UI displays info about the swap space (even if nothing is allocated) 
> whereas the new UI does not (see screenshot).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7341) xjoin - join data from external sources

2015-10-15 Thread Doug Turnbull (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959677#comment-14959677
 ] 

Doug Turnbull commented on SOLR-7341:
-

I am really looking forward to this patch. It has a lot of potential for 
joining search with external ranking systems like recommenders or other systems 
that are more appropriatte for different use cases.

> xjoin - join data from external sources
> ---
>
> Key: SOLR-7341
> URL: https://issues.apache.org/jira/browse/SOLR-7341
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 4.10.3
>Reporter: Tom Winch
>Priority: Minor
> Fix For: Trunk
>
> Attachments: SOLR-7341.patch, SOLR-7341.patch, SOLR-7341.patch, 
> SOLR-7341.patch, SOLR-7341.patch, SOLR-7341.patch, SOLR-7341.patch-trunk, 
> SOLR-7341.patch-trunk, SOLR-7341.patch-trunk
>
>
> h2. XJoin
> The "xjoin" SOLR contrib allows external results to be joined with SOLR 
> results in a query and the SOLR result set to be filtered by the results of 
> an external query. Values from the external results are made available in the 
> SOLR results and may also be used to boost the scores of corresponding 
> documents during the search. The contrib consists of the Java classes 
> XJoinSearchComponent, XJoinValueSourceParser and XJoinQParserPlugin (and 
> associated classes), which must be configured in solrconfig.xml, and the 
> interfaces XJoinResultsFactory and XJoinResults, which are implemented by the 
> user to provide the link between SOLR and the external results source. 
> External results and SOLR documents are matched via a single configurable 
> attribute (the "join field"). The contrib JAR solr-xjoin-4.10.3.jar contains 
> these classes and interfaces and should be included in SOLR's class path from 
> solrconfig.xml, as should a JAR containing the user implementations of the 
> previously mentioned interfaces. For example:
> {code:xml}
> 
>   ..
>   
>/>
>   ..
>   
>   
>   ..
> 
> {code}
> h2. Java classes and interfaces
> h3. XJoinResultsFactory
> The user implementation of this interface is responsible for connecting to an 
> external source to perform a query (or otherwise collect results). Parameters 
> with prefix ".external." are passed from the SOLR query URL 
> to pararameterise the search. The interface has the following methods:
> * void init(NamedList args) - this is called during SOLR initialisation, and 
> passed parameters from the search component configuration (see below)
> * XJoinResults getResults(SolrParams params) - this is called during a SOLR 
> search to generate external results, and is passed parameters from the SOLR 
> query URL (as above)
> For example, the implementation might perform queries of an external source 
> based on the 'q' SOLR query URL parameter (in full,  name>.external.q).
> h3. XJoinResults
> A user implementation of this interface is returned by the getResults() 
> method of the XJoinResultsFactory implementation. It has methods:
> * Object getResult(String joinId) - this should return a particular result 
> given the value of the join attribute
> * Iterable getJoinIds() - this should return an ordered (ascending) 
> list of the join attribute values for all results of the external search
> h3. XJoinSearchComponent
> This is the central Java class of the contrib. It is a SOLR search component, 
> configured in solrconfig.xml and included in one or more SOLR request 
> handlers. There is one XJoin search component per external source, and each 
> has two main responsibilities:
> * Before the SOLR search, it connects to the external source and retrieves 
> results, storing them in the SOLR request context
> * After the SOLR search, it matches SOLR document in the results set and 
> external results via the join field, adding attributes from the external 
> results to documents in the SOLR results set
> It takes the following initialisation parameters:
> * factoryClass - this specifies the user-supplied class implementing 
> XJoinResultsFactory, used to generate external results
> * joinField - this specifies the attribute on which to join between SOLR 
> documents and external results
> * external - this parameter set is passed to configure the 
> XJoinResultsFactory implementation
> For example, in solrconfig.xml:
> {code:xml}
>  class="org.apache.solr.search.xjoin.XJoinSearchComponent">
>   test.TestXJoinResultsFactory
>   id
>   
> 1,2,3
>   
> 
> {code}
> Here, the search

Re: Mention security as a key feature on the web site "Features" page

2015-09-26 Thread Doug Turnbull

I'm glad some of these changes have made it in. And I admit ignorance to
the work done in this area. However...

My 2 cents would be that I'm still more comfortable locking down Solr
behind something that feels rather battle-tested like Nginx or another
proxy instead of letting Solr be in charge of security. I feel like this is
a better division of responsibilities, and I'm not sure you'd want to start
advertising Solr as super secure, locked down, and hardened.

-Doug

On Saturday, September 26, 2015, Jan Høydahl <jan@cominvent.com> wrote:

> Hi,
>
> Any comments on this suggestion?
>
> Jan
>
> > Den 25. aug. 2015 kl. 10.25 skrev Jan Høydahl <jan@cominvent.com
> <javascript:;>>:
> >
> > Idea: If we do not want to draw new icons, perhaps this could work:
> >
> > Use the “schemaless" icon (with a key) as the new security icon:
> >
> http://lucene.apache.org/solr/assets/images/Solr_Icons_a_real_data_schema.svg
> >
> > And for the schema-less feature, we can instead use the icon from the
> removed “External configuration"
> >
> http://lucene.apache.org/solr/assets/images/Solr_Icons_external_configuration.svg
> >
> >
> > Title: Security built right in
> > Subtitle: Secure Solr with Authentication, Role based Authorization and
> SSL. Pluggable of course!
> >
> >
> > See how it looks here:
> http://www.cominvent.com/solr/Apache%20Solr%20-%20Features.html
> >
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> >
> >> 24. aug. 2015 kl. 21.25 skrev Jan Høydahl <jan@cominvent.com
> <javascript:;>>:
> >>
> >> On the Solr web site
> http://lucene.staging.apache.org/solr/features.html we list key features.
> >> Now with 5.3 out the door, I think one of those icons should be about
> security.
> >>
> >> Suggest to remove one of the existing icons to make room for a new one.
> Candidates:
> >> - "External Configuration via XML” does perhaps not impress much
> anymore.
> >> - "Extensible Plugin Architecture” is almost a duplicate of "Powerful
> Extensions"
> >>
> >> --
> >> Jan Høydahl, search solution architect
> >> Cominvent AS - www.cominvent.com
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org <javascript:;>
> For additional commands, e-mail: dev-h...@lucene.apache.org <javascript:;>
>
>

-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: discountOverlaps option for QueryParser

2015-09-20 Thread Doug Turnbull

/document/bb99e435ba35f2b1
> >
> > What do you think about this? How difficult to implement this?
> > Would this be a Lucene or Solr issue?
> >
> > Thanks,
> > Ahmet
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org <javascript:;>
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> <javascript:;>
>
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org <javascript:;>
> For additional commands, e-mail: dev-h...@lucene.apache.org <javascript:;>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org <javascript:;>
> For additional commands, e-mail: dev-h...@lucene.apache.org <javascript:;>
>
>

-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: Moving to git?

2015-05-31 Thread Doug Turnbull

You just made my day with that CVS repo! :)

Though I don't really get a vote -- +1 to your plan Robert.

/polishes history degree
-Doug

On Sun, May 31, 2015 at 3:16 PM, Robert Muir rcm...@gmail.com wrote:

I totally agree Doug. Losing the jars would have a cost: those old
branches wouldn't work out of box if you wanted to run tests on
them.

But I am not sure how bad that cost really is. It might be zero. I
havent tried to run e.g. lucene 2.x tests with a modern java 7 or java
8, but i bet they probably do not work due to things like hashmap
failures. And I think solr before 4.0 will not even compile, because
of things like wildcard import + base64 clashes.

So if i had my preference, we'd import all history as much as we can,
and nuke the silly jars. And I'd like that sourceforge history there
too if we can get it, but I don't know if it is really legal.

The sourceforge CVS works, see IndexWriter:

http://lucene.cvs.sourceforge.net/viewvc/lucene/lucene/com/lucene/index/IndexWriter.java?view=log

On Sun, May 31, 2015 at 3:10 PM, Doug Turnbull
dturnb...@opensourceconnections.com wrote:
I have no dog in the svn vs git debate honestly.

I want to say how important it is to keep healthy history. I recently
went
on a bit of code archeology dig recently to figure out why something in
Lucene was done the way it was. It was handy that the history went as far
back as it did, but I had to switch around to different places to
continue
the history. For example, the abrupt shift that seems to be around when
Solr/Lucene were put together had me digging for the last pure lucene
tag.
Its over at lucene/java/branches NOT lucene/dev/tags with teh other tags.

Then when you get to the branch for lucene-101, the first commit is:
2001: New repository initialized by cvs2svn.

Unable to find a cvs repo, my hunt stopped (love to hear if anyone has a
CVS
repo -- maybe from Jakarta?)

So removing some jars isn't a big deal. But cutting off history and
restarting at some arbitrary point can be annoying and make it harder to
dig
up more about why things are the way they are.

/steps down from soapbox
-Doug

On Sunday, May 31, 2015, Dawid Weiss dawid.we...@cs.put.poznan.pl
wrote:

Yeah, but it misses the point -- history is history, if there were
jars in it, you shouldn't just strip them, it'd be confusing.

How was it back when Lucene was merging with Solr? Didn't it just
initiate with a new clean repo? Maybe not all of the history is really
needed -- if we limited ourselves to, say, all of the history that
includes ivy then the size of the repo would drop significantly... but
again, to me size doesn't really matter at all; one initial clone is
no-cost. Go make yourself a cup of tea, come back and you're set.

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Relevant Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: Moving to git?

2015-05-31 Thread Doug Turnbull

I have no dog in the svn vs git debate honestly.

I want to say how important it is to keep healthy history. I recently went
on a bit of code archeology dig recently to figure out why something in
Lucene was done the way it was. It was handy that the history went as far
back as it did, but I had to switch around to different places to continue
the history. For example, the abrupt shift that seems to be around when
Solr/Lucene were put together had me digging for the last pure lucene tag.
Its over at lucene/java/branches NOT lucene/dev/tags with teh other tags.

Then when you get to the branch for lucene-101, the first commit is:
 2001: New repository initialized by cvs2svn.

Unable to find a cvs repo, my hunt stopped (love to hear if anyone has a
CVS repo -- maybe from Jakarta?)

So removing some jars isn't a big deal. But cutting off history and
restarting at some arbitrary point can be annoying and make it harder to
dig up more about why things are the way they are.

/steps down from soapbox
-Doug



On Sunday, May 31, 2015, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote:

 Yeah, but it misses the point -- history is history, if there were
 jars in it, you shouldn't just strip them, it'd be confusing.

 How was it back when Lucene was merging with Solr? Didn't it just
 initiate with a new clean repo? Maybe not all of the history is really
 needed -- if we limited ourselves to, say, all of the history that
 includes ivy then the size of the repo would drop significantly... but
 again, to me size doesn't really matter at all; one initial clone is
 no-cost. Go make yourself a cup of tea, come back and you're set.

 Dawid

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Where Search Meets Machine Learning

2015-05-04 Thread Doug Turnbull

we tested this with datasets available
at the the UCI Machine Learning Repository
http://archive.ics.uci.edu/ml/ but I have been using this approach for
real-life response prediction/bidding problems in advertising and its very
powerful. Of course, this is not the panacea, as there are still some
issues with the approach, specially on the operational side. Let's keep
the conversation going as I think we are on to something useful.

-- Joaquin

On Thu, Apr 30, 2015 at 6:26 AM, Doug Turnbull
dturnb...@opensourceconnections.com wrote:

Hi Joaquin

Very neat, thanks for sharing,

Viewing search relevance as something akin to a classification problem is
actually a driving narrative in Taming Search
http://manning.com/turnbull. We generalize the relevance problem as
one of measuring the similarity between features of content (locations of
restaurants, price of a product, the words in the body of articles,
expanded synonyms in articles, etc) and features of a query (the search
terms, user usage history, any location, etc). What makes search
interesting is that unlike other classification systems, search has built
in similarity systems (largely TF*IDF).

So we actually cut the other direction from your talk. It appears that
you amend the search engine to change the underlying scoring to be based on
machine learning constructs. In our book, we work the opposite way. We
largely enable feature similarity classifications between document and
query by massaging features into terms and use the built in TF*IDF or other
relevant similarity approach.

We feel this plays to the advantages of a search engine. Search engines
already have some basic text analysis built in. They've also been heavily
optimized for most forms of text-based similarity. If you can massage text
such that your TF*IDF similarity reflects a rough proportion of text-based
features important to your users, this tends to reflect their intuitive
notions of relevance. A lot of this work involves feature section, or what
we term in the book feature modeling. What features should you introduce to
your documents that can be used to generate good signals at ranking time.

You can read more about our thoughts here
http://java.dzone.com/articles/solr-and-elasticsearch.

That all being said, what makes your stuff interesting is when you have
enough supervised training data over good-enough features. This can be hard
to do for a broad swatch of middle tier search applications, but
increasingly useful as scale goes up. I'd be interested to hear your
thoughts on this article
http://opensourceconnections.com/blog/2014/10/08/when-click-scoring-can-hurt-search-relevance-a-roadmap-to-better-signals-processing-in-search/
I wrote about collecting click tracking and other relevance feedback data:

Good stuff! Again, thanks for sharing,
-Doug

On Wed, Apr 29, 2015 at 6:58 PM, J. Delgado joaquin.delg...@gmail.com
wrote:

Here is a presentation on the topic:

http://www.slideshare.net/joaquindelgado1/where-search-meets-machine-learning04252015final

Search can be viewed as a combination of a) A problem of constraint
satisfaction, which is the process of finding a solution to a set of
constraints (query) that impose conditions that the variables (fields) must
satisfy with a resulting object (document) being a solution in the feasible
region (result set), plus b) A scoring/ranking problem of assigning values
to different alternatives, according to some convenient scale. This
ultimately provides a mechanism to sort various alternatives in the result
set in order of importance, value or preference. In particular scoring in
search has evolved from being a document centric calculation (e.g. TF-IDF)
proper from its information retrieval roots, to a function that is more
context sensitive (e.g. include geo-distance ranking) or user centric (e.g.
takes user parameters for personalization) as well as other factors that
depend on the domain and task at hand. However, most system that
incorporate machine learning techniques to perform classification or
generate scores for these specialized tasks do so as a post retrieval
re-ranking function, outside of search! In this talk I show ways of
incorporating advanced scoring functions, based on supervised learning and
bid scaling models, into popular search engines such as Elastic Search and
potentially SOLR. I'll provide practical examples of how to construct such
ML Scoring plugins in search to generalize the application of a search
engine as a model evaluator for supervised learning tasks. This will
facilitate the building of systems that can do computational advertising,
recommendations and specialized search systems, applicable to many domains.

Code to support it (only elastic search for now):
https://github.com/sdhu/elasticsearch-prediction

-- J

--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource
Connections, LLC | 240.476.9983 | http

Re: Where Search Meets Machine Learning

2015-04-30 Thread Doug Turnbull

Hi Joaquin

Very neat, thanks for sharing,

Viewing search relevance as something akin to a classification problem is
actually a driving narrative in Taming Search http://manning.com/turnbull.
We generalize the relevance problem as one of measuring the similarity
between features of content (locations of restaurants, price of a product,
the words in the body of articles, expanded synonyms in articles, etc) and
features of a query (the search terms, user usage history, any location,
etc). What makes search interesting is that unlike other classification
systems, search has built in similarity systems (largely TF*IDF).

So we actually cut the other direction from your talk. It appears that you
amend the search engine to change the underlying scoring to be based on
machine learning constructs. In our book, we work the opposite way. We
largely enable feature similarity classifications between document and
query by massaging features into terms and use the built in TF*IDF or other
relevant similarity approach.

You can read more about our thoughts here
http://java.dzone.com/articles/solr-and-elasticsearch.

Good stuff! Again, thanks for sharing,
-Doug

On Wed, Apr 29, 2015 at 6:58 PM, J. Delgado joaquin.delg...@gmail.com
wrote:

Here is a presentation on the topic:

http://www.slideshare.net/joaquindelgado1/where-search-meets-machine-learning04252015final

Code to support it (only elastic search for now):
https://github.com/sdhu/elasticsearch-prediction

-- J

--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

[jira] [Commented] (SOLR-5800) Admin UI - Analysis form doesn't render results correctly when a CharFilter is used.

2014-03-11 Thread Doug Turnbull (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13931319#comment-13931319
 ] 

Doug Turnbull commented on SOLR-5800:
-

Thanks for the patch Stefan. Will this be released in a Solr 4.7.1? This is a 
fairly major issue for folks that depend on the analysis UI.

 Admin UI - Analysis form doesn't render results correctly when a CharFilter 
 is used.
 

 Key: SOLR-5800
 URL: https://issues.apache.org/jira/browse/SOLR-5800
 Project: Solr
  Issue Type: Bug
  Components: web gui
Affects Versions: 4.7
Reporter: Timothy Potter
Assignee: Stefan Matheis (steffkes)
Priority: Minor
 Fix For: 4.8, 5.0

 Attachments: SOLR-5800-sample.json, SOLR-5800.patch


 I have an example in Solr In Action that uses the
 PatternReplaceCharFilterFactory and now it doesn't work in 4.7.0.
 Specifically, the fieldType is:
 fieldType name=text_microblog class=solr.TextField
 positionIncrementGap=100
   analyzer
 charFilter class=solr.PatternReplaceCharFilterFactory
 pattern=([a-zA-Z])\1+
 replacement=$1$1/
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1
 splitOnCaseChange=0
 splitOnNumerics=0
 stemEnglishPossessive=1
 preserveOriginal=0
 catenateWords=1
 generateNumberParts=1
 catenateNumbers=0
 catenateAll=0
 types=wdfftypes.txt/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=lang/stopwords_en.txt
 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.ASCIIFoldingFilterFactory/
 filter class=solr.KStemFilterFactory/
   /analyzer
 /fieldType
 The PatternReplaceCharFilterFactory (PRCF) is used to collapse
 repeated letters in a term down to a max of 2, such as #yu would
 be #yumm
 When I run some text through this analyzer using the Analysis form,
 the output is as if the resulting text is unavailable to the
 tokenizer. In other words, the only results being displayed in the
 output on the form is for the PRCF
 This example stopped working in 4.7.0 and I've verified it worked
 correctly in 4.6.1.
 Initially, I thought this might be an issue with the actual analysis,
 but the analyzer actually works when indexing / querying. Then,
 looking at the JSON response in the Developer console with Chrome, I
 see the JSON that comes back includes output for all the components in
 my chain (see below) ... so looks like a UI rendering issue to me?
 {responseHeader:{status:0,QTime:24},analysis:{field_types:{text_microblog:{index:[org.apache.lucene.analysis.pattern.PatternReplaceCharFilter,#Yumm
 :) Drinking a latte at Caffe Grecco in SF's historic North Beach...
 Learning text analysis with #SolrInAction by @ManningBooks on my i-Pad
 foo5,org.apache.lucene.analysis.core.WhitespaceTokenizer,[{text:#Yumm,raw_bytes:[23
 59 75 6d 
 6d],start:0,end:6,position:1,positionHistory:[1],type:word},{text::),raw_bytes:[3a
 29],start:7,end:9,position:2,positionHistory:[2],type:word},{text:Drinking,raw_bytes:[44
 72 69 6e 6b 69 6e
 67],start:10,end:18,position:3,positionHistory:[3],type:word},{text:a,raw_bytes:[61],start:19,end:20,position:4,positionHistory:[4],type:word},{text:latte,raw_bytes:[6c
  ...
 the JSON returned to the browser has evidence that the full analysis chain 
 was applied, so this seems to just be a rendering issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4812) Edismax highlighting query doesn't work.


[ 
https://issues.apache.org/jira/browse/SOLR-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772151#comment-13772151
 ] 

Doug Turnbull commented on SOLR-4812:
-

+1 I've also been able to recreate this.

 Edismax highlighting query doesn't work.
 

 Key: SOLR-4812
 URL: https://issues.apache.org/jira/browse/SOLR-4812
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.2, 4.3
 Environment: When hl.q is a edismax query, Highligting will ignore 
 the query specified in hl.q
Reporter: Nguyen Manh Tien
Priority: Minor
 Fix For: 4.5, 5.0

 Attachments: SOLR-4812.patch


 When hl.q is an edismax query, Highligting will ignore the query specified in 
 hl.q
 edismax highlighting query hl.q={!edismax qf=title v=Software}
 function getHighlightQuery in edismax don't parse highlight query so it 
 always return null so hl.q is ignored.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-5256) Send multiple queries through highlighter

Doug Turnbull created SOLR-5256:
---

 Summary: Send multiple queries through highlighter
 Key: SOLR-5256
 URL: https://issues.apache.org/jira/browse/SOLR-5256
 Project: Solr
  Issue Type: New Feature
  Components: highlighter
Affects Versions: 4.4
Reporter: Doug Turnbull
 Attachments: Solr-5256.patch

There's been several times when I wish I could specify multiple queries through 
the highlighter. For example, a search over books may have an option to filter 
my author. If I wanted to highlight both the primary search terms and the 
author match I'd have to construct an hl.q that created the desire highlight 
query.

This is complicated by the fact that q might be dismax/edismax while the fq is 
likely going to be a lucene query. It might be rather complex to construct a 
single query that reflects the combination of dismax over many fields plus a 
specific lucene query.

What I would prefer to do is be able to specify additional queries (hl.addlq) 
to the highlighter. The highlighter then highlights the results of those 
queries as well. 

(Unfortunately, while this is useful, its limited somewhat by this bug:
https://issues.apache.org/jira/browse/SOLR-4812#comment-13772151)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-5256) Send multiple queries through highlighter


 [ 
https://issues.apache.org/jira/browse/SOLR-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Turnbull updated SOLR-5256:


Attachment: Solr-5256.patch

Patch to add hl.addlq

 Send multiple queries through highlighter
 -

 Key: SOLR-5256
 URL: https://issues.apache.org/jira/browse/SOLR-5256
 Project: Solr
  Issue Type: New Feature
  Components: highlighter
Affects Versions: 4.4
Reporter: Doug Turnbull
 Attachments: Solr-5256.patch


 There's been several times when I wish I could specify multiple queries 
 through the highlighter. For example, a search over books may have an option 
 to filter my author. If I wanted to highlight both the primary search terms 
 and the author match I'd have to construct an hl.q that created the desire 
 highlight query.
 This is complicated by the fact that q might be dismax/edismax while the fq 
 is likely going to be a lucene query. It might be rather complex to construct 
 a single query that reflects the combination of dismax over many fields plus 
 a specific lucene query.
 What I would prefer to do is be able to specify additional queries (hl.addlq) 
 to the highlighter. The highlighter then highlights the results of those 
 queries as well. 
 (Unfortunately, while this is useful, its limited somewhat by this bug:
 https://issues.apache.org/jira/browse/SOLR-4812#comment-13772151)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-5256) Send multiple queries through highlighter