Re: newSearcher autowarming queries in solrconfig.xml run but does not appear to warm cache

2016-10-19 Thread Dalton Gooding
Erick,
Thanks very much for your help so far with this one. I have captured the logs 
from a commit which shows a commit and new searcher starting.
It appears a few ERROR's are amongst the logs and a few uninverting lines. 
The query is a very basic query as shown below:
     DataType_s:Product
  WebSections_ms:house
  VisibleOnline_ms:NAT
  SS_Stage_ms:Live
  0
  20
    


INFO  (qtp755840090-55) [   x:core1] o.a.s.c.SolrCore 
SolrDeletionPolicy.onCommit: commits: num=2
    
commit{dir=NRTCachingDirectory(MMapDirectory@/var/solr/data/core1/data/index.20131212092900012
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@8815140; 
maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_9kgw,generation=446432}
    
commit{dir=NRTCachingDirectory(MMapDirectory@/var/solr/data/core1/data/index.20131212092900012
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@8815140; 
maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_9kgx,generation=446433}
INFO  (qtp755840090-55) [   x:core1] o.a.s.c.SolrCore newest commit generation 
= 446433
INFO  (qtp755840090-55) [   x:core1] o.a.s.s.SolrIndexSearcher Opening 
Searcher@63f1fac[core1] main
INFO  (searcherExecutor-7-thread-1-processing-x:core1) [   x:core1] 
o.a.s.c.SolrCore QuerySenderListener sending requests to 
Searcher@63f1fac[core1] 
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_a3p4(5.3.1):C313761/58754:delGen=2073)
 Uninverting(_a7bs(5.3.1):c22601/9516:delGen=608) 
Uninverting(_a7np(5.3.1):C37794/14987:delGen=463) 
Uninverting(_aa7h(5.3.1):c13906/163:delGen=57) 
Uninverting(_a7u7(5.3.1):c5968/504:delGen=89) 
Uninverting(_a7yt(5.3.1):c2643/426:delGen=39) 
Uninverting(_aafw(5.3.1):c313/27:delGen=2) 
Uninverting(_aajf(5.3.1):c355/14:delGen=3) 
Uninverting(_aaqa(5.3.1):c195/1:delGen=1) 
Uninverting(_aapg(5.3.1):c279/3:delGen=3) 
Uninverting(_aahr(5.3.1):c262/5:delGen=1) 
Uninverting(_aafa(5.3.1):c265/2:delGen=1) 
Uninverting(_aap3(5.3.1):c252/2:delGen=2) Uninverting(_aaqb(5.3.1):C1) 
Uninverting(_aaqd(5.3.1):C1) Uninverting(_aaqh(5.3.1):C1) 
Uninverting(_aaqj(5.3.1):C1) Uninverting(_aaqm(5.3.1):C2/1:delGen=1) 
Uninverting(_aaqo(5.3.1):C1) Uninverting(_aaqq(5.3.1):C1) 
Uninverting(_aaqs(5.3.1):C1)))}
ERROR (searcherExecutor-7-thread-1-processing-x:core1) [   x:core1] 
o.a.s.c.SolrCore Previous SolrRequestInfo was not closed!  req=wt=json
ERROR (searcherExecutor-7-thread-1-processing-x:core1) [   x:core1] 
o.a.s.c.SolrCore prev == info : false
INFO  (searcherExecutor-7-thread-1-processing-x:core1) [   x:core1] 
o.a.s.c.S.Request [core1] webapp=null path=null 
params={start=0=newSearcher=SS_Stage_ms:Live=false=DataType_s:Product=WebSections_ms:house=VisibleOnline_ms:NAT=20}
 hits=2541 status=0 QTime=18
INFO  (searcherExecutor-7-thread-1-processing-x:core1) [   x:core1] 
o.a.s.c.SolrCore QuerySenderListener done.
INFO  (searcherExecutor-7-thread-1-processing-x:core1) [   x:core1] 
o.a.s.c.SolrCore [core1] Registered new searcher Searcher@63f1fac[core1] 
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_a3p4(5.3.1):C313761/58754:delGen=2073)
 Uninverting(_a7bs(5.3.1):c22601/9516:delGen=608) 
Uninverting(_a7np(5.3.1):C37794/14987:delGen=463) 
Uninverting(_aa7h(5.3.1):c13906/163:delGen=57) 
Uninverting(_a7u7(5.3.1):c5968/504:delGen=89) 
Uninverting(_a7yt(5.3.1):c2643/426:delGen=39) 
Uninverting(_aafw(5.3.1):c313/27:delGen=2) 
Uninverting(_aajf(5.3.1):c355/14:delGen=3) 
Uninverting(_aaqa(5.3.1):c195/1:delGen=1) 
Uninverting(_aapg(5.3.1):c279/3:delGen=3) 
Uninverting(_aahr(5.3.1):c262/5:delGen=1) 
Uninverting(_aafa(5.3.1):c265/2:delGen=1) 
Uninverting(_aap3(5.3.1):c252/2:delGen=2) Uninverting(_aaqb(5.3.1):C1) 
Uninverting(_aaqd(5.3.1):C1) Uninverting(_aaqh(5.3.1):C1) 
Uninverting(_aaqj(5.3.1):C1) Uninverting(_aaqm(5.3.1):C2/1:delGen=1) 
Uninverting(_aaqo(5.3.1):C1) Uninverting(_aaqq(5.3.1):C1) 
Uninverting(_aaqs(5.3.1):C1)))}
INFO  (qtp755840090-55) [   x:core1] o.a.s.u.UpdateHandler end_commit_flush


On Tuesday, 18 October 2016, 0:49, Erick Erickson  
wrote:
 

 Wow, wouldn't it be useful if the name of the field was dumped in the
message ;)...

You should see a query happen just before that message in the log
file. It won't quite be in the same format as a URL, but it's
reasonably easy to figure out.

Uninverting happens as a result of
> sorting
> faceting
> grouping
> ???

So the crude approach wold be to find the query(s) that precede this
then break it apart submitting one of the above operations from the
query in question at a time while tailing the logs.

And by "after the warming queries", I'm assuming that the searcher has
successfully opened (there'll be a message in the log).

BTW, DocValues fields are strongly recommended for any field that gets
uninverted.

Best,
Erick

On Sun, Oct 16, 2016 at 10:28 PM, Dalton Gooding
 wrote:
> Erick,
>
> I think you might have nailed it.
>
> After 

Re: Result Grouping vs. Collapsing Query Parser -- Can one be deprecated?

2016-10-19 Thread Joel Bernstein
Also as you consider using collapse you'll want to keep in mind the feature
compromises that were made to achieve the higher performance:

1) Collapse does not directly support faceting. It simply collapses the
results and the faceting components compute facets on the collapsed result
set. Grouping has direct support for faceting which, can be slow, but it
has options other then just computing facets on the collapsed result set.

2) Originally collapse only supported selecting group heads with min/max
value of a numeric field. It did not support using the sort parameter for
selecting the group head. Recently the sort parameter was added to
collapse, but this likely is not nearly as fast as using the min/max for
selecting group heads.



Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Oct 19, 2016 at 7:20 PM, Joel Bernstein  wrote:

> Originally collapsing was designed with a very small feature set and one
> goal in mind: High performance collapsing on high cardinality fields. To
> avoid having to compromise on that goal, it was developed as a separate
> feature.
>
> The trick in combining grouping and collapsing into one feature, is to do
> it in a way that does not hurt the original performance goal of collapse.
> Otherwise we'll be back to just have slow grouping.
>
> Perhaps the new API's that are being worked could have a facade over
> grouping and collapsing so they would share the same API.
>
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Oct 19, 2016 at 6:51 PM, Mike Lissner  com> wrote:
>
>> Hi all,
>>
>> I've had a rotten day today because of Solr. I want to share my experience
>> and perhaps see if we can do something to fix this particular situation in
>> the future.
>>
>> Solr currently has two ways to get grouped results (so far!). You can
>> either use Result Grouping or you can use the Collapsing Query Parser.
>> Result grouping seems like the obvious way to go. It's well documented,
>> the
>> parameters are clear, it doesn't use a bunch of weird syntax (ie,
>> {!collapse blah=foo}), and it uses the feature name from SQL (so it comes
>> up in Google).
>>
>> OTOH, if you use faceting with result grouping, which I imagine many
>> people
>> do, you get terrible performance. In our case it went from subsecond to
>> 10-120 seconds for big queries. Insanely bad.
>>
>> Collapsing Query Parser looks like a good way forward for us, and we'll be
>> investigating that, but it uses the Expand component that our library
>> doesn't support, to say nothing of the truly bizarre syntax. So this will
>> be a fair amount of effort to switch.
>>
>> I'm curious if there is anything we can do to clean up this situation.
>> What
>> I'd really like to do is:
>>
>> 1. Put a HUGE warning on the Result Grouping docs directing people away
>> from the feature if they plan to use faceting (or perhaps directing them
>> away no matter what?)
>>
>> 2. Work towards eliminating one or the other of these features. They're
>> nearly completely compatible, except for their syntax and performance. The
>> collapsing query parser apparently was only written because the result
>> grouping had such bad performance -- In other words, it doesn't exist to
>> provide unique features, it exists to be faster than the old way. Maybe we
>> can get rid of one or the other of these, taking the best parts from each
>> (syntax from Result Grouping, and performance from Collapse Query Parser)?
>>
>> Thanks,
>>
>> Mike
>>
>> PS -- For some extra context, I want to share some other reasons this is
>> frustrating:
>>
>> 1. I just spent a week upgrading a third-party library so it would support
>> grouped results, and another week implementing the feature in our code
>> with
>> tests and everything. That was a waste.
>> 2. It's hard to notice performance issues until after you deploy to a big
>> data environment. This creates a bad situation for users until you detect
>> it and revert the new features.
>> 3. The documentation *could* say something about the fact that a new
>> feature was developed to provide better performance for grouping. It could
>> say that using facets with groups is an anti-feature. It says neither.
>>
>> I only mention these because, like others, I've had a real rough time with
>> solr (again), and these are the kinds of seemingly small things that could
>> have made all the difference.
>>
>
>


Re: Result Grouping vs. Collapsing Query Parser -- Can one be deprecated?

2016-10-19 Thread Joel Bernstein
Originally collapsing was designed with a very small feature set and one
goal in mind: High performance collapsing on high cardinality fields. To
avoid having to compromise on that goal, it was developed as a separate
feature.

The trick in combining grouping and collapsing into one feature, is to do
it in a way that does not hurt the original performance goal of collapse.
Otherwise we'll be back to just have slow grouping.

Perhaps the new API's that are being worked could have a facade over
grouping and collapsing so they would share the same API.







Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Oct 19, 2016 at 6:51 PM, Mike Lissner <
mliss...@michaeljaylissner.com> wrote:

> Hi all,
>
> I've had a rotten day today because of Solr. I want to share my experience
> and perhaps see if we can do something to fix this particular situation in
> the future.
>
> Solr currently has two ways to get grouped results (so far!). You can
> either use Result Grouping or you can use the Collapsing Query Parser.
> Result grouping seems like the obvious way to go. It's well documented, the
> parameters are clear, it doesn't use a bunch of weird syntax (ie,
> {!collapse blah=foo}), and it uses the feature name from SQL (so it comes
> up in Google).
>
> OTOH, if you use faceting with result grouping, which I imagine many people
> do, you get terrible performance. In our case it went from subsecond to
> 10-120 seconds for big queries. Insanely bad.
>
> Collapsing Query Parser looks like a good way forward for us, and we'll be
> investigating that, but it uses the Expand component that our library
> doesn't support, to say nothing of the truly bizarre syntax. So this will
> be a fair amount of effort to switch.
>
> I'm curious if there is anything we can do to clean up this situation. What
> I'd really like to do is:
>
> 1. Put a HUGE warning on the Result Grouping docs directing people away
> from the feature if they plan to use faceting (or perhaps directing them
> away no matter what?)
>
> 2. Work towards eliminating one or the other of these features. They're
> nearly completely compatible, except for their syntax and performance. The
> collapsing query parser apparently was only written because the result
> grouping had such bad performance -- In other words, it doesn't exist to
> provide unique features, it exists to be faster than the old way. Maybe we
> can get rid of one or the other of these, taking the best parts from each
> (syntax from Result Grouping, and performance from Collapse Query Parser)?
>
> Thanks,
>
> Mike
>
> PS -- For some extra context, I want to share some other reasons this is
> frustrating:
>
> 1. I just spent a week upgrading a third-party library so it would support
> grouped results, and another week implementing the feature in our code with
> tests and everything. That was a waste.
> 2. It's hard to notice performance issues until after you deploy to a big
> data environment. This creates a bad situation for users until you detect
> it and revert the new features.
> 3. The documentation *could* say something about the fact that a new
> feature was developed to provide better performance for grouping. It could
> say that using facets with groups is an anti-feature. It says neither.
>
> I only mention these because, like others, I've had a real rough time with
> solr (again), and these are the kinds of seemingly small things that could
> have made all the difference.
>


Re: Result Grouping vs. Collapsing Query Parser -- Can one be deprecated?

2016-10-19 Thread John Bickerstaff
Thank you for posting that.  I'll be saving it in my "important painful
lessons learned by others" mail folder.

On Oct 19, 2016 4:51 PM, "Mike Lissner" 
wrote:

> Hi all,
>
> I've had a rotten day today because of Solr. I want to share my experience
> and perhaps see if we can do something to fix this particular situation in
> the future.
>
> Solr currently has two ways to get grouped results (so far!). You can
> either use Result Grouping or you can use the Collapsing Query Parser.
> Result grouping seems like the obvious way to go. It's well documented, the
> parameters are clear, it doesn't use a bunch of weird syntax (ie,
> {!collapse blah=foo}), and it uses the feature name from SQL (so it comes
> up in Google).
>
> OTOH, if you use faceting with result grouping, which I imagine many people
> do, you get terrible performance. In our case it went from subsecond to
> 10-120 seconds for big queries. Insanely bad.
>
> Collapsing Query Parser looks like a good way forward for us, and we'll be
> investigating that, but it uses the Expand component that our library
> doesn't support, to say nothing of the truly bizarre syntax. So this will
> be a fair amount of effort to switch.
>
> I'm curious if there is anything we can do to clean up this situation. What
> I'd really like to do is:
>
> 1. Put a HUGE warning on the Result Grouping docs directing people away
> from the feature if they plan to use faceting (or perhaps directing them
> away no matter what?)
>
> 2. Work towards eliminating one or the other of these features. They're
> nearly completely compatible, except for their syntax and performance. The
> collapsing query parser apparently was only written because the result
> grouping had such bad performance -- In other words, it doesn't exist to
> provide unique features, it exists to be faster than the old way. Maybe we
> can get rid of one or the other of these, taking the best parts from each
> (syntax from Result Grouping, and performance from Collapse Query Parser)?
>
> Thanks,
>
> Mike
>
> PS -- For some extra context, I want to share some other reasons this is
> frustrating:
>
> 1. I just spent a week upgrading a third-party library so it would support
> grouped results, and another week implementing the feature in our code with
> tests and everything. That was a waste.
> 2. It's hard to notice performance issues until after you deploy to a big
> data environment. This creates a bad situation for users until you detect
> it and revert the new features.
> 3. The documentation *could* say something about the fact that a new
> feature was developed to provide better performance for grouping. It could
> say that using facets with groups is an anti-feature. It says neither.
>
> I only mention these because, like others, I've had a real rough time with
> solr (again), and these are the kinds of seemingly small things that could
> have made all the difference.
>


Result Grouping vs. Collapsing Query Parser -- Can one be deprecated?

2016-10-19 Thread Mike Lissner
Hi all,

I've had a rotten day today because of Solr. I want to share my experience
and perhaps see if we can do something to fix this particular situation in
the future.

Solr currently has two ways to get grouped results (so far!). You can
either use Result Grouping or you can use the Collapsing Query Parser.
Result grouping seems like the obvious way to go. It's well documented, the
parameters are clear, it doesn't use a bunch of weird syntax (ie,
{!collapse blah=foo}), and it uses the feature name from SQL (so it comes
up in Google).

OTOH, if you use faceting with result grouping, which I imagine many people
do, you get terrible performance. In our case it went from subsecond to
10-120 seconds for big queries. Insanely bad.

Collapsing Query Parser looks like a good way forward for us, and we'll be
investigating that, but it uses the Expand component that our library
doesn't support, to say nothing of the truly bizarre syntax. So this will
be a fair amount of effort to switch.

I'm curious if there is anything we can do to clean up this situation. What
I'd really like to do is:

1. Put a HUGE warning on the Result Grouping docs directing people away
from the feature if they plan to use faceting (or perhaps directing them
away no matter what?)

2. Work towards eliminating one or the other of these features. They're
nearly completely compatible, except for their syntax and performance. The
collapsing query parser apparently was only written because the result
grouping had such bad performance -- In other words, it doesn't exist to
provide unique features, it exists to be faster than the old way. Maybe we
can get rid of one or the other of these, taking the best parts from each
(syntax from Result Grouping, and performance from Collapse Query Parser)?

Thanks,

Mike

PS -- For some extra context, I want to share some other reasons this is
frustrating:

1. I just spent a week upgrading a third-party library so it would support
grouped results, and another week implementing the feature in our code with
tests and everything. That was a waste.
2. It's hard to notice performance issues until after you deploy to a big
data environment. This creates a bad situation for users until you detect
it and revert the new features.
3. The documentation *could* say something about the fact that a new
feature was developed to provide better performance for grouping. It could
say that using facets with groups is an anti-feature. It says neither.

I only mention these because, like others, I've had a real rough time with
solr (again), and these are the kinds of seemingly small things that could
have made all the difference.


Re: Public/Private data in Solr :: Metadata or ?

2016-10-19 Thread Hrishikesh Gadre
As part of Cloudera Search, we have integrated with Apache Sentry for
document level authorization. Currently we are using custom search
component to implement filtering. Please refer to this blog post for
details,
http://blog.cloudera.com/blog/2014/07/new-in-cdh-5-1-document-level-security-for-cloudera-search/

I am currently working on a Sentry based plugin implementation which can be
hooked in the Solr authorization framework. Currently Solr authorization
framework doesn't implement document level security. I filed SOLR-9578
 to add the relevant doc
level security support in Solr.

The main drawback of custom search component based mechanism is that it
requires a special solrconfig.xml file (which is using these custom search
components). On the other hand, once Solr provides hooks to implement doc
level security as part of authorization framework, then this restriction
will go away.

If you have any ideas (or concerns) with this feature, please feel free to
comment on the jira.

Thanks
Hrishikesh

On Wed, Oct 19, 2016 at 7:48 AM, Shawn Heisey  wrote:

> On 10/18/2016 3:00 PM, John Bickerstaff wrote:
> > How (or is it even wise) to "segregate data" in Solr so that some data
> > can be seen by some users and some data not be seen?
>
> IMHO, security like this isn't really Solr's job ... but with the right
> data in the index, the system that DOES handle the security can include
> a filter with each user's query to restrict them to only the data they
> are allowed to see.  There are many ways to put data in the index for
> efficient use by a filter.  The simplest would be a boolean field with a
> name like isPublic or isPrivate, where true and false are mapped as
> necessary to public and private.
>
> Naturally, the users must not be able to reach Solr directly ... they
> must be restricted to the software that connects to Solr on their
> behalf.  Blocking end users from direct network access to Solr is a good
> idea even if there are no other security needs.
>
> There are more comprehensive solutions available, as you will notice
> from other replies, but the idea of simple filtering, controlled by your
> application, should work.
>
> Thanks,
> Shawn
>
>


Re: Public/Private data in Solr :: Metadata or ?

2016-10-19 Thread John Bickerstaff
Thanks Erick - also very helpful.

On Wed, Oct 19, 2016 at 1:24 PM, Erick Erickson 
wrote:

> And for hairy ACL processing, consider a post-filter. It's custom code
> that only evaluates a document _after_ it has made it through the
> primary query and any "lower cost" filters. See:
> http://yonik.com/advanced-filter-caching-in-solr/.
>
> NOTE: this isn't the thing I would do first, it's much more efficient
> to implement some of the suggestions above. Any time you can trade off
> index-time work for query-time work, it's almost always better to do
> the work up-front during queries
>
> Best,
> Erick
>
> On Wed, Oct 19, 2016 at 12:07 PM, John Bickerstaff
>  wrote:
> > Thank you both!  Very helpful.
> >
> > On Wed, Oct 19, 2016 at 8:48 AM, Shawn Heisey 
> wrote:
> >
> >> On 10/18/2016 3:00 PM, John Bickerstaff wrote:
> >> > How (or is it even wise) to "segregate data" in Solr so that some data
> >> > can be seen by some users and some data not be seen?
> >>
> >> IMHO, security like this isn't really Solr's job ... but with the right
> >> data in the index, the system that DOES handle the security can include
> >> a filter with each user's query to restrict them to only the data they
> >> are allowed to see.  There are many ways to put data in the index for
> >> efficient use by a filter.  The simplest would be a boolean field with a
> >> name like isPublic or isPrivate, where true and false are mapped as
> >> necessary to public and private.
> >>
> >> Naturally, the users must not be able to reach Solr directly ... they
> >> must be restricted to the software that connects to Solr on their
> >> behalf.  Blocking end users from direct network access to Solr is a good
> >> idea even if there are no other security needs.
> >>
> >> There are more comprehensive solutions available, as you will notice
> >> from other replies, but the idea of simple filtering, controlled by your
> >> application, should work.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>


Re: Public/Private data in Solr :: Metadata or ?

2016-10-19 Thread Erick Erickson
And for hairy ACL processing, consider a post-filter. It's custom code
that only evaluates a document _after_ it has made it through the
primary query and any "lower cost" filters. See:
http://yonik.com/advanced-filter-caching-in-solr/.

NOTE: this isn't the thing I would do first, it's much more efficient
to implement some of the suggestions above. Any time you can trade off
index-time work for query-time work, it's almost always better to do
the work up-front during queries

Best,
Erick

On Wed, Oct 19, 2016 at 12:07 PM, John Bickerstaff
 wrote:
> Thank you both!  Very helpful.
>
> On Wed, Oct 19, 2016 at 8:48 AM, Shawn Heisey  wrote:
>
>> On 10/18/2016 3:00 PM, John Bickerstaff wrote:
>> > How (or is it even wise) to "segregate data" in Solr so that some data
>> > can be seen by some users and some data not be seen?
>>
>> IMHO, security like this isn't really Solr's job ... but with the right
>> data in the index, the system that DOES handle the security can include
>> a filter with each user's query to restrict them to only the data they
>> are allowed to see.  There are many ways to put data in the index for
>> efficient use by a filter.  The simplest would be a boolean field with a
>> name like isPublic or isPrivate, where true and false are mapped as
>> necessary to public and private.
>>
>> Naturally, the users must not be able to reach Solr directly ... they
>> must be restricted to the software that connects to Solr on their
>> behalf.  Blocking end users from direct network access to Solr is a good
>> idea even if there are no other security needs.
>>
>> There are more comprehensive solutions available, as you will notice
>> from other replies, but the idea of simple filtering, controlled by your
>> application, should work.
>>
>> Thanks,
>> Shawn
>>
>>


Zero value fails to match Positive, Negative, or Zero interval facet

2016-10-19 Thread Andy C
I have a field called "SCALE_double" that is defined as multivalued with
the fieldType "tdouble".

"tdouble" is defined as:



I have a document with the value "0" indexed for this field. I am able to
successfully retrieve the document with the range query "SCALE_double:[0 TO
0]". However it doesn't match any of the interval facets I am trying to
populate that match negative, zero, or positive values:

"{!key=\"Negative\"}(*,0)",
"{!key=\"Positive\"}(0,*]",
"{!key=\"Zero\"}[0,0]"

I assume this is some sort of precision issue with the TrieDoubleField
implementation (if I change the Zero interval to
"(-.01,+.01)" it now considers the document a match).
However the range query works fine (I had assumed that the interval was
just converted to a range query internally), and it fails to show up in the
Negative or Positive intervals either.

Any ideas what is going on, and if there is anything I can do to get this
to work correctly? I am using Solr 5.3.1. I've pasted the output from the
Solr Admin UI query below.

Thanks,
- Andy -

{
  "responseHeader": {
"status": 0,
"QTime": 0,
"params": {
  "facet": "true",
  "fl": "SCALE_double",
  "facet.mincount": "1",
  "indent": "true",
  "facet.interval": "SCALE_double",
  "q": "SCALE_double:[0 TO 0]",
  "facet.limit": "100",
  "f.SCALE_double.facet.interval.set": [
"{!key=\"Negative\"}(*,0)",
"{!key=\"Positive\"}(0,*]",
"{!key=\"Zero\"}[0,0]"
  ],
  "_": "1476900130184",
  "wt": "json"
}
  },
  "response": {
"numFound": 1,
"start": 0,
"docs": [
  {
"SCALE_double": [
  0
]
  }
]
  },
  "facet_counts": {
"facet_queries": {},
"facet_fields": {},
"facet_dates": {},
"facet_ranges": {},
"facet_intervals": {
  "SCALE_double": {
"Negative": 0,
"Positive": 0,
"Zero": 0
  }
},
"facet_heatmaps": {}
  }
}


ApacheCon is now less than a month away!

2016-10-19 Thread Rich Bowen
Dear Apache Enthusiast,

ApacheCon Sevilla is now less than a month out, and we need your help
getting the word out. Please tell your colleagues, your friends, and
members of related technical communities, about this event. Rates go up
November 3rd, so register today!

ApacheCon, and Apache Big Data, are the official gatherings of the
Apache Software Foundation, and one of the best places in the world to
meet other members of your project community, gain deeper knowledge
about your favorite Apache projects, learn about the ASF. Your project
doesn't live in a vacuum - it's part of a larger family of projects that
have a shared set of values, as well as a shared governance model. And
many of our project have an overlap in developers, in communities, and
in subject matter, making ApacheCon a great place for cross-pollination
of ideas and of communities.

Some highlights of these events will be:

* Many of our board members and project chairs will be present
* The lightning talks are a great place to hear, and give, short
presentations about what you and other members of the community are
working on
* The key signing gets you linked into the web of trust, and better
able to verify our software releases
* Evening receptions and parties where you can meet community
members in a less formal setting
* The State of the Feather, where you can learn what the ASF has
done in the last year, and what's coming next year
* BarCampApache, an informal unconference-style event, is another
venue for discussing your projects at the ASF

We have a great schedule lined up, covering the wide range of ASF
projects, including:

* CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos -
Carlos Sanchez
* Inner sourcing 101 - Jim Jagielski
* Java Memory Leaks in Modular Environments - Mark Thomas

ApacheCon/Apache Big Data will be held in Sevilla, Spain, at the Melia
Sevilla, November 14th through 18th. You can find out more at
http://apachecon.com/  Other ways to stay up to date with ApacheCon are:

* Follow us on Twitter at @apachecon
* Join us on IRC, at #apachecon on the Freenode IRC network
* Join the apachecon-discuss mailing list by sending email to
apachecon-discuss-subscr...@apache.org
* Or contact me directly at rbo...@apache.org with questions,
comments, or to volunteer to help

See you in Sevilla!

-- 
Rich Bowen: VP, Conferences
rbo...@apache.org
http://apachecon.com/
@apachecon


Re: Problem with spellchecker component

2016-10-19 Thread la...@2locos.com
we are using these spellcheckers in our collesction configs:


default
solr.DirectSolrSpellChecker
.
.
.

  


  wordbreak
  solr.WordBreakSolrSpellChecker
  .
  .
  .   
  

 
   
   jarowinkler
   
   org.apache.lucene.search.spell.JaroWinklerDistance
   .
   .
   .
   

it works when I have one word, but it doesn't work when I have combination of 
words with errors.

Ladan Nekuii
Web Developer
2locos
300 Frank H. Ogawa Plaza, Suite 234
Oakland, CA 94612
Tel: 510-465-0101
Fax: 510-465-0104
www.2locos.com 

-Original Message-
From: "Rajesh Hazari" 
Sent: Friday, October 7, 2016 12:27pm
To: solr-user@lucene.apache.org
Subject: Re: Problem with spellchecker component

What spellcheckers you have in your collection configs,
do you have any of these

 
wordbreak
solr.WordBreakSolrSpellChecker
.
.
.
.
 

  
default
textSpell
solr.IndexBasedSpellChecker
.
.
.
.
.
 

we have come up with these spellcheckers which works with our schema
definitions.

*Rajesh**.*

On Fri, Oct 7, 2016 at 2:36 PM, la...@2locos.com  wrote:

> I'm using Spellcheck component and it doesn't show me any error for
> combination of words with error, I want to know if it just work on one word
> or it also works on combination of words?and if so what should I do to
> makes it work?
>
> Ladan Nekuii
> Web Developer
> 2locos
> 300 Frank H. Ogawa Plaza, Suite 234
> Oakland, CA 94612
> Tel: 510-465-0101
> Fax: 510-465-0104
> www.2locos.com
>
>




Re: Public/Private data in Solr :: Metadata or ?

2016-10-19 Thread John Bickerstaff
Thank you both!  Very helpful.

On Wed, Oct 19, 2016 at 8:48 AM, Shawn Heisey  wrote:

> On 10/18/2016 3:00 PM, John Bickerstaff wrote:
> > How (or is it even wise) to "segregate data" in Solr so that some data
> > can be seen by some users and some data not be seen?
>
> IMHO, security like this isn't really Solr's job ... but with the right
> data in the index, the system that DOES handle the security can include
> a filter with each user's query to restrict them to only the data they
> are allowed to see.  There are many ways to put data in the index for
> efficient use by a filter.  The simplest would be a boolean field with a
> name like isPublic or isPrivate, where true and false are mapped as
> necessary to public and private.
>
> Naturally, the users must not be able to reach Solr directly ... they
> must be restricted to the software that connects to Solr on their
> behalf.  Blocking end users from direct network access to Solr is a good
> idea even if there are no other security needs.
>
> There are more comprehensive solutions available, as you will notice
> from other replies, but the idea of simple filtering, controlled by your
> application, should work.
>
> Thanks,
> Shawn
>
>


Re: PDF writer

2016-10-19 Thread Shawn Heisey
On 10/17/2016 8:01 AM, Matthew Roth wrote:
> Is there a documented or preferred path to have a PDF response writer?
> I am using solr 5.3.x for an internal project. I have an XSL-FO
> transformation that I am able to return via the XSLT response writer.
> Is there a documented way to produce a PDF via solr? Alternatively, I
> was thinking of passing the response through an eXist-db instance [0]
> we have running. However, a pdf response writer would be ideal.

Solr responses are designed to be processed by a program making a search
query, not read by an end user.  Solr is middleware.  There are multiple
formats (json, xml, javabin) because we do not know what kind of program
will consume the response.

https://en.wikipedia.org/wiki/Middleware

PDF is an end-user format for display and print, not a middleware
response format.  Creating content like that is best handled by other
pieces of software, not Solr.

For best results that fit your needs perfectly, that software is likely
to be something you write yourself.  The Solr project has absolutely no
idea how you will define your schema, or how you would like the data in
a Solr response transformed, integrated, and formatted in a PDF.

Designing the feature you want would be something best handled as an
software project separate from Solr.  The software would take a Solr
response and turn it into a PDF.  It doesn't fit into Solr's core usage,
so making it a part of Solr is not a good fit and unlikely to happen.

No matter where the development for a general feature like that happens,
it would likely take weeks or months of work just to reach alpha
quality.  After that, it would take weeks or months of additional work
to reach release quality ... and even then it probably wouldn't produce
the exact results you want without extensive and complicated
configuration.  Handling complicated configuration is itself very
complicated, which is one reason why development would take so long.

Thanks,
Shawn



Re: solr-6.2.0 cannot be launched by systemd service

2016-10-19 Thread Shawn Heisey
On 10/17/2016 9:20 AM, yunjiez wrote:
> solr_systemd.log
>   
>
> There is no problem when launching the solr-6.2.0 with the script bin/solr.
> But when I launching it with systemd service, the solr instance will soon be
> stopped by systemd. I attached the error log. Anyone can help? thanks.

I had to go to Nabble to see your log.  It didn't make it to the list.

This log line shows when shutdown begins:

2016-10-12 08:20:09.973 INFO (ShutdownMonitor) [ ]
o.e.j.s.ServerConnector Stopped
ServerConnector@62ddbd7e{HTTP/1.1,[http/1.1]}{0.0.0.0:8983}

This is a little more than halfway through the log.  Solr did not stop
because of any error, it was shut down externally.  The errors that show
up later in the log happened because Solr was shutting down.

There is no official systemd config for Solr.  That probably means that
you built it yourself or obtained it somewhere else and modified it. 
Looking in the latest Solr download, I do not see any file or directory
containing "systemd" in its name.  That makes this an unsupported
configuration.  I have absolutely no idea why this is happening.

Is there any particular reason you can't use the included shell script
for service installation?

https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production#TakingSolrtoProduction-RuntheSolrInstallationScript

That script should work on most UNIX-like operating systems, especially
Linux, where it has been pretty thoroughly tested.  It installs an init
script to /etc/init.d/ when it runs.  If you find that the script
doesn't work on your UNIX-like operating system, and you are sure that
recent versions of the required tools (bash, lsof, grep, ps, and similar
-- open source versions highly recommended) are installed, then that
would most likely be considered a bug.  Please discuss any problems
encountered with the install script here before opening an issue in Jira.

Thanks,
Shawn



Re: Public/Private data in Solr :: Metadata or ?

2016-10-19 Thread Shawn Heisey
On 10/18/2016 3:00 PM, John Bickerstaff wrote:
> How (or is it even wise) to "segregate data" in Solr so that some data
> can be seen by some users and some data not be seen? 

IMHO, security like this isn't really Solr's job ... but with the right
data in the index, the system that DOES handle the security can include
a filter with each user's query to restrict them to only the data they
are allowed to see.  There are many ways to put data in the index for
efficient use by a filter.  The simplest would be a boolean field with a
name like isPublic or isPrivate, where true and false are mapped as
necessary to public and private.

Naturally, the users must not be able to reach Solr directly ... they
must be restricted to the software that connects to Solr on their
behalf.  Blocking end users from direct network access to Solr is a good
idea even if there are no other security needs.

There are more comprehensive solutions available, as you will notice
from other replies, but the idea of simple filtering, controlled by your
application, should work.

Thanks,
Shawn



Re: Public/Private data in Solr :: Metadata or ?

2016-10-19 Thread Jan Høydahl
In practice there shoud not be much of a delay, but if you change the ACL 
permission on a top-level folder with 10 million docs beneath,
it will take some time before all those docs are reindexed. But if you instead 
give your friend read access to a new “group” which 
already have access to the docs, the change is immediate.

I suppose ManifoldCF could start using DocValues for the ACL info and update 
those atomically much faster than re-indexing the content of every document. 
Anyone know if that would be feasible?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 19. okt. 2016 kl. 00.30 skrev Markus Jelsma :
> 
> ManifoldCF can do this really flexible, with Filenet or Sharepoint, or both, 
> i don't remember that well. This means a variety of users can have changing 
> privileges  at any time. The backend determines visibility, ManifoldCF just 
> asks how visible it should be.
> 
> This also means you need those backends and ManifoldCF. If broad document and 
> users permissions are required (and you have those backends), this is a very 
> viable option.
> 
> 
> 
> -Original message-
>> From:John Bickerstaff 
>> Sent: Wednesday 19th October 2016 0:14
>> To: solr-user@lucene.apache.org
>> Subject: Re: Public/Private data in Solr :: Metadata or ?
>> 
>> Thanks Jan --
>> 
>> I did a quick scan on the wiki and here:
>> http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011
>> and couldn't find the answer to the following question in the 5 or 10
>> minutes I spent looking.  Admittedly I'm being lazy and hoping you have
>> enough experience with the project to answer easily...
>> 
>> Do you know if ManifoldCF helps with a use case where the security token
>> needs to be changed arbitrarily and a re-index of the collection is not
>> practical?  Or is ManifoldCF an index-time only kind of thing?
>> 
>> 
>> Use Case:  User A changes "record A" from private to public so a friend
>> (User B) can see it.  User B logs in and expects to see what User A changed
>> to public a few minutes earlier.
>> 
>> The security token on "record A" would need to be changed immediately, and
>> that change would have to occur in Solr - yes?
>> 
>> 
>> 
>> On Tue, Oct 18, 2016 at 3:32 PM, Jan Høydahl  wrote:
>> 
>>> https://wiki.apache.org/solr/SolrSecurity#Document_Level_Security <
>>> https://wiki.apache.org/solr/SolrSecurity#Document_Level_Security>
>>> 
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>> 
 18. okt. 2016 kl. 23.00 skrev John Bickerstaff >> can
 be seen by some users and some data not be seen?
 
 Taking the case of "public / private" as a (hopefully) simple, binary
 example...
 
 Let's imagine I have a data set that can be seen by a user.  Some of that
 data can be seen ONLY by the user (this would be the private data) and
>>> some
 of it can be seen by others (assume the user gave permission for this in
 some way)
 
 What is a best practice for handling this type of situation?  I can see
 putting metadata in Solr of course, but the instant I do that, I create
>>> the
 obligation to keep it updated (Document-level CRUD?) and I start using
>>> Solr
 more like a DB than a search engine.
 
 (Assume the user can change this public/private setting on any one piece
>>> of
 "their" data at any time).
 
 Of course, I can also see some kind of post-results massaging of data to
 remove private data based on ID's which are stored in a database or
>>> similar
 datastore...
 
 How have others solved this and is there a consensus on whether to keep
>>> it
 out of Solr, or how best to handle it in Solr?
 
 Are there clever implementations of "secondary" collections in Solr for
 this purpose?
 
 Any advice / hard-won experience is greatly appreciated...
>>> 
>>> 
>> 



Re: How to substract numeric value stored in 2 documents related by correlation id one-to-one

2016-10-19 Thread Kevin Risden
The Parallel SQL support for what you are asking for doesn't exist quite
yet. The use case you described is close to what I was envisioning for the
Solr SQL support. This would allow full text searches and then some
analytics on top of it (like call duration).

I'm not sure if subtracting fields (c2.time-c1.time) is supported in
streaming expressions yet. The leftOuterJoin is but not sure about
arbitrary math equations. The Parallel SQL side has an issue w/ 1!=0 right
now so I'm guessing adding/subtracting is also out for now.

The ticket you will want to follow is SOLR-8593 (
https://issues.apache.org/jira/browse/SOLR-8593) This is the Calcite
integration and should enable a lot more SQL syntax as a result.

Kevin Risden
Apache Lucene/Solr Committer
Hadoop and Search Tech Lead | Avalon Consulting, LLC

M: 732 213 8417
LinkedIn  | Google+
 | Twitter


-
This message (including any attachments) contains confidential information
intended for a specific individual and purpose, and is protected by law. If
you are not the intended recipient, you should delete this message. Any
disclosure, copying, or distribution of this message, or the taking of any
action based on it, is strictly prohibited.

On Wed, Oct 19, 2016 at 8:23 AM,  wrote:

> Hello,
> I have 2 documents recorded at request or response of a service call  :
> Entity Request
>  {
>   "type":"REQ",
>   "reqid":"MES0",
>"service":"service0",
>"time":1,
>  }
> Entity response
>  {
>   "type":"RES",
>   "reqid":"MES0",
>"time":10,
>  }
>
> I need to create following statistics:
> Total service call duration for each call (reqid is unique for each
> service call) :
> similar to query :
> select c1.reqid,c1.service,c1.time as REQTime, c2.time as RESTime ,
> c2.time - c1.time as TotalTime from collection c1 left join collection c2
> on c1.reqid = c2.reqid and c2.type = 'RES'
>
>  {
>"reqid":"MES0",
>"service":service0,
>"REQTime":1,
>"RESTime":10,
>"TotalTime":9
>  }
>
> Average service call duration :
> similar to query :
> select c1.service,  avg(c2.time - c1.time) as AvgTime, count(*) from
> collection c1 left join collection c2 on c1.reqid = c2.reqid and c2.type =
> 'RES' group by c1.service
>
>  {
>"service":service0,
>"AvgTime":9,
>"Count": 1
>  }
>
> I Tried to find solution in archives, I experimented  with !join,
> subquery, _query_ etc. but not succeeded..
> I can probably use streaming and leftOuterJoin, but in my understanding
> this functionality is not ready for production.
> Is SOLR capable to fulfill these use cases?  What are the key functions to
> focus on ?
>
> Thanks' Pavel
>
>
>
>
>
>
>
>
>


Re: Facet behavior

2016-10-19 Thread Yonik Seeley
On Wed, Oct 19, 2016 at 6:23 AM, Bastien Latard | MDPI AG
 wrote:
> Hi everybody,
>
> I just had a question about facets.
> *==> Is the facet run on all documents (to pre-process/cache the data) or
> only on returned documents?*

Yes ;-)

There are sometimes per-field data structures that are cached to
support faceting.  This can make the first facet request after a new
searcher take longer.  Unless you're using docValues, then the cost is
much less.

Then there are per-request data structures (like a count array) that
are O(field_cardinality) and not O(matching_docs).
But then for default field-cache faceting, the actual counting part is
O(matching_docs).
So yes, at the end of  the day we only facet on the matching
documents... but what the total field looks like certainly matters.

-Yonik


How to substract numeric value stored in 2 documents related by correlation id one-to-one

2016-10-19 Thread kahle

Hello,
I have 2 documents recorded at request or response of a service call  :
Entity Request
 {
  "type":"REQ",
  "reqid":"MES0",
   "service":"service0",
   "time":1,
 }
Entity response
 {
  "type":"RES",
  "reqid":"MES0",
   "time":10,
 }

I need to create following statistics:
Total service call duration for each call (reqid is unique for each 
service call) :

similar to query :
select c1.reqid,c1.service,c1.time as REQTime, c2.time as RESTime , 
c2.time - c1.time as TotalTime from collection c1 left join collection 
c2 on c1.reqid = c2.reqid and c2.type = 'RES'


 {
   "reqid":"MES0",
   "service":service0,
   "REQTime":1,
   "RESTime":10,
   "TotalTime":9
 }

Average service call duration :
similar to query :
select c1.service,  avg(c2.time - c1.time) as AvgTime, count(*) from 
collection c1 left join collection c2 on c1.reqid = c2.reqid and c2.type 
= 'RES' group by c1.service


 {
   "service":service0,
   "AvgTime":9,
   "Count": 1
 }

I Tried to find solution in archives, I experimented  with !join, 
subquery, _query_ etc. but not succeeded..
I can probably use streaming and leftOuterJoin, but in my understanding 
this functionality is not ready for production.
Is SOLR capable to fulfill these use cases?  What are the key functions 
to focus on ?


Thanks' Pavel










Facet behavior

2016-10-19 Thread Bastien Latard | MDPI AG

Hi everybody,

I just had a question about facets.
*==> Is the facet run on all documents (to pre-process/cache the data) 
or only on returned documents?*


Because I have exactly the same index locally and on the prod server.. 
(except that my dev. contains much less docs)


When I make a query, and want the facets for the query, it's taking much 
longer in the production server, even if the query returns less 
documents ...


e.g.:
q=nanoparticles AND 
gold=5=author=0=true=xml=0

- live : 4059 documents <=> 11 secs
- local: 22298 documents <=> 1 sec

Thanks in advance.

Kind regards,
Bastien



Re: Query by distance

2016-10-19 Thread Sergio García Maroto
Thanks a lot. I will try it and let you know.

Thanks again
Sergio


On 18 October 2016 at 17:02, John Bickerstaff 
wrote:

> Just in case it helps, I had good success on multi-word synonyms using this
> plugin...
>
> https://github.com/healthonnet/hon-lucene-synonyms
>
> IIRC, the instructions are clear and fairly easy to follow - especially for
> Solr 6.x
>
> Ping back if you run into any problems setting it up...
>
>
>
> On Tue, Oct 18, 2016 at 7:12 AM, marotosg  wrote:
>
> > This is my field type.
> >
> >
> > I was reading about this and it looks like the issue
> >  class="solr.TextField"
> > positionIncrementGap="300">
> >   
> > 
> >  > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
> > preserveOriginal="1" protected="protwordscompany.txt"/>
> >  > preserveOriginal="false"/>
> > 
> >  > synonyms="positionsynonyms.txt" ignoreCase="true" expand="true"/>
> >   
> >   
> > 
> >  > generateWordParts="0" generateNumberParts="0" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
> > preserveOriginal="1" protected="protwordscompany.txt"/>
> >  > preserveOriginal="false"/>
> > 
> >   
> >  
> >
> >
> > I have been reading and it looks like the issue is about multi term
> > synonym.
> > http://opensourceconnections.com/blog/2013/10/27/why-is-
> > multi-term-synonyms-so-hard-in-solr/
> >
> > I may try this plug in to check if it works.
> >
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> > nabble.com/Query-by-distance-tp4300660p4301697.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>