CDCR - Active-passive model

2016-08-03 Thread Mads Tomasgård Bjørgan
Hello,
I read that it's being worked on 6x to fix the limitation of CDCR only covering 
the active-passive scenario. My question is then - does anyone know when we can 
expect the fix to be out?

Thanks,
Mads


Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Walter Underwood
Ah, the difference between open source and a product. With Ultraseek, we chose 
a solid, stable algorithm that worked well for 3000 customers. In open source, 
it is a research project for every single customer.

I love open source. I’ve brought Solr into Netflix and Chegg. But there is a 
clear difference between developer-driven and customer-driven software.

I first learned about bounded binary exponential backoff in the 
Digital/Intel/Xerox (“DIX”) Ethernet spec in 1980. It is a solid algorithm for 
events with a Poisson distribution, like packet arrival times or web page next 
change times. There is no need for configuring algorithms here, especially 
configurations that lead to an unstable estimate. The only meaningful choices 
are the minimum revisit time, the maximum revisit time, and the number of bins. 
Those will be different for CNN (a launch customer for Ultraseek) or Sun 
documentation (another launch customer). CNN news articles change minute by 
minute, new Sun documentation appeared weekly or monthly.

Sorry for the rant, but “you can fix the algorithm yourself” almost always 
means a bad installation, an unhappy admin, and another black eye for open 
source.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 3, 2016, at 4:07 PM, Markus Jelsma  wrote:
> 
> Depending on your settings, Nutch does this as well. It is even possible to 
> set up different inc/decremental values per mime-type. 
> The algorithms are pluggable and overridable at any point of interest. You 
> can go all the way.  
> 
> -Original message-
>> From:Walter Underwood 
>> Sent: Wednesday 3rd August 2016 20:03
>> To: solr-user@lucene.apache.org
>> Subject: Re: SOLR + Nutch set up (UNCLASSIFIED)
>> 
>> That’s good news.
>> 
>> It should reset the interval estimate on page change instead of slowly 
>> shortening it.
>> 
>> I’m pretty sure that Ultraseek used a bounded exponential backoff when the 
>> page had not changed.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Aug 3, 2016, at 10:51 AM, Marco Scalone  wrote:
>>> 
>>> Nutch also has adaptive strategy:
>>> 
>>> This class implements an adaptive re-fetch algorithm. This works as
 follows:
 
  - for pages that has changed since the last fetchTime, decrease their
  fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
  - for pages that haven't changed since the last fetchTime, increase
  their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
  If SYNC_DELTA property is true, then:
 - calculate a delta = fetchTime - modifiedTime
 - try to synchronize with the time of change, by shifting the next
 fetchTime by a fraction of the difference between the last modification
 time and the last fetch time. I.e. the next fetch time will be set to 
 fetchTime
 + fetchInterval - delta * SYNC_DELTA_RATE
 - if the adjusted fetch interval is bigger than the delta, then 
 fetchInterval
 = delta.
  - the minimum value of fetchInterval may not be smaller than
  MIN_INTERVAL (default is 1 minute).
  - the maximum value of fetchInterval may not be bigger than
  MAX_INTERVAL (default is 365 days).
 
 NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize
 the algorithm, so that the fetch interval either increases or decreases
 infinitely, with little relevance to the page changes. Please use
 main(String[])
 
 method to test the values before applying them in a production system.
 
>>> 
>>> From:
>>> https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
>>> 
>>> 
>>> 2016-08-03 14:45 GMT-03:00 Walter Underwood :
>>> 
 I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler
 in Ultraseek.
 
 I think we were the only people who built an adaptive crawler for
 enterprise use. I tried to get Ultraseek open-sourced. I made the argument
 to Mike Lynch. He looked at me like I had three heads and didn’t even
 answer me.
 
 Ultraseek also has great support for sites that need login. If you use
 that, you’ll need to find a way to do that with another crawler.
 
 wunder
 Walter Underwood
 Former Ultraseek Principal Engineer
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)
 
 
> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US)
  wrote:
> 
> CLASSIFICATION: UNCLASSIFIED
> 
> We are currently using ultraseek and looking to deprecate it in favor of
 

RE: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Markus Jelsma
No, just run it continously, always! By default everything is refetched (if 
possible) every 30 days. Just read the descriptions for adaptive schedule and 
its javadoc. It is simple to use, but sometimes hard to predict its outcome, 
just because you never know what changes, at whatever time.

You will be fine with defaults if you have a small site. Just set the interval 
to a few days, or more if your site is slightly larger.

M.

 
 
-Original message-
> From:Musshorn, Kris T CTR USARMY RDECOM ARL (US) 
> 
> Sent: Wednesday 3rd August 2016 20:08
> To: solr-user@lucene.apache.org
> Subject: RE: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)
> 
> CLASSIFICATION: UNCLASSIFIED
> 
> Shall I assume that, even though nutch has adaptive capability, I would still 
> have to figure out how to trigger it to go look for content that needs update?
> 
> Thanks,
> Kris
> 
> ~~
> Kris T. Musshorn
> FileMaker Developer - Contractor – Catapult Technology Inc.  
> US Army Research Lab 
> Aberdeen Proving Ground 
> Application Management & Development Branch 
> 410-278-7251
> kris.t.musshorn@mail.mil
> ~~
> 
> 
> -Original Message-
> From: Walter Underwood [mailto:wun...@wunderwood.org] 
> Sent: Wednesday, August 03, 2016 2:03 PM
> To: solr-user@lucene.apache.org
> Subject: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)
> 
> All active links contained in this email were disabled.  Please verify the 
> identity of the sender, and confirm the authenticity of all links contained 
> within the message prior to copying and pasting the address to a Web browser. 
>  
> 
> 
> 
> 
> 
> 
> That’s good news.
> 
> It should reset the interval estimate on page change instead of slowly 
> shortening it.
> 
> I’m pretty sure that Ultraseek used a bounded exponential backoff when the 
> page had not changed.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> Caution-http://observer.wunderwood.org/  (my blog)
> 
> 
> > On Aug 3, 2016, at 10:51 AM, Marco Scalone  wrote:
> > 
> > Nutch also has adaptive strategy:
> > 
> > This class implements an adaptive re-fetch algorithm. This works as
> >> follows:
> >> 
> >>   - for pages that has changed since the last fetchTime, decrease their
> >>   fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
> >>   - for pages that haven't changed since the last fetchTime, increase
> >>   their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
> >>   If SYNC_DELTA property is true, then:
> >>  - calculate a delta = fetchTime - modifiedTime
> >>  - try to synchronize with the time of change, by shifting the next
> >>  fetchTime by a fraction of the difference between the last 
> >> modification
> >>  time and the last fetch time. I.e. the next fetch time will be set to 
> >> fetchTime
> >>  + fetchInterval - delta * SYNC_DELTA_RATE
> >>  - if the adjusted fetch interval is bigger than the delta, then 
> >> fetchInterval
> >>  = delta.
> >>   - the minimum value of fetchInterval may not be smaller than
> >>   MIN_INTERVAL (default is 1 minute).
> >>   - the maximum value of fetchInterval may not be bigger than
> >>   MAX_INTERVAL (default is 365 days).
> >> 
> >> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may 
> >> destabilize the algorithm, so that the fetch interval either 
> >> increases or decreases infinitely, with little relevance to the page 
> >> changes. Please use
> >> main(String[])
> >>  >> h/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
> >> method to test the values before applying them in a production system.
> >> 
> > 
> > From:
> > Caution-https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/
> > crawl/AdaptiveFetchSchedule.html
> > 
> > 
> > 2016-08-03 14:45 GMT-03:00 Walter Underwood :
> > 
> >> I’m pretty sure Nutch uses a batch crawler instead of the adaptive 
> >> crawler in Ultraseek.
> >> 
> >> I think we were the only people who built an adaptive crawler for 
> >> enterprise use. I tried to get Ultraseek open-sourced. I made the 
> >> argument to Mike Lynch. He looked at me like I had three heads and 
> >> didn’t even answer me.
> >> 
> >> Ultraseek also has great support for sites that need login. If you 
> >> use that, you’ll need to find a way to do that with another crawler.
> >> 
> >> wunder
> >> Walter Underwood
> >> Former Ultraseek Principal Engineer
> >> wun...@wunderwood.org
> >> Caution-http://observer.wunderwood.org/  (my blog)
> >> 
> >> 
> >>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL 
> >>> (US)
> >>  wrote:
> >>> 
> >>> CLASSIFICATION: UNCLASSIFIED
> >>> 
> >>> We are currently using ultraseek and looking to deprecate it in 
> >>> favor of
> >> solr/nutch.
> >>> Ultraseek 

RE: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Markus Jelsma
Depending on your settings, Nutch does this as well. It is even possible to set 
up different inc/decremental values per mime-type. 
The algorithms are pluggable and overridable at any point of interest. You can 
go all the way.  
 
-Original message-
> From:Walter Underwood 
> Sent: Wednesday 3rd August 2016 20:03
> To: solr-user@lucene.apache.org
> Subject: Re: SOLR + Nutch set up (UNCLASSIFIED)
> 
> That’s good news.
> 
> It should reset the interval estimate on page change instead of slowly 
> shortening it.
> 
> I’m pretty sure that Ultraseek used a bounded exponential backoff when the 
> page had not changed.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
> > On Aug 3, 2016, at 10:51 AM, Marco Scalone  wrote:
> > 
> > Nutch also has adaptive strategy:
> > 
> > This class implements an adaptive re-fetch algorithm. This works as
> >> follows:
> >> 
> >>   - for pages that has changed since the last fetchTime, decrease their
> >>   fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
> >>   - for pages that haven't changed since the last fetchTime, increase
> >>   their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
> >>   If SYNC_DELTA property is true, then:
> >>  - calculate a delta = fetchTime - modifiedTime
> >>  - try to synchronize with the time of change, by shifting the next
> >>  fetchTime by a fraction of the difference between the last 
> >> modification
> >>  time and the last fetch time. I.e. the next fetch time will be set to 
> >> fetchTime
> >>  + fetchInterval - delta * SYNC_DELTA_RATE
> >>  - if the adjusted fetch interval is bigger than the delta, then 
> >> fetchInterval
> >>  = delta.
> >>   - the minimum value of fetchInterval may not be smaller than
> >>   MIN_INTERVAL (default is 1 minute).
> >>   - the maximum value of fetchInterval may not be bigger than
> >>   MAX_INTERVAL (default is 365 days).
> >> 
> >> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize
> >> the algorithm, so that the fetch interval either increases or decreases
> >> infinitely, with little relevance to the page changes. Please use
> >> main(String[])
> >> 
> >> method to test the values before applying them in a production system.
> >> 
> > 
> > From:
> > https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
> > 
> > 
> > 2016-08-03 14:45 GMT-03:00 Walter Underwood :
> > 
> >> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler
> >> in Ultraseek.
> >> 
> >> I think we were the only people who built an adaptive crawler for
> >> enterprise use. I tried to get Ultraseek open-sourced. I made the argument
> >> to Mike Lynch. He looked at me like I had three heads and didn’t even
> >> answer me.
> >> 
> >> Ultraseek also has great support for sites that need login. If you use
> >> that, you’ll need to find a way to do that with another crawler.
> >> 
> >> wunder
> >> Walter Underwood
> >> Former Ultraseek Principal Engineer
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >> 
> >> 
> >>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US)
> >>  wrote:
> >>> 
> >>> CLASSIFICATION: UNCLASSIFIED
> >>> 
> >>> We are currently using ultraseek and looking to deprecate it in favor of
> >> solr/nutch.
> >>> Ultraseek runs all the time and auto detects when pages have changed and
> >> automatically reindexes them.
> >>> Is this possible with SOLR/nutch?
> >>> 
> >>> Thanks,
> >>> Kris
> >>> 
> >>> ~~
> >>> Kris T. Musshorn
> >>> FileMaker Developer - Contractor - Catapult Technology Inc.
> >>> US Army Research Lab
> >>> Aberdeen Proving Ground
> >>> Application Management & Development Branch
> >>> 410-278-7251
> >>> kris.t.musshorn@mail.mil
> >>> ~~
> >>> 
> >>> 
> >>> 
> >>> CLASSIFICATION: UNCLASSIFIED
> >> 
> >> 
> 
> 


Re: QParsePlugin not working on sharded collection

2016-08-03 Thread Erick Erickson
OK, I'm going to assume that somewhere you're
keeping more complicated structures around to
track all the docs coming through the collector so
you can know whether they're duplicates or not.

I think there are really two ways (at least) to go about
it
1> use a SearchComponent to add a separate section to
the response similar to highlighting or faceting.

2> go ahead and use a DocTransformer to add the data
to each individual doc. But the example you're using adds the
data to the meta-data, not an individual doc.


Best,
Erick

On Wed, Aug 3, 2016 at 2:03 PM, tedsolr  wrote:
> So I notice if I create the simplest MergeStrategy I can get my test values
> from the shard responses and then if I add info to the SolrQueryResponse it
> gets back to the caller. I still must be missing something. I wouldn't
> expect to have different code paths - one for single shard one for multi
> shard. So if the PostFilter is restricting the documents returned, what's
> the correct way to return my analytics info? Should I not be adding data to
> the SolrQueryResponse from within the delegating collector's finish()
> method? Here's what I'm trying to do (still works fine with a single shard
> collection :)
>
> - Use the DelegatingCollector to restrict docs returned (dropping docs that
> are "duplicates" based on my critieria)
> - Calculate 2 stats for each collected doc: a count of "duplicate" docs & a
> sum on a number field from these "duplicate" docs. I am doing the math in
> the collect() method.
> - Return the stats in the response stream. I'm using a TransformerFactory
> now to inject a new field into the results for each doc. Should I be using a
> SearchComponent instead?
>
>
> Erick Erickson wrote
>> Right, I don't have the code in front of me right now, but I think
>> your issue is at the "aggregation" point. You also have to put
>> some code in the aggregation bits that pull your custom parts
>> from the sub-request packets and puts in the final packet,
>> "doing the right thing" in terms of assembling them into
>> something meaningful along the way (e.g. averaging "myvar"
>> or putting it in a list identified by shard or..).
>>
>> I think if you fire the query at one of your shards with =false
>> you'll see your additions, which would demonstrate that your
>> filter is being found. I assume your custom jar is on the shards
>> or you'd get an exception (assuming you've pushed your
>> solrconfig to ZK).
>>
>> Best,
>> Erick
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/QParsePlugin-not-working-on-sharded-collection-tp4290249p4290285.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: QParsePlugin not working on sharded collection

2016-08-03 Thread tedsolr
So I notice if I create the simplest MergeStrategy I can get my test values
from the shard responses and then if I add info to the SolrQueryResponse it
gets back to the caller. I still must be missing something. I wouldn't
expect to have different code paths - one for single shard one for multi
shard. So if the PostFilter is restricting the documents returned, what's
the correct way to return my analytics info? Should I not be adding data to
the SolrQueryResponse from within the delegating collector's finish()
method? Here's what I'm trying to do (still works fine with a single shard
collection :)

- Use the DelegatingCollector to restrict docs returned (dropping docs that
are "duplicates" based on my critieria)
- Calculate 2 stats for each collected doc: a count of "duplicate" docs & a
sum on a number field from these "duplicate" docs. I am doing the math in
the collect() method.
- Return the stats in the response stream. I'm using a TransformerFactory
now to inject a new field into the results for each doc. Should I be using a
SearchComponent instead?


Erick Erickson wrote
> Right, I don't have the code in front of me right now, but I think
> your issue is at the "aggregation" point. You also have to put
> some code in the aggregation bits that pull your custom parts
> from the sub-request packets and puts in the final packet,
> "doing the right thing" in terms of assembling them into
> something meaningful along the way (e.g. averaging "myvar"
> or putting it in a list identified by shard or..).
> 
> I think if you fire the query at one of your shards with =false
> you'll see your additions, which would demonstrate that your
> filter is being found. I assume your custom jar is on the shards
> or you'd get an exception (assuming you've pushed your
> solrconfig to ZK).
> 
> Best,
> Erick





--
View this message in context: 
http://lucene.472066.n3.nabble.com/QParsePlugin-not-working-on-sharded-collection-tp4290249p4290285.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: QParsePlugin not working on sharded collection

2016-08-03 Thread Erick Erickson
Right, I don't have the code in front of me right now, but I think
your issue is at the "aggregation" point. You also have to put
some code in the aggregation bits that pull your custom parts
from the sub-request packets and puts in the final packet,
"doing the right thing" in terms of assembling them into
something meaningful along the way (e.g. averaging "myvar"
or putting it in a list identified by shard or..).

I think if you fire the query at one of your shards with =false
you'll see your additions, which would demonstrate that your
filter is being found. I assume your custom jar is on the shards
or you'd get an exception (assuming you've pushed your
solrconfig to ZK).

Best,
Erick

On Wed, Aug 3, 2016 at 9:42 AM, tedsolr  wrote:
> I'm trying to verify that a very simple custom post filter will work on a
> sharded collection. So far it doesn't. Here are the search results on my
> single shard test collection:
>
> {
>   "responseHeader": {
> "status": 0,
> "QTime": 17
>   },
>   "thecountis": "946028",
>   "myvar": "hello",
>   "response": {
> "numFound": 946028,
> "start": 0,
> "docs": [
> ...]
> }
>
> When I run against a two shard collection (same data set) it's as though the
> post filter doesn't exist. The results don't include my additions to the
> response:
>
> {
>   "responseHeader": {
> "status": 0,
> "QTime": 17
>   },
>   "response": {
> "numFound": 946028,
> "start": 0,
> "docs": [
> ...]
> }
>
> Here's the solconfig.xml:
>
> ...
> 
>
>
> {!TedFilter myvar=hello}
> 
>
> ...
>
> And here's the simplest plugin I could write:
>
> public class TedPlugin extends QParserPlugin {
> @Override
> public void init(NamedList arg0) {
> }
>
> @Override
> public QParser createParser(String arg0, final SolrParams arg1, final
> SolrParams arg2, final SolrQueryRequest arg3) {
> return new QParser(arg0, arg1, arg2, arg3) {
>
> @Override
> public Query parse() throws SyntaxError {
> return new TedQuery(arg1, arg2, arg3);
> }
> };
> }
> }
>
> public class TedQuery extends AnalyticsQuery {
> private final String myvar;
>
> TedQuery(SolrParams localParams, SolrParams params, SolrQueryRequest 
> req) {
> myvar = localParams.get("myvar");
> }
>
> @Override
> public DelegatingCollector getAnalyticsCollector(ResponseBuilder rb,
> IndexSearcher searcher) {
> return new TedCollector(myvar, rb);
> }
>
> @Override
> public boolean equals(Object o) {
> if (o instanceof TedQuery) {
> TedQuery tq = (TedQuery) o;
> return Objects.equals(this.myvar, tq.myvar);
> }
> return false;
> }
>
> @Override
> public int hashCode() {
> return myvar == null ? 1 : myvar.hashCode();
> }
>
>
> class TedCollector extends DelegatingCollector {
> ResponseBuilder rb;
> int count;
> String myvar;
>
> public TedCollector(String myvar, ResponseBuilder rb) {
> this.rb = rb;
> this.myvar = myvar;
> }
>
> @Override
> public void collect(int doc) throws IOException {
> count++;
> super.collect(doc);
> }
>
> @Override
> public void finish() throws IOException {
> rb.rsp.add("thecountis", String.valueOf(count));
> rb.rsp.add("myvar", myvar);
>
> if (super.delegate instanceof DelegatingCollector) {
> ((DelegatingCollector) 
> super.delegate).finish();
> }
> }
> }
> }
>
> What am I doing wrong? Thanks!
> Ted
> v5.2.1 SolrCloud mode
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/QParsePlugin-not-working-on-sharded-collection-tp4290249.html
> Sent from the Solr - User mailing list archive at Nabble.com.


RE: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Musshorn, Kris T CTR USARMY RDECOM ARL (US)
CLASSIFICATION: UNCLASSIFIED

Shall I assume that, even though nutch has adaptive capability, I would still 
have to figure out how to trigger it to go look for content that needs update?

Thanks,
Kris

~~
Kris T. Musshorn
FileMaker Developer - Contractor – Catapult Technology Inc.  
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn@mail.mil
~~


-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Wednesday, August 03, 2016 2:03 PM
To: solr-user@lucene.apache.org
Subject: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)

All active links contained in this email were disabled.  Please verify the 
identity of the sender, and confirm the authenticity of all links contained 
within the message prior to copying and pasting the address to a Web browser.  






That’s good news.

It should reset the interval estimate on page change instead of slowly 
shortening it.

I’m pretty sure that Ultraseek used a bounded exponential backoff when the page 
had not changed.

wunder
Walter Underwood
wun...@wunderwood.org
Caution-http://observer.wunderwood.org/  (my blog)


> On Aug 3, 2016, at 10:51 AM, Marco Scalone  wrote:
> 
> Nutch also has adaptive strategy:
> 
> This class implements an adaptive re-fetch algorithm. This works as
>> follows:
>> 
>>   - for pages that has changed since the last fetchTime, decrease their
>>   fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
>>   - for pages that haven't changed since the last fetchTime, increase
>>   their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
>>   If SYNC_DELTA property is true, then:
>>  - calculate a delta = fetchTime - modifiedTime
>>  - try to synchronize with the time of change, by shifting the next
>>  fetchTime by a fraction of the difference between the last modification
>>  time and the last fetch time. I.e. the next fetch time will be set to 
>> fetchTime
>>  + fetchInterval - delta * SYNC_DELTA_RATE
>>  - if the adjusted fetch interval is bigger than the delta, then 
>> fetchInterval
>>  = delta.
>>   - the minimum value of fetchInterval may not be smaller than
>>   MIN_INTERVAL (default is 1 minute).
>>   - the maximum value of fetchInterval may not be bigger than
>>   MAX_INTERVAL (default is 365 days).
>> 
>> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may 
>> destabilize the algorithm, so that the fetch interval either 
>> increases or decreases infinitely, with little relevance to the page 
>> changes. Please use
>> main(String[])
>> > h/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
>> method to test the values before applying them in a production system.
>> 
> 
> From:
> Caution-https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/
> crawl/AdaptiveFetchSchedule.html
> 
> 
> 2016-08-03 14:45 GMT-03:00 Walter Underwood :
> 
>> I’m pretty sure Nutch uses a batch crawler instead of the adaptive 
>> crawler in Ultraseek.
>> 
>> I think we were the only people who built an adaptive crawler for 
>> enterprise use. I tried to get Ultraseek open-sourced. I made the 
>> argument to Mike Lynch. He looked at me like I had three heads and 
>> didn’t even answer me.
>> 
>> Ultraseek also has great support for sites that need login. If you 
>> use that, you’ll need to find a way to do that with another crawler.
>> 
>> wunder
>> Walter Underwood
>> Former Ultraseek Principal Engineer
>> wun...@wunderwood.org
>> Caution-http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL 
>>> (US)
>>  wrote:
>>> 
>>> CLASSIFICATION: UNCLASSIFIED
>>> 
>>> We are currently using ultraseek and looking to deprecate it in 
>>> favor of
>> solr/nutch.
>>> Ultraseek runs all the time and auto detects when pages have changed 
>>> and
>> automatically reindexes them.
>>> Is this possible with SOLR/nutch?
>>> 
>>> Thanks,
>>> Kris
>>> 
>>> ~~
>>> Kris T. Musshorn
>>> FileMaker Developer - Contractor - Catapult Technology Inc.
>>> US Army Research Lab
>>> Aberdeen Proving Ground
>>> Application Management & Development Branch
>>> 410-278-7251
>>> kris.t.musshorn@mail.mil
>>> ~~
>>> 
>>> 
>>> 
>>> CLASSIFICATION: UNCLASSIFIED
>> 
>> 


CLASSIFICATION: UNCLASSIFIED


Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Walter Underwood
That’s good news.

It should reset the interval estimate on page change instead of slowly 
shortening it.

I’m pretty sure that Ultraseek used a bounded exponential backoff when the page 
had not changed.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 3, 2016, at 10:51 AM, Marco Scalone  wrote:
> 
> Nutch also has adaptive strategy:
> 
> This class implements an adaptive re-fetch algorithm. This works as
>> follows:
>> 
>>   - for pages that has changed since the last fetchTime, decrease their
>>   fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
>>   - for pages that haven't changed since the last fetchTime, increase
>>   their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
>>   If SYNC_DELTA property is true, then:
>>  - calculate a delta = fetchTime - modifiedTime
>>  - try to synchronize with the time of change, by shifting the next
>>  fetchTime by a fraction of the difference between the last modification
>>  time and the last fetch time. I.e. the next fetch time will be set to 
>> fetchTime
>>  + fetchInterval - delta * SYNC_DELTA_RATE
>>  - if the adjusted fetch interval is bigger than the delta, then 
>> fetchInterval
>>  = delta.
>>   - the minimum value of fetchInterval may not be smaller than
>>   MIN_INTERVAL (default is 1 minute).
>>   - the maximum value of fetchInterval may not be bigger than
>>   MAX_INTERVAL (default is 365 days).
>> 
>> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize
>> the algorithm, so that the fetch interval either increases or decreases
>> infinitely, with little relevance to the page changes. Please use
>> main(String[])
>> 
>> method to test the values before applying them in a production system.
>> 
> 
> From:
> https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
> 
> 
> 2016-08-03 14:45 GMT-03:00 Walter Underwood :
> 
>> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler
>> in Ultraseek.
>> 
>> I think we were the only people who built an adaptive crawler for
>> enterprise use. I tried to get Ultraseek open-sourced. I made the argument
>> to Mike Lynch. He looked at me like I had three heads and didn’t even
>> answer me.
>> 
>> Ultraseek also has great support for sites that need login. If you use
>> that, you’ll need to find a way to do that with another crawler.
>> 
>> wunder
>> Walter Underwood
>> Former Ultraseek Principal Engineer
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US)
>>  wrote:
>>> 
>>> CLASSIFICATION: UNCLASSIFIED
>>> 
>>> We are currently using ultraseek and looking to deprecate it in favor of
>> solr/nutch.
>>> Ultraseek runs all the time and auto detects when pages have changed and
>> automatically reindexes them.
>>> Is this possible with SOLR/nutch?
>>> 
>>> Thanks,
>>> Kris
>>> 
>>> ~~
>>> Kris T. Musshorn
>>> FileMaker Developer - Contractor - Catapult Technology Inc.
>>> US Army Research Lab
>>> Aberdeen Proving Ground
>>> Application Management & Development Branch
>>> 410-278-7251
>>> kris.t.musshorn@mail.mil
>>> ~~
>>> 
>>> 
>>> 
>>> CLASSIFICATION: UNCLASSIFIED
>> 
>> 



Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Marco Scalone
Nutch also has adaptive strategy:

This class implements an adaptive re-fetch algorithm. This works as
> follows:
>
>- for pages that has changed since the last fetchTime, decrease their
>fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
>- for pages that haven't changed since the last fetchTime, increase
>their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
>If SYNC_DELTA property is true, then:
>   - calculate a delta = fetchTime - modifiedTime
>   - try to synchronize with the time of change, by shifting the next
>   fetchTime by a fraction of the difference between the last modification
>   time and the last fetch time. I.e. the next fetch time will be set to 
> fetchTime
>   + fetchInterval - delta * SYNC_DELTA_RATE
>   - if the adjusted fetch interval is bigger than the delta, then 
> fetchInterval
>   = delta.
>- the minimum value of fetchInterval may not be smaller than
>MIN_INTERVAL (default is 1 minute).
>- the maximum value of fetchInterval may not be bigger than
>MAX_INTERVAL (default is 365 days).
>
> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize
> the algorithm, so that the fetch interval either increases or decreases
> infinitely, with little relevance to the page changes. Please use
> main(String[])
> 
> method to test the values before applying them in a production system.
>

From:
https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html


2016-08-03 14:45 GMT-03:00 Walter Underwood :

> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler
> in Ultraseek.
>
> I think we were the only people who built an adaptive crawler for
> enterprise use. I tried to get Ultraseek open-sourced. I made the argument
> to Mike Lynch. He looked at me like I had three heads and didn’t even
> answer me.
>
> Ultraseek also has great support for sites that need login. If you use
> that, you’ll need to find a way to do that with another crawler.
>
> wunder
> Walter Underwood
> Former Ultraseek Principal Engineer
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US)
>  wrote:
> >
> > CLASSIFICATION: UNCLASSIFIED
> >
> > We are currently using ultraseek and looking to deprecate it in favor of
> solr/nutch.
> > Ultraseek runs all the time and auto detects when pages have changed and
> automatically reindexes them.
> > Is this possible with SOLR/nutch?
> >
> > Thanks,
> > Kris
> >
> > ~~
> > Kris T. Musshorn
> > FileMaker Developer - Contractor - Catapult Technology Inc.
> > US Army Research Lab
> > Aberdeen Proving Ground
> > Application Management & Development Branch
> > 410-278-7251
> > kris.t.musshorn@mail.mil
> > ~~
> >
> >
> >
> > CLASSIFICATION: UNCLASSIFIED
>
>


Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Walter Underwood
I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler in 
Ultraseek.

I think we were the only people who built an adaptive crawler for enterprise 
use. I tried to get Ultraseek open-sourced. I made the argument to Mike Lynch. 
He looked at me like I had three heads and didn’t even answer me.

Ultraseek also has great support for sites that need login. If you use that, 
you’ll need to find a way to do that with another crawler.

wunder
Walter Underwood
Former Ultraseek Principal Engineer
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US) 
>  wrote:
> 
> CLASSIFICATION: UNCLASSIFIED
> 
> We are currently using ultraseek and looking to deprecate it in favor of 
> solr/nutch.
> Ultraseek runs all the time and auto detects when pages have changed and 
> automatically reindexes them.
> Is this possible with SOLR/nutch?
> 
> Thanks,
> Kris
> 
> ~~
> Kris T. Musshorn
> FileMaker Developer - Contractor - Catapult Technology Inc.  
> US Army Research Lab 
> Aberdeen Proving Ground 
> Application Management & Development Branch 
> 410-278-7251
> kris.t.musshorn@mail.mil
> ~~
> 
> 
> 
> CLASSIFICATION: UNCLASSIFIED



SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Musshorn, Kris T CTR USARMY RDECOM ARL (US)
CLASSIFICATION: UNCLASSIFIED

We are currently using ultraseek and looking to deprecate it in favor of 
solr/nutch.
Ultraseek runs all the time and auto detects when pages have changed and 
automatically reindexes them.
Is this possible with SOLR/nutch?

Thanks,
Kris

~~
Kris T. Musshorn
FileMaker Developer - Contractor - Catapult Technology Inc.  
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn@mail.mil
~~



CLASSIFICATION: UNCLASSIFIED

My solr server finishes itself

2016-08-03 Thread Julien VIELLE
Hello,

I'm facing strange problem my solr server stops itself randomly.

with the message:

Graceful shutdown SocketConnector@0.0.0.0:8983

You will  find attached my solr.log


I don't understand why there are no crontab running. There nothing in my
log telling why solr shutdown itself. Any help will be very helpfull

Thanks in advance


Suspicious message with attachment

2016-08-03 Thread help
The following message addressed to you was quarantined because it likely 
contains a virus:

Subject: My solr server finishes itself
From: Julien VIELLE 

However, if you know the sender and are expecting an attachment, please reply 
to this message, and we will forward the quarantined message to you.


QParsePlugin not working on sharded collection

2016-08-03 Thread tedsolr
I'm trying to verify that a very simple custom post filter will work on a
sharded collection. So far it doesn't. Here are the search results on my
single shard test collection:

{
  "responseHeader": {
"status": 0,
"QTime": 17
  },
  "thecountis": "946028",
  "myvar": "hello",
  "response": {
"numFound": 946028,
"start": 0,
"docs": [
...]
}

When I run against a two shard collection (same data set) it's as though the
post filter doesn't exist. The results don't include my additions to the
response:

{
  "responseHeader": {
"status": 0,
"QTime": 17
  },
  "response": {
"numFound": 946028,
"start": 0,
"docs": [
...]
}

Here's the solconfig.xml:

...

   
   
{!TedFilter myvar=hello}

   
...

And here's the simplest plugin I could write:

public class TedPlugin extends QParserPlugin {
@Override
public void init(NamedList arg0) {
}

@Override
public QParser createParser(String arg0, final SolrParams arg1, final
SolrParams arg2, final SolrQueryRequest arg3) {
return new QParser(arg0, arg1, arg2, arg3) {

@Override
public Query parse() throws SyntaxError {
return new TedQuery(arg1, arg2, arg3);
}
};
}
}

public class TedQuery extends AnalyticsQuery {
private final String myvar;

TedQuery(SolrParams localParams, SolrParams params, SolrQueryRequest 
req) {
myvar = localParams.get("myvar");
}

@Override
public DelegatingCollector getAnalyticsCollector(ResponseBuilder rb,
IndexSearcher searcher) {
return new TedCollector(myvar, rb);
}

@Override
public boolean equals(Object o) {
if (o instanceof TedQuery) {
TedQuery tq = (TedQuery) o;
return Objects.equals(this.myvar, tq.myvar);
}
return false;
}

@Override
public int hashCode() {
return myvar == null ? 1 : myvar.hashCode();
}


class TedCollector extends DelegatingCollector {
ResponseBuilder rb;
int count;
String myvar;

public TedCollector(String myvar, ResponseBuilder rb) {
this.rb = rb;
this.myvar = myvar;
}

@Override
public void collect(int doc) throws IOException {
count++;
super.collect(doc);
}

@Override
public void finish() throws IOException {
rb.rsp.add("thecountis", String.valueOf(count));
rb.rsp.add("myvar", myvar);

if (super.delegate instanceof DelegatingCollector) {
((DelegatingCollector) super.delegate).finish();
}
}
}
}

What am I doing wrong? Thanks!
Ted
v5.2.1 SolrCloud mode



--
View this message in context: 
http://lucene.472066.n3.nabble.com/QParsePlugin-not-working-on-sharded-collection-tp4290249.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Replication with managed resources?

2016-08-03 Thread rosbaldeston
I was just running my own test and it seems it doesn't replicate or reload
the managed schema synonyms file. Not on a manual replication request after
a synonym change and not on an index change triggering an automatic
replication at least.

Used this as the slaves confFiles, not sure if this allows globs for the
language variants?

  solrconfig.xml,managed-schema,_schema_analysis_stopwords_english.json,_schema_analysis_synonyms_english.json

This is with a Solr 5.5, new schemas for both master & slave and all on
Centos 6.5 with Java 7.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Replication-with-managed-resources-tp4289880p4290248.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Replication with managed resources?

2016-08-03 Thread Erick Erickson
bq: I'm also guessing those _schema and managed_schema files are an
implementation detail for the missing zookeeper functionality. But if I did
add those to a conffiles option it might automate the slave core reloads for
me?

You're getting closer ;). There's nothing Cloud specific about the whole
managed schema functionality, although that is where it's gotten the most
exercise so

So if you're saying that you change the managed schema file on the
master and it is _not_ replicated automatically to the slave (you'll
have to have added docs I believe on the master, I don't think replication
happens just because of config changes) then I think that's worth a
JIRA, can you please confirm?

So if this sequence doesn't work:
1> change the managed schema
2> index some docs
3> wait for a replication
4> the managed schema file on the slave has _not_ been updated

then please raise a JIRA. Make sure you identify that this is stand-alone.
NOTE: I'm not sure what the right thing to do in this case is, but the JIRA
would allow a place to discuss what "the right thing" would be.

In the meantime, you should be able to work around that by explicitly listing
them in the conffiles section.

Best,
Erick

On Wed, Aug 3, 2016 at 8:58 AM, rosbaldeston  wrote:
> Erick Erickson wrote
>> It Depends. When running in Cloud mode then "yes". If you're running
>> stand-alone
>> then there is no Zookeeper running so the answer is "no".
>
> Ah that helps, so no zookeeper in my case. I did wonder if it wasn't just
> sharing the same config files between master and slave from sharing the same
> configset. So it would appear I'm not replicating any of the managed files
> and reloading the slave core probably just reread the shared synonyms file.
>
> I'm also guessing those _schema and managed_schema files are an
> implementation detail for the missing zookeeper functionality. But if I did
> add those to a conffiles option it might automate the slave core reloads for
> me?
>
>
>> If a replication involved downloading of at least one configuration file,
>> the ReplicationHandler issues a core-reload command instead of a commit
>> command.
>
> (from https://cwiki.apache.org/confluence/display/solr/Index+Replication)
>
> Currently I've no conffiles set on the slave and I know it didn't get
> reloaded after synonym changes to the master.
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Replication-with-managed-resources-tp4289880p4290242.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Suspicious message with attachment

2016-08-03 Thread help
The following message addressed to you was quarantined because it likely 
contains a virus:

Subject: My solr server finishes itself
From: Julien Vielle 

However, if you know the sender and are expecting an attachment, please reply 
to this message, and we will forward the quarantined message to you.


Re: EmbeddedSolrServer problem when using one-jar-with-dependency including solr

2016-08-03 Thread Steve Rowe
Oh, then likely the problem is that your uberjar packing tool doesn’t know how 
to (or maybe isn’t configured to?) include/merge/translate resources under 
META-INF/services/.  E.g. lucene/core module has SPI files there.

Info on the maven shade plugin’s configuration for this stuff is here here: 


--
Steve
www.lucidworks.com

> On Aug 3, 2016, at 5:26 AM, Ziqi Zhang  wrote:
> 
> Thanks
> 
> I am not sure if Steve's suggestion was the right solution. Even when I did 
> not have explicitly defined the dependency on lucene, I can see in the 
> packaged jar it still contains org.apache.lucene.
> 
> What solved my problem is to not pack a single jar but use a folder of 
> individual jars. I am not sure why though.
> 
> Regards
> 
> 
> On 02/08/2016 21:53, Rohit Kanchan wrote:
>> We also faced same issue when we were running embedded solr 6.1 server.
>> Actually I faced the same in our integration environment after deploying
>> project. Solr 6.1 is using http client 4.4.1 which I think  embedded solr
>> server is looking for. I think when solr core is getting loaded then old
>> http client is getting loaded from some where in your maven. Check
>> dependency tree of your pom.xml and see if you can exclude this jar getting
>> loaded from anywhere else. Just exclude them in your pom.xml. I hope this
>> solves your issue,
>> 
>> 
>> Thanks
>> Rohit
>> 
>> 
>> On Tue, Aug 2, 2016 at 9:44 AM, Steve Rowe  wrote:
>> 
>>> solr-core[1] and solr-solrj[2] POMs have parent POM solr-parent[3], which
>>> in turn has parent POM lucene-solr-grandparent[4], which has a
>>>  section that specifies dependency versions &
>>> exclusions *for all direct dependencies*.
>>> 
>>> The intent is for all Lucene/Solr’s internal dependencies to be managed
>>> directly, rather than through Maven’s transitive dependency mechanism.  For
>>> background, see summary & comments on JIRA issue LUCENE-5217[5].
>>> 
>>> I haven’t looked into how this affects systems that depend on Lucene/Solr
>>> artifacts, but it appears to be the case that you can’t use Maven’s
>>> transitive dependency mechanism to pull in all required dependencies for
>>> you.
>>> 
>>> BTW, if you look at the grandparent POM, the httpclient version for Solr
>>> 6.1.0 is declared as 4.4.1.  I don’t know if depending on version 4.5.2 is
>>> causing problems, but if you don’t need a feature in 4.5.2, I suggest that
>>> you depend on the same version as Solr does.
>>> 
>>> For error #2, you should depend on lucene-core[6].
>>> 
>>> My suggestion as a place to start: copy/paste the dependencies from
>>> solr-core[1] and solr-solrj[2] POMs, and leave out stuff you know you won’t
>>> need.
>>> 
>>> [1] <
>>> https://repo1.maven.org/maven2/org/apache/solr/solr-core/6.1.0/solr-core-6.1.0.pom
>>> [2] <
>>> https://repo1.maven.org/maven2/org/apache/solr/solr-solrj/6.1.0/solr-solrj-6.1.0.pom
>>> [3] <
>>> https://repo1.maven.org/maven2/org/apache/solr/solr-parent/6.1.0/solr-parent-6.1.0.pom
>>> [4] <
>>> https://repo1.maven.org/maven2/org/apache/lucene/lucene-solr-grandparent/6.1.0/lucene-solr-grandparent-6.1.0.pom
>>> [5] 
>>> [6] <
>>> http://search.maven.org/#artifactdetails|org.apache.lucene|lucene-core|6.1.0|jar
>>> --
>>> Steve
>>> www.lucidworks.com
>>> 
 On Aug 2, 2016, at 12:03 PM, Ziqi Zhang 
>>> wrote:
 Hi, I am using Solr, Solrj 6.1, and Maven to manage my project. I use
>>> maven to build a jar-with-dependency and run a java program pointing its
>>> classpath to this jar. However I keep getting errors even when I just try
>>> to create an instance of EmbeddedSolrServer:
 */code/
 *String solrHome = "/home/solr/";
 String solrCore = "fw";
 solrCores = new EmbeddedSolrServer(
Paths.get(solrHome), solrCore
).getCoreContainer();
 ///
 
 
 My project has dependencies defined in the pom shown below:  **When
>>> block A is not present**, running the code that calls:
 * pom /*
 
org.apache.jena
jena-arq
3.0.1

 

 BLOCK A
 org.apache.httpcomponents
httpclient
4.5.2
 BLOCK A ENDS


org.apache.solr
solr-core
6.1.0


org.slf4j
 slf4j-log4j12


log4j
log4j



Re: Replication with managed resources?

2016-08-03 Thread rosbaldeston
Erick Erickson wrote
> It Depends. When running in Cloud mode then "yes". If you're running
> stand-alone
> then there is no Zookeeper running so the answer is "no".

Ah that helps, so no zookeeper in my case. I did wonder if it wasn't just
sharing the same config files between master and slave from sharing the same
configset. So it would appear I'm not replicating any of the managed files
and reloading the slave core probably just reread the shared synonyms file.

I'm also guessing those _schema and managed_schema files are an
implementation detail for the missing zookeeper functionality. But if I did
add those to a conffiles option it might automate the slave core reloads for
me? 


> If a replication involved downloading of at least one configuration file,
> the ReplicationHandler issues a core-reload command instead of a commit
> command.

(from https://cwiki.apache.org/confluence/display/solr/Index+Replication)

Currently I've no conffiles set on the slave and I know it didn't get
reloaded after synonym changes to the master.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Replication-with-managed-resources-tp4289880p4290242.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Sort Facet Values by "Interestingness"?

2016-08-03 Thread Joel Bernstein
You first gather the candidates and then call the TermsComponent with a
callback. The scoreNodes expression does this and it's very fast because
Streaming expressions are run from a Solr node in the same cluster.

The TermsComponent will return the global docFreq for the terms and global
numDocs for the collection, so you'll be able to compute idf for each term.










Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Aug 3, 2016 at 11:22 AM, Ben Heuwing 
wrote:

> Hi Joel,
>
> thank you, this sounds great!
>
> As to your first proposal: I am a bit out of my depth here, as I have not
> worked with streaming expressions so far. But I will try out your example
> using the facet() expression on a simple use case as soon as you publish it.
>
> Using the TermsComponent directly, would that imply that I have to
> retrieve all possible candidates and then sent them back as a  terms.list
> to get their df? However, I assume that this would still be faster than
> having 2 repeated facet-calls. Or did you suggest to use the component in a
> customized RequestHandler?
>
> Regards,
>
> Ben
>
>
> Am 03.08.2016 um 14:57 schrieb Joel Bernstein:
>
>> Also the TermsComponent now can export the docFreq for a list of terms and
>> the numDocs for the index. This can be used as a general purpose mechanism
>> for scoring facets with a callback.
>>
>> https://issues.apache.org/jira/browse/SOLR-9243
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Wed, Aug 3, 2016 at 8:52 AM, Joel Bernstein
>> wrote:
>>
>> What you're describing is implemented with Graph aggregations in this
>>> ticket using tf-idf. Other scoring methods can be implemented as well.
>>>
>>> https://issues.apache.org/jira/browse/SOLR-9193
>>>
>>> I'll update this thread with a description of how this can be used with
>>> the facet() streaming expression as well as with graph queries later
>>> today.
>>>
>>>
>>>
>>> Joel Bernstein
>>> http://joelsolr.blogspot.com/
>>>
>>> On Wed, Aug 3, 2016 at 8:18 AM,  wrote:
>>>
>>> Dear everybody,

 as the JSON-API now makes configuration of facets and sub-facets easier,
 there appears to be a lot of potential to enable instant calculation of
 facet-recommendations for a query, that is, to sort facets by their
 relative importance/interestingess/signficance for a current query
 relative
 to the complete collection or relative to a result set defined by a
 different query.

 An example would be to show the most typical terms which are used in
 descriptions of horror-movies, in contrast to the most popular ones for
 this query, as these may include terms that occur as often in other
 genres.

 This feature has been discussed earlier in the context of solr:
 *

 http://stackoverflow.duapp.com/questions/26399264/how-can-i-sort-facets-by-their-tf-idf-score-rather-than-popularity
 *

 http://lucene.472066.n3.nabble.com/Facets-with-an-IDF-concept-td504070.html

 In elasticsearch, the specific feature that I am looking for is called
 Significant Terms Aggregation:

 https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#search-aggregations-bucket-significantterms-aggregation

 As of now, I have two questions:

 a) Are there workarounds in the current solr-implementation or known
 patches that implement such a sort-option for fields with a large
 number of
 possible values, e.g. text-fields? (for smaller vocabularies it is easy
 to
 do this client-side with two queries)
 b) Are there plans to implement this in facet.pivot or in the
 facet.json-API?

 The first step could be to define "interestingness" as a sort-option for
 facets and to define interestingness as facet-count in the result-set as
 compared to the complete collection: documentfrequency_termX(bucket) *
 inverse_documentfrequency_termX(collection)

 As an extension, the JSON-API could be used to change the domain used as
 base for the comparison. Another interesting option would be to compare
 facet-counts against a current parent-facet for nested facets, e.g. the
 5
 most interesting terms by genre for a query on 70s movies, returning the
 terms specific to horror, comedy, action etc. compared to all
 terminology
 at the time (i.e. in the parent-query).

 A call-back-function could be used to define other measures of
 interestingness such as the log-likelihood-ratio (
 http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html).
 Most
 measures need at least the following 4 values: document-frequency for a
 term for the result-set, document-frequency for the result-set,
 document-frequency for a term in the index (or base-domain),
 document-frequency in the index (or 

Re: Sort Facet Values by "Interestingness"?

2016-08-03 Thread Ben Heuwing

Hi Joel,

thank you, this sounds great!

As to your first proposal: I am a bit out of my depth here, as I have 
not worked with streaming expressions so far. But I will try out your 
example using the facet() expression on a simple use case as soon as you 
publish it.


Using the TermsComponent directly, would that imply that I have to 
retrieve all possible candidates and then sent them back as a  
terms.list to get their df? However, I assume that this would still be 
faster than having 2 repeated facet-calls. Or did you suggest to use the 
component in a customized RequestHandler?


Regards,

Ben

Am 03.08.2016 um 14:57 schrieb Joel Bernstein:

Also the TermsComponent now can export the docFreq for a list of terms and
the numDocs for the index. This can be used as a general purpose mechanism
for scoring facets with a callback.

https://issues.apache.org/jira/browse/SOLR-9243

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Aug 3, 2016 at 8:52 AM, Joel Bernstein  wrote:


What you're describing is implemented with Graph aggregations in this
ticket using tf-idf. Other scoring methods can be implemented as well.

https://issues.apache.org/jira/browse/SOLR-9193

I'll update this thread with a description of how this can be used with
the facet() streaming expression as well as with graph queries later today.



Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Aug 3, 2016 at 8:18 AM,  wrote:


Dear everybody,

as the JSON-API now makes configuration of facets and sub-facets easier,
there appears to be a lot of potential to enable instant calculation of
facet-recommendations for a query, that is, to sort facets by their
relative importance/interestingess/signficance for a current query relative
to the complete collection or relative to a result set defined by a
different query.

An example would be to show the most typical terms which are used in
descriptions of horror-movies, in contrast to the most popular ones for
this query, as these may include terms that occur as often in other genres.

This feature has been discussed earlier in the context of solr:
*
http://stackoverflow.duapp.com/questions/26399264/how-can-i-sort-facets-by-their-tf-idf-score-rather-than-popularity
*
http://lucene.472066.n3.nabble.com/Facets-with-an-IDF-concept-td504070.html

In elasticsearch, the specific feature that I am looking for is called
Significant Terms Aggregation:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#search-aggregations-bucket-significantterms-aggregation

As of now, I have two questions:

a) Are there workarounds in the current solr-implementation or known
patches that implement such a sort-option for fields with a large number of
possible values, e.g. text-fields? (for smaller vocabularies it is easy to
do this client-side with two queries)
b) Are there plans to implement this in facet.pivot or in the
facet.json-API?

The first step could be to define "interestingness" as a sort-option for
facets and to define interestingness as facet-count in the result-set as
compared to the complete collection: documentfrequency_termX(bucket) *
inverse_documentfrequency_termX(collection)

As an extension, the JSON-API could be used to change the domain used as
base for the comparison. Another interesting option would be to compare
facet-counts against a current parent-facet for nested facets, e.g. the 5
most interesting terms by genre for a query on 70s movies, returning the
terms specific to horror, comedy, action etc. compared to all terminology
at the time (i.e. in the parent-query).

A call-back-function could be used to define other measures of
interestingness such as the log-likelihood-ratio (
http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html). Most
measures need at least the following 4 values: document-frequency for a
term for the result-set, document-frequency for the result-set,
document-frequency for a term in the index (or base-domain),
document-frequency in the index (or base-domain).

I guess, this feature might be of interest for those who want to do some
small-scale term-analysis in addition to search, e.g. as in my case in
digital humanities projects. But it might also be an interesting navigation
device, e.g. when searching on job-offers to show the skills that are most
distinctive for a category.

It would be great to know, if others are interested in this feature. If
there are any implementations out there or if anybody else is working on
this, a pointer would be a great start. In the absence of existing
solutions: Perhaps somebody has some idea on where and how to start
implementing this?

Best regards,

Ben





--

Ben Heuwing, Dr. phil.
Wissenschaftlicher Mitarbeiter
Institut für Informationswissenschaft und Sprachtechnologie
Universität Hildesheim

Postanschrift:
Universitätsplatz 1
D-31141 Hildesheim


Büro:
Lübeckerstraße 3
Raum L017

+49(0)5121 

RE: Installing Solr with Ivy

2016-08-03 Thread Davis, Daniel (NIH/NLM) [C]
I think the free versions of either Artifactory or Sonatype Nexus would be able 
to be this cache in a very effective, cloud ready way.   This way, you would 
not be dependent on shared directories.You would just need some task to 
pull down Solr and checksums and publish them into the repository.

I've done php, but I shudder.   I know you do a lot - we've looked at VuFind 
for our discovery layer here at NLM.

-Original Message-
From: Demian Katz [mailto:demian.k...@villanova.edu] 
Sent: Wednesday, August 03, 2016 9:31 AM
To: solr-user@lucene.apache.org
Subject: RE: Installing Solr with Ivy

Dan,

In case you, or anyone else, is interested, let me share my current 
solution-in-progress:

https://github.com/vufind-org/vufind/pull/769

I've written a Phing task for my project (Phing is the PHP equivalent of Ant) 
which takes some loose inspiration from your Ant download task. The task uses a 
local directory to cache Solr distributions and only hits Apache servers if the 
cache lacks the requested version. This cache can be retained on my continuous 
integration and development servers, so I think this should get me the effect I 
desire without putting an unreasonable amount of load on the archive servers. 
I'd still love in theory to find a solution that's a little more future-proof 
than "build a URL and download from it," but for now, I think this will get me 
through.

Thanks again!

- Demian

-Original Message-
From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.da...@nih.gov] 
Sent: Tuesday, August 02, 2016 11:33 AM
To: solr-user@lucene.apache.org
Subject: RE: Installing Solr with Ivy

Demian,

I've long meant to upload my own "automated installation" - it is ant without 
ivy, but with checksums.   I suppose gpg signatures could also be worked in.
It is only semi-automated, because our DevOps group does not have root, but 
here is a clean version - https://github.com/danizen/solr-ant-install

System administrators prepare the environment:
- creating a directory for solr (/opt/solr) and logs (/var/logs/solr), maybe a 
different volume for solr data.
- create an administrative user with a shell (owns the code)
- create an operational user who runs solr (no shell, cannot modify the code)
- install the initscripts
- setup sudoers rules

The installation this supports is very, very small, and I do not intend to 
support the cleaned version of this going forward.   I will update the 
README.md to make that clear.

I agree with your summary of the difference.   One more aspect of 
maturity/fullness of solution - MySQL/PostgreSQL etc. support multiple projects 
on the same server, at least administratively.   Solr is getting there, but 
until role-based access control (RBAC) is strong enough out-of-the-box, it is 
hard to setup a *shared* Solr server.Yet it is very common to do that with 
database servers, and in fact doing this is a common way to avoid siloed 
applications.Unfortunately, HTTP auth is not quite good enough for me; but 
it is only my own fault I haven't contributed something more.

Dan Davis, Systems/Applications Architect (Contractor), Office of Computer and 
Communications Systems, National Library of Medicine, NIH







-Original Message-
From: Demian Katz [mailto:demian.k...@villanova.edu]
Sent: Tuesday, August 02, 2016 8:37 AM
To: solr-user@lucene.apache.org
Subject: RE: Installing Solr with Ivy

Thanks, Shawn, for confirming my suspicions.

Regarding your question about how Solr differs from a database server, I agree 
with you in theory, but the problem is in the practice: there are very easy, 
familiar, well-established techniques for installing and maintaining database 
platforms, and these platforms are mature enough that they evolve slowly and 
most versions are closely functionally equivalent to one another. Solr is 
comparatively young (not immature, but young).

Solr still (as far as I can tell) lacks standard package support in the default 
repos of the major Linux distros, and frequently breaks backward compatibility 
between versions in large and small ways (particularly in the internal API, but 
sometimes also in the configuration files). Those are not intended as 
criticisms of Solr -- they're to a large extent positive signs of activity and 
growth -- but they are, as far as I can tell, the current realities of working 
with the software.

For a developer with the right experience and knowledge, it's no big deal to 
navigate these challenges. However, my package is designed to be friendly to a 
less experienced, more generalized non-technical audience, and bundling Solr in 
the package instead of trying to guide the user through a potentially confusing 
manual installation process greatly simplifies the task of getting things up 
and running, saving me from having to field support emails from people who 
can't figure out how to install Solr on their platform, or those who end up 
with a version that's incompatible with my project's 

RE: Where should the listeners be defined in solrconfig.xml (Solr 6.0.1)

2016-08-03 Thread Alexandre Drouin
That's good to know thanks!

I should have thought to check the java code before asking.


Alexandre Drouin


-Original Message-
From: Mikhail Khludnev [mailto:m...@apache.org] 
Sent: August 3, 2016 10:52 AM
To: solr-user 
Subject: Re: Where should the listeners be defined in solrconfig.xml (Solr 
6.0.1)
Importance: High

As far as I remember the code it captures  everywhere
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/core/SolrConfig.java#L331
Double slash "//listener" means "everywhere".

On Wed, Aug 3, 2016 at 4:38 PM, Alexandre Drouin < 
alexandre.dro...@orckestra.com> wrote:

> Does anyone knows where the listener should be defined in solrconfig.xml?
>
>
> Alexandre Drouin
>
>
> -Original Message-
> From: Alexandre Drouin [mailto:alexandre.dro...@orckestra.com]
> Sent: July 29, 2016 10:46 AM
> To: solr-user@lucene.apache.org
> Subject: Where should the listeners be defined in solrconfig.xml (Solr
> 6.0.1)
> Importance: High
>
> Hello,
>
> I was wondering where I should put the  configuration.  I 
> can see from the sample solrconfig.xml that they are defined under the 
>  and  elements.
> The Schema API for listeners does not specify a parent of type 
> updateHandler or query so I wanted to know if I also define them 
> directly under the root of the xml document (config)?
>
> Alexandre Drouin
>
>


--
Sincerely yours
Mikhail Khludnev


Re: Replication with managed resources?

2016-08-03 Thread Erick Erickson
bq: Am I right in saying managed resources are handled by zookeeper rather than
files on the filesystem

It Depends. When running in Cloud mode then "yes". If you're running stand-alone
then there is no Zookeeper running so the answer is "no".

You can run Solr just like you always have in master/slave setups. In
that case you
need to manage your own configurations on every node just like you always have,
probably through replication.

In stand-alone mode, you should send all your managed schema API calls to the
master core and let the replication distribute the changes to the slaves.

Best,
Erick

On Wed, Aug 3, 2016 at 4:48 AM, rosbaldeston  wrote:
> Am I right in saying managed resources are handled by zookeeper rather than
> files on the filesystem and I should ignore any files such as:
> managed-schema,   _rest_managed.json,
> _schema_analysis_stopwords_english.json,
> _schema_analysis_synonyms_english.json ...
>
> I should not try to copy any of these via the slaves confFiles option?
>
> What I was planning to do was have the master as the indexing source and all
> slaves as query sources. But they need the same synonyms & stopwords.
>
> One thing I am seeing is when I create my master and slave from a custom
> configset without any copying of configs is when the masters synonyms have
> been changed the synonyms on the slave don't reflect these changes even
> sometime after after replication?
>
> It appears I need to reload the slave core(s) before they show the same
> synonyms as the master? is this because they're sharing the same file? how
> do should I keep slaves in sync with managed resources? do I just have to
> keep reloading all slave cores ever so often?
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Replication-with-managed-resources-tp4289880p4290177.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Where should the listeners be defined in solrconfig.xml (Solr 6.0.1)

2016-08-03 Thread Mikhail Khludnev
As far as I remember the code it captures  everywhere
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/core/SolrConfig.java#L331
Double slash "//listener" means "everywhere".

On Wed, Aug 3, 2016 at 4:38 PM, Alexandre Drouin <
alexandre.dro...@orckestra.com> wrote:

> Does anyone knows where the listener should be defined in solrconfig.xml?
>
>
> Alexandre Drouin
>
>
> -Original Message-
> From: Alexandre Drouin [mailto:alexandre.dro...@orckestra.com]
> Sent: July 29, 2016 10:46 AM
> To: solr-user@lucene.apache.org
> Subject: Where should the listeners be defined in solrconfig.xml (Solr
> 6.0.1)
> Importance: High
>
> Hello,
>
> I was wondering where I should put the  configuration.  I can
> see from the sample solrconfig.xml that they are defined under the
>  and  elements.
> The Schema API for listeners does not specify a parent of type
> updateHandler or query so I wanted to know if I also define them directly
> under the root of the xml document (config)?
>
> Alexandre Drouin
>
>


-- 
Sincerely yours
Mikhail Khludnev


RE: Where should the listeners be defined in solrconfig.xml (Solr 6.0.1)

2016-08-03 Thread Alexandre Drouin
Does anyone knows where the listener should be defined in solrconfig.xml?


Alexandre Drouin


-Original Message-
From: Alexandre Drouin [mailto:alexandre.dro...@orckestra.com] 
Sent: July 29, 2016 10:46 AM
To: solr-user@lucene.apache.org
Subject: Where should the listeners be defined in solrconfig.xml (Solr 6.0.1)
Importance: High

Hello,

I was wondering where I should put the  configuration.  I can see 
from the sample solrconfig.xml that they are defined under the  
and  elements.  
The Schema API for listeners does not specify a parent of type updateHandler or 
query so I wanted to know if I also define them directly under the root of the 
xml document (config)? 

Alexandre Drouin



RE: Installing Solr with Ivy

2016-08-03 Thread Demian Katz
Dan,

In case you, or anyone else, is interested, let me share my current 
solution-in-progress:

https://github.com/vufind-org/vufind/pull/769

I've written a Phing task for my project (Phing is the PHP equivalent of Ant) 
which takes some loose inspiration from your Ant download task. The task uses a 
local directory to cache Solr distributions and only hits Apache servers if the 
cache lacks the requested version. This cache can be retained on my continuous 
integration and development servers, so I think this should get me the effect I 
desire without putting an unreasonable amount of load on the archive servers. 
I'd still love in theory to find a solution that's a little more future-proof 
than "build a URL and download from it," but for now, I think this will get me 
through.

Thanks again!

- Demian

-Original Message-
From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.da...@nih.gov] 
Sent: Tuesday, August 02, 2016 11:33 AM
To: solr-user@lucene.apache.org
Subject: RE: Installing Solr with Ivy

Demian,

I've long meant to upload my own "automated installation" - it is ant without 
ivy, but with checksums.   I suppose gpg signatures could also be worked in.
It is only semi-automated, because our DevOps group does not have root, but 
here is a clean version - https://github.com/danizen/solr-ant-install

System administrators prepare the environment:
- creating a directory for solr (/opt/solr) and logs (/var/logs/solr), maybe a 
different volume for solr data.
- create an administrative user with a shell (owns the code)
- create an operational user who runs solr (no shell, cannot modify the code)
- install the initscripts
- setup sudoers rules

The installation this supports is very, very small, and I do not intend to 
support the cleaned version of this going forward.   I will update the 
README.md to make that clear.

I agree with your summary of the difference.   One more aspect of 
maturity/fullness of solution - MySQL/PostgreSQL etc. support multiple projects 
on the same server, at least administratively.   Solr is getting there, but 
until role-based access control (RBAC) is strong enough out-of-the-box, it is 
hard to setup a *shared* Solr server.Yet it is very common to do that with 
database servers, and in fact doing this is a common way to avoid siloed 
applications.Unfortunately, HTTP auth is not quite good enough for me; but 
it is only my own fault I haven't contributed something more.

Dan Davis, Systems/Applications Architect (Contractor), Office of Computer and 
Communications Systems, National Library of Medicine, NIH







-Original Message-
From: Demian Katz [mailto:demian.k...@villanova.edu]
Sent: Tuesday, August 02, 2016 8:37 AM
To: solr-user@lucene.apache.org
Subject: RE: Installing Solr with Ivy

Thanks, Shawn, for confirming my suspicions.

Regarding your question about how Solr differs from a database server, I agree 
with you in theory, but the problem is in the practice: there are very easy, 
familiar, well-established techniques for installing and maintaining database 
platforms, and these platforms are mature enough that they evolve slowly and 
most versions are closely functionally equivalent to one another. Solr is 
comparatively young (not immature, but young).

Solr still (as far as I can tell) lacks standard package support in the default 
repos of the major Linux distros, and frequently breaks backward compatibility 
between versions in large and small ways (particularly in the internal API, but 
sometimes also in the configuration files). Those are not intended as 
criticisms of Solr -- they're to a large extent positive signs of activity and 
growth -- but they are, as far as I can tell, the current realities of working 
with the software.

For a developer with the right experience and knowledge, it's no big deal to 
navigate these challenges. However, my package is designed to be friendly to a 
less experienced, more generalized non-technical audience, and bundling Solr in 
the package instead of trying to guide the user through a potentially confusing 
manual installation process greatly simplifies the task of getting things up 
and running, saving me from having to field support emails from people who 
can't figure out how to install Solr on their platform, or those who end up 
with a version that's incompatible with my project's configurations and custom 
handlers.

At this point, my main goal is to revise the bundling process so that instead 
of storing Solr in Git, I can install it on-demand with a simple automated 
process during continuous integration builds and packaging for release. In the 
longer term, if the environmental factors change, I'd certainly prefer to stop 
bundling it entirely... but I don't think that is practical for my audience at 
this stage.

In any case, sorry for the long-winded reply, but hopefully that helps clarify 
my situation.

- Demian

-Original Message-

[...snip...]

In a theoretical situation 

Re: Replication with managed resources?

2016-08-03 Thread rosbaldeston
Am I right in saying managed resources are handled by zookeeper rather than
files on the filesystem and I should ignore any files such as:   
managed-schema,   _rest_managed.json,
_schema_analysis_stopwords_english.json,
_schema_analysis_synonyms_english.json ...

I should not try to copy any of these via the slaves confFiles option?

What I was planning to do was have the master as the indexing source and all
slaves as query sources. But they need the same synonyms & stopwords.

One thing I am seeing is when I create my master and slave from a custom
configset without any copying of configs is when the masters synonyms have
been changed the synonyms on the slave don't reflect these changes even
sometime after after replication?

It appears I need to reload the slave core(s) before they show the same
synonyms as the master? is this because they're sharing the same file? how
do should I keep slaves in sync with managed resources? do I just have to
keep reloading all slave cores ever so often?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Replication-with-managed-resources-tp4289880p4290177.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Sort Facet Values by "Interestingness"?

2016-08-03 Thread Joel Bernstein
Also the TermsComponent now can export the docFreq for a list of terms and
the numDocs for the index. This can be used as a general purpose mechanism
for scoring facets with a callback.

https://issues.apache.org/jira/browse/SOLR-9243

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Aug 3, 2016 at 8:52 AM, Joel Bernstein  wrote:

> What you're describing is implemented with Graph aggregations in this
> ticket using tf-idf. Other scoring methods can be implemented as well.
>
> https://issues.apache.org/jira/browse/SOLR-9193
>
> I'll update this thread with a description of how this can be used with
> the facet() streaming expression as well as with graph queries later today.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Aug 3, 2016 at 8:18 AM,  wrote:
>
>> Dear everybody,
>>
>> as the JSON-API now makes configuration of facets and sub-facets easier,
>> there appears to be a lot of potential to enable instant calculation of
>> facet-recommendations for a query, that is, to sort facets by their
>> relative importance/interestingess/signficance for a current query relative
>> to the complete collection or relative to a result set defined by a
>> different query.
>>
>> An example would be to show the most typical terms which are used in
>> descriptions of horror-movies, in contrast to the most popular ones for
>> this query, as these may include terms that occur as often in other genres.
>>
>> This feature has been discussed earlier in the context of solr:
>> *
>> http://stackoverflow.duapp.com/questions/26399264/how-can-i-sort-facets-by-their-tf-idf-score-rather-than-popularity
>> *
>> http://lucene.472066.n3.nabble.com/Facets-with-an-IDF-concept-td504070.html
>>
>> In elasticsearch, the specific feature that I am looking for is called
>> Significant Terms Aggregation:
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#search-aggregations-bucket-significantterms-aggregation
>>
>> As of now, I have two questions:
>>
>> a) Are there workarounds in the current solr-implementation or known
>> patches that implement such a sort-option for fields with a large number of
>> possible values, e.g. text-fields? (for smaller vocabularies it is easy to
>> do this client-side with two queries)
>> b) Are there plans to implement this in facet.pivot or in the
>> facet.json-API?
>>
>> The first step could be to define "interestingness" as a sort-option for
>> facets and to define interestingness as facet-count in the result-set as
>> compared to the complete collection: documentfrequency_termX(bucket) *
>> inverse_documentfrequency_termX(collection)
>>
>> As an extension, the JSON-API could be used to change the domain used as
>> base for the comparison. Another interesting option would be to compare
>> facet-counts against a current parent-facet for nested facets, e.g. the 5
>> most interesting terms by genre for a query on 70s movies, returning the
>> terms specific to horror, comedy, action etc. compared to all terminology
>> at the time (i.e. in the parent-query).
>>
>> A call-back-function could be used to define other measures of
>> interestingness such as the log-likelihood-ratio (
>> http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html). Most
>> measures need at least the following 4 values: document-frequency for a
>> term for the result-set, document-frequency for the result-set,
>> document-frequency for a term in the index (or base-domain),
>> document-frequency in the index (or base-domain).
>>
>> I guess, this feature might be of interest for those who want to do some
>> small-scale term-analysis in addition to search, e.g. as in my case in
>> digital humanities projects. But it might also be an interesting navigation
>> device, e.g. when searching on job-offers to show the skills that are most
>> distinctive for a category.
>>
>> It would be great to know, if others are interested in this feature. If
>> there are any implementations out there or if anybody else is working on
>> this, a pointer would be a great start. In the absence of existing
>> solutions: Perhaps somebody has some idea on where and how to start
>> implementing this?
>>
>> Best regards,
>>
>> Ben
>>
>>
>>
>


Re: Sort Facet Values by "Interestingness"?

2016-08-03 Thread Joel Bernstein
What you're describing is implemented with Graph aggregations in this
ticket using tf-idf. Other scoring methods can be implemented as well.

https://issues.apache.org/jira/browse/SOLR-9193

I'll update this thread with a description of how this can be used with the
facet() streaming expression as well as with graph queries later today.



Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Aug 3, 2016 at 8:18 AM,  wrote:

> Dear everybody,
>
> as the JSON-API now makes configuration of facets and sub-facets easier,
> there appears to be a lot of potential to enable instant calculation of
> facet-recommendations for a query, that is, to sort facets by their
> relative importance/interestingess/signficance for a current query relative
> to the complete collection or relative to a result set defined by a
> different query.
>
> An example would be to show the most typical terms which are used in
> descriptions of horror-movies, in contrast to the most popular ones for
> this query, as these may include terms that occur as often in other genres.
>
> This feature has been discussed earlier in the context of solr:
> *
> http://stackoverflow.duapp.com/questions/26399264/how-can-i-sort-facets-by-their-tf-idf-score-rather-than-popularity
> *
> http://lucene.472066.n3.nabble.com/Facets-with-an-IDF-concept-td504070.html
>
> In elasticsearch, the specific feature that I am looking for is called
> Significant Terms Aggregation:
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#search-aggregations-bucket-significantterms-aggregation
>
> As of now, I have two questions:
>
> a) Are there workarounds in the current solr-implementation or known
> patches that implement such a sort-option for fields with a large number of
> possible values, e.g. text-fields? (for smaller vocabularies it is easy to
> do this client-side with two queries)
> b) Are there plans to implement this in facet.pivot or in the
> facet.json-API?
>
> The first step could be to define "interestingness" as a sort-option for
> facets and to define interestingness as facet-count in the result-set as
> compared to the complete collection: documentfrequency_termX(bucket) *
> inverse_documentfrequency_termX(collection)
>
> As an extension, the JSON-API could be used to change the domain used as
> base for the comparison. Another interesting option would be to compare
> facet-counts against a current parent-facet for nested facets, e.g. the 5
> most interesting terms by genre for a query on 70s movies, returning the
> terms specific to horror, comedy, action etc. compared to all terminology
> at the time (i.e. in the parent-query).
>
> A call-back-function could be used to define other measures of
> interestingness such as the log-likelihood-ratio (
> http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html). Most
> measures need at least the following 4 values: document-frequency for a
> term for the result-set, document-frequency for the result-set,
> document-frequency for a term in the index (or base-domain),
> document-frequency in the index (or base-domain).
>
> I guess, this feature might be of interest for those who want to do some
> small-scale term-analysis in addition to search, e.g. as in my case in
> digital humanities projects. But it might also be an interesting navigation
> device, e.g. when searching on job-offers to show the skills that are most
> distinctive for a category.
>
> It would be great to know, if others are interested in this feature. If
> there are any implementations out there or if anybody else is working on
> this, a pointer would be a great start. In the absence of existing
> solutions: Perhaps somebody has some idea on where and how to start
> implementing this?
>
> Best regards,
>
> Ben
>
>
>


Sort Facet Values by "Interestingness"?

2016-08-03 Thread heuwing

Dear everybody,

as the JSON-API now makes configuration of facets and sub-facets easier, 
there appears to be a lot of potential to enable instant calculation of 
facet-recommendations for a query, that is, to sort facets by their 
relative importance/interestingess/signficance for a current query 
relative to the complete collection or relative to a result set defined 
by a different query.


An example would be to show the most typical terms which are used in 
descriptions of horror-movies, in contrast to the most popular ones for 
this query, as these may include terms that occur as often in other genres.


This feature has been discussed earlier in the context of solr:
*http://stackoverflow.duapp.com/questions/26399264/how-can-i-sort-facets-by-their-tf-idf-score-rather-than-popularity
* 
http://lucene.472066.n3.nabble.com/Facets-with-an-IDF-concept-td504070.html


In elasticsearch, the specific feature that I am looking for is called 
Significant Terms Aggregation: 
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#search-aggregations-bucket-significantterms-aggregation


As of now, I have two questions:

a) Are there workarounds in the current solr-implementation or known 
patches that implement such a sort-option for fields with a large number 
of possible values, e.g. text-fields? (for smaller vocabularies it is 
easy to do this client-side with two queries)
b) Are there plans to implement this in facet.pivot or in the 
facet.json-API?


The first step could be to define "interestingness" as a sort-option for 
facets and to define interestingness as facet-count in the result-set as 
compared to the complete collection: documentfrequency_termX(bucket) * 
inverse_documentfrequency_termX(collection)


As an extension, the JSON-API could be used to change the domain used as 
base for the comparison. Another interesting option would be to compare 
facet-counts against a current parent-facet for nested facets, e.g. the 
5 most interesting terms by genre for a query on 70s movies, returning 
the terms specific to horror, comedy, action etc. compared to all 
terminology at the time (i.e. in the parent-query).


A call-back-function could be used to define other measures of 
interestingness such as the log-likelihood-ratio 
(http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html). 
Most measures need at least the following 4 values: document-frequency 
for a term for the result-set, document-frequency for the result-set, 
document-frequency for a term in the index (or base-domain), 
document-frequency in the index (or base-domain).


I guess, this feature might be of interest for those who want to do some 
small-scale term-analysis in addition to search, e.g. as in my case in 
digital humanities projects. But it might also be an interesting 
navigation device, e.g. when searching on job-offers to show the skills 
that are most distinctive for a category.


It would be great to know, if others are interested in this feature. If 
there are any implementations out there or if anybody else is working on 
this, a pointer would be a great start. In the absence of existing 
solutions: Perhaps somebody has some idea on where and how to start 
implementing this?


Best regards,

Ben




Solr 6 more like this

2016-08-03 Thread sara hajili
hi i switch from solr 5 to solr 6 .
i create my more like this handler that use solr more like this handler and
expand query by adding some word to query.
and now my question is about mlt parameter .
i wanna to know about mlt.mindf and mlt.mintf.what are these doing exactly?
when i didn't set mlt.mindf and mlt.mintf and theses set to default value.
every thing is ok.and i got answer of my handler queickly.
but when i set both of these 1 .
i got heap error.and when i check my solr with jconsole.i saw that these
new query with mlt.mintf=1 and mlt.mindf=1 nead more than 4G heap while
when i execute my query with default mlt.mintf=2 and mlt.mindf=5 i did not
get heap space error .and query execute with less than 512M heap size.

mintf and mindf how can affect on memory use (heap size) of my solr system?


Re: EmbeddedSolrServer problem when using one-jar-with-dependency including solr

2016-08-03 Thread Ziqi Zhang

Thanks

I am not sure if Steve's suggestion was the right solution. Even when I 
did not have explicitly defined the dependency on lucene, I can see in 
the packaged jar it still contains org.apache.lucene.


What solved my problem is to not pack a single jar but use a folder of 
individual jars. I am not sure why though.


Regards


On 02/08/2016 21:53, Rohit Kanchan wrote:

We also faced same issue when we were running embedded solr 6.1 server.
Actually I faced the same in our integration environment after deploying
project. Solr 6.1 is using http client 4.4.1 which I think  embedded solr
server is looking for. I think when solr core is getting loaded then old
http client is getting loaded from some where in your maven. Check
dependency tree of your pom.xml and see if you can exclude this jar getting
loaded from anywhere else. Just exclude them in your pom.xml. I hope this
solves your issue,


Thanks
Rohit


On Tue, Aug 2, 2016 at 9:44 AM, Steve Rowe  wrote:


solr-core[1] and solr-solrj[2] POMs have parent POM solr-parent[3], which
in turn has parent POM lucene-solr-grandparent[4], which has a
 section that specifies dependency versions &
exclusions *for all direct dependencies*.

The intent is for all Lucene/Solr’s internal dependencies to be managed
directly, rather than through Maven’s transitive dependency mechanism.  For
background, see summary & comments on JIRA issue LUCENE-5217[5].

I haven’t looked into how this affects systems that depend on Lucene/Solr
artifacts, but it appears to be the case that you can’t use Maven’s
transitive dependency mechanism to pull in all required dependencies for
you.

BTW, if you look at the grandparent POM, the httpclient version for Solr
6.1.0 is declared as 4.4.1.  I don’t know if depending on version 4.5.2 is
causing problems, but if you don’t need a feature in 4.5.2, I suggest that
you depend on the same version as Solr does.

For error #2, you should depend on lucene-core[6].

My suggestion as a place to start: copy/paste the dependencies from
solr-core[1] and solr-solrj[2] POMs, and leave out stuff you know you won’t
need.

[1] <
https://repo1.maven.org/maven2/org/apache/solr/solr-core/6.1.0/solr-core-6.1.0.pom
[2] <
https://repo1.maven.org/maven2/org/apache/solr/solr-solrj/6.1.0/solr-solrj-6.1.0.pom
[3] <
https://repo1.maven.org/maven2/org/apache/solr/solr-parent/6.1.0/solr-parent-6.1.0.pom
[4] <
https://repo1.maven.org/maven2/org/apache/lucene/lucene-solr-grandparent/6.1.0/lucene-solr-grandparent-6.1.0.pom
[5] 
[6] <
http://search.maven.org/#artifactdetails|org.apache.lucene|lucene-core|6.1.0|jar
--
Steve
www.lucidworks.com


On Aug 2, 2016, at 12:03 PM, Ziqi Zhang 

wrote:

Hi, I am using Solr, Solrj 6.1, and Maven to manage my project. I use

maven to build a jar-with-dependency and run a java program pointing its
classpath to this jar. However I keep getting errors even when I just try
to create an instance of EmbeddedSolrServer:

*/code/
*String solrHome = "/home/solr/";
String solrCore = "fw";
solrCores = new EmbeddedSolrServer(
Paths.get(solrHome), solrCore
).getCoreContainer();
///


My project has dependencies defined in the pom shown below:  **When

block A is not present**, running the code that calls:

* pom /*

org.apache.jena
jena-arq
3.0.1




 BLOCK A
org.apache.httpcomponents
httpclient
4.5.2
 BLOCK A ENDS



org.apache.solr
solr-core
6.1.0


org.slf4j
slf4j-log4j12


log4j
log4j


org.slf4j
slf4j-jdk14




org.apache.solr
solr-solrj
6.1.0


org.slf4j
slf4j-log4j12


log4j
log4j


org.slf4j
slf4j-jdk14



///


Block A is added because when it is missing, the following error is

thrown on the java code above:

* ERROR 1 ///*

Exception in thread "main" java.lang.NoClassDefFoundError:

org/apache/http/impl/client/CloseableHttpClient

at

org.apache.solr.handler.component.HttpShardHandlerFactory.init(HttpShardHandlerFactory.java:167)

at


Re: Sum of all values in Function Query

2016-08-03 Thread Mikhail Khludnev
Edwin,
Did you try something from http://yonik.com/solr-facet-functions/ ?

On Wed, Aug 3, 2016 at 6:58 AM, Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> Would like to find out, is it possible for Solr to find a sum of all the
> values that are returned in a query?
>
> For example, when I does a search in which that is a field called "Amount"
> with fieldType=float.
> Is it possible for Solr to do return the sum of all the "Amount" that are
> returned? If the main search query is *:*, then it should return the sum of
> all the "Amount" that are present in the entire collection.
>
> i have tried to use sum(Amount), but this doesn't work, as I believe what
> the sum does is to just do the sum of the static amount in each of the
> individual record, and not the entire search field records.
>
> I'm using Solr 6.1.0
>
> Regards,
> Edwin
>



-- 
Sincerely yours
Mikhail Khludnev


Re: TooManyClauses: maxClauseCount is set to 1024

2016-08-03 Thread liubiaoxin1

 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4290158.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: TooManyClauses: maxClauseCount is set to 1024

2016-08-03 Thread liubiaoxin1
set exery core  solrconfig.xml: 4096



--
View this message in context: 
http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4290157.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: problems with bulk indexing with concurrent DIH

2016-08-03 Thread Bernd Fehling
Hi Shalin,

yes I'm going to setup 5.5.3 to see how that behaves.
Michael McCandless gave me the hint about LUCENE-6161.

We will see... :-)


Am 02.08.2016 um 16:31 schrieb Shalin Shekhar Mangar:
> Hi Bernd,
> 
> I think you are running into
> https://issues.apache.org/jira/browse/LUCENE-6161. Can you upgrade to 5.1
> or newer?
> 
> On Wed, Jul 27, 2016 at 7:29 PM, Bernd Fehling <
> bernd.fehl...@uni-bielefeld.de> wrote:
> 
>> After enhancing the server with SSDs I'm trying to speed up indexing.
>>
>> The server has 16 CPUs and more than 100G RAM.
>> JAVA (1.8.0_92) has 24G.
>> SOLR is 4.10.4.
>> Plain XML data to load is 218G with about 96M records.
>> This will result in a single index of 299G.
>>
>> I tried with 4, 8, 12 and 16 concurrent DIHs.
>> 16 and 12 was to much because for 16 CPUs and my test continued with 8
>> concurrent DIHs.
>> Then i was trying different  and  settings but
>> now I'm stuck.
>> I can't figure out what is the best setting for bulk indexing.
>> What I see is that the indexing is "falling asleep" after some time of
>> indexing.
>> It is only producing del-files, like _11_1.del, _w_2.del, _h_3.del,...
>>
>> 
>> 8
>> 1024
>> -1
>> 
>>   8
>>   100
>>   512
>> 
>> 8
>> > class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
>> ${solr.lock.type:native}
>> ...
>> 
>>
>> 
>>  ### no autocommit at all
>>  
>>${solr.autoSoftCommit.maxTime:-1}
>>  
>> 
>>
>>
>>
>> command=full-import=false=false=false=false
>> After indexing finishes there is a final optimize.
>>
>> My idea is, if 8 DIHs use 8 CPUs then I have 8 CPUs left for merging
>> (maxIndexingThreads/maxMergeAtOnce/mergeFactor).
>> It should do no commit, no optimize.
>> ramBufferSizeMB is high because I have plenty of RAM and I want make use
>> the speed of RAM.
>> segmentsPerTier is high to reduce merging.
>>
>> But somewhere is a misconfiguration because indexing gets stalled.
>>
>> Any idea what's going wrong?
>>
>>
>> Bernd
>>
>>
>>
>>
>>
> 
> 

-- 
*
Bernd FehlingBielefeld University Library
Dipl.-Inform. (FH)LibTec - Library Technology
Universitätsstr. 25  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060   bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*