Re: Parallelize Cursor approach

2016-11-04 Thread Erick Erickson
Have you considered the /xport functionality?

On Fri, Nov 4, 2016 at 5:56 PM, Yonik Seeley  wrote:
> No, you can't get cursor-marks ahead of time.
> They are the serialized representation of the last sort values
> encountered (hence not known ahead of time).
>
> -Yonik
>
>
> On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi  wrote:
>> Hi,
>>
>> I am using the cursor approach to fetch results from Solr (5.5.0). Most of
>> my queries return millions of results. Is there a way I can read the pages
>> in parallel? Is there a way I can get all the cursors well in advance?
>>
>> Let's say my query returns 2M documents and I have set rows=100,000.
>> Can I have multiple threads iterating over different pages like
>> Thread1 -> docs 1 to 100K
>> Thread2 -> docs 101K to 200K
>> ..
>> ..
>>
>> for this to happen, can I get all the cursorMarks for a given query so that
>> I can leverage the following code in parallel
>>
>> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
>> val rsp: QueryResponse = c.query(cursorQ)
>>
>> Thank you,
>> Chetas.


Re: Parallelize Cursor approach

2016-11-04 Thread Yonik Seeley
No, you can't get cursor-marks ahead of time.
They are the serialized representation of the last sort values
encountered (hence not known ahead of time).

-Yonik


On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi  wrote:
> Hi,
>
> I am using the cursor approach to fetch results from Solr (5.5.0). Most of
> my queries return millions of results. Is there a way I can read the pages
> in parallel? Is there a way I can get all the cursors well in advance?
>
> Let's say my query returns 2M documents and I have set rows=100,000.
> Can I have multiple threads iterating over different pages like
> Thread1 -> docs 1 to 100K
> Thread2 -> docs 101K to 200K
> ..
> ..
>
> for this to happen, can I get all the cursorMarks for a given query so that
> I can leverage the following code in parallel
>
> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
> val rsp: QueryResponse = c.query(cursorQ)
>
> Thank you,
> Chetas.


Parallelize Cursor approach

2016-11-04 Thread Chetas Joshi
Hi,

I am using the cursor approach to fetch results from Solr (5.5.0). Most of
my queries return millions of results. Is there a way I can read the pages
in parallel? Is there a way I can get all the cursors well in advance?

Let's say my query returns 2M documents and I have set rows=100,000.
Can I have multiple threads iterating over different pages like
Thread1 -> docs 1 to 100K
Thread2 -> docs 101K to 200K
..
..

for this to happen, can I get all the cursorMarks for a given query so that
I can leverage the following code in parallel

cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
val rsp: QueryResponse = c.query(cursorQ)

Thank you,
Chetas.


Re: CodaHale metrics for Solr 6?

2016-11-04 Thread Jeff Wartes
Expanding on my comment on the ticket, I’m really quite happy with using 
codahale/dropwizard metrics with Solr. I don’t know if I’m comfortable just 
sharing a screenshot of the resulting grafana dashboard, but I’ve got, per-host:

- Percentile latencies and rates for GET vs POST (which in solrcloud generally 
maps to top-level-query vs shard-request & update) (From the jetty-metrics 
plugin)
- Log rates by log level (from the logging-metrics plugin)
- GC time (from the jvm-metrics plugin)
- Thread counts (same)
- Percentile latencies and rates per query “performance class” (from my 
metrics-aware backup request handler - https://github.com/whitepages/SOLR-4449) 
- Backup request rates (same)

I’ve been agitating the idea of getting metrics more tightly integrated for at 
least a year, maybe longer. I’ve been using the jetty metrics plugin since at 
least Solr 4.9, and have upgraded and re-added it at least twice.


On 11/1/16, 11:56 AM, "Walter Underwood"  wrote:

Anybody?

It seems like this would be a solution for SOLR-4735, which has been open 
for 3.5 years.

https://issues.apache.org/jira/browse/SOLR-4735 


wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 26, 2016, at 12:02 PM, Walter Underwood  
wrote:
> 
> Anybody using the CodaHale metrics.jetty9.InstrumentedHandler? It looks a 
lot like something we built for our own use with Solr 4.
> 
> http://metrics.dropwizard.io/3.1.0/manual/jetty/ 

> 
http://metrics.dropwizard.io/3.1.0/apidocs/com/codahale/metrics/jetty9/InstrumentedHandler.html
 

> 
> wunder
> Walter Underwood
> wun...@wunderwood.org 
> http://observer.wunderwood.org/  (my blog)





Re: Facets based on sampling

2016-11-04 Thread Yonik Seeley
Sampling has been on my TODO list for the JSON Facet API.
How much it would help depends on where the bottlenecks are, but that
in conjunction with a hashing approach to collection (assuming field
cardinality is high) should definitely help.

-Yonik


On Fri, Nov 4, 2016 at 3:02 PM, John Davis  wrote:
> Hi,
> I am trying to improve the performance of queries with facets. I understand
> that for queries with high facet cardinality and large number results the
> current facet computation algorithms can be slow as they are trying to loop
> across all docs and facet values.
>
> Does there exist an option to compute facets by just looking at the top-n
> results instead of all of them or a sample of results based on some query
> parameters? I couldn't find one and if it does not exist, has this come up
> before? This would definitely not be a precise facet count but using
> reasonable sampling algorithms we should be able to extrapolate well.
>
> Thank you in advance for any advice!
>
> John


Re: Facets based on sampling

2016-11-04 Thread Jeff Wartes
https://issues.apache.org/jira/browse/SOLR-5894 had some pretty interesting 
looking work on heuristic counts for facets, among other things.

Unfortunately, it didn’t get picked up, but if you don’t mind using Solr 4.10, 
there’s a jar.


On 11/4/16, 12:02 PM, "John Davis"  wrote:

Hi,
I am trying to improve the performance of queries with facets. I understand
that for queries with high facet cardinality and large number results the
current facet computation algorithms can be slow as they are trying to loop
across all docs and facet values.

Does there exist an option to compute facets by just looking at the top-n
results instead of all of them or a sample of results based on some query
parameters? I couldn't find one and if it does not exist, has this come up
before? This would definitely not be a precise facet count but using
reasonable sampling algorithms we should be able to extrapolate well.

Thank you in advance for any advice!

John




Re: Facets based on sampling

2016-11-04 Thread Alexandre Rafalovitch
I believe that's what's JSON facet API does by default. Have you tried that?

Regards,
   Alex.

Solr Example reading group is starting November 2016, join us at
http://j.mp/SolrERG
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 5 November 2016 at 06:02, John Davis  wrote:
> Hi,
> I am trying to improve the performance of queries with facets. I understand
> that for queries with high facet cardinality and large number results the
> current facet computation algorithms can be slow as they are trying to loop
> across all docs and facet values.
>
> Does there exist an option to compute facets by just looking at the top-n
> results instead of all of them or a sample of results based on some query
> parameters? I couldn't find one and if it does not exist, has this come up
> before? This would definitely not be a precise facet count but using
> reasonable sampling algorithms we should be able to extrapolate well.
>
> Thank you in advance for any advice!
>
> John


Re: Custom user web interface for Solr

2016-11-04 Thread Erik Hatcher
What kind of graphical format?

> On Nov 4, 2016, at 14:01, "tesm...@gmail.com"  wrote:
> 
> Hi,
> 
> My search query comprises of more than one fields like search string, date
> field and a one optional field).
> 
> I need to represent these on the web interface to the users.
> 
> Secondly, I need to represent the search data in graphical format.
> 
> Is there some Solr web client that provides the above features or Is there
> a way to modify the default Solr Browse interface and add above options?
> 
> 
> 
> 
> 
> Regards,


Re: Custom user web interface for Solr

2016-11-04 Thread Alexandre Rafalovitch
Unless you secure Solr instance well, you should not be exposing your
Solr directly to the client. Anyone who can see Admin UI or /browse
handle can also delete all your documents. I am mentioning this just
in case.

So, you usually need a middleware that maps your requests to Solr.
Either with something with spring.io Solr Data (somewhat restricting
but easy to start) or other ways.

Still, the other part of your question is whether it is possible to
Solr several distinct fields and have it compose corresponding
queries. I've done similar things for contact database using Switch
operator, which made the middleware embarrassingly simple. You can see
the configuration for that:
https://gist.github.com/arafalov/5e04884e5aefaf46678c

Regards,
Alex.

Solr Example reading group is starting November 2016, join us at
http://j.mp/SolrERG
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 5 November 2016 at 05:01, tesm...@gmail.com  wrote:
> Hi,
>
> My search query comprises of more than one fields like search string, date
> field and a one optional field).
>
> I need to represent these on the web interface to the users.
>
> Secondly, I need to represent the search data in graphical format.
>
> Is there some Solr web client that provides the above features or Is there
> a way to modify the default Solr Browse interface and add above options?
>
>
>
>
>
> Regards,


Re: Aggregate Values Inside a Facet Range

2016-11-04 Thread Furkan KAMACI
Yes, it works with hours too. You can run a sum function each hour facet
which is named as bucket.

On Nov 4, 2016 10:14 PM, "William Bell"  wrote:

> How about hours?
>
> NOW+1HR
> NOW+2HR
> NOW+12HR
> NOW-4HR
>
> Can we add that?
>
>
> On Fri, Nov 4, 2016 at 12:25 PM, Furkan KAMACI 
> wrote:
>
> > I have documents like that
> >
> > id:5
> > timestamp:NOW //pseudo date representation
> > count:13
> >
> > id:4
> > timestamp:NOW //pseudo date representation
> > count:3
> >
> > id:3
> > timestamp:NOW-1DAY //pseudo date representation
> > count:21
> >
> > id:2
> > timestamp:NOW-1DAY //pseudo date representation
> > count:29
> >
> > id:1
> > timestamp:NOW-3DAY //pseudo date representation
> > count:4
> >
> > When I want to facet last 3 days data by timestamp its OK. However my
> need
> > is that:
> >
> > facets:
> > TODAY: 16 //pseudo representation
> > TODAY - 1: 50 //pseudo date representation
> > TODAY - 2: 0 //pseudo date representation
> > TODAY - 3: 4 //pseudo date representation
> >
> > I mean, I have to facet by dates and aggregate values inside that facet
> > range. Is it possible to do that without multiple queries at Solr?
> >
> > Kind Regards,
> > Furkan KAMACI
> >
>
>
>
> --
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076
>


Re: Aggregate Values Inside a Facet Range

2016-11-04 Thread William Bell
How about hours?

NOW+1HR
NOW+2HR
NOW+12HR
NOW-4HR

Can we add that?


On Fri, Nov 4, 2016 at 12:25 PM, Furkan KAMACI 
wrote:

> I have documents like that
>
> id:5
> timestamp:NOW //pseudo date representation
> count:13
>
> id:4
> timestamp:NOW //pseudo date representation
> count:3
>
> id:3
> timestamp:NOW-1DAY //pseudo date representation
> count:21
>
> id:2
> timestamp:NOW-1DAY //pseudo date representation
> count:29
>
> id:1
> timestamp:NOW-3DAY //pseudo date representation
> count:4
>
> When I want to facet last 3 days data by timestamp its OK. However my need
> is that:
>
> facets:
> TODAY: 16 //pseudo representation
> TODAY - 1: 50 //pseudo date representation
> TODAY - 2: 0 //pseudo date representation
> TODAY - 3: 4 //pseudo date representation
>
> I mean, I have to facet by dates and aggregate values inside that facet
> range. Is it possible to do that without multiple queries at Solr?
>
> Kind Regards,
> Furkan KAMACI
>



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Aggregate Values Inside a Facet Range

2016-11-04 Thread Furkan KAMACI
Seems that Solrj doesn't support JSON Facet API yet.

On Fri, Nov 4, 2016 at 9:08 PM, Furkan KAMACI 
wrote:

> Fantastic! Thanks Yonik, I could do the stuff that I want with JSON Facet
> API.
>
> On Fri, Nov 4, 2016 at 8:42 PM, Yonik Seeley  wrote:
>
>> On Fri, Nov 4, 2016 at 2:25 PM, Furkan KAMACI 
>> wrote:
>> > I mean, I have to facet by dates and aggregate values inside that facet
>> > range. Is it possible to do that without multiple queries at Solr?
>>
>> This (old) blog shows a percentiles calculation under a range facet:
>> http://yonik.com/percentiles-for-solr-faceting/
>>
>> -Yonik
>>
>
>


Re: Aggregate Values Inside a Facet Range

2016-11-04 Thread Furkan KAMACI
Fantastic! Thanks Yonik, I could do the stuff that I want with JSON Facet
API.

On Fri, Nov 4, 2016 at 8:42 PM, Yonik Seeley  wrote:

> On Fri, Nov 4, 2016 at 2:25 PM, Furkan KAMACI 
> wrote:
> > I mean, I have to facet by dates and aggregate values inside that facet
> > range. Is it possible to do that without multiple queries at Solr?
>
> This (old) blog shows a percentiles calculation under a range facet:
> http://yonik.com/percentiles-for-solr-faceting/
>
> -Yonik
>


Facets based on sampling

2016-11-04 Thread John Davis
Hi,
I am trying to improve the performance of queries with facets. I understand
that for queries with high facet cardinality and large number results the
current facet computation algorithms can be slow as they are trying to loop
across all docs and facet values.

Does there exist an option to compute facets by just looking at the top-n
results instead of all of them or a sample of results based on some query
parameters? I couldn't find one and if it does not exist, has this come up
before? This would definitely not be a precise facet count but using
reasonable sampling algorithms we should be able to extrapolate well.

Thank you in advance for any advice!

John


Re: Aggregate Values Inside a Facet Range

2016-11-04 Thread Yonik Seeley
On Fri, Nov 4, 2016 at 2:25 PM, Furkan KAMACI  wrote:
> I mean, I have to facet by dates and aggregate values inside that facet
> range. Is it possible to do that without multiple queries at Solr?

This (old) blog shows a percentiles calculation under a range facet:
http://yonik.com/percentiles-for-solr-faceting/

-Yonik


Re: Custom user web interface for Solr

2016-11-04 Thread KRIS MUSSHORN
https://cwiki.apache.org/confluence/display/solr/Velocity+Search+UI 

You might be able to customize velocity. 

K 
- Original Message -

From: "Binoy Dalal"  
To: solr-user@lucene.apache.org 
Sent: Friday, November 4, 2016 2:33:24 PM 
Subject: Re: Custom user web interface for Solr 

See this link for more details => 
https://lucidworks.com/blog/2015/12/08/browse-new-improved-solr-5/ 

On Sat, Nov 5, 2016 at 12:02 AM Binoy Dalal  wrote: 

> Have you checked out the /browse handler? It provides a pretty rudimentary 
> UI for displaying the results. It is nowhere close to what you would want 
> to present to your users but it is a good place to start off. 
> 
> On Fri, Nov 4, 2016 at 11:32 PM tesm...@gmail.com  
> wrote: 
> 
> Hi, 
> 
> My search query comprises of more than one fields like search string, date 
> field and a one optional field). 
> 
> I need to represent these on the web interface to the users. 
> 
> Secondly, I need to represent the search data in graphical format. 
> 
> Is there some Solr web client that provides the above features or Is there 
> a way to modify the default Solr Browse interface and add above options? 
> 
> 
> 
> 
> 
> Regards, 
> 
> -- 
> Regards, 
> Binoy Dalal 
> 
-- 
Regards, 
Binoy Dalal 



Re: Indexing and Disk Writes

2016-11-04 Thread Andrew Dinsmore
Erick,

We currently have ramBufferSizeMB at 1024M. For this indexing activity, the
cluster is "offline" thus no queries coming in so not worried about any
user impact or delays should Solr terminate and need to replay. The
thinking was that increasing these values (ramBuffer, commit times, etc)
would cut down on the amount of merging by writing larger segments from the
start. segmentsPerTier is currently 15 so in theory if we only committed 15
times we would never have to merge (right?). But no real effect on the disk
metrics thus far.

For the most recent test, I was very surprised by the commit activity. Some
commits are logged by the qtp threads and some by the commit scheduler (I'm
ignoring the openSearch=true commits here. I recognize we should disable
softCommits if users are not searching during indexing). I assume the
"commit schedulers" are the autoCommits and the "qtp" are from the
ramBufferSizeMB or maxBufferedDocs thresholds. I observed that our commits
came in pairs (scheduler then qtp) usually within a minute or three of each
other and then nothing for 10 to 15 minutes and then another pair within a
minute or two. Even more surprising is that I observed commits across the
13 node cluster all within the same second. This activity isn't
synchronized, is it? I can't imagine we are indexing that uniformly in
terms of bytes to account for this behavior. And with autoCommit.maxTime
set to 10 minutes I see commits occurring closer to 15 minutes (7 logged by
/update in the admin interface over the 2 hour run) thus my question about
the cluster synchronizing commits.

We measured tlog writes to disk and they come very close to the bytes
coming in over the NIC (makes sense) so they are accounting for only 5 to
10% of the disk writes. Good savings but not enough to significantly change
the load we are putting on the SAN.

Andrew

On Fri, Nov 4, 2016 at 12:00 PM, Erick Erickson 
wrote:

> Every time your ramBufferSizeMB limit is exceeded, a segment is
> created that's eventually merged. In terms of _throughput_, making
> this large usually doesn't help much after about 100M (the default).
> It'd be interesting to see if it changes your I/O activity though.
>
> BTW, I'd hard commit (openSearcher=false) much more frequently. As you
> see that doesn't particularly change IO, but if Solr should terminate
> abnormally the tlog will be replayed on startup and may sit there for
> 10 minutes.
>
> You could also consider disabling tlogs for the duration of your bulk
> indexing, then turn them back on for incremental.
>
> The background merging can be pretty dramatic though, that may well be
> where much of this is coming from.
>
> Best,
> Erick
>
> On Fri, Nov 4, 2016 at 8:51 AM, Andrew Dinsmore 
> wrote:
> > We are using Solr 5.4 to index TBs of documents in a bulk fashion to get
> > the cluster up and running. Indexing is over HTTP round robin as directed
> > by zookeeper.
> >
> > Each of the 13 nodes is receiving about 6-8 MB/s on the NIC but solr is
> > writing around 20 to 25 thousand times per second (4k block size). My
> > question is what is Solr doing writing all this data to disk
> (80-100MB/s)?
> >
> > Over a three hour run with 4.5 million docs we only committed 20 some
> times
> > but disk activity was pretty constant at the above levels.
> >
> > Is there more going on than tlogs, commits and merges? When we moved
> from 1
> > minute autoCommit to 10 we committed less per the log messages but I
> > expected the bigger initial segments to result in less merging thus lower
> > disk activity. But testing showed no significant change in disk writing.
> >
> > Thanks for any help.
> >
> > Andrew
>


Re: Custom user web interface for Solr

2016-11-04 Thread Binoy Dalal
See this link for more details =>
https://lucidworks.com/blog/2015/12/08/browse-new-improved-solr-5/

On Sat, Nov 5, 2016 at 12:02 AM Binoy Dalal  wrote:

> Have you checked out the /browse handler? It provides a pretty rudimentary
> UI for displaying the results. It is nowhere close to what you would want
> to present to your users but it is a good place to start off.
>
> On Fri, Nov 4, 2016 at 11:32 PM tesm...@gmail.com 
> wrote:
>
> Hi,
>
> My search query comprises of more than one fields like search string, date
> field and a one optional field).
>
> I need to represent these on the web interface to the users.
>
> Secondly, I need to represent the search data in graphical format.
>
> Is there some Solr web client that provides the above features or Is there
> a way to modify the default Solr Browse interface and add above options?
>
>
>
>
>
> Regards,
>
> --
> Regards,
> Binoy Dalal
>
-- 
Regards,
Binoy Dalal


Re: Custom user web interface for Solr

2016-11-04 Thread Binoy Dalal
Have you checked out the /browse handler? It provides a pretty rudimentary
UI for displaying the results. It is nowhere close to what you would want
to present to your users but it is a good place to start off.

On Fri, Nov 4, 2016 at 11:32 PM tesm...@gmail.com  wrote:

Hi,

My search query comprises of more than one fields like search string, date
field and a one optional field).

I need to represent these on the web interface to the users.

Secondly, I need to represent the search data in graphical format.

Is there some Solr web client that provides the above features or Is there
a way to modify the default Solr Browse interface and add above options?





Regards,

-- 
Regards,
Binoy Dalal


Re: Aggregate Values Inside a Facet Range

2016-11-04 Thread David Santamauro


I believe your answer is in the subject
  => facet.range
https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-RangeFaceting

//

On 11/04/2016 02:25 PM, Furkan KAMACI wrote:

I have documents like that

id:5
timestamp:NOW //pseudo date representation
count:13

id:4
timestamp:NOW //pseudo date representation
count:3

id:3
timestamp:NOW-1DAY //pseudo date representation
count:21

id:2
timestamp:NOW-1DAY //pseudo date representation
count:29

id:1
timestamp:NOW-3DAY //pseudo date representation
count:4

When I want to facet last 3 days data by timestamp its OK. However my need
is that:

facets:
 TODAY: 16 //pseudo representation
 TODAY - 1: 50 //pseudo date representation
 TODAY - 2: 0 //pseudo date representation
 TODAY - 3: 4 //pseudo date representation

I mean, I have to facet by dates and aggregate values inside that facet
range. Is it possible to do that without multiple queries at Solr?

Kind Regards,
Furkan KAMACI



Aggregate Values Inside a Facet Range

2016-11-04 Thread Furkan KAMACI
I have documents like that

id:5
timestamp:NOW //pseudo date representation
count:13

id:4
timestamp:NOW //pseudo date representation
count:3

id:3
timestamp:NOW-1DAY //pseudo date representation
count:21

id:2
timestamp:NOW-1DAY //pseudo date representation
count:29

id:1
timestamp:NOW-3DAY //pseudo date representation
count:4

When I want to facet last 3 days data by timestamp its OK. However my need
is that:

facets:
TODAY: 16 //pseudo representation
TODAY - 1: 50 //pseudo date representation
TODAY - 2: 0 //pseudo date representation
TODAY - 3: 4 //pseudo date representation

I mean, I have to facet by dates and aggregate values inside that facet
range. Is it possible to do that without multiple queries at Solr?

Kind Regards,
Furkan KAMACI


Custom user web interface for Solr

2016-11-04 Thread tesm...@gmail.com
Hi,

My search query comprises of more than one fields like search string, date
field and a one optional field).

I need to represent these on the web interface to the users.

Secondly, I need to represent the search data in graphical format.

Is there some Solr web client that provides the above features or Is there
a way to modify the default Solr Browse interface and add above options?





Regards,


Re: Solrj facet.date

2016-11-04 Thread Furkan KAMACI
Hi Shawn,

You are right, ClientUtils.escapeQueryChars() breaks the functionality. My
expectation was that: Solrj has

addDateRangeFacet

However there is not a direct method for facet.date query.

Kind Regards,
Furkan KAMACI

On Fri, Nov 4, 2016 at 7:04 PM, Shawn Heisey  wrote:

> On 11/4/2016 10:22 AM, Furkan KAMACI wrote:
> > I send a query to Solr to get information about each day of current week
> > via this way:
> >
> > =*:*
> > =type:dps
> > =0
> > =true
> > =date
> > =NOW/DAY-6DAYS
> > =NOW/DAY%2B1DAY
> > =%2B1DAY
> >
> > I want to make that query over Solrj.
>
> This code would do it:
>
>   /*
>* The client creation would probably be elsewhere, just putting it here
>* for a complete example.
>*/
>   SolrClient client = new HttpSolrClient("http://server:8983/solr;);
>   SolrQuery query = new SolrQuery();
>   query.setQuery("*:*");
>   query.addFilterQuery("type:dps");
>   query.setRows(0);
>   query.add("facet", "true");
>   query.add("facet.date", "date");
>   query.add("facet.date.start", "NOW/DAY-6DAYS");
>   query.add("facet.date.end", "NOW/DAY+1DAY");
>   query.add("facet.date.gap", "+1DAY");
>   String collection = "gettingstarted";
>   client.query(collection, query);
>
> One possible problem in the attempts you've made:  In the parameters
> you've provided, %2B is a URL-encoded plus sign.  It is shown in the
> documentation that way because a plus sign in a URL is a URL-encoded
> space.  If your SolrJ code tries to use "%2B" like you would need when
> doing the query in a browser, then Solr will not receive a plus sign.
> It would receive the literal string "%2B" which it won't understand.
> SolrJ performs URL encoding on all parameters, so you don't want to do
> the URL encoding yourself.
>
> Thanks,
> Shawn
>
>


Re: Solrj facet.date

2016-11-04 Thread Shawn Heisey
On 11/4/2016 10:22 AM, Furkan KAMACI wrote:
> I send a query to Solr to get information about each day of current week
> via this way:
>
> =*:*
> =type:dps
> =0
> =true
> =date
> =NOW/DAY-6DAYS
> =NOW/DAY%2B1DAY
> =%2B1DAY
>
> I want to make that query over Solrj.

This code would do it:

  /*
   * The client creation would probably be elsewhere, just putting it here
   * for a complete example.
   */
  SolrClient client = new HttpSolrClient("http://server:8983/solr;);
  SolrQuery query = new SolrQuery();
  query.setQuery("*:*");
  query.addFilterQuery("type:dps");
  query.setRows(0);
  query.add("facet", "true");
  query.add("facet.date", "date");
  query.add("facet.date.start", "NOW/DAY-6DAYS");
  query.add("facet.date.end", "NOW/DAY+1DAY");
  query.add("facet.date.gap", "+1DAY");
  String collection = "gettingstarted";
  client.query(collection, query);

One possible problem in the attempts you've made:  In the parameters
you've provided, %2B is a URL-encoded plus sign.  It is shown in the
documentation that way because a plus sign in a URL is a URL-encoded
space.  If your SolrJ code tries to use "%2B" like you would need when
doing the query in a browser, then Solr will not receive a plus sign. 
It would receive the literal string "%2B" which it won't understand. 
SolrJ performs URL encoding on all parameters, so you don't want to do
the URL encoding yourself.

Thanks,
Shawn



Solrj facet.date

2016-11-04 Thread Furkan KAMACI
Hi,

I send a query to Solr to get information about each day of current week
via this way:

=*:*
=type:dps
=0
=true
=date
=NOW/DAY-6DAYS
=NOW/DAY%2B1DAY
=%2B1DAY

I want to make that query over Solrj.

This facet.date definition at source code (5.5.3):

public static final String FACET_DATE = FACET + ".date";

However it is not used at Solrj. How can I make such a query with Solrj? If
I'm not missing anything I can create a patch for such functionality at
Solrj.

Kind Regards,
Furkan KAMACI


Re: How-To: Secure Solr by IP Address

2016-11-04 Thread Fuad Efendi

*Deserves* to mention: I run Solr on 8080 port, and Firewall blocks *port* 
8080. It is not indeed securing by IP address!

“block by IP” vs. “block by port number”

“block *all* services run on a machine by IP address” vs. “block only Jetty”

and etc.



Still need option for Jetty, it will simplify life ;)




On November 4, 2016 at 12:05:13 PM, Fuad Efendi (f...@efendi.ca) wrote:

Yes we need that documented,

http://stackoverflow.com/questions/8924102/restricting-ip-addresses-for-jetty-and-solr


Of course Firewall is a must for extremely strong environments / large 
corporations, DMZ, and etc; IPTables is the simplest solution if you run Linux; 
my vendor 1and1.com provides firewall functionality too - but I wouldn’t trust 
it: what if local at 1and1.com servers (in the same rack for example) can 
bypass this firewall?


Having option to configure Jetty minimizes dependencies. In real production I’d 
use all possible options: firewall(s) + iptable + Jetty config + DMZ(s)


--
Fuad Efendi
(416) 993-2060
http://www.tokenizer.ca
Search Relevancy, Recommender Systems


On November 4, 2016 at 9:28:21 AM, David Smiley (david.w.smi...@gmail.com) 
wrote:

I was just researching how to secure Solr by IP address and I finally
figured it out. Perhaps this might go in the ref guide but I'd like to
share it here anyhow. The scenario is where only "localhost" should have
full unfettered access to Solr, whereas everyone else (notably web clients)
can only access some whitelisted paths. This setup is intended for a
single instance of Solr (not a member of a cluster); the particular config
below would probably need adaptations for a cluster of Solr instances. The
technique here uses a utility with Jetty called IPAccessHandler --
http://download.eclipse.org/jetty/stable-9/apidocs/org/eclipse/jetty/server/handler/IPAccessHandler.html
For reasons I don't know (and I did search), it was recently deprecated and
there's another InetAccessHandler (not in Solr's current version of Jetty)
but it doesn't support constraints incorporating paths, so it's a
non-option for my needs.

First, Java must be told to insist on it's IPv4 stack. This is because
Jetty's IPAccessHandler simply doesn't support IPv6 IP matching; it throws
NPEs in my experience. In recent versions of Solr, this can be easily done
just by adding -Djava.net.preferIPv4Stack=true at the Solr start
invocation. Alternatively put it into SOLR_OPTS perhaps in solr.in.sh.

Edit server/etc/jetty.xml, and replace the line
mentioning ContextHandlerCollection with this:




127.0.0.1
-.-.-.-|/solr/techproducts/select


false





This mechanism wraps ContextHandlerCollection (which ultimately serves
Solr) with this handler that adds the constraints. These constraints above
allow localhost to do anything; other IP addresses can only access
/solr/techproducts/select. That line could be duplicated for other
white-listed paths -- I recommend creating request handlers for your use,
possibly with invariants to further constraint what someone can do.

note: I originally tried inserting the IPAccessHandler in
server/contexts/solr-jetty-context.xml but found that there's a bug in
IPAccessHanlder that fails to consider when HttpServletRequest.getPathInfo
is null. And it wound up letting everything through (if I recall). But I
like it up in server.xml anyway as it intercepts everything

~ David

--
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


Re: How-To: Secure Solr by IP Address

2016-11-04 Thread Fuad Efendi
Yes we need that documented,

http://stackoverflow.com/questions/8924102/restricting-ip-addresses-for-jetty-and-solr


Of course Firewall is a must for extremely strong environments / large 
corporations, DMZ, and etc; IPTables is the simplest solution if you run Linux; 
my vendor 1and1.com provides firewall functionality too - but I wouldn’t trust 
it: what if local at 1and1.com servers (in the same rack for example) can 
bypass this firewall?


Having option to configure Jetty minimizes dependencies. In real production I’d 
use all possible options: firewall(s) + iptable + Jetty config + DMZ(s)


--
Fuad Efendi
(416) 993-2060
http://www.tokenizer.ca
Search Relevancy, Recommender Systems


On November 4, 2016 at 9:28:21 AM, David Smiley (david.w.smi...@gmail.com) 
wrote:

I was just researching how to secure Solr by IP address and I finally  
figured it out. Perhaps this might go in the ref guide but I'd like to  
share it here anyhow. The scenario is where only "localhost" should have  
full unfettered access to Solr, whereas everyone else (notably web clients)  
can only access some whitelisted paths. This setup is intended for a  
single instance of Solr (not a member of a cluster); the particular config  
below would probably need adaptations for a cluster of Solr instances. The  
technique here uses a utility with Jetty called IPAccessHandler --  
http://download.eclipse.org/jetty/stable-9/apidocs/org/eclipse/jetty/server/handler/IPAccessHandler.html
  
For reasons I don't know (and I did search), it was recently deprecated and  
there's another InetAccessHandler (not in Solr's current version of Jetty)  
but it doesn't support constraints incorporating paths, so it's a  
non-option for my needs.  

First, Java must be told to insist on it's IPv4 stack. This is because  
Jetty's IPAccessHandler simply doesn't support IPv6 IP matching; it throws  
NPEs in my experience. In recent versions of Solr, this can be easily done  
just by adding -Djava.net.preferIPv4Stack=true at the Solr start  
invocation. Alternatively put it into SOLR_OPTS perhaps in solr.in.sh.  

Edit server/etc/jetty.xml, and replace the line  
mentioning ContextHandlerCollection with this:  

  
  
  
127.0.0.1  
-.-.-.-|/solr/techproducts/select  
  
  
false  
  
  
  
  

This mechanism wraps ContextHandlerCollection (which ultimately serves  
Solr) with this handler that adds the constraints. These constraints above  
allow localhost to do anything; other IP addresses can only access  
/solr/techproducts/select. That line could be duplicated for other  
white-listed paths -- I recommend creating request handlers for your use,  
possibly with invariants to further constraint what someone can do.  

note: I originally tried inserting the IPAccessHandler in  
server/contexts/solr-jetty-context.xml but found that there's a bug in  
IPAccessHanlder that fails to consider when HttpServletRequest.getPathInfo  
is null. And it wound up letting everything through (if I recall). But I  
like it up in server.xml anyway as it intercepts everything  

~ David  

--  
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker  
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:  
http://www.solrenterprisesearchserver.com  


Re: Indexing and Disk Writes

2016-11-04 Thread Erick Erickson
Every time your ramBufferSizeMB limit is exceeded, a segment is
created that's eventually merged. In terms of _throughput_, making
this large usually doesn't help much after about 100M (the default).
It'd be interesting to see if it changes your I/O activity though.

BTW, I'd hard commit (openSearcher=false) much more frequently. As you
see that doesn't particularly change IO, but if Solr should terminate
abnormally the tlog will be replayed on startup and may sit there for
10 minutes.

You could also consider disabling tlogs for the duration of your bulk
indexing, then turn them back on for incremental.

The background merging can be pretty dramatic though, that may well be
where much of this is coming from.

Best,
Erick

On Fri, Nov 4, 2016 at 8:51 AM, Andrew Dinsmore  wrote:
> We are using Solr 5.4 to index TBs of documents in a bulk fashion to get
> the cluster up and running. Indexing is over HTTP round robin as directed
> by zookeeper.
>
> Each of the 13 nodes is receiving about 6-8 MB/s on the NIC but solr is
> writing around 20 to 25 thousand times per second (4k block size). My
> question is what is Solr doing writing all this data to disk (80-100MB/s)?
>
> Over a three hour run with 4.5 million docs we only committed 20 some times
> but disk activity was pretty constant at the above levels.
>
> Is there more going on than tlogs, commits and merges? When we moved from 1
> minute autoCommit to 10 we committed less per the log messages but I
> expected the bigger initial segments to result in less merging thus lower
> disk activity. But testing showed no significant change in disk writing.
>
> Thanks for any help.
>
> Andrew


Re: Sitecore deleting Solr documents

2016-11-04 Thread Erick Erickson
Hmm, I'm not quite sure we can help you as this sounds like
Sitecore-specific functionality. Here's my total guess anyway. The
docs are somehow getting
indexed directly to CD and CD is a slave to CM. So the next time a
replication is triggered (see the settings in solrconfig.xml) the
index from CM overwrites the one on CD and CD no longer has the
documents. So when these docs disappear, you should see some messages
in the solr log about fetching the index from the master.

You could quickly test this by disabling replication which you can do
with the replication API without having to restart Solr or anything,
assuming that the SiteCore isn't issuing the commands instead of
relying on the standard polling mechanism, see:
https://cwiki.apache.org/confluence/display/solr/Index+Replication#IndexReplication-HTTPAPICommandsfortheReplicationHandler

This is totally guessing since I know nothing about how Sitecore
works. It's just one path I can imagine for the behavior you're
describing.

Best,
Erick

On Thu, Nov 3, 2016 at 8:18 PM, Joshua Campbell
 wrote:
> Hi All,
> I'm having an odd issue with Solr, and am looking for some help or 
> suggestions.
>
> We're using Solr (on a Sitecore website) for search and some search-driven 
> pages. CM is pointing to a sitecore_master_index in Solr, while CD is 
> pointing to a sitecore_web_index. We're using the OnPublishEndAsync strategy. 
> Something really odd is happening.
>
> When we publish content, it shows up on CD (and the correct Solr index), 
> then, about 90 seconds later, it's gone.
>
> If anybody has any insight into why this might happen, please let me know.
>
>
> I've enabled all of the "Update" logs in solr, and I don't see any records 
> deleting my item.  Please let me know if that's not the right place.
>
>
> Thanks,
>
> Josh


Indexing and Disk Writes

2016-11-04 Thread Andrew Dinsmore
We are using Solr 5.4 to index TBs of documents in a bulk fashion to get
the cluster up and running. Indexing is over HTTP round robin as directed
by zookeeper.

Each of the 13 nodes is receiving about 6-8 MB/s on the NIC but solr is
writing around 20 to 25 thousand times per second (4k block size). My
question is what is Solr doing writing all this data to disk (80-100MB/s)?

Over a three hour run with 4.5 million docs we only committed 20 some times
but disk activity was pretty constant at the above levels.

Is there more going on than tlogs, commits and merges? When we moved from 1
minute autoCommit to 10 we committed less per the log messages but I
expected the bigger initial segments to result in less merging thus lower
disk activity. But testing showed no significant change in disk writing.

Thanks for any help.

Andrew


Re: Different Sorts based on Different Groups

2016-11-04 Thread Fuad Efendi
Hi Gustatec,


Relevancy tuning is really *huge* area, check this book when you have a
chance: https://www.manning.com/books/relevant-search

Default Solr sorting is based on TF/IDF algorithm; and sorting is not
necessarily ‘relevancy’

Trivial solution for clothes store domain would be this one, better to
explain using examples:

Product 1

Name: "Russell Athletic Men's Basic Tank Top"
Categories: “Shirt”, “Sleeveless Shirt”, “Tank Top”

Product 2

Name: "Russell Athletic Men's Cotton Muscle Shirt"
Categories: “Shirt”, “Sleeveless Shirt”, “Tank Top”


You may notice that first product has “Top” repeated twice in product name
and category; and second one has “Short” repeated twice.

Now having this real-life example you can play with boost query, boosting
results containing words from category name in their product name.

category:”Tank Top” & bq:”name:tank^10 OR name:top^5"


Solr provides “boost query” to tune sorting of output results, check “bq”
parameter in the docs at
https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser


I went from real-life scenario; your scenario and possible solutions could
be very different.

I had recently assignment at well-known retail shop where we even designed
pre-query custom boosts so that we can customize typical (most important
for the business) queries as per business needs



Thanks,

--

Fuad Efendi

(416) 993-2060

http://www.tokenizer.ca
Search Relevancy, Recommender Systems


On November 4, 2016 at 10:57:02 AM, Gustatec (gusta...@gmail.com) wrote:

Hello everyone!

I'm currently using Solr in a project (pretty much an e-commerce POC) and
came across with the following sort situation:

I have two products one called Product1 and other one called Product2, both
of them belongs to the same categories, Shirt(ID 1) and Tank-Top(ID 2)

When i query for any of these categories, it returns both of the products,
in the same order.

Is it possible to do some kind of grouping sort in query? So when i query
for category Shirt, it returns first Product1 then Product2 and when i do
the same query for category Tank-Top it would return first Product2 then
Product1?

By asking that i wonder if its possible to make a product more relevant,
based on the query.

So product1 relevancy would be
Category ID | Priority
1 | 1
2 | 2

And product2 would be
Category ID | Priority
1 | 2
2 | 1


Is it possible to achieve this "elevate" funcionality in query?

i thought in doing a _sort field for all categories, but we
are actually talking about a few hundred categories, so i dont know if
would
be viable to create one sort field for each one of them in every single
doc...

Ps: I asks if its achievable that in query because i dont know if there is
any other way of changing the elevate.xml file without having to restart my
solr instance

Sorry for my bad english, and thanks in advance!



-- 
View this message in context:
http://lucene.472066.n3.nabble.com/Different-Sorts-based-on-Different-Groups-tp4304516.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Search performance

2016-11-04 Thread Alessandro Benedetti
Seconding Shawn, if your queries will always aim the active documents you
will see :
High level this is what is going to happen :

A) You need to run your query + a filter query that will retrieve only
active documents.
The filter query results will be cached.
Solr will query over the entire document space, and then merge the query
results with the filtered documents.

B) You run your query over the entire ( smaller) document space .

So option B will be faster, possibly not massively but We do less
calculations.

Cheers

On Fri, Nov 4, 2016 at 2:45 PM, Shawn Heisey  wrote:

> On 11/4/2016 8:22 AM, Vincenzo D'Amore wrote:
> > Given 2 collection A and B:
> >
> > - A collection have 5 M documents with an attribute active: true/false.
> > - B collection have only 2.5 M documents, but all the documents have
> > attribute active:true
> > - in any case, A or B, I can only search upon documents that have
> > active:true
> >
> > Which one perform faster?
>
> This is not backed by knowledge of how the code internals operate, just
> things I've pieced together from my own experience and other things said
> on the list in response to past questions.
>
> Assuming you have the available memory to effectively cache both
> indexes, five million documents is chump change to Solr.  If you don't
> have that memory, it might present a performance issue.
>
> Because query performance is largely dependent on the number of terms
> that Solr must look through, and the active field probably has at most
> three (true, false, and field not present), that part of your query will
> generally be very fast with ANY number of documents.
>
> If you search for all documents and filter on the active field, the
> difference between the two will probably be so small a human being would
> never notice it, but it probably would be a difference that you'd be
> able to measure.
>
> Where you *might* notice a difference is when you do a "real" query
> against other fields in the index, and filter on the active field.
> That's when the document count will usually track with the term count.
> The smaller collection may be noticeably faster for this kind of query.
>
> Thanks,
> Shawn
>
>


-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Apache Solr Question

2016-11-04 Thread Chien Nguyen
Great! Thank you so much. ^^



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Solr-Question-tp4304308p4304437.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Different Sorts based on Different Groups

2016-11-04 Thread Alessandro Benedetti
Hi Gustatec,
your problem seems a fairly basic relevance problem.
Instead of elevating documents, why don't you include the category as part
of the main query ?
To make it simple in Solr you have a query component which affect the score
and the filter queries which don't.

If in your case you add the category as part of the main query, documents
matching that category will be more relevant.
Relevancy is an hard topic, but based on your initial requirement I think
you can solve it quite easily.
If i misunderstood anything , let me know!

Cheers

On Fri, Nov 4, 2016 at 2:51 PM, Gustatec  wrote:

> Hello everyone!
>
> I'm currently using Solr in a project (pretty much an e-commerce POC) and
> came across with the following sort situation:
>
> I have two products one called Product1 and other one called Product2, both
> of them belongs to the same categories, Shirt(ID 1) and Tank-Top(ID 2)
>
> When i query for any of these categories, it returns both of the products,
> in the same order.
>
> Is it possible to do some kind of grouping sort in query? So when i query
> for category Shirt, it returns first Product1 then Product2 and when i do
> the same query for category Tank-Top it would return first Product2 then
> Product1?
>
> By asking that i wonder if its possible to make a product more relevant,
> based on the query.
>
> So product1 relevancy would be
> Category ID | Priority
> 1   | 1
> 2   | 2
>
> And product2 would be
> Category ID | Priority
> 1   | 2
> 2   | 1
>
>
> Is it possible to achieve this "elevate" funcionality in query?
>
> i thought in doing a _sort field  for all categories, but we
> are actually talking about a few hundred categories, so i dont know if
> would
> be viable to create one sort field for each one of them in every single
> doc...
>
> Ps: I asks if its achievable that in query because i dont know if there is
> any other way of changing the elevate.xml file without having to restart my
> solr instance
>
> Sorry for my bad english, and thanks in advance!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Different-Sorts-based-on-Different-Groups-tp4304516.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Different Sorts based on Different Groups

2016-11-04 Thread Gustatec
Hello everyone!

I'm currently using Solr in a project (pretty much an e-commerce POC) and
came across with the following sort situation:

I have two products one called Product1 and other one called Product2, both
of them belongs to the same categories, Shirt(ID 1) and Tank-Top(ID 2)

When i query for any of these categories, it returns both of the products,
in the same order.

Is it possible to do some kind of grouping sort in query? So when i query
for category Shirt, it returns first Product1 then Product2 and when i do
the same query for category Tank-Top it would return first Product2 then
Product1?

By asking that i wonder if its possible to make a product more relevant,
based on the query.

So product1 relevancy would be 
Category ID | Priority
1   | 1
2   | 2

And product2 would be 
Category ID | Priority
1   | 2
2   | 1


Is it possible to achieve this "elevate" funcionality in query?

i thought in doing a _sort field  for all categories, but we
are actually talking about a few hundred categories, so i dont know if would
be viable to create one sort field for each one of them in every single
doc...

Ps: I asks if its achievable that in query because i dont know if there is
any other way of changing the elevate.xml file without having to restart my
solr instance

Sorry for my bad english, and thanks in advance!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Different-Sorts-based-on-Different-Groups-tp4304516.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Search performance

2016-11-04 Thread Shawn Heisey
On 11/4/2016 8:22 AM, Vincenzo D'Amore wrote:
> Given 2 collection A and B:
>
> - A collection have 5 M documents with an attribute active: true/false.
> - B collection have only 2.5 M documents, but all the documents have
> attribute active:true
> - in any case, A or B, I can only search upon documents that have
> active:true
>
> Which one perform faster?

This is not backed by knowledge of how the code internals operate, just
things I've pieced together from my own experience and other things said
on the list in response to past questions.

Assuming you have the available memory to effectively cache both
indexes, five million documents is chump change to Solr.  If you don't
have that memory, it might present a performance issue.

Because query performance is largely dependent on the number of terms
that Solr must look through, and the active field probably has at most
three (true, false, and field not present), that part of your query will
generally be very fast with ANY number of documents.

If you search for all documents and filter on the active field, the
difference between the two will probably be so small a human being would
never notice it, but it probably would be a difference that you'd be
able to measure.

Where you *might* notice a difference is when you do a "real" query
against other fields in the index, and filter on the active field. 
That's when the document count will usually track with the term count. 
The smaller collection may be noticeably faster for this kind of query.

Thanks,
Shawn



RE: UpdateProcessor as a batch

2016-11-04 Thread Markus Jelsma
Thanks all for sharing your thoughts! 
 
-Original message-
> From:Joel Bernstein 
> Sent: Friday 4th November 2016 1:28
> To: solr-user@lucene.apache.org
> Subject: Re: UpdateProcessor as a batch
> 
> This might be useful. In this scenario you load you content into Solr for
> staging and perform your ETL from Solr to Solr:
> 
> http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html
> 
> Basically Solr becomes a text processing warehouse.
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
> On Thu, Nov 3, 2016 at 5:05 PM, Alexandre Rafalovitch 
> wrote:
> 
> > How big a batch we are talking about?
> >
> > Because I believe you could accumulate the docs in the first URP in
> > the processAdd and then do the batch lookup and actually processing of
> > them on processCommit.
> >
> > They are daisy chain, so as long as you are holding on to the chain,
> > the rest of the URPs don't happen.
> >
> > Obviously you are relying on the commit here to trigger the final call.
> >
> > Or you could do a two collection sequence with indexing to first
> > collection, querying for whatever you need to batch lookup and then
> > doing Collection-to-Collection enhanced copy.
> >
> > Regards,
> >Alex.
> > 
> > Solr Example reading group is starting November 2016, join us at
> > http://j.mp/SolrERG
> > Newsletter and resources for Solr beginners and intermediates:
> > http://www.solr-start.com/
> >
> >
> > On 4 November 2016 at 07:35, mike st. john  wrote:
> > > maybe introduce a distributed queue such as apache ignite,  hazelcast or
> > > even redis.   Read from the queue in batches, do your lookup then index
> > the
> > > same batch.
> > >
> > > just a thought.
> > >
> > > Mike St. John.
> > >
> > > On Nov 3, 2016 3:58 PM, "Erick Erickson" 
> > wrote:
> > >
> > >> I thought we might be talking past each other...
> > >>
> > >> I think you're into "roll your own" here. Anything that
> > >> accumulated docs for a while, did a batch lookup
> > >> on the external system, then passed on the docs
> > >> runs the risk of losing docs if the server is abnormally
> > >> shut down.
> > >>
> > >> I guess ideally you'd like to augment the list coming in
> > >> rather than the docs once they're removed from the
> > >> incoming batch and passed on, but I admit I have no
> > >> clue where to do that. Possibly in an update chain? If
> > >> so, you'd need to be careful to only augment when
> > >> they'd reached their final shard leader or all at once
> > >> before distribution to shard leaders.
> > >>
> > >> Is the expense for the external lookup doing the actual
> > >> lookups or establishing the connection? Would
> > >> having some kind of shared connection to the external
> > >> source be worthwhile?
> > >>
> > >> FWIW,
> > >> Erick
> > >>
> > >> On Thu, Nov 3, 2016 at 12:06 PM, Markus Jelsma
> > >>  wrote:
> > >> > Hi - i believe i did not explain myself well enough.
> > >> >
> > >> > Getting the data in Solr is not a problem, various sources index docs
> > to
> > >> Solr, all in fine batches as everyone should do indeed. The thing is
> > that i
> > >> need to do some preprocessing before it is indexed. Normally,
> > >> UpdateProcessors are the way to go. I've made quite a few of them and
> > they
> > >> work fine.
> > >> >
> > >> > The problem is, i need to do a remote lookup for each document being
> > >> indexed. Right now, i make an external connection for each doc being
> > >> indexed in the current UpdateProcessor. This is still fast. But the
> > remote
> > >> backend supports batched lookups, which are faster.
> > >> >
> > >> > This is why i'd love to be able to buffer documents in an
> > >> UpdateProcessor, and if there are enough, i do a remote lookup for all
> > of
> > >> them, do some processing and let them be indexed.
> > >> >
> > >> > Thanks,
> > >> > Markus
> > >> >
> > >> >
> > >> >
> > >> > -Original message-
> > >> >> From:Erick Erickson 
> > >> >> Sent: Thursday 3rd November 2016 19:18
> > >> >> To: solr-user 
> > >> >> Subject: Re: UpdateProcessor as a batch
> > >> >>
> > >> >> I _thought_ you'd been around long enough to know about the options I
> > >> >> mentioned ;).
> > >> >>
> > >> >> Right. I'd guess you're in UpdateHandler.addDoc and there's really no
> > >> >> batching at that level that I know of. I'm pretty sure that even
> > >> >> indexing batches of 1,000 documents from, say, SolrJ go through this
> > >> >> method.
> > >> >>
> > >> >> I don't think there's much to be gained by any batching at this
> > level,
> > >> >> it pretty immediately tells Lucene to index the doc.
> > >> >>
> > >> >> FWIW
> > >> >> Erick
> > >> >>
> > >> >> On Thu, Nov 3, 2016 at 11:10 AM, Markus Jelsma
> > >> >>  wrote:
> > >> >> > Erick - in this case data can come from anywhere. There is one

Search performance

2016-11-04 Thread Vincenzo D'Amore
Hi all,

it's trivia time :) hope you enjoy the question.

Given 2 collection A and B:

- A collection have 5 M documents with an attribute active: true/false.
- B collection have only 2.5 M documents, but all the documents have
attribute active:true
- in any case, A or B, I can only search upon documents that have
active:true

Which one perform faster?

I ask that because someone in my office says that it does not matter.

It does not even worth the effort to remove all the not active documents,
because there is no performance gain in this change, apart the reduction
time during the documents ingestion process.

I'm preparing a stress test to understand this thing but before I'm curious
to read what is your opinion.

Thanks in advance for your time,
Vincenzo



-- 
Vincenzo D'Amore
email: v.dam...@gmail.com
skype: free.dev
mobile: +39 349 8513251


Re: How-To: Secure Solr by IP Address

2016-11-04 Thread David Smiley
Not to knock the other suggestions, but a benefit to securing Jetty like
this is that *everyone* can do this approach.

On Fri, Nov 4, 2016 at 9:54 AM john saylor  wrote:

> hi
>
> any firewall worth it's name should be able to do this. in fact, that is
> one of several things that a firewall was designed to do.
>
> also, you are stopping this traffic at the application, which is good;
> but you'd prolly be better off stopping it at the network interface
> [using a firewall, for instance].
>
> of course, firewalls have their own complexity ...
>
> good luck!
>
> --
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


Re: How-To: Secure Solr by IP Address

2016-11-04 Thread john saylor

hi

any firewall worth it's name should be able to do this. in fact, that is 
one of several things that a firewall was designed to do.


also, you are stopping this traffic at the application, which is good; 
but you'd prolly be better off stopping it at the network interface 
[using a firewall, for instance].


of course, firewalls have their own complexity ...

good luck!



Re: How-To: Secure Solr by IP Address

2016-11-04 Thread GW
I run a small solrcloud on a set of internal IP address. I connect with a
routed OpenVPN so I hit solr on 10.8.0.1:8983 from my desktop. Only my web
clients are on public IPs and only those clients can talk to the inside
cluster.

That's how I manage things...

On 4 November 2016 at 09:27, David Smiley  wrote:

> I was just researching how to secure Solr by IP address and I finally
> figured it out.  Perhaps this might go in the ref guide but I'd like to
> share it here anyhow.  The scenario is where only "localhost" should have
> full unfettered access to Solr, whereas everyone else (notably web clients)
> can only access some whitelisted paths.  This setup is intended for a
> single instance of Solr (not a member of a cluster); the particular config
> below would probably need adaptations for a cluster of Solr instances.  The
> technique here uses a utility with Jetty called IPAccessHandler --
> http://download.eclipse.org/jetty/stable-9/apidocs/org/
> eclipse/jetty/server/handler/IPAccessHandler.html
> For reasons I don't know (and I did search), it was recently deprecated and
> there's another InetAccessHandler (not in Solr's current version of Jetty)
> but it doesn't support constraints incorporating paths, so it's a
> non-option for my needs.
>
> First, Java must be told to insist on it's IPv4 stack. This is because
> Jetty's IPAccessHandler simply doesn't support IPv6 IP matching; it throws
> NPEs in my experience. In recent versions of Solr, this can be easily done
> just by adding -Djava.net.preferIPv4Stack=true at the Solr start
> invocation.  Alternatively put it into SOLR_OPTS perhaps in solr.in.sh.
>
> Edit server/etc/jetty.xml, and replace the line
> mentioning ContextHandlerCollection with this:
>
>  class="org.eclipse.jetty.server.handler.IPAccessHandler">
>
>  
>127.0.0.1
>-.-.-.-|/solr/techproducts/select
>  
>
>false
>
>   class="org.eclipse.jetty.server.handler.ContextHandlerCollection"/>
>
>  
>
> This mechanism wraps ContextHandlerCollection (which ultimately serves
> Solr) with this handler that adds the constraints.  These constraints above
> allow localhost to do anything; other IP addresses can only access
> /solr/techproducts/select.  That line could be duplicated for other
> white-listed paths -- I recommend creating request handlers for your use,
> possibly with invariants to further constraint what someone can do.
>
> note: I originally tried inserting the IPAccessHandler in
> server/contexts/solr-jetty-context.xml but found that there's a bug in
> IPAccessHanlder that fails to consider when HttpServletRequest.getPathInfo
> is null.  And it wound up letting everything through (if I recall).  But I
> like it up in server.xml anyway as it intercepts everything
>
> ~ David
>
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>


How-To: Secure Solr by IP Address

2016-11-04 Thread David Smiley
I was just researching how to secure Solr by IP address and I finally
figured it out.  Perhaps this might go in the ref guide but I'd like to
share it here anyhow.  The scenario is where only "localhost" should have
full unfettered access to Solr, whereas everyone else (notably web clients)
can only access some whitelisted paths.  This setup is intended for a
single instance of Solr (not a member of a cluster); the particular config
below would probably need adaptations for a cluster of Solr instances.  The
technique here uses a utility with Jetty called IPAccessHandler --
http://download.eclipse.org/jetty/stable-9/apidocs/org/eclipse/jetty/server/handler/IPAccessHandler.html
For reasons I don't know (and I did search), it was recently deprecated and
there's another InetAccessHandler (not in Solr's current version of Jetty)
but it doesn't support constraints incorporating paths, so it's a
non-option for my needs.

First, Java must be told to insist on it's IPv4 stack. This is because
Jetty's IPAccessHandler simply doesn't support IPv6 IP matching; it throws
NPEs in my experience. In recent versions of Solr, this can be easily done
just by adding -Djava.net.preferIPv4Stack=true at the Solr start
invocation.  Alternatively put it into SOLR_OPTS perhaps in solr.in.sh.

Edit server/etc/jetty.xml, and replace the line
mentioning ContextHandlerCollection with this:


   
 
   127.0.0.1
   -.-.-.-|/solr/techproducts/select
 
   
   false
   
 
   
 

This mechanism wraps ContextHandlerCollection (which ultimately serves
Solr) with this handler that adds the constraints.  These constraints above
allow localhost to do anything; other IP addresses can only access
/solr/techproducts/select.  That line could be duplicated for other
white-listed paths -- I recommend creating request handlers for your use,
possibly with invariants to further constraint what someone can do.

note: I originally tried inserting the IPAccessHandler in
server/contexts/solr-jetty-context.xml but found that there's a bug in
IPAccessHanlder that fails to consider when HttpServletRequest.getPathInfo
is null.  And it wound up letting everything through (if I recall).  But I
like it up in server.xml anyway as it intercepts everything

~ David

-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


Re: Fields with stored=false are stored though

2016-11-04 Thread Alexandre Rafalovitch
docValues are enabled (in the type) and with the latest schema
version, docvalues can be returned even if stored is off.

You can disable docValues or disable them returning a value unless
requested explicitly in fl param.

Regards,
   Alex.
P.s. I am not say that was a smart idea to do in the default example
schema..

Solr Example reading group is starting November 2016, join us at
http://j.mp/SolrERG
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 4 November 2016 at 23:29, Reinhard Budenstecher  wrote:
> I'm using Solr 6.2.1. Schema is static (schema.xml) and some fields look like
>
>   
>   
>
> and so on. But when querying in web browser GUI I can see, that these fields 
> are stored though and values are returned on query. How can this happen?
> Looking into web schema browser I can see fields with following attributes:
>
> Field: attributes_size
> Field-Type:org.apache.solr.schema.TrieIntField
> Flags:  Indexed DocValues   Multivalued Omit Norms  Omit Term 
> Frequencies & Positions
> Properties  √   √   √   √   √
>
> Field: attributes_price
> Field-Type:org.apache.solr.schema.TrieIntField
> Flags:  Indexed DocValues   Multivalued Omit Norms  Omit Term 
> Frequencies & Positions
> Properties  √   √   √   √   √
> Schema  √   √   √   √   √
> Index   (unstored field)
>
> What is wrong there?
>
> __
> Gesendet mit Maills.de - mehr als nur Freemail www.maills.de
>
>


Fields with stored=false are stored though

2016-11-04 Thread Reinhard Budenstecher
I'm using Solr 6.2.1. Schema is static (schema.xml) and some fields look like

  
  

and so on. But when querying in web browser GUI I can see, that these fields 
are stored though and values are returned on query. How can this happen?
Looking into web schema browser I can see fields with following attributes:

Field: attributes_size
Field-Type:org.apache.solr.schema.TrieIntField
Flags:  Indexed DocValues   Multivalued Omit Norms  Omit Term 
Frequencies & Positions
Properties  √   √   √   √   √

Field: attributes_price
Field-Type:org.apache.solr.schema.TrieIntField
Flags:  Indexed DocValues   Multivalued Omit Norms  Omit Term 
Frequencies & Positions
Properties  √   √   √   √   √
Schema  √   √   √   √   √
Index   (unstored field)

What is wrong there?

__
Gesendet mit Maills.de - mehr als nur Freemail www.maills.de




Re: Local parameter query and multiple fields

2016-11-04 Thread Gintautas Sulskus
To add: I am passing parameter defType=edismax.

On Fri, Nov 4, 2016 at 11:41 AM, Gintautas Sulskus <
gintautas.suls...@gmail.com> wrote:

> Hi,
>
> If I search for "London" with the following query, I get London city at
> the top.
>
> name:London^10
> category:City^5
> category:Organization^1
>
> Now I would like to store this query in SearchHandler with  a parameter
> $term instead of the hard-coded word "London". However, I am not sure how
> the query should be constructed to get identical results to the query
> above. The following query ignores category search whatsoever:
>
> {!qf="name^10" v=$term}
> category:City^5
> category:Organization^1
>
> To add, what if I wanted to date-boost the category field (not the whole
> query) if the matched $term is of type Organization?
>
> {!qf="name^10" v=$term}
> category:City^5
> category:Organization^(date_boost)
>
> Are you aware of a good book or a source on the Internet regarding query
> construction as specified above?
>
> Best,
> Gintas
>


Local parameter query and multiple fields

2016-11-04 Thread Gintautas Sulskus
Hi,

If I search for "London" with the following query, I get London city at the
top.

name:London^10
category:City^5
category:Organization^1

Now I would like to store this query in SearchHandler with  a parameter
$term instead of the hard-coded word "London". However, I am not sure how
the query should be constructed to get identical results to the query
above. The following query ignores category search whatsoever:

{!qf="name^10" v=$term}
category:City^5
category:Organization^1

To add, what if I wanted to date-boost the category field (not the whole
query) if the matched $term is of type Organization?

{!qf="name^10" v=$term}
category:City^5
category:Organization^(date_boost)

Are you aware of a good book or a source on the Internet regarding query
construction as specified above?

Best,
Gintas


Re: facet on dynamic field

2016-11-04 Thread Erik Hatcher
You'll have to enumerate them (see the Luke request handler) and specify them 
explicitly. 

> On Nov 4, 2016, at 03:40, Midas A  wrote:
> 
> i want to create facet on all dynamic field (by_*) . what should be the
> query ?


Sitecore deleting Solr documents

2016-11-04 Thread Joshua Campbell
Hi All,
I'm having an odd issue with Solr, and am looking for some help or suggestions.

We're using Solr (on a Sitecore website) for search and some search-driven 
pages. CM is pointing to a sitecore_master_index in Solr, while CD is pointing 
to a sitecore_web_index. We're using the OnPublishEndAsync strategy. Something 
really odd is happening.

When we publish content, it shows up on CD (and the correct Solr index), then, 
about 90 seconds later, it's gone.

If anybody has any insight into why this might happen, please let me know.


I've enabled all of the "Update" logs in solr, and I don't see any records 
deleting my item.  Please let me know if that's not the right place.


Thanks,

Josh


facet on dynamic field

2016-11-04 Thread Midas A
i want to create facet on all dynamic field (by_*) . what should be the
query ?