Re: Find groups where at least one item matches a query

2017-02-05 Thread Nick Vasilyev
Check out the group.limit argument.

On Feb 5, 2017 12:10 PM, "Cristian Popovici" 
wrote:

> Erick, thanks for you answer.
>
> Sorry - I forgot to mention that I do not know the group id when I perform
> the query.
> Grouping - I think - does not help for me as it filters out the documents
> that do not meet the filter criteria.
>
> Example:
> *q=pathology:Normal=true=groupId*  will miss out the
> "pathology":
> "Metastasis".
>
> I need to retrieve both documents in the same group even if only one meets
> the search criteria.
>
> Thanks!
>
> On Sun, Feb 5, 2017 at 6:54 PM, Erick Erickson 
> wrote:
>
> > Isn't this just "=groupId:223"?
> >
> > Or do you mean you need multiple _groups_? In which case you can use
> > grouping, see:
> > https://cwiki.apache.org/confluence/display/solr/
> > Collapse+and+Expand+Results
> > and/or
> > https://cwiki.apache.org/confluence/display/solr/Result+Grouping
> >
> > but do note there are some limitations in distributed mode.
> >
> > Best,
> > Erick
> >
> > On Sun, Feb 5, 2017 at 1:49 AM, Cristian Popovici
> >  wrote:
> > > Hi all,
> > >
> > > I'm new to Solr and I need a bit of help.
> > >
> > > I have a structure of documents indexed in Solr that are grouped
> together
> > > by a property. I need to retrieve all groups where at least one entry
> in
> > > the group matches a query.
> > >
> > > Example:
> > > I have two documents indexed and both share the *groupId *property that
> > > defines the grouping field.
> > >
> > > *{*
> > > *"groupId": "223",*
> > > *"modality": "Computed Tomography",*
> > > *"anatomy": "Subcutaneous fat",*
> > > *"pathology": "Metastasis",*
> > > *}*
> > >
> > > *{*
> > > *"groupId": "223",*
> > > *"modality": "Computed Tomography",*
> > > *"anatomy": "Subcutaneous fat",*
> > > *"pathology": "Normal",*
> > > *}*
> > >
> > > I need to retrieve both entries in the group when performing a query
> > like:
> > >
> > > *(pathology:Normal)*
> > > Is this possible in solr?
> > >
> > > Thanks!
> >
>


Re: Best python 3 client for solrcloud

2016-11-24 Thread Nick Vasilyev
I am a comitter for

https://github.com/moonlitesolutions/SolrClient.

I think its pretty good, my aim with it is to provide several reusable
modules for working with Solr in python. Not just querying, but working
with collections indexing, reindexing, etc..

Check it out and let me know what you think.

On Nov 24, 2016 3:51 PM, "Dorian Hoxha"  wrote:

> Hi searchers,
>
> I see multiple clients for solr in python but each one looks like misses
> many features. What I need is for at least the low-level api to work with
> cloud (like retries on different nodes and nice exceptions). What is the
> best that you use currently ?
>
> Thank You!
>


Re: How to retrieve 200K documents from Solr 4.10.2

2016-10-12 Thread Nick Vasilyev
Check out cursorMark, it should be available in your release. There is some
good information on this page:

https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results


On Wed, Oct 12, 2016 at 5:46 PM, Salikeen, Obaid <
obaid.salik...@iacpublishinglabs.com> wrote:

> Hi,
>
> I am using Solr 4.10.2. I have 200K documents sitting on Solr cluster (it
> has 3 nodes), and let me first state that I am new Solr. I want to retrieve
> all documents from Sold (essentially just one field from each document).
>
> What is the best way of fetching this much data without overloading Solr
> cluster?
>
>
> Approach I tried:
> I tried using the following API (running every minute) to fetch a batch of
> 1000 documents every minute. On Each run, I initialize start with the new
> index i.e adding 1000.
> http://SOLR_HOST/solr/abc/select?q=*:*==1=
> 1000=url=csv=false=false
>
> However, with the above approach, I have two issues:
>
> 1.   Solr cluster gets overloaded i.e it slows down
>
> 2.   I am not sure if start=X=1000 would give me the correct
> results (changing rows=2 or rows=4 gives me totally different results,
> which is why I am not confident if I will get the correct results).
>
>
> Thanks
> Obaid
>
>


Re: Miserable Experience Using Solr. Again.

2016-09-15 Thread Nick Vasilyev
Just wanted to chime in on the technical set-up of the Solr "petting zoo",
I think I can help here; just let me know what you need.

Here is the idea; just have a vagrant box with ansible provisioning Zoo
keepers and Solr, creating collections, and etc That way anyone
starting out can just clone the repo, 'vagrant up' and have a fully
functional environment in no time. Setting up Solr is not the hard part and
I think it takes a little something from the experience, but if it would
help someone get started. Just send me an e-mail off line and let me know.

I do some work on an open source Solr python library and I use a similar
instance to run through unit tests on supported versions of python with
some of the latest versions of Solr; it works great and most of the work is
already done.


On Thu, Sep 15, 2016 at 2:39 PM, Shawn Heisey  wrote:

> On 9/15/2016 8:24 AM, Alexandre Rafalovitch wrote:
> > The WIKI may be an official community-contributing forum, but its
> > technological implementation has gotten so bad it is impossible to
> > update. Every time I change the page, it takes minutes (and feels like
> > hours) for the update to come through. No clue what to do about that
> > though.
>
> Interestingly, even though it takes several minutes for the change
> request to finish, the wiki actually updates almost immediately after
> pushing the button.  The page load (and the resulting email to the
> mailing list) just takes forever.  I discovered this by looking at the
> page in another tab while waiting for the page load to get done.
>
> As I understand it, MoinMoin is entirely filesystem-based, a typical
> config doesn't use a database.  Apache has a LOT of MoinMoin installs
> running on wiki.apache.org.  I think the performance woes are a case of
> a technology that's not scalable enough for how it's being used.
>
> > I feel that it would be cool to have a live tutorial. Perhaps a
> > special collection that, when bootstrapped from, provides tutorial,
> > supporting data, smart interface to play with that data against that
> > same instance, etc. It could also have a static read-only export, but
> > the default experience should be interactive ("bin/solr start -e
> > tutorial" or even "bin/solr start -e
> > http://www.example.com/tutorial;).
>
> That is an interesting idea.  I can envision a tutorial example, a
> canned source directory for indexing data into it, and a third volume of
> documentation, specifically for learning with that index.  It could
> include a section on changing the schema, reindexing, and seeing how
> those changes affect indexing and queries.
>
> > And it should be something that very strongly focuses on teaching new
> > users to fish, not just use the variety of seafood Solr comes with. A
> > narrative showing how different parts of Solr come together and how to
> > troubleshoot those, as opposed to taking each element (e.g. Query
> > Parser) individually and covering them super-comprehensively. That
> > last one is perfect in the reference guide, but less than friendly to
> > a beginner.
>
> Yes, yes, yes.
>
> Thanks,
> Shawn
>
>


Re: Discreptancy in json.facet uniqe and group.ngroups

2016-09-06 Thread Nick Vasilyev
Thanks Alexandre, that does sound related. I wouldn't imagine the
discrepancy would be that much, but I also realized that related items
aren't grouped on the same shard. This may be why my grouped counts are
off.

I will do some manual verification of the counts.

On Mon, Sep 5, 2016 at 12:22 PM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> Perhaps https://issues.apache.org/jira/browse/SOLR-7452 ?
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 5 September 2016 at 23:07, Nick Vasilyev <nick.vasily...@gmail.com>
> wrote:
> > Hi, I need to get the number of distinct values of a field and I am
> getting
> > different counts between the json.facet interface and group.ngroups. Here
> > are the two queries:
> >
> > {'q': '*:*',
> >  'rows': 0,
> >  'json.facet': '{'mfr': "unique('mfr')"}'
> > })
> >
> > This brings up around 6,000 in the mfr field.
> >
> > However, if I run the following query, I get around 22,000:
> > {'q': '*:*',
> >  'rows': 0,
> >  'group': 'true',
> >  'group.ngroups': 'true',
> >  'group.field': 'mfr' }
> >
> > I am running solr 6.1.0 with 4 shards, I ran through some estimates and
> it
> > looks like each shard has around 6k manufacturers. Does anyone have any
> > ideas why this is happening?
> >
> > Thanks
>


Discreptancy in json.facet uniqe and group.ngroups

2016-09-05 Thread Nick Vasilyev
Hi, I need to get the number of distinct values of a field and I am getting
different counts between the json.facet interface and group.ngroups. Here
are the two queries:

{'q': '*:*',
 'rows': 0,
 'json.facet': '{'mfr': "unique('mfr')"}'
})

This brings up around 6,000 in the mfr field.

However, if I run the following query, I get around 22,000:
{'q': '*:*',
 'rows': 0,
 'group': 'true',
 'group.ngroups': 'true',
 'group.field': 'mfr' }

I am running solr 6.1.0 with 4 shards, I ran through some estimates and it
looks like each shard has around 6k manufacturers. Does anyone have any
ideas why this is happening?

Thanks


Re: How to re-index SOLR data

2016-08-09 Thread Nick Vasilyev
Hi, I work on a python Solr Client
 library and there is a
reindexing helper module that you can use if you are on Solr 4.9+. I use it
all the time and I think it works pretty well. You can re-index all
documents from a collection into another collection or dump them to the
filesystem as JSON. It also supports parallel execution and can run
independently on each shard. There is also a way to resume if your job
craps out half way through if your existing schema is set up with a good
date field and unique id.

You can read the documentation here:
http://solrclient.readthedocs.io/en/latest/Reindexer.html

Code is pretty short and is here:
https://github.com/moonlitesolutions/SolrClient/blob/master/SolrClient/helpers/reindexer.py

Here is sample:
from SolrClient import SolrClient
from SolrClient.helpers import Reindexer

r = Reindexer(SolrClient('http://source_solr:8983/solr'), SolrClient('
http://destination_solr:8983/solr') , source_coll='source_collection',
dest_coll='destination-collection')
r.reindex()






On Tue, Aug 9, 2016 at 9:56 AM, Shawn Heisey  wrote:

> On 8/9/2016 1:48 AM, bharath.mvkumar wrote:
> > What would be the best way to re-index the data in the SOLR cloud? We
> > have around 65 million data and we are planning to change the schema
> > by changing the unique key type from long to string. How long does it
> > take to re-index 65 million documents in SOLR and can you please
> > suggest how to do that?
>
> There is no magic bullet.  And there's no way for anybody but you to
> determine how long it's going to take.  There are people who have
> achieved over 50K inserts per second, and others who have difficulty
> reaching 1000 per second.  Many factors affect indexing speed, including
> the size of your documents, the complexity of your analysis, the
> capabilities of your hardware, and how many threads/processes you are
> using at the same time when you index.
>
> Here's some more detailed info about reindexing, but it's probably not
> what you wanted to hear:
>
> https://wiki.apache.org/solr/HowToReindex
>
> Thanks,
> Shawn
>
>


Re: Solr Rounding Issue On Float fields.

2016-07-21 Thread Nick Vasilyev
Thanks Chris.

Searching for both values and retrieving the documents would be alright as
long as the data was correct. In this case, the data that I am indexing
into Solr is not the same data that I am pulling out at query time. That is
the real impact here.

On Thu, Jul 21, 2016 at 6:12 PM, Chris Hostetter 
wrote:

>
> : Hi, I am running into a weird rounding issue on Solr 5.2.1. I have a
> float
> : field (also tried tfloat), I am indexing 154035.26 into it (confirmed in
> : the data),  but at query time, I get back 154035.27 (.01 more).
> : Additionally when I query for the document and include this number in
> the q
> : parameter, it comes up with both values, .26 and .27.
>
> Pretty sure what you are observing is just the normal consequences of IEEE
> floats (as used by java) being base2 -- not every base10 decimal value
> has a precise base2 representation.
>
> Quering for 154035.27 and 154035.26 will both match the same docs, because
> the String->Float parsing in both cases will produce the closest *legal*
> float value, which is identical for both inputs.
>
> If need precise decimal values in solr, you need to either use 2
> ints/longs (ie num_base="154035", num_decimal="26") or use one int/long
> and multiply/divide by a power of 10 corisponding to the number of
> significant digits you want in the client (ie: "15403526" divide by 100)
>
>
> Some good reading linked to from here...
>
> http://perlmonks.org/?node_id=203257
>
> And of course, if you really want to bang java against your head,
> this is a classic (all of which is still appliable i believe) ...
>
> https://people.eecs.berkeley.edu/~wkahan/JAVAhurt.pdf
>
>
>
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: Solr Rounding Issue On Float fields.

2016-07-21 Thread Nick Vasilyev
I did a bit more investigating here is something that may help
troubleshooting:

- Seems that numbers above 131071 - are impacted. 131071.26 is fine,
but 131072.26 is not. 131071 is a large prime and also a Mersenne prime.

- 131072.24 gets rounded down to 131072.23. While 131072.26 gets rounded up
to 131072.27. Similarily, 131072.76 gets rounded up to 131072.77
and 131072.74 gets rounded down to 131072.73. 131072.49 gets rounded down
to 131072.48 and 131072.51 gets rounded up to 131072.52.

I haven't validated this code, just doing some manual digging.



On Thu, Jul 21, 2016 at 1:48 PM, Nick Vasilyev <nick.vasily...@gmail.com>
wrote:

> Hi, I am running into a weird rounding issue on Solr 5.2.1. I have a float
> field (also tried tfloat), I am indexing 154035.26 into it (confirmed in
> the data),  but at query time, I get back 154035.27 (.01 more).
> Additionally when I query for the document and include this number in the q
> parameter, it comes up with both values, .26 and .27.
>
> I've fed the values through the analyzer and I get this bizarre behavior
> per the screenshot below. The field is a single value float or tfloat
> field.
>
> Any help would be much appreciated, thanks in advance
>
> [image: Inline image 1]
>


Solr Rounding Issue On Float fields.

2016-07-21 Thread Nick Vasilyev
Hi, I am running into a weird rounding issue on Solr 5.2.1. I have a float
field (also tried tfloat), I am indexing 154035.26 into it (confirmed in
the data),  but at query time, I get back 154035.27 (.01 more).
Additionally when I query for the document and include this number in the q
parameter, it comes up with both values, .26 and .27.

I've fed the values through the analyzer and I get this bizarre behavior
per the screenshot below. The field is a single value float or tfloat
field.

Any help would be much appreciated, thanks in advance

[image: Inline image 1]


Re: Use of solr + banana for faceted search

2016-07-21 Thread Nick Vasilyev
Not that I know of, but it is an open source project so its easy to extend.

On Jul 21, 2016 11:01 AM, "Darshan Pandya" <darshanpan...@gmail.com> wrote:

> Thanks Nick, once again.
> I was able to use Facet panel.
>
> I also wanted to ask the group if there is a repository of custom panels
> for Banana which we can benefit from ?
>
> Sincerely,
> Darshan
>
> On Wed, Jul 20, 2016 at 11:55 AM, Darshan Pandya <darshanpan...@gmail.com>
> wrote:
>
> > Nick, Thanks for your help. I'll test it out and respond back.
> >
> > On Wed, Jul 20, 2016 at 11:52 AM, Nick Vasilyev <
> nick.vasily...@gmail.com>
> > wrote:
> >
> >> Banana has a facet panel that allows you to configure several fields to
> >> facet on, you can have multiple fields and they will show up as an
> >> accordion. However, keep in mind that the field needs to be tokenized
> for
> >> faceting (i.e. string) and upon selection the filter is added to the fq
> >> parameter in the Solr query. Let me know if that helps.
> >>
> >> On Wed, Jul 20, 2016 at 12:40 PM, Darshan Pandya <
> darshanpan...@gmail.com
> >> >
> >> wrote:
> >>
> >> > Hello folks,
> >> >
> >> > I am fairly new to solr + banana, especially banana.
> >> >
> >> >
> >> > I am trying to configure banana for faceted search for a collection in
> >> > solr.
> >> > I want to be able to have multiple facets parameters on the left and
> see
> >> > the results of selections on my data table on the right. Exactly like
> >> > guided Nav.
> >> >
> >> > Please let me know if anyone has done this and/or if there is a
> tutorial
> >> > for this.
> >> >
> >> > --
> >> > Sincerely,
> >> > Darshan
> >> >
> >>
> >
> >
> >
> > --
> > Sincerely,
> > Darshan
> >
> >
>
>
> --
> Sincerely,
> Darshan
>


Re: Use of solr + banana for faceted search

2016-07-20 Thread Nick Vasilyev
Banana has a facet panel that allows you to configure several fields to
facet on, you can have multiple fields and they will show up as an
accordion. However, keep in mind that the field needs to be tokenized for
faceting (i.e. string) and upon selection the filter is added to the fq
parameter in the Solr query. Let me know if that helps.

On Wed, Jul 20, 2016 at 12:40 PM, Darshan Pandya 
wrote:

> Hello folks,
>
> I am fairly new to solr + banana, especially banana.
>
>
> I am trying to configure banana for faceted search for a collection in
> solr.
> I want to be able to have multiple facets parameters on the left and see
> the results of selections on my data table on the right. Exactly like
> guided Nav.
>
> Please let me know if anyone has done this and/or if there is a tutorial
> for this.
>
> --
> Sincerely,
> Darshan
>


Re: Solr 5.5.2

2016-05-26 Thread Nick Vasilyev
Thanks Erik, option 4 is my favorite so far :)

On Thu, May 26, 2016 at 2:15 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> There is no plan to release 5.5.2, development has moved to trunk and
> 6.x. Also, while there
> is a patch for that JIRA it hasn't been committed even in trunk/6.0.
>
> So I think your choices are:
> 1> find a work-around
> 2> see about moving to Solr 6.0.1 (in release process now),
> assuming that it solves the problem.
> 3> See if the patch supplied with SOLR-8940 works for you and compile
> it locally.
> 4> agitate for a 5.5.2 that includes this fix (after the fix has been
> vetted).
>
> Best,
> Erick
>
> On Thu, May 26, 2016 at 11:08 AM, Nick Vasilyev
> <nick.vasily...@gmail.com> wrote:
> > Is there an anticipated release date for 5.5.2? I know 5.5.1 was just
> > released a while ago and although it fixes the faceting performance
> > (SOLR-8096), distributed grouping is broken (SOLR-8940).
> >
> > I just need a solid 5.x release that is stable and with all core
> > functionality working.
> >
> > Thanks
>


Solr 5.5.2

2016-05-26 Thread Nick Vasilyev
Is there an anticipated release date for 5.5.2? I know 5.5.1 was just
released a while ago and although it fixes the faceting performance
(SOLR-8096), distributed grouping is broken (SOLR-8940).

I just need a solid 5.x release that is stable and with all core
functionality working.

Thanks


Re: API call for optimising a collection

2016-05-17 Thread Nick Vasilyev
As far as I know, you have to run it on each core.
On May 18, 2016 1:04 AM, "Binoy Dalal"  wrote:

> Is there no api call that can optimize an entire collection?
>
> I tried the collections api page on the confluence wiki but couldn't find
> anything, and a Google search also yielded no meaningful results.
> --
> Regards,
> Binoy Dalal
>


Re: json.facet streaming

2016-05-17 Thread Nick Vasilyev
Got it. Thanks for clarifying.

On Tue, May 17, 2016 at 9:58 AM, Yonik Seeley <ysee...@gmail.com> wrote:

> On Tue, May 17, 2016 at 9:41 AM, Nick Vasilyev <nick.vasily...@gmail.com>
> wrote:
> > Hi Yonik, I do see them in the response, but the JSON format is like
> > standard facet output. I am not sure what streaming facet response would
> > look like, but I expected it to be similar to the streaming API. Is this
> > the case?
>
> Nope.
> The method is an execution hint (calculate the facets via this
> method), and should not normally affect what the response looks like.
>
> -Yonik
>


Re: json.facet streaming

2016-05-17 Thread Nick Vasilyev
Hi Yonik, I do see them in the response, but the JSON format is like
standard facet output. I am not sure what streaming facet response would
look like, but I expected it to be similar to the streaming API. Is this
the case?

On Tue, May 17, 2016 at 9:35 AM, Yonik Seeley <ysee...@gmail.com> wrote:

> So it looks like facets are being computed... do you not see them in
> the response?
> -Yonik
>
>
> On Tue, May 17, 2016 at 9:12 AM, Nick Vasilyev <nick.vasily...@gmail.com>
> wrote:
> > I enabled query debugging, here is the facet-trace snippet.
> >
> > "facet-trace":{
> >   "processor":"FacetQueryProcessor",
> >   "elapse":0,
> >   "query":null,
> >   "domainSize":43046041,
> >   "sub-facet":[{
> >   "processor":"FacetFieldProcessorStream",
> >   "elapse":0,
> >   "field":"group",
> >   "limit":10,
> >   "domainSize":8980542},
> > {
> >   "processor":"FacetFieldProcessorStream",
> >   "elapse":0,
> >   "field":"group",
> >   "limit":10,
> >   "domainSize":9005295},
> > {
> >   "processor":"FacetFieldProcessorStream",
> >   "elapse":0,
> >   "field":"group",
> >   "limit":10,
> >   "domainSize":7555021},
> > {
> >   "processor":"FacetFieldProcessorStream",
> >   "elapse":0,
> >   "field":"group",
> >   "limit":10,
> >   "domainSize":8928379},
> > {
> >   "processor":"FacetFieldProcessorStream",
> >   "elapse":0,
> >   "field":"group",
> >   "limit":10,
> >   "domainSize":8576804}]},
> > "json":{"facet":{"groups":{
> >   "type":"terms",
> >   "field":"group",
> >   "method":"stream"}}},
> >
> > On Tue, May 17, 2016 at 8:42 AM, Yonik Seeley <ysee...@gmail.com> wrote:
> >
> >> Perhaps try turning on request debugging and see what is actually
> >> being received by Solr?
> >>
> >> -Yonik
> >>
> >>
> >> On Tue, May 17, 2016 at 8:33 AM, Nick Vasilyev <
> nick.vasily...@gmail.com>
> >> wrote:
> >> > I am on the nightly build of 6.1 and I am experimenting with
> json.facet
> >> > streaming, however the response I am getting back looks like regular
> >> query
> >> > response. I was expecting something like the streaming api. Is this
> right
> >> > or am I missing something?
> >> >
> >> > Hhere is the json.facet string.
> >> >
> >> > 'json.facet':str({ "groups":{
> >> > "type": "terms",
> >> > "field": "group",
> >> > "method":"stream"
> >> > }}),
> >> >
> >> > The group field is a string field with DocValues enabled.
> >> >
> >> > Thanks
> >>
>


Re: json.facet streaming

2016-05-17 Thread Nick Vasilyev
I enabled query debugging, here is the facet-trace snippet.

"facet-trace":{
  "processor":"FacetQueryProcessor",
  "elapse":0,
  "query":null,
  "domainSize":43046041,
  "sub-facet":[{
  "processor":"FacetFieldProcessorStream",
  "elapse":0,
  "field":"group",
  "limit":10,
  "domainSize":8980542},
{
  "processor":"FacetFieldProcessorStream",
  "elapse":0,
  "field":"group",
  "limit":10,
  "domainSize":9005295},
{
  "processor":"FacetFieldProcessorStream",
  "elapse":0,
  "field":"group",
  "limit":10,
  "domainSize":7555021},
{
  "processor":"FacetFieldProcessorStream",
  "elapse":0,
  "field":"group",
  "limit":10,
  "domainSize":8928379},
{
  "processor":"FacetFieldProcessorStream",
  "elapse":0,
  "field":"group",
  "limit":10,
  "domainSize":8576804}]},
"json":{"facet":{"groups":{
  "type":"terms",
  "field":"group",
  "method":"stream"}}},

On Tue, May 17, 2016 at 8:42 AM, Yonik Seeley <ysee...@gmail.com> wrote:

> Perhaps try turning on request debugging and see what is actually
> being received by Solr?
>
> -Yonik
>
>
> On Tue, May 17, 2016 at 8:33 AM, Nick Vasilyev <nick.vasily...@gmail.com>
> wrote:
> > I am on the nightly build of 6.1 and I am experimenting with json.facet
> > streaming, however the response I am getting back looks like regular
> query
> > response. I was expecting something like the streaming api. Is this right
> > or am I missing something?
> >
> > Hhere is the json.facet string.
> >
> > 'json.facet':str({ "groups":{
> > "type": "terms",
> > "field": "group",
> > "method":"stream"
> > }}),
> >
> > The group field is a string field with DocValues enabled.
> >
> > Thanks
>


json.facet streaming

2016-05-17 Thread Nick Vasilyev
I am on the nightly build of 6.1 and I am experimenting with json.facet
streaming, however the response I am getting back looks like regular query
response. I was expecting something like the streaming api. Is this right
or am I missing something?

Hhere is the json.facet string.

'json.facet':str({ "groups":{
"type": "terms",
"field": "group",
"method":"stream"
}}),

The group field is a string field with DocValues enabled.

Thanks


Re: Re-indexing in SolRCloud while keeping the collection online -- Best practice?

2016-05-11 Thread Nick Vasilyev
Aliasing works great, I implemented it after upgrading to Solr 5 and it
allows us to do this exact thing. The only thing you have to watch out for
is indexing new items (if they overwrite old ones) while you are
re-indexing.

I took it a step further for another collection that stores a lot of time
based data from logs. I have two aliases for that collection logs and
logs_indexing, every month a new collection gets created called logs_201605
or something like that and both aliases get updated. logs_indexing now only
points to the newest collection, thats where all the indexing is going, the
logs alias gets updated to include the new collection as well (since
aliases can point to multiple collections).

Here is the link to the documentation.
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api4

On Tue, May 10, 2016 at 12:55 PM, Horváth Péter Gergely <
peter.gergely.horv...@gmail.com> wrote:

> Hi Erick,
>
> Most of the time we have to do a full re-index: I do love your second idea,
> I will take a look at the details of that. Thank you! :)
>
> Cheers,
> Peter
>
> 2016-05-10 17:10 GMT+02:00 Erick Erickson :
>
> > Peter:
> >
> > Yeah, that would work, but there are a couple of alternatives:
> > 1> If there's any way to know what the subset of docs that's
> >  changed, just re-index _them_. The problem here is
> >  picking up deletes. In the RDBMS case this is often done
> >  by creating a trigger for deletes and then the last step
> >  in your update is to remove the docs since the last time
> >  you indexed using the deleted_docs table (or whatever).
> >  This falls down if a> you require an instantaneous switch
> >  from _all_ the old data to the new or b> you can't get a
> >  list of deleted docs.
> >
> > 2> Use collection aliasing. The pattern is this: you have your
> >  "Hot" collection (col1) serving queries that is pointed to
> >  by alias "hot". You create a new collection (col2) and index
> >  to it in the background. When done, use CREATEALIAS
> >  to point "hot" to "col2". Now you can delete col1. There are
> >  no restrictions on where these collections live, so this
> >  allows you to move your collections around as you want. Plus
> >  this keeps a better separation of old and new data...
> >
> > Best,
> > Erick
> >
> > On Tue, May 10, 2016 at 4:32 AM, Horváth Péter Gergely
> >  wrote:
> > > Hi Everyone,
> > >
> > > I am wondering if there is any best practice regarding re-indexing
> > > documents in SolrCloud 6.0.0 without making the data (or the underlying
> > > collection) temporarily unavailable. Wiping all documents in a
> collection
> > > and performing a full re-indexing is not a viable alternative for us.
> > >
> > > Say we had a massive Solr Cloud cluster with a number of separate nodes
> > > that are used to host *multiple hundreds* of collections, with document
> > > counts ranging from a couple of thousands to multiple (say up to 20)
> > > millions of documents, each with 200-300 fields and a background batch
> > > loader job that fetches data from a variety of source systems.
> > >
> > > We have to retain the cluster and ALL collections online all the time
> > (365
> > > x 24): We cannot allow queries to be blocked while data in a collection
> > is
> > > being updated and we cannot load everything in a single-shot jumbo
> commit
> > > (the replication could overload the cluster).
> > >
> > > One solution I could imagine is storing an additional field "load
> > > time-stamp" in all documents and the client (interactive query)
> > application
> > > extending all queries with an additional restriction, which requires
> > > documents "load time-stamp" to be the latest known completed "load
> > > time-stamp".
> > >
> > > This concept would work according to the following:
> > > 1.) The batch job would simply start loading new documents, with the
> new
> > > "load time-stamp". Existing documents would not be touched.
> > > 2.) The client (interactive query) application would still use the old
> > data
> > > from the previous load (since all queries are restricted with the old
> > "load
> > > time-stamp")
> > > 3.) The batch job would store the new "load time-stamp" as the one to
> be
> > > used (e.g. in a separate collection etc.) -- after this, all queries
> > would
> > > return the most up-to-data documents
> > > 4.) The batch job would purge all documents from the collection, where
> > > the "load time-stamp" is not the same as the last one.
> > >
> > > This approach seems to be implementable, however, I definitely want to
> > > avoid reinventing the wheel myself and wondering if there is any better
> > > solution or built-in Solr Cloud feature to achieve the same or
> something
> > > similar.
> > >
> > > Thanks,
> > > Peter
> >
>


Re: Filtering on nGroups

2016-05-06 Thread Nick Vasilyev
I guess it would also work if I could facet on the group counts. I just
need to know how many groups of different sizes there are.

On Fri, May 6, 2016 at 2:10 PM, Nick Vasilyev <nick.vasily...@gmail.com>
wrote:

> I am on 6.1 preview, I just need this to gather some one time metrics so
> performance isn't an issue.
> On May 6, 2016 1:13 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:
>
> What version of Solr? Regardless, if you can pre-process
> at index time it'll be faster than anything else (probably).
>
> pre-processing isn't very dynamic though so there are lots
> of situations where that's just not viable.
>
> Best,
> Erick
>
> On Thu, May 5, 2016 at 6:05 PM, Nick Vasilyev <nick.vasily...@gmail.com>
> wrote:
> > I am grouping documents on a field and would like to retrieve documents
> > where the number of items in a group matches a specific value or a range.
> >
> > I haven't been able to experiment with all new functionality, but I
> wanted
> > to see if this is possible without having to calculate the count and add
> it
> > at index time as a field.
> >
> > Does anyone have any ideas?
> >
> > Thanks in advance
>
>


Re: Filtering on nGroups

2016-05-06 Thread Nick Vasilyev
I am on 6.1 preview, I just need this to gather some one time metrics so
performance isn't an issue.
On May 6, 2016 1:13 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:

What version of Solr? Regardless, if you can pre-process
at index time it'll be faster than anything else (probably).

pre-processing isn't very dynamic though so there are lots
of situations where that's just not viable.

Best,
Erick

On Thu, May 5, 2016 at 6:05 PM, Nick Vasilyev <nick.vasily...@gmail.com>
wrote:
> I am grouping documents on a field and would like to retrieve documents
> where the number of items in a group matches a specific value or a range.
>
> I haven't been able to experiment with all new functionality, but I wanted
> to see if this is possible without having to calculate the count and add
it
> at index time as a field.
>
> Does anyone have any ideas?
>
> Thanks in advance


Filtering on nGroups

2016-05-05 Thread Nick Vasilyev
I am grouping documents on a field and would like to retrieve documents
where the number of items in a group matches a specific value or a range.

I haven't been able to experiment with all new functionality, but I wanted
to see if this is possible without having to calculate the count and add it
at index time as a field.

Does anyone have any ideas?

Thanks in advance


Re: Solr cloud 6.0.0 with ZooKeeper 3.4.8 Errors

2016-05-05 Thread Nick Vasilyev
Just out of curiosity, are you using sharing the zookeepers between the
different versions of Solr? If So, are you specifying a zookeeper chroot?
On May 5, 2016 2:05 PM, "Susheel Kumar" <susheel2...@gmail.com> wrote:

> Nick, Hoss -  Things are back to normal with ZK 3.4.8 and ZK-6.0.0.  I
> switched to Solr 5.5.0 with ZK 3.4.8 which worked fine and then installed
> 6.0.0.  I suspect (not 100% sure) i left ZK dataDir / Solr collection
> directory data from previous ZK/solr version which probably was making Solr
> 6 in unstable state.
>
> Thanks,
> Susheel
>
> On Wed, May 4, 2016 at 9:56 PM, Susheel Kumar <susheel2...@gmail.com>
> wrote:
>
> > Thanks, Nick & Hoss.  I am using the exact same machine, have wiped out
> > solr 5.5.0 and installed solr-6.0.0 with external ZK 3.4.8.  I checked
> the
> > File Description limit for user solr, which is 12000 and increased to
> > 52000. Don't see "too many files open..." error now in Solr log but still
> > Solr connection getting lost in Admin panel.
> >
> > Let me do some more tests and install older version back to confirm and
> > will share the findings.
> >
> > Thanks,
> > Susheel
> >
> > On Wed, May 4, 2016 at 8:11 PM, Chris Hostetter <
> hossman_luc...@fucit.org>
> > wrote:
> >
> >>
> >> : Thanks, Nick. Do we know any suggested # for file descriptor limit
> with
> >> : Solr6?  Also wondering why i haven't seen this problem before with
> Solr
> >> 5.x?
> >>
> >> are you running Solr6 on the exact same host OS that you were running
> >> Solr5 on?
> >>
> >> even if you are using the "same OS version" on a diff machine, that
> could
> >> explain the discrepency if you (or someone else) increased the file
> >> descriptor limit on the "old machine" but that neverh appened on the
> 'new
> >> machine"
> >>
> >>
> >>
> >> : On Wed, May 4, 2016 at 4:54 PM, Nick Vasilyev <
> nick.vasily...@gmail.com
> >> >
> >> : wrote:
> >> :
> >> : > It looks like you have too many open files, try increasing the file
> >> : > descriptor limit.
> >> : >
> >> : > On Wed, May 4, 2016 at 3:48 PM, Susheel Kumar <
> susheel2...@gmail.com>
> >> : > wrote:
> >> : >
> >> : > > Hello,
> >> : > >
> >> : > > I am trying to setup 2 node Solr cloud 6 cluster with ZK 3.4.8 and
> >> used
> >> : > the
> >> : > > install service to setup solr.
> >> : > >
> >> : > > After launching Solr Admin Panel on server1, it looses connections
> >> in few
> >> : > > seconds and then comes back and other node server2 is marked as
> >> Down in
> >> : > > cloud graph. After few seconds its loosing the connection and
> comes
> >> back.
> >> : > >
> >> : > > Any idea what may be going wrong? Has anyone used Solr 6 with ZK
> >> 3.4.8.
> >> : > > Have never seen this error before with solr 5.x with ZK 3.4.6.
> >> : > >
> >> : > > Below log from server1 & server2.  The ZK has 3 nodes with chroot
> >> : > enabled.
> >> : > >
> >> : > > Thanks,
> >> : > > Susheel
> >> : > >
> >> : > > server1/solr.log
> >> : > >
> >> : > > 
> >> : > >
> >> : > >
> >> : > > 2016-05-04 19:20:53.804 INFO  (qtp1989972246-14) [   ]
> >> : > > o.a.s.c.c.ZkStateReader path=[/collections/collection1]
> >> : > > [configName]=[collection1] specified config exists in ZooKeeper
> >> : > >
> >> : > > 2016-05-04 19:20:53.806 INFO  (qtp1989972246-14) [   ]
> >> : > o.a.s.s.HttpSolrCall
> >> : > > [admin] webapp=null path=/admin/collections
> >> : > > params={action=CLUSTERSTATUS=json&_=1462389588125} status=0
> >> QTime=25
> >> : > >
> >> : > > 2016-05-04 19:20:53.859 INFO  (qtp1989972246-19) [   ]
> >> : > > o.a.s.h.a.CollectionsHandler Invoked Collection Action :list with
> >> params
> >> : > > action=LIST=json&_=1462389588125 and sendToOCPQueue=true
> >> : > >
> >> : > > 2016-05-04 19:20:53.861 INFO  (qtp1989972246-19) [   ]
> >> : > o.a.s.s.HttpSolrCall
> >> : > > [admin] webapp=null path=/admin/collections
> >&g

Re: Solr cloud 6.0.0 with ZooKeeper 3.4.8 Errors

2016-05-04 Thread Nick Vasilyev
Not sure about your environment so it's hard to say why you haven't ran
into this issue before.

As for the suggested limit, I am not sure, it would depend on your system
and if you really want to limit it. I personally just jack it up to 5.

On Wed, May 4, 2016 at 6:13 PM, Susheel Kumar <susheel2...@gmail.com> wrote:

> Thanks, Nick. Do we know any suggested # for file descriptor limit with
> Solr6?  Also wondering why i haven't seen this problem before with Solr
> 5.x?
>
> On Wed, May 4, 2016 at 4:54 PM, Nick Vasilyev <nick.vasily...@gmail.com>
> wrote:
>
> > It looks like you have too many open files, try increasing the file
> > descriptor limit.
> >
> > On Wed, May 4, 2016 at 3:48 PM, Susheel Kumar <susheel2...@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > I am trying to setup 2 node Solr cloud 6 cluster with ZK 3.4.8 and used
> > the
> > > install service to setup solr.
> > >
> > > After launching Solr Admin Panel on server1, it looses connections in
> few
> > > seconds and then comes back and other node server2 is marked as Down in
> > > cloud graph. After few seconds its loosing the connection and comes
> back.
> > >
> > > Any idea what may be going wrong? Has anyone used Solr 6 with ZK 3.4.8.
> > > Have never seen this error before with solr 5.x with ZK 3.4.6.
> > >
> > > Below log from server1 & server2.  The ZK has 3 nodes with chroot
> > enabled.
> > >
> > > Thanks,
> > > Susheel
> > >
> > > server1/solr.log
> > >
> > > 
> > >
> > >
> > > 2016-05-04 19:20:53.804 INFO  (qtp1989972246-14) [   ]
> > > o.a.s.c.c.ZkStateReader path=[/collections/collection1]
> > > [configName]=[collection1] specified config exists in ZooKeeper
> > >
> > > 2016-05-04 19:20:53.806 INFO  (qtp1989972246-14) [   ]
> > o.a.s.s.HttpSolrCall
> > > [admin] webapp=null path=/admin/collections
> > > params={action=CLUSTERSTATUS=json&_=1462389588125} status=0 QTime=25
> > >
> > > 2016-05-04 19:20:53.859 INFO  (qtp1989972246-19) [   ]
> > > o.a.s.h.a.CollectionsHandler Invoked Collection Action :list with
> params
> > > action=LIST=json&_=1462389588125 and sendToOCPQueue=true
> > >
> > > 2016-05-04 19:20:53.861 INFO  (qtp1989972246-19) [   ]
> > o.a.s.s.HttpSolrCall
> > > [admin] webapp=null path=/admin/collections
> > > params={action=LIST=json&_=1462389588125} status=0 QTime=2
> > >
> > > 2016-05-04 19:20:57.520 INFO  (qtp1989972246-13) [   ]
> > o.a.s.s.HttpSolrCall
> > > [admin] webapp=null path=/admin/cores
> > > params={indexInfo=false=json&_=1462389588124} status=0 QTime=0
> > >
> > > 2016-05-04 19:20:57.546 INFO  (qtp1989972246-15) [   ]
> > o.a.s.s.HttpSolrCall
> > > [admin] webapp=null path=/admin/info/system
> > > params={wt=json&_=1462389588126} status=0 QTime=25
> > >
> > > 2016-05-04 19:20:57.610 INFO  (qtp1989972246-13) [   ]
> > > o.a.s.h.a.CollectionsHandler Invoked Collection Action :list with
> params
> > > action=LIST=json&_=1462389588125 and sendToOCPQueue=true
> > >
> > > 2016-05-04 19:20:57.613 INFO  (qtp1989972246-13) [   ]
> > o.a.s.s.HttpSolrCall
> > > [admin] webapp=null path=/admin/collections
> > > params={action=LIST=json&_=1462389588125} status=0 QTime=3
> > >
> > > 2016-05-04 19:21:29.139 INFO  (qtp1989972246-5980) [   ]
> > > o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException)
> > caught
> > > when connecting to {}->http://server2:8983: Too many open files
> > >
> > > 2016-05-04 19:21:29.139 INFO  (qtp1989972246-5983) [   ]
> > > o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException)
> > caught
> > > when connecting to {}->http://server2:8983: Too many open files
> > >
> > > 2016-05-04 19:21:29.139 INFO  (qtp1989972246-5984) [   ]
> > > o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException)
> > caught
> > > when connecting to {}->http://server2:8983: Too many open files
> > >
> > > 2016-05-04 19:21:29.141 INFO  (qtp1989972246-5984) [   ]
> > > o.a.h.i.c.DefaultHttpClient Retrying connect to {}->
> http://server2:8983
> > >
> > > 2016-05-04 19:21:29.141 INFO  (qtp1989972246-5984) [   ]
> > > o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException)
> > caught
> > > when connecting to {}->

Re: Solr cloud 6.0.0 with ZooKeeper 3.4.8 Errors

2016-05-04 Thread Nick Vasilyev
It looks like you have too many open files, try increasing the file
descriptor limit.

On Wed, May 4, 2016 at 3:48 PM, Susheel Kumar  wrote:

> Hello,
>
> I am trying to setup 2 node Solr cloud 6 cluster with ZK 3.4.8 and used the
> install service to setup solr.
>
> After launching Solr Admin Panel on server1, it looses connections in few
> seconds and then comes back and other node server2 is marked as Down in
> cloud graph. After few seconds its loosing the connection and comes back.
>
> Any idea what may be going wrong? Has anyone used Solr 6 with ZK 3.4.8.
> Have never seen this error before with solr 5.x with ZK 3.4.6.
>
> Below log from server1 & server2.  The ZK has 3 nodes with chroot enabled.
>
> Thanks,
> Susheel
>
> server1/solr.log
>
> 
>
>
> 2016-05-04 19:20:53.804 INFO  (qtp1989972246-14) [   ]
> o.a.s.c.c.ZkStateReader path=[/collections/collection1]
> [configName]=[collection1] specified config exists in ZooKeeper
>
> 2016-05-04 19:20:53.806 INFO  (qtp1989972246-14) [   ] o.a.s.s.HttpSolrCall
> [admin] webapp=null path=/admin/collections
> params={action=CLUSTERSTATUS=json&_=1462389588125} status=0 QTime=25
>
> 2016-05-04 19:20:53.859 INFO  (qtp1989972246-19) [   ]
> o.a.s.h.a.CollectionsHandler Invoked Collection Action :list with params
> action=LIST=json&_=1462389588125 and sendToOCPQueue=true
>
> 2016-05-04 19:20:53.861 INFO  (qtp1989972246-19) [   ] o.a.s.s.HttpSolrCall
> [admin] webapp=null path=/admin/collections
> params={action=LIST=json&_=1462389588125} status=0 QTime=2
>
> 2016-05-04 19:20:57.520 INFO  (qtp1989972246-13) [   ] o.a.s.s.HttpSolrCall
> [admin] webapp=null path=/admin/cores
> params={indexInfo=false=json&_=1462389588124} status=0 QTime=0
>
> 2016-05-04 19:20:57.546 INFO  (qtp1989972246-15) [   ] o.a.s.s.HttpSolrCall
> [admin] webapp=null path=/admin/info/system
> params={wt=json&_=1462389588126} status=0 QTime=25
>
> 2016-05-04 19:20:57.610 INFO  (qtp1989972246-13) [   ]
> o.a.s.h.a.CollectionsHandler Invoked Collection Action :list with params
> action=LIST=json&_=1462389588125 and sendToOCPQueue=true
>
> 2016-05-04 19:20:57.613 INFO  (qtp1989972246-13) [   ] o.a.s.s.HttpSolrCall
> [admin] webapp=null path=/admin/collections
> params={action=LIST=json&_=1462389588125} status=0 QTime=3
>
> 2016-05-04 19:21:29.139 INFO  (qtp1989972246-5980) [   ]
> o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
> when connecting to {}->http://server2:8983: Too many open files
>
> 2016-05-04 19:21:29.139 INFO  (qtp1989972246-5983) [   ]
> o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
> when connecting to {}->http://server2:8983: Too many open files
>
> 2016-05-04 19:21:29.139 INFO  (qtp1989972246-5984) [   ]
> o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
> when connecting to {}->http://server2:8983: Too many open files
>
> 2016-05-04 19:21:29.141 INFO  (qtp1989972246-5984) [   ]
> o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983
>
> 2016-05-04 19:21:29.141 INFO  (qtp1989972246-5984) [   ]
> o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
> when connecting to {}->http://server2:8983: Too many open files
>
> 2016-05-04 19:21:29.142 INFO  (qtp1989972246-5984) [   ]
> o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983
>
> 2016-05-04 19:21:29.142 INFO  (qtp1989972246-5984) [   ]
> o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
> when connecting to {}->http://server2:8983: Too many open files
>
> 2016-05-04 19:21:29.142 INFO  (qtp1989972246-5984) [   ]
> o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983
>
> 2016-05-04 19:21:29.140 INFO  (qtp1989972246-5983) [   ]
> o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983
>
> 2016-05-04 19:21:29.140 INFO  (qtp1989972246-5980) [   ]
> o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983
>
> 2016-05-04 19:21:29.143 INFO  (qtp1989972246-5983) [   ]
> o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
> when connecting to {}->http://server2:8983: Too many open files
>
> 2016-05-04 19:21:29.144 INFO  (qtp1989972246-5983) [   ]
> o.a.h.i.c.DefaultHttpClient Retrying connect to {}->http://server2:8983
>
> 2016-05-04 19:21:29.144 INFO  (qtp1989972246-5980) [   ]
> o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
> when connecting to {}->http://server2:8983: Too many open files
>
> 2016-05-04 19:21:29.144 INFO  (qtp1989972246-5983) [   ]
> o.a.h.i.c.DefaultHttpClient I/O exception (java.net.SocketException) caught
> when connecting to {}->http://server2:8983: Too many open files
>
> 2016-05-04 19:20:53.806 INFO  (qtp1989972246-14) [   ] o.a.s.s.HttpSolrCall
> [admin] webapp=null path=/admin/collections
> params={action=CLUSTERSTATUS=json&_=1462389588125} status=0 QTime=25
>
> 2016-05-04 19:20:53.859 INFO  (qtp1989972246-19) [   ]

Re: Solr 5.2.1 on Java 8 GC

2016-05-01 Thread Nick Vasilyev
How do you log GC frequency and time to compare it with other GC
configurations?

Also, do you tweak parameters automatically or is there a set of
configuration that get tested?

Lastly, I was under impression that G1 is not recommended to be used based
on some issues with Lucene, so I haven't tried it. Are you guys seeing any
significant performance benefits with it and Java 8? Any issues?
On May 1, 2016 12:57 PM, "Bram Van Dam"  wrote:

> On 30/04/16 17:34, Davis, Daniel (NIH/NLM) [C] wrote:
> > Bram, on the subject of brute force - if your script is "clever" and
> uses binary first search, I'd love to adapt it to my environment.  I am
> trying to build a truly multi-tenant Solr because each of our indexes is
> tiny, but all together they will eventually be big, and so I'll have to
> repeat this experiment, many, many times.
>
> Sorry to disappoint, the script is very dumb, and it doesn't just
> start/stop Solr, it installs our application suite, picks a GC profile
> at random, indexes a boatload of data and then runs a bunch of query tests.
>
> Three pointers I can give you:
>
> 1) beware of JVM versions, especially when using the G1 collector, it
> behaves horribly on older JVMs but rather nicely on newer versions.
>
> 2) At the very least you'll want to test the G1 and CMS collectors.
>
> 3) One large index vs many small indexes: the behaviour is very
> different. Depending on how many indexes you have, it might be worth to
> run each one in a different JVM. Of course that's not practical if you
> have thousands of indexes.
>
>  - Bram
>
>


Re: Solr5.5:DocValues/CopyField does not work with Atomic updates

2016-04-30 Thread Nick Vasilyev
I am also running into this problem on Solr 6.

On Sun, Apr 24, 2016 at 6:10 PM, Karthik Ramachandran <
kramachand...@commvault.com> wrote:

> I have opened JIRA
>
> https://issues.apache.org/jira/browse/SOLR-9034
>
> I will upload the patch soon.
>
> With Thanks & Regards
> Karthik Ramachandran
> CommVault
> Direct: (732) 923-2197
>  Please don't print this e-mail unless you really need to
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Friday, April 22, 2016 8:24 PM
> To: solr-user 
> Subject: Re: Solr5.5:DocValues/CopyField does not work with Atomic updates
>
> I think I just added the right person, let us know if you don't have
> access and/or if you need access to the LUCENE JIRA.
>
> Erick
>
> On Fri, Apr 22, 2016 at 5:17 PM, Karthik Ramachandran <
> kramachand...@commvault.com> wrote:
> > Eric
> >   I have created a JIRA id (kramachand...@commvault.com).  Once I get
> > access I will create the JIRA and submit the patch.
> >
> > With Thanks & Regards
> > Karthik Ramachandran
> > CommVault
> > Direct: (732) 923-2197
> > P Please don't print this e-mail unless you really need to
> >
> >
> >
> > On 4/22/16, 8:04 PM, "Erick Erickson"  wrote:
> >
> >>Karthik:
> >>
> >>The Apache mailing list is pretty aggressive about removing
> >>attachments. Could you possibly open a JIRA and attach the file as a
> >>patch? If at all possible a patch file with just the diffs would be
> >>best.
> >>
> >>One problem is that it'll be a two-step process. The JIRAs have been
> >>being hit with spam, so you'll have to request access once you create
> >>a JIRA ID (this list would be fine).
> >>
> >>Best,
> >>Erick
> >>
> >>On Thu, Apr 21, 2016 at 9:09 PM, Karthik Ramachandran
> >> wrote:
> >>> We feel the issue is in
> >>>RealTimeGetComponent.getInputDocument(SolrCore
> >>>core,
> >>> BytesRef idBytes) where solr calls getNonStoredDVs and add the
> >>>fields to the  original document without excluding the copyFields.
> >>>
> >>>
> >>>
> >>> We made changes to send the filteredList to
> >>>searcher.decorateDocValueFields
> >>> and it started working.
> >>>
> >>>
> >>>
> >>> Attached is the modified file.
> >>>
> >>>
> >>>
> >>> With Thanks & Regards
> >>> Karthik Ramachandran
> >>> CommVault
> >>> P Please don't print this e-mail unless you really need to
> >>>
> >>>
> >>>
> >>> -Original Message-
> >>> From: Karthik Ramachandran [mailto:mrk...@gmail.com]
> >>> Sent: Friday, April 22, 2016 12:08 AM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: Solr5.5:DocValues/CopyField does not work with Atomic
> >>>updates
> >>>
> >>>
> >>>
> >>> We are trying to update Field A.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> -Karthik
> >>>
> >>>
> >>>
> >>> On Thu, Apr 21, 2016 at 10:36 PM, John Bickerstaff
> >>> >>>
>  wrote:
> >>>
> >>>
> >>>
>  Which field do you try to atomically update?  A or B or some other?
> >>>
>  On Apr 21, 2016 8:29 PM, "Tirthankar Chatterjee" <
> >>>
>  tchatter...@commvault.com>
> >>>
>  wrote:
> >>>
> 
> >>>
>  > Hi,
> >>>
>  > Here is the scenario for SOLR5.5:
> >>>
>  >
> >>>
>  > FieldA type= stored=true indexed=true
> >>>
>  >
> >>>
>  > FieldB type= stored=false indexed=true docValue=true
> >>>
>  > usedocvalueasstored=false
> >>>
>  >
> >>>
>  > FieldA copyTo FieldB
> >>>
>  >
> >>>
>  > Try an Atomic update and we are getting this error:
> >>>
>  >
> >>>
>  > possible analysis error: DocValuesField "mtmround" appears more
>  > than
> >>>
>  > once in this document (only one value is allowed per field)
> >>>
>  >
> >>>
>  > How do we resolve this.
> >>>
>  >
> >>>
>  >
> >>>
>  >
> >>>
>  > ***Legal
> >>>
>  > Disclaimer***
> >>>
>  > "This communication may contain confidential and privileged
>  > material
> >>>
>  > for the sole use of the intended recipient. Any unauthorized
>  > review,
> >>>
>  > use or distribution by others is strictly prohibited. If you have
> >>>
>  > received the message by mistake, please advise the sender by
>  > reply
> >>>
>  > email and delete the message. Thank
> >>>
>  you."
> >>>
>  > *
>  > ***
> >>>
>  > **
> >>>
> 
> >>>
> >>> ***Legal
> >>>Disclaimer***
> >>> "This communication may contain confidential and privileged material
> >>>for the  sole use of the intended recipient. Any unauthorized review,
> >>>use or  distribution  by others is strictly prohibited. If you have
> >>>received the message by  mistake,  please advise the sender by reply
> >>>email and delete the message. Thank you."
> >>>
> >>>*
> >>>*
> >>
> 

Re: Solr 5.2.1 on Java 8 GC

2016-04-29 Thread Nick Vasilyev
: 0.0001340
seconds
2016-04-26T04:42:36.033-0400: 245437.715: Total time for which application
threads were stopped: 9.9446430 seconds, Stopping threads took: 0.0007500
seconds
2016-04-26T04:43:02.409-0400: 245464.091: Total time for which application
threads were stopped: 10.4197000 seconds, Stopping threads took: 0.260
seconds
2016-04-26T04:43:29.559-0400: 245491.241: Total time for which application
threads were stopped: 9.6712880 seconds, Stopping threads took: 0.0001080
seconds
2016-04-26T04:43:56.648-0400: 245518.330: Total time for which application
threads were stopped: 9.8339590 seconds, Stopping threads took: 0.0011820
seconds
2016-04-26T04:45:35.358-0400: 245617.040: Total time for which application
threads were stopped: 9.5853210 seconds, Stopping threads took: 0.0001760
seconds
2016-04-26T04:54:58.764-0400: 246180.446: Total time for which application
threads were stopped: 2.9048350 seconds, Stopping threads took: 0.0008180
seconds
2016-04-26T04:55:06.107-0400: 246187.789: Total time for which application
threads were stopped: 1.1189760 seconds, Stopping threads took: 0.0011390
seconds

After:
2016-04-29T04:30:05.758-0400: 29962.077: Total time for which application
threads were stopped: 1.0823960 seconds, Stopping threads took: 0.0005840
seconds
2016-04-29T04:30:11.349-0400: 29967.668: Total time for which application
threads were stopped: 1.4147830 seconds, Stopping threads took: 0.0008980
seconds
2016-04-29T04:30:17.198-0400: 29973.517: Total time for which application
threads were stopped: 1.6294590 seconds, Stopping threads took: 0.0009380
seconds
2016-04-29T04:30:22.350-0400: 29978.669: Total time for which application
threads were stopped: 1.6787880 seconds, Stopping threads took: 0.0012320
seconds
2016-04-29T04:30:28.230-0400: 29984.549: Total time for which application
threads were stopped: 1.6895760 seconds, Stopping threads took: 0.0010270
seconds
2016-04-29T04:30:29.944-0400: 29986.263: Total time for which application
threads were stopped: 1.5271500 seconds, Stopping threads took: 0.0009670
seconds
2016-04-29T04:30:35.282-0400: 29991.601: Total time for which application
threads were stopped: 1.6575670 seconds, Stopping threads took: 0.0006200
seconds
2016-04-29T04:30:51.011-0400: 30007.329: Total time for which application
threads were stopped: 2.0383550 seconds, Stopping threads took: 0.0004640
seconds
2016-04-29T04:31:03.032-0400: 30019.351: Total time for which application
threads were stopped: 2.1963570 seconds, Stopping threads took: 0.0004650
seconds
2016-04-29T04:31:07.679-0400: 30023.998: Total time for which application
threads were stopped: 1.2220760 seconds, Stopping threads took: 0.0004720
seconds

On Thu, Apr 28, 2016 at 1:02 PM, Jeff Wartes <jwar...@whitepages.com> wrote:

>
> Shawn Heisey’s page is the usual reference guide for GC settings:
> https://wiki.apache.org/solr/ShawnHeisey
> Most of the learnings from that are in the Solr 5.x startup scripts
> already, but your heap is bigger, so your mileage may vary.
>
> Some tools I’ve used while doing GC tuning:
>
> * VisualVM - Comes with the jdk. It has a Visual GC plug-in that’s pretty
> nice for visualizing what’s going on in realtime, but you need to connect
> it via jstatd for that to work.
> * GCViewer - Visualizes a GC log. The UI leaves a lot to be desired, but
> it’s the best tool I’ve found for this purpose. Use this fork for jdk 6+ -
> https://github.com/chewiebug/GCViewer
> * Swiss Java Knife has a bunch of useful features -
> https://github.com/aragozin/jvm-tools
> * YourKit - I’ve been using this lately to analyze where garbage comes
> from. It’s not free though.
> * Eclipse Memory Analyzer - I used this to analyze heap dumps before I got
> a YourKit license: http://www.eclipse.org/mat/
>
> Good luck!
>
>
>
>
>
>
> On 4/28/16, 9:27 AM, "Yonik Seeley" <ysee...@gmail.com> wrote:
>
> >On Thu, Apr 28, 2016 at 12:21 PM, Nick Vasilyev
> ><nick.vasily...@gmail.com> wrote:
> >> Hi Yonik,
> >>
> >> There are a lot of logistics involved with re-indexing and naturally
> >> upgrading Solr. I was hoping that there is an easier alternative since
> this
> >> is only a single back end script that is having problems.
> >>
> >> Is there any room for improvement with tweaking GC params?
> >
> >There always is ;-)  But I'm not a GC tuning expert.  I prefer to
> >attack memory problems more head-on (i.e. with code to use less
> >memory).
> >
> >-Yonik
>


Re: Solr 5.2.1 on Java 8 GC

2016-04-28 Thread Nick Vasilyev
Hi Yonik,

There are a lot of logistics involved with re-indexing and naturally
upgrading Solr. I was hoping that there is an easier alternative since this
is only a single back end script that is having problems.

Is there any room for improvement with tweaking GC params?

On Thu, Apr 28, 2016 at 12:06 PM, Yonik Seeley <ysee...@gmail.com> wrote:

> On Thu, Apr 28, 2016 at 11:50 AM, Nick Vasilyev
> <nick.vasily...@gmail.com> wrote:
> > mmfr_exact is a string field. key_phrases is a multivalued string field.
>
> One guess is that top-level field caches (and UnInvertedField use)
> were removed in
> https://issues.apache.org/jira/browse/LUCENE-5666
>
> While this is better for NRT (a quickly changing index), it's worse in
> CPU, and can be worse in memory overhead for very static indexes.
>
> Multi-valued string faceting was hit hardest:
> https://issues.apache.org/jira/browse/SOLR-8096
> Although I only measured the CPU impact, and not memory.
>
> The 4.x method of faceting was restored as part of
> https://issues.apache.org/jira/browse/SOLR-8466
>
> If this is the issue, you can:
> - try reindexing with docValues... that should solve memory issues at
> the expense of some speed
> - upgrade to a more recent Solr version and use facet.method=uif for
> your multi-valued string fields
>
> -Yonik
>


Re: Solr 5.2.1 on Java 8 GC

2016-04-28 Thread Nick Vasilyev
Correction, the key_phrases is set up as follows:




   

  
  
  
  
  
  

  

On Thu, Apr 28, 2016 at 12:03 PM, Nick Vasilyev <nick.vasily...@gmail.com>
wrote:

> The working set is larger than the heap. This is our largest collection
> and all shards combined would probably be around 60GB in total, there are
> also a few other much smaller collections.
>
> During normal operations the JVM memory utilization hangs between 17GB and
> 22GB if we aren't indexing any data.
>
> Either way, this wasn't a problem before. I suspect that it is because we
> are now on Java 8 so I wanted to reach out to the community to see if there
> are any new best practices around GC tuning since the current
> recommendation seems to be for Java 7.
>
>
> On Thu, Apr 28, 2016 at 11:54 AM, Walter Underwood <wun...@wunderwood.org>
> wrote:
>
>> 32 GB is a pretty big heap. If the working set is really smaller than
>> that, the extra heap just makes a full GC take longer.
>>
>> How much heap is used after a full GC? Take the largest value you see
>> there, then add a bit more, maybe 25% more or 2 GB more.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>> > On Apr 28, 2016, at 8:50 AM, Nick Vasilyev <nick.vasily...@gmail.com>
>> wrote:
>> >
>> > mmfr_exact is a string field. key_phrases is a multivalued string field.
>> >
>> > On Thu, Apr 28, 2016 at 11:47 AM, Yonik Seeley <ysee...@gmail.com>
>> wrote:
>> >
>> >> What about the field types though... are they single valued or multi
>> >> valued, string, text, numeric?
>> >>
>> >> -Yonik
>> >>
>> >>
>> >> On Thu, Apr 28, 2016 at 11:43 AM, Nick Vasilyev
>> >> <nick.vasily...@gmail.com> wrote:
>> >>> Hi Yonik,
>> >>>
>> >>> I forgot to mention that the index is approximately 50 million docs
>> split
>> >>> across 4 shards (replication factor 2) on 2 solr replicas.
>> >>>
>> >>> This particular script will filter items based on a category
>> >> (10-~1,000,000
>> >>> items in each) and run facets on top X terms for particular fields.
>> Query
>> >>> looks like this:
>> >>>
>> >>> {
>> >>>   q => "cat:$code",
>> >>>   rows => 0,
>> >>>   facet => 'true',
>> >>>   'facet.field' => [ 'key_phrases', 'mmfr_exact' ],
>> >>>   'f.key_phrases.facet.limit' => 100,
>> >>>   'f.mmfr_exact.facet.limit' => 20,
>> >>>   'facet.mincount' => 5,
>> >>>   distrib => 'false',
>> >>> }
>> >>>
>> >>> I know it can be re-worked some, especially considering there are
>> >> thousands
>> >>> of similar requests going out. However we didn't have this issue
>> before
>> >> and
>> >>> I am worried that it may be a symptom of a larger underlying problem.
>> >>>
>> >>> On Thu, Apr 28, 2016 at 11:34 AM, Yonik Seeley <ysee...@gmail.com>
>> >> wrote:
>> >>>
>> >>>> On Thu, Apr 28, 2016 at 11:29 AM, Nick Vasilyev
>> >>>> <nick.vasily...@gmail.com> wrote:
>> >>>>> Hello,
>> >>>>>
>> >>>>> We recently upgraded to Solr 5.2.1 with jre1.8.0_74 and are seeing
>> >> long
>> >>>> GC
>> >>>>> pauses when running jobs that do some hairy faceting. The same jobs
>> >>>> worked
>> >>>>> fine with our previous 4.6 Solr.
>> >>>>
>> >>>> What does a typical request look like, and what are the field types
>> >>>> that faceting is done on?
>> >>>>
>> >>>> -Yonik
>> >>>>
>> >>>>
>> >>>>> The JVM is configured with 32GB heap with default GC settings,
>> however
>> >>>> I've
>> >>>>> been tweaking the GC settings to no avail. The latest version had
>> the
>> >>>>> following differences from the default config:
>> >>>>>
>> >>>>> XX:ConcGCThreads and XX:ParallelGCThreads are increased from 4 to 7
>> >>>>>
>> >>>>> XX:CMSInitiatingOccupancyFraction increased from 50 to 

Re: Solr 5.2.1 on Java 8 GC

2016-04-28 Thread Nick Vasilyev
The working set is larger than the heap. This is our largest collection and
all shards combined would probably be around 60GB in total, there are also
a few other much smaller collections.

During normal operations the JVM memory utilization hangs between 17GB and
22GB if we aren't indexing any data.

Either way, this wasn't a problem before. I suspect that it is because we
are now on Java 8 so I wanted to reach out to the community to see if there
are any new best practices around GC tuning since the current
recommendation seems to be for Java 7.


On Thu, Apr 28, 2016 at 11:54 AM, Walter Underwood <wun...@wunderwood.org>
wrote:

> 32 GB is a pretty big heap. If the working set is really smaller than
> that, the extra heap just makes a full GC take longer.
>
> How much heap is used after a full GC? Take the largest value you see
> there, then add a bit more, maybe 25% more or 2 GB more.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Apr 28, 2016, at 8:50 AM, Nick Vasilyev <nick.vasily...@gmail.com>
> wrote:
> >
> > mmfr_exact is a string field. key_phrases is a multivalued string field.
> >
> > On Thu, Apr 28, 2016 at 11:47 AM, Yonik Seeley <ysee...@gmail.com>
> wrote:
> >
> >> What about the field types though... are they single valued or multi
> >> valued, string, text, numeric?
> >>
> >> -Yonik
> >>
> >>
> >> On Thu, Apr 28, 2016 at 11:43 AM, Nick Vasilyev
> >> <nick.vasily...@gmail.com> wrote:
> >>> Hi Yonik,
> >>>
> >>> I forgot to mention that the index is approximately 50 million docs
> split
> >>> across 4 shards (replication factor 2) on 2 solr replicas.
> >>>
> >>> This particular script will filter items based on a category
> >> (10-~1,000,000
> >>> items in each) and run facets on top X terms for particular fields.
> Query
> >>> looks like this:
> >>>
> >>> {
> >>>   q => "cat:$code",
> >>>   rows => 0,
> >>>   facet => 'true',
> >>>   'facet.field' => [ 'key_phrases', 'mmfr_exact' ],
> >>>   'f.key_phrases.facet.limit' => 100,
> >>>   'f.mmfr_exact.facet.limit' => 20,
> >>>   'facet.mincount' => 5,
> >>>   distrib => 'false',
> >>> }
> >>>
> >>> I know it can be re-worked some, especially considering there are
> >> thousands
> >>> of similar requests going out. However we didn't have this issue before
> >> and
> >>> I am worried that it may be a symptom of a larger underlying problem.
> >>>
> >>> On Thu, Apr 28, 2016 at 11:34 AM, Yonik Seeley <ysee...@gmail.com>
> >> wrote:
> >>>
> >>>> On Thu, Apr 28, 2016 at 11:29 AM, Nick Vasilyev
> >>>> <nick.vasily...@gmail.com> wrote:
> >>>>> Hello,
> >>>>>
> >>>>> We recently upgraded to Solr 5.2.1 with jre1.8.0_74 and are seeing
> >> long
> >>>> GC
> >>>>> pauses when running jobs that do some hairy faceting. The same jobs
> >>>> worked
> >>>>> fine with our previous 4.6 Solr.
> >>>>
> >>>> What does a typical request look like, and what are the field types
> >>>> that faceting is done on?
> >>>>
> >>>> -Yonik
> >>>>
> >>>>
> >>>>> The JVM is configured with 32GB heap with default GC settings,
> however
> >>>> I've
> >>>>> been tweaking the GC settings to no avail. The latest version had the
> >>>>> following differences from the default config:
> >>>>>
> >>>>> XX:ConcGCThreads and XX:ParallelGCThreads are increased from 4 to 7
> >>>>>
> >>>>> XX:CMSInitiatingOccupancyFraction increased from 50 to 70
> >>>>>
> >>>>>
> >>>>> Here is a sample output from the gc_log
> >>>>>
> >>>>> 2016-04-28T04:36:47.240-0400: 27905.535: Total time for which
> >> application
> >>>>> threads were stopped: 0.1667520 seconds, Stopping threads took:
> >> 0.0171900
> >>>>> seconds
> >>>>> {Heap before GC invocations=2051 (full 59):
> >>>>> par new generation   total 6990528K, used 2626705K
> >> [0x2b16c000,
> >>>>> 0x2b18c000, 0

Re: Solr 5.2.1 on Java 8 GC

2016-04-28 Thread Nick Vasilyev
mmfr_exact is a string field. key_phrases is a multivalued string field.

On Thu, Apr 28, 2016 at 11:47 AM, Yonik Seeley <ysee...@gmail.com> wrote:

> What about the field types though... are they single valued or multi
> valued, string, text, numeric?
>
> -Yonik
>
>
> On Thu, Apr 28, 2016 at 11:43 AM, Nick Vasilyev
> <nick.vasily...@gmail.com> wrote:
> > Hi Yonik,
> >
> > I forgot to mention that the index is approximately 50 million docs split
> > across 4 shards (replication factor 2) on 2 solr replicas.
> >
> > This particular script will filter items based on a category
> (10-~1,000,000
> > items in each) and run facets on top X terms for particular fields. Query
> > looks like this:
> >
> > {
> >q => "cat:$code",
> >rows => 0,
> >facet => 'true',
> >'facet.field' => [ 'key_phrases', 'mmfr_exact' ],
> >'f.key_phrases.facet.limit' => 100,
> >'f.mmfr_exact.facet.limit' => 20,
> >'facet.mincount' => 5,
> >distrib => 'false',
> >  }
> >
> > I know it can be re-worked some, especially considering there are
> thousands
> > of similar requests going out. However we didn't have this issue before
> and
> > I am worried that it may be a symptom of a larger underlying problem.
> >
> > On Thu, Apr 28, 2016 at 11:34 AM, Yonik Seeley <ysee...@gmail.com>
> wrote:
> >
> >> On Thu, Apr 28, 2016 at 11:29 AM, Nick Vasilyev
> >> <nick.vasily...@gmail.com> wrote:
> >> > Hello,
> >> >
> >> > We recently upgraded to Solr 5.2.1 with jre1.8.0_74 and are seeing
> long
> >> GC
> >> > pauses when running jobs that do some hairy faceting. The same jobs
> >> worked
> >> > fine with our previous 4.6 Solr.
> >>
> >> What does a typical request look like, and what are the field types
> >> that faceting is done on?
> >>
> >> -Yonik
> >>
> >>
> >> > The JVM is configured with 32GB heap with default GC settings, however
> >> I've
> >> > been tweaking the GC settings to no avail. The latest version had the
> >> > following differences from the default config:
> >> >
> >> > XX:ConcGCThreads and XX:ParallelGCThreads are increased from 4 to 7
> >> >
> >> > XX:CMSInitiatingOccupancyFraction increased from 50 to 70
> >> >
> >> >
> >> > Here is a sample output from the gc_log
> >> >
> >> > 2016-04-28T04:36:47.240-0400: 27905.535: Total time for which
> application
> >> > threads were stopped: 0.1667520 seconds, Stopping threads took:
> 0.0171900
> >> > seconds
> >> > {Heap before GC invocations=2051 (full 59):
> >> >  par new generation   total 6990528K, used 2626705K
> [0x2b16c000,
> >> > 0x2b18c000, 0x2b18c000)
> >> >   eden space 5592448K,  44% used [0x2b16c000,
> 0x2b17571b9948,
> >> > 0x2b181556)
> >> >   from space 1398080K,  10% used [0x2b181556,
> 0x2b181e8cac28,
> >> > 0x2b186aab)
> >> >   to   space 1398080K,   0% used [0x2b186aab,
> 0x2b186aab,
> >> > 0x2b18c000)
> >> >  concurrent mark-sweep generation total 25165824K, used 25122205K
> >> > [0x2b18c000, 0x2b1ec000, 0x2b1ec000)
> >> >  Metaspace   used 41840K, capacity 42284K, committed 42680K,
> reserved
> >> > 43008K
> >> > 2016-04-28T04:36:49.828-0400: 27908.123: [GC (Allocation Failure)
> >> > 2016-04-28T04:36:49.828-0400: 27908.124:
> >> [CMS2016-04-28T04:36:49.912-0400:
> >> > 27908.207: [CMS-concurr
> >> > ent-abortable-preclean: 5.615/5.862 secs] [Times: user=17.70 sys=2.77,
> >> > real=5.86 secs]
> >> >  (concurrent mode failure): 25122205K->15103706K(25165824K), 8.5567560
> >> > secs] 27748910K->15103706K(32156352K), [Metaspace:
> >> 41840K->41840K(43008K)],
> >> > 8.5657830 secs] [
> >> > Times: user=8.56 sys=0.01, real=8.57 secs]
> >> > Heap after GC invocations=2052 (full 60):
> >> >  par new generation   total 6990528K, used 0K [0x2b16c000,
> >> > 0x2b18c000, 0x2b18c000)
> >> >   eden space 5592448K,   0% used [0x2b16c000,
> 0x2b16c000,
> >> > 0x2b181556)
> >> >   from space 1398080K,   0% used [0x2b181556,
> 0x2b181556,
> >> > 0x2b186aab)
> >> >   to   space 1398080K,   0% used [0x2b186aab,
> 0x2b186aab,
> >> > 0x2b18c000)
> >> >  concurrent mark-sweep generation total 25165824K, used 15103706K
> >> > [0x2b18c000, 0x2b1ec000, 0x2b1ec000)
> >> >  Metaspace   used 41840K, capacity 42284K, committed 42680K,
> reserved
> >> > 43008K
> >> > }
> >> > 2016-04-28T04:36:58.395-0400: 27916.690: Total time for which
> application
> >> > threads were stopped: 8.5676090 seconds, Stopping threads took:
> 0.0003930
> >> > seconds
> >> >
> >> > I read the instructions here,
> https://wiki.apache.org/solr/ShawnHeisey,
> >> but
> >> > they seem to be specific to Java 7. Are there any updated
> recommendations
> >> > for Java 8?
> >>
>


Re: Solr 5.2.1 on Java 8 GC

2016-04-28 Thread Nick Vasilyev
Hi Yonik,

I forgot to mention that the index is approximately 50 million docs split
across 4 shards (replication factor 2) on 2 solr replicas.

This particular script will filter items based on a category (10-~1,000,000
items in each) and run facets on top X terms for particular fields. Query
looks like this:

{
   q => "cat:$code",
   rows => 0,
   facet => 'true',
   'facet.field' => [ 'key_phrases', 'mmfr_exact' ],
   'f.key_phrases.facet.limit' => 100,
   'f.mmfr_exact.facet.limit' => 20,
   'facet.mincount' => 5,
   distrib => 'false',
 }

I know it can be re-worked some, especially considering there are thousands
of similar requests going out. However we didn't have this issue before and
I am worried that it may be a symptom of a larger underlying problem.

On Thu, Apr 28, 2016 at 11:34 AM, Yonik Seeley <ysee...@gmail.com> wrote:

> On Thu, Apr 28, 2016 at 11:29 AM, Nick Vasilyev
> <nick.vasily...@gmail.com> wrote:
> > Hello,
> >
> > We recently upgraded to Solr 5.2.1 with jre1.8.0_74 and are seeing long
> GC
> > pauses when running jobs that do some hairy faceting. The same jobs
> worked
> > fine with our previous 4.6 Solr.
>
> What does a typical request look like, and what are the field types
> that faceting is done on?
>
> -Yonik
>
>
> > The JVM is configured with 32GB heap with default GC settings, however
> I've
> > been tweaking the GC settings to no avail. The latest version had the
> > following differences from the default config:
> >
> > XX:ConcGCThreads and XX:ParallelGCThreads are increased from 4 to 7
> >
> > XX:CMSInitiatingOccupancyFraction increased from 50 to 70
> >
> >
> > Here is a sample output from the gc_log
> >
> > 2016-04-28T04:36:47.240-0400: 27905.535: Total time for which application
> > threads were stopped: 0.1667520 seconds, Stopping threads took: 0.0171900
> > seconds
> > {Heap before GC invocations=2051 (full 59):
> >  par new generation   total 6990528K, used 2626705K [0x2b16c000,
> > 0x2b18c000, 0x2b18c000)
> >   eden space 5592448K,  44% used [0x2b16c000, 0x2b17571b9948,
> > 0x2b181556)
> >   from space 1398080K,  10% used [0x2b181556, 0x2b181e8cac28,
> > 0x2b186aab)
> >   to   space 1398080K,   0% used [0x2b186aab, 0x2b186aab,
> > 0x2b18c000)
> >  concurrent mark-sweep generation total 25165824K, used 25122205K
> > [0x2b18c000, 0x2b1ec000, 0x2b1ec000)
> >  Metaspace   used 41840K, capacity 42284K, committed 42680K, reserved
> > 43008K
> > 2016-04-28T04:36:49.828-0400: 27908.123: [GC (Allocation Failure)
> > 2016-04-28T04:36:49.828-0400: 27908.124:
> [CMS2016-04-28T04:36:49.912-0400:
> > 27908.207: [CMS-concurr
> > ent-abortable-preclean: 5.615/5.862 secs] [Times: user=17.70 sys=2.77,
> > real=5.86 secs]
> >  (concurrent mode failure): 25122205K->15103706K(25165824K), 8.5567560
> > secs] 27748910K->15103706K(32156352K), [Metaspace:
> 41840K->41840K(43008K)],
> > 8.5657830 secs] [
> > Times: user=8.56 sys=0.01, real=8.57 secs]
> > Heap after GC invocations=2052 (full 60):
> >  par new generation   total 6990528K, used 0K [0x2b16c000,
> > 0x2b18c000, 0x2b18c000)
> >   eden space 5592448K,   0% used [0x2b16c000, 0x2b16c000,
> > 0x2b181556)
> >   from space 1398080K,   0% used [0x2b181556, 0x2b181556,
> > 0x2b186aab)
> >   to   space 1398080K,   0% used [0x2b186aab, 0x2b186aab,
> > 0x2b18c000)
> >  concurrent mark-sweep generation total 25165824K, used 15103706K
> > [0x2b18c000, 0x2b1ec000, 0x2b1ec000)
> >  Metaspace   used 41840K, capacity 42284K, committed 42680K, reserved
> > 43008K
> > }
> > 2016-04-28T04:36:58.395-0400: 27916.690: Total time for which application
> > threads were stopped: 8.5676090 seconds, Stopping threads took: 0.0003930
> > seconds
> >
> > I read the instructions here, https://wiki.apache.org/solr/ShawnHeisey,
> but
> > they seem to be specific to Java 7. Are there any updated recommendations
> > for Java 8?
>


Solr 5.2.1 on Java 8 GC

2016-04-28 Thread Nick Vasilyev
Hello,

We recently upgraded to Solr 5.2.1 with jre1.8.0_74 and are seeing long GC
pauses when running jobs that do some hairy faceting. The same jobs worked
fine with our previous 4.6 Solr.

The JVM is configured with 32GB heap with default GC settings, however I've
been tweaking the GC settings to no avail. The latest version had the
following differences from the default config:

XX:ConcGCThreads and XX:ParallelGCThreads are increased from 4 to 7

XX:CMSInitiatingOccupancyFraction increased from 50 to 70


Here is a sample output from the gc_log

2016-04-28T04:36:47.240-0400: 27905.535: Total time for which application
threads were stopped: 0.1667520 seconds, Stopping threads took: 0.0171900
seconds
{Heap before GC invocations=2051 (full 59):
 par new generation   total 6990528K, used 2626705K [0x2b16c000,
0x2b18c000, 0x2b18c000)
  eden space 5592448K,  44% used [0x2b16c000, 0x2b17571b9948,
0x2b181556)
  from space 1398080K,  10% used [0x2b181556, 0x2b181e8cac28,
0x2b186aab)
  to   space 1398080K,   0% used [0x2b186aab, 0x2b186aab,
0x2b18c000)
 concurrent mark-sweep generation total 25165824K, used 25122205K
[0x2b18c000, 0x2b1ec000, 0x2b1ec000)
 Metaspace   used 41840K, capacity 42284K, committed 42680K, reserved
43008K
2016-04-28T04:36:49.828-0400: 27908.123: [GC (Allocation Failure)
2016-04-28T04:36:49.828-0400: 27908.124: [CMS2016-04-28T04:36:49.912-0400:
27908.207: [CMS-concurr
ent-abortable-preclean: 5.615/5.862 secs] [Times: user=17.70 sys=2.77,
real=5.86 secs]
 (concurrent mode failure): 25122205K->15103706K(25165824K), 8.5567560
secs] 27748910K->15103706K(32156352K), [Metaspace: 41840K->41840K(43008K)],
8.5657830 secs] [
Times: user=8.56 sys=0.01, real=8.57 secs]
Heap after GC invocations=2052 (full 60):
 par new generation   total 6990528K, used 0K [0x2b16c000,
0x2b18c000, 0x2b18c000)
  eden space 5592448K,   0% used [0x2b16c000, 0x2b16c000,
0x2b181556)
  from space 1398080K,   0% used [0x2b181556, 0x2b181556,
0x2b186aab)
  to   space 1398080K,   0% used [0x2b186aab, 0x2b186aab,
0x2b18c000)
 concurrent mark-sweep generation total 25165824K, used 15103706K
[0x2b18c000, 0x2b1ec000, 0x2b1ec000)
 Metaspace   used 41840K, capacity 42284K, committed 42680K, reserved
43008K
}
2016-04-28T04:36:58.395-0400: 27916.690: Total time for which application
threads were stopped: 8.5676090 seconds, Stopping threads took: 0.0003930
seconds

I read the instructions here, https://wiki.apache.org/solr/ShawnHeisey, but
they seem to be specific to Java 7. Are there any updated recommendations
for Java 8?


Re: block join rollups

2016-04-18 Thread Nick Vasilyev
Hi Yonik,

Well, no one replied to this yet, so I thought I'd chime in with some of
the use cases that I am working with. Please note that I am lagging a big
behind the last few releases, so I haven't had time to experiment with Solr
5.3+, I am sure that some of this is included in there already and I am
very excited to play around with the new streaming API, json facets and SQL
interface when I have a bit more time.

I am indexing click stream data into Solr. Each set of records represents a
user's unique visit to our website. They all share a common session id, as
well as several session attributes, such as IP and user attributes if they
log in. Each record represents an individual action, such as a search,
product view or a visit to a particular page, all attributes and data
elements of each request are stored with each record, additionally, session
attributes get copied down to each event item. The current goal of this
system is to provide less tech savvy users with easy access to this data in
a way they can explore it and drill down on particular elements; we are
using Banana for this.

Currently, I have to copy a lot of session fields to each event so I can
filter on them, for example, show all searches for users associated with
organization X. This is super redundant and I am really looking for a
better way. It would be great if I could make parent document fields appear
as if they are a part of child documents.

Additionally, I am counting various events for each session during
processing. For example, I count the number of searches, product views, add
to carts, etc... This information is also indexed in each record. This
allows me to pull up specific events (like product views) where the number
of searches in a given session is greater than X. However, again, indexing
this information for each event creates a lot of redundancy.

Finally, a slightly different use cases involves running functions on items
in a group (even if they aren't a part of the result set) and returning
that as a part of the document. Almost like a dynamically generated
document, based on aggregations from child documents. This is currently
somewhat available, but I can't include it in sort. For example, I am
grouping items on a field, I want to get the minimum value of a field per
group and sort the result (of groups) on that calculated value.

I am not sure if this helps you at all, but wanted to share some of my pain
points, hope it helps.

On Sun, Apr 17, 2016 at 6:50 PM, Yonik Seeley  wrote:

> Hey folks, we're at the point of figuring out the API for block join
> child rollups for the JSON Facet API.
> We already have simple block join faceting:
> http://yonik.com/solr-nested-objects/
> So now we need an API to carry over more information from children to
> parents (say rolling up average rating of all the reviews to the
> corresponding parent book objects).
>
> I've gathered some of my notes/thoughts on the API here:
> https://issues.apache.org/jira/browse/SOLR-8998
>
> Feedback welcome, and we can discuss here in this thread rather than
> cluttering the JIRA.
>
> -Yonik
>


JSON Facet Stats Mincount

2016-04-14 Thread Nick Vasilyev
Hello, I am trying to get a list of items that have more than one
manufacturer using the following json facet query. This works fine without
mincount, but errors out as soon as I add it.

Is this possible or am I doing something wrong?

json.facet={
   groupID: {
  type: terms,
  field: groupID,
  facet:{ y: "unique(mfr)",
mincount: 2}
   }
}

Error:
"error": { "msg": "expected Map but got 2 ,path=facet/groupID", "code": 400
}

Thanks in advance


Re: How fast indexing?

2016-03-20 Thread Nick Vasilyev
There can be a lot of factors, can you provide a bit of additional
information to get started?

- How many items are you indexing per second?
- How does the indexing process look like?
- How large is each item?
- What hardware are you using?
- How is your Solr set up? JVM memory, collection layout, etc...
- What is your current commit frequency?
- What is the query volume while you are indexing?

On Sun, Mar 20, 2016 at 6:25 PM, fabigol  wrote:

> hi,
> i have a soir project where i do the indexing since a database postgre.
> the indexation is very long.
> How i can accelerate it.
> I can modify autocommit in the file solrconfig.xml?
> someone has some ideas. I looking on google but I found little
> help me please
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-fast-indexing-tp4264994.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Boosts for relevancy (shopping products)

2016-03-19 Thread Nick Vasilyev
I work with a similar catalog; except our data is especially bad.  We've
found that several things helped:

- Item level grouping (group same item sold by multiple vendors). Rank
items with more vendors a bit higher.
- Include a boost function for other attributes, such as an original image
of the product
- Rank items a bit higher if they have data from an external catalog like
IceCat
- For relevance and performance, we have several fields that we copy data
into. High value fields get copied into a high weighted field, while lower
value fields like description get copied into a lower weighted field. These
fields are the backbone of our qf parameter, with other fields adding
additional boost.
- Play around with the tie parameter for edismax, we found that it makes
quite a big difference.

Hope this helps.

On Fri, Mar 18, 2016 at 6:19 AM, Alessandro Benedetti  wrote:

> In a relevancy problem I would repeat what my colleagues already pointed
> out :
> Data is key. We need to understand first of all our data before we can
> understand what is relevant and what is not.
> Once we specify a groundfloor which make sense ( and your basic approach +
> proper schema configuration as suggested + properly configured request
> handler , seems a good start to me ) .
>
> At this point if you are still not happy with the relevancy (i.e. you are
> not happy with the different boosts you assigned ) my strongest suggestion
> at this time is to move to machine learning.
> You need a good amount of data to feed the learner and make it your Super
> Business Expert) .
> I have been recently working with the Learn To Rank Bloomberg Plugin [1] .
> In  my opinion will be key for all the business that have many features in
> the game, that can help to evaluate a proper ranking.
> For that you need to be able to collect and process signals, and you need
> to carefully tune the features of your interest.
> But the results could be surprising .
>
> [1] https://issues.apache.org/jira/browse/SOLR-8542
> [2] Learning to Rank in Solr 
>
> Cheers
>
> On Thu, Mar 17, 2016 at 10:15 AM, Robert Brown 
> wrote:
>
> > Thanks Scott and John,
> >
> > As luck would have it I've got a PhD graduate coming for an interview
> > today, who just happened to do her research thesis on information
> retrieval
> > with quantum theory and machine learning  :)
> >
> > John, it sounds like you're describing my system!  Shopping products from
> > multiple sources.  (De-duplication is going to be fun soon).
> >
> > I already copy fields like merchant, brand, category, to string fields to
> > use them as facets/filters.  I was contemplating removing the description
> > due to the spammy issue you mentioned, I didn't know about the
> > RemoveDuplicatesTokenFilterFactory, so I'm sure that's going to be a huge
> > help.
> >
> > Thanks a lot,
> > Rob
> >
> >
> >
> > On 03/17/2016 10:01 AM, John Smith wrote:
> >
> >> Hi,
> >>
> >> For once I might be of some help: I've had a similar configuration
> >> (large set of products from various sources). It's very difficult to
> >> find the right balance between all parameters and requires a lot of
> >> tweaking, most often in the dark unfortunately.
> >>
> >> What I've found is that omitNorms=true is a real breakthrough: without
> >> it results tend to favor small texts, which is not what's wanted for
> >> product names. I also added a RemoveDuplicatesTokenFilterFactory for the
> >> name as it's a common practice for spammers to repeat some key words in
> >> order to be better placed in results. Stemming and custom stop words
> >> (e.g. "cheap", "sale", ...) are other potential ideas.
> >>
> >> I've also ended up in removing the description field as it's often too
> >> broad, and name is now the only field left: brand, category and merchant
> >> (as well as other fields) are offered as additional filters using
> >> facets. Note that you'd have to re-index them as plain strings.
> >>
> >> It's more difficult to achieve but popularity boost can also be useful:
> >> you can measure it by sales or by number of clicks. I use a combination
> >> of both, and store those values using partial updates.
> >>
> >> Hope it helps,
> >> John
> >>
> >>
> >> On 17/03/16 09:36, Robert Brown wrote:
> >>
> >>> Hi,
> >>>
> >>> I currently have an index of ~50m docs representing shopping products:
> >>> name, description, brand, category, etc.
> >>>
> >>> Our "qf" is currently setup as:
> >>>
> >>> name^5
> >>> brand^2
> >>> category^3
> >>> merchant^2
> >>> description^1
> >>>
> >>> mm: 100%
> >>> ps: 5
> >>>
> >>> I'm getting complaints from the business concerning relevancy, and was
> >>> hoping to get some constructive ideas/thoughts on whether these boosts
> >>> look semi-sensible or not, I think they were put in place pretty much
> >>> at random.
> >>>
> >>> I know it's going to be a case of rounds upon rounds of testing, but
> >>> maybe there's a 

Re: Boosts for relevancy (shopping products)

2016-03-18 Thread Nick Vasilyev
Tie does quite a bit, without it only the highest weighted field that has
the term will be included in relevance score. Tie let's you include the
other fields that match as well.
On Mar 18, 2016 10:40 AM, "Robert Brown" <r...@intelcompute.com> wrote:

> Thanks for the added input.
>
> I'll certainly look into the machine learning aspect, will be good to put
> some basic knowledge I have into practice.
>
> I'd been led to believe the tie parameter didn't actually do a lot. :-/
>
>
>
> On 03/18/2016 12:07 PM, Nick Vasilyev wrote:
>
>> I work with a similar catalog; except our data is especially bad.  We've
>> found that several things helped:
>>
>> - Item level grouping (group same item sold by multiple vendors). Rank
>> items with more vendors a bit higher.
>> - Include a boost function for other attributes, such as an original image
>> of the product
>> - Rank items a bit higher if they have data from an external catalog like
>> IceCat
>> - For relevance and performance, we have several fields that we copy data
>> into. High value fields get copied into a high weighted field, while lower
>> value fields like description get copied into a lower weighted field.
>> These
>> fields are the backbone of our qf parameter, with other fields adding
>> additional boost.
>> - Play around with the tie parameter for edismax, we found that it makes
>> quite a big difference.
>>
>> Hope this helps.
>>
>> On Fri, Mar 18, 2016 at 6:19 AM, Alessandro Benedetti <
>> abenede...@apache.org
>>
>>> wrote:
>>> In a relevancy problem I would repeat what my colleagues already pointed
>>> out :
>>> Data is key. We need to understand first of all our data before we can
>>> understand what is relevant and what is not.
>>> Once we specify a groundfloor which make sense ( and your basic approach
>>> +
>>> proper schema configuration as suggested + properly configured request
>>> handler , seems a good start to me ) .
>>>
>>> At this point if you are still not happy with the relevancy (i.e. you are
>>> not happy with the different boosts you assigned ) my strongest
>>> suggestion
>>> at this time is to move to machine learning.
>>> You need a good amount of data to feed the learner and make it your Super
>>> Business Expert) .
>>> I have been recently working with the Learn To Rank Bloomberg Plugin [1]
>>> .
>>> In  my opinion will be key for all the business that have many features
>>> in
>>> the game, that can help to evaluate a proper ranking.
>>> For that you need to be able to collect and process signals, and you need
>>> to carefully tune the features of your interest.
>>> But the results could be surprising .
>>>
>>> [1] https://issues.apache.org/jira/browse/SOLR-8542
>>> [2] Learning to Rank in Solr <
>>> https://www.youtube.com/watch?v=M7BKwJoh96s>
>>>
>>> Cheers
>>>
>>> On Thu, Mar 17, 2016 at 10:15 AM, Robert Brown <r...@intelcompute.com>
>>> wrote:
>>>
>>> Thanks Scott and John,
>>>>
>>>> As luck would have it I've got a PhD graduate coming for an interview
>>>> today, who just happened to do her research thesis on information
>>>>
>>> retrieval
>>>
>>>> with quantum theory and machine learning  :)
>>>>
>>>> John, it sounds like you're describing my system!  Shopping products
>>>> from
>>>> multiple sources.  (De-duplication is going to be fun soon).
>>>>
>>>> I already copy fields like merchant, brand, category, to string fields
>>>> to
>>>> use them as facets/filters.  I was contemplating removing the
>>>> description
>>>> due to the spammy issue you mentioned, I didn't know about the
>>>> RemoveDuplicatesTokenFilterFactory, so I'm sure that's going to be a
>>>> huge
>>>> help.
>>>>
>>>> Thanks a lot,
>>>> Rob
>>>>
>>>>
>>>>
>>>> On 03/17/2016 10:01 AM, John Smith wrote:
>>>>
>>>> Hi,
>>>>>
>>>>> For once I might be of some help: I've had a similar configuration
>>>>> (large set of products from various sources). It's very difficult to
>>>>> find the right balance between all parameters and requires a lot of
>>>>> tweaking, most often in the dark unfortunately.
>>>>>
>>>>> What

Re: Inconsistent Shard Usage for Distributed Queries

2016-03-15 Thread Nick Vasilyev
I had another collection I was running into this issue with, so I decided
to play around with it. This one had active indexing going on, so I was
able to confirm how the counts get updated. Basically, it looks like
clicking the reload button will only send a commit to that one core, it
will not be propagated to other shards and the same shard on the other
replica. Full commit update?commit=true=true works fine. I
know that the reload button was not intended to issue commits, but it's
quicker than typing out the command.

On Tue, Mar 15, 2016 at 12:24 PM, Nick Vasilyev <nick.vasily...@gmail.com>
wrote:

> Yea, the code sends actual commits, but I hate typing so usually just
> click the reload button unless it's production.
> On Mar 15, 2016 12:22 PM, "Erick Erickson" <erickerick...@gmail.com>
> wrote:
>
>> bq: Not sure what the issue was, in previous versions of Solr, clicking
>> reload
>> would send a commit to all replicas, right
>>
>> Reloading doesn't really have anything to do with commits. Reload
>> would certainly
>> cause a new searcher to be opened and thus would pick up any changes
>> that hat been hard-committed (openSearcher=false), but that's a complete
>> side-effect. Simply issuing a commit on the url to the _collection_ will
>> cause
>> commits to happen on all replicas, as:
>>
>> blah/solr/collection/update?commit=true
>>
>> Best,
>> Erick
>>
>> On Tue, Mar 15, 2016 at 9:11 AM, Nick Vasilyev <nick.vasily...@gmail.com>
>> wrote:
>> > I reloaded the collection and ran distrib=false query for several
>> shards on
>> > both replicas. The counts matched exactly.
>> >
>> > I then reloaded the second replica (through the UI) and now it seems
>> like
>> > it is working fine, I am getting consistent matches.
>> >
>> > Not sure what the issue was, in previous versions of Solr, clicking
>> reload
>> > would send a commit to all replicas, right? Is that still the case?
>> >
>> >
>> >
>> > On Tue, Mar 15, 2016 at 11:53 AM, Erick Erickson <
>> erickerick...@gmail.com>
>> > wrote:
>> >
>> >> This is very strange. What are the results you get when
>> >> you compare replicas in th e_same_ shard? It doesn't really
>> >> mean anything when you say
>> >> "shard1 has X docs, shard2 has Y docs". The only way
>> >> you should be getting different results from
>> >> the match all docs query is if different replicas within the
>> >> _same_ shard have different counts.
>> >>
>> >> And just as a sanity check, issue a commit. It's highly unlikely
>> >> that you have uncommitted changes, but it never hurts to try.
>> >>
>> >> All distributed queries should have a sub query sent to one
>> >> replica of each shard, is that what you're seeing? And I'd ping
>> >> the cores  directly rather than provide shards parameters,
>> >> something like:
>> >>
>> >> blha blah blah/products/query/shard1_core3/query?q=*:*. That
>> >> addresses the specific core rather than rely on any internal query
>> >> routing logic..
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Tue, Mar 15, 2016 at 8:43 AM, Nick Vasilyev <
>> nick.vasily...@gmail.com>
>> >> wrote:
>> >> > Hello,
>> >> >
>> >> > I have a brand new installation of Solr 5.4.1 and I am running into a
>> >> > strange problem with one of my collections. Collection *products*
>> has 5
>> >> > shards and replication factor of two. Both replicas are up and show
>> green
>> >> > status on the Cloud page in the UI.
>> >> >
>> >> > When I run a default search on the query page (q=*:*) I always get a
>> >> > different numFound although there is no active indexing and
>> everything is
>> >> > committed. I checked the logs and it looks like every time it runs a
>> >> > search, it is sent to different shards. Below, search1 went to shard
>> 5, 2
>> >> > and 4, search2 went to shard 5, 3, 1 and search 3 went to shard 3,
>> 4, 1,
>> >> 5.
>> >> >
>> >> > To confirm this, I ran a =false query on shard 5 and got
>> >> 8,928,379
>> >> > items, 8,917,318 for shard 2, and 9,005,295 for shard 4. The results
>> from
>> >> > shard 2 distrib=false query did not match th

Re: Inconsistent Shard Usage for Distributed Queries

2016-03-15 Thread Nick Vasilyev
Yea, the code sends actual commits, but I hate typing so usually just click
the reload button unless it's production.
On Mar 15, 2016 12:22 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:

> bq: Not sure what the issue was, in previous versions of Solr, clicking
> reload
> would send a commit to all replicas, right
>
> Reloading doesn't really have anything to do with commits. Reload
> would certainly
> cause a new searcher to be opened and thus would pick up any changes
> that hat been hard-committed (openSearcher=false), but that's a complete
> side-effect. Simply issuing a commit on the url to the _collection_ will
> cause
> commits to happen on all replicas, as:
>
> blah/solr/collection/update?commit=true
>
> Best,
> Erick
>
> On Tue, Mar 15, 2016 at 9:11 AM, Nick Vasilyev <nick.vasily...@gmail.com>
> wrote:
> > I reloaded the collection and ran distrib=false query for several shards
> on
> > both replicas. The counts matched exactly.
> >
> > I then reloaded the second replica (through the UI) and now it seems like
> > it is working fine, I am getting consistent matches.
> >
> > Not sure what the issue was, in previous versions of Solr, clicking
> reload
> > would send a commit to all replicas, right? Is that still the case?
> >
> >
> >
> > On Tue, Mar 15, 2016 at 11:53 AM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> >> This is very strange. What are the results you get when
> >> you compare replicas in th e_same_ shard? It doesn't really
> >> mean anything when you say
> >> "shard1 has X docs, shard2 has Y docs". The only way
> >> you should be getting different results from
> >> the match all docs query is if different replicas within the
> >> _same_ shard have different counts.
> >>
> >> And just as a sanity check, issue a commit. It's highly unlikely
> >> that you have uncommitted changes, but it never hurts to try.
> >>
> >> All distributed queries should have a sub query sent to one
> >> replica of each shard, is that what you're seeing? And I'd ping
> >> the cores  directly rather than provide shards parameters,
> >> something like:
> >>
> >> blha blah blah/products/query/shard1_core3/query?q=*:*. That
> >> addresses the specific core rather than rely on any internal query
> >> routing logic..
> >>
> >> Best,
> >> Erick
> >>
> >> On Tue, Mar 15, 2016 at 8:43 AM, Nick Vasilyev <
> nick.vasily...@gmail.com>
> >> wrote:
> >> > Hello,
> >> >
> >> > I have a brand new installation of Solr 5.4.1 and I am running into a
> >> > strange problem with one of my collections. Collection *products* has
> 5
> >> > shards and replication factor of two. Both replicas are up and show
> green
> >> > status on the Cloud page in the UI.
> >> >
> >> > When I run a default search on the query page (q=*:*) I always get a
> >> > different numFound although there is no active indexing and
> everything is
> >> > committed. I checked the logs and it looks like every time it runs a
> >> > search, it is sent to different shards. Below, search1 went to shard
> 5, 2
> >> > and 4, search2 went to shard 5, 3, 1 and search 3 went to shard 3, 4,
> 1,
> >> 5.
> >> >
> >> > To confirm this, I ran a =false query on shard 5 and got
> >> 8,928,379
> >> > items, 8,917,318 for shard 2, and 9,005,295 for shard 4. The results
> from
> >> > shard 2 distrib=false query did not match the results that were in the
> >> > distributed query (from the logs). The query returned 8917318. Here is
> >> the
> >> > log entry for the query.
> >> >
> >> > 214467874 INFO  (qtp1013423070-21019) [c:products s:shard2
> r:core_node7
> >> > x:products_shard2_replica2] o.a.s.c.S.Request
> [products_shard2_replica2]
> >> > webapp=/solr path=/select
> >> > params={q=*:*=false=true=json&_=1458056340020}
> >> > hits=8917318 status=0 QTime=0
> >> >
> >> >
> >> > Here are the logs from other queries.
> >> >
> >> > Search 1 - numFound 18309764
> >> >
> >> > 213941984 INFO  (qtp1013423070-21046) [c:products s:shard5
> r:core_node4
> >> > x:products_shard5_replica2] o.a.s.c.S.Request
> [products_shard5_replica2]
> >> > webapp=/solr path=/select
> >> >
> >>
>

Re: Inconsistent Shard Usage for Distributed Queries

2016-03-15 Thread Nick Vasilyev
I reloaded the collection and ran distrib=false query for several shards on
both replicas. The counts matched exactly.

I then reloaded the second replica (through the UI) and now it seems like
it is working fine, I am getting consistent matches.

Not sure what the issue was, in previous versions of Solr, clicking reload
would send a commit to all replicas, right? Is that still the case?



On Tue, Mar 15, 2016 at 11:53 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> This is very strange. What are the results you get when
> you compare replicas in th e_same_ shard? It doesn't really
> mean anything when you say
> "shard1 has X docs, shard2 has Y docs". The only way
> you should be getting different results from
> the match all docs query is if different replicas within the
> _same_ shard have different counts.
>
> And just as a sanity check, issue a commit. It's highly unlikely
> that you have uncommitted changes, but it never hurts to try.
>
> All distributed queries should have a sub query sent to one
> replica of each shard, is that what you're seeing? And I'd ping
> the cores  directly rather than provide shards parameters,
> something like:
>
> blha blah blah/products/query/shard1_core3/query?q=*:*. That
> addresses the specific core rather than rely on any internal query
> routing logic..
>
> Best,
> Erick
>
> On Tue, Mar 15, 2016 at 8:43 AM, Nick Vasilyev <nick.vasily...@gmail.com>
> wrote:
> > Hello,
> >
> > I have a brand new installation of Solr 5.4.1 and I am running into a
> > strange problem with one of my collections. Collection *products* has 5
> > shards and replication factor of two. Both replicas are up and show green
> > status on the Cloud page in the UI.
> >
> > When I run a default search on the query page (q=*:*) I always get a
> > different numFound although there is no active indexing and everything is
> > committed. I checked the logs and it looks like every time it runs a
> > search, it is sent to different shards. Below, search1 went to shard 5, 2
> > and 4, search2 went to shard 5, 3, 1 and search 3 went to shard 3, 4, 1,
> 5.
> >
> > To confirm this, I ran a =false query on shard 5 and got
> 8,928,379
> > items, 8,917,318 for shard 2, and 9,005,295 for shard 4. The results from
> > shard 2 distrib=false query did not match the results that were in the
> > distributed query (from the logs). The query returned 8917318. Here is
> the
> > log entry for the query.
> >
> > 214467874 INFO  (qtp1013423070-21019) [c:products s:shard2 r:core_node7
> > x:products_shard2_replica2] o.a.s.c.S.Request [products_shard2_replica2]
> > webapp=/solr path=/select
> > params={q=*:*=false=true=json&_=1458056340020}
> > hits=8917318 status=0 QTime=0
> >
> >
> > Here are the logs from other queries.
> >
> > Search 1 - numFound 18309764
> >
> > 213941984 INFO  (qtp1013423070-21046) [c:products s:shard5 r:core_node4
> > x:products_shard5_replica2] o.a.s.c.S.Request [products_shard5_replica2]
> > webapp=/solr path=/select
> >
> params={df=text=false=id=score=4=0=true=
> >
> http://192.168.1.211:9000/solr/products_shard5_replica2/|http://192.168.1.212:9000/solr/products_shard5_replica1/=10=2=*:*=1458055805759=true=javabin&_=1458055814096
> }
> > hits=8928379 status=0 QTime=3
> > 213941985 INFO  (qtp1013423070-21028) [c:products s:shard4 r:core_node6
> > x:products_shard4_replica2] o.a.s.c.S.Request [products_shard4_replica2]
> > webapp=/solr path=/select
> >
> params={df=text=false=id=score=4=0=true=
> >
> http://192.168.1.212:9000/solr/products_shard4_replica1/|http://192.168.1.211:9000/solr/products_shard4_replica2/=10=2=*:*=1458055805759=true=javabin&_=1458055814096
> }
> > hits=9005295 status=0 QTime=3
> > 213942045 INFO  (qtp1013423070-21042) [c:products s:shard2 r:core_node7
> > x:products_shard2_replica2] o.a.s.c.S.Request [products_shard2_replica2]
> > webapp=/solr path=/select
> > params={q=*:*=true=json&_=1458055814096} hits=18309764 status=0
> > QTime=81
> >
> >
> > Search 2 - numFound 27072144
> > 213995779 INFO  (qtp1013423070-21046) [c:products s:shard5 r:core_node4
> > x:products_shard5_replica2] o.a.s.c.S.Request [products_shard5_replica2]
> > webapp=/solr path=/select
> >
> params={df=text=false=id=score=4=0=true=
> >
> http://192.168.1.211:9000/solr/products_shard5_replica2/|http://192.168.1.212:9000/solr/products_shard5_replica1/=10=2=*:*=1458055859563=true=javabin&_=1458055867894
> }
> > hits=8928379 status=0 QTime=1
> > 213995781 INFO  (qtp1013423070-20985) [c:products s:shard3 r:core_n

Inconsistent Shard Usage for Distributed Queries

2016-03-15 Thread Nick Vasilyev
Hello,

I have a brand new installation of Solr 5.4.1 and I am running into a
strange problem with one of my collections. Collection *products* has 5
shards and replication factor of two. Both replicas are up and show green
status on the Cloud page in the UI.

When I run a default search on the query page (q=*:*) I always get a
different numFound although there is no active indexing and everything is
committed. I checked the logs and it looks like every time it runs a
search, it is sent to different shards. Below, search1 went to shard 5, 2
and 4, search2 went to shard 5, 3, 1 and search 3 went to shard 3, 4, 1, 5.

To confirm this, I ran a =false query on shard 5 and got 8,928,379
items, 8,917,318 for shard 2, and 9,005,295 for shard 4. The results from
shard 2 distrib=false query did not match the results that were in the
distributed query (from the logs). The query returned 8917318. Here is the
log entry for the query.

214467874 INFO  (qtp1013423070-21019) [c:products s:shard2 r:core_node7
x:products_shard2_replica2] o.a.s.c.S.Request [products_shard2_replica2]
webapp=/solr path=/select
params={q=*:*=false=true=json&_=1458056340020}
hits=8917318 status=0 QTime=0


Here are the logs from other queries.

Search 1 - numFound 18309764

213941984 INFO  (qtp1013423070-21046) [c:products s:shard5 r:core_node4
x:products_shard5_replica2] o.a.s.c.S.Request [products_shard5_replica2]
webapp=/solr path=/select
params={df=text=false=id=score=4=0=true=
http://192.168.1.211:9000/solr/products_shard5_replica2/|http://192.168.1.212:9000/solr/products_shard5_replica1/=10=2=*:*=1458055805759=true=javabin&_=1458055814096}
hits=8928379 status=0 QTime=3
213941985 INFO  (qtp1013423070-21028) [c:products s:shard4 r:core_node6
x:products_shard4_replica2] o.a.s.c.S.Request [products_shard4_replica2]
webapp=/solr path=/select
params={df=text=false=id=score=4=0=true=
http://192.168.1.212:9000/solr/products_shard4_replica1/|http://192.168.1.211:9000/solr/products_shard4_replica2/=10=2=*:*=1458055805759=true=javabin&_=1458055814096}
hits=9005295 status=0 QTime=3
213942045 INFO  (qtp1013423070-21042) [c:products s:shard2 r:core_node7
x:products_shard2_replica2] o.a.s.c.S.Request [products_shard2_replica2]
webapp=/solr path=/select
params={q=*:*=true=json&_=1458055814096} hits=18309764 status=0
QTime=81


Search 2 - numFound 27072144
213995779 INFO  (qtp1013423070-21046) [c:products s:shard5 r:core_node4
x:products_shard5_replica2] o.a.s.c.S.Request [products_shard5_replica2]
webapp=/solr path=/select
params={df=text=false=id=score=4=0=true=
http://192.168.1.211:9000/solr/products_shard5_replica2/|http://192.168.1.212:9000/solr/products_shard5_replica1/=10=2=*:*=1458055859563=true=javabin&_=1458055867894}
hits=8928379 status=0 QTime=1
213995781 INFO  (qtp1013423070-20985) [c:products s:shard3 r:core_node10
x:products_shard3_replica2] o.a.s.c.S.Request [products_shard3_replica2]
webapp=/solr path=/select
params={df=text=false=id=score=4=0=true=
http://192.168.1.212:9000/solr/products_shard3_replica1/|http://192.168.1.211:9000/solr/products_shard3_replica2/=10=2=*:*=1458055859563=true=javabin&_=1458055867894}
hits=8980542 status=0 QTime=3
213995785 INFO  (qtp1013423070-21042) [c:products s:shard1 r:core_node9
x:products_shard1_replica2] o.a.s.c.S.Request [products_shard1_replica2]
webapp=/solr path=/select
params={df=text=false=id=score=4=0=true=
http://192.168.1.212:9000/solr/products_shard1_replica1/|http://192.168.1.211:9000/solr/products_shard1_replica2/=10=2=*:*=1458055859563=true=javabin&_=1458055867894}
hits=8914801 status=0 QTime=3
213995798 INFO  (qtp1013423070-21028) [c:products s:shard2 r:core_node7
x:products_shard2_replica2] o.a.s.c.S.Request [products_shard2_replica2]
webapp=/solr path=/select
params={q=*:*=true=json&_=1458055867894} hits=27072144 status=0
QTime=30


Search 3 - numFound 35953734

214022457 INFO  (qtp1013423070-21019) [c:products s:shard3 r:core_node10
x:products_shard3_replica2] o.a.s.c.S.Request [products_shard3_replica2]
webapp=/solr path=/select
params={df=text=false=id=score=4=0=true=
http://192.168.1.212:9000/solr/products_shard3_replica1/|http://192.168.1.211:9000/solr/products_shard3_replica2/=10=2=*:*=1458055886247=true=javabin&_=1458055894580}
hits=8980542 status=0 QTime=0
214022458 INFO  (qtp1013423070-21036) [c:products s:shard4 r:core_node6
x:products_shard4_replica2] o.a.s.c.S.Request [products_shard4_replica2]
webapp=/solr path=/select
params={df=text=false=id=score=4=0=true=
http://192.168.1.212:9000/solr/products_shard4_replica1/|http://192.168.1.211:9000/solr/products_shard4_replica2/=10=2=*:*=1458055886247=true=javabin&_=1458055894580}
hits=9005295 status=0 QTime=1
214022459 INFO  (qtp1013423070-21046) [c:products s:shard1 r:core_node9
x:products_shard1_replica2] o.a.s.c.S.Request [products_shard1_replica2]
webapp=/solr path=/select
params={df=text=false=id=score=4=0=true=

Re: Solr Managed Schema by Default in 5.5

2016-03-11 Thread Nick Vasilyev
Got it.

Thank you for clarifying this, I was under impression that I would only be
able to make changes via the API. I will look into this some more.

On Fri, Mar 11, 2016 at 11:51 AM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 3/11/2016 9:28 AM, Nick Vasilyev wrote:
> > Maybe I am missing something, if that is the case what is the difference
> > between data_driven_schema_configs and basic_configs? I thought that the
> > only difference was that the data_driven_schema_configs comes with the
> > managed schema and the basic_configs come with regular?
> >
> > Also, I haven't really dived into the schema less mode so far, I know
> > elastic uses it and it has been kind of a turn off for me. Can you
> provide
> > some guidance around best practices on how to use it?
>
> Schemaless mode is implemented with an update processor chain.  If you
> look in the data_driven_schema_configs solrconfig.xml file, you will
> find an updateRequestProcessorChain named
> "add-unknown-fields-to-the-schema".  This update chain is then enabled
> with an initParams config.
>
> I personally would not recommend using it.  It would be fine to use
> during prototyping, but I would definitely turn it off for production.
>
> > For example, now I have all of my configuration files in version control,
> > if I need to make a change, I upload a new schema to version control,
> then
> > the server pulls them down, uploads to zk and reloads collections. This
> is
> > almost fully automated and since all configuration is in a single file it
> > is easy to review and track previous changes. I like this process and it
> > works well; if I have to start using managed schemas; I would like some
> > feedback on how to implement it with minimal disruption to this.
>
> There's no reason you can't continue to use this method, even with the
> managed schema.  Editing the managed-schema is discouraged if you
> actually intend to use the Schema API, but there's nothing in place to
> prevent you from doing it that way.
>
> > If I am sending all schema changes via the API, I would need to have
> still
> > have some file with the schema configuration, it would just be a
> different
> > format. I would then need to have some code to read it and send specific
> > items to Solr, right?  When I need to make a change, do I have to then
> make
> > this change individually and include that configuration as part of the
> > config file? Or should I be able to just send the entire schema in again?
>
> Using the Schema API changes the managed-schema file in place.  You
> wouldn't need to upload anything to zookeeper, the change would already
> be there -- but you'd have to take an extra step (retrieving from
> zookeeper) to make sure it's in version control.
>
> My recommendation is to just keep using version control as you have
> been, which you can do with either the Classic or Managed schema.  The
> filename for the schema would change with the managed version, but
> nothing else.
>
> Thanks,
> Shawn
>
>


Re: Solr Managed Schema by Default in 5.5

2016-03-11 Thread Nick Vasilyev
Hi Shawn,

Maybe I am missing something, if that is the case what is the difference
between data_driven_schema_configs and basic_configs? I thought that the
only difference was that the data_driven_schema_configs comes with the
managed schema and the basic_configs come with regular?

Also, I haven't really dived into the schema less mode so far, I know
elastic uses it and it has been kind of a turn off for me. Can you provide
some guidance around best practices on how to use it?

For example, now I have all of my configuration files in version control,
if I need to make a change, I upload a new schema to version control, then
the server pulls them down, uploads to zk and reloads collections. This is
almost fully automated and since all configuration is in a single file it
is easy to review and track previous changes. I like this process and it
works well; if I have to start using managed schemas; I would like some
feedback on how to implement it with minimal disruption to this.

If I am sending all schema changes via the API, I would need to have still
have some file with the schema configuration, it would just be a different
format. I would then need to have some code to read it and send specific
items to Solr, right?  When I need to make a change, do I have to then make
this change individually and include that configuration as part of the
config file? Or should I be able to just send the entire schema in again?

Previously when I tried to upload the entire schema again I ran into
problems; for example if there is already field copying from field1 to
field 2, when I resend the config it would add another "copy field set". So
copying would occur twice and error out if the field is not multi-valued.
If future changes need to be made atomically and then included back into
this other config it just introduces more room for error.

Also, with classic schema if I wanted to revert a change or delete a field,
I would simply remove it from the schema and re-upload. Now it looks like I
need to add additional functionality into whatever my new process will be
to delete fields / copy fields, etc...

I know the point of this is to be able to easily make a UI for these
changes, but UI changes are hard to automate and version control. Please
let me know if I am missing something.

On Fri, Mar 11, 2016 at 10:41 AM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 3/11/2016 7:01 AM, Nick Vasilyev wrote:
> > Is this now the default behavior for basic_configs? I would really like
> to
> > maintain an option to easily create collection with classic schema
> settings
> > without jumping through all of these hoops.
>
> Starting in 5.5, all examples now use the managed schema.
>
> https://issues.apache.org/jira/browse/SOLR-8131
>
> The classic schema factory still exists, and probably will exist for all
> 6.x versions, so you will not need to migrate any existing setup yet.
>
> I don't mind putting more emphasis on the new factory or using it by
> default.  I expect that eventually the classic factory will get
> deprecated.  When that happens, I would like to see an option to mimic
> the classic version, where making changes via API won't work.  One
> person has already come into the IRC channel and asked how they can
> disable schema editing.
>
> Although I don't have a problem with the managed schema, I still don't
> like schemaless mode, which requires the managed schema.  It looks like
> the basic_configs and sample_techproducts_configs examples have NOT
> enabled that feature.
>
> Thanks,
> Shawn
>
>


Solr Managed Schema by Default in 5.5

2016-03-11 Thread Nick Vasilyev
Hi,

I started playing around with Solr 5.5 and created a collection using the
following:

./solr create_collection -c test -p 9000 -replicationFactor 2 -d
basic_configs -shards 2

The collection created fine, however I see that although I specified
basic_configs, it was deployed in managed schema mode.

I was able to follow instructions here:
https://cwiki.apache.org/confluence/display/solr/Managed+Schema+Definition+in+SolrConfig

To get it back to basic mode, which required me to modify solrconfig and
remove the manged schema file from zookeeper manually.

I checked the configuration files for basic_configs for Solr 5.5 and it
looks like it is managed, however Solr 5.4 still has the classic as the
default parameters.

Is this now the default behavior for basic_configs? I would really like to
maintain an option to easily create collection with classic schema settings
without jumping through all of these hoops.

Thanks