from:"Tomás Fernández Löbbe"

Re: Programmatic Basic Auth on CloudSolrClient

2021-03-04 Thread Tomás Fernández Löbbe

Ah, right, now I remember that something like this was possible with the
"http1" version of the clients, which is why I created the Jira issues for
the http2 ones. Maybe you can even skip the "LBHttpSolrClient" step, I
believe you can just pass the HttpClient to the CloudSolrClient? you will
have to make sure to close all the clients that are created externally
after done, since the Solr client won't in this case.

On Thu, Mar 4, 2021 at 1:22 PM Mark H. Wood  wrote:

> On Wed, Mar 03, 2021 at 10:34:50AM -0800, Tomás Fernández Löbbe wrote:
> > As far as I know the current OOTB options are system properties or
> > per-request (which would allow you to use different per collection, but
> > probably not ideal if you do different types of requests from different
> > parts of your code). A workaround (which I've used in the past) is to
> have
> > a custom client that overrides and sets the credentials in the "request"
> > method (you can put whatever logic there to identify which credentials to
> > use). I recently created
> https://issues.apache.org/jira/browse/SOLR-15154
> > and https://issues.apache.org/jira/browse/SOLR-15155 to try to address
> this
> > issue in future releases.
>
> I have not tried it, but could you not:
>
> 1. set up an HttpClient with an appropriate CredentialsProvider;
> 2. pass it to HttpSolrClient.Builder.withHttpClient();
> 2. pass that Builder to
> LBHttpSolrClient.Builder.withHttpSolrClientBuilder();
> 3. pass *that* Builder to
> CloudSolrClient.Builder.withLBHttpSolrClientBuilder();
>
> Now you have control of the CredentialsProvider and can have it return
> whatever credentials you wish, so long as you still have a reference
> to it.
>
> > On Wed, Mar 3, 2021 at 5:42 AM Subhajit Das 
> wrote:
> >
> > >
> > > Hi There,
> > >
> > > Is there any way to programmatically set basic authentication
> credential
> > > on CloudSolrClient?
> > >
> > > The only documentation available is to use system property. This is not
> > > useful if two collection required two separate set of credentials and
> they
> > > are parallelly accessed.
> > > Thanks in advance.
> > >
>
> --
> Mark H. Wood
> Lead Technology Analyst
>
> University Library
> Indiana University - Purdue University Indianapolis
> 755 W. Michigan Street
> Indianapolis, IN 46202
> 317-274-0749
> www.ulib.iupui.edu
>

Re: Programmatic Basic Auth on CloudSolrClient

2021-03-03 Thread Tomás Fernández Löbbe

Maybe something like this (I omitted a lot of things you'll have to do,
like passing zk or the list of hosts):

static class CustomCloudSolrClient extends CloudSolrClient {

  protected CustomCloudSolrClient(CustomCloudSolrClientBuilder builder) {
super(builder);
  }

  @Override
  public NamedList request(SolrRequest request, String
collection) throws SolrServerException, IOException {
// your logic here to figure out which credentials to use...
String user = "user";
String pass = "pass";
request.setBasicAuthCredentials(user, pass);
return super.request(request, collection);
  }
}

static class CustomCloudSolrClientBuilder extends CloudSolrClient.Builder {

  @Override
  public CloudSolrClient build() {
return new CustomCloudSolrClient(this);
  }
}

public static void main(String[] args) {
  CloudSolrClient c = new CustomCloudSolrClientBuilder().build();
  ...
}

Do consider that "request" method is called per request, make sure whatever
logic you have there is not super expensive.

On Wed, Mar 3, 2021 at 10:48 AM Subhajit Das 
wrote:

> Hi Thomas,
>
> Thanks. Can you please also share a sample of code to configure the client
> with your workaround?
>
> From: Tomás Fernández Löbbe<mailto:tomasflo...@gmail.com>
> Sent: 04 March 2021 12:05 AM
> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Subject: Re: Programmatic Basic Auth on CloudSolrClient
>
> As far as I know the current OOTB options are system properties or
> per-request (which would allow you to use different per collection, but
> probably not ideal if you do different types of requests from different
> parts of your code). A workaround (which I've used in the past) is to have
> a custom client that overrides and sets the credentials in the "request"
> method (you can put whatever logic there to identify which credentials to
> use). I recently created https://issues.apache.org/jira/browse/SOLR-15154
> and https://issues.apache.org/jira/browse/SOLR-15155 to try to address
> this
> issue in future releases.
>
> On Wed, Mar 3, 2021 at 5:42 AM Subhajit Das 
> wrote:
>
> >
> > Hi There,
> >
> > Is there any way to programmatically set basic authentication credential
> > on CloudSolrClient?
> >
> > The only documentation available is to use system property. This is not
> > useful if two collection required two separate set of credentials and
> they
> > are parallelly accessed.
> > Thanks in advance.
> >
>
>

Re: NPE in QueryComponent.mergeIds when using timeAllowed and sorting SOLR 8.7

2021-03-03 Thread Tomás Fernández Löbbe

Patch looks good to me. Since it's a bugfix it can be committed to 8_8
branch and released on the next bugfix release, though I don't think it
should trigger one. In the meantime, if you can patch your environment and
confirm that it fixes your problem, that's a good comment to leave in
SOLR-14758. 

On Mon, Mar 1, 2021 at 3:12 PM Phill Campbell 
wrote:

> Anyone?
>
> > On Feb 24, 2021, at 7:47 AM, Phill Campbell
>  wrote:
> >
> > Last week I switched to Solr 8.7 from a “special” build of Solr 6.6
> >
> > The system has a timeout set for querying. I am now seeing this bug.
> >
> > https://issues.apache.org/jira/browse/SOLR-14758 <
> https://issues.apache.org/jira/browse/SOLR-14758>
> >
> > Max Query Time goes from 1.6 seconds to 20 seconds and affects the
> entire system for about 2 minutes as reported in New Relic.
> >
> > null:java.lang.NullPointerException
> >   at
> org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:935)
> >   at
> org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:626)
> >   at
> org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:605)
> >   at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:486)
> >   at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:214)
> >   at org.apache.solr.core.SolrCore.execute(SolrCore.java:2627)
> >
> >
> > Can this be fixed in a patch for Solr 8.8? I do not want to have to go
> back to Solr 6 and reindex the system, that takes 2 days using 180 EMR
> instances.
> >
> > Pease advise. Thank you.
>
>

Re: Programmatic Basic Auth on CloudSolrClient

2021-03-03 Thread Tomás Fernández Löbbe

As far as I know the current OOTB options are system properties or
per-request (which would allow you to use different per collection, but
probably not ideal if you do different types of requests from different
parts of your code). A workaround (which I've used in the past) is to have
a custom client that overrides and sets the credentials in the "request"
method (you can put whatever logic there to identify which credentials to
use). I recently created https://issues.apache.org/jira/browse/SOLR-15154
and https://issues.apache.org/jira/browse/SOLR-15155 to try to address this
issue in future releases.

On Wed, Mar 3, 2021 at 5:42 AM Subhajit Das  wrote:

>
> Hi There,
>
> Is there any way to programmatically set basic authentication credential
> on CloudSolrClient?
>
> The only documentation available is to use system property. This is not
> useful if two collection required two separate set of credentials and they
> are parallelly accessed.
> Thanks in advance.
>

Re: How pull replica works

2021-01-06 Thread Tomás Fernández Löbbe

Hi Abhishek,
The pull replicas uses the "/replication" endpoint to copy full segment
files (sections of the index) from the leader. It works in a similar way to
the legacy leader/follower replication. This[1] talk tries to explain the
different replica types and how they work.

HTH,

Tomás

[1] https://www.youtube.com/watch?v=C8C9GRTCSzY

On Tue, Jan 5, 2021 at 10:29 PM Abhishek Mishra 
wrote:

> I want to know how pull replica replicate from leader in real? Does
> internally admin API get data from the leader in form of batches?
>
> Regards,
> Abhishek
>

Re: [CVE-2020-13957] The checks added to unauthenticated configset uploads in Apache Solr can be circumvented

2020-10-13 Thread Tomás Fernández Löbbe

Thanks Bernd, I missed 6.6.6 because it's not marked as a released version
in Jira. 6.6.6 is also affected.

On Mon, Oct 12, 2020 at 11:47 PM Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:

> Good to know that Version 6.6.6 is not affected, so I am safe ;-)
>
> Regards
> Bernd
>
> Am 12.10.20 um 20:38 schrieb Tomas Fernandez Lobbe:
> > Severity: High
> >
> > Vendor: The Apache Software Foundation
> >
> > Versions Affected:
> > 6.6.0 to 6.6.5
> > 7.0.0 to 7.7.3
> > 8.0.0 to 8.6.2
> >
> > Description:
> > Solr prevents some features considered dangerous (which could be used for
> > remote code execution) to be configured in a ConfigSet that's uploaded
> via
> > API without authentication/authorization. The checks in place to prevent
> > such features can be circumvented by using a combination of UPLOAD/CREATE
> > actions.
> >
> > Mitigation:
> > Any of the following are enough to prevent this vulnerability:
> > * Disable UPLOAD command in ConfigSets API if not used by setting the
> > system property: "configset.upload.enabled" to "false" [1]
> > * Use Authentication/Authorization and make sure unknown requests aren't
> > allowed [2]
> > * Upgrade to Solr 8.6.3 or greater.
> > * If upgrading is not an option, consider applying the patch in
> SOLR-14663
> > ([3])
> > * No Solr API, including the Admin UI, is designed to be exposed to
> > non-trusted parties. Tune your firewall so that only trusted computers
> and
> > people are allowed access
> >
> > Credit:
> > Tomás Fernández Löbbe, András Salamon
> >
> > References:
> > [1] https://lucene.apache.org/solr/guide/8_6/configsets-api.html
> > [2]
> >
> https://lucene.apache.org/solr/guide/8_6/authentication-and-authorization-plugins.html
> > [3] https://issues.apache.org/jira/browse/SOLR-14663
> > [4] https://issues.apache.org/jira/browse/SOLR-14925
> > [5] https://wiki.apache.org/solr/SolrSecurity
> >
>

Re: Updating configset

2020-09-11 Thread Tomás Fernández Löbbe

I created https://github.com/apache/lucene-solr/pull/1861

On Fri, Sep 11, 2020 at 11:43 AM Walter Underwood 
wrote:

> I wrote some Python to get the Zookeeper address from CLUSTERSTATUS, then
> use the Kazoo library to upload a configset. Then it goes back to the
> cluster and
> runs an async command to RELOAD.
>
> I really should open source that thing (in my copious free time).
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Sep 11, 2020, at 9:35 AM, Tomás Fernández Löbbe <
> tomasflo...@gmail.com> wrote:
> >
> > I was in the same situation recently. I think it would be nice to have
> the
> > configset UPLOAD command be able to override the existing configset
> instead
> > of just fail (with a parameter such as override=true or something). We
> need
> > to be careful with the trusted/unstrusted flag there, but that should be
> > possible.
> >
> >> If we can’t modify the configset wholesale this way, is it possible to
> > create a new configset and swap the old collection to it?
> > You can create a new one and then call MODIFYCOLLECTION on the collection
> > that uses it:
> >
> https://lucene.apache.org/solr/guide/8_6/collection-management.html#modifycollection-parameters
> .
> > I've never used that though.
> >
> > On Fri, Sep 11, 2020 at 7:26 AM Carroll, Michael (ELS-PHI) <
> > m.carr...@elsevier.com> wrote:
> >
> >> Hello,
> >>
> >> I am running SolrCloud in Kubernetes with Solr version 8.5.2.
> >>
> >> Is it possible to update a configset being used by a collection using a
> >> SolrCloud API directly? I know that this is possible using the zkcli
> and a
> >> collection RELOAD. We essentially want to be able to checkout our
> configset
> >> from source control, and then replace everything in the active
> configset in
> >> SolrCloud (other than the schema.xml).
> >>
> >> We have a couple of custom plugins that use config files that reside in
> >> the configset, and we don’t want to have to rebuild the collection or
> >> access zookeeper directly if we don’t have to. If we can’t modify the
> >> configset wholesale this way, is it possible to create a new configset
> and
> >> swap the old collection to it?
> >>
> >> Best,
> >> Michael Carroll
> >>
>
>

Re: Updating configset

2020-09-11 Thread Tomás Fernández Löbbe

Right, the problem is that both, bin/solr zk and ZkConfigManager require
"direct access" to ZooKeeper (you have to have ZooKeeper exposed). I
believe the original question was about how to achieve this without
exposing ZooKeeper.

On Fri, Sep 11, 2020 at 11:00 AM Andy C  wrote:

> Don't know if this is an option for you but the SolrJ Java Client library
> has support for uploading a config set. If the config set already exists it
> will overwrite it, and automatically RELOAD the dependent collection.
>
> See
>
> https://lucene.apache.org/solr/8_5_0/solr-solrj/org/apache/solr/common/cloud/ZkConfigManager.html
>
> On Fri, Sep 11, 2020 at 1:45 PM Jörn Franke  wrote:
>
> > I would go for the Solr rest api ... especially if you have a secured zk
> > (eg with Kerberos). Then you need to manage access for humans only in
> Solr
> > and not also in ZK.
> >
> > > Am 11.09.2020 um 19:41 schrieb Erick Erickson  >:
> > >
> > > Bin/solr zk upconfig...
> > > Bin/solr zk cp... For individual files.
> > >
> > > Not as convenient as a nice API, but might let you get by...
> > >
> > >> On Fri, Sep 11, 2020, 13:26 Houston Putman 
> > wrote:
> > >>
> > >> I completely agree, there should be a way to overwrite an existing
> > >> configSet.
> > >>
> > >> Looks like https://issues.apache.org/jira/browse/SOLR-10391 already
> > >> exists,
> > >> so the work could be tracked there.
> > >>
> > >> On Fri, Sep 11, 2020 at 12:36 PM Tomás Fernández Löbbe <
> > >> tomasflo...@gmail.com> wrote:
> > >>
> > >>> I was in the same situation recently. I think it would be nice to
> have
> > >> the
> > >>> configset UPLOAD command be able to override the existing configset
> > >> instead
> > >>> of just fail (with a parameter such as override=true or something).
> We
> > >> need
> > >>> to be careful with the trusted/unstrusted flag there, but that should
> > be
> > >>> possible.
> > >>>
> > >>>> If we can’t modify the configset wholesale this way, is it possible
> to
> > >>> create a new configset and swap the old collection to it?
> > >>> You can create a new one and then call MODIFYCOLLECTION on the
> > collection
> > >>> that uses it:
> > >>>
> > >>>
> > >>
> >
> https://lucene.apache.org/solr/guide/8_6/collection-management.html#modifycollection-parameters
> > >>> .
> > >>> I've never used that though.
> > >>>
> > >>> On Fri, Sep 11, 2020 at 7:26 AM Carroll, Michael (ELS-PHI) <
> > >>> m.carr...@elsevier.com> wrote:
> > >>>
> > >>>> Hello,
> > >>>>
> > >>>> I am running SolrCloud in Kubernetes with Solr version 8.5.2.
> > >>>>
> > >>>> Is it possible to update a configset being used by a collection
> using
> > a
> > >>>> SolrCloud API directly? I know that this is possible using the zkcli
> > >> and
> > >>> a
> > >>>> collection RELOAD. We essentially want to be able to checkout our
> > >>> configset
> > >>>> from source control, and then replace everything in the active
> > >> configset
> > >>> in
> > >>>> SolrCloud (other than the schema.xml).
> > >>>>
> > >>>> We have a couple of custom plugins that use config files that reside
> > in
> > >>>> the configset, and we don’t want to have to rebuild the collection
> or
> > >>>> access zookeeper directly if we don’t have to. If we can’t modify
> the
> > >>>> configset wholesale this way, is it possible to create a new
> configset
> > >>> and
> > >>>> swap the old collection to it?
> > >>>>
> > >>>> Best,
> > >>>> Michael Carroll
> > >>>>
> > >>>
> > >>
> >
>

Re: Updating configset

2020-09-11 Thread Tomás Fernández Löbbe

I was in the same situation recently. I think it would be nice to have the
configset UPLOAD command be able to override the existing configset instead
of just fail (with a parameter such as override=true or something). We need
to be careful with the trusted/unstrusted flag there, but that should be
possible.

> If we can’t modify the configset wholesale this way, is it possible to
create a new configset and swap the old collection to it?
You can create a new one and then call MODIFYCOLLECTION on the collection
that uses it:
https://lucene.apache.org/solr/guide/8_6/collection-management.html#modifycollection-parameters.
I've never used that though.

On Fri, Sep 11, 2020 at 7:26 AM Carroll, Michael (ELS-PHI) <
m.carr...@elsevier.com> wrote:

> Hello,
>
> I am running SolrCloud in Kubernetes with Solr version 8.5.2.
>
> Is it possible to update a configset being used by a collection using a
> SolrCloud API directly? I know that this is possible using the zkcli and a
> collection RELOAD. We essentially want to be able to checkout our configset
> from source control, and then replace everything in the active configset in
> SolrCloud (other than the schema.xml).
>
> We have a couple of custom plugins that use config files that reside in
> the configset, and we don’t want to have to rebuild the collection or
> access zookeeper directly if we don’t have to. If we can’t modify the
> configset wholesale this way, is it possible to create a new configset and
> swap the old collection to it?
>
> Best,
> Michael Carroll
>

Re: Pull Replica compaints about UpdateLog being disabled when DocBasedVersionConstraintsProcessorFactory

2020-08-05 Thread Tomás Fernández Löbbe

This is an interesting bug. I’m wondering if we can completely skip the
initialization of UpdateRequestProcessorFactories in PULL replicas...

On Wed, Aug 5, 2020 at 8:40 AM Erick Erickson 
wrote:

> Offhand, this looks like a bug, please raise a JIRA.
>
> You said: " We also have DocBasedVersionConstraintsProcessorFactory in our
> UpdateProcessorChain for optimistic Concurrency.”
>
> Optimistic concurrency is automatically enofrced on the _version_ field.
> The intent of this processor factory is to allow you finer control over
> optimistic concurrency by explicitly defining/populating fields. I do
> wonder whether you need this factory at all. If the intent is that any
> document with the same  is updated with optimistic concurrency,
> you don’t need it at all.
>
> Best,
> Erick
>
> > On Aug 4, 2020, at 2:17 PM, harjags
>  wrote:
> >
> > DocBasedVersionConstraintsProcessorFactory
>
>

Re: Multiple fq vs combined fq performance

2020-07-10 Thread Tomás Fernández Löbbe

All non-cached filters will be executed together (leapfrog between them)
and will be sorted by the filter cost (I guess that, since you aren't
setting a cost, then the order of the input matters).  You can try setting
a cost in your filters (lower than 100, so that they don't become post
filters)

One other thing though, I guess you are using Point fields? If you
typically query for a single value like in this example (vs. ranges), you
may want to use string fields for those. See
https://issues.apache.org/jira/browse/SOLR-11078.




On Fri, Jul 10, 2020 at 7:51 AM Chris Dempsey  wrote:

> Thanks for the suggestion, Alex. It doesn't appear that
> IndexOrDocValuesQuery (at least in Solr 7.7.1) supports the PostFilter
> interface. I've tried various values for cost on each of the fq and it
> doesn't change the QTime.
>
> So, after digging around a bit even though
> {!cache=false}taggedTickets_ticketId:100241 only matches one and only
> one document in the collection that doesn't matter for the other two fq who
> continue to look over the index of the collection, correct?
>
> On Thu, Jul 9, 2020 at 4:24 PM Alexandre Rafalovitch 
> wrote:
>
> > I _think_ it will run all 3 and then do index hopping. But if you know
> one
> > fq is super expensive, you could assign it a cost
> > Value over 100 will try to use PostFilter then and apply the query on top
> > of results from other queries.
> >
> >
> >
> >
> https://lucene.apache.org/solr/guide/8_4/common-query-parameters.html#cache-parameter
> >
> > Hope it helps,
> > Alex.
> >
> > On Thu., Jul. 9, 2020, 2:05 p.m. Chris Dempsey, 
> wrote:
> >
> > > Hi all! In a collection where we have ~54 million documents we've
> noticed
> > > running a query with the following:
> > >
> > > "fq":["{!cache=false}_class:taggedTickets",
> > >   "{!cache=false}taggedTickets_ticketId:100241",
> > >   "{!cache=false}companyId:22476"]
> > >
> > > when I debugQuery I see:
> > >
> > > "parsed_filter_queries":[
> > >   "{!cache=false}_class:taggedTickets",
> > >
>  "{!cache=false}IndexOrDocValuesQuery(taggedTickets_ticketId:[100241
> > > TO 100241])",
> > >   "{!cache=false}IndexOrDocValuesQuery(companyId:[22476 TO 22476])"
> > > ]
> > >
> > > runs in roughly ~450ms but if we remove `{!cache=false}companyId:22476`
> > it
> > > drops down to ~5ms (it's important to note that
> `taggedTickets_ticketId`
> > is
> > > globally unique).
> > >
> > > If we change the fqs to:
> > >
> > > "fq":["{!cache=false}_class:taggedTickets",
> > >   "{!cache=false}+companyId:22476
> > +taggedTickets_ticketId:100241"]
> > >
> > > when I debugQuery I see:
> > >
> > > "parsed_filter_queries":[
> > >"{!cache=false}_class:taggedTickets",
> > >"{!cache=false}+IndexOrDocValuesQuery(companyId:[22476 TO 22476])
> > > +IndexOrDocValuesQuery(taggedTickets_ticketId:[100241 TO
> > 100241])"
> > > ]
> > >
> > > we get the correct result back in ~5ms.
> > >
> > > My current thought is that in the slow scenario Solr is still running
> > > `{!cache=false}IndexOrDocValuesQuery(companyId:[22476
> > > TO 22476])` even though it "has the answer" from the first two fq.
> > >
> > > Am I off-base or misunderstanding how `fq` are processed?
> > >
> >
>

Re: [EXTERNAL] Getting rid of Master/Slave nomenclature in Solr

2020-06-23 Thread Tomás Fernández Löbbe

I agree in general with what Trey and Jan said and have suggested. I
personally like to use "leader/follower". It's true that somewhat collides
with SolrCloud terminology, but that's not a problem IMO, now that replica
types exist, the “role” of the replica (leader vs. non-leader/follower)
doesn’t specify the internals of how they behave, the replica type defines
that. So, in a non-SolrCloud world, they would still be leader/followers
regardless of how they perform that role.

I also agree that the name of the role is not that important, more the
"mode" of the architecture needs to be renamed. We tend to refer to
"SolrCloud mode" and "Master/Slave mode", the main part in all this (IMO)
is to change that "mode" name. I kind of like Trey's suggestion of "Managed
Clustering" vs. "Manual Clustering" Mode (Or "managed" vs "manual"), but
still haven't made up my mind (especially the fact that "manual" usually
doesn't really mean "manual", is just "you build your tools”)…

On Fri, Jun 19, 2020 at 1:38 PM Walter Underwood 
wrote:

> > On Jun 19, 2020, at 7:48 AM, Phill Campbell
>  wrote:
> >
> > Delegator - Handler
> >
> > A common pattern we are all aware of. Pretty simple.
>
> The Solr master does not delegate and the slave does not handle.
> The master is a server that handles replication requests from the
> slave.
>
> Delegator/handler is a common pattern, but it is not the pattern
> that describes traditional Solr replication.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>

Re: Unbalanced shard requests

2020-05-16 Thread Tomás Fernández Löbbe

I just backported Michael’s fix to be released in 8.5.2

On Fri, May 15, 2020 at 6:38 AM Michael Gibney 
wrote:

> Hi Wei,
> SOLR-14471 has been merged, so this issue should be fixed in 8.6.
> Thanks for reporting the problem!
> Michael
>
> On Mon, May 11, 2020 at 7:51 PM Wei  wrote:
> >
> > Thanks Michael!  Yes in each shard I have 10 Tlog replicas,  no other
> type
> > of replicas, and each Tlog replica is an individual solr instance on its
> > own physical machine.  In the jira you mentioned 'when "last place
> matches"
> > == "first place matches" – e.g. when shards.preference specified matches
> > *all* available replicas'.   My setting is
> > shards.preference=replica.location:local,replica.type:TLOG,
> > I also tried just shards.preference=replica.location:local and it still
> has
> > the issue. Can you explain a bit more?
> >
> > On Mon, May 11, 2020 at 12:26 PM Michael Gibney <
> mich...@michaelgibney.net>
> > wrote:
> >
> > > FYI: https://issues.apache.org/jira/browse/SOLR-14471
> > > Wei, assuming you have only TLOG replicas, your "last place" matches
> > > (to which the random fallback ordering would not be applied -- see
> > > above issue) would be the same as the "first place" matches selected
> > > for executing distributed requests.
> > >
> > >
> > > On Mon, May 11, 2020 at 1:49 PM Michael Gibney
> > >  wrote:
> > > >
> > > > Wei, probably no need to answer my earlier questions; I think I see
> > > > the problem here, and believe it is indeed a bug, introduced in 8.3.
> > > > Will file an issue and submit a patch shortly.
> > > > Michael
> > > >
> > > > On Mon, May 11, 2020 at 12:49 PM Michael Gibney
> > > >  wrote:
> > > > >
> > > > > Hi Wei,
> > > > >
> > > > > In considering this problem, I'm stumbling a bit on terminology
> > > > > (particularly, where you mention "nodes", I think you're referring
> to
> > > > > "replicas"?). Could you confirm that you have 10 TLOG replicas per
> > > > > shard, for each of 6 shards? How many *nodes* (i.e., running solr
> > > > > server instances) do you have, and what is the replica placement
> like
> > > > > across those nodes? What, if any, non-TLOG replicas do you have per
> > > > > shard (not that it's necessarily relevant, but just to get a
> complete
> > > > > picture of the situation)?
> > > > >
> > > > > If you're able without too much trouble, can you determine what the
> > > > > behavior is like on Solr 8.3? (there were different changes
> introduced
> > > > > to potentially relevant code in 8.3 and 8.4, and knowing whether
> the
> > > > > behavior you're observing manifests on 8.3 would help narrow down
> > > > > where to look for an explanation).
> > > > >
> > > > > Michael
> > > > >
> > > > > On Fri, May 8, 2020 at 7:34 PM Wei  wrote:
> > > > > >
> > > > > > Update:  after I remove the shards.preference parameter from
> > > > > > solrconfig.xml,  issue is gone and internal shard requests are
> now
> > > > > > balanced. The same parameter works fine with solr 7.6.  Still not
> > > sure of
> > > > > > the root cause, but I observed a strange coincidence: the nodes
> that
> > > are
> > > > > > most frequently picked for shard requests are the first node in
> each
> > > shard
> > > > > > returned from the CLUSTERSTATUS api.  Seems something wrong with
> > > shuffling
> > > > > > equally compared nodes when shards.preference is set.  Will
> report
> > > back if
> > > > > > I find more.
> > > > > >
> > > > > > On Mon, Apr 27, 2020 at 5:59 PM Wei  wrote:
> > > > > >
> > > > > > > Hi Eric,
> > > > > > >
> > > > > > > I am measuring the number of shard requests, and it's for query
> > > only, no
> > > > > > > indexing requests.  I have an external load balancer and see
> each
> > > node
> > > > > > > received about the equal number of external queries. However
> for
> > > the
> > > > > > > internal shard queries,  the distribution is uneven:6 nodes
> > > (one in
> > > > > > > each shard,  some of them are leaders and some are non-leaders
> )
> > > gets about
> > > > > > > 80% of the shard requests, the other 54 nodes gets about 20% of
> > > the shard
> > > > > > > requests.   I checked a few other parameters set:
> > > > > > >
> > > > > > > -Dsolr.disable.shardsWhitelist=true
> > > > > > > shards.preference=replica.location:local,replica.type:TLOG
> > > > > > >
> > > > > > > Nothing seems to cause the strange behavior.  Any suggestions
> how
> > > to
> > > > > > > debug this?
> > > > > > >
> > > > > > > -Wei
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson <
> > > erickerick...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Wei:
> > > > > > >>
> > > > > > >> How are you measuring utilization here? The number of incoming
> > > requests
> > > > > > >> or CPU?
> > > > > > >>
> > > > > > >> The leader for each shard are certainly handling all of the
> > > indexing
> > > > > > >> requests since they’re TLOG replicas, so that’s one thing that
> > > might
> > > > > > >> skewing your measurements.
>

Re: shard.preference for single shard queries

2019-12-05 Thread Tomás Fernández Löbbe

Look at SOLR-12217, it explains the limitation and has a patch for SolrJ
cases. Should be merged soon.

Note that the combination of replica types you are describing is not
recommended. See
https://lucene.apache.org/solr/guide/8_1/shards-and-indexing-data-in-solrcloud.html#combining-replica-types-in-a-cluster

On Thu, Dec 5, 2019 at 5:58 AM spanchal 
wrote:

> Hi all, Thanks to  SOLR-11982
>    we can now give solr
> parameter to sort replicas while giving results but ONLY for distributed
> queries as per documentation. May I know why this limitation?
>
> As my setup, I have 3 replicas(2 NRT, 1 PULL) of a single shard on 3
> different machines. Since NRT replicas might be busy with indexing, I would
> like my queries to land on PULL replica as a preferred option. And
> shard.preference=replica.type:PULL is not working in my case.
> Please help, thanks.
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: Lucene optimization to disable hit count

2019-11-20 Thread Tomás Fernández Löbbe

Not yet:
https://issues.apache.org/jira/browse/SOLR-13289

On Wed, Nov 20, 2019 at 4:57 PM Wei  wrote:

> Hi,
>
> I see this lucene optimization to disable hit counts for better query
> performance:
>
> https://issues.apache.org/jira/browse/LUCENE-8060
>
> Is the feature available in Solr 8.3?
>
> Thanks,
> Wei
>

Re: fq pfloat_field:* returns no documents, tfloat:* does

2019-11-20 Thread Tomás Fernández Löbbe

Hi Webster,
> The fq  facet_melting_point:*
"Point" numeric fields don't support that syntax currently, and the way to
retrieve "docs with any value in field foo" is "foo:[* TO *]". See
https://issues.apache.org/jira/browse/SOLR-11746


On Wed, Nov 20, 2019 at 2:21 PM Webster Homer <
webster.ho...@milliporesigma.com> wrote:

> The fq   facet_melting_point:*
> Returns 0 rows. However the field clearly has data in it, why does this
> query return rows where there is data
>
> I am trying to update our solr schemas to use the point fields instead of
> the trie fields.
>
> We have a number of pfloat fields. These fields are indexed and I can
> facet on them
>
> This is a typical definition
>  stored="true" required="false" multiValued="true" docValues="true"/>
>
> Another odd behavior is that when I use the Schema Browser the "Load Term
> Info" loads no data.
>
> I am using Solr 7.2
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith. Click http://www.merckgroup.com/disclaimer to access the
> German, French, Spanish and Portuguese versions of this disclaimer.
>

Re: NPE during spell checking when result collapsing is activated and local parameters are used

2019-11-15 Thread Tomás Fernández Löbbe

Would you create a Jira issue anyway tu fix the fact that it NPE instead of
throwing a bad request?

On Fri, Nov 15, 2019 at 2:31 AM Stefan Walter  wrote:

> Indeed, you are right. Interestingly, it generally worked with the two {!
> ..} in the filter query - besides the problem with the collations, of
> course. Therefore I never questioned it...
>
> Thank you!
> Stefan
>
>
> Am 15. November 2019 um 00:01:52, Tomás Fernández Löbbe (
> tomasflo...@gmail.com) schrieb:
>
> I believe your syntax is incorrect. I believe local params must all be
> included in between the same {!...}, and "{!" can only be at the beginning
>
> have you tried:
>
> ={!collapse tag=collapser field=productId sort='merchantOrder asc,
> price asc, id asc'}
>
>
>
> On Thu, Nov 14, 2019 at 4:54 AM Stefan Walter  wrote:
>
> > Hi!
> >
> > I have an issue with Solr 7.3.1 in the spell checking component:
> >
> > java.lang.NullPointerException at
> >
> >
>
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1021)
>
> > at
> >
> >
>
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1081)
>
> > at
> >
> >
>
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:230)
>
> > at
> >
> >
>
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1602)
>
> > at
> >
> >
>
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1419)
>
> > at
> >
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:584)
> > ...
> >
> > I have found an issue that addresses a similiar problem:
> > https://issues.apache.org/jira/browse/SOLR-8807
> >
> > The fix, which was introduced with this issue seems to miss our
> situation,
> > though. The relevant part of the query is this:
> >
> > ={!tag=collapser}{!collapse field=productId sort='merchantOrder asc,
> > price asc, id asc'}
> >
> > When I remove the local parameter {!tag=collapser} the collation works
> > fine. Looking at the diff of the commit of the issue mentioned above, it
> > seems that the "startsWith" could be the problem:
> >
> > + // Collate testing does not support the Collapse QParser (See
> > SOLR-8807)
> > + params.remove("expand");
> > + String[] filters = params.getParams(CommonParams.FQ);
> > + if (filters != null) {
> > + List filtersToApply = new ArrayList<>(filters.length);
> > + for (String fq : filters) {
> > + if (!fq.startsWith("{!collapse")) {
> > + filtersToApply.add(fq);
> > + }
> > + }
> > + params.set("fq", filtersToApply.toArray(new
> > String[filtersToApply.size()]));
> > + }
> >
> > Can someone confirm this? I would open a bug ticket then. (Since the code
> > is unchanged in the latest version.)
> >
> > Thanks,
> > Stefan
> >
>

Re: NPE during spell checking when result collapsing is activated and local parameters are used

2019-11-14 Thread Tomás Fernández Löbbe

I believe your syntax is incorrect. I believe local params must all be
included in between the same {!...}, and "{!" can only be at the beginning

have you tried:

={!collapse tag=collapser field=productId sort='merchantOrder asc,
price asc, id asc'}



On Thu, Nov 14, 2019 at 4:54 AM Stefan Walter  wrote:

> Hi!
>
> I have an issue with Solr 7.3.1 in the spell checking component:
>
> java.lang.NullPointerException at
>
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1021)
> at
>
> org.apache.solr.search.CollapsingQParserPlugin$OrdFieldValueCollector.finish(CollapsingQParserPlugin.java:1081)
> at
>
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:230)
> at
>
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1602)
> at
>
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1419)
> at
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:584)
> ...
>
> I have found an issue that addresses a similiar problem:
> https://issues.apache.org/jira/browse/SOLR-8807
>
> The fix, which was introduced with this issue seems to miss our situation,
> though. The relevant part of the query is this:
>
> ={!tag=collapser}{!collapse field=productId sort='merchantOrder asc,
> price asc, id asc'}
>
> When I remove the local parameter {!tag=collapser} the collation works
> fine. Looking at the diff of the commit of the issue mentioned above, it
> seems that the "startsWith" could be the problem:
>
> +// Collate testing does not support the Collapse QParser (See
> SOLR-8807)
> +params.remove("expand");
> +String[] filters = params.getParams(CommonParams.FQ);
> +if (filters != null) {
> +  List filtersToApply = new ArrayList<>(filters.length);
> +  for (String fq : filters) {
> +if (!fq.startsWith("{!collapse")) {
> +  filtersToApply.add(fq);
> +}
> +  }
> +  params.set("fq", filtersToApply.toArray(new
> String[filtersToApply.size()]));
> +}
>
> Can someone confirm this? I would open a bug ticket then. (Since the code
> is unchanged in the latest version.)
>
> Thanks,
> Stefan
>

Re: Does Solr replicate data securely

2019-11-13 Thread Tomás Fernández Löbbe

Yes, if you are using TLS for running Solr, the replication will happen
using TLS

On Wed, Nov 13, 2019 at 2:45 PM Pushkar Raste 
wrote:

> Hi,
> Can some help me with my question.
>
> On Tue, Nov 12, 2019 at 10:20 AM Pushkar Raste 
> wrote:
>
> > Hi,
> > How about in the master/slave set up. If I enable ssl in master/slave
> > setup would the segment and config files be copied using TLS.
> >
> > On Sat, Nov 9, 2019 at 3:31 PM Jan Høydahl 
> wrote:
> >
> >> You choose. If you use solr cloud and have enabled ssl in your cluster,
> >> then all requests including replication will be secure (https). This it
> is
> >> still tcp but using TLS :)
> >>
> >> Jan Høydahl
> >>
> >> > 6. nov. 2019 kl. 00:03 skrev Pushkar Raste :
> >> >
> >> > Hi,
> >> > When slaves/pull replicas copy index files from master is done using
> an
> >> > secure protocol or just over tcp?
> >> > --
> >> > — Pushkar Raste
> >>
> > --
> — Pushkar Raste
>

Lucene/Solr swag

2019-09-10 Thread Tomás Fernández Löbbe

If you are interested, Apache Comdev team added Lucene and Solr items to
RedBubble:

Lucene:
https://www.redbubble.com/people/comdev/works/40953165-apache-lucene?asc=u

Solr:
https://www.redbubble.com/people/comdev/works/40952682-apache-solr?asc=u

Re: Mistake assert tips in FST builder ？

2019-04-18 Thread Tomás Fernández Löbbe

The Lucene list is probably better for this question. I'd try
java-u...@lucene.apache.org

On Mon, Apr 15, 2019 at 9:04 PM zhenyuan wei  wrote:

> Hi，
>With current newest version, 9.0.0-snapshot，In
> Builder.UnCompileNode.addArc() function，
> found this line：
>
> assert numArcs == 0 || label > arcs[numArcs-1].label: "arc[-1].label="
> + arcs[numArcs-1].label + " new label=" + label + " numArcs=" +
> numArcs;
>
> Maybe assert tips is ：
>
> assert numArcs == 0 || label > arcs[numArcs-1].label:
> "arc[numArc-1].label=" + arcs[numArcs-1].label + " new label=" + label
> + " numArcs=" + numArcs;
>
> Is it a personal tips code style? or small mistake?
>
> Just curious about it.
>

Re: Intervals vs Span guidance

2019-03-26 Thread Tomás Fernández Löbbe

While solr-user is not a bad place to ask this question, I suspect you'll
get more answers in java-u...@lucene.apache.org since there is a lot going
on at the Lucene level right now.

On Tue, Mar 26, 2019 at 9:09 AM Ramsey Haddad (BLOOMBERG/ LONDON) <
rhadda...@bloomberg.net> wrote:

> We are building our needed customizations/extensions on Solr/Lucene 7.7 or
> 8.0 or later. We are unclear on whether/when to use Intervals vs Span.
>
> We know that Intervals is still maturing (new functionality in 8.0 and
> probably on-going for a while?)
>
> But what is the overall intention/guidance? "If you need X, then use
> Spans." "If you need Y, then use Intervals." "After the year 20xy, we
> expect everyone to be using Intervals." ??
>
> Any opinions valued.
>
> Thanks,
> Ramsey.

Re: cve-2017-

2019-02-28 Thread Tomás Fernández Löbbe

I updated the description of SOLR-12770
 a bit. The problem
stated is that, since the "shards" parameter allows any URL, someone could
make an insecure Solr instance hit some other (secure) web endpoint. Solr
would throw an exception, but the error may include information from such
endpoint (parsing error). I don't believe this would allow access to a
local file (though, if you know of a way, please report to
secur...@lucene.apache.org)

The only way to know (to my knowledge) if your Solr instance was affected
is by looking at your Solr logs. If you log queries, you should be able to
see what's being included in the "shards" parameter and detect something
that's not looking right. Also, if Solr is fooled to hit some other
endpoint, it would fail with a parsing error, so you should probably see
exceptions in your logs. The worst case, I guess, depends on how much
access the Solr process has and how much damage it can cause to an adjacent
web endpoint via a GET request.

Note that this can only impact you if your Solr instance can be directly
accessed by untrusted sources.

HTH

On Thu, Feb 28, 2019 at 11:54 AM Jeff Courtade 
wrote:

> This particular cve came out in the mailing list. Fed 12th
>
>
> CVE-2017-3164 SSRF issue in Apache Solr
>
>  I need to know what the exploit for this could be?
>
>
> can a user send a bogus shards param via a web request and get a local
> file?
>
>
> What does an attack vector look like for this?
>
>
> I am being asked specifically this...
>
>
> -  How would we know if the vulnerability in the Solr CVE was
> taking advantage of? What are signs of us being exploited? What is the
> worst case scenario with this CVE?
>
> Could someone help me answer this please?
>
>
>
>
> http://mail-archives.apache.org/mod_mbox/www-announce/201902.mbox/%3CCAECwjAVjBN=wO5rYs6ktAX-5=-f5jdfwbbtsm2ttjebgo5j...@mail.gmail.com%3E
>
>
>
> the bug is
>
>
>
> https://issues.apache.org/jira/browse/SOLR-12770
>
>
>
> the mitigation is upgrading to solr 7.7
>

Re: Re: High CPU usage with Solr 7.7.0

2019-02-27 Thread Tomás Fernández Löbbe

Maybe a thread dump would be useful if you still have some instance running
on 7.7

On Wed, Feb 27, 2019 at 7:28 AM Lukas Weiss 
wrote:

> I can confirm this. Downgrading to 7.6.0 solved the issue.
> Thanks for the hint.
>
>
>
> Von:"Joe Obernberger" 
> An: solr-user@lucene.apache.org, "Lukas Weiss"
> ,
> Datum:  27.02.2019 15:59
> Betreff:Re: High CPU usage with Solr 7.7.0
>
>
>
> Just to add to this.  We upgraded to 7.7.0 and saw very large CPU usage
> on multi core boxes - sustained in the 1200% range.  We then switched to
> 7.6.0 (no other configuration changes) and the problem went away.
>
> We have a 40 node cluster and all 40 nodes had high CPU usage with 3
> indexes stored on HDFS.
>
> -Joe
>
> On 2/27/2019 5:04 AM, Lukas Weiss wrote:
> > Hello,
> >
> > we recently updated our Solr server from 6.6.5 to 7.7.0. Since then, we
> > have problems with the server's CPU usage.
> > We have two Solr cores configured, but even if we clear all indexes and
> do
> > not start the index process, we see 100 CPU usage for both cores.
> >
> > Here's what our top says:
> >
> > root@solr:~ # top
> > top - 09:25:24 up 17:40,  1 user,  load average: 2,28, 2,56, 2,68
> > Threads:  74 total,   3 running,  71 sleeping,   0 stopped,   0 zombie
> > %Cpu0  :100,0 us,  0,0 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,
> > 0,0 st
> > %Cpu1  :100,0 us,  0,0 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,
> > 0,0 st
> > %Cpu2  : 11,3 us,  1,0 sy,  0,0 ni, 86,7 id,  0,7 wa,  0,0 hi,  0,3 si,
> > 0,0 st
> > %Cpu3  :  3,0 us,  3,0 sy,  0,0 ni, 93,7 id,  0,3 wa,  0,0 hi,  0,0 si,
> > 0,0 st
> > KiB Mem :  8388608 total,  7859168 free,   496744 used,32696
> > buff/cache
> > KiB Swap:  2097152 total,  2097152 free,0 used.  7859168 avail
> Mem
> >
> >
> >PID USER  PR  NIVIRTRESSHR S %CPU %MEM TIME+
> COMMAND
> >P
> > 10209 solr  20   0 6138468 452520  25740 R 99,9  5,4  29:43.45 java
> > -server -Xms1024m -Xmx1024m -XX:NewRatio=3 -XX:SurvivorRatio=4
> > -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8
> > -XX:+UseConcMarkSweepGC -XX:ConcGCThreads=4 + 24
> > 10214 solr  20   0 6138468 452520  25740 R 99,9  5,4  28:42.91 java
> > -server -Xms1024m -Xmx1024m -XX:NewRatio=3 -XX:SurvivorRatio=4
> > -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8
> > -XX:+UseConcMarkSweepGC -XX:ConcGCThreads=4 + 25
> >
> > The solr server is installed on a Debian Stretch 9.8 (64bit) on Linux
> LXC
> > dedicated Container.
> >
> > Some more server info:
> >
> > root@solr:~ # java -version
> > openjdk version "1.8.0_181"
> > OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-2~deb9u1-b13)
> > OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
> >
> > root@solr:~ # free -m
> >totalusedfree  shared  buff/cache
> > available
> > Mem:   8192 4847675 701  31 7675
> > Swap:  2048   02048
> >
> > We also found something strange if we do an strace of the main process,
> we
> > get lots of ongoing connection timeouts:
> >
> > root@solr:~ # strace -F -p 4136
> > strace: Process 4136 attached with 48 threads
> > strace: [ Process PID=11089 runs in x32 mode. ]
> > [pid  4937] epoll_wait(139,  
> > [pid  4936] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4909] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4618] epoll_wait(136,  
> > [pid  4576] futex(0x7ff61ce66474, FUTEX_WAIT_PRIVATE, 1, NULL
>  > ...>
> > [pid  4279] futex(0x7ff61ce62b34, FUTEX_WAIT_PRIVATE, 2203, NULL
> > 
> > [pid  4244] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4227] futex(0x7ff56c71ae14, FUTEX_WAIT_PRIVATE, 2237, NULL
> > 
> > [pid  4243] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4228] futex(0x7ff5608331a4, FUTEX_WAIT_PRIVATE, 2237, NULL
> > 
> > [pid  4208] futex(0x7ff61ce63e54, FUTEX_WAIT_PRIVATE, 5, NULL
>  > ...>
> > [pid  4205] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4204] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4196] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4195] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4194] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4193] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4187] restart_syscall(<... resuming interrupted restart_syscall
> ...>
> > 
> > [pid  4180] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4179] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4177] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4174] accept(133,  
> > [pid  4173] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4172] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4171] restart_syscall(<... resuming interrupted restart_syscall
> ...>

Re: Reporting security vulnerability in Solr

2019-02-20 Thread Tomás Fernández Löbbe

Hi Krzysztof,
There is some information on the past CVEs and dependency issues in
https://wiki.apache.org/solr/SolrSecurity. For reporting, creating a
private Jira is good, or following the guidelines here:
https://www.apache.org/security/ (email secur...@apache.org or
secur...@lucene.apache.org)

On Wed, Feb 20, 2019 at 9:16 AM Erick Erickson 
wrote:

> You did the right thing, but there will be no new versions of the 6x code
> line released. Meanwhile, the versions of jar files in the two JIRAs you
> created have been replaced with newer versions.
>
> You could get the source code and upgrade the jar files (see
> lucene/ivy-versions.properties) if you can’t upgrade to a newer Solr
> release.
>
> Best,
> Erick
>
> > On Feb 20, 2019, at 5:48 AM, Krzysztof Dębski 
> wrote:
> >
> > Hi,
> >
> > What is the right way to report a security vulnerability in Solr?
> >
> > A few days ago I created two issues:
> > https://issues.apache.org/jira/browse/SOLR-13250
> > https://issues.apache.org/jira/browse/SOLR-13251
> >
> > I chose Security Level: Private (Security Issue) and added "security"
> label.
> >
> > Do I need to do anything else to report a security issue?
> >
> > Regards,
> > Krzysztof
>
>

Re: Soft commit and new replica types

2018-12-14 Thread Tomás Fernández Löbbe

Yes, that would be great.

Thanks

On Fri, Dec 14, 2018 at 5:38 PM Edward Ribeiro 
wrote:

> Indeed! It clarified a lot, thank you. :) Now I know I messed with the
> reload core config, but the other aspects were more or less what I have
> been expecting.
>
> Do you think it's worth to submit a PR to the Reference Guide with those
> explanations? I can take a stab at it.
>
> Regards,
> Edward
>
> On Fri, Dec 14, 2018 at 3:08 AM Tomás Fernández Löbbe <
> tomasflo...@gmail.com>
> wrote:
>
> > > >
> > > > No, I am not seeing reloads.
> >
> > Ah, good.
> >
> >
> > > > I am trying to understand the interactions
> > > > between hard commit, soft commit, transaction log update with a TLOG
> > > > cluster for both leader and follower replicas. For example, after
> > getting
> > > > new segments from the leader the follower replica will still apply
> the
> > > > hard/soft commit?
> > >
> >
> > Think about the hard commit as a flush of the latest updates to a segment
> > plus checkpoint pointing to all the current valid segments. That
> checkpoint
> > is also a file. The soft commit is similar to the hard commit in the
> sense
> > that it creates a segment and a pointer to the valid segments, however,
> > those segments may not be flushed to disk yet, and the checkpoint is not
> on
> > a file. *In addition* to creating segments, the commits in Solr create
> > searchers to get the latest view of the index (hard-commits only when
> > openSearcher=true and soft-commits always), but that doesn't really
> matter
> > in the context of replication.
> >
> > The follower replica (a TLOG/PULL) will ask the leader for the last hard
> > commit and replicate all the segments and the file indicating the commit.
> > All the TLOG/PULL replica does after it replicates is open a searcher
> with
> > all the segments in that checkpoint. Two important notes here: 1) the
> > follower replica doesn't "perform" a commit, it copied it from the leader
> > and 2) this "open a searcher" is not a soft/hard commit, is just opening
> a
> > searcher (a "commit" usually involves creating segments).
> >
> > * If in the leader (a TLOG replica) you do a soft commit, it'll never
> make
> > it to the follower, because the follower only replicates the latest hard
> > commit (see ReplicationHandler.indexCommitPoint).
> > * If in the follower (a TLOG replica) you do a soft commit, it won't do
> any
> > difference, because in the TLOG case, documents are not added to the
> index
> > (only to the transaction log). (See UpdateCommand.IGNORE_INDEXWRITER
> flag)
> > * If in the follower (a PULL replica) you do a soft commit, it also
> > wouldn't do any difference, because it doesn't receive the documents
> anyway
> > (only replicates). Commit is skipped anyway (see
> > DistributedUpdateProcessor.processCommit)
> >
> > The transaction log is only used for recovery purposes (or realtime get).
> >
> > I hope that clarifies things.
> >
> > >
> > > > PS: congratulations on the Berlin Buzzwords' talk. :)
> > >
> > Thanks!
> >
> > > >
> > > > Thanks!
> > > >
> > > > On Mon, Dec 10, 2018 at 9:24 PM Tomás Fernández Löbbe
> > > > 
> > > > wrote:
> > > >
> > > > > I think this is a good point. The tricky part is that if TLOG
> > replicas
> > > > > don't replicate often, their transaction logs will get too big too,
> > so
> > > you
> > > > > want the replication interval of TLOG replicas to be tied to the
> > > > > auto(hard)Commit interval (by default at least). If you are using
> > them
> > > for
> > > > > search, you may also not want to open a searcher for each fetch...
> > for
> > > PULL
> > > > > replicas, maybe the best way is to use the autoSoftCommit interval
> to
> > > > > define the polling interval. That said, I'm not sure using
> different
> > > > > configurations is a good idea, some people may be mixing TLOG and
> > PULL
> > > > and
> > > > > querying them both alike.
> > > > >
> > > > > In the meantime, if you have different hosts for TLOG and PULL
> > > replicas,
> > > > > one workaround you can have is to define the autoCommit time with a
> > > > system
> > > > > property, and use different

Re: Soft commit and new replica types

2018-12-13 Thread Tomás Fernández Löbbe

> >
> > No, I am not seeing reloads.

Ah, good.

> > I am trying to understand the interactions
> > between hard commit, soft commit, transaction log update with a TLOG
> > cluster for both leader and follower replicas. For example, after getting
> > new segments from the leader the follower replica will still apply the
> > hard/soft commit?
>

Think about the hard commit as a flush of the latest updates to a segment
plus checkpoint pointing to all the current valid segments. That checkpoint
is also a file. The soft commit is similar to the hard commit in the sense
that it creates a segment and a pointer to the valid segments, however,
those segments may not be flushed to disk yet, and the checkpoint is not on
a file. *In addition* to creating segments, the commits in Solr create
searchers to get the latest view of the index (hard-commits only when
openSearcher=true and soft-commits always), but that doesn't really matter
in the context of replication.

The follower replica (a TLOG/PULL) will ask the leader for the last hard
commit and replicate all the segments and the file indicating the commit.
All the TLOG/PULL replica does after it replicates is open a searcher with
all the segments in that checkpoint. Two important notes here: 1) the
follower replica doesn't "perform" a commit, it copied it from the leader
and 2) this "open a searcher" is not a soft/hard commit, is just opening a
searcher (a "commit" usually involves creating segments).

* If in the leader (a TLOG replica) you do a soft commit, it'll never make
it to the follower, because the follower only replicates the latest hard
commit (see ReplicationHandler.indexCommitPoint).
* If in the follower (a TLOG replica) you do a soft commit, it won't do any
difference, because in the TLOG case, documents are not added to the index
(only to the transaction log). (See UpdateCommand.IGNORE_INDEXWRITER flag)
* If in the follower (a PULL replica) you do a soft commit, it also
wouldn't do any difference, because it doesn't receive the documents anyway
(only replicates). Commit is skipped anyway (see
DistributedUpdateProcessor.processCommit)

The transaction log is only used for recovery purposes (or realtime get).

I hope that clarifies things.

>
> > PS: congratulations on the Berlin Buzzwords' talk. :)
>
Thanks!

> >
> > Thanks!
> >
> > On Mon, Dec 10, 2018 at 9:24 PM Tomás Fernández Löbbe
> > 
> > wrote:
> >
> > > I think this is a good point. The tricky part is that if TLOG replicas
> > > don't replicate often, their transaction logs will get too big too, so
> you
> > > want the replication interval of TLOG replicas to be tied to the
> > > auto(hard)Commit interval (by default at least). If you are using them
> for
> > > search, you may also not want to open a searcher for each fetch... for
> PULL
> > > replicas, maybe the best way is to use the autoSoftCommit interval to
> > > define the polling interval. That said, I'm not sure using different
> > > configurations is a good idea, some people may be mixing TLOG and PULL
> > and
> > > querying them both alike.
> > >
> > > In the meantime, if you have different hosts for TLOG and PULL
> replicas,
> > > one workaround you can have is to define the autoCommit time with a
> > system
> > > property, and use different properties for TLOGs vs PULL nodes.
> > >
> > > > There is no commit on TLOG/PULL  follower replicas, only on the
> leader.
> > > > Followers fetch the segments and **reload the core** every 150
> seconds
> > >
> > > Edward, "reload" shouldn't really happen in regular TLOG/PULL fetches.
> Are
> > > you seeing reloads?
> > >
> > > On Mon, Dec 10, 2018 at 4:41 PM Erick Erickson <
> erickerick...@gmail.com>
> > > wrote:
> > >
> > > > bq. but not every poll attempt they fetch new segment from the leader
> > > >
> > > > Ah, right. Ignore my comment. Commit will only occur on the followers
> > > > when there are new segments to pull down, so your'e right, roughly
> > > > every second poll would commit find things to bring down and open a
> > > > new searcher.
> > > > On Sun, Dec 9, 2018 at 4:14 PM Edward Ribeiro
> > 
> > > > wrote:
> > > > >
> > > > > Hi Vadim,
> > > > >
> > > > > There is no commit on TLOG/PULL  follower replicas, only on the
> leader.
> > > > > Followers fetch the segments and **reload the core** every 150
> seconds
> > > > (if
> > > > > there were new segments, I suppose

Re: Soft commit and new replica types

2018-12-10 Thread Tomás Fernández Löbbe

I think this is a good point. The tricky part is that if TLOG replicas
don't replicate often, their transaction logs will get too big too, so you
want the replication interval of TLOG replicas to be tied to the
auto(hard)Commit interval (by default at least). If you are using them for
search, you may also not want to open a searcher for each fetch... for PULL
replicas, maybe the best way is to use the autoSoftCommit interval to
define the polling interval. That said, I'm not sure using different
configurations is a good idea, some people may be mixing TLOG and PULL and
querying them both alike.

In the meantime, if you have different hosts for TLOG and PULL replicas,
one workaround you can have is to define the autoCommit time with a system
property, and use different properties for TLOGs vs PULL nodes.

> There is no commit on TLOG/PULL  follower replicas, only on the leader.
> Followers fetch the segments and **reload the core** every 150 seconds

Edward, "reload" shouldn't really happen in regular TLOG/PULL fetches. Are
you seeing reloads?

On Mon, Dec 10, 2018 at 4:41 PM Erick Erickson 
wrote:

> bq. but not every poll attempt they fetch new segment from the leader
>
> Ah, right. Ignore my comment. Commit will only occur on the followers
> when there are new segments to pull down, so your'e right, roughly
> every second poll would commit find things to bring down and open a
> new searcher.
> On Sun, Dec 9, 2018 at 4:14 PM Edward Ribeiro 
> wrote:
> >
> > Hi Vadim,
> >
> > There is no commit on TLOG/PULL  follower replicas, only on the leader.
> > Followers fetch the segments and **reload the core** every 150 seconds
> (if
> > there were new segments, I suppose). Yeah, followers don't pay the CPU
> > price of indexing, but there are still cache invalidation, autowarming,
> > etc, in addition to network and IO demand. Is that ritht, Erick?
> >
> > Besides that, Erick is pointing out that under a heavy indexing workload
> > you could either have:
> >
> > 1. Very large transaction logs;
> >
> > 2. Very large numbers of segments. If that is the case, you could have
> the
> > following scenario numerous times:
> >2.1. follower replica downloads segment A and B from leader;
> >2.2 leader merges segments A + B into C;
> >2.3. follower replicas discard A and B and download C on next poll;
> >
> > Under the second condition followers needlessly downloaded segments that
> > would eventually be merged.
> >
> > IMO, you should carefully evaluate if the use of TLOG/PULL is really
> > recommended for your cluster setup, plus indexing and querying workload.
> > You can very much stay with a NRT setup if it suits you better. The
> videos
> > below provide a nice set of hints for when to choose between NRT or some
> > combination of TLOG and PULL.
> >
> > https://youtu.be/XIb8X3MwVKc
> >
> > https://youtu.be/dkWy2ykzAv0
> >
> > https://youtu.be/XqfTjd9KDWU
> >
> > Regards,
> > Edward
> >
> > Em dom, 9 de dez de 2018 16:56,  escreveu:
> >
> > >
> > >  If hard commit max time is 300 sec then commit happens every 300 sec
> on
> > > tlog leader. And new segments pop up on the leader every 300 sec,
> during
> > > indexing. Polling interval on other replicas 150 sec, but not every
> poll
> > > attempt they fetch new segment from the leader, afaiu. Erick, do you
> mean
> > > that on all other  tlog replicas(not leaders) commit occurs every poll?
> > > воскресенье, 09 декабря 2018г., 19:21 +03:00 от Erick Erickson
> > > erickerick...@gmail.com :
> > >
> > > >Not quite, 60. The polling interval is half the commit
> interval
> > > >
> > > >This has always bothered me a little bit, I wonder at the utility of a
> > > >config param. We already have old-style replication with a
> > > >configurable polling interval. Under very heavy indexing loads, it
> > > >seems to me that either the tlogs will grow quite large or we'll be
> > > >pulling a lot of unnecessary segments across the wire, segments
> > > >that'll soon be merged away and the merged segment re-pulled.
> > > >
> > > >Apparently, though, nobody's seen this "in the wild", so it's
> > > >theoretical at this point.
> > > >On Sun, Dec 9, 2018 at 1:48 AM Vadim Ivanov
> > > < vadim.iva...@spb.ntk-intourist.ru> wrote:
> > > >
> > > > Thanks, Edward, for clues.
> > > > What bothers me is newSearcher start, warming, cache clear... all
> that
> > > CPU consuming stuff in my heavy-indexing scenario.
> > > > With NRT I had autoSoftCommit:  30 .
> > > > So I had new Searcher no more than  every 5 min on every replica.
> > > > To have more or less  the same effect with TLOG - PULL collection,
> > > > I suppose, I have to have  :  30
> > > > (yes, I understand that newSearchers start asynchronously on leader
> and
> > > replicas)
> > > > Am I right?
> > > > --
> > > > Vadim
> > > >
> > > >
> > > >> -Original Message-
> > > >> From: Edward Ribeiro [mailto:edward.ribe...@gmail.com]
> > > >> Sent: Sunday, December 09, 2018 12:42 AM
> > > >> To:

Re: TolerantUpdateProcessorFactory maxErrors=-1 issue

2018-09-21 Thread Tomás Fernández Löbbe

Hi Derek,
I suspect you need to move the TolerantUpdateProcessorFactory to the
beginning of the chain

On Thu, Sep 20, 2018 at 6:17 PM Derek Poh  wrote:

> Does any one have any idea whatcould be the causeof this?
>
> On 19/9/2018 11:40 AM, Derek Poh wrote:
> > In addition, I tried withmaxErrors=3 and with only 1error document,
> > the indexing process still gets aborted.
> >
> > Could it be the way I defined the TolerantUpdateProcessorFactory in
> > solrconfg.xml?
> >
> > On 18/9/2018 3:13 PM, Derek Poh wrote:
> >> Hi
> >>
> >> I am using CSV formatted indexupdates to index on tab delimited file.
> >>
> >> I have define "TolerantUpdateProcessorFactory" with "maxErrors=-1" in
> >> the solrconfig.xml to skip any document update error and proceed to
> >> update the remaining documents without failing.
> >> Howeverit does not seemto be workingas there is an document in the
> >> tab delimited file withadditional number of fields and this caused
> >> the indexing to abort instead.
> >>
> >> This is how I start the indexing,
> >> curl -o /apps/search/logs/indexing.log
> >> "
> http://localhost:8983/solr/$collection/update?update.chain=$updateChainName=true=%09=^=$fieldnames$splitOptions;
>
> >> --data-binary "@/apps/search/feed/$csvFilePath/$csvFileName" -H
> >> 'Content-type:application/csv'
> >>
> >> This is how the TolerantUpdateProcessorFactory is defined in the
> >> solrconfig.xml,
> >> 
> >>   
> >> P_SupplierId
> >> P_TradeShowId
> >> P_ProductId
> >> id
> >>   
> >>   
> >> id
> >> 
> >>   
> >>   
> >>  -1
> >>   
> >>   
> >> 
> >> 
> >> 43200
> >> P_TradeShowOnlineEndDateUTC
> >>   
> >>   
> >>   
> >> 
> >>
> >> Solr version is 6.6.2.
> >>
> >> Derek
> >>
> >> --
> >> CONFIDENTIALITY NOTICE
> >> This e-mail (including any attachments) may contain confidential
> >> and/or privileged information. If you are not the intended recipient
> >> or have received this e-mail in error, please inform the sender
> >> immediately and delete this e-mail (including any attachments) from
> >> your computer, and you must not use, disclose to anyone else or copy
> >> this e-mail (including any attachments), whether in whole or in part.
> >> This e-mail and any reply to it may be monitored for security, legal,
> >> regulatory compliance and/or other appropriate reasons.
> >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > This e-mail (including any attachments) may contain confidential
> > and/or privileged information. If you are not the intended recipient
> > or have received this e-mail in error, please inform the sender
> > immediately and delete this e-mail (including any attachments) from
> > your computer, and you must not use, disclose to anyone else or copy
> > this e-mail (including any attachments), whether in whole or in part.
> > This e-mail and any reply to it may be monitored for security, legal,
> > regulatory compliance and/or other appropriate reasons.
>
>
> --
> CONFIDENTIALITY NOTICE
>
> This e-mail (including any attachments) may contain confidential and/or
> privileged information. If you are not the intended recipient or have
> received this e-mail in error, please inform the sender immediately and
> delete this e-mail (including any attachments) from your computer, and you
> must not use, disclose to anyone else or copy this e-mail (including any
> attachments), whether in whole or in part.
>
> This e-mail and any reply to it may be monitored for security, legal,
> regulatory compliance and/or other appropriate reasons.

Re: 7.3 appears to leak

2018-09-05 Thread Tomás Fernández Löbbe

I created SOLR-12743 to track this.

On Mon, Jul 16, 2018 at 12:30 PM Markus Jelsma 
wrote:

> Hello Thomas,
>
> To be absolutely sure you suffer from the same problem as one of our
> collections, can you confirm that your Solr cores are leaking a
> SolrIndexSearcher instance on each commit? If not, there may be a second
> problem.
>
> Also, do you run any custom plugins or apply patches to your Solr
> instances? Or is your Solr a 100 % official build?
>
> Thanks,
> Markus
>
>
>
> -Original message-
> > From:Thomas Scheffler 
> > Sent: Monday 16th July 2018 13:39
> > To: solr-user@lucene.apache.org
> > Subject: Re: 7.3 appears to leak
> >
> > Hi,
> >
> > we noticed the same problems here in a rather small setup. 40.000
> metadata documents with nearly as much files that have „literal.*“ fields
> with it. While 7.2.1 has brought some tika issues the real problems started
> to appear with version 7.3.0 which are currently unresolved in 7.4.0.
> Memory consumption is out-of-roof. Where previously 512MB heap was enough,
> now 6G aren’t enough to index all files.
> >
> > kind regards,
> >
> > Thomas
> >
> > > Am 04.07.2018 um 15:03 schrieb Markus Jelsma <
> markus.jel...@openindex.io>:
> > >
> > > Hello Andrey,
> > >
> > > I didn't think of that! I will try it when i have the courage again,
> probably next week or so.
> > >
> > > Many thanks,
> > > Markus
> > >
> > >
> > > -Original message-
> > >> From:Kydryavtsev Andrey 
> > >> Sent: Wednesday 4th July 2018 14:48
> > >> To: solr-user@lucene.apache.org
> > >> Subject: Re: 7.3 appears to leak
> > >>
> > >> If it is not possible to find a resource leak by code analysis and
> there is no better ideas, I can suggest a brute force approach:
> > >> - Clone Solr's sources from appropriate branch
> https://github.com/apache/lucene-solr/tree/branch_7_3
> > >> - Log every searcher's holder increment/decrement operation in a way
> to catch every caller name (use Thread.currentThread().getStackTrace() or
> something)
> https://github.com/apache/lucene-solr/blob/branch_7_3/solr/core/src/java/org/apache/solr/util/RefCounted.java
> > >> - Build custom artefacts and upload them on prod
> > >> - After memory leak happened - analyse logs to see what part of
> functionality doesn't decrement searcher after counter was incremented. If
> searchers are leaked - there should be such code I guess.
> > >>
> > >> This is not something someone would like to do, but it is what it is.
> > >>
> > >>
> > >>
> > >> Thank you,
> > >>
> > >> Andrey Kudryavtsev
> > >>
> > >>
> > >> 03.07.2018, 14:26, "Markus Jelsma" :
> > >>> Hello Erick,
> > >>>
> > >>> Even the silliest ideas may help us, but unfortunately this is not
> the case. All our Solr nodes run binaries from the same source from our
> central build server, with the same libraries thanks to provisioning. Only
> schema and config are different, but the  directive is the same all
> over.
> > >>>
> > >>> Are there any other ideas, speculations, whatever, on why only our
> main text collection leaks a SolrIndexSearcher instance on commit since
> 7.3.0 and every version up?
> > >>>
> > >>> Many thanks?
> > >>> Markus
> > >>>
> > >>> -Original message-
> >   From:Erick Erickson 
> >   Sent: Friday 29th June 2018 19:34
> >   To: solr-user 
> >   Subject: Re: 7.3 appears to leak
> > 
> >   This is truly puzzling then, I'm clueless. It's hard to imagine
> this
> >   is lurking out there and nobody else notices, but you've eliminated
> >   the custom code. And this is also very peculiar:
> > 
> >   * it occurs only in our main text search collection, all other
> >   collections are unaffected;
> >   * despite what i said earlier, it is so far unreproducible outside
> >   production, even when mimicking production as good as we can;
> > 
> >   Here's a tedious idea. Restart Solr with the -v option, I _think_
> that
> >   shows you each and every jar file Solr loads. Is it "somehow"
> possible
> >   that your main collection is loading some jar from somewhere that's
> >   different than you expect? 'cause silly ideas like this are all I
> can
> >   come up with.
> > 
> >   Erick
> > 
> >   On Fri, Jun 29, 2018 at 9:56 AM, Markus Jelsma
> >    wrote:
> >   > Hello Erick,
> >   >
> >   > The custom search handler doesn't interact with
> SolrIndexSearcher, this is really all it does:
> >   >
> >   >   public void handleRequestBody(SolrQueryRequest req,
> SolrQueryResponse rsp) throws Exception {
> >   > super.handleRequestBody(req, rsp);
> >   >
> >   > if (rsp.getToLog().get("hits") instanceof Integer) {
> >   >   rsp.addHttpHeader("X-Solr-Hits",
> String.valueOf((Integer)rsp.getToLog().get("hits")));
> >   > }
> >   > if (rsp.getToLog().get("hits") instanceof Long) {
> >   >   rsp.addHttpHeader("X-Solr-Hits",
>

Re: Heap Memory Problem after Upgrading to 7.4.0

2018-09-05 Thread Tomás Fernández Löbbe

I think this is pretty bad. I created
https://issues.apache.org/jira/browse/SOLR-12743. Feel free to add any more
details you have there.

On Mon, Sep 3, 2018 at 1:50 PM Markus Jelsma 
wrote:

> Hello Björn,
>
> Take great care, 7.2.1 cannot read an index written by 7.4.0, so you
> cannot roll back but need to reindex!
>
> Andrey Kudryavtsev made a good suggestion in the thread on how to find the
> culprit, but it will be a tedious task. I have not yet had the time or
> courage to venture there.
>
> Hope it helps,
> Markus
>
>
>
> -Original message-
> > From:Björn Häuser 
> > Sent: Monday 3rd September 2018 22:28
> > To: solr-user@lucene.apache.org
> > Subject: Re: Heap Memory Problem after Upgrading to 7.4.0
> >
> > Hi Markus,
> >
> > this reads exactly like what we have. Where you able to figure out
> anything? Currently thinking about rollbacking to 7.2.1.
> >
> >
> >
> > > On 3. Sep 2018, at 21:54, Markus Jelsma 
> wrote:
> > >
> > > Hello,
> > >
> > > Getting an OOM plus the fact you are having a lot of IndexSearcher
> instances rings a familiar bell. One of our collections has the same issue
> [1] when we attempted an upgrade 7.2.1 > 7.3.0. I managed to rule out all
> our custom Solr code but had to keep our Lucene filters in the schema, the
> problem persisted.
> > >
> > > The odd thing, however, is that you appear to have the same problem,
> but not with 7.3.0? Since you shortly after 7.3.0 upgraded to 7.4.0, can
> you confirm the problem is not also in 7.3.0?
> > >
> >
> > We had very similar problems with 7.3.0 but never analyzed them and just
> updated to 7.4.0 because I thought thats the bug we hit:
> https://issues.apache.org/jira/browse/SOLR-11882 <
> https://issues.apache.org/jira/browse/SOLR-11882>
> >
> >
> > > You should see the instance count for IndexSearcher increase by one
> for each replica on each commit.
> >
> >
> > Sorry, where can I find this? ;) Sorry, did not find anything.
> >
> > Thanks
> > Björn
> >
> > >
> > > Regards,
> > > Markus
> > >
> > > [1]
> http://lucene.472066.n3.nabble.com/RE-7-3-appears-to-leak-td4396232.html
> > >
> > >
> > >
> > > -Original message-
> > >> From:Erick Erickson 
> > >> Sent: Monday 3rd September 2018 20:49
> > >> To: solr-user 
> > >> Subject: Re: Heap Memory Problem after Upgrading to 7.4.0
> > >>
> > >> I would expect at least 1 IndexSearcher per replica, how many total
> > >> replicas hosted in your JVM?
> > >>
> > >> Plus, if you're actively indexing, there may temporarily be 2
> > >> IndexSearchers open while the new searcher warms.
> > >>
> > >> And there may be quite a few caches, at least queryResultCache and
> > >> filterCache and documentCache, one of each per replica and maybe two
> > >> (for queryResultCache and filterCache) if you have a background
> > >> searcher autowarming.
> > >>
> > >> At a glance, your autowarm counts are very high, so it may take some
> > >> time to autowarm leading to multiple IndexSearchers and caches open
> > >> per replica when you happen to hit a commit point. I usually start
> > >> with 16-20 as an autowarm count, the benefit decreases rapidly as you
> > >> increase the count.
> > >>
> > >> I'm not quite sure why it would be different in 7x .vs. 6x. How much
> > >> heap do you allocate to the JVM? And do you see similar heap dumps in
> > >> 6.6?
> > >>
> > >> Best,
> > >> Erick
> > >> On Mon, Sep 3, 2018 at 10:33 AM Björn Häuser 
> wrote:
> > >>>
> > >>> Hello,
> > >>>
> > >>> we recently upgraded our solrcloud (5 nodes, 25 collections, 1 shard
> each, 4 replicas each) from 6.6.0 to 7.3.0 and shortly after to 7.4.0. We
> are running Zookeeper 4.1.13.
> > >>>
> > >>> Since the upgrade to 7.3.0 and also 7.4.0 we encountering heap space
> exhaustion. After obtaining a heap dump it looks like that we have a lot of
> IndexSearchers open for our largest collection.
> > >>>
> > >>> The dump contains around ~60 IndexSearchers, and each containing
> around ~40mb heap. Another 500MB of heap is the fieldcache, which is
> expected in my opinion.
> > >>>
> > >>> The current config can be found here:
> https://gist.github.com/bjoernhaeuser/327a65291ac9793e744b87f0a561e844 <
> https://gist.github.com/bjoernhaeuser/327a65291ac9793e744b87f0a561e844>
> > >>>
> > >>> Analyzing the heap dump eclipse MAT says this:
> > >>>
> > >>> Problem Suspect 1
> > >>>
> > >>> 91 instances of "org.apache.solr.search.SolrIndexSearcher", loaded
> by "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x6807d1048" occupy
> 1.981.148.336 (38,26%) bytes.
> > >>>
> > >>> Biggest instances:
> > >>>
> > >>>• org.apache.solr.search.SolrIndexSearcher @ 0x6ffd47ea8 -
> 70.087.272 (1,35%) bytes.
> > >>>• org.apache.solr.search.SolrIndexSearcher @ 0x79ea9c040 -
> 65.678.264 (1,27%) bytes.
> > >>>• org.apache.solr.search.SolrIndexSearcher @ 0x6855ad680 -
> 63.050.600 (1,22%) bytes.
> > >>>
> > >>>
> > >>> Problem Suspect 2
> > >>>
> > >>> 223 instances of "org.apache.solr.util.ConcurrentLRUCache", loaded
> by

Re: Solr Cloud not routing to PULL replicas

2018-08-28 Thread Tomás Fernández Löbbe

Hi Ash,
Do you see all shard queries going to the TLOG replicas or “most” (are
there some going to the PULL replicas). You can confirm this by looking in
the logs for queries with “isShard=true” parameter. Are the PULL replicas
active (since you are using a load balancer I’m guessing you are not using
CloudSolrClient for queries)?
Did you look at other metrics other than the CPU utilization? like, are the
“/select” request metrics (or whatever handler path you are using)
confirming the issue (high in the TLOG replicas and low in the PULL
replicas).

Can you share a query from your logs (the main query and the shard queries
if possible)

Tomás

On Tue, Aug 28, 2018 at 6:22 AM Ash Ramesh  wrote:

> Hi again,
>
> We are currently using Solr 7.3.1 and have a 8 shard collection. All our
> TLOGs are in seperate machines & PULLs in others. Since not all shards are
> in the same machine, the request will be distributed. However, we are
> seeing that most of the 'distributed' parts of the requests are being
> routed to the TLOG machines. This is evident as the TLOGs are saturated at
> 80%+ CPU while the PULL machines are sitting at 25% even through the load
> balancer only routes to the PULL machines. I know we can use
> 'preferLocalShards', but that still doesn't solve the problem.
>
> Is there something we have configured incorrectly? We are currently rushing
> to upgrade to 7.4.0 so we can take advantage of
> 'shards.preference=replica.location:local,replica.type:PULL' parameter. In
> the meantime, we would like to know if there is a reason for this behavior
> and if there is anything we can do to avoid it.
>
> Thank you & regards,
>
> Ash
>
> --
> *P.S. We've launched a new blog to share the latest ideas and case studies
> from our team. Check it out here: product.canva.com
> . ***
> ** Empowering the world
> to design
> Also, we're hiring. Apply here!
> 
>  
>  
> 
>
>
>
>
>
>

Re: code v.s. schema for BEST_COMPRESSION mode

2018-06-17 Thread Tomás Fernández Löbbe

The schema configuration in your first link is the way to tell Solr to use
a particular compression mode in the Lucene index (the second link).
So yes, the schema change should be enough.

On Sun, Jun 17, 2018 at 6:39 AM Zahra Aminolroaya 
wrote:

> I want to reduce the size of indexed and stored documents in Solr. I found
> two way in the first solution
> http://
> https://lucene.apache.org/solr/guide/6_6/codec-factory.html#solr-schemacodecfactory
>  https://lucene.apache.org/solr/guide/6_6/codec-factory.html#solr-schemacodecfactory>
>
> it is only needed to change the compressionMode in the codecFactory section
> of schema. In the other way,
> http://
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.4.0/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressingStoredFieldsFormat.java
>  https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.4.0/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressingStoredFieldsFormat.java>
>
> it is needed to use the java code to compress stored field.
>
> What is the difference between these solutions? Is it enough to use the
> schema editting way?
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: collection properties

2018-04-13 Thread Tomás Fernández Löbbe

Yes... Unfortunately there is no GET API :S Can you open a Jira? Patch
should be trivial

On Fri, Apr 13, 2018 at 3:05 PM, Hendrik Haddorp 
wrote:

> Hi,
>
> with Solr 7.3 it is possible to set arbitrary collection properties using
> https://lucene.apache.org/solr/guide/7_3/collections-api.
> html#collectionprop
> But how do I read out the properties again? So far I could not find a REST
> call that would return the properties. I do see my property in the ZK file
> collectionprops.json below my collection though.
>
> thanks,
> Hendrik
>

Re: Solr performance on EC2 linux

2017-05-02 Thread Tomás Fernández Löbbe

I remember seeing some performance impact (even when not using it) and it
was attributed to the calls to System.nanoTime. See SOLR-7875 and SOLR-7876
(fixed for 5.3 and 5.4). Those two Jiras fix the impact when timeAllowed is
not used, but I don't know if there were more changes to improve the
performance of the feature itself. The problem was that System.nanoTime may
be called too many times on indices with many different terms. If this is
the problem Jeff is seeing, a small degradation of System.nanoTime could
have a big impact.

Tomás

On Tue, May 2, 2017 at 10:23 AM, Walter Underwood 
wrote:

> Hmm, has anyone measured the overhead of timeAllowed? We use it all the
> time.
>
> If nobody has, I’ll run a benchmark with and without it.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On May 2, 2017, at 9:52 AM, Chris Hostetter 
> wrote:
> >
> >
> > : I specify a timeout on all queries, 
> >
> > Ah -- ok, yeah -- you mean using "timeAllowed" correct?
> >
> > If the root issue you were seeing is in fact clocksource related,
> > then using timeAllowed would probably be a significant compounding
> > factor there since it would involve a lot of time checks in a single
> > request (even w/o any debugging enabled)
> >
> > (did your coworker's experiements with ES use any sort of equivilent
> > timeout feature?)
> >
> >
> >
> >
> >
> > -Hoss
> > http://www.lucidworks.com/
>
>

Re: Interval Facets with JSON

2017-02-23 Thread Tomás Fernández Löbbe

Hi Deniz,
Interval Facets is currently not supported with JSON Facets as Tom said.
Could you create a Jira issue?

On Fri, Feb 10, 2017 at 6:16 AM, Tom Evans  wrote:

> On Wed, Feb 8, 2017 at 11:26 PM, deniz  wrote:
> > Tom Evans-2 wrote
> >> I don't think there is such a thing as an interval JSON facet.
> >> Whereabouts in the documentation are you seeing an "interval" as JSON
> >> facet type?
> >>
> >>
> >> You want a range facet surely?
> >>
> >> One thing with range facets is that the gap is fixed size. You can
> >> actually do your example however:
> >>
> >> json.facet={hieght_facet:{type:range, gap:20, start:160, end:190,
> >> hardend:True, field:height}}
> >>
> >> If you do require arbitrary bucket sizes, you will need to do it by
> >> specifying query facets instead, I believe.
> >>
> >> Cheers
> >>
> >> Tom
> >
> >
> > nothing other than
> > https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-
> IntervalFaceting
> > for documentation on intervals...  i am ok with range queries as well but
> > intervals would fit better because of different sizes...
>
> That documentation is not for JSON facets though. You can't pick and
> choose features from the old facet system and use them in JSON facets
> unless they are mentioned in the JSON facet documentation:
>
> https://cwiki.apache.org/confluence/display/solr/JSON+Request+API
>
> and (not official documentation)
>
> http://yonik.com/json-facet-api/
>
> Cheers
>
> Tom
>

Re: NumericDocValues only supports long?

2017-02-14 Thread Tomás Fernández Löbbe

I think you should use FloatFieldSource. Solr uses
Float.floatToIntBits(floatValue) when adding the DV field, so you could use
Float.intBitsToFloat((int)longValue) when reading (See
TrieField.createFields(...)), but FloatFieldSource is already doing that
for you.

On Tue, Feb 14, 2017 at 10:37 AM, Ugo Matrangolo 
wrote:

> Hi,
>
> I have a corpus where each document contains a field of type Float.
>
> I'm trying to write a PostFilter that returns a DelegatingCollector to
> filter all the docs where the value of a function applied to this float
> value is lower than a given threshold. I can't precompute/index anything
> here.
>
> I have just found that calling IndexReader to read a single document and
> then read the stored field of type float that I need is quite expensive so
> I was thinking to load all of them at once using:
>
>  LeafReader.numericDocValues($my_float_field_name_here)
>
> Turns out that NumericDocValues only supports long!
>
> How can I access all my float values in the same way ???
>
> Best
> Ugo
>

Re: Limit = 0? Does it still calculate facet ?

2016-12-22 Thread Tomás Fernández Löbbe

Yes, facet.limit will short circuit and not calculate the facet for the
field. I'm assuming you can't just use facet=false?

Tomas

On Thu, Dec 22, 2016 at 1:00 PM, William Bell  wrote:

> We have a qt=provider and it sets facets.
>
> We want to short circuit the facet. Can we set limit=0 and will it NOT
> calculate it?
>
> Or does it calculate it and not return results? Can we make it faster ?
>
> f..facet.limit = 0
>
> --
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076
>

CREATEALIAS to non-existing collections

2016-12-09 Thread Tomás Fernández Löbbe

We currently support requests to CREATEALIAS to collections that don’t
exist. Requests to this alias later result in 404s. If the target
collection is later created, requests to the alias will begin to work. I’m
wondering if someone is relying on this behavior, or if we should validate
the existence of the target collections when creating the alias (and thus,
fail fast in cases of typos or unexpected cluster state)

Tomás

Re: Solr suddenly starts creating .cfs (compound) segments during indexing

2016-09-27 Thread Tomás Fernández Löbbe

By default, TieredMergePolicy uses CFS for segments that are less than 10%
of the index[1]. If you set the "useCompoundFile" element in solrconfig (to
either true or false) you can override this[2].
TMP also has some other limits and logic on when to (and when not to) use
CFS. You can take a look at the code if you are interested to see those.

Tomás

[1]
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java#L93
[2]
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/update/SolrIndexConfig.java#L384-L410

On Tue, Sep 27, 2016 at 1:25 PM, simon  wrote:

> Our index builds take around 6 hours, and I've noticed recently that
> segments created towards the end of the build (in the last hour or so)  use
> the compound file format (.cfs). I assumed that this might be due to the
> number of open files approaching a maximum, but both the hard and soft open
> file limits for the Solr JVM process are set to 65536, so that doesn't seem
> very likely. It's obviously not a problem, but I'm curious as to why this
> might be happening.
>
>
> Environment:
> OS = Centos 7 Linux
>
> Java:
> java -version =>
> openjdk version "1.8.0_45"
> OpenJDK Runtime Environment (build 1.8.0_45-b13)
> OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode)
>
> Solr 5.4 started with the bin/solr script: ps shows
>
> java -server -Xms5g -Xmx5g -XX:NewRatio=3 -XX:SurvivorRatio=4
> -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8
> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ConcGCThreads=4
> -XX:ParallelGCThreads=4 -XX:+CMSScavengeBeforeRemark
> -XX:PretenureSizeThreshold=64m -XX:+UseCMSInitiatingOccupancyOnly
> -XX:CMSInitiatingOccupancyFraction=50 -XX:CMSMaxAbortablePrecleanTime=6000
> -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -Djetty.port=8983
> -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks -Duser.timezone=EST
> -Djetty.home=/home/srosenthal/defsolr/server
> -Dsolr.solr.home=/home/srosenthal/defsolr/server/solr
> -Dsolr.install.dir=/home/srosenthal/defsolr -Xss256k -jar start.jar
> -XX:OnOutOfMemoryError=/home/srosenthal/defsolr/bin/oom_solr.sh 8983
> /home/srosenthal/defsolr/server/logs --module=http
>
> solrconfig.xml: basically the default with some minor tweaks in the
> indexConfig section
> 5.0
> 
>  
> 
> 
> 200
> 1
>
> 
>   20
>   60
>   20
> 
>
> 
> 
> ... everything else is default
> 
> Insights as to why this is happening would be welcome.
>
> -Simon
>

Re: Faceting search issues

2016-09-27 Thread Tomás Fernández Löbbe

I wonder why in the "facet_field" section of the first query it says:
"facet_fields": {"id": []}
 when it should be saying
"facet_fields": {"name": []}

Also, why is the second query not including the fq in the echoParams
section.
What is that other query with fq=aggregationname:story?

This is not in a SolrCloud environment, right? just a single host with no
replication?

Tomás

On Tue, Sep 27, 2016 at 1:10 AM, Jan Høydahl  wrote:

> Please tell some more
> - Solr version
> - Add to your query: =true=all and paste the result
> - How is “string_ci” defined ()?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 26. sep. 2016 kl. 23.59 skrev Beyene, Iyob :
> >
> > Hi,
> >
> > When I query solr using faceted search to check for duplicates using the
> following ,
> >
> > 'http://localhost:8983/solr/core/
> select?q=*:*=true=name=2`,
> >
> > I get the following response with no facet data.
> >
> >
> > {"responseHeader": {"status": 0,"QTime": 541,"params": {"q": "*:*",
> > "facet.field": "name","facet.mincount": "2","rows": "0","facet":
> "true"}},"response": {"numFound": 316544,"start": 0,"maxScore": 1,"docs":
> []},"facet_counts": {"facet_queries": {},"facet_fields": {"name":
> []},"facet_dates": {},"facet_ranges": {},"facet_intervals":
> {},"facet_heatmaps": {}}}
> >
> >
> > but when I specify the name in fq
> >
> > 'http://localhost:8983/solr/core/
> select?q=*:*=true=name=2&
> fq=name:elephant`
> >
> > I get a facet result like these
> >
> > {"responseHeader": {"status": 0,"QTime": 541,"params": {"q":
> "*:*","facet.field": "name","fq": "name:elephant","facet.mincount":
> "2","rows": "0","facet": "true"}},"response": {"numFound": 2,"start":
> 0,"maxScore": 1,"docs": []},"facet_counts": {"facet_queries":
> {},"facet_fields": {"name": ["elephant",4]},"facet_dates":
> {},"facet_ranges": {},"facet_intervals": {},"facet_heatmaps": {}}}
> >
> >
> > The field I am basing the facet search on is defined like below
> >
> >  required="true" multiValued="true"/>
> >
> >
> > Is there some variation of faceting that could help me analyze the
> difference?
> >
> > Thanks
> >
> > Iyob
> >
> >
> >
> >
> >
>
>

Re: Viewing the Cache Stats [SOLR 6.1.0]

2016-09-24 Thread Tomás Fernández Löbbe

That thread is pretty old and probably talking about the old(est) admin UI
(before 4.0). The cache stats can be found selecting the core in the
dropdown and then "Plugin/Stats".

See
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=32604180

Tomás

On Sat, Sep 24, 2016 at 12:14 PM, slee  wrote:

> I'm trying to view the Cache Stats.
> After reading this thread:  Cache Stats
>   , I can't
> seem to find the Statistic page in the SOLR Admin.
>
> Should I be installing some plug-in or do some configuration?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Viewing-the-Cache-Stats-SOLR-6-1-0-tp4297861.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: AW: group.facet=true and facet on field of type int -> org.apache.solr.common.SolrException: Exception during facet.field

2016-07-19 Thread Tomás Fernández Löbbe

Hi Sebastian,
This looks like https://issues.apache.org/jira/browse/SOLR-7495

On Jul 19, 2016 3:46 AM, "Sebastian Riemer"  wrote:

> May I respectfully refer again to a question I posted last week?
>
> Thank you very much and a nice day to you all!
>
> Sebastian
> -
>
>
>
>
>
>
>
> Hi all,
>
> Tested on Solr 6.1.0 (as well as 5.4.0 and 5.5.0) using the "techproducts"
> example the following query throws the same exception as in my original
> question:
>
> To reproduce:
> 1) set up the techproducts example:
> solr start -e techproducts -noprompt
> 2) go to Solr Admin:
> http://localhost:8983/solr/#/techproducts/query
> 3) in "Raw Query Parameters" enter:
>
> group=true=true=true=manu_id_s=true=popularity
> 4) Hit "Execute Query"
>
> [..]
> "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","java.lang.IllegalStateException"],
> "msg":"Exception during facet.field: popularity",
> "trace":"org.apache.solr.common.SolrException: Exception during
> facet.field: popularity\r\n\tat
> org.apache.solr.request.SimpleFacets.lambda$getFacetFieldCounts$50(SimpleFacets.java:739)\r\n\tat
> org.apache.solr.request.SimpleFacets$$Lambda$37/2022187546.call(Unknown
> Source)\r\n\tat
> java.util.concurrent.FutureTask.run(FutureTask.java:266)\r\n\tat
> org.apache.solr.request.SimpleFacets$2.execute(SimpleFacets.java:672)\r\n\tat
> org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:748)\r\n\tat
> org.apache.solr.handler.component.FacetComponent.getFacetCounts(FacetComponent.java:321)\r\n\tat
> org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:265)\r\n\tat
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:293)\r\n\tat
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156)\r\n\tat
> org.apache.solr.core.SolrCore.execute(SolrCore.java:2036)\r\n\tat
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:657)\r\n\tat
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:464)\r\n\tat
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)\r\n\tat
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)\r\n\tat
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)\r\n\tat
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)\r\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\r\n\tat
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\r\n\tat
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\r\n\tat
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)\r\n\tat
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)\r\n\tat
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\r\n\tat
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)\r\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\r\n\tat
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\r\n\tat
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)\r\n\tat
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\r\n\tat
> org.eclipse.jetty.server.Server.handle(Server.java:518)\r\n\tat
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)\r\n\tat
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)\r\n\tat
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)\r\n\tat
> org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)\r\n\tat
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\r\n\tat
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)\r\n\tat
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)\r\n\tat
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)\r\n\tat
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)\r\n\tat
> java.lang.Thread.run(Thread.java:745)\r\nCaused by:
> java.lang.IllegalStateException: unexpected docvalues type NUMERIC for
> field 'popularity' (expected=SORTED). Use UninvertingReader or index with
> docvalues.\r\n\tat
> org.apache.lucene.index.DocValues.checkField(DocValues.java:212)\r\n\tat
> org.apache.lucene.index.DocValues.getSorted(DocValues.java:264)\r\n\tat
> org.apache.lucene.search.grouping.term.TermGroupFacetCollector$SV.doSetNextReader(TermGroupFacetCollector.java:129)\r\n\tat
>

Re: deploy solr on cloud providers

2016-07-06 Thread Tomás Fernández Löbbe

On Wed, Jul 6, 2016 at 2:30 AM, Lorenzo Fundaró <
lorenzo.fund...@dawandamail.com> wrote:

> On 6 July 2016 at 00:00, Tomás Fernández Löbbe <tomasflo...@gmail.com>
> wrote:
>
> > The leader will do the replication before responding to the client, so
> lets
> > say the leader gets to update it's local copy, but it's terminated before
> > sending the request to the replicas, the client should get either an HTTP
> > 500 or no http response. From the client code you can take action (log,
> > retry, etc).
> >
>
> If this true then whenever I ask for min_rf having three nodes (1 leader +
> 2 replicas)
> I should get rf = 3, but in reality i don't.
>
>
> > The "min_rf" is useful for the case where replicas may be down or not
> > accessible. Again, you can use this for retrying or take any necessary
> > action on the client side if the desired rf is not achieved.
> >
>
>
> I think both paragraphs are contradictory. If the leader does the
> replication before responding to the client, then
> why is there a need to use the min_rf ? I don;t think is true that you get
> a 200 when the update has been passed to all replicas.
>

The reason why "min_rf" is there is because:
* If there are no replicas at the time of the request (e.g. if replicas are
unreachable and disconnected from ZK)
* Replicas could fail to ACK the update request from the leader, in that
case the leader will mark them as unhealthy but would HTTP 200 to the
client.

So, it could happen that you think your data is being replicated to 3
replicas, but 2 of them are currently out of service, this means that your
doc is in a single host, and if that one dies, then you lose that data. In
order to prevent this, you can ask Solr to tell you how many replicas
succeeded that update request. You can read more about this in
https://issues.apache.org/jira/browse/SOLR-5468


>
> The thing is that, when you have persistent storage yo shouldn't worry
> about this because you know when the node comes back
> the rest of the index will be sync, the problem is when you don't have
> persistent storage. For my particular case I have to be extra careful and
> always
> make sure that all my replicas have all the data I sent.
>
> In any case you should assume that storage on a host can be completely
lost, no mater if you are deploying on premises or on the cloud. Consider
that once that host comes back (could be hours later) it could be already
out of date, and will replicate from the current leader, possibly dropping
parts or all it's current index.

Tomás


>
> > Tomás
> >
> > On Tue, Jul 5, 2016 at 11:39 AM, Lorenzo Fundaró <
> > lorenzo.fund...@dawandamail.com> wrote:
> >
> > > @Tomas and @Steven
> > >
> > > I am a bit skeptical about this two statements:
> > >
> > > If a node just disappears you should be fine in terms of data
> > > > availability, since Solr in "SolrCloud" replicates the data as it
> comes
> > > it
> > > > (before sending the http response)
> > >
> > >
> > > and
> > >
> > > >
> > > > You shouldn't "need" to move the storage as SolrCloud will replicate
> > all
> > > > data to the new node and anything in the transaction log will already
> > be
> > > > distributed through the rest of the machines..
> > >
> > >
> > > because according to the official documentation here
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance
> > > >:
> > > (Write side fault tolerant -> recovery)
> > >
> > > If a leader goes down, it may have sent requests to some replicas and
> not
> > > > others. So when a new potential leader is identified, it runs a synch
> > > > process against the other replicas. If this is successful, everything
> > > > should be consistent, the leader registers as active, and normal
> > actions
> > > > proceed
> > >
> > >
> > > I think there is a possibility that an update is not sent by the leader
> > but
> > > is kept in the local disk and after it comes up again it can sync the
> > > non-sent data.
> > >
> > > Furthermore:
> > >
> > > Achieved Replication Factor
> > > > When using a replication factor greater than one, an update request
> may
> > > > succeed on the shard leader but fail on one or more of the replicas.
> > For
> > > > instance, consider a collection with one shard and replication factor
> > of

Re: deploy solr on cloud providers

2016-07-05 Thread Tomás Fernández Löbbe

The leader will do the replication before responding to the client, so lets
say the leader gets to update it's local copy, but it's terminated before
sending the request to the replicas, the client should get either an HTTP
500 or no http response. From the client code you can take action (log,
retry, etc).
The "min_rf" is useful for the case where replicas may be down or not
accessible. Again, you can use this for retrying or take any necessary
action on the client side if the desired rf is not achieved.

Tomás

On Tue, Jul 5, 2016 at 11:39 AM, Lorenzo Fundaró <
lorenzo.fund...@dawandamail.com> wrote:

> @Tomas and @Steven
>
> I am a bit skeptical about this two statements:
>
> If a node just disappears you should be fine in terms of data
> > availability, since Solr in "SolrCloud" replicates the data as it comes
> it
> > (before sending the http response)
>
>
> and
>
> >
> > You shouldn't "need" to move the storage as SolrCloud will replicate all
> > data to the new node and anything in the transaction log will already be
> > distributed through the rest of the machines..
>
>
> because according to the official documentation here
> <
> https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance
> >:
> (Write side fault tolerant -> recovery)
>
> If a leader goes down, it may have sent requests to some replicas and not
> > others. So when a new potential leader is identified, it runs a synch
> > process against the other replicas. If this is successful, everything
> > should be consistent, the leader registers as active, and normal actions
> > proceed
>
>
> I think there is a possibility that an update is not sent by the leader but
> is kept in the local disk and after it comes up again it can sync the
> non-sent data.
>
> Furthermore:
>
> Achieved Replication Factor
> > When using a replication factor greater than one, an update request may
> > succeed on the shard leader but fail on one or more of the replicas. For
> > instance, consider a collection with one shard and replication factor of
> > three. In this case, you have a shard leader and two additional replicas.
> > If an update request succeeds on the leader but fails on both replicas,
> for
> > whatever reason, the update request is still considered successful from
> the
> > perspective of the client. The replicas that missed the update will sync
> > with the leader when they recover.
>
>
> They have implemented this parameter called *min_rf* that you can use
> (client-side) to make sure that your update was replicated to at least one
> replica (e.g.: min_rf > 1).
>
> This is why my concern about moving storage around, because then I know
> when the shard leader comes back, solrcloud will run sync process for those
> documents that couldn't be sent to the replicas.
>
> Am I missing something or misunderstood the documentation ?
>
> Cheers !
>
>
>
>
>
>
>
> On 5 July 2016 at 19:49, Davis, Daniel (NIH/NLM) [C]  >
> wrote:
>
> > Lorenzo, this probably comes late, but my systems guys just don't want to
> > give me real disk.   Although RAID-5 or LVM on-top of JBOD may be better
> > than Amazon EBS, Amazon EBS is still much closer to real disk in terms of
> > IOPS and latency than NFS ;)I even ran a mini test (not an official
> > benchmark), and found the response time for random reads to be better.
> >
> > If you are a young/smallish company, this may be all in the cloud, but if
> > you are in a large organization like mine, you may also need to allow for
> > other architectures, such as a "virtual" Netapp in the cloud that
> > communicates with a physical Netapp on-premises, and the
> throughput/latency
> > of that.   The most important thing is to actually measure the numbers
> you
> > are getting, both for search and for simply raw I/O, or to get your
> > systems/storage guys to measure those numbers. If you get your
> > systems/storage guys to just measure storage - you will want to care
> about
> > three things for indexing primarily:
> >
> > Sequential Write Throughput
> > Random Read Throughput
> > Random Read Response Time/Latency
> >
> > Hope this helps,
> >
> > Dan Davis, Systems/Applications Architect (Contractor),
> > Office of Computer and Communications Systems,
> > National Library of Medicine, NIH
> >
> >
> >
> > -Original Message-
> > From: Lorenzo Fundaró [mailto:lorenzo.fund...@dawandamail.com]
> > Sent: Tuesday, July 05, 2016 3:20 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: deploy solr on cloud providers
> >
> > Hi Shawn. Actually what im trying to find out is whether this is the best
> > approach for deploying solr in the cloud. I believe solrcloud solves a
> lot
> > of problems in terms of High Availability but when it comes to storage
> > there seems to be a limitation that can be workaround of course but it's
> a
> > bit cumbersome and i was wondering if there is a better option for this
> or
> > if im missing something with the

Re: deploy solr on cloud providers

2016-07-05 Thread Tomás Fernández Löbbe

I think there are two parts to this question:
* If a node just disappears you should be fine in terms of data
availability, since Solr in "SolrCloud" replicates the data as it comes it
(before sending the http response). Even if the leader disappears and never
comes back as long as you have one replica alive for that shard of that
collection there should be no data lost. A new leader will be elected and
you can continue adding docs or querying.
* If the node doesn't recover and a new one joins the cluster, currently
Solr won't automatically realize that replicas have disappear and create
them, so you need to take some action. Some good responses about this issue
are in this other thread
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3ccap_wmbugkdujin1unb_arvxq9vh3f5x6ybpgu7iqckawv9b...@mail.gmail.com%3E

I hope this helps,

Tomás

On Tue, Jul 5, 2016 at 8:55 AM, Steven Bower  wrote:

> You shouldn't "need" to move the storage as SolrCloud will replicate all
> data to the new node and anything in the transaction log will already be
> distributed through the rest of the machines..
>
> One option to keep all your data attached to nodes might be to use Amazon
> EFS (pretty new) to store your data.. However I've not seen any good perf
> testing done against it so not sure how it will scale..
>
> steve
>
> On Tue, Jul 5, 2016 at 11:46 AM Lorenzo Fundaró <
> lorenzo.fund...@dawandamail.com> wrote:
>
> > On 5 July 2016 at 15:55, Shawn Heisey  wrote:
> >
> > > On 7/5/2016 1:19 AM, Lorenzo Fundaró wrote:
> > > > Hi Shawn. Actually what im trying to find out is whether this is the
> > best
> > > > approach for deploying solr in the cloud. I believe solrcloud solves
> a
> > > lot
> > > > of problems in terms of High Availability but when it comes to
> storage
> > > > there seems to be a limitation that can be workaround of course but
> > it's
> > > a
> > > > bit cumbersome and i was wondering if there is a better option for
> this
> > > or
> > > > if im missing something with the way I'm doing it. I wonder if there
> > are
> > > > some proved experience about how to solve the storage problem when
> > > > deploying in the cloud. Any advise or point to some enlightening
> > > > documentation will be appreciated. Thanks.
> > >
> > > When you ask whether "this is the best approach" ... you need to define
> > > what "this" is.  You mention a "storage problem" that needs solving ...
> > > but haven't actually described that problem in a way that I can
> > > understand.
> >
> >
> > So, Im trying to put Solrcloud in a cloud provider where a node can
> > disappear any time
> > because of hardware failure. In order to preserve any non replicated
> > updates I need to
> > make the storage of that dead node go to the newly spawned node. I am not
> > having a problem with this
> > approach actually, I just want to know if there is a better way of doing
> > this. I know there is HDFS support that makes
> > all this easier but this is not an option for me. Thank you and I
> apologise
> > for the unclear mails.
> >
> >
> > >
> > > Let's back up and cover some basics:
> > >
> > > What steps are you taking?
> >
> > What do you expect (or want) to happen?
> >
> > What actually happens?
> > >
> > > The answers to these questions need to be very detailed.
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
> >
> > --
> >
> > --
> > Lorenzo Fundaro
> > Backend Engineer
> > E-Mail: lorenzo.fund...@dawandamail.com
> >
> > Fax   + 49 - (0)30 - 25 76 08 52
> > Tel+ 49 - (0)179 - 51 10 982
> >
> > DaWanda GmbH
> > Windscheidstraße 18
> > 10627 Berlin
> >
> > Geschäftsführer: Claudia Helming und Niels Nüssler
> > AG Charlottenburg HRB 104695 B http://www.dawanda.com
> >
>

Re: OOM script executed

2016-05-03 Thread Tomás Fernández Löbbe

You could use some memory analyzer tools (e.g. jmap), that could give you a
hint. But if you are migrating, I'd start to see if you changed something
from the previous version, including jvm settings, schema/solrconfig.
If nothing is different, I'd try to identify which feature is consuming
more memory. If you use faceting/stats/suggester, or you have big caches or
request big pages (e.g. 100k docs) or use Solr Cell for extracting content,
those are some usual suspects. Try to narrow it down, it could be many
things. Turn on/off features as you look at the memory (you could use
something like jconsole/jvisualvm/jstat) and see when it spikes, compare
with the previous version. That's that I would do at least.

If you get to narrow it down to a specific feature, then you can come back
to the users list and ask with some more specifics, that way someone could
point you to the solution, or maybe file a JIRA if it turns out to be a bug.

Tomás

On Mon, May 2, 2016 at 11:34 PM, Bastien Latard - MDPI AG <
lat...@mdpi.com.invalid> wrote:

> Hi Tomás,
>
> Thanks for your answer.
> How could I see what's using memory?
> I tried to add "-XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/var/solr/logs/OOM_Heap_dump/"
> ...but this doesn't seem to be really helpful...
>
> Kind regards,
> Bastien
>
>
> On 02/05/2016 22:55, Tomás Fernández Löbbe wrote:
>
>> You could, but before that I'd try to see what's using your memory and see
>> if you can decrease that. Maybe identify why you are running OOM now and
>> not with your previous Solr version (assuming you weren't, and that you
>> are
>> running with the same JVM settings). A bigger heap usually means more work
>> to the GC and less memory available for the OS cache.
>>
>> Tomás
>>
>> On Sun, May 1, 2016 at 11:20 PM, Bastien Latard - MDPI AG <
>> lat...@mdpi.com.invalid> wrote:
>>
>> Hi Guys,
>>>
>>> I got several times the OOM script executed since I upgraded to Solr6.0:
>>>
>>> $ cat solr_oom_killer-8983-2016-04-29_15_16_51.log
>>> Running OOM killer script for process 26044 for Solr on port 8983
>>>
>>> Does it mean that I need to increase my JAVA Heap?
>>> Or should I do anything else?
>>>
>>> Here are some further logs:
>>> $ cat solr_gc_log_20160502_0730:
>>> }
>>> {Heap before GC invocations=1674 (full 91):
>>>   par new generation   total 1747648K, used 1747135K [0x0005c000,
>>> 0x00064000, 0x00064000)
>>>eden space 1398144K, 100% used [0x0005c000,
>>> 0x00061556,
>>> 0x00061556)
>>>from space 349504K,  99% used [0x00061556, 0x00062aa2fc30,
>>> 0x00062aab)
>>>to   space 349504K,   0% used [0x00062aab, 0x00062aab,
>>> 0x00064000)
>>>   concurrent mark-sweep generation total 6291456K, used 6291455K
>>> [0x00064000, 0x0007c000, 0x0007c000)
>>>   Metaspace   used 39845K, capacity 40346K, committed 40704K,
>>> reserved
>>> 1085440K
>>>class spaceused 4142K, capacity 4273K, committed 4368K, reserved
>>> 1048576K
>>> 2016-04-29T21:15:41.970+0200: 20356.359: [Full GC (Allocation Failure)
>>> 2016-04-29T21:15:41.970+0200: 20356.359: [CMS:
>>> 6291455K->6291456K(6291456K), 12.5694653 secs]
>>> 8038591K->8038590K(8039104K), [Metaspace: 39845K->39845K(1085440K)],
>>> 12.5695497 secs] [Times: user=12.57 sys=0.00, real=12.57 secs]
>>>
>>>
>>> Kind regards,
>>> Bastien
>>>
>>>
>>>
> Kind regards,
> Bastien Latard
> Web engineer
> --
> MDPI AG
> Postfach, CH-4005 Basel, Switzerland
> Office: Klybeckstrasse 64, CH-4057
> Tel. +41 61 683 77 35
> Fax: +41 61 302 89 18
> E-mail:
> lat...@mdpi.com
> http://www.mdpi.com/
>
>

Re: OOM script executed

2016-05-02 Thread Tomás Fernández Löbbe

You could, but before that I'd try to see what's using your memory and see
if you can decrease that. Maybe identify why you are running OOM now and
not with your previous Solr version (assuming you weren't, and that you are
running with the same JVM settings). A bigger heap usually means more work
to the GC and less memory available for the OS cache.

Tomás

On Sun, May 1, 2016 at 11:20 PM, Bastien Latard - MDPI AG <
lat...@mdpi.com.invalid> wrote:

> Hi Guys,
>
> I got several times the OOM script executed since I upgraded to Solr6.0:
>
> $ cat solr_oom_killer-8983-2016-04-29_15_16_51.log
> Running OOM killer script for process 26044 for Solr on port 8983
>
> Does it mean that I need to increase my JAVA Heap?
> Or should I do anything else?
>
> Here are some further logs:
> $ cat solr_gc_log_20160502_0730:
> }
> {Heap before GC invocations=1674 (full 91):
>  par new generation   total 1747648K, used 1747135K [0x0005c000,
> 0x00064000, 0x00064000)
>   eden space 1398144K, 100% used [0x0005c000, 0x00061556,
> 0x00061556)
>   from space 349504K,  99% used [0x00061556, 0x00062aa2fc30,
> 0x00062aab)
>   to   space 349504K,   0% used [0x00062aab, 0x00062aab,
> 0x00064000)
>  concurrent mark-sweep generation total 6291456K, used 6291455K
> [0x00064000, 0x0007c000, 0x0007c000)
>  Metaspace   used 39845K, capacity 40346K, committed 40704K, reserved
> 1085440K
>   class spaceused 4142K, capacity 4273K, committed 4368K, reserved
> 1048576K
> 2016-04-29T21:15:41.970+0200: 20356.359: [Full GC (Allocation Failure)
> 2016-04-29T21:15:41.970+0200: 20356.359: [CMS:
> 6291455K->6291456K(6291456K), 12.5694653 secs]
> 8038591K->8038590K(8039104K), [Metaspace: 39845K->39845K(1085440K)],
> 12.5695497 secs] [Times: user=12.57 sys=0.00, real=12.57 secs]
>
>
> Kind regards,
> Bastien
>
>

Re: What does the "Max Doc" means in Admin interface?

2016-05-02 Thread Tomás Fernández Löbbe

"Max Docs" is a confusing. It's not really the maximum number of docs you
can have, it's just the total amount of docs in your index INCLUDING
DELETED DOCS that haven't been cleared by a merge.
"Heap Memory Usage" is currently broken. See
https://issues.apache.org/jira/browse/SOLR-7475

On Sun, May 1, 2016 at 11:25 PM, Bastien Latard - MDPI AG <
lat...@mdpi.com.invalid> wrote:

> Hi All,
>
> Everything is in the title...
>
>
> Can this value be modified?
> Or is it because of my environment?
>
> Also, what does "Heap Memory Usage: -1" mean?
>
> Kind regards,
> Bastien Latard
> Web engineer
> --
> MDPI AG
> Postfach, CH-4005 Basel, Switzerland
> Office: Klybeckstrasse 64, CH-4057
> Tel. +41 61 683 77 35
> Fax: +41 61 302 89 18
> E-mail: latard@mdpi.comhttp://www.mdpi.com/
>
>

Re: Next Solr Release - 5.5.1 or 6.0 ?

2016-03-24 Thread Tomás Fernández Löbbe

>
>
> Not to mention the fact that Solr 6 is using deprecated Lucene 6
> numeric types if those are removed in Lucene 7, then what?
>
> I believe this is going to be an issue. We have SOLR-8396
 open, but it doesn't look
like it's going to make it to 6.0 (I tried to look at it but I didn't have
time the past weeks). We'll have to support it until Solr 8 I guess.

Tomás

Re: SolrCloud: published host/port

2016-03-24 Thread Tomás Fernández Löbbe

I believe this can be done by setting the "host" and "hostPort" elements in
solr.xml. In the default solr.xml they are configured in a way to support
also setting them via System properties:

${host:}
${jetty.port:8983}

Tomás

On Wed, Mar 23, 2016 at 11:26 PM, Hendrik Haddorp 
wrote:

> Hi,
>
> is it possible to instruct Solr to publish a different host/port into
> ZooKeeper then it is actually running on? This is required if the Solr
> node is not directly reachable on its port from outside due to a NAT
> setup or when running Solr as a Docker container with a mapped port.
>
> For what its worth ElasticSearch is supporting this as documented here [1]:
> - transport.publish_port
> - transport.publish_host
>
> regards,
> Hendrik
>
> [1]
>
> https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-transport.html
>

Re: New comer - Benoit Vanalderweireldt

2016-02-27 Thread Tomás Fernández Löbbe

Yes, you can create a new Jira if there isn't one already. I believe you
can create a new pull request with the Jira number in the title, and that
gets automatically appended to the Jira issue. Other option is to create a
patch and upload it to the Jira manually.

On Sat, Feb 27, 2016 at 11:52 AM, Benoit Vanalderweireldt <
benoi...@hybhub.com> wrote:

> Thank you guys,
>
> I have choose to add a test cases class after reading the code coverage
> report, I have pushed a first pull requests adding test cases for
> org.apache.solr.util.hll.NumberUtil (GitHub PR <
> https://github.com/apache/lucene-solr/pull/15>).
>
> What I should do next, should I create a new task on Jira ?
>
> Benoit
>
> > On Feb 26, 2016, at 1:31 AM, Erick Erickson 
> wrote:
> >
> > There are also other ways to help than coding. Documentation, Java docs,
> > writing test cases (look at the code coverage reports and pick something
> > not already covered). Review or comment on patches, work on the new
> Angular
> > JS UI.
> >
> > Try installing Solr and note any ambiguous docs and suggest better ones,
> > the sky is the limit.
> >
> > There's a chronic need for better JavaDocs.
> >
> > In short, pick something about Solr/Lucene that bugs you and see what you
> > can do to improve it ;)
> >
> > Welcome!
> > Erick
> > On Feb 26, 2016 14:07, "Shawn Heisey"  wrote:
> >
> >> On 2/25/2016 4:34 PM, Benoit Vanalderweireldt wrote:
> >>> I have just joined this mailing list, I would love to contribute to
> >> Apache SOLR (I am a certified Java developer OCA and OCP)
> >>>
> >>> Can someone guide me and assign me a first task on Jira (my username is
> >> : b.vanalderweireldt) ?
> >>
> >> Thanks for stepping up and offering to help out!  Jan has given you some
> >> good starting points.  I had mostly written this message before that
> >> reply came through, so here's some more info:
> >>
> >> You'll want to join the dev list.  Most of the communication for a
> >> specific issue will happen in Jira, but the dev list offers a place for
> >> larger and miscellaneous discussions.  Warning: Because all Jira
> >> activity is sent to the dev list, it is a high-traffic list.  Having the
> >> ability to use filters on your email to direct messages to different
> >> folders is a life-saver.
> >>
> >> Your initial message would have been more at home on the dev list, but
> >> we're not terribly formal about enforcing that kind of separation.
> >> Initial discussion for many issues is welcome on this list, and often
> >> preferred before going to Jira.
> >>
> >> Normally issues are assigned to the committer that agrees to take on the
> >> change and commit it.
> >>
> >> Take a look at the many open issues on Solr.  You'll probably want to
> >> start with an issue that's recently filed, not one that was filed years
> >> ago.  After you become more comfortable with the codebase and the
> >> process, you'll be in a better position to tackle older issues.
> >>
> >>
> >>
> https://issues.apache.org/jira/browse/SOLR/?selectedTab=com.atlassian.jira.jira-projects-plugin:issues-panel
> >>
> >> A highly filtered and likely more relevant list:
> >>
> >>
> >>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SOLR%20AND%20labels%20in%20%28beginners%2C%20newdev%29%20AND%20status%20not%20in%20%28resolved%2C%20closed%29
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>
>

Re: SolrCloud replicas out of sync

2016-01-28 Thread Tomás Fernández Löbbe

Maybe you are hitting the reordering issue described in SOLR-8129?

Tomás

On Wed, Jan 27, 2016 at 11:32 AM, David Smith 
wrote:

> Sure.  Here is our SolrCloud cluster:
>
>+ Three (3) instances of Zookeeper on three separate (physical)
> servers.  The ZK servers are beefy and fairly recently built, with 2x10
> GigE (bonded) Ethernet connectivity to the rest of the data center.  We
> recognize importance of the stability and responsiveness of ZK to the
> stability of SolrCloud as a whole.
>
>+ 364 collections, all with single shards and a replication factor of
> 3.  Currently housing only 100,000,000 documents in aggregate.  Expected to
> grow to 25 billion+.  The size of a single document would be considered
> “large”, by the standards of what I’ve seen posted elsewhere on this
> mailing list.
>
> We are always open to ZK recommendations from you or anyone else,
> particularly for running a SolrCloud cluster of this size.
>
> Kind Regards,
>
> David
>
>
>
> On 1/27/16, 12:46 PM, "Jeff Wartes"  wrote:
>
> >
> >If you can identify the problem documents, you can just re-index those
> after forcing a sync. Might save a full rebuild and downtime.
> >
> >You might describe your cluster setup, including ZK. it sounds like
> you’ve done your research, but improper ZK node distribution could
> certainly invalidate some of Solr’s assumptions.
> >
> >
> >
> >
> >On 1/27/16, 7:59 AM, "David Smith"  wrote:
> >
> >>Jeff, again, very much appreciate your feedback.
> >>
> >>It is interesting — the article you linked to by Shalin is exactly why
> we picked SolrCloud over ES, because (eventual) consistency is critical for
> our application and we will sacrifice availability for it.  To be clear,
> after the outage, NONE of our three replicas are correct or complete.
> >>
> >>So we definitely don’t have CP yet — our very first network outage
> resulted in multiple overlapped lost updates.  As a result, I can’t pick
> one replica and make it the new “master”.  I must rebuild this collection
> from scratch, which I can do, but that requires downtime which is a problem
> in our app (24/7 High Availability with few maintenance windows).
> >>
> >>
> >>So, I definitely need to “fix” this somehow.  I wish I could outline a
> reproducible test case, but as the root cause is likely very tight timing
> issues and complicated interactions with Zookeeper, that is not really an
> option.  I’m happy to share the full logs of all 3 replicas though if that
> helps.
> >>
> >>I am curious though if the thoughts have changed since
> https://issues.apache.org/jira/browse/SOLR-5468 of seriously considering
> a “majority quorum” model, with rollback?  Done properly, this should be
> free of all lost update problems, at the cost of availability.  Some
> SolrCloud users (like us!!!) would gladly accept that tradeoff.
> >>
> >>Regards
> >>
> >>David
> >>
> >>
>
>

Re: SOLR replicas performance

2016-01-08 Thread Tomás Fernández Löbbe

Hi Luca,
It looks like your queries are complex wildcard queries. My theory is that
you are CPU-bounded, for a single query one CPU core for each shard will be
at 100% for the duration of the sub-query. Smaller shards make these
sub-queries faster which is why 16 shards is better than 8 in your case.
* In your 16x1 configuration, you have exactly one shard per CPU core, so
in a single query, 16 subqueries will go to both nodes evenly and use one
of the CPU cores.
* In your 8x2 configuration, you still get to use one CPU core per shard,
but the shards are bigger, so maybe each subquery takes longer (for the
single query thread and 8x2 scenario I would expect CPU utilization to be
lower?).
* In your 16x2 case 16 subqueries will be distributed un-evenly, and some
node will get more than 8 subqueries, which means that some of the
subqueries will have to wait for their turn for a CPU core. In addition,
more Solr cores will be competing for resources.
If this theory is correct, adding more replicas won't speedup your queries,
you need to either get faster CPU or simplify your queries/configuration in
some way. Adding more replicas should improve your query throughput, but
only if you add them in more HW, not the same one.

...anyway, just a theory

Tomás

On Fri, Jan 8, 2016 at 7:40 AM, Shawn Heisey  wrote:

> On 1/8/2016 7:55 AM, Luca Quarello wrote:
> > I used solr5.3.1 and I sincerely expected response times with replica
> > configuration near to response times without replica configuration.
> >
> > Do you agree with me?
> >
> > I read here
> >
> http://lucene.472066.n3.nabble.com/Solr-Cloud-Query-Scaling-td4110516.html
> > that "Queries do not need to be routed to leaders; they can be handled by
> > any replica in a shard. Leaders are only needed for handling update
> > requests. "
> >
> > I haven't found this behaviour. In my case CONF2 e CONF3 have all
> replicas
> > on VM2 but analyzing core utilization during a request is 100% on both
> > machines. Why?
>
> Indexing is a little bit slower with replication -- the update must
> happen on all replicas.
>
> If your index is sharded (which I believe you did indicate in your
> initial message), you may find that all replicas get used even for
> queries.  It is entirely possible that some of the shard subqueries will
> be processed on one replica and some of them will be processed on other
> replicas.  I do not know if this commonly happens, but I would not be
> surprised if it does.  If the machines are sized appropriately for the
> index, this separation should speed up queries, because you have the
> resources of multiple machines handling one query.
>
> That phrase "sized appropriately" is very important.  Your initial
> message indicated that you have a 90GB index, and that you are running
> in virtual machines.  Typically VMs have fairly small memory sizes.  It
> is very possible that you simply don't have enough memory in the VM for
> good performance with an index that large.  With 90GB of index data on
> one machine, I would hope for at least 64GB of RAM, and I would prefer
> to have 128GB.  If there is more than 90GB of data on one machine, then
> even more memory would be needed.
>
> Thanks,
> Shawn
>
>

Re: Facet shows deleted values...

2015-12-29 Thread Tomás Fernández Löbbe

I believe the problem here is that terms from the deleted docs still appear
in the facets, even with a doc count of 0, is that it? Can you use
facet.mincount=1 or would that not be a good fit for your use case?

https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-Thefacet.mincountParameter

Tomás

On Tue, Dec 29, 2015 at 5:23 PM, Erick Erickson 
wrote:

> Let's be sure we're using terms similarly
>
> That article is from 2010, so is unreliable in the 5.2 world, I'd ignore
> that.
>
> First, facets should always reflect the latest commit, regardless of
> expungeDeletes or optimizes/forcemerges.
>
> _commits_ are definitely recommended. Optimize/forcemerge (or
> expungedeletes) are rarely necessary and
> should _not_ be necessary for facets to not count omitted documents.
>
> Is it possible that your autowarm period is long and you're still
> getting an old searcher when you run your tests?
>
> Assuming that you commit(), then wait a few minutes, do you see
> inaccurate facets? If so, what are the
> exact steps you follow?
>
> Best,
> Erick
>
> On Tue, Dec 29, 2015 at 12:54 PM, Don Bosco Durai 
> wrote:
> > I am purging some of my data on regular basis, but when I run a facet
> query, the deleted values are still shown in the facet list.
> >
> > Seems, commit with expunge resolves this issue (
> http://grokbase.com/t/lucene/solr-user/106313v302/deleted-documents-appearing-in-facet-fields
> ). But it seems, commit is no more recommended. Also, I am running Solr 5.2
> in SolrCloud mode.
> >
> > What is the recommendation here?
> >
> > Thanks
> >
> > Bosco
> >
> >
>

Re: Solr index segment level merge

2015-12-29 Thread Tomás Fernández Löbbe

Would collection aliases be an option (assuming you are using SolrCloud
mode)?


https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api4

On Tue, Dec 29, 2015 at 9:21 PM, Erick Erickson 
wrote:

> Could you simply add the new documents to the current index?
>
> That aside, merging does not need to create a new core or a new
> folder. The form:
>
>
> mergeindexes=core0=/opt/solr/core1/data/index=/opt/solr/core2/data/index
>
> Should merge the indexes from the two directories into the pre-existing
> core's index.
>
> Best,
> Erick
>
> On Tue, Dec 29, 2015 at 9:00 PM, Walter Underwood 
> wrote:
> > You probably do not NEED to merge your indexes. Have you tried not
> merging the indexes?
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >> On Dec 29, 2015, at 7:31 PM, jeba earnest 
> wrote:
> >>
> >> I have a scenario that I need to merge the solr indexes online. I have a
> >> primary solr index of 100 Gb and it is serving the end users and it
> can't
> >> go offline for a moment. Everyday new lucene indexes(2 GB) are generated
> >> separately.
> >>
> >> I have tried coreadmin
> >> https://cwiki.apache.org/confluence/display/solr/Merging+Indexes
> >>
> >> And it will create a new core or new folder. which means it will copy
> 100Gb
> >> every time to a new folder.
> >>
> >> Is there a way I can do a segment level merging?
> >>
> >> Jeba
> >
>

Re: How turn on logging for segment merging

2015-11-01 Thread Tomás Fernández Löbbe

You can turn on "infoStream" from the solrconfig:
https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig#IndexConfiginSolrConfig-OtherIndexingSettings

Tomás

On Sun, Nov 1, 2015 at 8:59 AM, Pushkar Raste 
wrote:

> Is segment merging information logged at level finer than INFO? I have
> application setup with INFO level logging and I am indexing documents at
> rate of about few hundred a min. I am using default merge policy
> parameters. However I never see logs that can give me information about
> segment merging.
>
> Is there special operation I have to set to turn on segment merging
> information?
>
> -- Pushkar Raste
>

Re: SolrCloud Admin UI shows node is Down, but state.json says it's active/up

2015-09-08 Thread Tomás Fernández Löbbe

I believe this is expected in the current code. From Replica.State javadoc:

  /**
   * The replica's state. In general, if the node the replica is hosted on
is
   * not under {@code /live_nodes} in ZK, the replica's state should be
   * discarded.
   */
  public enum State {

/**
 * The replica is ready to receive updates and queries.
 * 
 * NOTE: when the node the replica is hosted on crashes, the
 * replica's state may remain ACTIVE in ZK. To determine if the replica
is
 * truly active, you must also verify that its {@link
Replica#getNodeName()
 * node} is under {@code /live_nodes} in ZK (or use
 * {@link ClusterState#liveNodesContain(String)}).
 * 
 */
ACTIVE,
...

On Tue, Sep 8, 2015 at 9:51 AM, Erick Erickson 
wrote:

> Arcadius:
>
> Hmmm. It may take a while for the cluster state to change, but I'm
> assuming that this state persists for minutes/hours/days.
>
> So to recap: If dump the entire ZK node from the root, you have
> 1> liveNodes has N nodes listed (correctly)
> 2> clusterstate.json has N+M nodes listed as "active"
>
> Doesn't sound right to me, but I'll have to let people who are deep
> into that code speculate from here.
>
> Best,
> Erick
>
> On Tue, Sep 8, 2015 at 1:13 AM, Arcadius Ahouansou 
> wrote:
> > On Sep 8, 2015 6:25 AM, "Erick Erickson" 
> wrote:
> >>
> >> Perhaps the browser cache? What happens if you, say, use
> >> Zookeeper client tools to bring down the the cluster state in
> >> question? Or perhaps just refresh the admin UI when showing
> >> the cluster status
> >>
> >
> > Hello Erick.
> >
> > Thank you very much for answering.
> > I did use the ZooInspetor tool to check the state.json in all 5 zk nodes
> > and they are all out of date and identical to what I get through the tree
> > view in sole admin ui.
> >
> > Looking at the source code cloud.js that correctly display nodes as
> "gone"
> > in the graph view, it calls the end point /zookeeper?wt=json and relies
> on
> > the live nodes to mark a node as down instead of status.json.
> >
> > Thanks.
> >
> >> Shot in the dark,
> >> Erick
> >>
> >> On Mon, Sep 7, 2015 at 6:09 PM, Arcadius Ahouansou <
> arcad...@menelic.com>
> > wrote:
> >> > We are running the latest Solr 5.3.0
> >> >
> >> > Thanks.
>

Re: Order of hosts in zkHost

2015-09-04 Thread Tomás Fernández Löbbe

I believe Arcadius has a point, but I still think the answer is no.
ZooKeeper clients (Solr/SolrJ)  connect to a single ZooKeeper server
instance at a time, and keep that session open to that same server as long
as they can/need. During this time, all interactions between the client and
the ZK ensemble will be done to the same ZK server instance (yes, some
operations will require that server to talk with the leader, but not all,
reads are served locally for example). They will only switch to a different
ZooKeeper server instance if the connection is lost for some reason. If all
the clients are connected to the same ZK server, the load wouldn't be
evenly distributed.

However, according to ZooKeeper documentation [1] (and I haven't tested
this), ZK clients don't chose the servers from the connection string in
order:
"To create a client session the application code must provide a connection
string containing a comma separated list of host:port pairs, each
corresponding to a ZooKeeper server (e.g. "127.0.0.1:4545" or "
127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002"). The ZooKeeper client
library will pick an arbitrary server and try to connect to it."

Tomás

[1] http://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html

On Fri, Sep 4, 2015 at 9:12 AM, Erick Erickson 
wrote:

> Arcadius:
>
> Note that one of the more recent changes is "per collection states" in
> ZK. So rather
> than have one huge clusterstate.json that gets passed out to to all
> collection on any
> change, the listeners can now listen only to specific collections.
>
> Reduces the "thundering herd" problem.
>
> Best,
> Erick
>
> On Fri, Sep 4, 2015 at 12:39 AM, Arcadius Ahouansou
>  wrote:
> > Hello Shawn.
> > This question was raised because IMHO, apart from leader election, there
> > are other load-generating activities such as all 10 solrj
> > clients+solrCloudNodes listening to changes on
> clusterstate.json/state.json
> > and downloading the whole file in case there is a change... And this
> would
> > have  happened on zk1 only if we did not shuffle... That's the theory.
> > I could test this and see.
> > On Sep 4, 2015 6:27 AM, "Shawn Heisey"  wrote:
> >
> >> On 9/3/2015 9:47 PM, Arcadius Ahouansou wrote:
> >> > Let's say we have 10 SolrJ clients all configured with
> >> > zkhost=zk1:port,zk2:port,zk3:port
> >> >
> >> > For each of the 10 SolrJ clients, would it make a difference in term
> of
> >> > load on zk1 (the server on the list) if we shuffle around the order of
> >> the
> >> > ZK servers in zkHost or is it all the same?
> >> >
> >> > I would have thought that shuffling would lower load on zk1.
> >>
> >> I don't think this is going to make much difference.  Here's why,
> >> assuming that my understanding of how it all works is correct:
> >>
> >> One of the things zookeeper does is manage elections.  It helps figure
> >> out which member of a cluster is the leader.  I think Zookeeper uses
> >> this concept internally, too.  One of the hosts in the ensemble will be
> >> elected to be the leader, which accepts all input and replicates it to
> >> the other members of the cluster.  All of the clients will be talking to
> >> the leader first, no matter what order the hosts are listed.
> >>
> >> If my understanding of how this works is flawed, then what I just said
> >> is probably wrong.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>

Re: Lucene/Solr 5.0 and custom FieldCahe implementation

2015-08-31 Thread Tomás Fernández Löbbe

Sorry Jamie, I totally missed this email. There was no Jira that I could
find. I created SOLR-7996

On Sat, Aug 29, 2015 at 5:26 AM, Jamie Johnson <jej2...@gmail.com> wrote:

> This sounds like a good idea, I'm assuming I'd need to make my own
> UnInvertingReader (or subclass) to do this right?  Is there a way to do
> this on the 5.x codebase or would I still need the solrindexer factory work
> that Tomás mentioned previously?
>
> Tomás, is there a ticket for the SolrIndexer factory?  I'd like to follow
> it's work to know what version of 5.x (or later) I should be looking for
> this in.
>
> On Thu, Aug 27, 2015 at 1:06 PM, Yonik Seeley <ysee...@gmail.com> wrote:
>
> > UnInvertingReader makes indexed fields look like docvalues fields.
> > The caching itself is still done in FieldCache/FieldCacheImpl
> > but you could perhaps wrap what is cached there to either screen out
> > stuff or construct a new entry based on the user.
> >
> > -Yonik
> >
> >
> > On Thu, Aug 27, 2015 at 12:55 PM, Jamie Johnson <jej2...@gmail.com>
> wrote:
> > > I think a custom UnInvertingReader would work as I could skip the
> process
> > > of putting things in the cache.  Right now in Solr 4.x though I am
> > caching
> > > based but including the users authorities in the key of the cache so
> > we're
> > > not rebuilding the UnivertedField on every request.  Where in 5.x is
> the
> > > object actually cached?  Will this be possible in 5.x?
> > >
> > > On Thu, Aug 27, 2015 at 12:32 PM, Yonik Seeley <ysee...@gmail.com>
> > wrote:
> > >
> > >> The FieldCache has become implementation rather than interface, so I
> > >> don't think you're going to see plugins at that level (it's all
> > >> package protected now).
> > >>
> > >> One could either subclass or re-implement UnInvertingReader though.
> > >>
> > >> -Yonik
> > >>
> > >>
> > >> On Thu, Aug 27, 2015 at 12:09 PM, Jamie Johnson <jej2...@gmail.com>
> > wrote:
> > >> > Also in this vein I think that Lucene should support factories for
> the
> > >> > cache creation as described @
> > >> > https://issues.apache.org/jira/browse/LUCENE-2394.  I'm not
> endorsing
> > >> the
> > >> > patch that is provided (I haven't even looked at it) just the
> concept
> > in
> > >> > general.
> > >> >
> > >> > On Thu, Aug 27, 2015 at 12:01 PM, Jamie Johnson <jej2...@gmail.com>
> > >> wrote:
> > >> >
> > >> >> That makes sense, then I could extend the SolrIndexSearcher by
> > creating
> > >> a
> > >> >> different factory class that did whatever magic I needed.  If you
> > >> create a
> > >> >> Jira ticket for this please link it here so I can track it!  Again
> > >> thanks
> > >> >>
> > >> >> On Thu, Aug 27, 2015 at 11:59 AM, Tomás Fernández Löbbe <
> > >> >> tomasflo...@gmail.com> wrote:
> > >> >>
> > >> >>> I don't think there is a way to do this now. Maybe we should
> > separate
> > >> the
> > >> >>> logic of creating the SolrIndexSearcher to a factory. Moving this
> > logic
> > >> >>> away from SolrCore is already a win, plus it will make it easier
> to
> > >> unit
> > >> >>> test and extend for advanced use cases.
> > >> >>>
> > >> >>> Tomás
> > >> >>>
> > >> >>> On Wed, Aug 26, 2015 at 8:10 PM, Jamie Johnson <jej2...@gmail.com
> >
> > >> wrote:
> > >> >>>
> > >> >>> > Sorry to poke this again but I'm not following the last comment
> of
> > >> how I
> > >> >>> > could go about extending the solr index searcher and have the
> > >> extension
> > >> >>> > used.  Is there an example of this?  Again thanks
> > >> >>> >
> > >> >>> > Jamie
> > >> >>> > On Aug 25, 2015 7:18 AM, "Jamie Johnson" <jej2...@gmail.com>
> > wrote:
> > >> >>> >
> > >> >>> > > I had seen this as well, if I over wrote this by extending
> > >> >>> > > SolrIndexSearcher how do I have my extension used?  I didn't
> > see a
> > >> wa

Re: Lucene/Solr 5.0 and custom FieldCahe implementation

2015-08-27 Thread Tomás Fernández Löbbe

I don't think there is a way to do this now. Maybe we should separate the
logic of creating the SolrIndexSearcher to a factory. Moving this logic
away from SolrCore is already a win, plus it will make it easier to unit
test and extend for advanced use cases.

Tomás

On Wed, Aug 26, 2015 at 8:10 PM, Jamie Johnson jej2...@gmail.com wrote:

 Sorry to poke this again but I'm not following the last comment of how I
 could go about extending the solr index searcher and have the extension
 used.  Is there an example of this?  Again thanks

 Jamie
 On Aug 25, 2015 7:18 AM, Jamie Johnson jej2...@gmail.com wrote:

  I had seen this as well, if I over wrote this by extending
  SolrIndexSearcher how do I have my extension used?  I didn't see a way
 that
  could be plugged in.
  On Aug 25, 2015 7:15 AM, Mikhail Khludnev mkhlud...@griddynamics.com
  wrote:
 
  On Tue, Aug 25, 2015 at 2:03 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
   Thanks Mikhail.  If I'm reading the SimpleFacets class correctly, out
   delegates to DocValuesFacets when facet method is FC, what used to be
   FieldCache I believe.  DocValuesFacets either uses DocValues or builds
  then
   using the UninvertingReader.
  
 
  Ah.. got it. Thanks for reminding this details.It seems like even
  docValues=true doesn't help with your custom implementation.
 
 
  
   I am not seeing a clean extension point to add a custom
  UninvertingReader
   to Solr, would the only way be to copy the FacetComponent and
  SimpleFacets
   and modify as needed?
  
  Sadly, yes. There is no proper extension point. Also, consider
 overriding
  SolrIndexSearcher.wrapReader(SolrCore, DirectoryReader) where the
  particular UninvertingReader is created, there you can pass the own one,
  which refers to custom FieldCache.
 
 
   On Aug 25, 2015 12:42 AM, Mikhail Khludnev 
  mkhlud...@griddynamics.com
   wrote:
  
Hello Jamie,
I don't understand how it could choose DocValuesFacets (it occurs on
docValues=true) field, but then switches to
  UninvertingReader/FieldCache
which means docValues=false. If you can provide more details it
 would
  be
great.
Beside of that, I suppose you can only implement and inject your own
UninvertingReader, I don't think there is an extension point for
 this.
   It's
too specific requirement.
   
On Tue, Aug 25, 2015 at 3:50 AM, Jamie Johnson jej2...@gmail.com
   wrote:
   
 as mentioned in a previous email I have a need to provide security
controls
 at the term level.  I know that Lucene/Solr doesn't support this
 so
  I
   had
 baked something onto a 4.x baseline that was sufficient for my use
   cases.
 I am now looking to move that implementation to 5.x and am running
  into
an
 issue around faceting.  Previously we were able to provide a
 custom
   cache
 implementation that would create separate cache entries given a
particular
 set of security controls, but in Solr 5 some faceting is delegated
  to
 DocValuesFacets which delegates to UninvertingReader in my case
 (we
  are
not
 storing DocValues).  The issue I am running into is that before
 5.x
  I
   had
 the ability to influence the FieldCache that was used at the Solr
  level
to
 also include a security token into the key so each cache entry was
   scoped
 to a particular level.  With the current implementation the
  FieldCache
 seems to be an internal detail that I can't influence in anyway.
 Is
   this
 correct?  I had noticed this Jira ticket
 https://issues.apache.org/jira/browse/LUCENE-5427, is there any
   movement
 on
 this?  Is there another way to influence the information that is
 put
   into
 these caches?  As always thanks in advance for any suggestions.

 -Jamie

   
   
   
--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics
   
http://www.griddynamics.com
mkhlud...@griddynamics.com
   
  
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  http://www.griddynamics.com
  mkhlud...@griddynamics.com

Re: SolrCloud Core Reload

2015-04-17 Thread Tomás Fernández Löbbe

Optimize will be distributed to all shards/replicas.
I believe reload will only reload the specific core. For reloading the
complete collection use the Collections API:
https://cwiki.apache.org/confluence/display/solr/Collections+API


On Thu, Apr 16, 2015 at 5:15 PM, Vincenzo D'Amore v.dam...@gmail.com
wrote:

 Hi all,

 I have a solrcloud cluster with 3 server and there are many cores.
 Using the SolrCloud UI Admin Core, if I execute core optimize (or
 reload), all the core in the cluster will be optimized or reloaded? or
 only the selected core?.

 Best regards,
 Vincenzo

Re: 5.1 'unique' facet function / calcDistinct

2015-04-17 Thread Tomás Fernández Löbbe


 II. Is there a way to use the stats.calcdistinct functionality and only
 return the countDistinct portion of the response and not the full list of
 distinct values -- as provided in the distinctValues portion of the
 response. In a field with high cardinality the response size becomes too
 large.


I don't think this is currently supported.


 If there is no such option, could someone point me in the right direction
 for implementing a custom solution?


The problem is how to calculate this in distributed requests. Even if the
final response doesn't include the distinct values, the shard responses
will probably have to.

Look at StatsComponent.java and AbstractStatsValues in
StatsValuesFactory.java

Tomás



 Thank you for your time,
 Levan



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/5-1-unique-facet-function-calcDistinct-tp4200110.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Range facets in sharded search

2015-04-16 Thread Tomás Fernández Löbbe

This looks like a bug. The logic to merge range facets from shards seems to
only be merging counts, not the first level elements.
Could you create a Jira?

On Thu, Apr 16, 2015 at 2:38 PM, Will Miller wmil...@fbbrands.com wrote:

 I am seeing some some odd behavior with range facets across multiple
 shards. When querying each node directly with distrib=false the facet
 returned matches what is expected. When doing the same query against the
 collection and it spans the two shards, the facet after and between buckets
 are wrong.


 I can re-create a similar problem using the out of the box example scripts
 and data. I am running on Windows and tested both Solr 5.0.0 and 5.1.0.
 This is the steps to reproduce:


 c:\solr-5.1.0\solr -e cloud

 These are the selections I made:


 (specify 1-4 nodes) [2]: 2
 Please enter the port for node1 [8983]: 8983
 Please enter the port for node2 [7574]: 7574
 Please provide a name for your new collection: [gettingstarted]
 gettingstarted
 How many shards would you like to split gettingstarted into? [2] 2
 How many replicas per shard would you like to create? [2] 1
 Please choose a configuration ...  [data_driven_schema_configs]
 sample_techproducts_configs


 I then posted some of the sample XMLs:

 C:\solr-5.1.0\example\exampledocs java -Dc=gettingstarted -jar post.jar
 vidcard.xml, hd.xml, ipod_other.xml, ipod_video.xml, mem.xml, monitor.xml,
 monitor2.xml,mp500.xml, sd500.xml


 This first query is against node1 with distrib=false:


 http://localhost:8983/solr/gettingstarted/select/?q=*:*wt=jsonindent=truedistrib=falsefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND

 There are 7 Results (results ommited).
 facet_ranges:{
   price:{
 counts:[
   0.0,1,
   20.0,0,
   40.0,0,
   60.0,0,
   80.0,1],
 gap:20.0,
 start:0.0,
 end:100.0,
 before:0,
 after:5,
 between:2}},


 This second query is against node2 with distrib=false:

 http://localhost:7574/solr/gettingstarted/select/?q=*:*wt=jsonindent=truedistrib=falsefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND

 7 Results (one product does not have a price):
 facet_ranges:{
   price:{
 counts:[
   0.0,1,
   20.0,0,
   40.0,0,
   60.0,1,
   80.0,0],
 gap:20.0,
 start:0.0,
 end:100.0,
 before:0,
 after:4,
 between:2}},


 Finally querying the entire collection:

 http://localhost:7574/solr/gettingstarted/select/?q=*:*wt=jsonindent=truefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND

 14 results (one without a price range):
 facet_ranges:{
   price:{
 counts:[
   0.0,2,
   20.0,0,
   40.0,0,
   60.0,1,
   80.0,1],
 gap:20.0,
 start:0.0,
 end:100.0,
 before:0,
 after:5,
 between:2}},


 Notice that both the after and the between are wrong here. The actual
 buckets do correctly represent the right values but I would expect
 between to be 5 and after to be 13.


 There appears to be a recently fixed issue (
 https://issues.apache.org/jira/browse/SOLR-6154) with range facet in
 distributed queries but it was related to buckets not always appearing with
 mincount=1 for the field. This looks like it is a different problem.


 Anyone have any suggestions or notice anythign wrong with my query
 parameters? I can open a Jira ticket but wanted to run it by the larger
 audience first to see if I am missing anything obvious.


 Thanks,

 Will

Re: Range facets in sharded search

2015-04-16 Thread Tomás Fernández Löbbe

Should be fixed in 5.2. See https://issues.apache.org/jira/browse/SOLR-7412

On Thu, Apr 16, 2015 at 3:18 PM, Tomás Fernández Löbbe 
tomasflo...@gmail.com wrote:

 This looks like a bug. The logic to merge range facets from shards seems
 to only be merging counts, not the first level elements.
 Could you create a Jira?

 On Thu, Apr 16, 2015 at 2:38 PM, Will Miller wmil...@fbbrands.com wrote:

 I am seeing some some odd behavior with range facets across multiple
 shards. When querying each node directly with distrib=false the facet
 returned matches what is expected. When doing the same query against the
 collection and it spans the two shards, the facet after and between buckets
 are wrong.


 I can re-create a similar problem using the out of the box example
 scripts and data. I am running on Windows and tested both Solr 5.0.0 and
 5.1.0. This is the steps to reproduce:


 c:\solr-5.1.0\solr -e cloud

 These are the selections I made:


 (specify 1-4 nodes) [2]: 2
 Please enter the port for node1 [8983]: 8983
 Please enter the port for node2 [7574]: 7574
 Please provide a name for your new collection: [gettingstarted]
 gettingstarted
 How many shards would you like to split gettingstarted into? [2] 2
 How many replicas per shard would you like to create? [2] 1
 Please choose a configuration ...  [data_driven_schema_configs]
 sample_techproducts_configs


 I then posted some of the sample XMLs:

 C:\solr-5.1.0\example\exampledocs java -Dc=gettingstarted -jar post.jar
 vidcard.xml, hd.xml, ipod_other.xml, ipod_video.xml, mem.xml, monitor.xml,
 monitor2.xml,mp500.xml, sd500.xml


 This first query is against node1 with distrib=false:


 http://localhost:8983/solr/gettingstarted/select/?q=*:*wt=jsonindent=truedistrib=falsefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND

 There are 7 Results (results ommited).
 facet_ranges:{
   price:{
 counts:[
   0.0,1,
   20.0,0,
   40.0,0,
   60.0,0,
   80.0,1],
 gap:20.0,
 start:0.0,
 end:100.0,
 before:0,
 after:5,
 between:2}},


 This second query is against node2 with distrib=false:

 http://localhost:7574/solr/gettingstarted/select/?q=*:*wt=jsonindent=truedistrib=falsefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND

 7 Results (one product does not have a price):
 facet_ranges:{
   price:{
 counts:[
   0.0,1,
   20.0,0,
   40.0,0,
   60.0,1,
   80.0,0],
 gap:20.0,
 start:0.0,
 end:100.0,
 before:0,
 after:4,
 between:2}},


 Finally querying the entire collection:

 http://localhost:7574/solr/gettingstarted/select/?q=*:*wt=jsonindent=truefacet=truefacet.range=pricef.price.facet.range.start=0.00f.price.facet.range.end=100.00f.price.facet.range.gap=20f.price.facet.range.other=alldefType=edismaxq.op=AND

 14 results (one without a price range):
 facet_ranges:{
   price:{
 counts:[
   0.0,2,
   20.0,0,
   40.0,0,
   60.0,1,
   80.0,1],
 gap:20.0,
 start:0.0,
 end:100.0,
 before:0,
 after:5,
 between:2}},


 Notice that both the after and the between are wrong here. The actual
 buckets do correctly represent the right values but I would expect
 between to be 5 and after to be 13.


 There appears to be a recently fixed issue (
 https://issues.apache.org/jira/browse/SOLR-6154) with range facet in
 distributed queries but it was related to buckets not always appearing with
 mincount=1 for the field. This looks like it is a different problem.


 Anyone have any suggestions or notice anythign wrong with my query
 parameters? I can open a Jira ticket but wanted to run it by the larger
 audience first to see if I am missing anything obvious.


 Thanks,

 Will

Re: Suggester Example In Documentation Not Working

2015-01-22 Thread Tomás Fernández Löbbe

I see that the docs say that the doc needs to be indexed only, but for
Fuzzy or Analyzed, I think the field needs to be stored. On the other side,
not sure how much sense it makes to use any of those two implementations if
the field type you want to have is string.

Tomás

On Thu, Jan 22, 2015 at 8:14 AM, Charles Sanders csand...@redhat.com
wrote:

 Attempting to follow the documentation found here:
 https://cwiki.apache.org/confluence/display/solr/Suggester

 The example given in the documentation is not working. See below my
 configuration. I only changed the field names to those in my schema. Can
 anyone provide an example for this component that actually works?

 searchComponent name=suggest class=solr.SuggestComponent
 lst name=suggester
 str name=namemySuggester/str
 str name=lookupImplFuzzyLookupFactory/str
 str name=dictionaryImplDocumentDictionaryFactory/str
 str name=fieldsugg_allText/str
 str name=weightFieldsuggestWeight/str
 str name=suggestAnalyzerFieldTypestring/str
 /lst
 /searchComponent

 requestHandler name=/suggest class=solr.SearchHandler startup=lazy
 lst name=defaults
 str name=suggesttrue/str
 str name=suggest.count10/str
 str name=suggest.buildtrue/str
 /lst
 arr name=components
 strsuggest/str
 /arr
 /requestHandler

 field name=sugg_allText type=string indexed=true multiValued=true
 stored=false/
 field name=suggestWeight type=long indexed=true stored=true
 default=1 /



 http://localhost:/solr/collection1/suggest?suggest=truesuggest.build=truesuggest.dictionary=mySuggesterwt=jsonsuggest.q=kern


 {responseHeader:{status:0,QTime:4},command:build,suggest:{mySuggester:{kern:{numFound:0,suggestions:[]

Re: Slow faceting performance on a docValues field

2015-01-13 Thread Tomás Fernández Löbbe

Range Faceting won't use the DocValues even if they are there set, it
translates each gap to a filter. This means that it will end up using the
FilterCache, which should cause faster followup queries if you repeat the
same gaps (and don't commit).
You may also want to try interval faceting, it will use DocValues instead
of filters. The API is different, you'll have to provide the intervals
yourself.

Tomás

On Tue, Jan 13, 2015 at 10:01 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 1/13/2015 10:35 AM, David Smith wrote:
  I have a query against a single 50M doc index (175GB) using Solr 4.10.2,
 that exhibits the following response times (via the debugQuery option in
 Solr Admin):
  process: {
   time: 24709,
   query: { time: 54 }, facet: { time: 24574 },
 
 
  The query time of 54ms is great and exactly as expected -- this example
 was a single-term search that returned 3 hits.
  I am trying to get the facet time (24.5 seconds) to be sub-second, and
 am having no luck.  The facet part of the query is as follows:
 
  params: { facet.range: eventDate,
   f.eventDate.facet.range.end: 2015-05-13T16:37:18.000Z,
   f.eventDate.facet.range.gap: +1DAY,
   start: 0,
 
   rows: 10,
 
   f.eventDate.facet.range.start: 2005-03-13T16:37:18.000Z,
 
   f.eventDate.facet.mincount: 1,
 
   facet: true,
 
   debugQuery: true,
   _: 1421169383802
   }
 
  And, the relevant schema definition is as follows:
 
 field name=eventDate type=tdate indexed=true stored=true
 multiValued=false docValues=true/
 
  !-- A Trie based date field for faster date range queries and date
 faceting. --
  fieldType name=tdate class=solr.TrieDateField precisionStep=6
 positionIncrementGap=0/
 
 
  During the 25-second query, the Solr JVM pegs one CPU, with little or no
 I/O activity detected on the drive that holds the 175GB index.  I have 48GB
 of RAM, 1/2 of that dedicated to the OS and the other to the Solr JVM.
 
  I do NOT have any fieldValue caches configured as yet, because my
 (perhaps too simplistic?) reading of the documentation was that DocValues
 eliminates the need for a field-level cache on this facet field.

 24GB of RAM to cache 175GB is probably not enough in the general case,
 but if you're seeing very little disk I/O activity for this query, then
 we'll leave that alone and you can worry about it later.

 What I would try immediately is setting the facet.method parameter to
 enum and seeing what that does to the facet time.  I've had good luck
 generally with that, even in situations where the docs indicated that
 the default (fc) was supposed to work better.  I have never explored the
 relationship between facet.method and docValues, though.

 I'm out of ideas after this.  I don't have enough experience with
 faceting to help much.

 Thanks,
 Shawn

Re: Slow faceting performance on a docValues field

2015-01-13 Thread Tomás Fernández Löbbe

Just a side question. In your first example you have dates set with time
but in the second (where you set intervals) time is not set.
Is this something that can be resolved having a field that only sets date
(without time), and then use regular field faceting and facet.sort=index?
If that's possible in your use case that may be faster.

Tomás

On Tue, Jan 13, 2015 at 11:12 AM, Tomás Fernández Löbbe
tomasflo...@gmail.com wrote:

No, you are not misreading, right now there is no automatic way of
generating the intervals on the server side similar to range faceting... I
guess it won't work in your case. Maybe you should create a Jira to add
this feature to interval faceting.

Tomás

On Tue, Jan 13, 2015 at 10:44 AM, David Smith
dsmiths...@yahoo.com.invalid wrote:

Tomás,

Thanks for the response -- the performance of my query makes perfect
sense in light of your information.
I looked at Interval faceting. My required interval is 1 day. I cannot
change that requirement. Unless I am mis-reading the doc, that means to
facet a 10 year range, the query needs to specify over 3,600 intervals ??

f.eventDate.facet.interval.set=[2005-01-01T00:00:00.000Z,2005-01-01T23:59:59.999Z]f.eventDate.facet.interval.set=[2005-01-02T00:00:00.000Z,2005-01-02T23:59:59.999Z]etc,etc

Each query would be 185MB in size if I structure it this way.

I assume I must be mis-understanding how to use Interval faceting with
dates. Are there any concrete examples you know of? A google search did
not come up with much.

Kind regards,
Dave

On Tuesday, January 13, 2015 12:16 PM, Tomás Fernández Löbbe
tomasflo...@gmail.com wrote:

Range Faceting won't use the DocValues even if they are there set, it
translates each gap to a filter. This means that it will end up using the
FilterCache, which should cause faster followup queries if you repeat the
same gaps (and don't commit).
You may also want to try interval faceting, it will use DocValues instead
of filters. The API is different, you'll have to provide the intervals
yourself.

Tomás

On Tue, Jan 13, 2015 at 10:01 AM, Shawn Heisey apa...@elyograg.org
wrote:

On 1/13/2015 10:35 AM, David Smith wrote:
I have a query against a single 50M doc index (175GB) using Solr
4.10.2,
that exhibits the following response times (via the debugQuery option in
Solr Admin):
process: {
time: 24709,
query: { time: 54 }, facet: { time: 24574 },

The query time of 54ms is great and exactly as expected -- this
example
was a single-term search that returned 3 hits.
I am trying to get the facet time (24.5 seconds) to be sub-second, and
am having no luck. The facet part of the query is as follows:

params: { facet.range: eventDate,
f.eventDate.facet.range.end: 2015-05-13T16:37:18.000Z,
f.eventDate.facet.range.gap: +1DAY,
start: 0,

rows: 10,

f.eventDate.facet.range.start: 2005-03-13T16:37:18.000Z,

f.eventDate.facet.mincount: 1,

facet: true,

debugQuery: true,
_: 1421169383802
}

And, the relevant schema definition is as follows:

field name=eventDate type=tdate indexed=true stored=true
multiValued=false docValues=true/

!-- A Trie based date field for faster date range queries and date
faceting. --
fieldType name=tdate class=solr.TrieDateField
precisionStep=6
positionIncrementGap=0/

During the 25-second query, the Solr JVM pegs one CPU, with little or
no
I/O activity detected on the drive that holds the 175GB index. I have
48GB
of RAM, 1/2 of that dedicated to the OS and the other to the Solr JVM.

I do NOT have any fieldValue caches configured as yet, because my
(perhaps too simplistic?) reading of the documentation was that
DocValues
eliminates the need for a field-level cache on this facet field.

24GB of RAM to cache 175GB is probably not enough in the general case,
but if you're seeing very little disk I/O activity for this query, then
we'll leave that alone and you can worry about it later.

What I would try immediately is setting the facet.method parameter to
enum and seeing what that does to the facet time. I've had good luck
generally with that, even in situations where the docs indicated that
the default (fc) was supposed to work better. I have never explored the
relationship between facet.method and docValues, though.

I'm out of ideas after this. I don't have enough experience with
faceting to help much.

Thanks,
Shawn

Re: Slow faceting performance on a docValues field

2015-01-13 Thread Tomás Fernández Löbbe

Tomás

On Tue, Jan 13, 2015 at 10:44 AM, David Smith dsmiths...@yahoo.com.invalid
wrote:

Tomás,

Thanks for the response -- the performance of my query makes perfect sense
in light of your information.
I looked at Interval faceting. My required interval is 1 day. I cannot
change that requirement. Unless I am mis-reading the doc, that means to
facet a 10 year range, the query needs to specify over 3,600 intervals ??

f.eventDate.facet.interval.set=[2005-01-01T00:00:00.000Z,2005-01-01T23:59:59.999Z]f.eventDate.facet.interval.set=[2005-01-02T00:00:00.000Z,2005-01-02T23:59:59.999Z]etc,etc

Each query would be 185MB in size if I structure it this way.

I assume I must be mis-understanding how to use Interval faceting with
dates. Are there any concrete examples you know of? A google search did
not come up with much.

Kind regards,
Dave

On Tuesday, January 13, 2015 12:16 PM, Tomás Fernández Löbbe
tomasflo...@gmail.com wrote:

Tomás

On Tue, Jan 13, 2015 at 10:01 AM, Shawn Heisey apa...@elyograg.org
wrote:

The query time of 54ms is great and exactly as expected -- this example
was a single-term search that returned 3 hits.
I am trying to get the facet time (24.5 seconds) to be sub-second, and
am having no luck. The facet part of the query is as follows:

params: { facet.range: eventDate,
f.eventDate.facet.range.end: 2015-05-13T16:37:18.000Z,
f.eventDate.facet.range.gap: +1DAY,
start: 0,

rows: 10,

f.eventDate.facet.range.start: 2005-03-13T16:37:18.000Z,

f.eventDate.facet.mincount: 1,

facet: true,

debugQuery: true,
_: 1421169383802
}

And, the relevant schema definition is as follows:

field name=eventDate type=tdate indexed=true stored=true
multiValued=false docValues=true/

!-- A Trie based date field for faster date range queries and date
faceting. --
fieldType name=tdate class=solr.TrieDateField precisionStep=6
positionIncrementGap=0/

I do NOT have any fieldValue caches configured as yet, because my
(perhaps too simplistic?) reading of the documentation was that DocValues
eliminates the need for a field-level cache on this facet field.

24GB of RAM to cache 175GB is probably not enough in the general case,
but if you're seeing very little disk I/O activity for this query, then
we'll leave that alone and you can worry about it later.

I'm out of ideas after this. I don't have enough experience with
faceting to help much.

Thanks,
Shawn

Re: Slow faceting performance on a docValues field

2015-01-13 Thread Tomás Fernández Löbbe

fc, fcs and enum only apply for field faceting, not range faceting.

Tomás

On Tue, Jan 13, 2015 at 11:24 AM, David Smith dsmiths...@yahoo.com.invalid
wrote:

What is stumping me is that the search result has 3 hits, yet faceting
those 3 hits takes 24 seconds. The documentation for facet.method=fc is
quite explicit about how Solr does faceting:

fc (stands for Field Cache) The facet counts are calculated by iterating
over documents that match the query and summing the terms that appear in
each document. This was the default method for single valued fields prior
to Solr 1.4.

If a search yielded millions of hits, I could understand 24 seconds to
calculate the facets. But not for a search with only 3 hits.

What am I missing?

Regards,
David

On Tuesday, January 13, 2015 1:12 PM, Tomás Fernández Löbbe
tomasflo...@gmail.com wrote:

Tomás

On Tue, Jan 13, 2015 at 10:44 AM, David Smith dsmiths...@yahoo.com.invalid

wrote:

Tomás,

Thanks for the response -- the performance of my query makes perfect
sense
in light of your information.
I looked at Interval faceting. My required interval is 1 day. I cannot
change that requirement. Unless I am mis-reading the doc, that means to
facet a 10 year range, the query needs to specify over 3,600 intervals ??

f.eventDate.facet.interval.set=[2005-01-01T00:00:00.000Z,2005-01-01T23:59:59.999Z]f.eventDate.facet.interval.set=[2005-01-02T00:00:00.000Z,2005-01-02T23:59:59.999Z]etc,etc

Each query would be 185MB in size if I structure it this way.

I assume I must be mis-understanding how to use Interval faceting with
dates. Are there any concrete examples you know of? A google search did
not come up with much.

Kind regards,
Dave

On Tuesday, January 13, 2015 12:16 PM, Tomás Fernández Löbbe
tomasflo...@gmail.com wrote:

Tomás

On Tue, Jan 13, 2015 at 10:01 AM, Shawn Heisey apa...@elyograg.org
wrote:

On 1/13/2015 10:35 AM, David Smith wrote:
I have a query against a single 50M doc index (175GB) using Solr
4.10.2,
that exhibits the following response times (via the debugQuery option
in
Solr Admin):
process: {
time: 24709,
query: { time: 54 }, facet: { time: 24574 },

The query time of 54ms is great and exactly as expected -- this
example
was a single-term search that returned 3 hits.
I am trying to get the facet time (24.5 seconds) to be sub-second,
and
am having no luck. The facet part of the query is as follows:

params: { facet.range: eventDate,
f.eventDate.facet.range.end: 2015-05-13T16:37:18.000Z,
f.eventDate.facet.range.gap: +1DAY,
start: 0,

rows: 10,

f.eventDate.facet.range.start: 2005-03-13T16:37:18.000Z,

f.eventDate.facet.mincount: 1,

facet: true,

debugQuery: true,
_: 1421169383802
}

And, the relevant schema definition is as follows:

field name=eventDate type=tdate indexed=true stored=true
multiValued=false docValues=true/

!-- A Trie based date field for faster date range queries and
date
faceting. --
fieldType name=tdate class=solr.TrieDateField
precisionStep=6
positionIncrementGap=0/

24GB of RAM to cache 175GB is probably not enough in the general case,
but if you're seeing very little disk I/O activity for this query, then
we'll leave that alone and you can worry about it later.

What I would try immediately is setting the facet.method parameter to
enum and seeing what that does to the facet time. I've had good luck
generally with that, even in situations where the docs indicated that
the default (fc) was supposed to work better. I have never explored
the
relationship between facet.method and docValues, though.

I'm out of ideas after this. I

Re: import solr source to eclipse

2014-10-12 Thread Tomás Fernández Löbbe

The way I do this:
From a terminal:
svn checkout https://svn.apache.org/repos/asf/lucene/dev/trunk/
lucene-solr-trunk
cd lucene-solr-trunk
ant eclipse

... And then, from your Eclipse import existing java project, and select
the directory where you placed lucene-solr-trunk

On Sun, Oct 12, 2014 at 7:09 AM, Ali Nazemian alinazem...@gmail.com wrote:

 Hi,
 I am going to import solr source code to eclipse for some development
 purpose. Unfortunately every tutorial that I found for this purpose is
 outdated and did not work. So would you please give me some hint about how
 can I import solr source code to eclipse?
 Thank you very much.

 --
 A.Nazemian

Re: Turn off suggester

2014-09-25 Thread Tomás Fernández Löbbe

The SuggestComponent is not in the default components list. There must be a
request handler with this component added explicitly in the solrconfig.xml

Tomás

On Thu, Sep 25, 2014 at 12:22 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 Isn't it one of the Solr components? Can it be just removed from the
 default chain? Random poking in the dark here.
 Personal: http://www.outerthoughts.com/ and @arafalov
 Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
 Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


 On 25 September 2014 10:45, Erick Erickson erickerick...@gmail.com
 wrote:
  Well, tell us more about the suggester configuration, the number
  of unique terms in the field you're using, what version of Solr, etc.
 
  As Hoss says, details matter.
 
  Best,
  Erick
 
  On Thu, Sep 25, 2014 at 4:18 AM, PeriS peri.subrahma...@htcinc.com
 wrote:
 
  Is there a way to turn off the solr suggester? I have about 30M records
  and when tomcat starts up, it takes a long time (~10 minutes) for the
  suggester to decompress the data or its doing soothing as it hangs on
  SolrSuggester.build(); Any ideas please?
 
  Thanks
  -Peri
 
 
 
  *** DISCLAIMER *** This is a PRIVATE message. If you are not the
 intended
  recipient, please delete without copying and kindly advise us by e-mail
 of
  the mistake in delivery.
  NOTE: Regardless of content, this e-mail shall not operate to bind HTC
  Global Services to any order or other contract unless pursuant to
 explicit
  written agreement or government initiative expressly permitting the use
 of
  e-mail for such purpose.

Re: Solr Faceting issue

2014-08-04 Thread Tomás Fernández Löbbe

If I understand correctly, you are looking for multi-select faceting:
https://wiki.apache.org/solr/SimpleFacetParameters#Multi-Select_Faceting_and_LocalParams


On Mon, Aug 4, 2014 at 9:46 PM, Smitha Rajiv smitharaji...@gmail.com
wrote:

 Hi Solr Experts,


 Request you to please help me in fixing the below facets problem. Thanks in
 advance. Also pls let me know if you do not understand any part of my
 explanation below.



 How do I facet on more than two categories (let’s say ‘project’ and ‘type’
 as discussed below) at the same time to get combination facets and their
 count ?

 When you open URL,http://search-lucene.com/?q=facets you can see the
 facets on right hand side as 'Project','type','date','author' and their
 corresponding values with count in brackets.



 For instance, let’s say you select *'solr(3366)'* under 'Project' facet,
 still I can see other values under 'Project' facet like ElasticSearch etc.
 along with their respective count.

 Project:

 solr(3366) -- selected

 ElasticSearch (1650)

 Lucene (1255)

 Lucene.Net (43)

 Nutch (20)

 PyLucene (17)

 Mahout (16)

 ManifoldCF (8)

 Tika (4)

 OpenRelevance (3)

 Lucy (2)

 type:

 mail # user (2791)

 issue (303)

 mail # dev (134)

 source code (82)

 javadoc (37)

 wiki (36)

 web site (2)



 2. Further when I  Select 'mail # user(2791)' under “type” section , again
 I can see other values under “type” section with their corresponding count
 in brackets and their corresponding values in “Project” facet gets changed
 accordingly (namely the count ).

 project:

 Solr (2784) -- selected

 ElasticSearch (1056)

 Lucene (237)

 Lucene.Net (24)

 Nutch (14)

 Mahout (10)

 ManifoldCF (4)

 Lucy (2)

 OpenRelevance (1)

 type

 mail # user (2791) -- selected

 issue (303)

 mail # dev (134)

 source code (82)

 javadoc (37)

 wiki (36)

 web site (2)



   Observe how solr(3366) changed to   Solr (2784) post selection of mail #
 user along with the other values of ‘Project’ (like ElasticSearch etc.) and
 ‘type’ (issue, javadoc etc.,) with a change in their count values.



 I want to achieve similar working functionality.  Can you pls let me know
 if the below query is in the correct direction. Pls let me know if I have
 to modify this. if yes, what and how. Probably an explanation on why would
 do a huge help.




 http://localhost:8080/solr/collection1/select?q=solr%20facetsfq=Project%3A(%22solr%22)fq=type%3A(%22mailhashuser%22)facet=truefacet.mincount=1

 facet.field=projectfacet.field=typewt=jsonindent=truedefType=edismax
 json.nl=map



 If the above query is not in the correct direction. Pls help in
 constructing the same. Thanks in advance.


 Regards,

 Smitha

Re: Memory leak for debugQuery?

2014-07-16 Thread Tomás Fernández Löbbe

Also, is this trunk? Solr 4.x? Single shard, right?


On Wed, Jul 16, 2014 at 2:24 PM, Erik Hatcher erik.hatc...@gmail.com
wrote:

 Tom -

 You could maybe isolate it a little further by seeing using the “debug
 parameter with values of timing|query|results

 Erik

 On May 15, 2014, at 5:50 PM, Tom Burton-West tburt...@umich.edu wrote:

  Hello all,
 
  I'm trying to get relevance scoring information for each of 1,000 docs
 returned for each of 250 queries.If I run the query (appended below)
 without debugQuery=on, I have no problem with getting all the results with
 under 4GB of memory use.  If I add the parameter debugQuery=on, memory use
 goes up continuously and after about 20 queries (with 1,000 results each),
 memory use reaches about 29.1 GB and the garbage collector gives up:
 
   org.apache.solr.common.SolrException; null:java.lang.RuntimeException:
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 
  I've attached a jmap -histo, exgerpt below.
 
  Is this a known issue with debugQuery?
 
  Tom
  
  query:
 
 
 q=Abraham+Lincolnfl=id,scoreindent=onwt=jsonstart=0rows=1000version=2.2debugQuery=on
 
  without debugQuery=on:
 
 
 q=Abraham+Lincolnfl=id,scoreindent=onwt=jsonstart=0rows=1000version=2.2
 
  num   #instances#bytes  Class description
 
 --
  1:  585,559 10,292,067,456  byte[]
  2:  743,639 18,874,349,592  char[]
  3:  53,821  91,936,328  long[]
  4:  70,430  69,234,400  int[]
  5:  51,348  27,111,744
  org.apache.lucene.util.fst.FST$Arc[]
  6:  286,357 20,617,704
  org.apache.lucene.util.fst.FST$Arc
  7:  715,364 17,168,736  java.lang.String
  8:  79,561  12,547,792  * ConstMethodKlass
  9:  18,909  11,404,696  short[]
  10: 345,854 11,067,328  java.util.HashMap$Entry
  11: 8,823   10,351,024  * ConstantPoolKlass
  12: 79,561  10,193,328  * MethodKlass
  13: 228,587 9,143,480
 org.apache.lucene.document.FieldType
  14: 228,584 9,143,360   org.apache.lucene.document.Field
  15: 368,423 8,842,152   org.apache.lucene.util.BytesRef
  16: 210,342 8,413,680   java.util.TreeMap$Entry
  17: 81,576  8,204,648   java.util.HashMap$Entry[]
  18: 107,921 7,770,312
 org.apache.lucene.util.fst.FST$Arc
  19: 13,020  6,874,560
 org.apache.lucene.util.fst.FST$Arc[]
 
  debugQuery_jmap.txt

Re: Continue indexing doc after error

2014-07-01 Thread Tomás Fernández Löbbe

I think what you want is what’s described in
https://issues.apache.org/jira/browse/SOLR-445 This has not been committed
because it still doesn’t work with SolrCloud. Hoss gave me the hint to look
at DistributingUpdateProcessorFactory to solve the problem described in the
last comments, but I haven’t had time to get back to this yet.


On Tue, Jul 1, 2014 at 1:37 PM, tedsolr tsm...@sciquest.com wrote:

 I need to index documents from a csv file that will have 1000s of rows and
 100+ columns. To help the user loading the file I must return useful errors
 when indexing fails (schema violations). I'm using SolrJ to read the files
 line by line, build the document, and index/commit. This approach allows me
 to index the docs that have no schema validation errors, skipping over the
 docs that do. However, I really want to report errors field by field. As
 the
 user makes corrections to the file, this would prevent the same doc from
 failing multiple times if there are several fields that are busted.I have
 not seen a configuration setting that tells solr to keep indexing the doc
 after it encounters the first error, reporting back all the field errors
 (multiple exceptions). Does anyone know if that's possible?Using Solr 4.8.1



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Continue-indexing-doc-after-error-tp4145081.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to Configure Solr For Test Purposes?

2014-05-27 Thread Tomás Fernández Löbbe

 What do you suggest for my purpose? If a test case fails re-running it for
 some times maybe a solution? What kind of configuration do you suggest for
 my Solr configuration?


From the snippet of test that you showed, it looks like it's testing only
Solr functionality. So, first make sure this is a test that you really
need. Solr has it's own tests, and it you feel it could use more (for some
specific case or context), I'd open a Jira and try to get the test inside
Solr.
If my impression is wrong and your test is actually testing your code, then
I'd suggest you to use a specific soft commit call with waitSearcher = true
on your test instead of relying on the autocommit (and remove the
autocommit completely from your solrconfig).

Tomás



Thanks;
 Furkan KAMACI
 26 May 2014 21:03 tarihinde Shawn Heisey s...@elyograg.org yazdı:

  On 5/26/2014 10:57 AM, Furkan KAMACI wrote:
   Hi;
  
   I run Solr within my Test Suite. I delete documents or atomically
 update
   them and check whether if it works or not. I know that I have to setup
 a
   hard/soft commit timing for my test Solr. However even I have that
  settings:
  
autoCommit
  maxTime1/maxTime
  openSearchertrue/openSearcher
/autoCommit
  
  autoSoftCommit
maxTime1/maxTime
  /autoSoftCommit
 
  I hope you know that this is BAD configuration.  Doing automatic commits
  on an interval of 1 millisecond is asking for a whole host of problems.
   In some cases, this could do a commit after every single document that
  is indexed, which is NOT recommended at all.  The openSearcher setting
  of true on autoCommit makes it even worse.  There's no reason to do
  both autoSoftCommit and autoCommit with openSearcher=true.  I don't know
  which one wins between autoCommit and autoSoftCommit if they both have
  the same config, but I would guess the hard commit does.
 
   and even I wait (Thread.sleep()) for a time to wait Solr *sometimes* my
   tests are failed. I get fail error even I increase wait time.  Example
  of a
   sometimes failed code piece:
  
   for (int i = 0; i  dummyDocumentSize; i++) {
deleteById(id + i);
dummyDocumentSize--;
queryResponse = query(solrParams);
assertTrue(queryResponse.getResults().size() ==
  dummyDocumentSize);
 }
  
   at debug mode if I wait for Solr to reflect changes I see that I do not
  get
   error. What do you think, what kind of configuration I should have for
  such
   kind of purposes?
 
  Chances are that commits are going to take longer than 1 millisecond.
  If you're actively indexing, the system is going to be trying to stack
  up lots of commits at the same time.  The maxWarmingSearchers value will
  limit the number of new searchers that can be opened, but it will not
  stop the commits themselves.  When lots of commits are going on, each
  one will take *even longer* to complete, which probably explains the
  problem.
 
  Thanks,
  Shawn

Re: SolrMeter is dead?

2014-05-16 Thread Tomás Fernández Löbbe

It didn't have any improvements for a long time now (It doesn't have any
SolrCloud-related feautes for example), I just added a note on Solr wiki to
alert users about that. Feel free to ask on the solrmeter mailing list if
you have any other questions.

Tomás


On Wed, May 14, 2014 at 3:56 AM, Ahmet Arslan iori...@yahoo.com wrote:

 Hi Al,

 http://jmeter.apache.org

 Ahmet





 On Wednesday, May 14, 2014 1:11 PM, Al Krinker al.krin...@gmail.com
 wrote:
 I am trying to test performance of my cluster (solr 4.8).

 SolrMeter looked promising... small and standalone. Plus, open source so
 that I could make tweaks if needed.

 However, I see that the last update date was in Oct 2012. Is it dead? Any
 better non commercial and preferably open sourced projects out there?

 Thanks,
 Al

Re: Search Suggestion Filtering

2014-01-15 Thread Tomás Fernández Löbbe

I think your use case is the one described in LUCENE-5350, maybe you want
to take a look to the patch and comments there.

Tomás


On Wed, Jan 15, 2014 at 12:58 PM, Hamish Campbell 
hamish.campb...@koordinates.com wrote:

 Hi all,

 I'm looking into options for filtering the search suggestions dictionary.

 Using Solr 4.6.0, Suggester component and fst.FuzzyLookupFactory using a
 field based dictionary, we're indexing records for a multi-tenanted SaaS
 platform. SearchHandler records are always filtered by the particular
 client warehouse (e.g. by domain), however we need a way to apply a similar
 filter to the spell check dictionary to prevent leaking terms between
 clients. In other words: when client A searches for a document title they
 should not receive spelling suggestions for client B's document titles.

 This has been asked a couple of times, on the mailing list and on
 StackOverflow. Some of the suggested approaches:

 1. Use dynamic fields to create dictionaries per-warehouse (mentioned here:

 http://lucene.472066.n3.nabble.com/Filtering-down-terms-in-suggest-tt4069627.html
 )

 That might be a reasonable option for us (we already considered a similar
 approach), but at what point does this stop scaling efficiently? How many
 dynamic fields are too many?

 2. Run a query to populate the suggestion list (also mentioned in that
 thread)

 If I understand this correctly, this would give us a lot of flexibility and
 power: for example to give a more nuanced result set using the users
 permissions to expose private documents in their spelling suggestions.

 I expect this would be a slow query, but our total document count is
 currently relatively small (on the order of 10^3 objects) and I imagine you
 could create a specific word index with the appropriate fields to keep this
 in check. Is this a feasible approach, and if so, how do you build a
 dynamic suggestion list?

 3. Other options:

 It seems like this is a common problem - and we could through some
 resources at building an extension to provide some limited suggestion
 dictionary filtering. Is anyone already doing something similar, or has
 found a clever hack around this, or can suggest a starting point?

 Thanks everyone!

 --
 Hamish Campbell
 Koordinates Ltd http://koordinates.com/?_bzhc=esig
 PH   +64 9 966 0433
 FAX +64 9 966 0045

Re: distributed search is significantly slower than direct search

2013-11-17 Thread Tomás Fernández Löbbe

Hi Yuval, quick question. You say that your code has 750k docs and around
400mb? Is this some kind of test dataset and you expect it to grow
significantly? For an index of this size, I wouldn't use distributed
search, single shard should be fine.

Tomás

On Sun, Nov 17, 2013 at 6:50 AM, Yuval Dotan yuvaldo...@gmail.com wrote:

Hi,

I isolated the case

Installed on a new machine (2 x Xeon E5410 2.33GHz)

I have an environment with 12Gb of memory.

I assigned 6gb of memory to Solr and I’m not running any other memory
consuming process so no memory issues should arise.

Removed all indexes apart from two:

emptyCore – empty – used for routing

core1 – holds the stored data – has ~750,000 docs and size of 400Mb

Again this is a single machine that holds both indexes.

The query

http://localhost:8210/solr/emptyCore/select?rows=5000q=*:*shards=127.0.0.1:8210/solr/core1wt=jsonQTime
takes ~3 seconds

and direct query
http://localhost:8210/solr/core1/select?rows=5000q=*:*wt=json Qtime
takes
~15 ms - a magnitude difference.

I ran the long query several times and got an improvement of about a sec
(33%) but that’s it.

I need to better understand why this is happening.

I tried looking at Solr code and debugging the issue but with no success.

The one thing I did notice is that the getFirstMatch method which receives
the doc id, searches the term dict and returns the internal id takes most
of the time for some reason.

I am pretty stuck and would appreciate any ideas

My only solution for the moment is to bypass the distributed query,
implement code in my own app that directly queries the relevant cores and
handles the sorting etc..

Thanks

On Sat, Nov 16, 2013 at 2:39 PM, Michael Sokolov
msoko...@safaribooksonline.com wrote:

Did you say what the memory profile of your machine is? How much memory,
and how large are the shards? This is just a random guess, but it might
be
that if you are memory-constrained, there is a lot of thrashing caused by
paging (swapping?) in and out the sharded indexes while a single index
can
be scanned linearly, even if it does need to be paged in.

-Mike

On 11/14/2013 8:10 AM, Elran Dvir wrote:

Hi,

We tried returning just the id field and got exactly the same
performance.
Our system is distributed but all shards are in a single machine so
network issues are not a factor.
The code we found where Solr is spending its time is on the shard and
not
on the routing core, again all shards are local.
We investigated the getFirstMatch() method and noticed that the
MultiTermEnum.reset (inside MultiTerm.iterator) and MultiTerm.seekExact
take 99% of the time.
Inside these methods, the call to BlockTreeTermsReader$
FieldReader$SegmentTermsEnum$Frame.loadBlock takes most of the time.
Out of the 7 seconds run these methods take ~5 and
BinaryResponseWriter.write takes the rest(~ 2 seconds).

We tried increasing cache sizes and got hits, but it only improved the
query time by a second (~6), so no major effect.
We are not indexing during our tests. The performance is similar.
(How do we measure doc size? Is it important due to the fact that the
performance is the same when returning only id field?)

We still don't completely understand why the query takes this much
longer
although the cores are on the same machine.

Is there a way to improve the performance (code, configuration, query)?

-Original Message-
From: idokis...@gmail.com [mailto:idokis...@gmail.com] On Behalf Of
Manuel Le Normand
Sent: Thursday, November 14, 2013 1:30 AM
To: solr-user@lucene.apache.org
Subject: Re: distributed search is significantly slower than direct
search

It's surprising such a query takes a long time, I would assume that
after
trying consistently q=*:* you should be getting cache hits and times
should
be faster. Try see in the adminUI how do your query/doc cache perform.
Moreover, the query in itself is just asking the first 5000 docs that
were indexed (returing the first [docid]), so seems all this time is
wasted
on transfer. Out of these 7 secs how much is spent on the above method?
What do you return by default? How big is every doc you display in your
results?
Might be the matter that both collections work on the same ressources.
Try elaborating your use-case.

Anyway, it seems like you just made a test to see what will be the
performance hit in a distributed environment so I'll try to explain some
things we encountered in our benchmarks, with a case that has at least
the
similarity of the num of docs fetched.

We reclaim 2000 docs every query, running over 40 shards. This means
every shard is actually transfering to our frontend 2000 docs every
document-match request (the first you were referring to). Even if lazily
loaded, reading 2000 id's (on 40 servers) and lazy loading the fields
is a
tough job. Waiting for the slowest shard to

Re: limiting deep pagination

2013-10-08 Thread Tomás Fernández Löbbe

I don't know of any OOTB way to do that, I'd write a custom request handler
as you suggested.

Tomás


On Tue, Oct 8, 2013 at 3:51 PM, Peter Keegan peterlkee...@gmail.com wrote:

 Is there a way to configure Solr 'defaults/appends/invariants' such that
 the product of the 'start' and 'rows' parameters doesn't exceed a given
 value? This would be to prevent deep pagination.  Or would this require a
 custom requestHandler?

 Peter

Re: SolrCloud distribute search question.

2013-10-04 Thread Tomás Fernández Löbbe

Yes, the machine that gets the initial request is the one that distributes
to the shards and the aggregates the results.



On Fri, Oct 4, 2013 at 9:55 AM, yriveiro yago.rive...@gmail.com wrote:

 Hi,

 When a distributed search is done, the inital query is forwarded to all
 shards that are part of the specific collection that we are querying.

 My question here is, Which is the machine that does the aggregation for
 results from shards?

 Is the machine which receives the initial request?

 I need to have the control of the machine that does the aggregation.


 /Yago



 -
 Best regards
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/SolrCloud-distribute-search-question-tp4093523.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: App server?

2013-10-03 Thread Tomás Fernández Löbbe

You may also want to take a look at this Jira:
https://issues.apache.org/jira/browse/SOLR-4792 for Solr 5.0 (trunk)

Tomás


On Thu, Oct 3, 2013 at 10:41 AM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 On 10/02/2013 06:44 PM, Mark wrote:

 Is Jetty sufficient for running Solr or should I go with something a
 little more enterprise like tomcat?

 Any others?

 FWIW we use tomcat for all of our installs, and it works fine.  I don't
 claim it's any better than Jetty, but it doesn't cause any problems, either.

 -Mike

Re: Top 10 Terms in Index (by date)

2013-04-02 Thread Tomás Fernández Löbbe

Oh, I see, essentially you want to get the sum of the term frequencies for
every term in a subset of documents (instead of the document frequency as
the FacetComponent would give you). I don't know of an easy/out of the box
solution for this. I know the TermVectorComponent will give you the tf for
every term in a document, but I'm not sure if you can filter or sort on it.
Maybe you can do something like:
https://issues.apache.org/jira/browse/LUCENE-2393
or what's suggested here:
http://search-lucene.com/m/of5Fn1PUOHU/
but I have never used something like that.

Tomás



On Mon, Apr 1, 2013 at 9:58 PM, Andy Pickler andy.pick...@gmail.com wrote:

 I need total number of occurrences across all documents for each term.
 Imagine this...

 Post #1: I think, therefore I am like you
 Reply #1: You think too much
 Reply #2 I think that I think much as you

 Each of those documents are put into 'content'.  Pretending I don't have
 stop words, the top term query (not considering dateCreated in this
 example) would result in something like...

 think: 4
 I: 4
 you: 3
 much: 2
 ...

 Thus, just a number of documents approach doesn't work, because if a word
 occurs more than one time in a document it needs to be counted that many
 times.  That seemed to rule out faceting like you mentioned as well as the
 TermsComponent (which as I understand also only counts documents).

 Thanks,
 Andy Pickler

 On Mon, Apr 1, 2013 at 4:31 PM, Tomás Fernández Löbbe 
 tomasflo...@gmail.com
  wrote:

  So you have one document per user comment? Why not use faceting plus
  filtering on the dateCreated field? That would count number of
  documents for each term (so, in your case, if a term is used twice in
 one
  comment it would only count once). Is that what you are looking for?
 
  Tomás
 
 
  On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler andy.pick...@gmail.com
  wrote:
 
   Our company has an application that is Facebook-like for usage by
   enterprise customers.  We'd like to do a report of top 10 terms
 entered
  by
   users over (some time period).  With that in mind I'm using the
   DataImportHandler to put all the relevant data from our database into a
   Solr 'content' field:
  
   field name=content type=text_general indexed=true stored=false
   multiValued=false required=true termVectors=true/
  
   Along with the content is the 'dateCreated' for that content:
  
   field name=dateCreated type=tdate indexed=true stored=false
   multiValued=false required=true/
  
   I'm struggling with the TermVectorComponent documentation to understand
  how
   I can put together a query that answers the 'report' mentioned above.
   For
   each document I need each term counted however many times it is entered
   (content of I think what I think would report 'think' as used twice).
Does anyone have any insight as to whether I'm headed in the right
   direction and then what my query would be?
  
   Thanks,
   Andy Pickler

Re: Top 10 Terms in Index (by date)

2013-04-01 Thread Tomás Fernández Löbbe

So you have one document per user comment? Why not use faceting plus
filtering on the dateCreated field? That would count number of
documents for each term (so, in your case, if a term is used twice in one
comment it would only count once). Is that what you are looking for?

Tomás


On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler andy.pick...@gmail.com wrote:

 Our company has an application that is Facebook-like for usage by
 enterprise customers.  We'd like to do a report of top 10 terms entered by
 users over (some time period).  With that in mind I'm using the
 DataImportHandler to put all the relevant data from our database into a
 Solr 'content' field:

 field name=content type=text_general indexed=true stored=false
 multiValued=false required=true termVectors=true/

 Along with the content is the 'dateCreated' for that content:

 field name=dateCreated type=tdate indexed=true stored=false
 multiValued=false required=true/

 I'm struggling with the TermVectorComponent documentation to understand how
 I can put together a query that answers the 'report' mentioned above.  For
 each document I need each term counted however many times it is entered
 (content of I think what I think would report 'think' as used twice).
  Does anyone have any insight as to whether I'm headed in the right
 direction and then what my query would be?

 Thanks,
 Andy Pickler

Re: Urgent:Solr cloud issue

2013-03-28 Thread Tomás Fernández Löbbe

Could you give more details on what's not working? Have you followed the
instructions here: http://wiki.apache.org/solr/SolrCloud#Getting_Started
Are you using an embedded Zookeeper or an external server? How many of
them? Are you using numShards=1?2?

What do you see in the Solr UI, in the cloud section?

Tomás


On Thu, Mar 28, 2013 at 8:44 AM, anuj vats vats_a...@rediffmail.com wrote:

 Waiting for your assitence to get config entries for 3 server solr cloud
 setup..


 Thanks in advance


 Anuj From: anuj vatslt;vats_a...@rediffmail.comgt;Sent: Fri, 22 Mar
 2013 17:32:10 To: solr-user@lucene.apache.org
 lt;solr-user@lucene.apache.orggt;Cc: mayank...@gmail.com
 lt;mayank...@gmail.comgt;Subject: lt;Urgent:Solr cloud issuegt;
 Hi Shawan,

 I have seen your post on solr cloud Master-Master configuration on two
 servers. I have to use the same Solr structure, but from long I am not able
 to configure it to comunicate between two server, on single server it works
 fine.
 Can you pls help me out to provide required config changes, so that SOLR
 can comunicate between two servers.

 http://grokbase.com/t/lucene/solr-user/132pb1pe34/solrcloud-master-master

 Regards
 Anuj Vats
 Get your own FREE website and domain with business email solutions, click
 here

Re: [ANNOUNCE] Solr wiki editing change

2013-03-28 Thread Tomás Fernández Löbbe

Steve, could you add me to the contrib group? TomasFernandezLobbe

Thanks!

Tomás


On Thu, Mar 28, 2013 at 1:04 PM, Steve Rowe sar...@gmail.com wrote:

 On Mar 28, 2013, at 11:57 AM, Jilal Oussama jilal.ouss...@gmail.com
 wrote:
  Please add OussamaJilal to the group.

 Added to solr ContributorsGroup.

Re: [Beginner] wants to contribute in open source project

2013-03-11 Thread Tomás Fernández Löbbe

You can also take a look at http://wiki.apache.org/solr/HowToContribute

Tomás


On Mon, Mar 11, 2013 at 9:20 AM, Andy Lester a...@petdance.com wrote:


 On Mar 11, 2013, at 11:14 AM, chandresh pancholi 
 chandreshpancholi...@gmail.com wrote:

  I am beginner in this field. It would be great if you help me out. I love
  to code in java.
  can you guys share some link so that i can start contributing in
  solr/lucene project.


 This article I wrote about getting started contributing to projects may
 give you some ideas.


 http://blog.smartbear.com/software-quality/bid/167051/14-Ways-to-Contribute-to-Open-Source-without-Being-a-Programming-Genius-or-a-Rock-Star

 I don't have tasks specifically for the Solr project (does Solr have such
 a list for newcomers to help on?) but I hope that you'll get some ideas.

 xoa

 --
 Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance

Re: Upgrade Solr3.5 to Solr4.1 - Index Reformat ?

2013-03-11 Thread Tomás Fernández Löbbe

Hi Feroz, due to Lucene's backward compatibility policy (
http://wiki.apache.org/lucene-java/BackwardsCompatibility ), a Solr 4.1
instance should be able to read an index generated by a Solr 3.5 instance.
This would not be true if you need to change the schema. Also, be careful
because Solr 4.1 could and will change the index files and will make them
unreadable by Solr 3.5 (so you should make a backup in case you need to
revert to 3.5 for some reason).
This means, that if you can't shutdown your whole application all together,
you could update the slaves first, and then the masters. Replacing all
servers together will also work.

That said, you should not use 4.1 if you are using Master/Slave, there are
some known bugs in that specific feature in 4.1 that were fixed for 4.2.

Tomás

On Mon, Mar 11, 2013 at 10:56 AM, feroz_kh feroz.kh2...@gmail.com wrote:

Hello,

We are planning to upgrade our solr servers from version 3.5 to 4.1.
We have master slave configuration and the index size is quite big (i.e.
around 14 GB ).
1. Do we really need to re-format the whole index , when we upgrade to 4.1
?
2. What will be the consequences - if we do not re-format and simply
upgrade
war file and config files ( solrconfig.xml, schema.xml ) on all slaves and
master together. (Shutdown all master slaves and then upgrade startup)
?
3. If re-formatting is neccessary - then what is the best tool to achieve
it. ( How long does it usually take to re-format the index of size around
14GB ) ?

Thanks,
Feroz

--
View this message in context:
http://lucene.472066.n3.nabble.com/Upgrade-Solr3-5-to-Solr4-1-Index-Reformat-tp4046391.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud: port out of range:-1

2013-03-08 Thread Tomás Fernández Löbbe

A couple of comments about your deployment architecture too. You'll need to
change the zoo.cfg to make the Zookeeper ensemble work with two instances
as you are trying to do, have you? The example configuration with the
zoo.cfg is intended for a single ZK instance as described in the SolrCloud
example. That said, really a two instances ZK ensemble as the one you are
intending to have doesn't make much sense, if ANY of your Solr servers
break (which as you are running embedded, ZK will also stop), the whole
cluster will be useless until you start the server again.

Tomás


On Fri, Mar 8, 2013 at 12:26 PM, Shawn Heisey s...@elyograg.org wrote:

 On 3/8/2013 7:37 AM, roySolr wrote:

 java -Djetty.port=4110 -DzkRun=10.100.10.101:5110
 -DzkHost=10.100.10.101:5110,10**.100.10.102:5120http://10.100.10.102:5120-Dbootstrap_conf=true
 -DnumShards=1 -Xmx1024M -Xms512M -jar start.jar

 It runs Solr on port 4110, the embedded zk on 5110.

 The -DzkHost gives the urls of the localhost zk(5110) and the url of the
 other server(zk port). When i try to start this it give the error: port
 out
 of range:-1.


 The full log line, ideally with several lines above and below for context,
 is going to be crucial for figuring this out.  Also, the contents of your
 solr.xml file may be important.

 Thanks,
 Shawn

Re: Query parsing issue

2013-03-06 Thread Tomás Fernández Löbbe

It should be easy to extend ExtendedDismaxQParser and do your
pre-processing in the parse() method before calling edismax's parse. Or
maybe you could change the way EDismax is splitting the input query into
clauses by extending the splitIntoClauses method?

Tomás


On Wed, Mar 6, 2013 at 6:37 AM, Francesco Valentini 
francesco.valent...@altiliagroup.com wrote:

 Hi,



 I’ve written my own analyzer to index and query a set of documents. At
 indexing time everything goes well but

 now I have a problem in  query phase.

 I need to pass  the whole query string to my analyzer before the edismax
 query parser begins its tasks.

 In other words I have to preprocess the raw query string.

 The phrase querying does not fit my needs because I don’t have to match
 the entire set of terms/tokens.

 How can I achieve this?



 Thank you in advance.





 Francesco

Re: Solr cloud distributed queries, what goes on in the consolidation step?

2013-02-15 Thread Tomás Fernández Löbbe

In step 4, once the node 1 gets all the responses, it merges and sorts
them: Lets say you requested 15 docs from each shard (because the rows
parameter is 15), at this point node 1 merges the results from all the
responses and gets the top 15 across all of them. The second request is
only to get the requested fl from those top 15 docs. This request will
only be sent to those nodes that have at least 1 of the top 15. You'll see
that the second request has the parameter ids, with the list of ids from
the documents that have to be retrieved.



 In terms of querying + scoring, clearly that has to happen in the shards
 (since only the shard knows the IDF), and the shards only return the
 requested number of documents (15 each in our case).  So it seems like the
 final step 5 just has to sort the 15 x 4 = 60 documents it has been given
 and return the top 15 of those.  However, we are seeing a dis-proportionate
 amount of time in that step (admittedly we are only looking at query times
 in the logs, don't have debug on this system yet).

This is correct, but it happens in step 4, not 5.



 So I'm thinking what about filtering?  We have some FilterQueries (Post
 Filter implementations) and it seems to be combinations of those which
 cause the massive query times, is it possible the consolidation is then
 trying to run (or re-run ) the filters?


All filters are applied in each node before responding to node1


 I can include logs and specifics if necessary, but in essence for a
 particular set of queries, step 3 takes about 400ms (on all 4 shards), step
 4 is 5ms, yet the user response isn't sent out for about 13s(!)


The second request is usually much faster than the first one. In this cases
the problem may be due to network latency. You can compare the times you
see in Solr logs vs the request log of your servlet container.


 In terms of the log entries for the distributed queries, I'm assuming the
 logs are written as the queries complete, and the QTime is the time taken
 to run that query?


Yes, everyting is logged after the fact. You should see. A log entry for
the search request in the nodes. A log entry for the fetch request in
the logs (may not be in all nodes, if some of them didn't match or didin't
have any doc of the top 15), and finally the main search entry, including
all of the above.


Tomás

Re: Eject a node from SolrCloud

2013-02-07 Thread Tomás Fernández Löbbe

Yes, currently the only option is to shutdown the node. Maybe not the
cleanest way to remove a node. See this jira too:
https://issues.apache.org/jira/browse/SOLR-3512


On Thu, Feb 7, 2013 at 7:20 AM, yriveiro yago.rive...@gmail.com wrote:

 Hi,

 Exists any way to eject a node from a solr cluster?

 If I shutdown a node in the cluster, the zookeeper tag the node as down.

 Thanks

 /Yago



 -
 Best regards
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Eject-a-node-from-SolrCloud-tp4038950.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Is Solr Cloud will be helpful in case of Load balancing

2013-02-01 Thread Tomás Fernández Löbbe

Yes and no. SolrCloud won't do it automatically. But it will make it easier
for you to add/remove nodes from a collection. And if you use
CloudSolrServer for queries, the new nodes will automatically be used for
queries once they are ready to respond.

Tomás


On Fri, Feb 1, 2013 at 7:35 AM, dharmendra jaiswal 
dharmendra.jais...@gmail.com wrote:

 Hello,

 I am using multi-core mechnism with Solr. And each core is dedicated to a
 particular client.

 Like If we search data from SiteA, it will provide search result from CoreA
 And if we search data from SiteB, it will provide search result from CoreB

 and similar case with other client.

 We have created N number of core on Single node of Solr server.
 My query is that does solr cloud will be helpful in case of load balancing.
 As in my case all request for different-different client came to single
 node of server.
 Any pointer and link will be helpfull.
 Note: I am Using Windows machine for deplyment of Solr.

 Thanks,
 Dharmendra jaiswal

Re: copyField - copy only specific words

2013-01-25 Thread Tomás Fernández Löbbe

I think the best way will be to pre-process the document (or use a custom
UpdateRequestProcessor). Other option, if you'll only use the cities
field for faceting/sorting/searching (you don't need the stored content)
would be to use a regular copyField and use a KeepWordFilter for the
cities field. However, with this approach it will be difficult to handle
multi-word cities like New York or Buenos Aires.

Tomás


On Fri, Jan 25, 2013 at 7:33 AM, b.riez...@pixel-ink.de 
b.riez...@pixel-ink.de wrote:

 Hi,

 i'd like to copy specific words from the keywords field to another field.
 Cause the data i get is all in one field i'd like to extract the cities
 (they are fixed, so i'll know them in advance) and put them in a seperate
 field.

 Can i generate a whitelist file and tell the copy field to check this file
 and only copy matching words to a new field?

 Thanks for your help
 Ben

Re: Solr cache considerations

2013-01-18 Thread Tomás Fernández Löbbe

No, the fieldValueCache is not used for resolving queries. Only for
multi-token faceting and apparently for the stats component too. The
document cache maintains in memory the stored content of the fields you are
retrieving or highlighting on. It'll hit if the same document matches the
query multiple times and the same fields are requested, but as Eirck said,
it is important for cases when multiple components in the same request need
to access the same data.

I think soft committing every 10 minutes is totally fine, but you should
hard commit more often if you are going to be using transaction log.
openSearcher=false will essentially tell Solr not to open a new searcher
after the (hard) commit, so you won't see the new indexed data and caches
wont be flushed. openSearcher=false makes sense when you are using
hard-commits together with soft-commits, as the soft-commit is dealing
with opening/closing searchers, you don't need hard commits to do it.

Tomás


On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Unfortunately, it seems (
 http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html) that
 these caches are not per-segment. In this case, I want to (soft) commit
 less frequently. Am I right?

 Tomás, as the fieldValueCache is very similar to lucene's FieldCache, I
 guess it has a big contribution to standard (not only faceted) queries
 time. SolrWiki claims that it primarily used by faceting. What that says
 about complex textual queries?

 documentCache:
 Erick, After a query processing is finished, doesn't some documents stay in
 the documentCache? can't I use it to accelerate queries that should
 retrieve stored fields of documents? In this case, a big documentCache can
 hold more documents..

 About commit frequency:
 HardCommit: openSearch=false seems as a nice solution. Where can I read
 about this? (found nothing but one unexplained sentence in SolrWiki).
 SoftCommit: In my case, the required index freshness is 10 minutes. The
 plan to soft commit every 10 minutes is similar to storing all of the
 documents in a queue (outside to Solr), an indexing a bulk every 10
 minutes.

 Thanks.


 On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe 
 tomasflo...@gmail.com wrote:

  I think fieldValueCache is not per segment, only fieldCache is. However,
  unless I'm missing something, this cache is only used for faceting on
  multivalued fields
 
 
  On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in
   cache). Notice the /8. This reflects the fact that the filters are
   represented by a bitset on the _internal_ Lucene ID. UniqueId has no
   bearing here whatsoever. This is, in a nutshell, why warming is
   required, the internal Lucene IDs may change. Note also that it's
   maxDoc, the internal arrays have holes for deleted documents.
  
   Note this is an _upper_ bound, if there are only a few docs that
   match, the size will be (num of matching docs) * sizeof(int)).
  
   fieldValueCache. I don't think so, although I'm a bit fuzzy on this.
   It depends on whether these are per-segment caches or not. Any per
   segment cache is still valid.
  
   Think of documentCache as intended to hold the stored fields while
   various components operate on it, thus avoiding repeatedly fetching
   the data from disk. It's _usually_ not too big a worry.
  
   About hard-commits once a day. That's _extremely_ long. Think instead
   of committing more frequently with openSearcher=false. If nothing
   else, you transaction log will grow lots and lots and lots. I'm
   thinking on the order of 15 minutes, or possibly even much less. With
   softCommits happening more often, maybe every 15 seconds. In fact, I'd
   start out with soft commits every 15 seconds and hard commits
   (openSearcher=false) every 5 minutes. The problem with hard commits
   being once a day is that, if for any reason the server is interrupted,
   on startup Solr will try to replay the entire transaction log to
   assure index integrity. Not to mention that your tlog will be huge.
   Not to mention that there is some memory usage for each document in
   the tlog. Hard commits roll over the tlog, flush the in-memory tlog
   pointers, close index segments, etc.
  
   Best
   Erick
  
   On Thu, Jan 17, 2013 at 1:29 PM, Isaac Hebsh isaac.he...@gmail.com
   wrote:
Hi,
   
I am going to build a big Solr (4.0?) index, which holds some dozens
 of
millions of documents. Each document has some dozens of fields, and
 one
   big
textual field.
The queries on the index are non-trivial, and a little-bit long
 (might
  be
hundreds of terms). No query is identical to another.
   
Now, I want to analyze the cache performance (before setting up the
  whole
environment), in order to estimate how much RAM will I need.
   
filterCache:
In my scenariom, every query has some filters. let's say

Re: group.ngroups behavior in response

2013-01-17 Thread Tomás Fernández Löbbe

Bu Amit is right, when you use group.main, the number of groups is not
displayed, even if you set grop.ngroups.

I think in this case NumFound should display the number of groups instead
of the number of docs matching. Other option would be to keep numFound as
the number of docs matching and add another attribute to the response that
shows the number of groups.


On Thu, Jan 17, 2013 at 11:51 AM, denl0 david.vandendriess...@gmail.comwrote:

 There's a parameter to enable that. :D

 In solrJ

 solrQuery.setParam(group.ngroups, true);

 http://wiki.apache.org/solr/FieldCollapsing



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/group-ngroups-behavior-in-response-tp4033924p4034187.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr cache considerations

2013-01-17 Thread Tomás Fernández Löbbe

I think fieldValueCache is not per segment, only fieldCache is. However,
unless I'm missing something, this cache is only used for faceting on
multivalued fields


On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson erickerick...@gmail.comwrote:

 filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in
 cache). Notice the /8. This reflects the fact that the filters are
 represented by a bitset on the _internal_ Lucene ID. UniqueId has no
 bearing here whatsoever. This is, in a nutshell, why warming is
 required, the internal Lucene IDs may change. Note also that it's
 maxDoc, the internal arrays have holes for deleted documents.

 Note this is an _upper_ bound, if there are only a few docs that
 match, the size will be (num of matching docs) * sizeof(int)).

 fieldValueCache. I don't think so, although I'm a bit fuzzy on this.
 It depends on whether these are per-segment caches or not. Any per
 segment cache is still valid.

 Think of documentCache as intended to hold the stored fields while
 various components operate on it, thus avoiding repeatedly fetching
 the data from disk. It's _usually_ not too big a worry.

 About hard-commits once a day. That's _extremely_ long. Think instead
 of committing more frequently with openSearcher=false. If nothing
 else, you transaction log will grow lots and lots and lots. I'm
 thinking on the order of 15 minutes, or possibly even much less. With
 softCommits happening more often, maybe every 15 seconds. In fact, I'd
 start out with soft commits every 15 seconds and hard commits
 (openSearcher=false) every 5 minutes. The problem with hard commits
 being once a day is that, if for any reason the server is interrupted,
 on startup Solr will try to replay the entire transaction log to
 assure index integrity. Not to mention that your tlog will be huge.
 Not to mention that there is some memory usage for each document in
 the tlog. Hard commits roll over the tlog, flush the in-memory tlog
 pointers, close index segments, etc.

 Best
 Erick

 On Thu, Jan 17, 2013 at 1:29 PM, Isaac Hebsh isaac.he...@gmail.com
 wrote:
  Hi,
 
  I am going to build a big Solr (4.0?) index, which holds some dozens of
  millions of documents. Each document has some dozens of fields, and one
 big
  textual field.
  The queries on the index are non-trivial, and a little-bit long (might be
  hundreds of terms). No query is identical to another.
 
  Now, I want to analyze the cache performance (before setting up the whole
  environment), in order to estimate how much RAM will I need.
 
  filterCache:
  In my scenariom, every query has some filters. let's say that each filter
  matches 1M documents, out of 10M. Does the estimated memory usage should
 be
  1M * sizeof(uniqueId) * num-of-filters-in-cache?
 
  fieldValueCache:
  Due to the difference between queries, I guess that fieldValueCache is
 the
  most important factor on query performance. Here comes a generic
 question:
  I'm indexing new documents to the index constantly. Soft commits will be
  performed every 10 mins. Does it say that the cache is meaningless, after
  every 10 minutes?
 
  documentCache:
  enableLazyFieldLoading will be enabled, and fl contains a very small
 set
  of fields. BUT, I need to return highlighting on about (possibly) 20
  fields. Does the highlighting component use the documentCache? I guess
 that
  highlighting requires the whole field to be loaded into the
 documentCache.
  Will it happen only for fields that matched a term from the query?
 
  And one more question: I'm planning to hard-commit once a day. Should I
  prepare to a significant RAM usage growth between hard-commits?
 (consider a
  lot of new documents in this period...)
  Does this RAM comes from the same pool as the caches? An OutOfMemory
  exception can happen is this scenario?
 
  Thanks a lot.

Re: Large transaction logs

2013-01-10 Thread Tomás Fernández Löbbe

Yes, you must issue hard commits. You can use autocommit and use
openSearcher=false. Autocommit is not distributed, it has to be configured
in every node (which will automatically be, because you are using the exact
same solrconfig for all your nodes).

Other option is to issue an explicit hard commit command, those ARE
distributed across all shards and replicas. You should also use
openSearcher=false option for explicit hard commits (the searcher is now
being opened by the soft commits).

Both options are fine. Personally I prefer autocommit because then you can
just forget about commits.

Tomás

On Thu, Jan 10, 2013 at 7:51 AM, gadde gadde@gmail.com wrote:

we have a SolrCloud with 3 nodes. we add documents to leader node and use
commitwithin(100secs) option in SolrJ to add documents. AutoSoftCommit in
SolrConfig is 1000ms.

Transaction logs on replicas grew bigger than the index and we ran out of
disk space in few days. Leader's tlogs are very small in few hundred MBs.

The following post suggest hard commit is required for relieving the
memory
pressure of the transactionlog

http://lucene.472066.n3.nabble.com/SolrCloud-is-softcommit-cluster-wide-for-the-collection-td4021584.html#a4021631

what is the best way to do a hard commit on this setup in SolrCloud?

a. Through autoCommit in SolrConfig? which would cause hard commit on all
the nodes at different times
b. Trigger hard commit on leader while updating through SolrJ?

--
View this message in context:
http://lucene.472066.n3.nabble.com/Large-transaction-logs-tp4032144.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr4.0 problem zkHost with multiple hosts throws out of range exception

2013-01-03 Thread Tomás Fernández Löbbe

I think it should be

–DzkHost=zoo1:8983,zoo2:8983,zoo3:8983/solrroot


Tomás


On Thu, Jan 3, 2013 at 2:14 PM, Mark Miller markrmil...@gmail.com wrote:

 I don't really understand your question. More than one what?

 More than one external zk node? Start up an ensemble, and pass a comma sep
 list of the addresses as the zkhost - each one should have the same chroot
 on it.

 - Mark

 On Jan 3, 2013, at 4:32 AM, cmuarg cmu...@gmail.com wrote:

  Hello
 
  I have a zookeeper ensemble that is also used for other purposes and I
 don’t
  want the zookeeper root get messed up with solrcloud things so I try to
 use
  ‘chroot’.
 
  One external zookeeper node works fine with –DzkHost=zoo1:8983/solrroot
  (solrroot must exist) but how specify more than one?
 
  Thanks
  /C
 
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/solr4-0-problem-zkHost-with-multiple-hosts-throws-out-of-range-exception-tp4014440p4030230.html
  Sent from the Solr - User mailing list archive at Nabble.com.

Re: Upgrading from 3.6 to 4.0

2013-01-02 Thread Tomás Fernández Löbbe

AFAIK Solr 4 should be able to read Solr 3.6 indexes. Soon those files will
be updated to 4.0 format and will not be readable by Solr 3.6 anymore. See
http://wiki.apache.org/lucene-java/BackwardsCompatibility
You should not use a a 3.6 SolrJ client with Solr 4 server.

Tomás


On Wed, Jan 2, 2013 at 3:04 PM, Benjamin, Roy rbenja...@ebay.com wrote:

 Will the existing 3.6 indexes work with 4.0 binary ?

 Will 3.6 solrJ clients work with 4.0 servers ?


 Thanks
 Roy

1 2 3 >

1 - 100 of 271 matches

Mail list logo