Re: MLT Java example for Solr 6.3

2016-12-27 Thread Ere Maijala
Just a note that field boosting with the MLT Query Parser is broken, and 
for SolrCloud the whole thing is practically unusable if you index stuff 
in English because CloudMLTQParser includes strings from field 
definitions (such as "stored" and "indexed") in the query. I'm still 
hoping someone will review 
https://issues.apache.org/jira/browse/SOLR-9644, which contains a fix, 
at some point..


--Ere

24.12.2016, 1.26, Anshum Gupta kirjoitti:

Hi Todd,

You can query for similar documents using the MLT Query Parser. The code
would look something like:

// Assuming you want to use CloudSolrClient
CloudSolrClient client = new CloudSolrClient.Builder()
.withZkHost(zkHost)
.build();
client.setDefaultCollection(COLLECTION_NAME);
QueryResponse queryResponse = client.query(new SolrQuery("{!mlt
qf=foo}docId"));

Notice the *docId*, *qf*, and the *!mlt* part.
docId - External document ID/unique ID of the document you want to query for
qf - fields that you want to use for similarity (you can read more about it
here:
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-MoreLikeThisQueryParser
)
!mlt - the query parser you want to use.


On Thu, Dec 22, 2016 at 3:01 PM  wrote:


I am having trouble locating a decent example for using the MLT Java API
in Solr 6.3. What I want is to retrieve document IDs that are similar to a
given document ID.

Todd Peterson
Chief Embedded Systems Engineer
Management Sciences, Inc.
6022 Constitution Ave NE
Albuquerque, NM 87144
505-255-8611 <(505)%20255-8611> (office)
505-205-7057 <(505)%20205-7057> (cell)




--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


How to solve?

2016-12-27 Thread William Bell
We are entering entries into SOLR like the following, and we want to see if
my pt matches any of these radiuses.

1. Red, pt=39,-107, radius=10km
2. Blue, pt=39,-108, radius=50km

I want to run a SOLR select with a pt=39,-104 and see if it is within 10km
of point 1, and 50km of point 2?

Usually I know you can :

http://localhost:8983/select?q=*:*=39,-104=solr_geohash= ??

One idea was to use bbox and find the N,S,E,W pt for point 1 and point 2.
But this is not idea, we want to use Great Circle.

Thoughts?


-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: How to make Solr FuzzyLookupFactory exactMatch case insensitive

2016-12-27 Thread diwakar bhardwaj
I did, that is returning exact matches first but only when I query with
case matched elements. For example, mumbai's results would be different
then Mumbai's.

On Wed, Dec 28, 2016 at 2:32 AM, Susheel Kumar 
wrote:

> Did you try exactMatchFirst param of FuzzyLookupFactory ?  See
> https://cwiki.apache.org/confluence/display/solr/Suggester
>
> Thanks,
> Susheel
>
> On Sun, Dec 25, 2016 at 6:52 PM, diwakar bhardwaj <
> bhardwajdiwa...@gmail.com
> > wrote:
>
> > Hi,
> >
> > I've implemented a solr suggester with FuzzyLookupFactory and its working
> > perfectly. Except for a minor glitch, its only treating case sensitive
> > searches as an exact match.
> > For example, results for "mumbai" vs "Mumbai" is different.
> >
> > This is too restrictive and kind of defeating the purpose of the
> suggester.
> >
> > I've posted this on stackoverflow:
> >
> > http://stackoverflow.com/questions/41320424/solr-
> > fuzzylookupfactory-exactmatch-is-case-sensitive
> >
> > Following is the text I posted on stackoverflow
> >
> > I have implemented a solr suggester for list of cities and areas. I have
> > user FuzzyLookupFactory for this. My schema looks like this:
> >
> >  > positionIncrementGap="100">
> > 
> >  > pattern="[^a-zA-Z0-9]" replacement=" " />
> > 
> >  > ignoreCase="true" expand="true"/>
> > 
> > 
> > 
> >
> > synonym.txt is used for mapping older city names with new ones, like
> > Madras=>Chennai, Saigon=>Ho Chi Minh city
> >
> > My suggester definition looks like this:
> >
> >   
> > 
> >   suggestions
> >   FuzzyLookupFactory
> >   DocumentDictionaryFactory
> >   searchfield
> >   searchscore
> >   suggestTypeLc
> >   false
> >   false
> >   autosuggest_dict
> > 
> >   
> >
> > My request handler looks like this:
> >
> >> startup="lazy">
> > 
> > true
> > 10
> > suggestions
> > results
> > 
> > 
> > suggest
> > 
> >   
> >
> > Now the problem is that suggester is showing the exact matches first But
> it
> > is case sensitive. for eg,
> >
> > /suggest?suggest.q=mumbai (starting with a lower case "m")
> >
> > will give, exact result at 4th place:
> >
> > {
> >   "responseHeader":{
> > "status":0,
> > "QTime":19},
> >   "suggest":{
> > "suggestions":{
> >   "mumbai":{
> > "numFound":10,
> > "suggestions":[{
> > "term":"Mumbai Domestic Airport",
> > "weight":11536},
> >   {
> > "term":"Mumbai Chhatrapati Shivaji Intl Airport",
> > "weight":11376},
> >   {
> > "term":"Mumbai Pune Highway",
> > "weight":2850},
> >   {
> > "term":"Mumbai",
> > "weight":2248},
> > .
> >
> > Whereas, calling /suggest?suggest.q=Mumbai (starting with an upper case
> > "M")
> >
> > is giving exact result at 1st place:
> >
> > {
> >   "responseHeader":{
> > "status":0,
> > "QTime":16},
> >   "suggest":{
> > "suggestions":{
> >   "Mumbai":{
> > "numFound":10,
> > "suggestions":[{
> > "term":"Mumbai",
> > "weight":2248},
> >   {
> > "term":"Mumbai Domestic Airport",
> > "weight":11536},
> >   {
> > "term":"Mumbai Chhatrapati Shivaji Intl Airport",
> > "weight":11376},
> >   {
> > "term":"Mumbai Pune Highway",
> > "weight":2850},
> > ...
> >
> > What am I missing here ? What can be done to make Mumbai as the first
> > result even if it is called from a lower case "mumbai" as query. I
> thought
> > the case sensitivity is being handled by "suggestTypeLc" field I've
> > generated.
> > --
> > Ciao
> > Diwakar
> >
>



-- 
Ciao
Diwakar


Re: Easy way to preserve Solr Admin form input

2016-12-27 Thread Alexandre Rafalovitch
I think there may be a ticket for something similar. Or related to
rerunning a same query/configuration on a new core.

Worth having a quick look anyway.

The challenge would be to write the infrastructure that will unpack
those parameters back into the boxes. Because some go into raw query,
some go into specific boxes, etc.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 27 December 2016 at 16:02, Stefan Matheis  wrote:
> Sebastian,
>
> currently not - i'm sorry to say. We did it for the analysis screen but not
> for the query screen. Shouldn't be too hard to add this kind of persitence.
>
> Would you mind opening a ticket, so we can track the progress. Depending on
> your knowledge, you might be willing to give it a first whirl?
>
> -Stefan
>
> On Dec 27, 2016 12:09 PM, "Sebastian Riemer"  wrote:
>
> Hi,
>
> is there an easy way to preserve the query data I input in SolrAdmin?
>
> E.g. when debugging a query, I often have the desire to reopen the current
> query in solrAdmin in a new browser tab to make slight adaptions to the
> query without losing the original query.  What happens instead is the form
> is opened blank in the new tab and I have to manually copy/paste the
> entered form values.
>
> This is not such a big problem, when I only use the "Raw Query Parameters"
> field, but editing something in that tiny input is a real pain ...
>
> I wonder how others come around this?
>
> Sebastian


Re: Uncaught exception java.lang.StackOverflowError in 6.3.0

2016-12-27 Thread Yago Riveiro
bq: "That is really a job for streaming, not simple faceting.”

True, it’s the next step to improve our performance (right now we are using 
JSON facets), and 6.3.0 has a lot of useful tools to work with streaming 
expressions. Our last release before 6.3 was 5.3.1 and the streaming 
expressions were buggy in some scenarios.

bq: "Okay. You could create a new collection with the wanted amount of shards 
and do a full re-index into that.”

True, you are right but we are trying to avoid that (this point falls into 
“keep management low”).

Solr it’s a amazing tool, with a lack of auto magic management stuff. You have 
all the power and therefore all the work :p

Following your advices I will try to review the topology of my collection and 
try to point the oversharded collections.

--

/Yago Riveiro

On 27 Dec 2016 21:54 +, Toke Eskildsen , wrote:
> Yago Riveiro  wrote:
> > One thing that I forget to mention is that my clients can aggregate
> > by any field in the schema with limit=-1, this is not a problem with
> > 99% of the fields, but 2 or 3 of them are URLs. URLs has very
> > high cardinality and one of the reasons to sharding collections is
> > to lower the memory footprint to not blow the node and do the
> > last merge in a big machine.
>
> That is really a job for streaming, not simple faceting.
>
> Even if you insist on faceting, the problem remains that your merger needs to 
> be powerful enough to process the full result set. Using that machine with a 
> single shard collection instead would eliminate the excessive overhead of 
> doing distributed faceting on millions of values, sparing a lot of hardware 
> allocation, which could be used to beef up the single-shard hardware even 
> more.
>
> [Toke: You can always split later]
>
> > Every time I run the SPLITSHARD command, the command fails
> > in a different way. IMHO right now Solr doesn’t have an efficient
> > way to rebalance collection’s shard.
>
> Okay. You coul create a new collection with the wanted amount of shards and 
> do a full re-index into that.
>
> [Toke: "And yes, more logistics on your part as one size no longer fits all”]
>
> > The key point of this deploy is reduce the amount of management
> > as much as possible,
>
> That is your prerogative. I hope my suggestions can be used by other people 
> with similar challenges then.
>
> - Toke Eskildsen


Re: Uncaught exception java.lang.StackOverflowError in 6.3.0

2016-12-27 Thread Toke Eskildsen
Yago Riveiro  wrote:
> One thing that I forget to mention is that my clients can aggregate
> by any field in the schema with limit=-1, this is not a problem with
> 99% of the fields, but 2 or 3 of them are URLs. URLs has very
> high cardinality and one of the reasons to sharding collections is
> to lower the memory footprint to not blow the node and do the
> last merge in a big machine.

That is really a job for streaming, not simple faceting.

Even if you insist on faceting, the problem remains that your merger needs to 
be powerful enough to process the full result set. Using that machine with a 
single shard collection instead would eliminate the excessive overhead of doing 
distributed faceting on millions of values, sparing a lot of hardware 
allocation, which could be used to beef up the single-shard hardware even more.

[Toke: You can always split later]

> Every time I run the SPLITSHARD command, the command fails
> in a different way. IMHO right now Solr doesn’t have an efficient
> way to rebalance collection’s shard.

Okay. You coul create a new collection with the wanted amount of shards and do 
a full re-index into that.

[Toke: "And yes, more logistics on your part as one size no longer fits all”]

> The key point of this deploy is reduce the amount of management
> as much as possible,

That is your prerogative. I hope my suggestions can be used by other people 
with similar challenges then.

- Toke Eskildsen


Re: Easy way to preserve Solr Admin form input

2016-12-27 Thread Stefan Matheis
Sebastian,

currently not - i'm sorry to say. We did it for the analysis screen but not
for the query screen. Shouldn't be too hard to add this kind of persitence.

Would you mind opening a ticket, so we can track the progress. Depending on
your knowledge, you might be willing to give it a first whirl?

-Stefan

On Dec 27, 2016 12:09 PM, "Sebastian Riemer"  wrote:

Hi,

is there an easy way to preserve the query data I input in SolrAdmin?

E.g. when debugging a query, I often have the desire to reopen the current
query in solrAdmin in a new browser tab to make slight adaptions to the
query without losing the original query.  What happens instead is the form
is opened blank in the new tab and I have to manually copy/paste the
entered form values.

This is not such a big problem, when I only use the "Raw Query Parameters"
field, but editing something in that tiny input is a real pain ...

I wonder how others come around this?

Sebastian


Re: How to make Solr FuzzyLookupFactory exactMatch case insensitive

2016-12-27 Thread Susheel Kumar
Did you try exactMatchFirst param of FuzzyLookupFactory ?  See
https://cwiki.apache.org/confluence/display/solr/Suggester

Thanks,
Susheel

On Sun, Dec 25, 2016 at 6:52 PM, diwakar bhardwaj  wrote:

> Hi,
>
> I've implemented a solr suggester with FuzzyLookupFactory and its working
> perfectly. Except for a minor glitch, its only treating case sensitive
> searches as an exact match.
> For example, results for "mumbai" vs "Mumbai" is different.
>
> This is too restrictive and kind of defeating the purpose of the suggester.
>
> I've posted this on stackoverflow:
>
> http://stackoverflow.com/questions/41320424/solr-
> fuzzylookupfactory-exactmatch-is-case-sensitive
>
> Following is the text I posted on stackoverflow
>
> I have implemented a solr suggester for list of cities and areas. I have
> user FuzzyLookupFactory for this. My schema looks like this:
>
>  positionIncrementGap="100">
> 
>  pattern="[^a-zA-Z0-9]" replacement=" " />
> 
>  ignoreCase="true" expand="true"/>
> 
> 
> 
>
> synonym.txt is used for mapping older city names with new ones, like
> Madras=>Chennai, Saigon=>Ho Chi Minh city
>
> My suggester definition looks like this:
>
>   
> 
>   suggestions
>   FuzzyLookupFactory
>   DocumentDictionaryFactory
>   searchfield
>   searchscore
>   suggestTypeLc
>   false
>   false
>   autosuggest_dict
> 
>   
>
> My request handler looks like this:
>
>startup="lazy">
> 
> true
> 10
> suggestions
> results
> 
> 
> suggest
> 
>   
>
> Now the problem is that suggester is showing the exact matches first But it
> is case sensitive. for eg,
>
> /suggest?suggest.q=mumbai (starting with a lower case "m")
>
> will give, exact result at 4th place:
>
> {
>   "responseHeader":{
> "status":0,
> "QTime":19},
>   "suggest":{
> "suggestions":{
>   "mumbai":{
> "numFound":10,
> "suggestions":[{
> "term":"Mumbai Domestic Airport",
> "weight":11536},
>   {
> "term":"Mumbai Chhatrapati Shivaji Intl Airport",
> "weight":11376},
>   {
> "term":"Mumbai Pune Highway",
> "weight":2850},
>   {
> "term":"Mumbai",
> "weight":2248},
> .
>
> Whereas, calling /suggest?suggest.q=Mumbai (starting with an upper case
> "M")
>
> is giving exact result at 1st place:
>
> {
>   "responseHeader":{
> "status":0,
> "QTime":16},
>   "suggest":{
> "suggestions":{
>   "Mumbai":{
> "numFound":10,
> "suggestions":[{
> "term":"Mumbai",
> "weight":2248},
>   {
> "term":"Mumbai Domestic Airport",
> "weight":11536},
>   {
> "term":"Mumbai Chhatrapati Shivaji Intl Airport",
> "weight":11376},
>   {
> "term":"Mumbai Pune Highway",
> "weight":2850},
> ...
>
> What am I missing here ? What can be done to make Mumbai as the first
> result even if it is called from a lower case "mumbai" as query. I thought
> the case sensitivity is being handled by "suggestTypeLc" field I've
> generated.
> --
> Ciao
> Diwakar
>


Re: Cloud Behavior when using numShards=1

2016-12-27 Thread Dave Seltzer
Thanks Erick,

That's pretty much where I'd landed on the issue. To me Solr Cloud is
clearly the preferable option here - especially when it comes to indexing
and cluster management. I'll give "preferLocalShards" a try and see what
happens.

Many thanks for your in-depth analysis!

-Dave

Dave Seltzer 
Chief Systems Architect
TVEyes
(203) 254-3600 x222

On Tue, Dec 27, 2016 at 12:22 PM, Erick Erickson 
wrote:

> The form of the query doesn't enter into whether query is passed on to
> a different replica IIUC. preferLocalShards was created to keep this
> from happening though. There's discussion at
> https://issues.apache.org/jira/browse/SOLR-6832.
>
> BTW, "it's just a parameter". At root, the sugar methods (SolrJ, but I
> assume SolrNet too) for setting specific options (rows say) are just
> adding to an underlying map. A SolrJ example SolrQuery.setRows()
> eventually resolves itself to a Map.put("rows", ###). I'm pretty sure
> SolrNet has a generic "setParam" or similar that also just adds a
> value to the list of parameters.
>
> As for whether traditional master/slave would be a better choice...
> Since you only have one shard it's more ambiguous than if you had a
> bunch of shards.
>
> The biggest advantage you get with SolrCloud in your setup is that all
> the pesky issues about failover are handled for you.
>
> The other advantage of SolrCloud is that the client (CloudSolrClient)
> is aware of Zookeeper and can "do the right thing" when nodes come and
> go. In that setup, you don't necessarily even need a load balancer.
> AFAIK, SolrNet hasn't implemented that capability so that's
> irrelevant, and you're using HAProxy anyway so I doubt you care much.
>
> Say you're using M/S rather than SolrCloud. Now say you're indexing
> and the master fails. How difficult is it to recover? How mission
> critical is uninterrupted up-to-date service? How long can recovery
> take and not impact business unduly?
>
> A few scenarios.
>
> 1> worst case. You can't re-index from some arbitrary point in the
> past because the system-of-record isn't available. Thus if your master
> dies you may have lost documents. You really don't want M/S in this
> case.
>
> 2> next worst case, The master dies. Can you have an unchanging index
> on the replicas that you're querying while you spin up a new master
> and then re-index all your data and point your slaves at the new
> master? then M/S is fine.
>
> 3> less bad case. You're indexing and the master dies. Can you stand
> an unchanging index while you promote one of the slaves to be master
> and can pick up indexing from some time X where you're guaranteed that
> the newly-promoted master replicated from the old master? then M/S is
> fine.
>
> 4> Best case. You index once per day (or week or month or...).
> Rebuilding your entire index from the system of record takes X hours,
> and and business can wait X hours (possibly using the old index and
> not serving as many queries). M/S is simpler in this case than
> SolrCloud.
>
> So really, IMO, it's a question of whether the failover goodness you
> get with SolrCloud outweighs the complexity of maintaining Zookeeper
> and questions like you're asking now.
>
> IOW "It Depends" (tm).
>
> Best,
> Erick
>
> On Tue, Dec 27, 2016 at 7:59 AM, Dave Seltzer  wrote:
> > Hehe Good Tip :-)
> >
> > preferLocalShards may indeed be a good solution. I'll have to figure out
> > how to pass that parameter using SolrNet.
> >
> > The queries are quite complex. We're sampling audio, calculating hashes
> and
> > comparing them to known hashes.
> >
> > I'll paste an example below.
> >
> > Are nested queries more likely to be distributed in this fashion?
> >
> > -Dave
> >
> > q=_query_:"{!edismax mm=5}hashTable_0:359079936 hashTable_1:440999735
> > hashTable_2:1376147226 hashTable_3:35668745 hashTable_4:671810129
> > hashTable_5:536885545 hashTable_6:453337089 hashTable_7:1279281410
> > hashTable_8:772478009 hashTable_9:806096663 hashTable_10:1779768130
> > hashTable_11:1699416602 hashTable_12:135229216 hashTable_13:68107537
> > hashTable_14:134963224 hashTable_15:772210781 hashTable_16:51315463
> > hashTable_17:306522185 hashTable_18:575080513 hashTable_19:623118387
> > hashTable_20:1159227396 hashTable_21:907954972 hashTable_22:219782400
> > hashTable_23:268848920 hashTable_24:185729340" _query_:"{!edismax
> > mm=5}hashTable_0:830515738 hashTable_1:135401527 hashTable_2:2098135824
> > hashTable_3:2065698563 hashTable_4:672596488 hashTable_5:470813767
> > hashTable_6:453977870 hashTable_7:906104066 hashTable_8:21772611
> > hashTable_9:813630732 hashTable_10:-1973675256 hashTable_11:1577323034
> > hashTable_12:135152649 hashTable_13:236264215 hashTable_14:68300817
> > hashTable_15:85790523 hashTable_16:186191879 hashTable_17:306083351
> > hashTable_18:2011629862 hashTable_19:1364872503 hashTable_20:4128772
> > hashTable_21:689650435 hashTable_22:222499855 hashTable_23:17187346
> > 

Re: Easy way to preserve Solr Admin form input

2016-12-27 Thread Erik Hatcher
How's /browse fare for you?   What params are you adjusting regularly?

> On Dec 27, 2016, at 06:09, Sebastian Riemer  wrote:
> 
> Hi,
> 
> is there an easy way to preserve the query data I input in SolrAdmin?
> 
> E.g. when debugging a query, I often have the desire to reopen the current 
> query in solrAdmin in a new browser tab to make slight adaptions to the query 
> without losing the original query.  What happens instead is the form is 
> opened blank in the new tab and I have to manually copy/paste the entered 
> form values.
> 
> This is not such a big problem, when I only use the "Raw Query Parameters" 
> field, but editing something in that tiny input is a real pain ...
> 
> I wonder how others come around this?
> 
> Sebastian
> 


Re: Easy way to preserve Solr Admin form input

2016-12-27 Thread Erick Erickson
Use curl and edit a file maybe?

Or edit a file and copy/paste into the address bar? I often do this
and combined with browser format tools for XML or JSON let's me get
by.

IOW, there's really nothing I know of that allows you to save/retrieve
the contents of the admin UI form. It'd be a neat contribution though.

Best,
Erick



On Tue, Dec 27, 2016 at 3:09 AM, Sebastian Riemer  wrote:
> Hi,
>
> is there an easy way to preserve the query data I input in SolrAdmin?
>
> E.g. when debugging a query, I often have the desire to reopen the current 
> query in solrAdmin in a new browser tab to make slight adaptions to the query 
> without losing the original query.  What happens instead is the form is 
> opened blank in the new tab and I have to manually copy/paste the entered 
> form values.
>
> This is not such a big problem, when I only use the "Raw Query Parameters" 
> field, but editing something in that tiny input is a real pain ...
>
> I wonder how others come around this?
>
> Sebastian
>


Re: Cloud Behavior when using numShards=1

2016-12-27 Thread Erick Erickson
The form of the query doesn't enter into whether query is passed on to
a different replica IIUC. preferLocalShards was created to keep this
from happening though. There's discussion at
https://issues.apache.org/jira/browse/SOLR-6832.

BTW, "it's just a parameter". At root, the sugar methods (SolrJ, but I
assume SolrNet too) for setting specific options (rows say) are just
adding to an underlying map. A SolrJ example SolrQuery.setRows()
eventually resolves itself to a Map.put("rows", ###). I'm pretty sure
SolrNet has a generic "setParam" or similar that also just adds a
value to the list of parameters.

As for whether traditional master/slave would be a better choice...
Since you only have one shard it's more ambiguous than if you had a
bunch of shards.

The biggest advantage you get with SolrCloud in your setup is that all
the pesky issues about failover are handled for you.

The other advantage of SolrCloud is that the client (CloudSolrClient)
is aware of Zookeeper and can "do the right thing" when nodes come and
go. In that setup, you don't necessarily even need a load balancer.
AFAIK, SolrNet hasn't implemented that capability so that's
irrelevant, and you're using HAProxy anyway so I doubt you care much.

Say you're using M/S rather than SolrCloud. Now say you're indexing
and the master fails. How difficult is it to recover? How mission
critical is uninterrupted up-to-date service? How long can recovery
take and not impact business unduly?

A few scenarios.

1> worst case. You can't re-index from some arbitrary point in the
past because the system-of-record isn't available. Thus if your master
dies you may have lost documents. You really don't want M/S in this
case.

2> next worst case, The master dies. Can you have an unchanging index
on the replicas that you're querying while you spin up a new master
and then re-index all your data and point your slaves at the new
master? then M/S is fine.

3> less bad case. You're indexing and the master dies. Can you stand
an unchanging index while you promote one of the slaves to be master
and can pick up indexing from some time X where you're guaranteed that
the newly-promoted master replicated from the old master? then M/S is
fine.

4> Best case. You index once per day (or week or month or...).
Rebuilding your entire index from the system of record takes X hours,
and and business can wait X hours (possibly using the old index and
not serving as many queries). M/S is simpler in this case than
SolrCloud.

So really, IMO, it's a question of whether the failover goodness you
get with SolrCloud outweighs the complexity of maintaining Zookeeper
and questions like you're asking now.

IOW "It Depends" (tm).

Best,
Erick

On Tue, Dec 27, 2016 at 7:59 AM, Dave Seltzer  wrote:
> Hehe Good Tip :-)
>
> preferLocalShards may indeed be a good solution. I'll have to figure out
> how to pass that parameter using SolrNet.
>
> The queries are quite complex. We're sampling audio, calculating hashes and
> comparing them to known hashes.
>
> I'll paste an example below.
>
> Are nested queries more likely to be distributed in this fashion?
>
> -Dave
>
> q=_query_:"{!edismax mm=5}hashTable_0:359079936 hashTable_1:440999735
> hashTable_2:1376147226 hashTable_3:35668745 hashTable_4:671810129
> hashTable_5:536885545 hashTable_6:453337089 hashTable_7:1279281410
> hashTable_8:772478009 hashTable_9:806096663 hashTable_10:1779768130
> hashTable_11:1699416602 hashTable_12:135229216 hashTable_13:68107537
> hashTable_14:134963224 hashTable_15:772210781 hashTable_16:51315463
> hashTable_17:306522185 hashTable_18:575080513 hashTable_19:623118387
> hashTable_20:1159227396 hashTable_21:907954972 hashTable_22:219782400
> hashTable_23:268848920 hashTable_24:185729340" _query_:"{!edismax
> mm=5}hashTable_0:830515738 hashTable_1:135401527 hashTable_2:2098135824
> hashTable_3:2065698563 hashTable_4:672596488 hashTable_5:470813767
> hashTable_6:453977870 hashTable_7:906104066 hashTable_8:21772611
> hashTable_9:813630732 hashTable_10:-1973675256 hashTable_11:1577323034
> hashTable_12:135152649 hashTable_13:236264215 hashTable_14:68300817
> hashTable_15:85790523 hashTable_16:186191879 hashTable_17:306083351
> hashTable_18:2011629862 hashTable_19:1364872503 hashTable_20:4128772
> hashTable_21:689650435 hashTable_22:222499855 hashTable_23:17187346
> hashTable_24:1913783558" _query_:"{!edismax mm=5}hashTable_0:622538010
> hashTable_1:337383479 hashTable_2:-1272249576 hashTable_3:271847194
> hashTable_4:522322513 hashTable_5:1110312368 hashTable_6:-1757546994
> hashTable_7:-1939467262 hashTable_8:20196637 hashTable_9:572261655
> hashTable_10:-702476280 hashTable_11:453716754 hashTable_12:134877193
> hashTable_13:169152357 hashTable_14:136117838 hashTable_15:875044907
> hashTable_16:1797459972 hashTable_17:303711774 hashTable_18:1847132476
> hashTable_19:978126878 hashTable_20:120193028 hashTable_21:487858837
> hashTable_22:223803151 hashTable_23:-2079961818 hashTable_24:387645702"
> 

Re: Cloud Behavior when using numShards=1

2016-12-27 Thread Dave Seltzer
Hehe Good Tip :-)

preferLocalShards may indeed be a good solution. I'll have to figure out
how to pass that parameter using SolrNet.

The queries are quite complex. We're sampling audio, calculating hashes and
comparing them to known hashes.

I'll paste an example below.

Are nested queries more likely to be distributed in this fashion?

-Dave

q=_query_:"{!edismax mm=5}hashTable_0:359079936 hashTable_1:440999735
hashTable_2:1376147226 hashTable_3:35668745 hashTable_4:671810129
hashTable_5:536885545 hashTable_6:453337089 hashTable_7:1279281410
hashTable_8:772478009 hashTable_9:806096663 hashTable_10:1779768130
hashTable_11:1699416602 hashTable_12:135229216 hashTable_13:68107537
hashTable_14:134963224 hashTable_15:772210781 hashTable_16:51315463
hashTable_17:306522185 hashTable_18:575080513 hashTable_19:623118387
hashTable_20:1159227396 hashTable_21:907954972 hashTable_22:219782400
hashTable_23:268848920 hashTable_24:185729340" _query_:"{!edismax
mm=5}hashTable_0:830515738 hashTable_1:135401527 hashTable_2:2098135824
hashTable_3:2065698563 hashTable_4:672596488 hashTable_5:470813767
hashTable_6:453977870 hashTable_7:906104066 hashTable_8:21772611
hashTable_9:813630732 hashTable_10:-1973675256 hashTable_11:1577323034
hashTable_12:135152649 hashTable_13:236264215 hashTable_14:68300817
hashTable_15:85790523 hashTable_16:186191879 hashTable_17:306083351
hashTable_18:2011629862 hashTable_19:1364872503 hashTable_20:4128772
hashTable_21:689650435 hashTable_22:222499855 hashTable_23:17187346
hashTable_24:1913783558" _query_:"{!edismax mm=5}hashTable_0:622538010
hashTable_1:337383479 hashTable_2:-1272249576 hashTable_3:271847194
hashTable_4:522322513 hashTable_5:1110312368 hashTable_6:-1757546994
hashTable_7:-1939467262 hashTable_8:20196637 hashTable_9:572261655
hashTable_10:-702476280 hashTable_11:453716754 hashTable_12:134877193
hashTable_13:169152357 hashTable_14:136117838 hashTable_15:875044907
hashTable_16:1797459972 hashTable_17:303711774 hashTable_18:1847132476
hashTable_19:978126878 hashTable_20:120193028 hashTable_21:487858837
hashTable_22:223803151 hashTable_23:-2079961818 hashTable_24:387645702"
_query_:"{!edismax mm=5}hashTable_0:269046593 hashTable_1:202510337
hashTable_2:-1908118760 hashTable_3:557125123 hashTable_4:622985745
hashTable_5:1112540520 hashTable_6:-1760619239 hashTable_7:302584834
hashTable_8:774853149 hashTable_9:407637521 hashTable_10:503842575
hashTable_11:973810450 hashTable_12:386551297 hashTable_13:520687392
hashTable_14:2031254298 hashTable_15:253050461 hashTable_16:1697657095
hashTable_17:307316254 hashTable_18:321716292 hashTable_19:887500833
hashTable_20:120193028 hashTable_21:353632786 hashTable_22:221726992
hashTable_23:1359367954 hashTable_24:218981212" _query_:"{!edismax
mm=5}hashTable_0:354102618 hashTable_1:440534785 hashTable_2:1780351770
hashTable_3:35596035 hashTable_4:371327546 hashTable_5:620958505
hashTable_6:823926785 hashTable_7:106959874 hashTable_8:775171357
hashTable_9:570891537 hashTable_10:470295321 hashTable_11:823007555
hashTable_12:459162889 hashTable_13:163586959 hashTable_14:-1065149104
hashTable_15:422450690 hashTable_16:487142404 hashTable_17:222040067
hashTable_18:323450677 hashTable_19:36375841 hashTable_20:244600580
hashTable_21:1510146588 hashTable_22:571998720 hashTable_23:235287562
hashTable_24:1981482410" _query_:"{!edismax mm=5}hashTable_0:443429471
hashTable_1:437060151 hashTable_2:1145334291 hashTable_3:269043481
hashTable_4:371327531 hashTable_5:288896278 hashTable_6:19277121
hashTable_7:419565314 hashTable_8:1375944989 hashTable_9:571285015
hashTable_10:1728606735 hashTable_11:1560438339 hashTable_12:1263078657
hashTable_13:639901719 hashTable_14:980304657 hashTable_15:889786370
hashTable_16:288954532 hashTable_17:69543944 hashTable_18:52866077
hashTable_19:1174882581 hashTable_20:159002116 hashTable_21:218507036
hashTable_22:286916626 hashTable_23:17128202 hashTable_24:-1235483301"
_query_:"{!edismax mm=5}hashTable_0:1578862134 hashTable_1:439820032
hashTable_2:1715571972 hashTable_3:51184175 hashTable_4:371655241
hashTable_5:473500713 hashTable_6:20579091 hashTable_7:67600402
hashTable_8:336281885 hashTable_9:218958103 hashTable_10:170691901
hashTable_11:153224477 hashTable_12:941347926 hashTable_13:335611671
hashTable_14:352541245 hashTable_15:87010585 hashTable_16:36323236
hashTable_17:304437256 hashTable_18:1850568961 hashTable_19:34031890
hashTable_20:544884996 hashTable_21:588907548 hashTable_22:204955669
hashTable_23:1510304271 hashTable_24:555417973" _query_:"{!edismax
mm=5}hashTable_0:-1844085066 hashTable_1:441982775 hashTable_2:1176983556
hashTable_3:118293016 hashTable_4:374481425 hashTable_5:439943942
hashTable_6:19079169 hashTable_7:321782530 hashTable_8:538016737
hashTable_9:813316631 hashTable_10:169561147 hashTable_11:973210906
hashTable_12:1547197978 hashTable_13:957701387 hashTable_14:1679907747
hashTable_15:356169241 hashTable_16:1378732772 hashTable_17:313198851
hashTable_18:624714050 hashTable_19:67582263 

Re: Cloud Behavior when using numShards=1

2016-12-27 Thread Dorian Hoxha
I think solr tries itself to load balance. Read this page
https://cwiki.apache.org/confluence/display/solr/Distributed+Requests
(preferLocalShards!)

Also please write the query.

tip: fill "send" address after completing email

On Tue, Dec 27, 2016 at 4:31 PM, Dave Seltzer  wrote:

> [Forgive the repeat here, I accidentally clicked send too early]
>
> Hi Everyone,
>
> I have a Solr index which is quite small (400,000 documents totaling 157
> MB) with a query load which is quite large. I therefore want to spread the
> load across multiple Solr servers.
>
> To accomplish this I've created a Solr Cloud cluster with two collections.
> The collections are configured with only 1 shard, but with 3 replicas in
> order to make sure that each of the three Solr servers has all of the data
> and can therefore answer any query without having to request data from
> another server. I use the following command:
>
> solr create -c sf_fingerprints -shards 1 -n fingerprints -replicationFactor
> 3
>
> I use HAProxy to spread the load across the three servers by directing the
> query to the server with the fewest current connections.
>
> However, when I turn up the load during testing I'm seeing some stuff in
> the logs of SERVER1 which makes me question my understanding of Solr Cloud:
>
> SERVER1: HttpSolrCall null:org.apache.solr.common.SolrException: Error
> trying to proxy request for url: http://SERVER3:8983/solr/sf_
> fingerprints/select 
>
> I'm curious why SERVER1 would be proxying requests to SERVER3 in a
> situation where the sf_fingerprints index is completely present on the
> local system.
>
> Is this a situation where I should be using generic replication rather than
> Cloud?
>
> Many thanks!
>
> -Dave
>


Cloud Behavior when using numShards=1

2016-12-27 Thread Dave Seltzer
[Forgive the repeat here, I accidentally clicked send too early]

Hi Everyone,

I have a Solr index which is quite small (400,000 documents totaling 157
MB) with a query load which is quite large. I therefore want to spread the
load across multiple Solr servers.

To accomplish this I've created a Solr Cloud cluster with two collections.
The collections are configured with only 1 shard, but with 3 replicas in
order to make sure that each of the three Solr servers has all of the data
and can therefore answer any query without having to request data from
another server. I use the following command:

solr create -c sf_fingerprints -shards 1 -n fingerprints -replicationFactor
3

I use HAProxy to spread the load across the three servers by directing the
query to the server with the fewest current connections.

However, when I turn up the load during testing I'm seeing some stuff in
the logs of SERVER1 which makes me question my understanding of Solr Cloud:

SERVER1: HttpSolrCall null:org.apache.solr.common.SolrException: Error
trying to proxy request for url: http://SERVER3:8983/solr/sf_
fingerprints/select 

I'm curious why SERVER1 would be proxying requests to SERVER3 in a
situation where the sf_fingerprints index is completely present on the
local system.

Is this a situation where I should be using generic replication rather than
Cloud?

Many thanks!

-Dave


Cloud Behavior when using

2016-12-27 Thread Dave Seltzer
Hi Everyone,

I have a Solr index which is quite small (400,000 documents totaling 157
MB) with a query load which is quite large. I therefore want to spread the
load across multiple Solr servers.

To accomplish this I've created a Solr Cloud cluster with two collections.
The collections are configured with only 1 shard, but with 3 replicas in
order to make sure that each of the three Solr servers has all of the data
and can therefore answer any query without having to request data from
another server. I use the following command:



I use HAProxy to spread the load across the three servers by directing the
query to the server with the fewest current connections.

However, when I turn up the load during testing I'm seeing some stuff in
the logs of SERVER1 which makes me question my understanding of Solr Cloud:

SERVER1: HttpSolrCall null:org.apache.solr.common.SolrException: Error
trying to proxy request for url:
http://SERVER3:8983/solr/sf_fingerprints/select

I'm curious why SERVER1 would be proxying requests to SERVER3 in a
situation where the sf_fingerprints index is completely present on the
local system.

Is this a situation where I should be using generic replication rather than
Cloud?


Dave Seltzer 
Chief Systems Architect
TVEyes
(203) 254-3600 x222


Easy way to preserve Solr Admin form input

2016-12-27 Thread Sebastian Riemer
Hi,

is there an easy way to preserve the query data I input in SolrAdmin?

E.g. when debugging a query, I often have the desire to reopen the current 
query in solrAdmin in a new browser tab to make slight adaptions to the query 
without losing the original query.  What happens instead is the form is opened 
blank in the new tab and I have to manually copy/paste the entered form values.

This is not such a big problem, when I only use the "Raw Query Parameters" 
field, but editing something in that tiny input is a real pain ...

I wonder how others come around this?

Sebastian



Re: Uncaught exception java.lang.StackOverflowError in 6.3.0

2016-12-27 Thread Yago Riveiro
One thing that I forget to mention is that my clients can aggregate by any 
field in the schema with limit=-1, this is not a problem with 99% of the 
fields, but 2 or 3 of them are URLs. URLs has very high cardinality and one of 
the reasons to sharding collections is to lower the memory footprint to not 
blow the node and do the last merge in a big machine.

"Should a collection grow past whatever threshold you determine, you can always 
split it.”

Every time I run the SPLITSHARD command, the command fails in a different way. 
IMHO right now Solr doesn’t have an efficient way to rebalance collection’s 
shard.

"And yes, more logistics on your part as one size no longer fits all”

The key point of this deploy is reduce the amount of management as much as 
possible, Solr improved the management of the cluster a lot in comparison with 
4.x release. Even so, remains difficult manage a big cluster without custom 
tools.

Solr continues to improve with each version, and I saw issues with a lot of 
nice stuff like SOLR-9735 and SOLR-9241

--

/Yago Riveiro

On 26 Dec 2016 22:10 +, Toke Eskildsen , wrote:
> Yago Riveiro  wrtoe:
> > My cluster holds more than 10B documents stored in 15T.
> >
> > The size of my collections is variable but I have collections with 800M
> > documents distributed over the 12 nodes, the amount of documents per shard
> > is ~66M and indeed the performance is good.
>
> The math supports Erick's point about over-sharding. On average you have:
> 15 TB/ 1200 collections / 12 shards ~= 1GB / shard.
> 10B docs / 1200 collections / 12 shards ~= 700K docs/shard
>
> While your 12 shards fits well with your large collections, such as the one 
> you described above, they are a very poor match for your average collection. 
> Assuming your collections behave roughly the same way as each other, your 
> average and smaller than average collections would be much better off with 
> just 1 shard (and 2 replicas). That eliminates the overhead of distributed 
> search-requests (for that collection) and lowers your overall shard-count 
> significantly. Should a collection grow past whatever threshold you 
> determine, you can always split it.
>
> Better performance, lower hardware requirements, more manageable shard 
> amount. And yes, more logistics on your part as one size no longer fits all.
>
> - Toke Eskildsen