Re: A feature idea for discussion -- fields that can only be explicitly retrieved

2017-01-13 Thread Erick Erickson
bq: Is my understanding about stored fields correct, that even if excluded
from fl, the data on the disk for a given field would still be read as
part of decompression..

Assuming any stored field (NOT docvalues) was read then this is, indeed,
correct. To be pedantic about it, enough 16K blocks will be read/decompressed
to get all the fields of the doc, then the necessary fields will be extracted.

I gather it's kind hard to index into a compressed blob and extract just a
specific field.

Alexandre:

I think Shawn's looking for the opposite. If useDocValuesAsStored="true"
do _not_ fetch any fields where docValues=false and stored=true for
fl=*.

Best,
Erick



On Fri, Jan 13, 2017 at 6:03 PM, Shawn Heisey  wrote:
> On 1/13/2017 1:02 PM, Erick Erickson wrote:
>> What about using the defaults in requestHandlers along with SOLR-3191
>> to accomplish this? Let's say that there was an fl-exclusion
>> parameter. Now you'd be able to define an exclusion default that would
>> exclude your field(s) unless overridden in your request handler. This
>> could be either a default or invariant depending on how strictly you
>> wanted to enforce not being able to retrieve the field.
>
> If it's done with a parameter, I would want the parameter to work
> correctly if included multiple times, then add an exclusion default to
> the appends section rather than defaults or invariants.
>
>> And one thing about your notion. docValues are only primitive types,
>> i.e. string in this case. There's a limit I believe on how big these
>> can be, 32K? Which seems rather restrictive in this case so we're back
>> to stored.
>
> Oh, fun.  32K might be enough for my index, but it is not enough for
> general usage.
>
> Is my understanding about stored fields correct, that even if excluded
> from fl, the data on the disk for a given field would still be read as
> part of decompression?  That's what I was hoping to avoid by using
> docValues.  Just how much pain would be involved in implementing an
> option to disable stored field compression, and if that became possible,
> would it avoid the need to read field data that isn't used?
>
> Thanks,
> Shawn
>


Re: A feature idea for discussion -- fields that can only be explicitly retrieved

2017-01-13 Thread Alexandre Rafalovitch
On 13 January 2017 at 14:40, Shawn Heisey  wrote:
> What if there were a schema option that would skip docValue retrieval
> for a field unless the fl parameter were to *explicitly* ask for that
> field?  With a typical wildcard value in fl, fields with this option
> enabled would not be retrieved.

Isn't that what useDocValuesAsStored="false" do? As per Ref Guide:
When useDocValuesAsStored="false", non-stored DocValues fields can
still be explicitly requested by name in the fl param, but will not
match glob patterns ("*").
https://cwiki.apache.org/confluence/display/solr/DocValues

Regards,
   Alex.


http://www.solr-start.com/ - Resources for Solr users, new and experienced


Re: A feature idea for discussion -- fields that can only be explicitly retrieved

2017-01-13 Thread Shawn Heisey
On 1/13/2017 1:02 PM, Erick Erickson wrote:
> What about using the defaults in requestHandlers along with SOLR-3191
> to accomplish this? Let's say that there was an fl-exclusion
> parameter. Now you'd be able to define an exclusion default that would
> exclude your field(s) unless overridden in your request handler. This
> could be either a default or invariant depending on how strictly you
> wanted to enforce not being able to retrieve the field. 

If it's done with a parameter, I would want the parameter to work
correctly if included multiple times, then add an exclusion default to
the appends section rather than defaults or invariants.

> And one thing about your notion. docValues are only primitive types,
> i.e. string in this case. There's a limit I believe on how big these
> can be, 32K? Which seems rather restrictive in this case so we're back
> to stored.

Oh, fun.  32K might be enough for my index, but it is not enough for
general usage.

Is my understanding about stored fields correct, that even if excluded
from fl, the data on the disk for a given field would still be read as
part of decompression?  That's what I was hoping to avoid by using 
docValues.  Just how much pain would be involved in implementing an
option to disable stored field compression, and if that became possible,
would it avoid the need to read field data that isn't used?

Thanks,
Shawn



Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-13 Thread Shawn Heisey
On 1/13/2017 5:46 PM, Chetas Joshi wrote:
> One of the things I have observed is: if I use the collection API to
> create a replica for that shard, it does not complain about the config
> which has been set to ReplicationFactor=1. If replication factor was
> the issue as suggested by Shawn, shouldn't it complain? 

The replicationFactor value is used by exactly two things:  initial
collection creation, and autoAddReplicas.  It will not affect ANY other
command or operation, including ADDREPLICA.  You can create MORE
replicas than replicationFactor indicates, and there will be no error
messages or warnings.

In order to have a replica automatically added, your replicationFactor
must be at least two, and the number of active replicas in the cloud for
a shard must be less than that number.  If that's the case and the
expiration times have been reached without recovery, then Solr will
automatically add replicas until there are at least as many replicas
operational as specified in replicationFactor.

> I would also like to mention that I experience some instance dirs
> getting deleted and also found this open bug
> (https://issues.apache.org/jira/browse/SOLR-8905) 

The description on that issue is incomprehensible.  I can't make any
sense out of it.  It mentions the core.properties file, but the error
message shown doesn't talk about the properties file at all.  The error
and issue description seem to have nothing at all to do with the code
lines that were quoted.  Also, it was reported on version 4.10.3 ... but
this is going to be significantly different from current 6.x versions,
and the 4.x versions will NOT be updated with bugfixes.

Thanks,
Shawn



Re: Deleting a shard in solr 4.10.4

2017-01-13 Thread Rachid Bouacheria
Thank you so much Erik!


On Fri, Jan 13, 2017 at 4:40 PM, Erick Erickson 
wrote:

> Here's what I'd do
> 1> create a new collection with a single shard
> 2> use the MERGEINDEXES core admin API command to merge the indexes
> from the old 2-shard collection
>
> That way you have a chance to verify that the merged collection is OK
> before deleting the old 2-shard collection.
>
> On Fri, Jan 13, 2017 at 4:08 PM, Rachid Bouacheria 
> wrote:
> > Hi All,
> >
> > I have a collection that has 2 shards. And I am finding that the 2 shards
> > are unnecessary.
> > So I would like to delete one of the shard without losing its data.
> >
> > Illustration:
> > Before : Collection has shard1 and Shard 2
> > After: Collection No shard but the data contains Shard 1 and Shard 2
> >
> > I would think that someone must have done this already. So I would
> > appreciate any input.
> >
> > Thank you!
> > Rachid
>


Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-13 Thread Chetas Joshi
Erick, I have not changed any config. I have autoaddReplica = true for
individual collection config as well as the overall cluster config. Still,
it does not add a replica when I decommission a node.

Adding a replica is overseer's job. I looked at the logs of the overseer of
the solrCloud but could not find anything there as well.

I am doing some testing using different configs. I would be happy to share
my finding.

One of the things I have observed is: if I use the collection API to create
a replica for that shard, it does not complain about the config which has
been set to ReplicationFactor=1. If replication factor was the issue as
suggested by Shawn, shouldn't it complain?

I would also like to mention that I experience some instance dirs getting
deleted and also found this open bug (
https://issues.apache.org/jira/browse/SOLR-8905)

Thanks!

On Thu, Jan 12, 2017 at 9:50 AM, Erick Erickson 
wrote:

> Hmmm, have you changed any of the settings for autoAddReplcia? There
> are several parameters that govern how long before a replica would be
> added.
>
> But I suggest you use the Cloudera resources for this question, not
> only did they write this functionality, but Cloudera support is deeply
> embedded in HDFS and I suspect has _by far_ the most experience with
> it.
>
> And that said, anything you find out that would suggest good ways to
> clarify the docs would be most welcome!
>
> Best,
> Erick
>
> On Thu, Jan 12, 2017 at 8:42 AM, Shawn Heisey  wrote:
> > On 1/11/2017 7:14 PM, Chetas Joshi wrote:
> >> This is what I understand about how Solr works on HDFS. Please correct
> me
> >> if I am wrong.
> >>
> >> Although solr shard replication Factor = 1, HDFS default replication =
> 3.
> >> When the node goes down, the solr server running on that node goes down
> and
> >> hence the instance (core) representing the replica goes down. The data
> in
> >> on HDFS (distributed across all the datanodes of the hadoop cluster
> with 3X
> >> replication).  This is the reason why I have kept replicationFactor=1.
> >>
> >> As per the link:
> >> https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
> >> One benefit to running Solr in HDFS is the ability to automatically add
> new
> >> replicas when the Overseer notices that a shard has gone down. Because
> the
> >> "gone" index shards are stored in HDFS, a new core will be created and
> the
> >> new core will point to the existing indexes in HDFS.
> >>
> >> This is the expected behavior of Solr overseer which I am not able to
> see.
> >> After a couple of hours a node was assigned to host the shard but the
> >> status of the shard is still "down" and the instance dir is missing on
> that
> >> node for that particular shard_replica.
> >
> > As I said before, I know very little about HDFS, so the following could
> > be wrong, but it makes sense so I'll say it:
> >
> > I would imagine that Solr doesn't know or care what your HDFS
> > replication is ... the only replicas it knows about are the ones that it
> > is managing itself.  The autoAddReplicas feature manages *SolrCloud*
> > replicas, not HDFS replicas.
> >
> > I have seen people say that multiple SolrCloud replicas will take up
> > additional space in HDFS -- they do not point at the same index files.
> > This is because proper Lucene operation requires that it lock an index
> > and prevent any other thread/process from writing to the index at the
> > same time.  When you index, SolrCloud updates all replicas independently
> > -- the only time indexes are replicated is when you add a new replica or
> > a serious problem has occurred and an index needs to be recovered.
> >
> > Thanks,
> > Shawn
> >
>


Re: Deleting a shard in solr 4.10.4

2017-01-13 Thread Erick Erickson
Here's what I'd do
1> create a new collection with a single shard
2> use the MERGEINDEXES core admin API command to merge the indexes
from the old 2-shard collection

That way you have a chance to verify that the merged collection is OK
before deleting the old 2-shard collection.

On Fri, Jan 13, 2017 at 4:08 PM, Rachid Bouacheria  wrote:
> Hi All,
>
> I have a collection that has 2 shards. And I am finding that the 2 shards
> are unnecessary.
> So I would like to delete one of the shard without losing its data.
>
> Illustration:
> Before : Collection has shard1 and Shard 2
> After: Collection No shard but the data contains Shard 1 and Shard 2
>
> I would think that someone must have done this already. So I would
> appreciate any input.
>
> Thank you!
> Rachid


Deleting a shard in solr 4.10.4

2017-01-13 Thread Rachid Bouacheria
Hi All,

I have a collection that has 2 shards. And I am finding that the 2 shards
are unnecessary.
So I would like to delete one of the shard without losing its data.

Illustration:
Before : Collection has shard1 and Shard 2
After: Collection No shard but the data contains Shard 1 and Shard 2

I would think that someone must have done this already. So I would
appreciate any input.

Thank you!
Rachid


Re: Large index recommendation

2017-01-13 Thread Toke Eskildsen
Joe Obernberger  wrote:

[3 billion docs / 16TB / 27 shards on HDFS times 3 for replication]

> Each shard is then hosting about 610GBytes of index.  The HDFS cache
> size is very low at about 8GBytes.  Suffice it to say, performance isn't
> very good, but again, this is for experimentation.

We are running a setup with local SSDs where our shards are 900GB with ~6GB 
free for disk cache for each. But the shards are static and fully optimized. If 
your data are non-changing time-series, you might want to consider a model with 
dedicated search-only and shard-build-only nodes to lower hardware requirements.

> If we were to redo this, would it be better to create many shards -
> maybe 200 with 3 replicas each (600 in all) with the goal being to
> withstand a server going out, and future expansion as more hardware is
> added?  I know this is very general question.  Thanks very much in advance!

As Erick says then you are in the fortunate position that it is reasonable easy 
for you to prototype and extrapolate, as you are scaling out. I will add that 
you should keep in mind that under the constraint of constant hardware, you 
might win latency by sharding, but you will loose maximum throughput (due to 
increased duplicate information and increased logistics). Can you just add 
hardware, then do so. If you have under-utilized CPUs and a need for lower 
latency, then try more shards on existing hardware (the Scott@FullStory slides 
that Susheel mentions seems fitting here).

- Toke Eskildsen


Re: A feature idea for discussion -- fields that can only be explicitly retrieved

2017-01-13 Thread Erick Erickson
What about using the defaults in requestHandlers
along with SOLR-3191 to accomplish this? Let's
say that there was an fl-exclusion parameter. Now
you'd be able to define an exclusion default that
would exclude your field(s) unless overridden in your
request handler. This could be either a default or
invariant depending on how strictly you wanted to
enforce not being able to retrieve the field.

I'm not entirely sure how I feel about this option, but
wanted to throw it out for discussion. It does seem
easier to keep track of than another schema field
option.

I see no reason to make a distinction between
docValues only and stored-only though.

And one thing about your notion. docValues are only
primitive types, i.e. string in this case. There's a limit
I believe on how big these can be, 32K? Which seems
rather restrictive in this case so we're back to stored.

Not sure if that limit is configurable or not.

Erick



On Fri, Jan 13, 2017 at 11:40 AM, Shawn Heisey  wrote:
> I've got an idea for a feature that I think could be very useful.  I'd
> like to get some community feedback about it, see whether it's worth
> opening an issue for discussion.
>
> First, some background info:
>
> As I understand it, the fact that stored fields are compressed means
> that even if a particular stored field is not requested in the fl
> parameter, the data on disk for that field must still be read, in order
> to decompress the data and find the fields that ARE desired.  If one of
> the stored fields that's NOT requested is really large, that would
> pollute the OS disk cache with useless data.
>
> If the data for a field in the results comes from docValues instead of
> stored fields, I don't think it is compressed, which hopefully means
> that if a field is NOT requested, the corresponding docValues data is
> never read.
>
> And now for the idea:
>
> What if there were a schema option that would skip docValue retrieval
> for a field unless the fl parameter were to *explicitly* ask for that
> field?  With a typical wildcard value in fl, fields with this option
> enabled would not be retrieved.  If the field is not stored, not
> indexed, but has docValues, I *think* its presence on the disk would not
> affect performance (OS disk cache efficiency) unless its data is
> returned in results.
>
> One practical application, should my theory about docValues prove to be
> accurate:  Implementing a field that contains all the data sent for
> indexing, which could then be used for completely internal reindexing.
> A field like this would probably be detrimental to performance unless it
> could be automatically excluded without the client asking for the exclusion.
>
> SOLR-3191 is a sort-of related issue.  This links to SOLR-9467, which
> made me think of another potential use -- making it so certain fields
> are semi-secure because they aren't returned unless they are explicitly
> requested.  It wouldn't be TRULY secure, of course.
>
> Thanks,
> Shawn
>


A feature idea for discussion -- fields that can only be explicitly retrieved

2017-01-13 Thread Shawn Heisey
I've got an idea for a feature that I think could be very useful.  I'd
like to get some community feedback about it, see whether it's worth
opening an issue for discussion.

First, some background info:

As I understand it, the fact that stored fields are compressed means
that even if a particular stored field is not requested in the fl
parameter, the data on disk for that field must still be read, in order
to decompress the data and find the fields that ARE desired.  If one of
the stored fields that's NOT requested is really large, that would
pollute the OS disk cache with useless data.

If the data for a field in the results comes from docValues instead of
stored fields, I don't think it is compressed, which hopefully means
that if a field is NOT requested, the corresponding docValues data is
never read.

And now for the idea:

What if there were a schema option that would skip docValue retrieval
for a field unless the fl parameter were to *explicitly* ask for that
field?  With a typical wildcard value in fl, fields with this option
enabled would not be retrieved.  If the field is not stored, not
indexed, but has docValues, I *think* its presence on the disk would not
affect performance (OS disk cache efficiency) unless its data is
returned in results.

One practical application, should my theory about docValues prove to be
accurate:  Implementing a field that contains all the data sent for
indexing, which could then be used for completely internal reindexing. 
A field like this would probably be detrimental to performance unless it
could be automatically excluded without the client asking for the exclusion.

SOLR-3191 is a sort-of related issue.  This links to SOLR-9467, which
made me think of another potential use -- making it so certain fields
are semi-secure because they aren't returned unless they are explicitly
requested.  It wouldn't be TRULY secure, of course.

Thanks,
Shawn



Re: Large index recommendation

2017-01-13 Thread Erick Erickson
In any case, this is really "the sizing question" and generic answers
are not reliable. Here's a long blog about why, but the net-net is
"prototype and measure". Fortunately you can prototype with just a few
nodes (I usually want at least 2 shards) and extrapolate reasonably
well.

https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

On Fri, Jan 13, 2017 at 10:29 AM, Susheel Kumar  wrote:
> As per Scott@FullStory you shall see benefits with many smaller shards then
> few bigger. Also upgrading to Solr 6.2 would be better as there are many
> improvements done handling multiple shards. See below presentation
>
> http://www.slideshare.net/lucidworks/large-scale-solr-at-fullstory-presented-by-scott-blum-fullstory
>
>
> Thnx
> Susheel
>
> On Fri, Jan 13, 2017 at 12:56 PM, Joe Obernberger <
> joseph.obernber...@gmail.com> wrote:
>
>> Hi All - we've been experimenting with Solr Cloud 5.5.0 with a 27 shard
>> (no replication - each shard runs on a physical host) cluster on top of
>> HDFS.  It currently just crossed 3 billion documents indexed with an index
>> size of 16.1TBytes.  In HDFS with 3x replication this takes up 48.2TBytes.
>>
>> Each shard is then hosting about 610GBytes of index.  The HDFS cache size
>> is very low at about 8GBytes.  Suffice it to say, performance isn't very
>> good, but again, this is for experimentation.
>>
>> If we were to redo this, would it be better to create many shards - maybe
>> 200 with 3 replicas each (600 in all) with the goal being to withstand a
>> server going out, and future expansion as more hardware is added?  I know
>> this is very general question.  Thanks very much in advance!
>>
>> -Joe
>>
>>


Re: Large index recommendation

2017-01-13 Thread Susheel Kumar
As per Scott@FullStory you shall see benefits with many smaller shards then
few bigger. Also upgrading to Solr 6.2 would be better as there are many
improvements done handling multiple shards. See below presentation

http://www.slideshare.net/lucidworks/large-scale-solr-at-fullstory-presented-by-scott-blum-fullstory


Thnx
Susheel

On Fri, Jan 13, 2017 at 12:56 PM, Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Hi All - we've been experimenting with Solr Cloud 5.5.0 with a 27 shard
> (no replication - each shard runs on a physical host) cluster on top of
> HDFS.  It currently just crossed 3 billion documents indexed with an index
> size of 16.1TBytes.  In HDFS with 3x replication this takes up 48.2TBytes.
>
> Each shard is then hosting about 610GBytes of index.  The HDFS cache size
> is very low at about 8GBytes.  Suffice it to say, performance isn't very
> good, but again, this is for experimentation.
>
> If we were to redo this, would it be better to create many shards - maybe
> 200 with 3 replicas each (600 in all) with the goal being to withstand a
> server going out, and future expansion as more hardware is added?  I know
> this is very general question.  Thanks very much in advance!
>
> -Joe
>
>


Large index recommendation

2017-01-13 Thread Joe Obernberger
Hi All - we've been experimenting with Solr Cloud 5.5.0 with a 27 shard 
(no replication - each shard runs on a physical host) cluster on top of 
HDFS.  It currently just crossed 3 billion documents indexed with an 
index size of 16.1TBytes.  In HDFS with 3x replication this takes up 
48.2TBytes.


Each shard is then hosting about 610GBytes of index.  The HDFS cache 
size is very low at about 8GBytes.  Suffice it to say, performance isn't 
very good, but again, this is for experimentation.


If we were to redo this, would it be better to create many shards - 
maybe 200 with 3 replicas each (600 in all) with the goal being to 
withstand a server going out, and future expansion as more hardware is 
added?  I know this is very general question.  Thanks very much in advance!


-Joe



Re: equivalent of json.facet's "gap" keyword in /sql

2017-01-13 Thread Joel Bernstein
The time functions aren't supported in the SQL interface currently.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Jan 13, 2017 at 10:44 AM, radha krishnan  wrote:

> Hi,
>
> can we write an SQL statement and use the /sql handler to get the
> json.facet;s "gap" functionality.
>
> Ex facet query :
>
> json.facet: {
> my_histogram: {
> type: range,
> field: i_timestamp,
> start: "2016-10-21T01:00:00Z",
> end: "2016-10-21T02:00:00Z",
> gap: "+1MINUTE",
> mincount: 0
> }
> }
>
>
> Thanks,
> Radhakrishnan
>


Re: Trouble boosting a field

2017-01-13 Thread Tom Chiverton
Well, I've tried much larger values than 8, and it still doesn't seem to 
do the job ?


For now, assume my users are searching for exact sub strings of a real 
title.


Tom


On 13/01/17 16:22, Walter Underwood wrote:

I use a boost of 8 for title with no boost on the content. Both Infoseek and 
Inktomi settled on the 8X boost, getting there with completely different 
methodologies.

You might not want the title to completely trump the content. That causes some 
odd anomalies. If someone searches for “ice age 2”, do you really want every 
title with “2” to come before “ice age two”? Or a search for “steve jobs” to 
return every article with “job” or “jobs” in the title first?

Also, use “edismax”, not “dismax”. Dismax was obsolete in Solr 3.x, five years 
ago.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Jan 13, 2017, at 7:10 AM, Tom Chiverton  wrote:

I have a few hundred documents with title and content fields.

I want a match in title to trump matches in content. If I search for "connected 
vehicle" then a news article that has that in the content shouldn't be ranked higher 
than the page with that in the title is essentially what I want.

I have tried dismax with qf=title^2 as well as several other variants with the standard query parser 
(like q="title:"foo"^2 OR content:"foo") but documents without the search term 
in the title still come out before those with the term in the title when ordered by score.

Is there something I am missing ?

 From the docs, something like q=title:"connected vehicle"^2 OR content:"connected 
vehicle" should have worked ? Even using ^100 didn't help.

I tried with the dismax parser using

   "q": "Connected Vehicle",
   "defType": "dismax",
   "indent": "true",
   "qf": "title^2000 content",
   "pf": "pf=title^4000 content^2",
   "sort": "score desc",
   "wt": "json",

but that was not better. if I remove content from pf/qf then documents seem to 
rank correctly.
Example query and results (content omitted) : http://pastebin.com/5EhrRJP8 
 with managed-schema http://pastebin.com/mdraWQWE 


--



Tom Chiverton
Lead Developer

e:   t...@extravision.com 

p:  0161 817 2922
t:  @extravision 
w:   www.extravision.com 


 

Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, Manchester, M15 
4LD.
Company Reg No: 0‌‌5017214 VAT: GB 8‌‌24 5386 19

This e-mail is intended solely for the person to whom it is addressed and may 
contain confidential or privileged information.
Any views or opinions presented in this e-mail are solely of the author and do 
not necessarily represent those of Extravision Ltd.



__
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
__




Re: Trouble boosting a field

2017-01-13 Thread Walter Underwood
I use a boost of 8 for title with no boost on the content. Both Infoseek and 
Inktomi settled on the 8X boost, getting there with completely different 
methodologies.

You might not want the title to completely trump the content. That causes some 
odd anomalies. If someone searches for “ice age 2”, do you really want every 
title with “2” to come before “ice age two”? Or a search for “steve jobs” to 
return every article with “job” or “jobs” in the title first?

Also, use “edismax”, not “dismax”. Dismax was obsolete in Solr 3.x, five years 
ago.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jan 13, 2017, at 7:10 AM, Tom Chiverton  wrote:
> 
> I have a few hundred documents with title and content fields. 
> 
> I want a match in title to trump matches in content. If I search for 
> "connected vehicle" then a news article that has that in the content 
> shouldn't be ranked higher than the page with that in the title is 
> essentially what I want.
> 
> I have tried dismax with qf=title^2 as well as several other variants with 
> the standard query parser (like q="title:"foo"^2 OR content:"foo") but 
> documents without the search term in the title still come out before those 
> with the term in the title when ordered by score.
> 
> Is there something I am missing ?
> 
> From the docs, something like q=title:"connected vehicle"^2 OR 
> content:"connected vehicle" should have worked ? Even using ^100 didn't help.
> 
> I tried with the dismax parser using 
> 
>   "q": "Connected Vehicle",
>   "defType": "dismax",
>   "indent": "true",
>   "qf": "title^2000 content",
>   "pf": "pf=title^4000 content^2",
>   "sort": "score desc",
>   "wt": "json",
> 
> but that was not better. if I remove content from pf/qf then documents seem 
> to rank correctly.
> Example query and results (content omitted) : http://pastebin.com/5EhrRJP8 
>  with managed-schema 
> http://pastebin.com/mdraWQWE 
> 
> -- 
> 
> 
> 
> Tom Chiverton
> Lead Developer
> 
> e: t...@extravision.com 
> 
> p:0161 817 2922
> t:@extravision 
> w: www.extravision.com 
> 
> 
>  
> 
> Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, Manchester, M15 
> 4LD.
> Company Reg No: 0‌‌5017214 VAT: GB 8‌‌24 5386 19
> 
> This e-mail is intended solely for the person to whom it is addressed and may 
> contain confidential or privileged information.
> Any views or opinions presented in this e-mail are solely of the author and 
> do not necessarily represent those of Extravision Ltd. 
> 



Re: Trouble boosting a field

2017-01-13 Thread Erick Erickson
Tom:

The output is numbing, but add =true to your query and you'll see
exactly what contributed to the score and why. Otherwise you're flying
blind. Obviously something's trumping your boosting, but you can't pin down
what without the numbers.

You can get an overall sense of what's happening if you return "score" as a
an additional field, but that just gives you the result, now how it was
calculated. However, if you notice your boosting has changed the scores in
the right direction but just not enough it's an indication that bigger
boosts may help.

And do note that boosting _influences_ the score, it'll never guarantee an
absolute ordering where "all titles that have the content will appear
before any doc where the terms appear in the content".

It'll also be easier to read if you output it structured, see:
https://wiki.apache.org/solr/CommonQueryParameters#debug.explain.structured


Best,
Erick

On Fri, Jan 13, 2017 at 7:10 AM, Tom Chiverton  wrote:

> I have a few hundred documents with title and content fields.
>
> I want a match in title to trump matches in content. If I search for
> "connected vehicle" then a news article that has that in the content
> shouldn't be ranked higher than the page with that in the title is
> essentially what I want.
>
> I have tried dismax with qf=title^2 as well as several other variants with
> the standard query parser (like q="title:"foo"^2 OR content:"foo") but
> documents without the search term in the title still come out before those
> with the term in the title when ordered by score.
>
> Is there something I am missing ?
>
> From the docs, something like q=title:"connected vehicle"^2 OR
> content:"connected vehicle" should have worked ? Even using ^100 didn't
> help.
>
> I tried with the dismax parser using
>
>   "q": "Connected Vehicle",
>   "defType": "dismax",
>   "indent": "true",
>   "qf": "title^2000 content",
>   "pf": "pf=title^4000 content^2",
>   "sort": "score desc",
>   "wt": "json",
> but that was not better. if I remove content from pf/qf then documents seem 
> to rank correctly.
>
> Example query and results (content omitted) : http://pastebin.com/5EhrRJP8
> with managed-schema http://pastebin.com/mdraWQWE
>
> --
> *Tom Chiverton*
> Lead Developer
> e:  t...@extravision.com
> p:  0161 817 2922
> t:  @extravision 
> w:  www.extravision.com
> [image: Extravision - email worth seeing] 
> Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, Manchester,
> M15 4LD.
> Company Reg No: 0‌‌5017214 VAT: GB 8‌‌24 5386 19
>
> This e-mail is intended solely for the person to whom it is addressed and
> may contain confidential or privileged information.
> Any views or opinions presented in this e-mail are solely of the author
> and do not necessarily represent those of Extravision Ltd.
>


equivalent of json.facet's "gap" keyword in /sql

2017-01-13 Thread radha krishnan
Hi,

can we write an SQL statement and use the /sql handler to get the
json.facet;s "gap" functionality.

Ex facet query :

json.facet: {
my_histogram: {
type: range,
field: i_timestamp,
start: "2016-10-21T01:00:00Z",
end: "2016-10-21T02:00:00Z",
gap: "+1MINUTE",
mincount: 0
}
}


Thanks,
Radhakrishnan


Re: regarding extending classes in org.apache.solr.client.solrj.io.stream.metrics package

2017-01-13 Thread radha krishnan
Hi Scott,

i have created a JIRA ticket (
https://issues.apache.org/jira/browse/SOLR-9962) . i will figure out the
patch process.


Thanks,
Radhakrishnan D


On Thu, Jan 12, 2017 at 8:57 AM, Scott Stults <
sstu...@opensourceconnections.com> wrote:

> Radhakrishnan,
>
> That would be an appropriate Jira ticket. You can submit it here:
>
> https://issues.apache.org/jira/browse/solr
>
> Also, if you want to submit a patch, check out the guidelines (it's pretty
> easy):
>
> https://wiki.apache.org/solr/HowToContribute
>
>
> k/r,
> Scott
>
>
> On Tue, Jan 10, 2017 at 7:12 PM, radha krishnan <
> dradhakrishna...@gmail.com>
> wrote:
>
> >  Hi,
> >
> > i want to extend the update(Tuple tuple) method in MaxMetric,. MinMetric,
> > SumMetric, MeanMetric classes.
> >
> > can you please make the below metioned variables and methods in the above
> > mentioned classes as protected so that it will be easy to extend
> >
> > variables
> > ---
> >
> > longMax
> >
> > doubleMax
> >
> > columnName
> >
> >
> > and
> >
> > methods
> >
> > ---
> >
> > init
> >
> >
> >
> > Thanks,
> >
> > Radhakrishnan D
> >
>
>
>
> --
> Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
> | 434.409.2780
> http://www.opensourceconnections.com
>



-- 
D.Radhakrishnan
B.E Computer Science and Engineering
Madras Institute of Technology
Anna University
When GOD is with us.. who can be against us ??


Trouble boosting a field

2017-01-13 Thread Tom Chiverton

I have a few hundred documents with title and content fields.

I want a match in title to trump matches in content. If I search for 
"connected vehicle" then a news article that has that in the content 
shouldn't be ranked higher than the page with that in the title is 
essentially what I want.


I have tried dismax with qf=title^2 as well as several other variants 
with the standard query parser (like q="title:"foo"^2 OR content:"foo") 
but documents without the search term in the title still come out before 
those with the term in the title when ordered by score.


Is there something I am missing ?

From the docs, something like q=title:"connected vehicle"^2 OR 
content:"connected vehicle" should have worked ? Even using ^100 didn't 
help.


I tried with the dismax parser using

|"q": "Connected Vehicle", "defType": "dismax", "indent": "true", "qf": 
"title^2000 content", "pf": "pf=title^4000 content^2", "sort": "score 
desc", "wt": "json", but that was not better. if I remove content from 
pf/qf then documents seem to rank correctly. |


Example query and results (content omitted) : 
http://pastebin.com/5EhrRJP8 with managed-schema 
http://pastebin.com/mdraWQWE


--
*Tom Chiverton*
Lead Developer
e:  t...@extravision.com 
p:  0161 817 2922
t:  @extravision 
w:  www.extravision.com 

Extravision - email worth seeing 
Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, 
Manchester, M15 4LD.

Company Reg No: 0‌‌5017214 VAT: GB 8‌‌24 5386 19

This e-mail is intended solely for the person to whom it is addressed 
and may contain confidential or privileged information.
Any views or opinions presented in this e-mail are solely of the author 
and do not necessarily represent those of Extravision Ltd.




AW: AW: FacetField-Result on String-Field contains value with count 0?

2017-01-13 Thread Sebastian Riemer
Thanks @Toke,  for pointing out these options. I'll have a read about 
expungeDeletes. 

Sounds even more so, that having solr filter out 0-counts is a good idea and I 
should handle my use-case outside of solr.

Thanks again,
Sebastian

On Fri, 2017-01-13 at 14:19 +, Sebastian Riemer wrote:
> the second search should have been this: http://localhost:8983/solr/w 
> emi/select?fq=m_mediaType_s:%221%22=on=*:*=0=0
> =json
> (or in other words, give me all documents having value "1" for field
> "m_mediaType_s")
> 
> Since this search gives zero results, why is it included in the 
> facet.fields result-count list?

Qualified guess (I don't know the JSON faceting code in details):
The list of possible facet values is extracted from the DocValues structure in 
the segment files, without respect to documents marked as deleted. At some 
point you had one or more documents with m_mediaType_s:1, which were later 
deleted.

If your index is not too large, you can verify this by optimizing down to 1 
segment, which will remove all traces of deleted documents (unless the index is 
already 1 segment).

If you cannot live with the false terms, committing with expungeDeletes=true 
should do the trick, although it is likely to make your indexing process a lot 
heavier.

The reason for this inaccuracy is that it is quite heavy to verify whether a 
docvalue is referenced by a document: Each time one or more documents in a 
segment are deleted, all references from all documents in that segment would 
have to be checked to create a correct mapping.
As this only affects mincount=0 combined with your use case where _all_ 
documents with a certain docvalue are deleted, my guess it that it is seen as 
too much of an edge case to handle.
--
Toke Eskildsen, Royal Danish Library



AW: FacetField-Result on String-Field contains value with count 0?

2017-01-13 Thread Sebastian Riemer
Nice, thank you very much for your explanation!

>> Solr returns all fields as facet result where there was some value at 
some time as long as the the documents are somewhere in the index, even when 
they're marked as indexed. So there must have been a document with 
m_mediaType_s=1. Even if all these documents are deleted already, its values 
still appear in the facet result.

I did not know about that! That makes perfect sense. I am quite sure there has 
been a time where that field contained the value "1". Even more, as now where I 
rebuild my index, the value "1" is not present as facet.field result anymore.

I'll think about how to deal with my situation then, maybe it would be better 
to keep solr filtering out 0-count facet-fields and insert the filterquery 
leading to 0 results into the select-dropdown "manually".

-Ursprüngliche Nachricht-
Von: Michael Kuhlmann [mailto:k...@solr.info] 
Gesendet: Freitag, 13. Januar 2017 15:43
An: solr-user@lucene.apache.org
Betreff: Re: FacetField-Result on String-Field contains value with count 0?

Then I don't understand your problem. Solr already does exactly what you want.

Maybe the problem is different: I assume that there never was a value of "1" in 
the index, leading to your confusion.

Solr returns all fields as facet result where there was some value at some time 
as long as the the documents are somewhere in the index, even when they're 
marked as indexed. So there must have been a document with m_mediaType_s=1. 
Even if all these documents are deleted already, its values still appear in the 
facet result.

This holds true until segments get merged so that all deleted documents are 
pruned. So if you send a forceMerge request, chances are good that "1" won't 
come up any more.

-Michael

Am 13.01.2017 um 15:36 schrieb Sebastian Riemer:
> Hi Bill,
>
> Thanks, that's actually where I come from. But I don't want to exclude values 
> leading to a count of zero.
>
> Background to this: A user searched for mediaType "book" which gave him 10 
> results. Now some other task/routine whatever changes all those 10 books to 
> be say 10 ebooks, because the type has been incorrect. The user makes a 
> refresh, still looking for "book" gets 0 results (which is expected) and 
> because we rule out facet.fields having count 0, I don't get back the 
> selected mediaType "book" and thus I cannot select this value in the 
> select-dropdown-filter for the mediaType. This leads to confusion for the 
> user, since he has no results, but doesn't see that it's because of he still 
> has that mediaType-filter set to a value "books" which now actually leads to 
> 0 results.
>
> -Ursprüngliche Nachricht-
> Von: billnb...@gmail.com [mailto:billnb...@gmail.com]
> Gesendet: Freitag, 13. Januar 2017 15:23
> An: solr-user@lucene.apache.org
> Betreff: Re: AW: FacetField-Result on String-Field contains value with count 
> 0?
>
> Set mincount to 1
>
> Bill Bell
> Sent from mobile
>
>
>> On Jan 13, 2017, at 7:19 AM, Sebastian Riemer  wrote:
>>
>> Pardon me,
>> the second search should have been this: 
>> http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%221%22
>> t =on=*:*=0=0=json (or in other words, give me all 
>> documents having value "1" for field "m_mediaType_s")
>>
>> Since this search gives zero results, why is it included in the facet.fields 
>> result-count list?
>>
>> 
>>
>> Hi,
>>
>> Please help me understand: 
>> http://localhost:8983/solr/wemi/select?facet.field=m_mediaType_s=on=on=*:*=json
>>  returns:
>>
>> "facet_counts":{
>>"facet_queries":{},
>>"facet_fields":{
>>  "m_mediaType_s":[
>>"2",25561,
>>"3",19027,
>>"10",1966,
>>"11",1705,
>>"12",1067,
>>"4",1056,
>>"5",291,
>>"8",68,
>>"13",2,
>>"6",2,
>>"7",1,
>>"9",1,
>>"1",0]},
>>"facet_ranges":{},
>>"facet_intervals":{},
>>"facet_heatmaps":{}}}
>>
>> http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%222%22
>> t
>> =on=*:*=0=0=json
>>
>>
>> ?  "response":{"numFound":25561,"start":0,"docs":[]
>>
>> http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%220%22
>> t
>> =on=*:*=0=0=json
>>
>>
>> ?  "response":{"numFound":0,"start":0,"docs":[]
>>
>> So why does the search for facet.field even contain the value "1", if it 
>> does not exist?
>>
>> And why does it e.g. not contain
>> "SomeReallyCrazyOtherValueWhichLikeValue"1"DoesNotExistButLetsInclude
>> I tInTheFacetFieldsResultListAnywaysWithCountZero" : 0
>>
>> Best regards,
>> Sebastian
>>
>> Additional info, field m_mediaType_s is a string;
>> > stored="true" />
>> > />
>>



Re: AW: FacetField-Result on String-Field contains value with count 0?

2017-01-13 Thread Toke Eskildsen
On Fri, 2017-01-13 at 14:19 +, Sebastian Riemer wrote:
> the second search should have been this: http://localhost:8983/solr/w
> emi/select?fq=m_mediaType_s:%221%22=on=*:*=0=0
> =json 
> (or in other words, give me all documents having value "1" for field
> "m_mediaType_s")
> 
> Since this search gives zero results, why is it included in the
> facet.fields result-count list?

Qualified guess (I don't know the JSON faceting code in details):
The list of possible facet values is extracted from the DocValues
structure in the segment files, without respect to documents marked as
deleted. At some point you had one or more documents with
m_mediaType_s:1, which were later deleted.

If your index is not too large, you can verify this by optimizing down
to 1 segment, which will remove all traces of deleted documents (unless
the index is already 1 segment).

If you cannot live with the false terms, committing with
expungeDeletes=true should do the trick, although it is likely to make
your indexing process a lot heavier.

The reason for this inaccuracy is that it is quite heavy to verify
whether a docvalue is referenced by a document: Each time one or more
documents in a segment are deleted, all references from all documents
in that segment would have to be checked to create a correct mapping.
As this only affects mincount=0 combined with your use case where
_all_ documents with a certain docvalue are deleted, my guess it that
it is seen as too much of an edge case to handle.
-- 
Toke Eskildsen, Royal Danish Library



Re: FacetField-Result on String-Field contains value with count 0?

2017-01-13 Thread Michael Kuhlmann
Then I don't understand your problem. Solr already does exactly what you
want.

Maybe the problem is different: I assume that there never was a value of
"1" in the index, leading to your confusion.

Solr returns all fields as facet result where there was some value at
some time as long as the the documents are somewhere in the index, even
when they're marked as indexed. So there must have been a document with
m_mediaType_s=1. Even if all these documents are deleted already, its
values still appear in the facet result.

This holds true until segments get merged so that all deleted documents
are pruned. So if you send a forceMerge request, chances are good that
"1" won't come up any more.

-Michael

Am 13.01.2017 um 15:36 schrieb Sebastian Riemer:
> Hi Bill,
>
> Thanks, that's actually where I come from. But I don't want to exclude values 
> leading to a count of zero.
>
> Background to this: A user searched for mediaType "book" which gave him 10 
> results. Now some other task/routine whatever changes all those 10 books to 
> be say 10 ebooks, because the type has been incorrect. The user makes a 
> refresh, still looking for "book" gets 0 results (which is expected) and 
> because we rule out facet.fields having count 0, I don't get back the 
> selected mediaType "book" and thus I cannot select this value in the 
> select-dropdown-filter for the mediaType. This leads to confusion for the 
> user, since he has no results, but doesn't see that it's because of he still 
> has that mediaType-filter set to a value "books" which now actually leads to 
> 0 results.
>
> -Ursprüngliche Nachricht-
> Von: billnb...@gmail.com [mailto:billnb...@gmail.com] 
> Gesendet: Freitag, 13. Januar 2017 15:23
> An: solr-user@lucene.apache.org
> Betreff: Re: AW: FacetField-Result on String-Field contains value with count 
> 0?
>
> Set mincount to 1
>
> Bill Bell
> Sent from mobile
>
>
>> On Jan 13, 2017, at 7:19 AM, Sebastian Riemer  wrote:
>>
>> Pardon me,
>> the second search should have been this: 
>> http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%221%22
>> =on=*:*=0=0=json (or in other words, give me all 
>> documents having value "1" for field "m_mediaType_s")
>>
>> Since this search gives zero results, why is it included in the facet.fields 
>> result-count list?
>>
>> 
>>
>> Hi,
>>
>> Please help me understand: 
>> http://localhost:8983/solr/wemi/select?facet.field=m_mediaType_s=on=on=*:*=json
>>  returns:
>>
>> "facet_counts":{
>>"facet_queries":{},
>>"facet_fields":{
>>  "m_mediaType_s":[
>>"2",25561,
>>"3",19027,
>>"10",1966,
>>"11",1705,
>>"12",1067,
>>"4",1056,
>>"5",291,
>>"8",68,
>>"13",2,
>>"6",2,
>>"7",1,
>>"9",1,
>>"1",0]},
>>"facet_ranges":{},
>>"facet_intervals":{},
>>"facet_heatmaps":{}}}
>>
>> http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%222%22
>> =on=*:*=0=0=json
>>
>>
>> ?  "response":{"numFound":25561,"start":0,"docs":[]
>>
>> http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%220%22
>> =on=*:*=0=0=json
>>
>>
>> ?  "response":{"numFound":0,"start":0,"docs":[]
>>
>> So why does the search for facet.field even contain the value "1", if it 
>> does not exist?
>>
>> And why does it e.g. not contain 
>> "SomeReallyCrazyOtherValueWhichLikeValue"1"DoesNotExistButLetsIncludeI
>> tInTheFacetFieldsResultListAnywaysWithCountZero" : 0
>>
>> Best regards,
>> Sebastian
>>
>> Additional info, field m_mediaType_s is a string;
>> > stored="true" />
>> > />
>>



AW: AW: FacetField-Result on String-Field contains value with count 0?

2017-01-13 Thread Sebastian Riemer
Hi Bill,

Thanks, that's actually where I come from. But I don't want to exclude values 
leading to a count of zero.

Background to this: A user searched for mediaType "book" which gave him 10 
results. Now some other task/routine whatever changes all those 10 books to be 
say 10 ebooks, because the type has been incorrect. The user makes a refresh, 
still looking for "book" gets 0 results (which is expected) and because we rule 
out facet.fields having count 0, I don't get back the selected mediaType "book" 
and thus I cannot select this value in the select-dropdown-filter for the 
mediaType. This leads to confusion for the user, since he has no results, but 
doesn't see that it's because of he still has that mediaType-filter set to a 
value "books" which now actually leads to 0 results.

-Ursprüngliche Nachricht-
Von: billnb...@gmail.com [mailto:billnb...@gmail.com] 
Gesendet: Freitag, 13. Januar 2017 15:23
An: solr-user@lucene.apache.org
Betreff: Re: AW: FacetField-Result on String-Field contains value with count 0?

Set mincount to 1

Bill Bell
Sent from mobile


> On Jan 13, 2017, at 7:19 AM, Sebastian Riemer  wrote:
> 
> Pardon me,
> the second search should have been this: 
> http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%221%22
> =on=*:*=0=0=json (or in other words, give me all 
> documents having value "1" for field "m_mediaType_s")
> 
> Since this search gives zero results, why is it included in the facet.fields 
> result-count list?
> 
> 
> 
> Hi,
> 
> Please help me understand: 
> http://localhost:8983/solr/wemi/select?facet.field=m_mediaType_s=on=on=*:*=json
>  returns:
> 
> "facet_counts":{
>"facet_queries":{},
>"facet_fields":{
>  "m_mediaType_s":[
>"2",25561,
>"3",19027,
>"10",1966,
>"11",1705,
>"12",1067,
>"4",1056,
>"5",291,
>"8",68,
>"13",2,
>"6",2,
>"7",1,
>"9",1,
>"1",0]},
>"facet_ranges":{},
>"facet_intervals":{},
>"facet_heatmaps":{}}}
> 
> http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%222%22
> =on=*:*=0=0=json
> 
> 
> ?  "response":{"numFound":25561,"start":0,"docs":[]
> 
> http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%220%22
> =on=*:*=0=0=json
> 
> 
> ?  "response":{"numFound":0,"start":0,"docs":[]
> 
> So why does the search for facet.field even contain the value "1", if it does 
> not exist?
> 
> And why does it e.g. not contain 
> "SomeReallyCrazyOtherValueWhichLikeValue"1"DoesNotExistButLetsIncludeI
> tInTheFacetFieldsResultListAnywaysWithCountZero" : 0
> 
> Best regards,
> Sebastian
> 
> Additional info, field m_mediaType_s is a string;
>  stored="true" />
>  />
> 


Re: AW: FacetField-Result on String-Field contains value with count 0?

2017-01-13 Thread billnbell
Set mincount to 1

Bill Bell
Sent from mobile


> On Jan 13, 2017, at 7:19 AM, Sebastian Riemer  wrote:
> 
> Pardon me, 
> the second search should have been this: 
> http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%221%22=on=*:*=0=0=json
>  
> (or in other words, give me all documents having value "1" for field 
> "m_mediaType_s")
> 
> Since this search gives zero results, why is it included in the facet.fields 
> result-count list?
> 
> 
> 
> Hi,
> 
> Please help me understand: 
> http://localhost:8983/solr/wemi/select?facet.field=m_mediaType_s=on=on=*:*=json
>  returns:
> 
> "facet_counts":{
>"facet_queries":{},
>"facet_fields":{
>  "m_mediaType_s":[
>"2",25561,
>"3",19027,
>"10",1966,
>"11",1705,
>"12",1067,
>"4",1056,
>"5",291,
>"8",68,
>"13",2,
>"6",2,
>"7",1,
>"9",1,
>"1",0]},
>"facet_ranges":{},
>"facet_intervals":{},
>"facet_heatmaps":{}}}
> 
> http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%222%22=on=*:*=0=0=json
> 
> 
> ?  "response":{"numFound":25561,"start":0,"docs":[]
> 
> http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%220%22=on=*:*=0=0=json
> 
> 
> ?  "response":{"numFound":0,"start":0,"docs":[]
> 
> So why does the search for facet.field even contain the value "1", if it does 
> not exist?
> 
> And why does it e.g. not contain 
> "SomeReallyCrazyOtherValueWhichLikeValue"1"DoesNotExistButLetsIncludeItInTheFacetFieldsResultListAnywaysWithCountZero"
>  : 0
> 
> Best regards,
> Sebastian
> 
> Additional info, field m_mediaType_s is a string;
>  stored="true" />
> 
> 


AW: FacetField-Result on String-Field contains value with count 0?

2017-01-13 Thread Sebastian Riemer
Pardon me, 
the second search should have been this: 
http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%221%22=on=*:*=0=0=json
 
(or in other words, give me all documents having value "1" for field 
"m_mediaType_s")

Since this search gives zero results, why is it included in the facet.fields 
result-count list?



Hi,

Please help me understand: 
http://localhost:8983/solr/wemi/select?facet.field=m_mediaType_s=on=on=*:*=json
 returns:

"facet_counts":{
"facet_queries":{},
"facet_fields":{
  "m_mediaType_s":[
"2",25561,
"3",19027,
"10",1966,
"11",1705,
"12",1067,
"4",1056,
"5",291,
"8",68,
"13",2,
"6",2,
"7",1,
"9",1,
"1",0]},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}}}

http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%222%22=on=*:*=0=0=json


?  "response":{"numFound":25561,"start":0,"docs":[]

http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%220%22=on=*:*=0=0=json


?  "response":{"numFound":0,"start":0,"docs":[]

So why does the search for facet.field even contain the value "1", if it does 
not exist?

And why does it e.g. not contain 
"SomeReallyCrazyOtherValueWhichLikeValue"1"DoesNotExistButLetsIncludeItInTheFacetFieldsResultListAnywaysWithCountZero"
 : 0

Best regards,
Sebastian

Additional info, field m_mediaType_s is a string;





FacetField-Result on String-Field contains value with count 0?

2017-01-13 Thread Sebastian Riemer
Hi,

Please help me understand: 
http://localhost:8983/solr/wemi/select?facet.field=m_mediaType_s=on=on=*:*=json
 returns:

"facet_counts":{
"facet_queries":{},
"facet_fields":{
  "m_mediaType_s":[
"2",25561,
"3",19027,
"10",1966,
"11",1705,
"12",1067,
"4",1056,
"5",291,
"8",68,
"13",2,
"6",2,
"7",1,
"9",1,
"1",0]},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}}}

http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%222%22=on=*:*=0=0=json


?  "response":{"numFound":25561,"start":0,"docs":[]

http://localhost:8983/solr/wemi/select?fq=m_mediaType_s:%220%22=on=*:*=0=0=json


?  "response":{"numFound":0,"start":0,"docs":[]

So why does the search for facet.field even contain the value "1", if it does 
not exist?

And why does it e.g. not contain 
"SomeReallyCrazyOtherValueWhichLikeValue"1"DoesNotExistButLetsIncludeItInTheFacetFieldsResultListAnywaysWithCountZero"
 : 0

Best regards,
Sebastian

Additional info, field m_mediaType_s is a string;





RE: Can't get spelling suggestions to work properly

2017-01-13 Thread jimi.hullegard
I just noticed why setting maxResultsForSuggest to a high value was not a good 
thing. Because now it show spelling suggestions even on correctly spelled words.

I think, what I would need is the logic of SuggestMode. 
SUGGEST_WHEN_NOT_IN_INDEX, but with a configurable limit instead of it being 
hard coded to 0. Ie just as maxQueryFrequency works.

/Jimi

-Original Message-
From: jimi.hulleg...@svensktnaringsliv.se 
[mailto:jimi.hulleg...@svensktnaringsliv.se] 
Sent: Friday, January 13, 2017 5:56 PM
To: solr-user@lucene.apache.org
Subject: RE: Can't get spelling suggestions to work properly

Hi Alessandro,

Thanks for your explanation. It helped a lot. Although setting 
"spellcheck.maxResultsForSuggest" to a value higher than zero was not enough. I 
also had to set "spellcheck.alternativeTermCount". With that done, I now get 
suggestions when searching for 'mycet' (a misspelling of the Swedish word 
'mycket', that didn't return suggestions before).

Although, I'm still not able to fully understand how to configure this 
properly. Because with this change there now are other misspelled searches that 
now longer gives suggestions. The problem here is stemming, I suspect. Because 
the main search fields use stemming, so that in some cases one can get lots of 
results for spellings that doesn't exist in the index at all (or, at least not 
in the spelling-field). How can I configure this component so that those 
suggestions are still included? Do I need to set maxResultsForSuggest to a 
really high number? Like Integer.MAX_VALUE? I feel that such a setting would 
defeat the purpose of that parameter, in a way. But I'm not sure how else to 
solve this.

Also, there is one other things I wonder about the spelling suggestions, that 
you might have the answer to. Is there a way to make the logic case 
insensitive, but the presentation case sensitive? For example, a search for 
'georg washington' now would return 'george washington' as a suggestion, but ' 
Georg Washington' would be even better.

Regards
/Jimi


-Original Message-
From: alessandro.benedetti [mailto:abenede...@apache.org] 
Sent: Thursday, January 12, 2017 5:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Can't get spelling suggestions to work properly

Hi Jimi,
taking a look to the *maxQueryFrequency*  param :

Your understanding is correct.

1) we don't provide misspelled suggestions if we set the param to 1, and we 
have a minimum of 1 doc freq for the term .

2) we don't provide misspelled suggestions if the doc frequency of the term is 
greater than the max limit set.

Let us explore the code :

if (suggestMode==SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX && docfreq > 0) {
  return new SuggestWord[0];
}
/// If we are working in "Not in Index Mode" , with a document frequency >0 we 
get no misspelled corrections.
/

int maxDoc = ir.maxDoc();

if (maxQueryFrequency >= 1f && docfreq > maxQueryFrequency) {
  return new SuggestWord[0];
} else if (docfreq > (int) Math.ceil(maxQueryFrequency * (float)maxDoc)) {
  return new SuggestWord[0];
}
// then the MaxQueryFrequency as you correctly stated enters the game

...

Let's explore how you can end up in the first scenario :

if (maxResultsForSuggest == null || hits <= maxResultsForSuggest) {
  SuggestMode suggestMode = SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX;
  if (onlyMorePopular) {
suggestMode = SuggestMode.SUGGEST_MORE_POPULAR;
  } else if (alternativeTermCount > 0) {
suggestMode = SuggestMode.SUGGEST_ALWAYS;
  }

You did not set maxResultsForSuggest ( and not onlyMorePopular or alternative 
term count) so you ended up in :
SuggestMode suggestMode = SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX;

>From Solr javaDoc :

If left unspecified, the default behavior will prevail.  That is, 
"correctlySpelled" will be false and suggestions
   * will be returned only if one or more of the query terms are absent from 
the dictionary and/or index.  If set to zero,
   * the "correctlySpelled" flag will be false only if the response returns 
zero hits.  If set to a value greater than zero, 
   * suggestions will be returned even if hits are returned (up to the 
specified number).  This number also will serve as
   * the threshold in determining the value of "correctlySpelled". 
Specifying a value greater than zero is useful 
   * for creating "did-you-mean" suggestions for queries that return a low 
number of hits.
   * 
   */
  public static final String SPELLCHECK_MAX_RESULTS_FOR_SUGGEST = 
SPELLCHECK_PREFIX + "maxResultsForSuggest";

You probably want to bypass the other parameters and just set the proper 
maxResultsForSuggest param for your spellchecker Cheers



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-t-get-spelling-suggestions-to-work-properly-tp4310079p4313685.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Can't get spelling suggestions to work properly

2017-01-13 Thread jimi.hullegard
Hi Alessandro,

Thanks for your explanation. It helped a lot. Although setting 
"spellcheck.maxResultsForSuggest" to a value higher than zero was not enough. I 
also had to set "spellcheck.alternativeTermCount". With that done, I now get 
suggestions when searching for 'mycet' (a misspelling of the Swedish word 
'mycket', that didn't return suggestions before).

Although, I'm still not able to fully understand how to configure this 
properly. Because with this change there now are other misspelled searches that 
now longer gives suggestions. The problem here is stemming, I suspect. Because 
the main search fields use stemming, so that in some cases one can get lots of 
results for spellings that doesn't exist in the index at all (or, at least not 
in the spelling-field). How can I configure this component so that those 
suggestions are still included? Do I need to set maxResultsForSuggest to a 
really high number? Like Integer.MAX_VALUE? I feel that such a setting would 
defeat the purpose of that parameter, in a way. But I'm not sure how else to 
solve this.

Also, there is one other things I wonder about the spelling suggestions, that 
you might have the answer to. Is there a way to make the logic case 
insensitive, but the presentation case sensitive? For example, a search for 
'georg washington' now would return 'george washington' as a suggestion, but ' 
Georg Washington' would be even better.

Regards
/Jimi


-Original Message-
From: alessandro.benedetti [mailto:abenede...@apache.org] 
Sent: Thursday, January 12, 2017 5:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Can't get spelling suggestions to work properly

Hi Jimi,
taking a look to the *maxQueryFrequency*  param :

Your understanding is correct.

1) we don't provide misspelled suggestions if we set the param to 1, and we 
have a minimum of 1 doc freq for the term .

2) we don't provide misspelled suggestions if the doc frequency of the term is 
greater than the max limit set.

Let us explore the code :

if (suggestMode==SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX && docfreq > 0) {
  return new SuggestWord[0];
}
/// If we are working in "Not in Index Mode" , with a document frequency >0 we 
get no misspelled corrections.
/

int maxDoc = ir.maxDoc();

if (maxQueryFrequency >= 1f && docfreq > maxQueryFrequency) {
  return new SuggestWord[0];
} else if (docfreq > (int) Math.ceil(maxQueryFrequency * (float)maxDoc)) {
  return new SuggestWord[0];
}
// then the MaxQueryFrequency as you correctly stated enters the game

...

Let's explore how you can end up in the first scenario :

if (maxResultsForSuggest == null || hits <= maxResultsForSuggest) {
  SuggestMode suggestMode = SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX;
  if (onlyMorePopular) {
suggestMode = SuggestMode.SUGGEST_MORE_POPULAR;
  } else if (alternativeTermCount > 0) {
suggestMode = SuggestMode.SUGGEST_ALWAYS;
  }

You did not set maxResultsForSuggest ( and not onlyMorePopular or alternative 
term count) so you ended up in :
SuggestMode suggestMode = SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX;

>From Solr javaDoc :

If left unspecified, the default behavior will prevail.  That is, 
"correctlySpelled" will be false and suggestions
   * will be returned only if one or more of the query terms are absent from 
the dictionary and/or index.  If set to zero,
   * the "correctlySpelled" flag will be false only if the response returns 
zero hits.  If set to a value greater than zero, 
   * suggestions will be returned even if hits are returned (up to the 
specified number).  This number also will serve as
   * the threshold in determining the value of "correctlySpelled". 
Specifying a value greater than zero is useful 
   * for creating "did-you-mean" suggestions for queries that return a low 
number of hits.
   * 
   */
  public static final String SPELLCHECK_MAX_RESULTS_FOR_SUGGEST = 
SPELLCHECK_PREFIX + "maxResultsForSuggest";

You probably want to bypass the other parameters and just set the proper 
maxResultsForSuggest param for your spellchecker Cheers



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-t-get-spelling-suggestions-to-work-properly-tp4310079p4313685.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud different score for same document on different replicas.

2017-01-13 Thread Morten Bøgeskov
On Thu, 5 Jan 2017 16:31:35 +
Charlie Hull  wrote:

> On 05/01/2017 13:30, Morten Bøgeskov wrote:
> >
> >
> > Hi.
> >
> > We've got a SolrCloud which is sharded and has a replication factor of
> > 2.
> >
> > The 2 replicas of a shard may look like this:
> >
> > Num Docs:5401023
> > Max Doc:6388614
> > Deleted Docs:987591
> >
> >
> > Num Docs:5401023
> > Max Doc:5948122
> > Deleted Docs:547099
> >
> > We've seen >10% difference in Max Doc at times with same Num Docs.
> > Our use case is few documents that are search and many small that
> > are filtered against (often updated multiple times a day), so the
> > difference in deleted docs aren't surprising.
> >
> > This results in a different score for a document depending on which
> > replica it comes from. As I see it: it has to do with the different
> > maxDoc value when calculating idf.
> >
> > This in turn alters a specific document's position in the search
> > result over reloads. This is quite confusing (duplicates in pagination).
> >
> > What is the trick to get homogeneous score from different replicas.
> > We've tried using ExactStatsCache & ExactSharedStatsCache, but that
> > didn't seem to make any difference.
> >
> > Any hints to this will be greatly appreciated.
> >
> 
> This was one of things we looked at during our recent Lucene London 
> Hackday (see item 3) https://github.com/flaxsearch/london-hackday-2016
> 
> I'm not sure there is a way to get a homogenous score - this patch tries 
> to keep you connected to the same replica during a session so you don't 
> see results jumping over pagination.
> 

Sorry for the late reply.

I went with a new searcher, that inherits from SearchHandler.
This hashes the query, and uses that to select replicas to put in the
shards parameter (if it's a cloud, and a distributed query where shards
isn't already set), then passes it onto the original searcher.

Given sufficiently diverse end user queries, this gives an equal load
across the cloud. This could put a skewed load on nodes, if a query
suddenly becomes very popular or you have an opening page default query
(in our use case, quite unlikely).

Thanks for the input.


-- 
 Morten Bøgeskov