Re: Performance if there is a large number of field

2018-05-11 Thread Erick Erickson
Deepak:

I would strongly urge you to consider changing your problem solution
to _not_ need 35,000 fields. What that usually indicates is that there
are much better ways of tackling the problem. As Shawn says, 35,000
fields won't make much difference for an individual search. But 35,000
fields _do_ take up meta-data space, there has to be a catalog of all
the possibilities somewhere.

The question about missing fields is tricky. For the inverted index,
consider the structure. For each _field_ the structure looks like
this:
term, doc1, doc45, doc93.

so really, the doc not having the field is pretty much similar to the
doc not having a term in that field, it's just missing.

But back to your problem. Think hard about _why_ you think you need
35,000 fields. Could you tag your field? Say you are storing prices
for stores for some item. Instead of having a field for store1_price,
store2_price... what about having a single field store1_price_1.53
store2_price_2.35 etc.

Or consider payloads. store1_price|1.53 store2_price|2.35 and using
that See: https://lucidworks.com/2017/09/14/solr-payloads/

I've rarely seen situations where having that many fields is an
optimal solution.

Best,
Erick

On Fri, May 11, 2018 at 12:20 PM, Shawn Heisey  wrote:
> On 5/11/2018 9:26 AM, Andy C wrote:
>> Why are range searches more efficient than wildcard searches? I guess I
>> would have expected that they just provide different mechanism for defining
>> the range of unique terms that are of interest, and that the merge
>> processing would be identical.
>
> I hope I can explain the reason that wildcard queries tend to be slow.
> I will use an example field from one of my own indexes.
>
> Choosing one of the shards of my main index, and focusing on the
> "keywords" field for that Solr core:  Here's the histogram data that the
> Luke handler gives for this field:
>
>   "histogram":[
> "1",14095268,
> "2",76,
> "4",425610,
> "8",312156,
> "16",236743,
> "32",177718,
> "64",122603,
> "128",80513,
> "256",52746,
> "512",34925,
> "1024",24770,
> "2048",17516,
> "4096",11467,
> "8192",7748,
> "16384",5210,
> "32768",3433,
> "65536",2164,
> "131072",1280,
> "262144",688,
> "524288",355,
> "1048576",163,
> "2097152",53,
> "4194304",12]}},
>
>
> The first entry means that there are 14 million terms that only appear
> once in the keywords field across the whole index. The last entry means
> that there are twelve terms that appear 4 million times in the keywords
> field across the whole index.
>
> Adding this all up, I can see that there are a little more than 16
> million unique terms in this field.
>
> This means that when I do a "keywords:*" query, that Solr/Lucene will
> expand this query such that the query literally contains 16 million
> individual terms.  It's going to take time just to make the query.  And
> then that query will have to be executed.  No matter how quickly each
> term in the query executes, doing 16 million of them is going to be slow.
>
> Just for giggles, I used my dev server to execute that "keywords:*"
> query on this single shard.  The reported QTime in the response was
> 18017 milliseconds.  Then I ran the full range query.  The reported
> QTime for that was 14569 milliseconds.  Which is honestly slower than I
> thought it would be, but faster than the wildcard.  The number of unique
> terms in the field affects both kinds of queries, but the effect of a
> large number of terms on the wildcard is usually greater than the effect
> on the range.
>
>> Would a search such as:
>>
>> field:c*
>>
>> be more efficient if rewritten as:
>>
>> field:[c TO d}
>
> On most indexes, probably.  That would depend on the number of terms in
> the field, I think.  But there's something to consider:  Not every
> wildcard query can be easily rewritten as a range.  I think this one is
> impossible to rewrite as a range:  field:abc*xyz
>
> I tried your c* example as well on my keywords field.  The wildcard had
> a QTime of 1702 milliseconds.  The range query had a QTime of 1434
> milliseconds.  The numFound on both queries was identical, at 16399711.
>
> Thanks,
> Shawn
>


Re: FilterQueries with NRT

2018-05-11 Thread Erick Erickson
bq. does it make even sense to cache anything

In a word, "no". Now only would I set the cache entry size to zero,
I'd form my filer queries with {!cache=false}... There's no particular
point in computing the entire cache entry in this case, possibly even
with a cost=101. See:
http://yonik.com/advanced-filter-caching-in-solr/.

And I'll spare you the rant about 500 ms being unreasonable as you
seem to already get that ;)

Best,
Erick

On Fri, May 11, 2018 at 12:46 PM, root23  wrote:
> Hi all,
> We have a requirement for NRT search. Our autosoft commit time is set to 500
> ms(i know its low.But that's another story). We use filter queries
> extensively for most of our queries.
>
> But i am trying to understand how filter query caching works with NRT.
> Now as i understand we use fq for queries which are commonly used across
> most of our searches, so that they get cached and will be reused across
> multiple queries. However the cache is only good for the life time of a
> indexSearcher.
> And as we do a new soft commit a new searcher will open and it will creates
> its own cache and if we have autowarmcount set to 0, its essentially will
> start with an empty filter cache.
> Now since the autosoft commit time is 500 ms, in theory we are openeing a
> searcher every 0.5 second.
>
> So in this case does it make even sense to cache anything, if we are gona
> throw it away 0.5 second later and start again.
> Does adding stuff to fq gives us anything in this particular scenario.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: ZKPropertiesWriter Could not read DIH properties

2018-05-11 Thread Shawn Heisey
On 5/11/2018 11:20 AM, tayitu wrote:
> I am using Solr 6.6.0.  I have created collection and uploaded the config
> files to zookeeper.  I can see the collection and config files from Solr
> Admin UI.  When I try to Dataimport, I get the following error:
>
> ZKPropertiesWriter Could not read DIH properties from /configs/collection
> name/dataimport.properties :class
> org.apache.zookeeper.KeeperException$NoNodeException 

This message means /configs/collectionname/dataimport.properties does
not exist within the ZooKeeper database.  I know this because of the
KeeperException$NoNodeException part of the message.

Best guess about what's happening: You're trying to do a delta-import
without ever running a full-import.  That properties file is created or
updated whenever an import succeeds.  I think that a delta-import cannot
run if the properties file doesn't exist, so before your first
delta-import, a full-import must be successfully completed.  Once that's
done, all the rest can be delta if that fits your needs.

Thanks,
Shawn



FilterQueries with NRT

2018-05-11 Thread root23
Hi all,
We have a requirement for NRT search. Our autosoft commit time is set to 500
ms(i know its low.But that's another story). We use filter queries
extensively for most of our queries.

But i am trying to understand how filter query caching works with NRT.
Now as i understand we use fq for queries which are commonly used across
most of our searches, so that they get cached and will be reused across
multiple queries. However the cache is only good for the life time of a
indexSearcher.
And as we do a new soft commit a new searcher will open and it will creates
its own cache and if we have autowarmcount set to 0, its essentially will
start with an empty filter cache.
Now since the autosoft commit time is 500 ms, in theory we are openeing a
searcher every 0.5 second. 

So in this case does it make even sense to cache anything, if we are gona
throw it away 0.5 second later and start again. 
Does adding stuff to fq gives us anything in this particular scenario.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Must clause with filter queries

2018-05-11 Thread root23
hey shawn i tried debugging actual solr code in local with the following two
different forms for frange. So to see if solr is somehow parsing it wrong.
But i seed the parsed query that gets put in the filter query pretty much
same.

query1 -> +_val_:{!frange cost=200 l=30 u=100 incl=true incu=false}price
the above get parsed into following:
+ConstantScore(frange(float(price)):[30 TO 100})

qyery2: {!frange cost=200 l=30 u=100 incl=true incu=false}price
this gets parsed into the following:
ConstantScore(frange(float(price)):[30 TO 100})

As you can see the only difference is the leading +(Must clause). But since
this is single clause i assume it doesnt make any difference.


I saw in Qparser.getParser() method there is a check for localparams as
shown below
*if (allowLocalParams && qstr != null &&
qstr.startsWith(QueryParsing.LOCALPARAM_START))*

But it seems like eventually solr figures out how to correctly parse it. Not
sure if i am missing something.




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Performance if there is a large number of field

2018-05-11 Thread Shawn Heisey
On 5/11/2018 9:26 AM, Andy C wrote:
> Why are range searches more efficient than wildcard searches? I guess I
> would have expected that they just provide different mechanism for defining
> the range of unique terms that are of interest, and that the merge
> processing would be identical.

I hope I can explain the reason that wildcard queries tend to be slow. 
I will use an example field from one of my own indexes.

Choosing one of the shards of my main index, and focusing on the
"keywords" field for that Solr core:  Here's the histogram data that the
Luke handler gives for this field:

  "histogram":[
"1",14095268,
"2",76,
"4",425610,
"8",312156,
"16",236743,
"32",177718,
"64",122603,
"128",80513,
"256",52746,
"512",34925,
"1024",24770,
"2048",17516,
"4096",11467,
"8192",7748,
"16384",5210,
"32768",3433,
"65536",2164,
"131072",1280,
"262144",688,
"524288",355,
"1048576",163,
"2097152",53,
"4194304",12]}},


The first entry means that there are 14 million terms that only appear
once in the keywords field across the whole index. The last entry means
that there are twelve terms that appear 4 million times in the keywords
field across the whole index.

Adding this all up, I can see that there are a little more than 16
million unique terms in this field.

This means that when I do a "keywords:*" query, that Solr/Lucene will
expand this query such that the query literally contains 16 million
individual terms.  It's going to take time just to make the query.  And
then that query will have to be executed.  No matter how quickly each
term in the query executes, doing 16 million of them is going to be slow.

Just for giggles, I used my dev server to execute that "keywords:*"
query on this single shard.  The reported QTime in the response was
18017 milliseconds.  Then I ran the full range query.  The reported
QTime for that was 14569 milliseconds.  Which is honestly slower than I
thought it would be, but faster than the wildcard.  The number of unique
terms in the field affects both kinds of queries, but the effect of a
large number of terms on the wildcard is usually greater than the effect
on the range.

> Would a search such as:
>
> field:c*
>
> be more efficient if rewritten as:
>
> field:[c TO d}

On most indexes, probably.  That would depend on the number of terms in
the field, I think.  But there's something to consider:  Not every
wildcard query can be easily rewritten as a range.  I think this one is
impossible to rewrite as a range:  field:abc*xyz

I tried your c* example as well on my keywords field.  The wildcard had
a QTime of 1702 milliseconds.  The range query had a QTime of 1434
milliseconds.  The numFound on both queries was identical, at 16399711.

Thanks,
Shawn



ZKPropertiesWriter Could not read DIH properties

2018-05-11 Thread tayitu
I am using Solr 6.6.0.  I have created collection and uploaded the config
files to zookeeper.  I can see the collection and config files from Solr
Admin UI.  When I try to Dataimport, I get the following error:

ZKPropertiesWriter Could not read DIH properties from /configs/collection
name/dataimport.properties :class
org.apache.zookeeper.KeeperException$NoNodeException 

Appreciate any help you can provide.  Thank you!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Facet Range with Stats

2018-05-11 Thread Jim Freeby
 Correction.  The solution below did not quite get what we need.
I need the stats reports for the range.
I'll keep digging on this one

On ‎Friday‎, ‎May‎ ‎11‎, ‎2018‎ ‎10‎:‎59‎:‎45‎ ‎AM‎ ‎PDT, Jim Freeby 
 wrote:  
 
  I found a solution.
If I use tags for the facet range definition and the stats definition, I can 
include it in the facet pivot
stats=true
stats.field={!tag=piv1 percentiles='50'}price
facet=true
facet.range={!tag=r1}someDate
f.someDate.facet.range.start=2018-01-01T00:00:00Z
f.someDate.facet.range.end=2018-04-30T00:00:00Z
f.someDate.facet.range.gap=+1MONTH
facet.pivot={!stats=piv1 range=r1}category
Please let me know if there's a better way to achieve this.
Cheers,

Jim

    On ‎Friday‎, ‎May‎ ‎11‎, ‎2018‎ ‎09‎:‎23‎:‎39‎ ‎AM‎ ‎PDT, Jim Freeby 
 wrote:  
 
 All,
I'd like to generate stats for the results of a facet range.
For example, calculate the mean sold price over a range of months.
Does anyone know how to do this?
This Jira issue seems to indicate its not yet possible.
[SOLR-6352] Let Stats Hang off of Range Facets - ASF JIRA

| 
| 
|  | 
[SOLR-6352] Let Stats Hang off of Range Facets - ASF JIRA


 |

 |

 |



Thanks,  

Jim    

Re: Facet Range with Stats

2018-05-11 Thread Jim Freeby
 I found a solution.
If I use tags for the facet range definition and the stats definition, I can 
include it in the facet pivot
stats=true
stats.field={!tag=piv1 percentiles='50'}price
facet=true
facet.range={!tag=r1}someDate
f.someDate.facet.range.start=2018-01-01T00:00:00Z
f.someDate.facet.range.end=2018-04-30T00:00:00Z
f.someDate.facet.range.gap=+1MONTH
facet.pivot={!stats=piv1 range=r1}category
Please let me know if there's a better way to achieve this.
Cheers,

Jim

On ‎Friday‎, ‎May‎ ‎11‎, ‎2018‎ ‎09‎:‎23‎:‎39‎ ‎AM‎ ‎PDT, Jim Freeby 
 wrote:  
 
 All,
I'd like to generate stats for the results of a facet range.
For example, calculate the mean sold price over a range of months.
Does anyone know how to do this?
This Jira issue seems to indicate its not yet possible.
[SOLR-6352] Let Stats Hang off of Range Facets - ASF JIRA

| 
| 
|  | 
[SOLR-6352] Let Stats Hang off of Range Facets - ASF JIRA


 |

 |

 |



Thanks,  

Jim  

Re: Performance if there is a large number of field

2018-05-11 Thread Deepak Goel
Deepak
"The greatness of a nation can be judged by the way its animals are
treated. Please stop cruelty to Animals, become a Vegan"

+91 73500 12833
deic...@gmail.com

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"

Make In India : http://www.makeinindia.com/home

On Fri, May 11, 2018 at 8:15 PM, Shawn Heisey  wrote:

> On 5/10/2018 2:22 PM, Deepak Goel wrote:
>
>> Are there any benchmarks for this approach? If not, I can give it a spin.
>> Also wondering if there are any alternative approach (i guess lucene
>> stores
>> data in a inverted field format)
>>
>
> Here is the only other query I know of that can find documents missing a
> field:
>
> q=*:* -field:*
>
> The potential problem with this query is that it uses a wildcard.  On
> non-point fields with very low cardinality, the performance might be
> similar.  But if the field is a Point type, or has a large number of unique
> values, then performance would be a lot worse than the range query I
> mentioned before.  The range query is the best general purpose option.
>
>
I wonder if giving a default value would help. Since Lucene stores all the
document id's which contain the default value (not changed by user) in a
single block (inverted index format), this could be retrieved much faster


> The *:* query, despite appearances, does not use wildcards.  It is special
> query syntax.
>
> Thanks,
> Shawn
>
>


Facet Range with Stats

2018-05-11 Thread Jim Freeby
All,
I'd like to generate stats for the results of a facet range.
For example, calculate the mean sold price over a range of months.
Does anyone know how to do this?
This Jira issue seems to indicate its not yet possible.
[SOLR-6352] Let Stats Hang off of Range Facets - ASF JIRA

| 
| 
|  | 
[SOLR-6352] Let Stats Hang off of Range Facets - ASF JIRA


 |

 |

 |



Thanks,  

Jim


Re: Performance if there is a large number of field

2018-05-11 Thread Andy C
Shawn,

Why are range searches more efficient than wildcard searches? I guess I
would have expected that they just provide different mechanism for defining
the range of unique terms that are of interest, and that the merge
processing would be identical.

Would a search such as:

field:c*

be more efficient if rewritten as:

field:[c TO d}

then?

On Fri, May 11, 2018 at 10:45 AM, Shawn Heisey  wrote:

> On 5/10/2018 2:22 PM, Deepak Goel wrote:
>
>> Are there any benchmarks for this approach? If not, I can give it a spin.
>> Also wondering if there are any alternative approach (i guess lucene
>> stores
>> data in a inverted field format)
>>
>
> Here is the only other query I know of that can find documents missing a
> field:
>
> q=*:* -field:*
>
> The potential problem with this query is that it uses a wildcard.  On
> non-point fields with very low cardinality, the performance might be
> similar.  But if the field is a Point type, or has a large number of unique
> values, then performance would be a lot worse than the range query I
> mentioned before.  The range query is the best general purpose option.
>
> The *:* query, despite appearances, does not use wildcards.  It is special
> query syntax.
>
> Thanks,
> Shawn
>
>


Re: Solr soft commits

2018-05-11 Thread Shawn Heisey

On 5/10/2018 8:28 PM, Shivam Omar wrote:

Thanks Shawn, So there are cases when soft commit will not be faster than the 
hard commit with openSearcher=true. We have a case where we have to do bulk 
deletions in that case will soft commit be faster than hard commits.


I actually have no idea whether deletions get put in memory by the 
NRTCachingDirectory or not.  If they don't, then soft commits with 
deletes would have no performance advantages over hard commits.  
Somebody who knows the Lucene code REALLY well will need to comment here.



Does it mean post crossing the memory threshold soft commits will lead lucene 
to flush data to disk as in hard commit. Also does a soft commit has a query 
time performance cost than doing a hard commit.


If the machine has enough memory to effectively cache the index, then a 
query after a hard commit should be just as fast as a query after a soft 
commit.  When Solr must actually read the disk to process a query, 
that's when things get slow.  If the machine has enough memory (not 
assigned to any program) for effective disk caching, then the data it 
needs to process a query will be in memory regardless of what kind of 
commit is done.


Thanks,
Shawn



Re: Performance if there is a large number of field

2018-05-11 Thread Shawn Heisey

On 5/10/2018 2:22 PM, Deepak Goel wrote:

Are there any benchmarks for this approach? If not, I can give it a spin.
Also wondering if there are any alternative approach (i guess lucene stores
data in a inverted field format)


Here is the only other query I know of that can find documents missing a 
field:


q=*:* -field:*

The potential problem with this query is that it uses a wildcard.  On 
non-point fields with very low cardinality, the performance might be 
similar.  But if the field is a Point type, or has a large number of 
unique values, then performance would be a lot worse than the range 
query I mentioned before.  The range query is the best general purpose 
option.


The *:* query, despite appearances, does not use wildcards.  It is 
special query syntax.


Thanks,
Shawn



Issue in Authentication in solr 7.3

2018-05-11 Thread Prashant Thorat
*security.json*

{
"authentication":{
   "class":"solr.BasicAuthPlugin",
   "blockUnknown": true,
   "credentials":{"solr":"IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0=
Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c="}
},
"authorization":{
   "class":"solr.RuleBasedAuthorizationPlugin",
   "permissions":[{
"name": "all",
"role": "admin"
},
{
"name": "security-edit",
"role": "admin"
},
{
"name": "read",
"role": "admin"
},
{
"name": "update",
"role": "admin"
},
{
"name": "collection-admin-read",
"role": "admin"
},
{
"name": "config-read",
"role": "admin"
}],
   "user-role":{"solr":"admin"}
}}

And uploaded to zookeper

d /opt/solr-7.3.0/server/scripts/cloud-scripts
sudo ./zkcli.sh -zkhost 192.168.1.120:2181,192.168.1.100:2181,
192.168.1.105:2181 -cmd putfile /security.json
/home/pc2/Desktop/security.json

Authentication is enabled by but it looks like inter node communication
issue.
3 nodes are running and in * logging it shows error*


*HTTP ERROR 401 Problem accessing
/solr/locationList_shard1_replica_n4/update. Reason:  Unauthorized
request, Response code: 401  *