Re: Custom post filter with support for 'OR' queries

2019-05-05 Thread alexpusch
Thanks for the quick reply.

The real data is an representation of an HTML element "body div.class1
div.b.a", My goal is to match documents by css selector i.e ".class1 .a.b"

The field I'm querying on is a tokenzied texts field. The post filter takes
the doc value of the field (which is not tokenized - whole string),
transforms it a bit, and matches it to some input regex.

My alternative is to create a non tokenzied copy field, create a custom
filter that makes the transformation on index time and use regular regex
query on it. Writing the post filter was my first choice since I wanted to
avoid copying data and I'm not sure about regular regex query performance.
The collection I'm querying is quite huge.





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Custom post filter with support for 'OR' queries

2019-05-05 Thread alexpusch
Hi, 

I'm trying to write my own custom post filter. I'm following the following
guide -
http://qaware.blogspot.com/2014/11/how-to-write-postfilter-for-solr-49.html

My implementation works for a simple query:
{!myFilter}query

But I need to perform OR queries in addition to my post filter:
field:value OR {!myFilter}query

I'm getting the follow error: 
java.lang.UnsupportedOperationException: Query {!cache=false cost=100} does
not implement createWeight

As I only want this queryParser to only run on results on a post filter
manner, I presume I do not need or can implement createWeight.

Can a post filter be applied like this? Or should I look for a different
approach?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Changing merge policy config on production

2017-12-16 Thread alexpusch
Thanks Erick, good point on maxMergedSegmentMB, many of my segments really
are max out.
My index isn't 800G, but it's not far from it -  it's about 250G per server.
I have high confidence in Solr and my EC2 i3-2xl instances, so far I got
pretty good results.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: How to restart solr in docker?

2017-12-16 Thread alexpusch
While I don't know what exact solr image you use I can tell you this:

1. The command of your dockerfile probably starts solr. A Docker container
will automatically shutdown if the process that was started by it's command
is killed. Meaning you should never 'restart' a process in a container, but
restart the container as a whole.
2. You need to make sure your solrconfig.xml is under a docker volume of
some kind. If it is not, your changes will not take effect since after the
container restart the solrconfig.xml will revert to the version that is in
the image.




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Changing merge policy config on production

2017-12-16 Thread alexpusch
To be clear - I'm talking about query performance, not indexing performance.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Changing merge policy config on production

2017-12-16 Thread alexpusch
Thanks for the quick answer Erick,

I'm hoping to improve performance by reducing the number of segments.

Currently I have ~160 segments. Am I wrong thinking it might improve
performance?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Changing merge policy config on production

2017-12-15 Thread alexpusch
Hi,
Is it safe to change the mergePolicyFactory config on production servers?
Specifically maxMergeAtOnce and segmentsPerTier. How will solr reconcile the
current state of the segments with the new config? In case of setting
segmentsPerTier to a lower number - will subsequent merges be particulary
heavy and might cause performance issues?

Thanks,
Alex.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Performance issues with 'unique' function in json facets over a high cardinality field

2017-12-12 Thread alexpusch
Hi,
I have a surprising performance issue with the 'unique' function in a json
facet

My setup holds large amount of docs (~1B), despite this large number I only
facet on a small result set of a query, only a few  docs. The query itself
returns as fast as expected, but when I try to do a unique count on one of
the fields using json.facet the query takes much longer. 

Facet time remains constant when I try to do it over a much larger set of
docs. This leads me to believe that this unique count actually depends on
overall field cardinality and not the cardinality in the result set. Am I
right?

This phenomena occurs both in a high level facet, and a sub facet
calculation, which I actually interested in.

Is there a way to facet, and sub-facet over a field with overall high
cardinality, but small cardinality in the result set?

My setup is Solr 6.0 in a Datastax Enterprise cluster

example queries:





Thanks,
Alex



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Keeping the index naturally ordered by some field

2017-10-02 Thread alexpusch
The reason I'm interested in this is kind of unique. I'm writing a custom
query parser and search component. These components go over the search
results and perform some calculation over it. This calculation depends on
input sorted by a certain value. In this scenario, regular solr sorting is
insufficient as it's performed in post-search, and only collects needed rows
to satisfy the query. The alternative for naturally sorted  index is to sort
all the docs myself, and I wish to avoid this. I use docValues extensively,
it really is a great help.

Erick, I've tried using SortingMergePolicyFactory. It brings me close to my
goal, but it's not quite there. The problem with this approach is that while
each segment is sorted by itself there might be overlapping in ranges
between the segments. For example, lets say that some query results lay in
segments A, B, and C. Each one of the segments is sorted, so the docs coming
from segment A will be sorted in the range 0-50, docs coming from segment B
will be sorted in the range 20-70, and segment C will hold values in the
50-90 range. The query result will be 0-50,20-70, 50-90. Almost sorted, but
not quite there. 

A helpful detail about my data is that the fields I'm interested in sorting
the index by is a timestamp. Docs are indexed more or less in the correct
order. As a result, if the merge policy I'm using will merge only
consecutive segments, it should satisfy my need. TieredMergePolicy does
merge non-consecutive segments so it's clearly a bad fit. I'm hoping to get
some insight about some additional steps I may take so that 
SortingMergePolicyFactory could achieve perfection. 

Thanks!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Keeping the index naturally ordered by some field

2017-10-01 Thread alexpusch
Hello,
We've got a pretty big index (~1B small docs). I'm interested in managing
the index so that the search results would be naturally sorted by a certain
numeric field, without specifying the actual sort field in query time.

My first attempt was using SortingMergePolicyFactory. I've found that this
provides only partial success. The results were occasionally sorted, but
overall there where 'jumps' in the ordering.

After some research I've found this excellent  blog post

  
that taught me that TieredMergePolicy merges non consecutive segments, and
thus creating several segments with interlacing ordering. I've tried
replacing the merge policy to LogByteSizeMergePolicy, but results are still
inconsistent.

The post is from 2011, and it's not clear to me whether today
LogByteSizeMergePolicy merges only consecutive segments, or it can merge non
consecutive segments as well.

Is there an approach that will allow me achieve this goal?

Solr version: 6.0

Thanks, Alex.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Iterating sorted result docs in a custom search component

2017-03-14 Thread alexpusch
I ended up using ValueSource, and FunctionValues (as used in statsComponent)

FieldType fieldType = schemaField.getType();
ValueSource valueSource = fieldType.getValueSource(schemaField, null);
FunctionValues values = valueSource.getValues(Collections.emptyMap(), ctx);

values.strVal(docId)

I hope that's analogous to your suggested method

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Iterating-sorted-result-docs-in-a-custom-search-component-tp4324497p4324947.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Iterating sorted result docs in a custom search component

2017-03-14 Thread alexpusch
Single field. I'm iterating over the results once, and need each doc in
memory only for that single iteration. I need different fields from each doc
according to the algorithm state.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Iterating-sorted-result-docs-in-a-custom-search-component-tp4324497p4324818.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Iterating sorted result docs in a custom search component

2017-03-13 Thread alexpusch
As have been said, only the top N results are collected, but in order to find
out which of the results are the top one, all the results must be sorted,
no? Can't the docs be somehow accessible in that stage?

Anyway, I see SortingResponseWriter does its own manual sorting using a
priority queue. So I shell do. 

Thanks Joel and Erick! 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Iterating-sorted-result-docs-in-a-custom-search-component-tp4324497p4324764.html
Sent from the Solr - User mailing list archive at Nabble.com.


Iterating sorted result docs in a custom search component

2017-03-12 Thread alexpusch
I hope this is the right place to ask about custom search components.

I'm writing a custom search component. My aim is iterate over the entire
result set and do some aggregate computation. In order to implement my
algorithm I require to iterate over the result set in the order declared in
the search query. 

I've taken statsComponent as a relevant example. It iterates over the
results using rb.getResults().docSet and searcher.getIndexReader().leaves()
but it seems that these methods does not respect the query sort order. 

I've tried creating a new TopCollector and requesting it to collect all the
data. It works but takes too long. 

Is there a way to iterate over the sorted result set in an efficient way?
I'm working on solr 4.11, but upgrading to a newer version is acceptable if
necessary.

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Iterating-sorted-result-docs-in-a-custom-search-component-tp4324497.html
Sent from the Solr - User mailing list archive at Nabble.com.