date:20171222

Lucene two-phase iteration question

2017-12-22 Thread Wei

Hi,

I noticed that lucene has introduced a new two-phase iteration API since 5,
but could not get a good understanding of how it works. Are there any
detail documentation or examples?  Does the two-phase iteration result in
better query performance?  Appreciate your help.

Thanks,
Wei

Re: Debugging custom RequestHander: spinning up a core for debugging

2017-12-22 Thread Tod Olson

Thanks, that pointed me in the right direction! The problem was an ancient ICU 
library in the distributed code.

-Tod

On Dec 15, 2017, at 5:15 PM, Erick Erickson 
> wrote:

My guess is this isn't a Solr issue at all; you are somehow using an old Java.

RBBIDataWrapper is from

com.ibm.icu.text;

I saw on a quick Google that this was cured by re-installing Eclipse,
but that was from 5 years ago.

You say your Java and IDE skills are a bit rusty, maybe you haven't
updated your Java JDK or Eclipse in a while? I don't know if Eclipse
somehow has its own Java (I haven't used Eclipse for quite a while).

I take it this runs outside Eclipse OK? (well, with problems otherwise
you wouldn't be stepping through it.)

Best,
Erick

On Fri, Dec 15, 2017 at 1:16 PM, Tod Olson 
> wrote:
Hi everyone,

I need to do some step-wise debugging on a custom RequestHandler. I'm trying to 
spin up a core in a Junit test, with the idea of running it inside of Eclipse 
for debugging. (If there's an easier way, I'd like to see a walk through!) 
Problem is the core fails to spin up with:

java.io.IOException: Break Iterator Rule Data Magic Number Incorrect, or 
unsupported data version

Here's the code, just trying to load (cribbed and adapted from 
https://stackoverflow.com/questions/45506381/how-to-debug-solr-plugin):

public class BrowseHandlerTest
{
   private static CoreContainer container;
   private static SolrCore core;

   private static final Logger logger = Logger.getGlobal();

   @BeforeClass
   public static void prepareClass() throws Exception
   {
   String solrHomeProp = "solr.solr.home";
   System.out.println(solrHomeProp + "= " + 
System.getProperty(solrHomeProp));
   // create the core container from the solr.solr.home system property
   container = new CoreContainer();
   container.load();
   core = container.getCore("biblio");

logger.info>("Solr 
core loaded!");
   }

   @AfterClass
   public static void cleanUpClass()
   {
   core.close();
   container.shutdown();

logger.info>("Solr 
core shut down!");
   }
}

The test, run through ant, fails as follows:

   [junit] solr.solr.home= /Users/tod/src/vufind/solr/vufind
   [junit] SLF4J: Defaulting to no-operation (NOP) logger implementation
   [junit] SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for 
further details.
   [junit] SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
   [junit] SLF4J: Defaulting to no-operation MDCAdapter implementation.
   [junit] SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for 
further details.
   [junit] Tests run: 0, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 
1.299 sec
   [junit]
   [junit] - Standard Error -
   [junit] SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
   [junit] SLF4J: Defaulting to no-operation (NOP) logger implementation
   [junit] SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for 
further details.
   [junit] SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
   [junit] SLF4J: Defaulting to no-operation MDCAdapter implementation.
   [junit] SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for 
further details.
   [junit] -  ---
   [junit] Testcase: org.vufind.solr.handler.tests.BrowseHandlerTest: Caused an 
ERROR
   [junit] SolrCore 'biblio' is not available due to init failure: JVM Error 
creating core [biblio]: null
   [junit] org.apache.solr.common.SolrException: SolrCore 'biblio' is not 
available due to init failure: JVM Error creating core [biblio]: null
   [junit]  at 
org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:1066)
   [junit]  at 
org.vufind.solr.handler.tests.BrowseHandlerTest.prepareClass(BrowseHandlerTest.java:45)
   [junit] Caused by: org.apache.solr.common.SolrException: JVM Error creating 
core [biblio]: null
   [junit]  at org.apache.solr.core.CoreContainer.create(CoreContainer.java:833)
   [junit]  at 
org.apache.solr.core.CoreContainer.access$000(CoreContainer.java:87)
   [junit]  at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:467)
   [junit]  at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:458)
   [junit]  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   [junit]  at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
   [junit]  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   [junit]  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   [junit]  at java.lang.Thread.run(Thread.java:745)
   [junit] Caused by: java.lang.ExceptionInInitializerError
   [junit]  at

Re: Confusing DocValues documentation

2017-12-22 Thread Erick Erickson

About the docs. Recently we've changed the documents to be asciidoc format

One of the ways to contribute is to raise a JIRA and submit a
documentation patch.
See: https://wiki.apache.org/solr/HowToContribute

It's valuable to have people reading docs and trying to understand
them help update them with fresh eyes.

Best,
Erick

On Fri, Dec 22, 2017 at 11:20 AM, Emir Arnautović
 wrote:
> Your questions are already more or less answered:
>> 1) If the docValues are that good, can we git rid of the stored values
>> altogether?
> You can if you want - just configure your field with stored=“false” and 
> docValues=“true”. Note that you can do that only if:
> * field is not analyzed (you cannot enable docValues for analyzed field)
> * you do not care about order of your values
>
>> 2) And why the docValues are not enabled by default for multi-valued fields?
> Because it is overhead when it comes to indexing and it is not used in all 
> cases - only if field is used for faceting, sorting or in functions.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
>> On 22 Dec 2017, at 19:51, Tech Id  wrote:
>>
>> Very interesting discussion SG and Erick.
>> I wish these details were part of the official Solr documentation as well.
>> And yes, "columnar format" did not give any useful information to me either.
>>
>>
>> "A good explanation increases contributions to the project as more people
>> become empowered to improvise."
>>   - Self, LOL
>>
>>
>> I was expecting the sorting, faceting, pivoting to a bit more optimized for
>> docValues, something like a pre-calculated bit of information.
>> However, now it seems that the major benefit of docValues is to optimize
>> the lookup time of stored fields.
>> Here is the sorting function I wrote as pseudo-code from the discussion:
>>
>>
>> int docIDs[] = filterDocsOnQuery (query);
>> T docValues[] = loadDocValues (sortField);
>> TreeMap sortFieldValues[] = new TreeMap<>();
>> for (int docId : docIDs) {
>>T val = docValues[docId];
>>sortFieldValues.put(val, docId);
>> }
>> // return docIDs sorted by value
>> return sortFieldValues.values;
>>
>>
>> It is indeed difficult to pre-compute the sorts and facets because we do
>> not know what docIDs will be returned by the filtering.
>>
>> Two last questions I have are:
>> 1) If the docValues are that good, can we git rid of the stored values
>> altogether?
>> 2) And why the docValues are not enabled by default for multi-valued fields?
>>
>>
>> -T
>>
>>
>>
>>
>> On Thu, Dec 21, 2017 at 9:02 PM, Erick Erickson 
>> wrote:
>>
>>> OK, last bit of the tutorial.
>>>
>>> bq: But that does not seem to be helping with sorting or faceting of any
>>> kind.
>>> This seems to be like a good way to speed up a stored field's retrieval.
>>>
>>> These are the same thing. I have two docs. I have to know how they
>>> sort. Therefore I need the value in the sort field for each. This the
>>> same thing as getting the stored value, no?
>>>
>>> As for facets it's the same problem. To count facet buckets I have to
>>> find the values for the  field for each document in the results list
>>> and tally them. This is also getting the stored value, right? You're
>>> asking "for the docs in my result set, how many of them have val1, how
>>> many have val2, how many have val54 etc.
>>>
>>> And as an aside the docValues can also be used to return the stored value.
>>>
>>> Best,
>>> Erick
>>>
>>> On Thu, Dec 21, 2017 at 8:23 PM, S G  wrote:
 Thank you Eric.

 I guess the biggest piece I was missing was the sort on a field other
>>> than
 the search field.
 Once you have filtered a list of documents and then you want to sort, the
 inverted index cannot be used for lookup.
 You just have doc-IDs which are values in inverted index, not the keys.
 Hence they cannot be "looked" up - only option is to loop through all the
 entries of that key's inverted index.

 DocValues come to rescue by reducing that looping operation to a lookup
 again.
 Because in docValues, the key (i.e. array-index) is the document-index
>>> and
 gives an O(1) lookup for any doc-ID.


 But that does not seem to be helping with sorting or faceting of any
>>> kind.
 This seems to be like a good way to speed up a stored field's retrieval.

 DocValues in the current example are:
 FieldA
 doc1 = 1
 doc2 = 2
 doc3 =

 FieldB
 doc1 = 2
 doc2 = 4
 doc3 = 5

 FieldC
 doc1 = 5
 doc2 =
 doc3 = 5

 So if I have to run a query:
fieldA=*=fieldB asc
 I will get all the documents due to filter and then I will lookup the
 values of field-B from the docValues lookup.
 That will give me 2,4,5
 This is sorted in this case,

Re: Confusing DocValues documentation

2017-12-22 Thread Tech Id

Thanks Emir,

It seems that stored="false" docValues="true" is the default in Solr's
github and the recommended way to go.


grep "docValues=\"true\""
./server/solr/configsets/_default/conf/managed-schema








  Point fields don't support FieldCache, so they must have
docValues="true" if needed for sorting, faceting, functions, etc.

























So all the basic field-types (single and multi-valued) would have
docValues="true" and stored="false" is the default I assume.
But I do not get why the "id" field and the "dynamic fields" have
stored="true" in Solr 7:



grep "stored=\"true\""
./server/solr/configsets/_default/conf/managed-schema | grep -v "\*_txt_"




























































That is perhaps a bug?



Booleans seem to care neither about stored nor docValues:


grep -i boolean ./server/solr/configsets/_default/conf/managed-schema








-T



On Fri, Dec 22, 2017 at 11:20 AM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Your questions are already more or less answered:
> > 1) If the docValues are that good, can we git rid of the stored values
> > altogether?
> You can if you want - just configure your field with stored=“false” and
> docValues=“true”. Note that you can do that only if:
> * field is not analyzed (you cannot enable docValues for analyzed field)
> * you do not care about order of your values
>
> > 2) And why the docValues are not enabled by default for multi-valued
> fields?
> Because it is overhead when it comes to indexing and it is not used in all
> cases - only if field is used for faceting, sorting or in functions.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 22 Dec 2017, at 19:51, Tech Id  wrote:
> >
> > Very interesting discussion SG and Erick.
> > I wish these details were part of the official Solr documentation as
> well.
> > And yes, "columnar format" did not give any useful information to me
> either.
> >
> >
> > "A good explanation increases contributions to the project as more people
> > become empowered to improvise."
> >   - Self, LOL
> >
> >
> > I was expecting the sorting, faceting, pivoting to a bit more optimized
> for
> > docValues, something like a pre-calculated bit of information.
> > However, now it seems that the major benefit of docValues is to optimize
> > the lookup time of stored fields.
> > Here is the sorting function I wrote as pseudo-code from the discussion:
> >
> >
> > int docIDs[] = filterDocsOnQuery (query);
> > T docValues[] = loadDocValues (sortField);
> > TreeMap sortFieldValues[] = new TreeMap<>();
> > for (int docId : docIDs) {
> >T val = docValues[docId];
> >sortFieldValues.put(val, docId);
> > }
> > // return docIDs sorted by value
> > return sortFieldValues.values;
> >
> >
> > It is indeed difficult to pre-compute the sorts and facets because we do
> > not know what docIDs will be returned by the filtering.
> >
> > Two last questions I have are:
> > 1) If the docValues are that good, can we git rid of the stored values
> > altogether?
> > 2) And why the docValues are not enabled by default for multi-valued
> fields?
> >
> >
> > -T
> >
> >
> >
> >
> > On Thu, Dec 21, 2017 at 9:02 PM, Erick Erickson  >
> > wrote:
> >
> >> OK, last bit of the tutorial.
> >>
> >> bq: But that does not seem to be helping with sorting or faceting of any
> >> kind.
> >> This seems to be like a good way to speed up a stored field's retrieval.
> >>
> >> These are the same thing. I have two docs. I have to know how they
> >> sort. Therefore I need the value in the sort field for each. This the
> >> same thing as getting the stored value, no?
> >>
> >> As for facets it's the same problem. To count facet buckets I have to
> >> find the values for the  field for each document in the results list
> >> and tally them. This is also getting the stored value, right? You're
> >> asking "for the docs in my result set, how many of them have val1, how
> >> many have val2, how many have val54 etc.
> >>
> >> And as an aside the docValues can also be used to return the stored
> value.
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Dec 21, 2017 at 8:23 PM, S G  wrote:
> >>> Thank you Eric.
> >>>
> >>> I guess the biggest piece I was missing was the sort on a field other
> >> than
> >>> the search field.
> >>> Once you have filtered a list of documents and then you want to sort,
> the
> >>> inverted index cannot be used for lookup.
> >>> You just have doc-IDs which are values in inverted index, not the keys.
> >>> Hence they cannot be "looked" up - only option is to loop through all
> the
> >>> entries of that key's inverted index.
>

Re: Confusing DocValues documentation

2017-12-22 Thread Emir Arnautović

Your questions are already more or less answered:
> 1) If the docValues are that good, can we git rid of the stored values
> altogether?
You can if you want - just configure your field with stored=“false” and 
docValues=“true”. Note that you can do that only if:
* field is not analyzed (you cannot enable docValues for analyzed field)
* you do not care about order of your values

> 2) And why the docValues are not enabled by default for multi-valued fields?
Because it is overhead when it comes to indexing and it is not used in all 
cases - only if field is used for faceting, sorting or in functions.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 22 Dec 2017, at 19:51, Tech Id  wrote:
> 
> Very interesting discussion SG and Erick.
> I wish these details were part of the official Solr documentation as well.
> And yes, "columnar format" did not give any useful information to me either.
> 
> 
> "A good explanation increases contributions to the project as more people
> become empowered to improvise."
>   - Self, LOL
> 
> 
> I was expecting the sorting, faceting, pivoting to a bit more optimized for
> docValues, something like a pre-calculated bit of information.
> However, now it seems that the major benefit of docValues is to optimize
> the lookup time of stored fields.
> Here is the sorting function I wrote as pseudo-code from the discussion:
> 
> 
> int docIDs[] = filterDocsOnQuery (query);
> T docValues[] = loadDocValues (sortField);
> TreeMap sortFieldValues[] = new TreeMap<>();
> for (int docId : docIDs) {
>T val = docValues[docId];
>sortFieldValues.put(val, docId);
> }
> // return docIDs sorted by value
> return sortFieldValues.values;
> 
> 
> It is indeed difficult to pre-compute the sorts and facets because we do
> not know what docIDs will be returned by the filtering.
> 
> Two last questions I have are:
> 1) If the docValues are that good, can we git rid of the stored values
> altogether?
> 2) And why the docValues are not enabled by default for multi-valued fields?
> 
> 
> -T
> 
> 
> 
> 
> On Thu, Dec 21, 2017 at 9:02 PM, Erick Erickson 
> wrote:
> 
>> OK, last bit of the tutorial.
>> 
>> bq: But that does not seem to be helping with sorting or faceting of any
>> kind.
>> This seems to be like a good way to speed up a stored field's retrieval.
>> 
>> These are the same thing. I have two docs. I have to know how they
>> sort. Therefore I need the value in the sort field for each. This the
>> same thing as getting the stored value, no?
>> 
>> As for facets it's the same problem. To count facet buckets I have to
>> find the values for the  field for each document in the results list
>> and tally them. This is also getting the stored value, right? You're
>> asking "for the docs in my result set, how many of them have val1, how
>> many have val2, how many have val54 etc.
>> 
>> And as an aside the docValues can also be used to return the stored value.
>> 
>> Best,
>> Erick
>> 
>> On Thu, Dec 21, 2017 at 8:23 PM, S G  wrote:
>>> Thank you Eric.
>>> 
>>> I guess the biggest piece I was missing was the sort on a field other
>> than
>>> the search field.
>>> Once you have filtered a list of documents and then you want to sort, the
>>> inverted index cannot be used for lookup.
>>> You just have doc-IDs which are values in inverted index, not the keys.
>>> Hence they cannot be "looked" up - only option is to loop through all the
>>> entries of that key's inverted index.
>>> 
>>> DocValues come to rescue by reducing that looping operation to a lookup
>>> again.
>>> Because in docValues, the key (i.e. array-index) is the document-index
>> and
>>> gives an O(1) lookup for any doc-ID.
>>> 
>>> 
>>> But that does not seem to be helping with sorting or faceting of any
>> kind.
>>> This seems to be like a good way to speed up a stored field's retrieval.
>>> 
>>> DocValues in the current example are:
>>> FieldA
>>> doc1 = 1
>>> doc2 = 2
>>> doc3 =
>>> 
>>> FieldB
>>> doc1 = 2
>>> doc2 = 4
>>> doc3 = 5
>>> 
>>> FieldC
>>> doc1 = 5
>>> doc2 =
>>> doc3 = 5
>>> 
>>> So if I have to run a query:
>>>fieldA=*=fieldB asc
>>> I will get all the documents due to filter and then I will lookup the
>>> values of field-B from the docValues lookup.
>>> That will give me 2,4,5
>>> This is sorted in this case, but assume that this was not sorted.
>>> (The docValues array is indexed by Lucene's doc-ID not the field-value
>>> after all, right?)
>>> 
>>> Then does Lucene/Solr still sort them like regular array of values?
>>> That does not seem very efficient.
>>> And it does not seem to helping with faceting, pivoting too.
>>> What did I miss?
>>> 
>>> Thanks
>>> SG
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Dec 21, 2017 at 5:31 PM, Erick Erickson >> 
>>> wrote:
>>> 
 Here's where you're going off the rails:

Re: Confusing DocValues documentation

2017-12-22 Thread Tech Id

Very interesting discussion SG and Erick.
I wish these details were part of the official Solr documentation as well.
And yes, "columnar format" did not give any useful information to me either.


"A good explanation increases contributions to the project as more people
become empowered to improvise."
   - Self, LOL


I was expecting the sorting, faceting, pivoting to a bit more optimized for
docValues, something like a pre-calculated bit of information.
However, now it seems that the major benefit of docValues is to optimize
the lookup time of stored fields.
Here is the sorting function I wrote as pseudo-code from the discussion:


int docIDs[] = filterDocsOnQuery (query);
T docValues[] = loadDocValues (sortField);
TreeMap sortFieldValues[] = new TreeMap<>();
for (int docId : docIDs) {
T val = docValues[docId];
sortFieldValues.put(val, docId);
}
// return docIDs sorted by value
return sortFieldValues.values;


It is indeed difficult to pre-compute the sorts and facets because we do
not know what docIDs will be returned by the filtering.

Two last questions I have are:
1) If the docValues are that good, can we git rid of the stored values
altogether?
2) And why the docValues are not enabled by default for multi-valued fields?


-T




On Thu, Dec 21, 2017 at 9:02 PM, Erick Erickson 
wrote:

> OK, last bit of the tutorial.
>
> bq: But that does not seem to be helping with sorting or faceting of any
> kind.
> This seems to be like a good way to speed up a stored field's retrieval.
>
> These are the same thing. I have two docs. I have to know how they
> sort. Therefore I need the value in the sort field for each. This the
> same thing as getting the stored value, no?
>
> As for facets it's the same problem. To count facet buckets I have to
> find the values for the  field for each document in the results list
> and tally them. This is also getting the stored value, right? You're
> asking "for the docs in my result set, how many of them have val1, how
> many have val2, how many have val54 etc.
>
> And as an aside the docValues can also be used to return the stored value.
>
> Best,
> Erick
>
> On Thu, Dec 21, 2017 at 8:23 PM, S G  wrote:
> > Thank you Eric.
> >
> > I guess the biggest piece I was missing was the sort on a field other
> than
> > the search field.
> > Once you have filtered a list of documents and then you want to sort, the
> > inverted index cannot be used for lookup.
> > You just have doc-IDs which are values in inverted index, not the keys.
> > Hence they cannot be "looked" up - only option is to loop through all the
> > entries of that key's inverted index.
> >
> > DocValues come to rescue by reducing that looping operation to a lookup
> > again.
> > Because in docValues, the key (i.e. array-index) is the document-index
> and
> > gives an O(1) lookup for any doc-ID.
> >
> >
> > But that does not seem to be helping with sorting or faceting of any
> kind.
> > This seems to be like a good way to speed up a stored field's retrieval.
> >
> > DocValues in the current example are:
> > FieldA
> > doc1 = 1
> > doc2 = 2
> > doc3 =
> >
> > FieldB
> > doc1 = 2
> > doc2 = 4
> > doc3 = 5
> >
> > FieldC
> > doc1 = 5
> > doc2 =
> > doc3 = 5
> >
> > So if I have to run a query:
> > fieldA=*=fieldB asc
> > I will get all the documents due to filter and then I will lookup the
> > values of field-B from the docValues lookup.
> > That will give me 2,4,5
> > This is sorted in this case, but assume that this was not sorted.
> > (The docValues array is indexed by Lucene's doc-ID not the field-value
> > after all, right?)
> >
> > Then does Lucene/Solr still sort them like regular array of values?
> > That does not seem very efficient.
> > And it does not seem to helping with faceting, pivoting too.
> > What did I miss?
> >
> > Thanks
> > SG
> >
> >
> >
> >
> >
> >
> > On Thu, Dec 21, 2017 at 5:31 PM, Erick Erickson  >
> > wrote:
> >
> >> Here's where you're going off the rails: "I can just look at the
> >> map-for-field-A"
> >>
> >> As I said before, you're totally right, all the information you need
> >> is there. But
> >> you're thinking of this as though speed weren't a premium when you say.
> >> "I can just look". Consider that there are single replicas out there
> with
> >> 300M
> >> (or more) docs in them. "Just looking" in a list 300M items long 300M
> times
> >> (q=*:*=whatever) is simply not going to be performant compared to
> >> 300M indexing operations which is what DV does.
> >>
> >> Faceting is much worse.
> >>
> >> Plus space is also at a premium. Java takes 40+ bytes to store the first
> >> character. So any Java structure you use is going to be enormous. 300M
> ints
> >> is bad enough. And if you spoof this by using ordinals as Lucene does,
> >> you're
> >> well on your way to reinventing docValues.
> >>
> >> Maybe this will help. Imagine you have a phone book in your hands. It
> >> consists of documents

RE: Trouble with mm and SynonymQuery and KeywordRepeatFilter

2017-12-22 Thread Markus Jelsma

Hello Walter, Steve,

That is not going to be that easy, we have many Germanic languages in the 
index, all with support for splitting compound words.

I also do not like the idea of adding all inflections to the synonyms file, and 
blows up our queries N fold, they are very big already due to search over many 
fields of many languages. And i believe it is counter intuitive, i have a 
stemmer for that.

Ideally i would want to fix this in mm, something like mm.autoRelax does.

Many thanks,
Markus

-Original message-
> From:Walter Underwood 
> Sent: Thursday 21st December 2017 17:13
> To: solr-user@lucene.apache.org
> Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
> 
> You can find all the inflected forms that are in your index. Search for the 
> root form, use highlighting to pull out matches, and collect them. It is a 
> bother, but not that hard for a program to do.
> 
> In the synonym file, you don’t need to list an inflected form of the synonym, 
> because it will be stemmed. So:
> 
> traject => verbind
> trajecten => verbind
> 
> If you want an algorithmic solution, look for a “morphological generator”. 
> That is the inverse of a morphological analyzer. In the olden days, query 
> time generation was an alternative to stemming (analysis) at index time. But 
> that makes the query much larger and much slower.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
> > On Dec 21, 2017, at 6:28 AM, Markus Jelsma  
> > wrote:
> > 
> > Hello Steve,
> > 
> > Well, that is an interesting approach to the topic indeed. But i do not 
> > think it is possible to obtain a list of all inflected forms for all words 
> > that also have roots in some synonym file, the stemmers are not reversible. 
> > 
> > Any other ideas?
> > 
> > Thanks,
> > Markus
> > 
> > -Original message-
> >> From:Steve Rowe 
> >> Sent: Thursday 21st December 2017 0:10
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
> >> 
> >> Hi Markus,
> >> 
> >> My suggestion: rewrite your synonyms to include the triggering word in the 
> >> expanded synonyms list.  That way you won’t need 
> >> KeywordRepeat/RemoveDuplicates filters, and mm=100% will work as you 
> >> expect.
> >> 
> >> I don’t think this situation is a bug, since mm applies to the built 
> >> query, not to the original query terms.
> >> 
> >> --
> >> Steve
> >> www.lucidworks.com
> >> 
> >>> On Dec 20, 2017, at 5:02 PM, Markus Jelsma  
> >>> wrote:
> >>> 
> >>> Hello,
> >>> 
> >>> Yes of course, index time synonyms lessens the query time complexity and 
> >>> will solve the mm problem. It also screws IDF and the flexibility of 
> >>> adding synonyms on demand. The first we do not want, the second is 
> >>> impossible for us (very large main search index).
> >>> 
> >>> We are looking for a solution with mm that takes KeywordRepeat, stemming 
> >>> and synonym expansion into consideration. To me the current working of mm 
> >>> in this case is a bug, i input one term so treat it as one term in mm, 
> >>> regardless of expanded query terms.
> >>> 
> >>> Any query time ideas to share? I am not well versed with the actual code 
> >>> dealing with this specific subject, the code doesn't like me. I am fine 
> >>> if someone points me to the code that tells mm about the number of 
> >>> original input terms, and what to do. If someone does, please also 
> >>> explain why the change i want to make is a bad one, what to be aware of 
> >>> or what to beware of, or what to take into account.
> >>> 
> >>> Also, am i the only one who regards this behaviour as a bug, or more 
> >>> subtle, a weird unexpected behaviour?
> >>> 
> >>> Many many thanks!
> >>> Markus
> >>> 
> >>> -Original message-
>  From:Shawn Heisey 
>  Sent: Wednesday 20th December 2017 22:39
>  To: solr-user@lucene.apache.org
>  Subject: Re: Trouble with mm and SynonymQuery and KeywordRepeatFilter
>  
>  On 12/19/2017 4:38 AM, Markus Jelsma wrote:
> > I have an interesting issue with mm and SynonymQuery and 
> > KeywordRepeatFilter. We do query time synonym expansion and use 
> > KeywordRepeat for not only finding stemmed tokens. Our synonyms are 
> > already preprocessed and contain only stemmed tokens. Synonym file 
> > contains: traject,verbind
> > 
> > So, any non-root stem that ends up in a synonym is actually a search 
> > for three terms: +DisjunctionMaxQuery(((title_nl:trajecten 
> > Synonym(title_nl:traject title_nl:verbind
> > 
> > But, our default mm requires that two terms must match if the input 
> > query consists of two terms: 2<-1 5<-2 6<90%
> > 
> > So, a simple query looking for a plural (trajecten) will not match a 
> > document where the title contains

Re: Edismax leading wildcard search

2017-12-22 Thread Michael Kuhlmann

Am 22.12.2017 um 11:57 schrieb Selvam Raman:
> 1) how can i disable leading wildcard search

Do it on the client side. Just don't allow leading asterisks or question
marks in your query term.

> 2) why leading wildcard search takes so much of time to give the response.
> 

Because Lucene can't just look in the index for all terms beginning with
something; it needs to look in all terms instead. Basically, indexed
terms are in alphabetical order, but that doesn't help with leading
wildcards.

There's a ReversedWildcardFilterFactory in Solr to address this issue.

-Michael

Edismax leading wildcard search

2017-12-22 Thread Selvam Raman

Hi,

Solr version - 6.4

Parser - Edismax

Leading wildcard search is allowed in edismax.

1) how can i disable leading wildcard search
2) why leading wildcard search takes so much of time to give the response.

-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

Lucene two-phase iteration question

Re: Debugging custom RequestHander: spinning up a core for debugging

Re: Confusing DocValues documentation

Re: Confusing DocValues documentation

Re: Confusing DocValues documentation

Re: Confusing DocValues documentation

RE: Trouble with mm and SynonymQuery and KeywordRepeatFilter

Re: Edismax leading wildcard search

Edismax leading wildcard search

9 matches

Site Navigation

Mail list logo

Footer information