[jira] [Created] (LUCENE-10680) UnifiedHighlighter's term extraction not working for some query rewrites

2022-08-17 Thread Yannick Welsch (Jira)
Yannick Welsch created LUCENE-10680:
---

 Summary: UnifiedHighlighter's term extraction not working for some 
query rewrites
 Key: LUCENE-10680
 URL: https://issues.apache.org/jira/browse/LUCENE-10680
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/highlighter
Reporter: Yannick Welsch


UnifiedHighlighter rewrites the query against an empty index when extracting 
the terms from the query (see 
[https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java#L149).|https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java#L149)]

The rewrite step can unfortunately drop the terms that are to be extracted.

Take for example the boolean query "+field:value 
-ConstantScore(FieldExistsQuery [field=other_field])" when highlighting on 
"field".

The `FieldExistsQuery` rewrites on an empty index to a `MatchAllDocsQuery`, and 
as a `MUST_NOT` clause rewrites the overall boolean query to a 
`MatchNoDocsQuery`, dropping the `MUST` clause in the process, which means that 
the `field:value` term is not being extracted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10582) CombinedFieldQuery fails with distributed field statistics

2022-05-20 Thread Yannick Welsch (Jira)
Yannick Welsch created LUCENE-10582:
---

 Summary: CombinedFieldQuery fails with distributed field statistics
 Key: LUCENE-10582
 URL: https://issues.apache.org/jira/browse/LUCENE-10582
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/sandbox
Reporter: Yannick Welsch


CombinedFieldQuery does not properly combine distributed collection statistics, 
resulting in an IllegalArgumentException during searches.

Originally surfaced in this Elasticsearch issue: 
https://github.com/elastic/elasticsearch/issues/82817



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10474) Avoid throwing StackOverflowError when creating RegExp

2022-03-18 Thread Yannick Welsch (Jira)
Yannick Welsch created LUCENE-10474:
---

 Summary: Avoid throwing StackOverflowError when creating RegExp
 Key: LUCENE-10474
 URL: https://issues.apache.org/jira/browse/LUCENE-10474
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Reporter: Yannick Welsch


Creating a regular expression using Lucene's RegExp class can easily result in 
a StackOverflowError being thrown, for example when the input is larger than 
the maximum stack depth. Throwing a StackOverflowError isn't something a user 
would expect, and it isn't documented either. StackOverflowError is a 
user-unfriendly exception as it does not convey any intent that the user has 
done something wrong, but suggests a bug in the implementation.

I would like Lucene to follow the [approach taken by the 
JDK|https://github.com/openjdk/jdk/blob/cab4ff64541393a974ea91e35167668ef0036804/src/java.base/share/classes/java/util/regex/Pattern.java#L1441]
 and throw an IllegalArgumentException instead to clearly mark this as an input 
that the implementation can't handle.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10235) LRUQueryCache should not count never-cacheable queries as a miss

2021-11-15 Thread Yannick Welsch (Jira)
Yannick Welsch created LUCENE-10235:
---

 Summary: LRUQueryCache should not count never-cacheable queries as 
a miss
 Key: LUCENE-10235
 URL: https://issues.apache.org/jira/browse/LUCENE-10235
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Yannick Welsch


Hit and miss counts of a cache are typically used to check how effective a 
caching layer is. While looking at a system that exhibited a very high miss to 
hit ratio, I took a closer look at Lucene's LRUQueryCache and noticed that it's 
treating the handling of queries as a miss that it would never ever even think 
about caching in the first place. (e.g. TermQuery and others mentioned in 
UsageTrackingQueryCachingPolicy.shouldNeverCache).

The reason these are counted as a miss is that LRUQueryCache (scorerSupplier 
and bulkScorer methods) first does a lookup on the cache, incrementing hit or 
miss counters, and upon miss, only then checks QueryCachingPolicy.shouldCache 
to decide whether that query should be put into the cache.

This issue is made more complex by the fact that QueryCachingPolicy.shouldCache 
is a stateful method, and cacheability of a query can change over time (e.g. 
after appearing N times).

I'm opening this issue to discuss whether others also feel that the current way 
of accounting misses is unintuitive / confusing. I would also like to put 
forward a proposal to:
 * generalize the boolean QueryCachingPolicy.shouldCache method to return an 
enum instead (one of YES, NOT_RIGHT_NOW, NEVER), and only account queries that 
are (eventually) cacheable and not in the cache as a miss,
 * optionally introduce another metric for queries that are never cacheable, 
e.g. "ignored", and
 * optionally refine miss count into a count for items that are cacheable right 
away, and those that will eventually be cacheable.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9264) Remove SimpleFSDirectory in favor of NIOFsDirectory

2020-03-06 Thread Yannick Welsch (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053233#comment-17053233
 ] 

Yannick Welsch commented on LUCENE-9264:


I've opened a pull request for the removal (linked in this issue) and one for 
the deprecation (see sub-task).

> Remove SimpleFSDirectory in favor of NIOFsDirectory
> ---
>
> Key: LUCENE-9264
> URL: https://issues.apache.org/jira/browse/LUCENE-9264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Yannick Welsch
>Priority: Minor
> Fix For: master (9.0)
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{SimpleFSDirectory}} looks to duplicate what's already offered by 
> {{NIOFsDirectory}}. The only difference is that {{SimpleFSDirectory}} is 
> using non-positional reads on the {{FileChannel}} (i.e., reads that are 
> stateful, changing the current position), and {{SimpleFSDirectory}} therefore 
> has to externally synchronize access to the read method.
> On Windows, positional reads are not supported, which is why {{FileChannel}} 
> is already internally using synchronization to guarantee only access by one 
> thread at a time for positional reads (see {{read(ByteBuffer dst, long 
> position)}} in {{FileChannelImpl}}, and {{FileDispatcher.needsPositionLock}}, 
> which returns true on Windows) and the JDK implementation for Windows is 
> emulating positional reads by using non-positional ones, see 
> [http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/windows/native/sun/nio/ch/FileDispatcherImpl.c#l139].
> This means that on Windows, there should be no difference between 
> {{NIOFsDirectory}} and {{SimpleFSDirectory}} in terms of performance (it 
> should be equally poor as both implementations only allow one thread at a 
> time to read). On Linux/Mac, {{NIOFsDirectory}} is superior to 
> {{SimpleFSDirectory}}, however, as positional reads (pread) can be done 
> concurrently.
> My proposal is to remove {{SimpleFSDirectory}} and replace its uses with 
> {{NIOFsDirectory}}, given how similar these two directory implementations are 
> ({{SimpleFSDirectory}} isn't really simpler).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9265) Deprecate SimpleFSDirectory

2020-03-06 Thread Yannick Welsch (Jira)
Yannick Welsch created LUCENE-9265:
--

 Summary: Deprecate SimpleFSDirectory
 Key: LUCENE-9265
 URL: https://issues.apache.org/jira/browse/LUCENE-9265
 Project: Lucene - Core
  Issue Type: Sub-task
Reporter: Yannick Welsch






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9264) Remove SimpleFSDirectory in favor of NIOFsDirectory

2020-03-06 Thread Yannick Welsch (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yannick Welsch updated LUCENE-9264:
---
Fix Version/s: master (9.0)

> Remove SimpleFSDirectory in favor of NIOFsDirectory
> ---
>
> Key: LUCENE-9264
> URL: https://issues.apache.org/jira/browse/LUCENE-9264
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Yannick Welsch
>Priority: Minor
> Fix For: master (9.0)
>
>
> {{SimpleFSDirectory}} looks to duplicate what's already offered by 
> {{NIOFsDirectory}}. The only difference is that {{SimpleFSDirectory}} is 
> using non-positional reads on the {{FileChannel}} (i.e., reads that are 
> stateful, changing the current position), and {{SimpleFSDirectory}} therefore 
> has to externally synchronize access to the read method.
> On Windows, positional reads are not supported, which is why {{FileChannel}} 
> is already internally using synchronization to guarantee only access by one 
> thread at a time for positional reads (see {{read(ByteBuffer dst, long 
> position)}} in {{FileChannelImpl}}, and {{FileDispatcher.needsPositionLock}}, 
> which returns true on Windows) and the JDK implementation for Windows is 
> emulating positional reads by using non-positional ones, see 
> [http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/windows/native/sun/nio/ch/FileDispatcherImpl.c#l139].
> This means that on Windows, there should be no difference between 
> {{NIOFsDirectory}} and {{SimpleFSDirectory}} in terms of performance (it 
> should be equally poor as both implementations only allow one thread at a 
> time to read). On Linux/Mac, {{NIOFsDirectory}} is superior to 
> {{SimpleFSDirectory}}, however, as positional reads (pread) can be done 
> concurrently.
> My proposal is to remove {{SimpleFSDirectory}} and replace its uses with 
> {{NIOFsDirectory}}, given how similar these two directory implementations are 
> ({{SimpleFSDirectory}} isn't really simpler).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9264) Remove SimpleFSDirectory in favor of NIOFsDirectory

2020-03-05 Thread Yannick Welsch (Jira)
Yannick Welsch created LUCENE-9264:
--

 Summary: Remove SimpleFSDirectory in favor of NIOFsDirectory
 Key: LUCENE-9264
 URL: https://issues.apache.org/jira/browse/LUCENE-9264
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Yannick Welsch


{{SimpleFSDirectory}} looks to duplicate what's already offered by 
{{NIOFsDirectory}}. The only difference is that {{SimpleFSDirectory}} is using 
non-positional reads on the {{FileChannel}} (i.e., reads that are stateful, 
changing the current position), and {{SimpleFSDirectory}} therefore has to 
externally synchronize access to the read method.

On Windows, positional reads are not supported, which is why {{FileChannel}} is 
already internally using synchronization to guarantee only access by one thread 
at a time for positional reads (see {{read(ByteBuffer dst, long position)}} in 
{{FileChannelImpl}}, and {{FileDispatcher.needsPositionLock}}, which returns 
true on Windows) and the JDK implementation for Windows is emulating positional 
reads by using non-positional ones, see 
[http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/windows/native/sun/nio/ch/FileDispatcherImpl.c#l139].

This means that on Windows, there should be no difference between 
{{NIOFsDirectory}} and {{SimpleFSDirectory}} in terms of performance (it should 
be equally poor as both implementations only allow one thread at a time to 
read). On Linux/Mac, {{NIOFsDirectory}} is superior to {{SimpleFSDirectory}}, 
however, as positional reads (pread) can be done concurrently.

My proposal is to remove {{SimpleFSDirectory}} and replace its uses with 
{{NIOFsDirectory}}, given how similar these two directory implementations are 
({{SimpleFSDirectory}} isn't really simpler).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org