Re: [External] Re: How to highlight fields that are not stored?
Hi Michael. Thanks for the reply. As I said in the opening statement, I need to move away reading a file into memory before indexing the file.. The use case here is files 2+ GB in size. I thought streaming the file to be indexed is the only alternative to reading the full file in RAM then indexing. I would be happy to be directed to another way to get 2+ GB files indexed. > highlighting requires > the document in its uninverted form. Otherwise what text would you > highlight? Highlighting the, possibly changed, terms from the index is my goal if I can't store the entire document due to RAM size constraints. Not having the original file text in the highlight isn't ideal, but it is better than not being able to highlight text in large documents. David Shifflett On 2/16/23, 4:01 PM, "Michael Sokolov" mailto:msoko...@gmail.com>> wrote: Sorry your problem statement makes no sense: you should be able to store field data in the index without loading all your documents into RAM while indexing. Maybe there is some constraint you are not telling us about? Or you may be confused. In any case highlighting requires the document in its uninverted form. Otherwise what text would you highlight? On Mon, Feb 13, 2023 at 3:46 PM Shifflett, David [USA] mailto:shifflett_da...@bah.com.inva>lid> wrote: > > Hi, > I am converting my application from > reading documents into memory, then indexing the documents > to streaming the documents to be indexed. > > I quickly found out this required that the field NOT be stored. > I then quickly found out that my highlighting code requires the field to be > stored. > > I’ve been searching for an existing highlighter that doesn’t require the > field to be stored, > and thought I’d found one in the FastVectorHighlighter, > but tests revealed this highlighter also requires the field to be stored, > though this requirement isn’t documented, or reflected in any returned > exception. > > I have been investigating using code like > Terms terms = reader.getTermVector(docID, fieldName); > TermsEnum termsEnum = terms.iterator(); > BytesRef bytesRef = termsEnum.next(); > PostingsEnum pe = termsEnum.postings(null, PostingsEnum.OFFSETS); > > While this gives me the terms from the document, and the positions, > iterating over this, and matching to the queries I’m running, > seems cumbersome, and inefficient. > > Any suggestions for highlighting query matches without the searched field > being stored? > > Thanks, > David Shifflett > Senior Lead Technologist > Enterprise Cross Domain Solutions (ECDS) > Booz Allen Hamilton > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org <mailto:java-user-unsubscr...@lucene.apache.org> For additional commands, e-mail: java-user-h...@lucene.apache.org <mailto:java-user-h...@lucene.apache.org>
How to highlight fields that are not stored?
Hi, I am converting my application from reading documents into memory, then indexing the documents to streaming the documents to be indexed. I quickly found out this required that the field NOT be stored. I then quickly found out that my highlighting code requires the field to be stored. I’ve been searching for an existing highlighter that doesn’t require the field to be stored, and thought I’d found one in the FastVectorHighlighter, but tests revealed this highlighter also requires the field to be stored, though this requirement isn’t documented, or reflected in any returned exception. I have been investigating using code like Terms terms = reader.getTermVector(docID, fieldName); TermsEnum termsEnum = terms.iterator(); BytesRef bytesRef = termsEnum.next(); PostingsEnum pe = termsEnum.postings(null, PostingsEnum.OFFSETS); While this gives me the terms from the document, and the positions, iterating over this, and matching to the queries I’m running, seems cumbersome, and inefficient. Any suggestions for highlighting query matches without the searched field being stored? Thanks, David Shifflett Senior Lead Technologist Enterprise Cross Domain Solutions (ECDS) Booz Allen Hamilton
Re: [External] Streaming documents into the index breaks highlighting
Just to clarify, Is there a highlighting option that doesn't require the text from the matched document? David Shifflett On 11/17/22, 1:57 PM, "Shifflett, David [USA]" wrote: Hi, I am converting my application from reading documents into memory, then indexing the documents to streaming the documents to be indexed. I quickly found out this required that the field NOT be stored. I then quickly found out that my highlighting code requires the field to be stored. I’ve been searching for an existing highlighter that doesn’t require the field to be stored, and thought I’d found one in the FastVectorHighlighter, but tests revealed this highlighter also requires the field to be stored, though this requirement isn’t documented, or reflected in any returned exception. Any suggestions for highlighting query matches without the searched field being stored? I was hoping storing the offsets and positions would be enough to enable highlighting. David Shifflett
Streaming documents into the index breaks highlighting
Hi, I am converting my application from reading documents into memory, then indexing the documents to streaming the documents to be indexed. I quickly found out this required that the field NOT be stored. I then quickly found out that my highlighting code requires the field to be stored. I’ve been searching for an existing highlighter that doesn’t require the field to be stored, and thought I’d found one in the FastVectorHighlighter, but tests revealed this highlighter also requires the field to be stored, though this requirement isn’t documented, or reflected in any returned exception. Any suggestions for highlighting query matches without the searched field being stored? I was hoping storing the offsets and positions would be enough to enable highlighting. David Shifflett Senior Lead Technologist Enterprise Cross Domain Solutions (ECDS) Booz Allen Hamilton M: 831-920-8341
Migrating WhitespaceTokenizerFactory from 8.2 to 9.4
I am migrating my project’s usage of Lucene from 8.2 to 9.4. The migration documentation has been very helpful, but doesn’t help me resolve this exception: ‘Caused by: java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.analysis.TokenizerFactory with name 'whitespace' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath. The current classpath supports the following names: [standard]’ My project includes the lucene-analysis-common JAR, and my JAR includes org/apache/lucene/analysis/core/WhitespaceTokenizerFactory.class. I am not familiar with how Java SPI is configured and built. I tried creating META-INF/services/org.apache.lucene.analysis.TokenizerFactory containing: org.apache.lucene.analysis.core.WhitespaceTokenizerFactory What am I missing? Any help would be appreciated. Thanks, David Shifflett
Re: [External] Re: Can lucene be used in Android ?
Hi Uwe, I am a little confused by your 2 statements. > Lucene 9.x series requires JDK 11 to run > The main branch is already on JDK 17 Will Lucene 9.x run on JDK 17? Is 9.x 'the main branch'? Thanks, David Shifflett Senior Lead Technologist Enterprise Cross Domain Solutions (ECDS) Booz Allen Hamilton On 9/10/22, 5:30 AM, "Uwe Schindler" wrote: Hi Jie, actually the Lucene 9.x series requires JDK 11 to run, previous versions also work with Java 8. The main branch is already on JDK 17. From my knowledge, you may only use Lucene versions up to 8 to have at least a chance to run it. But with older Android version you may even need to go back to Lucene builds targetting JDK 7 (Lucene 5 ?, don't know). But this is only half of the story: Lucene actually uses many many modern JDK and JVM features that are partly not implemented in Dalvik. It uses MethodHandles instead of reflection and the Java 8+ version use lambdas which were not compatible with older Android SDKs. So in short: Use older version and hope, but we offer no support or are not keen to apply changes to Lucene so it can be used with Android at all - because Android is not really compatible to any Java spec like API or memory model. Uwe Am 09.09.2022 um 09:10 schrieb Jie Wang: > Hey, > > Recently, I am trying to compile the Lucene to get a jar that can be used in Android, but failed. > > Is there an official version that supports the use of Lucene on Android? > > > Thanks! > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > -- Uwe Schindler Achterdiek 19, D-28357 Bremen https://urldefense.com/v3/__https://www.thetaphi.de__;!!May37g!I0Gu25Y3BgTV3Vu1HySs6-3CFpW6BoaYKIsxiSeaohtNPkf00opY-hSY8XMqPJz990oyteqdryrf1cToSA$ eMail: u...@thetaphi.de - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
I am getting an exception in ComplexPhraseQueryParser when fuzzy searching
I am using Lucene 8.2, but have also verified this on 8.9 and 8.10.1. My query string is either ""by~1 word~1"", or ""ky~1 word~1"". I am looking for a phrase of these 2 words, with potential 1 character misspelling, or fuzziness. I realize that 'by' is usually a stop word, that is why I also tested with 'ky'. My simplified test content is either "AC-2.b word", "AC-2.k word", "AC-2.y word". The first part of the test content is pulled from actual data my customers are trying to search. For the query with 'by~1' the exception occurs if the content has '.b' or .y', but not '.k' For the query with 'ky~1' the exception occurs if the content has '.k' or .y', but not '.b' Here is the test code: import java.io.IOException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.core.*; import org.apache.lucene.analysis.standard.*; import org.apache.lucene.analysis.tokenattributes.*; import org.apache.lucene.analysis.util.*; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.FieldType; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexOptions; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.queryparser.classic.ParseException; import org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.TopDocs; import org.apache.lucene.store.RAMDirectory; public class phraseTest { public static Analyzer analyzer = new StandardAnalyzer(); public static IndexWriterConfig config = new IndexWriterConfig( analyzer); public static RAMDirectory ramDirectory = new RAMDirectory(); public static IndexWriter indexWriter; public static Query queryToSearch = null; public static IndexReader idxReader; public static IndexSearcher idxSearcher; public static TopDocs hits; public static String query_field = "Content"; // Pick only one content string // public static String content = "AC-2.b word"; public static String content = "AC-2.k word"; // public static String content = "AC-2.y word"; // Pick only one query string // public static String queryString = "\"by~1 word~1\""; public static String queryString = "\"ky~1 word~1\""; @SuppressWarnings("deprecation") public static void main(String[] args) throws IOException { System.out.println("Content is\n " + content); System.out.println("Query field is " + query_field); System.out.println("Query String is '" + queryString + "'"); Document doc = new Document(); // create a new document /** * Create a field with term vector enabled */ FieldType type = new FieldType(); type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); type.setStored(true); type.setStoreTermVectors(true); type.setTokenized(true); type.setStoreTermVectorOffsets(true); //term vector enabled Field cField = new Field(query_field, content, type); doc.add(cField); try { indexWriter = new IndexWriter(ramDirectory, config); indexWriter.addDocument(doc); indexWriter.close(); idxReader = DirectoryReader.open(ramDirectory); idxSearcher = new IndexSearcher(idxReader); ComplexPhraseQueryParser qp = new ComplexPhraseQueryParser(query_field, analyzer); queryToSearch = qp.parse(queryString); // Here is where the searching, etc starts hits = idxSearcher.search(queryToSearch, idxReader.maxDoc()); System.out.println("scoreDoc size: " + hits.scoreDocs.length); // highlight the hits ... } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ParseException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } Here is the exception (using Lucene 8.2): Exception in thread "main" java.lang.IllegalArgumentException: Unknown query type "org.apache.lucene.search.ConstantScoreQuery" found in phrase query string "ky~1 word~1" at org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:325) at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:666) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:439) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:564) at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:416) at
I am getting an exception in ComplexPhraseQueryParser when fuzzy searching
I am using Lucene 8.2, but have also verified this on 8.9 and 8.10.1. My query string is either ""by~1 word~1"", or ""ky~1 word~1"". I am looking for a phrase of these 2 words, with potential 1 character misspelling, or fuzziness. I realize that 'by' is usually a stop word, that is why I also tested with 'ky'. My simplified test content is either "AC-2.b word", "AC-2.k word", "AC-2.y word". The first part of the test content is pulled from actual data my customers are trying to search. For the query with 'by~1' the exception occurs if the content has '.b' or .y', but not '.k' For the query with 'ky~1' the exception occurs if the content has '.k' or .y', but not '.b' Here is the test code: import java.io.IOException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.core.*; import org.apache.lucene.analysis.standard.*; import org.apache.lucene.analysis.tokenattributes.*; import org.apache.lucene.analysis.util.*; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.FieldType; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexOptions; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.queryparser.classic.ParseException; import org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.TopDocs; import org.apache.lucene.store.RAMDirectory; public class phraseTest { public static Analyzer analyzer = new StandardAnalyzer(); public static IndexWriterConfig config = new IndexWriterConfig( analyzer); public static RAMDirectory ramDirectory = new RAMDirectory(); public static IndexWriter indexWriter; public static Query queryToSearch = null; public static IndexReader idxReader; public static IndexSearcher idxSearcher; public static TopDocs hits; public static String query_field = "Content"; // Pick only one content string // public static String content = "AC-2.b word"; public static String content = "AC-2.k word"; // public static String content = "AC-2.y word"; // Pick only one query string // public static String queryString = "\"by~1 word~1\""; public static String queryString = "\"ky~1 word~1\""; @SuppressWarnings("deprecation") public static void main(String[] args) throws IOException { System.out.println("Content is\n " + content); System.out.println("Query field is " + query_field); System.out.println("Query String is '" + queryString + "'"); Document doc = new Document(); // create a new document /** * Create a field with term vector enabled */ FieldType type = new FieldType(); type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); type.setStored(true); type.setStoreTermVectors(true); type.setTokenized(true); type.setStoreTermVectorOffsets(true); //term vector enabled Field cField = new Field(query_field, content, type); doc.add(cField); try { indexWriter = new IndexWriter(ramDirectory, config); indexWriter.addDocument(doc); indexWriter.close(); idxReader = DirectoryReader.open(ramDirectory); idxSearcher = new IndexSearcher(idxReader); ComplexPhraseQueryParser qp = new ComplexPhraseQueryParser(query_field, analyzer); queryToSearch = qp.parse(queryString); // Here is where the searching, etc starts hits = idxSearcher.search(queryToSearch, idxReader.maxDoc()); System.out.println("scoreDoc size: " + hits.scoreDocs.length); // highlight the hits ... } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ParseException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } Here is the exception (using Lucene 8.2): Exception in thread "main" java.lang.IllegalArgumentException: Unknown query type "org.apache.lucene.search.ConstantScoreQuery" found in phrase query string "ky~1 word~1" at org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:325) at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:666) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:439) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:564) at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:416) at
Why would a search using a ComplexPhraseQueryParser throw an exception for some content, but not all content?
I am using Lucene 8.2, but have also verified this on 8.9. My query string is either ""by~1 word~1"", or ""ky~1 word~1"". I am looking for a phrase of these 2 words, with potential 1 character misspelling, or fuzziness. I realize that 'by' is usually a stop word, that is why I also tested with 'ky'. My simplified test content is either "AC-2.b word", "AC-2.k word", "AC-2.y word". The first part of the test content is pulled from actual data my customers are trying to search. For the query with 'by~1' the exception occurs if the content has '.b' or .y', but not '.k' For the query with 'ky~1' the exception occurs if the content has '.k' or .y', but not '.b' Here is the test code: import java.io.IOException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.core.*; import org.apache.lucene.analysis.standard.*; import org.apache.lucene.analysis.tokenattributes.*; import org.apache.lucene.analysis.util.*; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.FieldType; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexOptions; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.queryparser.classic.ParseException; import org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.TopDocs; import org.apache.lucene.store.RAMDirectory; public class phraseTest { public static Analyzer analyzer = new StandardAnalyzer(); public static IndexWriterConfig config = new IndexWriterConfig( analyzer); public static RAMDirectory ramDirectory = new RAMDirectory(); public static IndexWriter indexWriter; public static Query queryToSearch = null; public static IndexReader idxReader; public static IndexSearcher idxSearcher; public static TopDocs hits; public static String query_field = "Content"; // Pick only one content string // public static String content = "AC-2.b word"; public static String content = "AC-2.k word"; // public static String content = "AC-2.y word"; // Pick only one query string // public static String queryString = "\"by~1 word~1\""; public static String queryString = "\"ky~1 word~1\""; @SuppressWarnings("deprecation") public static void main(String[] args) throws IOException { System.out.println("Content is\n " + content); System.out.println("Query field is " + query_field); System.out.println("Query String is '" + queryString + "'"); Document doc = new Document(); // create a new document /** * Create a field with term vector enabled */ FieldType type = new FieldType(); type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); type.setStored(true); type.setStoreTermVectors(true); type.setTokenized(true); type.setStoreTermVectorOffsets(true); //term vector enabled Field cField = new Field(query_field, content, type); doc.add(cField); try { indexWriter = new IndexWriter(ramDirectory, config); indexWriter.addDocument(doc); indexWriter.close(); idxReader = DirectoryReader.open(ramDirectory); idxSearcher = new IndexSearcher(idxReader); ComplexPhraseQueryParser qp = new ComplexPhraseQueryParser(query_field, analyzer); queryToSearch = qp.parse(queryString); // Here is where the searching, etc starts hits = idxSearcher.search(queryToSearch, idxReader.maxDoc()); System.out.println("scoreDoc size: " + hits.scoreDocs.length); // highlight the hits ... } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ParseException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } Here is the exception (using Lucene 8.2): Exception in thread "main" java.lang.IllegalArgumentException: Unknown query type "org.apache.lucene.search.ConstantScoreQuery" found in phrase query string "ky~1 word~1" at org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:325) at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:666) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:439) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:564) at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:416) at
Re: [External] Re: ComplexPhraseQueryParser isn't switching search terms to lowercase with StandardAnalyzer
I saw the changes in the diff. But without looking into the test, I am asking to confirm if it matches my conditions: 1) Uses a StandardAnalyzer 2) Does the actual query.toString() return lowercase J and S David Shifflett On 10/22/19, 10:44 AM, "Mikhail Khludnev" wrote: On Tue, Oct 22, 2019 at 5:26 PM Shifflett, David [USA] < shifflett_da...@bah.com> wrote: > Mikhail, > > Thanks for running those tests. > I haven’t looked into the test, but can you confirm it uses an analyzer > with the lowercase filter? > Look at his diff. It's a diff on test not a test -checkMatches("\"john smith\"", "1"); // Simple multi-term still works -checkMatches("\"j* smyth~\"", "1,2"); // wildcards and fuzzies are OK in +checkMatches("\"John Smith\"", "1"); // Simple multi-term still works +checkMatches("\"J* Smyth~\"", "1,2"); // wildcards and fuzzies are OK in Here I flip to Capital letters, and it still matches what it matches before in lower. > Also can you confirm whether the actual query being used contains upper or > lower case J and S (in you John Smith case) > > Apologizes on the 'content:foo'. > I changed the code snippet to "somefield", and missed changing that part > of the output > > David Shifflett > > > On 10/22/19, 5:51 AM, "Mikhail Khludnev" wrote: > > Hello, > I wonder how it come up with this particular field : > content:foo > Anyway I added some uppercase in the test and it passed despite of it > > diff --git > > a/lucene/queryparser/src/test/org/apache/lucene/queryparser/complexPhrase/TestComplexPhraseQuery.java > > b/lucene/queryparser/src/test/org/apache/lucene/queryparser/complexPhrase/TestComplexPhraseQuery.java > index 5935da9..9baa492 100644 > --- > > a/lucene/queryparser/src/test/org/apache/lucene/queryparser/complexPhrase/TestComplexPhraseQuery.java > +++ > > b/lucene/queryparser/src/test/org/apache/lucene/queryparser/complexPhrase/TestComplexPhraseQuery.java > @@ -55,8 +55,8 @@ >boolean inOrder = true; > >public void testComplexPhrases() throws Exception { > -checkMatches("\"john smith\"", "1"); // Simple multi-term still > works > -checkMatches("\"j* smyth~\"", "1,2"); // wildcards and fuzzies > are > OK in > +checkMatches("\"John Smith\"", "1"); // Simple multi-term still > works > +checkMatches("\"J* Smyth~\"", "1,2"); // wildcards and fuzzies > are > OK in > // phrases > checkMatches("\"(jo* -john) smith\"", "2"); // boolean logic > works > checkMatches("\"jo* smith\"~2", "1,2,3"); // position logic > works. > @@ -161,11 +161,11 @@ > checkMatches("name:\"j* smyth~\"", "1,2"); > checkMatches("role:\"developer\"", "1,2"); > checkMatches("role:\"p* manager\"", "4"); > -checkMatches("role:de*", "1,2,3"); > +checkMatches("role:De*", "1,2,3"); > checkMatches("name:\"j* smyth~\"~5", "1,2,3"); > checkMatches("role:\"p* manager\" AND name:jack*", "4"); > checkMatches("+role:developer +name:jack*", ""); > -checkMatches("name:\"john smith\"~2 AND role:designer AND id:3", > "3"); > +checkMatches("name:\"john smith\"~2 AND role:Designer AND id:3", > "3"); >} > >public void testToStringContainsSlop() throws Exception { > > Problem seems a way odd (assuming CPQP does analysis), it seems like > debugging is the last resort in this particular case. > > On Mon, Oct 21, 2019 at 8:31 PM Shifflett, David [USA] < > shifflett_da...@bah.com> wrote: > > > Hi all, > > Using the code snippet: > > ComplexPhraseQueryParser qp = new > > ComplexPhraseQueryParser(“somefield”, new Stand
Re: [External] Re: ComplexPhraseQueryParser isn't switching search terms to lowercase with StandardAnalyzer
Mikhail, Thanks for running those tests. I haven’t looked into the test, but can you confirm it uses an analyzer with the lowercase filter? Also can you confirm whether the actual query being used contains upper or lower case J and S (in you John Smith case) Apologizes on the 'content:foo'. I changed the code snippet to "somefield", and missed changing that part of the output David Shifflett On 10/22/19, 5:51 AM, "Mikhail Khludnev" wrote: Hello, I wonder how it come up with this particular field : content:foo Anyway I added some uppercase in the test and it passed despite of it diff --git a/lucene/queryparser/src/test/org/apache/lucene/queryparser/complexPhrase/TestComplexPhraseQuery.java b/lucene/queryparser/src/test/org/apache/lucene/queryparser/complexPhrase/TestComplexPhraseQuery.java index 5935da9..9baa492 100644 --- a/lucene/queryparser/src/test/org/apache/lucene/queryparser/complexPhrase/TestComplexPhraseQuery.java +++ b/lucene/queryparser/src/test/org/apache/lucene/queryparser/complexPhrase/TestComplexPhraseQuery.java @@ -55,8 +55,8 @@ boolean inOrder = true; public void testComplexPhrases() throws Exception { -checkMatches("\"john smith\"", "1"); // Simple multi-term still works -checkMatches("\"j* smyth~\"", "1,2"); // wildcards and fuzzies are OK in +checkMatches("\"John Smith\"", "1"); // Simple multi-term still works +checkMatches("\"J* Smyth~\"", "1,2"); // wildcards and fuzzies are OK in // phrases checkMatches("\"(jo* -john) smith\"", "2"); // boolean logic works checkMatches("\"jo* smith\"~2", "1,2,3"); // position logic works. @@ -161,11 +161,11 @@ checkMatches("name:\"j* smyth~\"", "1,2"); checkMatches("role:\"developer\"", "1,2"); checkMatches("role:\"p* manager\"", "4"); -checkMatches("role:de*", "1,2,3"); +checkMatches("role:De*", "1,2,3"); checkMatches("name:\"j* smyth~\"~5", "1,2,3"); checkMatches("role:\"p* manager\" AND name:jack*", "4"); checkMatches("+role:developer +name:jack*", ""); - checkMatches("name:\"john smith\"~2 AND role:designer AND id:3", "3"); +checkMatches("name:\"john smith\"~2 AND role:Designer AND id:3", "3"); } public void testToStringContainsSlop() throws Exception { Problem seems a way odd (assuming CPQP does analysis), it seems like debugging is the last resort in this particular case. On Mon, Oct 21, 2019 at 8:31 PM Shifflett, David [USA] < shifflett_da...@bah.com> wrote: > Hi all, > Using the code snippet: > ComplexPhraseQueryParser qp = new > ComplexPhraseQueryParser(“somefield”, new StandardAnalyzer()); > String teststr = "\"Foo Bar\"~2"; > Query queryToSearch = qp.parse(teststr); > System.out.println("Query : " + queryToSearch.toString()); > System.out.println("Type of query : " + > queryToSearch.getClass().getSimpleName()); > > I am getting the output > Query : "Foo Bar"~2 > Type of query : ComplexPhraseQuery > > If I change teststr to "\"Foo Bar\"" > I get > Query : "Foo Bar" > Type of query : ComplexPhraseQuery > > If I change teststr to "Foo Bar" > I get > Query : content:foo content:bar > Type of query : BooleanQuery > > > In the first two cases I was expecting the search terms to be switched to > lowercase. > > Were the Foo and Bar left as originally specified because the terms are > inside double quotes? > > How can I specify a search term that I want treated as a Phrase, > but also have the query parser apply the LowerCaseFilter? > > I am hoping to avoid the need to handle this using PhraseQuery, > and continue to use the QueryParser. > > > Thanks in advance for any help you can give me, > David Shifflett > > -- Sincerely yours Mikhail Khludnev - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: [External] Re: ComplexPhraseQueryParser isn't switching search terms to lowercase with StandardAnalyzer
Baris, Sorry I neglected to add that piece. This test was run against 8.0.0, but I also want it to work in later versions. Another piece of my project is using 8.2.0. Thanks again for any info, David Shifflett On 10/21/19, 3:23 PM, "baris.ka...@oracle.com" wrote: David,- which version of Lucene are You using? Best regards On 10/21/19 1:31 PM, Shifflett, David [USA] wrote: > Hi all, > Using the code snippet: > ComplexPhraseQueryParser qp = new ComplexPhraseQueryParser(“somefield”, new StandardAnalyzer()); > String teststr = "\"Foo Bar\"~2"; > Query queryToSearch = qp.parse(teststr); > System.out.println("Query : " + queryToSearch.toString()); > System.out.println("Type of query : " + queryToSearch.getClass().getSimpleName()); > > I am getting the output > Query : "Foo Bar"~2 > Type of query : ComplexPhraseQuery > > If I change teststr to "\"Foo Bar\"" > I get > Query : "Foo Bar" > Type of query : ComplexPhraseQuery > > If I change teststr to "Foo Bar" > I get > Query : content:foo content:bar > Type of query : BooleanQuery > > > In the first two cases I was expecting the search terms to be switched to lowercase. > > Were the Foo and Bar left as originally specified because the terms are inside double quotes? > > How can I specify a search term that I want treated as a Phrase, > but also have the query parser apply the LowerCaseFilter? > > I am hoping to avoid the need to handle this using PhraseQuery, > and continue to use the QueryParser. > > > Thanks in advance for any help you can give me, > David Shifflett > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
ComplexPhraseQueryParser isn't switching search terms to lowercase with StandardAnalyzer
Hi all, Using the code snippet: ComplexPhraseQueryParser qp = new ComplexPhraseQueryParser(“somefield”, new StandardAnalyzer()); String teststr = "\"Foo Bar\"~2"; Query queryToSearch = qp.parse(teststr); System.out.println("Query : " + queryToSearch.toString()); System.out.println("Type of query : " + queryToSearch.getClass().getSimpleName()); I am getting the output Query : "Foo Bar"~2 Type of query : ComplexPhraseQuery If I change teststr to "\"Foo Bar\"" I get Query : "Foo Bar" Type of query : ComplexPhraseQuery If I change teststr to "Foo Bar" I get Query : content:foo content:bar Type of query : BooleanQuery In the first two cases I was expecting the search terms to be switched to lowercase. Were the Foo and Bar left as originally specified because the terms are inside double quotes? How can I specify a search term that I want treated as a Phrase, but also have the query parser apply the LowerCaseFilter? I am hoping to avoid the need to handle this using PhraseQuery, and continue to use the QueryParser. Thanks in advance for any help you can give me, David Shifflett
Re: [External] Re: How to ignore certain words based on query specifics
Evert, It is definitely not a bug. I was asking about how to do something, I couldn't quite figure out. Stop words is the way to go. David Shifflett On 7/11/19, 11:26 AM, "evert.wagenaar" wrote: I see it as a feature, not a bug. The appearance of stop words in the Search Summary makes it more clear what the Hit is about.Not sure but I think Google does the same in search summaries.-Evert Original message ----From: "Shifflett, David [USA]" Date: 7/11/19 8:38 PM (GMT+08:00) To: java-user@lucene.apache.org Subject: Re: [External] Re: How to ignore certain words based on query specifics I just tested this with the search.highight.Highlighter class.Is this the 'old default highlighter'?I phrased my question badly.Of course the stop words shouldn't be highlighted,as they wouldn't match any query.My question was really, would the stop words be available forinclusion in the highlight context (surrounding a match)?The answer is yes the stop words do appear in the context,and are not highlighted.Thanks,David Shifflett On 7/10/19, 9:12 PM, "Michael Sokolov" wrote:I'm not au courant with highlighters as I used to be. I think some of themwork using postings, and for those, no, you wouldn't be able to highlightstop words. But maybe you can use the old default highlighter that wouldreanalyze the document from a stored field, using an Analyzer that doesn'tremove stop words? Sorry I'm not sure if that exists any more, maybesomeone else will know.On Tue, Jul 9, 2019, 10:17 AM Shifflett, David [USA] <shifflett_da...@bah.com> wrote:> Michael,> Thanks for your reply.>> You are correct, the desired effect is to not match 'freedom ...'.> I hadn't considered the case where both free* and freedom match.>> My solution 'free* and not freedom' would NOT match either of your> examples.>> I think what I really want is> Get every matching term from a matching document,> and if the term also matches an ignore word, then ignore the match.>> I hadn't considered the stopwords approach, I'll look into that.> If I add all the ignore words as stop words, will that effect highlighting?> Are the stopwords still available for highlighting?>> Thanks,> David Shifflett>>> On 7/9/19, 11:58 AM, "Michael Sokolov" wrote:>> I think what you're saying in you're example is that "free*" should> match anything with a term matching that pattern, but not *only*> freedom. In other words, if a document has "freedom from stupidity"> then it should not match, but if the document has "free freedom from> stupidity" than it should.>> Is that correct?>> You could apply stopwords, except that it sounds as if this is a> per-user blacklist, and you want them to share the same index?>> On Tue, Jul 9, 2019 at 11:29 AM Shifflett, David [USA]> wrote:> >> > Sorry for the weird reply path, but I couldn’t find an easy reply> method via the list archive.> >> > Anyway …> >> > The use case is as follows:> > Allow the user to specify queries such as ‘free*’ > > and also include similar words to be ignored, such as freedom.> > Another example would be ‘secret*’ and secretary.> >> > I want to keep the ignore words separate so they apply to all> queries,> > but then realized the ignore words should only apply to relevant> (matching) queries.> >> > I don’t want the users to be required to add ‘and not WORD’ many> times to each of the listed queries.> > > > David Shifflett> >> > From: Diego Ceccarelli> >> > Could you please describe the use case? maybe there is an easier > solution> >> >> >> > From: "Shifflett, David [USA]" > > Date: Tuesday, July 9, 2019 at 8:02 AM> > To: "java-user@lucene.apache.org" > > Subject: How to ignore certain words based on query specifics> >> > Hi all,> > I have a configuration file that lists multiple queries, of all> different types, > > and that lists words to be ignored.> >> > Each of these lists is user configured, variable in length and> content.> >> > I know that, in general, unless the ignore word is in
Re: [External] Re: How to ignore certain words based on query specifics
I just tested this with the search.highight.Highlighter class. Is this the 'old default highlighter'? I phrased my question badly. Of course the stop words shouldn't be highlighted, as they wouldn't match any query. My question was really, would the stop words be available for inclusion in the highlight context (surrounding a match)? The answer is yes the stop words do appear in the context, and are not highlighted. Thanks, David Shifflett On 7/10/19, 9:12 PM, "Michael Sokolov" wrote: I'm not au courant with highlighters as I used to be. I think some of them work using postings, and for those, no, you wouldn't be able to highlight stop words. But maybe you can use the old default highlighter that would reanalyze the document from a stored field, using an Analyzer that doesn't remove stop words? Sorry I'm not sure if that exists any more, maybe someone else will know. On Tue, Jul 9, 2019, 10:17 AM Shifflett, David [USA] < shifflett_da...@bah.com> wrote: > Michael, > Thanks for your reply. > > You are correct, the desired effect is to not match 'freedom ...'. > I hadn't considered the case where both free* and freedom match. > > My solution 'free* and not freedom' would NOT match either of your > examples. > > I think what I really want is > Get every matching term from a matching document, > and if the term also matches an ignore word, then ignore the match. > > I hadn't considered the stopwords approach, I'll look into that. > If I add all the ignore words as stop words, will that effect highlighting? > Are the stopwords still available for highlighting? > > Thanks, > David Shifflett > > > On 7/9/19, 11:58 AM, "Michael Sokolov" wrote: > > I think what you're saying in you're example is that "free*" should > match anything with a term matching that pattern, but not *only* > freedom. In other words, if a document has "freedom from stupidity" > then it should not match, but if the document has "free freedom from > stupidity" than it should. > > Is that correct? > > You could apply stopwords, except that it sounds as if this is a > per-user blacklist, and you want them to share the same index? > > On Tue, Jul 9, 2019 at 11:29 AM Shifflett, David [USA] > wrote: > > > > Sorry for the weird reply path, but I couldn’t find an easy reply > method via the list archive. > > > > Anyway … > > > > The use case is as follows: > > Allow the user to specify queries such as ‘free*’ > > and also include similar words to be ignored, such as freedom. > > Another example would be ‘secret*’ and secretary. > > > > I want to keep the ignore words separate so they apply to all > queries, > > but then realized the ignore words should only apply to relevant > (matching) queries. > > > > I don’t want the users to be required to add ‘and not WORD’ many > times to each of the listed queries. > > > > David Shifflett > > > > From: Diego Ceccarelli > > > > Could you please describe the use case? maybe there is an easier > solution > > > > > > > > From: "Shifflett, David [USA]" > > Date: Tuesday, July 9, 2019 at 8:02 AM > > To: "java-user@lucene.apache.org" > > Subject: How to ignore certain words based on query specifics > > > > Hi all, > > I have a configuration file that lists multiple queries, of all > different types, > > and that lists words to be ignored. > > > > Each of these lists is user configured, variable in length and > content. > > > > I know that, in general, unless the ignore word is in the query it > won’t match, > > but I need to be able to handle wildcard, fuzzy, and Regex, queries > which might match. > > > > What I need to be able to do is ignore the words in the ignore list, > > but only when they match terms the query would match. > > > > For example: if the query is ‘free*’ and ‘freedom’ should be ignored, > > I could modify the query to be ‘free*’ and not freedom. > > > > But if ‘liberty’ is also to be ignored, I don’t wa
Re: [External] Re: How to ignore certain words based on query specifics
Michael, Thanks for your reply. You are correct, the desired effect is to not match 'freedom ...'. I hadn't considered the case where both free* and freedom match. My solution 'free* and not freedom' would NOT match either of your examples. I think what I really want is Get every matching term from a matching document, and if the term also matches an ignore word, then ignore the match. I hadn't considered the stopwords approach, I'll look into that. If I add all the ignore words as stop words, will that effect highlighting? Are the stopwords still available for highlighting? Thanks, David Shifflett On 7/9/19, 11:58 AM, "Michael Sokolov" wrote: I think what you're saying in you're example is that "free*" should match anything with a term matching that pattern, but not *only* freedom. In other words, if a document has "freedom from stupidity" then it should not match, but if the document has "free freedom from stupidity" than it should. Is that correct? You could apply stopwords, except that it sounds as if this is a per-user blacklist, and you want them to share the same index? On Tue, Jul 9, 2019 at 11:29 AM Shifflett, David [USA] wrote: > > Sorry for the weird reply path, but I couldn’t find an easy reply method via the list archive. > > Anyway … > > The use case is as follows: > Allow the user to specify queries such as ‘free*’ > and also include similar words to be ignored, such as freedom. > Another example would be ‘secret*’ and secretary. > > I want to keep the ignore words separate so they apply to all queries, > but then realized the ignore words should only apply to relevant (matching) queries. > > I don’t want the users to be required to add ‘and not WORD’ many times to each of the listed queries. > > David Shifflett > > From: Diego Ceccarelli > > Could you please describe the use case? maybe there is an easier solution > > > > From: "Shifflett, David [USA]" > Date: Tuesday, July 9, 2019 at 8:02 AM > To: "java-user@lucene.apache.org" > Subject: How to ignore certain words based on query specifics > > Hi all, > I have a configuration file that lists multiple queries, of all different types, > and that lists words to be ignored. > > Each of these lists is user configured, variable in length and content. > > I know that, in general, unless the ignore word is in the query it won’t match, > but I need to be able to handle wildcard, fuzzy, and Regex, queries which might match. > > What I need to be able to do is ignore the words in the ignore list, > but only when they match terms the query would match. > > For example: if the query is ‘free*’ and ‘freedom’ should be ignored, > I could modify the query to be ‘free*’ and not freedom. > > But if ‘liberty’ is also to be ignored, I don’t want to add ‘and not liberty’ to that query > because that could produce false negatives for documents containing free and liberty. > > I think what I need to do is: > for each query > for each ignore word > if the query would match the ignore word, > add ‘and not ignore word’ to the query > > How can I test if a query would match an ignore word without putting the ignore words into an index > and searching the index? > This seems like overkill. > > To make matters worse, for a query like A and B and C, > this won’t match an index of ignore words that contains C, but not A or B. > > Thanks in advance, for any suggestions or advice, > David Shifflett > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to ignore certain words based on query specifics
Sorry for the weird reply path, but I couldn’t find an easy reply method via the list archive. Anyway … The use case is as follows: Allow the user to specify queries such as ‘free*’ and also include similar words to be ignored, such as freedom. Another example would be ‘secret*’ and secretary. I want to keep the ignore words separate so they apply to all queries, but then realized the ignore words should only apply to relevant (matching) queries. I don’t want the users to be required to add ‘and not WORD’ many times to each of the listed queries. David Shifflett From: Diego Ceccarelli Could you please describe the use case? maybe there is an easier solution From: "Shifflett, David [USA]" Date: Tuesday, July 9, 2019 at 8:02 AM To: "java-user@lucene.apache.org" Subject: How to ignore certain words based on query specifics Hi all, I have a configuration file that lists multiple queries, of all different types, and that lists words to be ignored. Each of these lists is user configured, variable in length and content. I know that, in general, unless the ignore word is in the query it won’t match, but I need to be able to handle wildcard, fuzzy, and Regex, queries which might match. What I need to be able to do is ignore the words in the ignore list, but only when they match terms the query would match. For example: if the query is ‘free*’ and ‘freedom’ should be ignored, I could modify the query to be ‘free*’ and not freedom. But if ‘liberty’ is also to be ignored, I don’t want to add ‘and not liberty’ to that query because that could produce false negatives for documents containing free and liberty. I think what I need to do is: for each query for each ignore word if the query would match the ignore word, add ‘and not ignore word’ to the query How can I test if a query would match an ignore word without putting the ignore words into an index and searching the index? This seems like overkill. To make matters worse, for a query like A and B and C, this won’t match an index of ignore words that contains C, but not A or B. Thanks in advance, for any suggestions or advice, David Shifflett
How to ignore certain words based on query specifics
Hi all, I have a configuration file that lists multiple queries, of all different types, and that lists words to be ignored. Each of these lists is user configured, variable in length and content. I know that, in general, unless the ignore word is in the query it won’t match, but I need to be able to handle wildcard, fuzzy, and Regex, queries which might match. What I need to be able to do is ignore the words in the ignore list, but only when they match terms the query would match. For example: if the query is ‘free*’ and ‘freedom’ should be ignored, I could modify the query to be ‘free*’ and not freedom. But if ‘liberty’ is also to be ignored, I don’t want to add ‘and not liberty’ to that query because that could produce false negatives for documents containing free and liberty. I think what I need to do is: for each query for each ignore word if the query would match the ignore word, add ‘and not ignore word’ to the query How can I test if a query would match an ignore word without putting the ignore words into an index and searching the index? This seems like overkill. To make matters worse, for a query like A and B and C, this won’t match an index of ignore words that contains C, but not A or B. Thanks in advance, for any suggestions or advice, David Shifflett