Re: [External] Re: How to highlight fields that are not stored?

2023-02-16 Thread Shifflett, David [USA]
Hi Michael.

Thanks for the reply.

As I said in the opening statement,
I need to move away reading a file into memory before indexing the file..
The use case here is files 2+ GB in size.

I thought streaming the file to be indexed is the only alternative
to reading the full file in RAM then indexing.

I would be happy to be directed to another way to get 2+ GB files indexed.


> highlighting requires
> the document in its uninverted form. Otherwise what text would you
> highlight?

Highlighting the, possibly changed, terms from the index is my goal
if I can't store the entire document due to RAM size constraints.

Not having the original file text in the highlight isn't ideal,
but it is better than not being able to highlight text in large documents.

David Shifflett


On 2/16/23, 4:01 PM, "Michael Sokolov" mailto:msoko...@gmail.com>> wrote:


Sorry your problem statement makes no sense: you should be able to
store field data in the index without loading all your documents into
RAM while indexing. Maybe there is some constraint you are not telling
us about? Or you may be confused. In any case highlighting requires
the document in its uninverted form. Otherwise what text would you
highlight?


On Mon, Feb 13, 2023 at 3:46 PM Shifflett, David [USA]
mailto:shifflett_da...@bah.com.inva>lid> wrote:
>
> Hi,
> I am converting my application from
> reading documents into memory, then indexing the documents
> to streaming the documents to be indexed.
>
> I quickly found out this required that the field NOT be stored.
> I then quickly found out that my highlighting code requires the field to be 
> stored.
>
> I’ve been searching for an existing highlighter that doesn’t require the 
> field to be stored,
> and thought I’d found one in the FastVectorHighlighter,
> but tests revealed this highlighter also requires the field to be stored,
> though this requirement isn’t documented, or reflected in any returned 
> exception.
>
> I have been investigating using code like
> Terms terms = reader.getTermVector(docID, fieldName);
> TermsEnum termsEnum = terms.iterator();
> BytesRef bytesRef = termsEnum.next();
> PostingsEnum pe = termsEnum.postings(null, PostingsEnum.OFFSETS);
>
> While this gives me the terms from the document, and the positions,
> iterating over this, and matching to the queries I’m running,
> seems cumbersome, and inefficient.
>
> Any suggestions for highlighting query matches without the searched field 
> being stored?
>
> Thanks,
> David Shifflett
> Senior Lead Technologist
> Enterprise Cross Domain Solutions (ECDS)
> Booz Allen Hamilton
>


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org 
<mailto:java-user-unsubscr...@lucene.apache.org>
For additional commands, e-mail: java-user-h...@lucene.apache.org 
<mailto:java-user-h...@lucene.apache.org>







How to highlight fields that are not stored?

2023-02-13 Thread Shifflett, David [USA]
Hi,
I am converting my application from
reading documents into memory, then indexing the documents
to streaming the documents to be indexed.

I quickly found out this required that the field NOT be stored.
I then quickly found out that my highlighting code requires the field to be 
stored.

I’ve been searching for an existing highlighter that doesn’t require the 
field to be stored,
and thought I’d found one in the FastVectorHighlighter,
but tests revealed this highlighter also requires the field to be stored,
though this requirement isn’t documented, or reflected in any returned 
exception.

  I have been investigating using code like
Terms terms = reader.getTermVector(docID, fieldName);
TermsEnum termsEnum = terms.iterator();
BytesRef bytesRef = termsEnum.next();
PostingsEnum pe = termsEnum.postings(null, PostingsEnum.OFFSETS);

While this gives me the terms from the document, and the positions,
iterating over this, and matching to the queries I’m running,
seems cumbersome, and inefficient.

Any suggestions for highlighting query matches without the searched field 
being stored?

Thanks,
David Shifflett
Senior Lead Technologist
Enterprise Cross Domain Solutions (ECDS)
Booz Allen Hamilton



Re: [External] Streaming documents into the index breaks highlighting

2022-11-17 Thread Shifflett, David [USA]
Just to clarify,

Is there a highlighting option that doesn't require the text from the matched 
document?

David Shifflett

On 11/17/22, 1:57 PM, "Shifflett, David [USA]" 
 wrote:

Hi,
I am converting my application from
reading documents into memory, then indexing the documents
to streaming the documents to be indexed.

I quickly found out this required that the field NOT be stored.
I then quickly found out that my highlighting code requires the field to be 
stored.

I’ve been searching for an existing highlighter that doesn’t require the 
field to be stored,
and thought I’d found one in the FastVectorHighlighter,
but tests revealed this highlighter also requires the field to be stored,
though this requirement isn’t documented, or reflected in any returned 
exception.

Any suggestions for highlighting query matches without the searched field 
being stored?
I was hoping storing the offsets and positions would be enough to enable 
highlighting.

David Shifflett





Streaming documents into the index breaks highlighting

2022-11-17 Thread Shifflett, David [USA]
Hi,
I am converting my application from
reading documents into memory, then indexing the documents
to streaming the documents to be indexed.

I quickly found out this required that the field NOT be stored.
I then quickly found out that my highlighting code requires the field to be 
stored.

I’ve been searching for an existing highlighter that doesn’t require the field 
to be stored,
and thought I’d found one in the FastVectorHighlighter,
but tests revealed this highlighter also requires the field to be stored,
though this requirement isn’t documented, or reflected in any returned 
exception.

Any suggestions for highlighting query matches without the searched field being 
stored?
I was hoping storing the offsets and positions would be enough to enable 
highlighting.

David Shifflett
Senior Lead Technologist
Enterprise Cross Domain Solutions (ECDS)
Booz Allen Hamilton
M: 831-920-8341


Migrating WhitespaceTokenizerFactory from 8.2 to 9.4

2022-10-28 Thread Shifflett, David [USA]
I am migrating my project’s usage of Lucene from 8.2 to 9.4.
The migration documentation has been very helpful,
but doesn’t help me resolve this exception:

‘Caused by: java.lang.IllegalArgumentException: A SPI class of type 
org.apache.lucene.analysis.TokenizerFactory with name 'whitespace' does not 
exist. You need to add the corresponding JAR file supporting this SPI to your 
classpath. The current classpath supports the following names: [standard]’

My project includes the lucene-analysis-common JAR,
and my JAR includes 
org/apache/lucene/analysis/core/WhitespaceTokenizerFactory.class.

I am not familiar with how Java SPI is configured and built.

I tried creating META-INF/services/org.apache.lucene.analysis.TokenizerFactory
containing: org.apache.lucene.analysis.core.WhitespaceTokenizerFactory

What am I missing?

Any help would be appreciated.

Thanks,
David Shifflett



Re: [External] Re: Can lucene be used in Android ?

2022-09-11 Thread Shifflett, David [USA]
Hi Uwe,

I am a little confused by your 2 statements.

> Lucene 9.x series requires JDK 11 to run
> The main branch is already on JDK 17

Will Lucene 9.x run on JDK 17?
Is 9.x 'the main branch'?

Thanks,
David Shifflett
Senior Lead Technologist
Enterprise Cross Domain Solutions (ECDS)
Booz Allen Hamilton

On 9/10/22, 5:30 AM, "Uwe Schindler"  wrote:

Hi Jie,

actually the Lucene 9.x series requires JDK 11 to run, previous versions 
also work with Java 8. The main branch is already on JDK 17. From my 
knowledge, you may only use Lucene versions up to 8 to have at least a 
chance to run it. But with older Android version you may even need to go 
back to Lucene builds targetting JDK 7 (Lucene 5 ?, don't know).

But this is only half of the story: Lucene actually uses many many 
modern JDK and JVM features that are partly not implemented in Dalvik. 
It uses MethodHandles instead of reflection and the Java 8+ version use 
lambdas which were not compatible with older Android SDKs.

So in short: Use older version and hope, but we offer no support or are 
not keen to apply changes to Lucene so it can be used with Android at 
all - because Android is not really compatible to any Java spec like API 
or memory model.

Uwe

Am 09.09.2022 um 09:10 schrieb Jie Wang:
> Hey,
>
> Recently, I am trying to compile the Lucene to get a jar that can be used 
in Android, but failed.
>
> Is there an official version that supports the use of Lucene on Android?
>
>
> Thanks!
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
-- 
Uwe Schindler
Achterdiek 19, D-28357 Bremen

https://urldefense.com/v3/__https://www.thetaphi.de__;!!May37g!I0Gu25Y3BgTV3Vu1HySs6-3CFpW6BoaYKIsxiSeaohtNPkf00opY-hSY8XMqPJz990oyteqdryrf1cToSA$
  
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




I am getting an exception in ComplexPhraseQueryParser when fuzzy searching

2021-11-12 Thread Shifflett, David [USA]
I am using Lucene 8.2, but have also verified this on 8.9 and 8.10.1.

My query string is either ""by~1 word~1"", or ""ky~1 word~1"".
I am looking for a phrase of these 2 words, with potential 1 character 
misspelling, or fuzziness.
I realize that 'by' is usually a stop word, that is why I also tested with 'ky'.

My simplified test content is either "AC-2.b word", "AC-2.k word", "AC-2.y 
word".

The first part of the test content is pulled from actual data my customers are 
trying to search.
For the query with 'by~1' the exception occurs if the content has '.b' or .y', 
but not '.k'
For the query with 'ky~1' the exception occurs if the content has '.k' or .y', 
but not '.b'

Here is the test code:
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.analysis.tokenattributes.*;
import org.apache.lucene.analysis.util.*;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexOptions;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.RAMDirectory;

public class phraseTest {

public static Analyzer analyzer = new StandardAnalyzer();
public static IndexWriterConfig config = new IndexWriterConfig(
analyzer);
public static RAMDirectory ramDirectory = new RAMDirectory();
public static IndexWriter indexWriter;
public static Query queryToSearch = null;
public static IndexReader idxReader;
public static IndexSearcher idxSearcher;
public static TopDocs hits;
public static String query_field = "Content";

// Pick only one content string
// public static String content = "AC-2.b word";
public static String content = "AC-2.k word";
// public static String content = "AC-2.y word";

// Pick only one query string
// public static String queryString = "\"by~1 word~1\"";
public static String queryString = "\"ky~1 word~1\"";

@SuppressWarnings("deprecation")
public static void main(String[] args) throws IOException {

System.out.println("Content   is\n  " + content);
System.out.println("Query field   is " + query_field);
System.out.println("Query String  is '" + queryString + "'");

Document doc = new Document(); // create a new document

/**
 * Create a field with term vector enabled
 */
FieldType type = new FieldType();

type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
type.setStored(true);
type.setStoreTermVectors(true);
type.setTokenized(true);
type.setStoreTermVectorOffsets(true);

//term vector enabled
Field cField = new Field(query_field, content, type);
doc.add(cField);

try {
indexWriter = new IndexWriter(ramDirectory, config);
indexWriter.addDocument(doc);
indexWriter.close();

idxReader = DirectoryReader.open(ramDirectory);
idxSearcher = new IndexSearcher(idxReader);
ComplexPhraseQueryParser qp =
new ComplexPhraseQueryParser(query_field, analyzer);
queryToSearch = qp.parse(queryString);

// Here is where the searching, etc starts
hits = idxSearcher.search(queryToSearch, idxReader.maxDoc());
System.out.println("scoreDoc size: " + hits.scoreDocs.length);

// highlight the hits ...

} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

}
}

Here is the exception (using Lucene 8.2):

Exception in thread "main" java.lang.IllegalArgumentException: Unknown query 
type "org.apache.lucene.search.ConstantScoreQuery" found in phrase query string 
"ky~1 word~1"
at 
org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:325)
at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:666)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:439)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:564)
at 
org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:416)
at 

I am getting an exception in ComplexPhraseQueryParser when fuzzy searching

2021-11-01 Thread Shifflett, David [USA]
I am using Lucene 8.2, but have also verified this on 8.9 and 8.10.1.
My query string is either ""by~1 word~1"", or ""ky~1 word~1"".
I am looking for a phrase of these 2 words, with potential 1 character 
misspelling, or fuzziness.
I realize that 'by' is usually a stop word, that is why I also tested with 'ky'.
My simplified test content is either "AC-2.b word", "AC-2.k word", "AC-2.y 
word".
The first part of the test content is pulled from actual data my customers are 
trying to search.
For the query with 'by~1' the exception occurs if the content has '.b' or .y', 
but not '.k'
For the query with 'ky~1' the exception occurs if the content has '.k' or .y', 
but not '.b'
Here is the test code:
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.analysis.tokenattributes.*;
import org.apache.lucene.analysis.util.*;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexOptions;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.RAMDirectory;

public class phraseTest {

public static Analyzer analyzer = new StandardAnalyzer();
public static IndexWriterConfig config = new IndexWriterConfig(
analyzer);
public static RAMDirectory ramDirectory = new RAMDirectory();
public static IndexWriter indexWriter;
public static Query queryToSearch = null;
public static IndexReader idxReader;
public static IndexSearcher idxSearcher;
public static TopDocs hits;
public static String query_field = "Content";

// Pick only one content string
// public static String content = "AC-2.b word";
public static String content = "AC-2.k word";
// public static String content = "AC-2.y word";

// Pick only one query string
// public static String queryString = "\"by~1 word~1\"";
public static String queryString = "\"ky~1 word~1\"";

@SuppressWarnings("deprecation")
public static void main(String[] args) throws IOException {

System.out.println("Content   is\n  " + content);
System.out.println("Query field   is " + query_field);
System.out.println("Query String  is '" + queryString + "'");

Document doc = new Document(); // create a new document

/**
 * Create a field with term vector enabled
 */
FieldType type = new FieldType();

type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
type.setStored(true);
type.setStoreTermVectors(true);
type.setTokenized(true);
type.setStoreTermVectorOffsets(true);

//term vector enabled
Field cField = new Field(query_field, content, type);
doc.add(cField);

try {
indexWriter = new IndexWriter(ramDirectory, config);
indexWriter.addDocument(doc);
indexWriter.close();

idxReader = DirectoryReader.open(ramDirectory);
idxSearcher = new IndexSearcher(idxReader);
ComplexPhraseQueryParser qp =
new ComplexPhraseQueryParser(query_field, analyzer);
queryToSearch = qp.parse(queryString);

// Here is where the searching, etc starts
hits = idxSearcher.search(queryToSearch, idxReader.maxDoc());
System.out.println("scoreDoc size: " + hits.scoreDocs.length);

// highlight the hits ...

} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

}
}

Here is the exception (using Lucene 8.2):

Exception in thread "main" java.lang.IllegalArgumentException: Unknown query 
type "org.apache.lucene.search.ConstantScoreQuery" found in phrase query string 
"ky~1 word~1"
at 
org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:325)
at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:666)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:439)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:564)
at 
org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:416)
at 

Why would a search using a ComplexPhraseQueryParser throw an exception for some content, but not all content?

2021-08-17 Thread Shifflett, David [USA]
I am using Lucene 8.2, but have also verified this on 8.9.

My query string is either ""by~1 word~1"", or ""ky~1 word~1"".

I am looking for a phrase of these 2 words, with potential 1 character 
misspelling, or fuzziness.

I realize that 'by' is usually a stop word, that is why I also tested with 'ky'.

My simplified test content is either "AC-2.b word", "AC-2.k word", "AC-2.y 
word".

The first part of the test content is pulled from actual data my customers are 
trying to search.

For the query with 'by~1' the exception occurs if the content has '.b' or .y', 
but not '.k'

For the query with 'ky~1' the exception occurs if the content has '.k' or .y', 
but not '.b'

Here is the test code:
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.analysis.tokenattributes.*;
import org.apache.lucene.analysis.util.*;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexOptions;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.RAMDirectory;

public class phraseTest {

public static Analyzer analyzer = new StandardAnalyzer();
public static IndexWriterConfig config = new IndexWriterConfig(
analyzer);
public static RAMDirectory ramDirectory = new RAMDirectory();
public static IndexWriter indexWriter;
public static Query queryToSearch = null;
public static IndexReader idxReader;
public static IndexSearcher idxSearcher;
public static TopDocs hits;
public static String query_field = "Content";

// Pick only one content string
// public static String content = "AC-2.b word";
public static String content = "AC-2.k word";
// public static String content = "AC-2.y word";

// Pick only one query string
// public static String queryString = "\"by~1 word~1\"";
public static String queryString = "\"ky~1 word~1\"";

@SuppressWarnings("deprecation")
public static void main(String[] args) throws IOException {

System.out.println("Content   is\n  " + content);
System.out.println("Query field   is " + query_field);
System.out.println("Query String  is '" + queryString + "'");

Document doc = new Document(); // create a new document

/**
 * Create a field with term vector enabled
 */
FieldType type = new FieldType();

type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
type.setStored(true);
type.setStoreTermVectors(true);
type.setTokenized(true);
type.setStoreTermVectorOffsets(true);

//term vector enabled
Field cField = new Field(query_field, content, type);
doc.add(cField);

try {
indexWriter = new IndexWriter(ramDirectory, config);
indexWriter.addDocument(doc);
indexWriter.close();

idxReader = DirectoryReader.open(ramDirectory);
idxSearcher = new IndexSearcher(idxReader);
ComplexPhraseQueryParser qp =
new ComplexPhraseQueryParser(query_field, analyzer);
queryToSearch = qp.parse(queryString);

// Here is where the searching, etc starts
hits = idxSearcher.search(queryToSearch, idxReader.maxDoc());
System.out.println("scoreDoc size: " + hits.scoreDocs.length);

// highlight the hits ...

} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

}
}

Here is the exception (using Lucene 8.2):


Exception in thread "main" java.lang.IllegalArgumentException: Unknown query 
type "org.apache.lucene.search.ConstantScoreQuery" found in phrase query string 
"ky~1 word~1"

at 
org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:325)

at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:666)

at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:439)

at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:564)

at 
org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:416)

at 

Re: [External] Re: ComplexPhraseQueryParser isn't switching search terms to lowercase with StandardAnalyzer

2019-10-22 Thread Shifflett, David [USA]
I saw the changes in the diff.
But without looking into the test, I am asking to confirm if it
matches my conditions:
1) Uses a StandardAnalyzer
2) Does the actual query.toString() return lowercase J and S

David Shifflett


On 10/22/19, 10:44 AM, "Mikhail Khludnev"  wrote:

On Tue, Oct 22, 2019 at 5:26 PM Shifflett, David [USA] <
shifflett_da...@bah.com> wrote:

> Mikhail,
>
> Thanks for running those tests.
> I haven’t looked into the test, but can you confirm it uses an analyzer
> with the lowercase filter?
>
Look at his diff. It's a diff on test not a test

-checkMatches("\"john smith\"", "1"); // Simple multi-term still works
-checkMatches("\"j*   smyth~\"", "1,2"); // wildcards and fuzzies are
OK in
+checkMatches("\"John Smith\"", "1"); // Simple multi-term still works
+checkMatches("\"J*   Smyth~\"", "1,2"); // wildcards and fuzzies are
OK in

Here I flip to Capital letters, and it still matches what it matches before
in lower.


> Also can you confirm whether the actual query being used contains upper or
> lower case J and S (in you John Smith case)
>
> Apologizes on the 'content:foo'.
> I changed the code snippet to "somefield", and missed changing that part
> of the output
>
> David Shifflett
>
>
> On 10/22/19, 5:51 AM, "Mikhail Khludnev"  wrote:
>
> Hello,
> I wonder how it come up with this particular field :
> content:foo
> Anyway I added some uppercase in the test and it passed despite of it
>
> diff --git
>
> 
a/lucene/queryparser/src/test/org/apache/lucene/queryparser/complexPhrase/TestComplexPhraseQuery.java
>
> 
b/lucene/queryparser/src/test/org/apache/lucene/queryparser/complexPhrase/TestComplexPhraseQuery.java
> index 5935da9..9baa492 100644
> ---
>
> 
a/lucene/queryparser/src/test/org/apache/lucene/queryparser/complexPhrase/TestComplexPhraseQuery.java
> +++
>
> 
b/lucene/queryparser/src/test/org/apache/lucene/queryparser/complexPhrase/TestComplexPhraseQuery.java
> @@ -55,8 +55,8 @@
>boolean inOrder = true;
>
>public void testComplexPhrases() throws Exception {
> -checkMatches("\"john smith\"", "1"); // Simple multi-term still
> works
> -checkMatches("\"j*   smyth~\"", "1,2"); // wildcards and fuzzies
> are
> OK in
> +checkMatches("\"John Smith\"", "1"); // Simple multi-term still
> works
> +checkMatches("\"J*   Smyth~\"", "1,2"); // wildcards and fuzzies
> are
> OK in
>  // phrases
>  checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic
> works
>  checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic
> works.
> @@ -161,11 +161,11 @@
>  checkMatches("name:\"j*   smyth~\"", "1,2");
>  checkMatches("role:\"developer\"", "1,2");
>  checkMatches("role:\"p* manager\"", "4");
> -checkMatches("role:de*", "1,2,3");
> +checkMatches("role:De*", "1,2,3");
>  checkMatches("name:\"j* smyth~\"~5", "1,2,3");
>  checkMatches("role:\"p* manager\" AND name:jack*", "4");
>  checkMatches("+role:developer +name:jack*", "");
> -checkMatches("name:\"john smith\"~2 AND role:designer AND id:3",
> "3");
> +checkMatches("name:\"john smith\"~2 AND role:Designer AND id:3",
> "3");
>}
>
>public void testToStringContainsSlop() throws Exception {
>
> Problem seems a way odd (assuming CPQP does analysis), it seems like
> debugging is the last resort in this particular case.
>
> On Mon, Oct 21, 2019 at 8:31 PM Shifflett, David [USA] <
> shifflett_da...@bah.com> wrote:
>
> > Hi all,
> > Using the code snippet:
> > ComplexPhraseQueryParser qp = new
> > ComplexPhraseQueryParser(“somefield”, new Stand

Re: [External] Re: ComplexPhraseQueryParser isn't switching search terms to lowercase with StandardAnalyzer

2019-10-22 Thread Shifflett, David [USA]
Mikhail,

Thanks for running those tests.
I haven’t looked into the test, but can you confirm it uses an analyzer with 
the lowercase filter?
Also can you confirm whether the actual query being used contains upper or 
lower case J and S (in you John Smith case)

Apologizes on the 'content:foo'.
I changed the code snippet to "somefield", and missed changing that part of the 
output

David Shifflett


On 10/22/19, 5:51 AM, "Mikhail Khludnev"  wrote:

Hello,
I wonder how it come up with this particular field :
content:foo
Anyway I added some uppercase in the test and it passed despite of it

diff --git

a/lucene/queryparser/src/test/org/apache/lucene/queryparser/complexPhrase/TestComplexPhraseQuery.java

b/lucene/queryparser/src/test/org/apache/lucene/queryparser/complexPhrase/TestComplexPhraseQuery.java
index 5935da9..9baa492 100644
---

a/lucene/queryparser/src/test/org/apache/lucene/queryparser/complexPhrase/TestComplexPhraseQuery.java
+++

b/lucene/queryparser/src/test/org/apache/lucene/queryparser/complexPhrase/TestComplexPhraseQuery.java
@@ -55,8 +55,8 @@
   boolean inOrder = true;

   public void testComplexPhrases() throws Exception {
-checkMatches("\"john smith\"", "1"); // Simple multi-term still works
-checkMatches("\"j*   smyth~\"", "1,2"); // wildcards and fuzzies are
OK in
+checkMatches("\"John Smith\"", "1"); // Simple multi-term still works
+checkMatches("\"J*   Smyth~\"", "1,2"); // wildcards and fuzzies are
OK in
 // phrases
 checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic works
 checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic works.
@@ -161,11 +161,11 @@
 checkMatches("name:\"j*   smyth~\"", "1,2");
 checkMatches("role:\"developer\"", "1,2");
 checkMatches("role:\"p* manager\"", "4");
-checkMatches("role:de*", "1,2,3");
+checkMatches("role:De*", "1,2,3");
 checkMatches("name:\"j* smyth~\"~5", "1,2,3");
 checkMatches("role:\"p* manager\" AND name:jack*", "4");
 checkMatches("+role:developer +name:jack*", "");
-    checkMatches("name:\"john smith\"~2 AND role:designer AND id:3", "3");
+checkMatches("name:\"john smith\"~2 AND role:Designer AND id:3", "3");
   }

   public void testToStringContainsSlop() throws Exception {

Problem seems a way odd (assuming CPQP does analysis), it seems like
debugging is the last resort in this particular case.

On Mon, Oct 21, 2019 at 8:31 PM Shifflett, David [USA] <
shifflett_da...@bah.com> wrote:

> Hi all,
> Using the code snippet:
> ComplexPhraseQueryParser qp = new
> ComplexPhraseQueryParser(“somefield”, new StandardAnalyzer());
> String teststr = "\"Foo Bar\"~2";
> Query queryToSearch = qp.parse(teststr);
> System.out.println("Query : " + queryToSearch.toString());
> System.out.println("Type of query : " +
> queryToSearch.getClass().getSimpleName());
>
> I am getting the output
> Query : "Foo Bar"~2
> Type of query : ComplexPhraseQuery
>
> If I change teststr to "\"Foo Bar\""
> I get
> Query : "Foo Bar"
> Type of query : ComplexPhraseQuery
>
> If I change teststr to "Foo Bar"
> I get
> Query : content:foo content:bar
> Type of query : BooleanQuery
>
>
> In the first two cases I was expecting the search terms to be switched to
> lowercase.
>
> Were the Foo and Bar left as originally specified because the terms are
> inside double quotes?
>
> How can I specify a search term that I want treated as a Phrase,
> but also have the query parser apply the LowerCaseFilter?
>
> I am hoping to avoid the need to handle this using PhraseQuery,
> and continue to use the QueryParser.
>
>
> Thanks in advance for any help you can give me,
> David Shifflett
>
>

-- 
Sincerely yours
Mikhail Khludnev



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [External] Re: ComplexPhraseQueryParser isn't switching search terms to lowercase with StandardAnalyzer

2019-10-21 Thread Shifflett, David [USA]
Baris,

Sorry I neglected to add that piece.
This test was run against 8.0.0,
but I also want it to work in later versions.

Another piece of my project is using 8.2.0.

Thanks again for any info,
David Shifflett


On 10/21/19, 3:23 PM, "baris.ka...@oracle.com"  wrote:

David,-

  which version of Lucene are You using?

Best regards


On 10/21/19 1:31 PM, Shifflett, David [USA] wrote:
> Hi all,
> Using the code snippet:
>  ComplexPhraseQueryParser qp = new 
ComplexPhraseQueryParser(“somefield”, new StandardAnalyzer());
>  String teststr = "\"Foo Bar\"~2";
>  Query queryToSearch = qp.parse(teststr);
>  System.out.println("Query : " + queryToSearch.toString());
>  System.out.println("Type of query : " + 
queryToSearch.getClass().getSimpleName());
>
> I am getting the output
>  Query : "Foo Bar"~2
>  Type of query : ComplexPhraseQuery
>
> If I change teststr to "\"Foo Bar\""
> I get
>  Query : "Foo Bar"
>  Type of query : ComplexPhraseQuery
>
> If I change teststr to "Foo Bar"
> I get
>  Query : content:foo content:bar
>  Type of query : BooleanQuery
>
>
> In the first two cases I was expecting the search terms to be switched to 
lowercase.
>
> Were the Foo and Bar left as originally specified because the terms are 
inside double quotes?
>
> How can I specify a search term that I want treated as a Phrase,
> but also have the query parser apply the LowerCaseFilter?
>
> I am hoping to avoid the need to handle this using PhraseQuery,
> and continue to use the QueryParser.
>
>
> Thanks in advance for any help you can give me,
> David Shifflett
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





ComplexPhraseQueryParser isn't switching search terms to lowercase with StandardAnalyzer

2019-10-21 Thread Shifflett, David [USA]
Hi all,
Using the code snippet:
ComplexPhraseQueryParser qp = new ComplexPhraseQueryParser(“somefield”, new 
StandardAnalyzer());
String teststr = "\"Foo Bar\"~2";
Query queryToSearch = qp.parse(teststr);
System.out.println("Query : " + queryToSearch.toString());
System.out.println("Type of query : " + 
queryToSearch.getClass().getSimpleName());

I am getting the output
Query : "Foo Bar"~2
Type of query : ComplexPhraseQuery

If I change teststr to "\"Foo Bar\""
I get
Query : "Foo Bar"
Type of query : ComplexPhraseQuery

If I change teststr to "Foo Bar"
I get
Query : content:foo content:bar
Type of query : BooleanQuery


In the first two cases I was expecting the search terms to be switched to 
lowercase.

Were the Foo and Bar left as originally specified because the terms are inside 
double quotes?

How can I specify a search term that I want treated as a Phrase,
but also have the query parser apply the LowerCaseFilter?

I am hoping to avoid the need to handle this using PhraseQuery,
and continue to use the QueryParser.


Thanks in advance for any help you can give me,
David Shifflett



Re: [External] Re: How to ignore certain words based on query specifics

2019-07-11 Thread Shifflett, David [USA]
Evert,
It is definitely not a bug.
I was asking about how to do something, I couldn't quite figure out.
Stop words is the way to go.

David Shifflett
 

On 7/11/19, 11:26 AM, "evert.wagenaar"  wrote:

I see it as a feature, not a bug. The appearance of stop words in the 
Search Summary makes it more clear what the Hit is about.Not sure but I think 
Google does the same in search summaries.-Evert

 Original message ----From: "Shifflett, David [USA]" 
 Date: 7/11/19  8:38 PM  (GMT+08:00) To: 
java-user@lucene.apache.org Subject: Re: [External] Re: How to ignore certain 
words based on query specifics I just tested this with the 
search.highight.Highlighter class.Is this the 'old default highlighter'?I 
phrased my question badly.Of course the stop words shouldn't be highlighted,as 
they wouldn't match any query.My question was really, would the stop words be 
available forinclusion in the highlight context (surrounding a match)?The 
answer is yes the stop words do appear in the context,and are not 
highlighted.Thanks,David Shifflett On 7/10/19, 9:12 PM, "Michael Sokolov" 
 wrote:I'm not au courant with highlighters as I used 
to be. I think some of themwork using postings, and for those, no, you 
wouldn't be able to highlightstop words. But maybe you can use the old 
default highlighter that wouldreanalyze the document from a stored field, 
using an Analyzer that doesn'tremove stop words? Sorry I'm not sure if that 
exists any more, maybesomeone else will know.On Tue, Jul 9, 2019, 
10:17 AM Shifflett, David [USA] <shifflett_da...@bah.com> wrote:> 
Michael,> Thanks for your reply.>> You are correct, the desired 
effect is to not match 'freedom ...'.> I hadn't considered the case where 
both free* and freedom match.>> My solution 'free* and not freedom' 
would NOT match either of your> examples.>> I think what I really 
want is> Get every matching term from a matching document,> and if the 
term also matches an ignore word, then ignore the match.>> I hadn't 
considered the stopwords approach, I'll look into that.> If I add all the 
ignore words as stop words, will that effect highlighting?> Are the 
stopwords still available for highlighting?>> Thanks,> David 
Shifflett>>> On 7/9/19, 11:58 AM, "Michael Sokolov" 
 wrote:>> I think what you're saying in you're 
example is that "free*" should> match anything with a term matching 
that pattern, but not *only*> freedom. In other words, if a document 
has "freedom from stupidity">  then it should not match, but if the 
document has "free freedom from> stupidity" than it should.>>   
  Is that correct?>> You could apply stopwords, except that it 
sounds as if this is a> per-user blacklist, and you want them to share 
the same index?>> On Tue, Jul 9, 2019 at 11:29 AM Shifflett, David 
[USA]>  wrote:> >> > Sorry for 
the weird reply path, but I couldn’t find an easy reply> method via the 
list archive.> >> > Anyway …> >> > The use case 
is as follows:> > Allow the user to specify queries such as ‘free*’
> > and also include similar words to be ignored, such as freedom.> 
> Another example would be ‘secret*’ and secretary.> >> > I 
want to keep the ignore words separate so they apply to all> queries,>  
   > but then realized the ignore words should only apply to relevant> 
(matching) queries.> >> > I don’t want the users to be required 
to add ‘and not WORD’ many> times to each of the listed queries.> > 
   > > David Shifflett> >> > From: Diego Ceccarelli>
 >> > Could you please describe the use case? maybe there is an easier  
  > solution> >> >> >> > From: "Shifflett, 
David [USA]" > > Date: Tuesday, July 9, 2019 
at 8:02 AM> > To: "java-user@lucene.apache.org" 
> > Subject: How to ignore certain words 
based on query specifics> >> > Hi all,> > I have a 
configuration file that lists multiple queries, of all> different types,
> > and that lists words to be ignored.> >> > Each of these 
lists is user configured, variable in length and> content.> >>  
   > I know that, in general, unless the ignore word is in 

Re: [External] Re: How to ignore certain words based on query specifics

2019-07-11 Thread Shifflett, David [USA]
I just tested this with the search.highight.Highlighter class.
Is this the 'old default highlighter'?

I phrased my question badly.
Of course the stop words shouldn't be highlighted,
as they wouldn't match any query.

My question was really, would the stop words be available for
inclusion in the highlight context (surrounding a match)?

The answer is yes the stop words do appear in the context,
and are not highlighted.

Thanks,
David Shifflett 

On 7/10/19, 9:12 PM, "Michael Sokolov"  wrote:

I'm not au courant with highlighters as I used to be. I think some of them
work using postings, and for those, no, you wouldn't be able to highlight
stop words. But maybe you can use the old default highlighter that would
reanalyze the document from a stored field, using an Analyzer that doesn't
remove stop words? Sorry I'm not sure if that exists any more, maybe
someone else will know.

On Tue, Jul 9, 2019, 10:17 AM Shifflett, David [USA] <
shifflett_da...@bah.com> wrote:

> Michael,
> Thanks for your reply.
>
> You are correct, the desired effect is to not match 'freedom ...'.
> I hadn't considered the case where both free* and freedom match.
>
> My solution 'free* and not freedom' would NOT match either of your
> examples.
>
> I think what I really want is
> Get every matching term from a matching document,
> and if the term also matches an ignore word, then ignore the match.
>
> I hadn't considered the stopwords approach, I'll look into that.
> If I add all the ignore words as stop words, will that effect 
highlighting?
> Are the stopwords still available for highlighting?
>
> Thanks,
> David Shifflett
>
>
> On 7/9/19, 11:58 AM, "Michael Sokolov"  wrote:
>
> I think what you're saying in you're example is that "free*" should
> match anything with a term matching that pattern, but not *only*
> freedom. In other words, if a document has "freedom from stupidity"
>  then it should not match, but if the document has "free freedom from
> stupidity" than it should.
>
> Is that correct?
>
> You could apply stopwords, except that it sounds as if this is a
> per-user blacklist, and you want them to share the same index?
>
> On Tue, Jul 9, 2019 at 11:29 AM Shifflett, David [USA]
>  wrote:
> >
> > Sorry for the weird reply path, but I couldn’t find an easy reply
> method via the list archive.
> >
> > Anyway …
> >
> > The use case is as follows:
> > Allow the user to specify queries such as ‘free*’
> > and also include similar words to be ignored, such as freedom.
> > Another example would be ‘secret*’ and secretary.
> >
> > I want to keep the ignore words separate so they apply to all
> queries,
> > but then realized the ignore words should only apply to relevant
> (matching) queries.
> >
> > I don’t want the users to be required to add ‘and not WORD’ many
> times to each of the listed queries.
    > >
> > David Shifflett
> >
> > From: Diego Ceccarelli
> >
> > Could you please describe the use case? maybe there is an easier
> solution
> >
> >
> >
> > From: "Shifflett, David [USA]" 
> > Date: Tuesday, July 9, 2019 at 8:02 AM
> > To: "java-user@lucene.apache.org" 
> > Subject: How to ignore certain words based on query specifics
> >
> > Hi all,
> > I have a configuration file that lists multiple queries, of all
> different types,
> > and that lists words to be ignored.
> >
> > Each of these lists is user configured, variable in length and
> content.
> >
> > I know that, in general, unless the ignore word is in the query it
> won’t match,
> > but I need to be able to handle wildcard, fuzzy, and Regex, queries
> which might match.
> >
> > What I need to be able to do is ignore the words in the ignore list,
> > but only when they match terms the query would match.
> >
> > For example: if the query is ‘free*’ and ‘freedom’ should be 
ignored,
> > I could modify the query to be ‘free*’ and not freedom.
> >
> > But if ‘liberty’ is also to be ignored, I don’t wa

Re: [External] Re: How to ignore certain words based on query specifics

2019-07-09 Thread Shifflett, David [USA]
Michael,
Thanks for your reply.

You are correct, the desired effect is to not match 'freedom ...'.
I hadn't considered the case where both free* and freedom match.

My solution 'free* and not freedom' would NOT match either of your examples.

I think what I really want is
Get every matching term from a matching document,
and if the term also matches an ignore word, then ignore the match.

I hadn't considered the stopwords approach, I'll look into that.
If I add all the ignore words as stop words, will that effect highlighting?
Are the stopwords still available for highlighting?

Thanks,
David Shifflett
 

On 7/9/19, 11:58 AM, "Michael Sokolov"  wrote:

I think what you're saying in you're example is that "free*" should
match anything with a term matching that pattern, but not *only*
freedom. In other words, if a document has "freedom from stupidity"
 then it should not match, but if the document has "free freedom from
stupidity" than it should.

Is that correct?

You could apply stopwords, except that it sounds as if this is a
per-user blacklist, and you want them to share the same index?
    
On Tue, Jul 9, 2019 at 11:29 AM Shifflett, David [USA]
 wrote:
>
> Sorry for the weird reply path, but I couldn’t find an easy reply method 
via the list archive.
>
> Anyway …
>
> The use case is as follows:
> Allow the user to specify queries such as ‘free*’
> and also include similar words to be ignored, such as freedom.
> Another example would be ‘secret*’ and secretary.
>
> I want to keep the ignore words separate so they apply to all queries,
> but then realized the ignore words should only apply to relevant 
(matching) queries.
>
> I don’t want the users to be required to add ‘and not WORD’ many times to 
each of the listed queries.
>
> David Shifflett
>
> From: Diego Ceccarelli
>
> Could you please describe the use case? maybe there is an easier solution
>
>
>
> From: "Shifflett, David [USA]" 
> Date: Tuesday, July 9, 2019 at 8:02 AM
> To: "java-user@lucene.apache.org" 
> Subject: How to ignore certain words based on query specifics
>
> Hi all,
> I have a configuration file that lists multiple queries, of all different 
types,
> and that lists words to be ignored.
>
> Each of these lists is user configured, variable in length and content.
>
> I know that, in general, unless the ignore word is in the query it won’t 
match,
> but I need to be able to handle wildcard, fuzzy, and Regex, queries which 
might match.
>
> What I need to be able to do is ignore the words in the ignore list,
> but only when they match terms the query would match.
>
> For example: if the query is ‘free*’ and ‘freedom’ should be ignored,
> I could modify the query to be ‘free*’ and not freedom.
>
> But if ‘liberty’ is also to be ignored, I don’t want to add ‘and not 
liberty’ to that query
> because that could produce false negatives for documents containing free 
and liberty.
>
> I think what I need to do is:
> for each query
>   for each ignore word
> if the query would match the ignore word,
>   add ‘and not ignore word’ to the query
>
> How can I test if a query would match an ignore word without putting the 
ignore words into an index
> and searching the index?
> This seems like overkill.
>
> To make matters worse, for a query like A and B and C,
> this won’t match an index of ignore words that contains C, but not A or B.
>
> Thanks in advance, for any suggestions or advice,
> David Shifflett
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





Re: How to ignore certain words based on query specifics

2019-07-09 Thread Shifflett, David [USA]
Sorry for the weird reply path, but I couldn’t find an easy reply method via 
the list archive.

Anyway …

The use case is as follows:
Allow the user to specify queries such as ‘free*’
and also include similar words to be ignored, such as freedom.
Another example would be ‘secret*’ and secretary.

I want to keep the ignore words separate so they apply to all queries,
but then realized the ignore words should only apply to relevant (matching) 
queries.

I don’t want the users to be required to add ‘and not WORD’ many times to each 
of the listed queries.

David Shifflett

From: Diego Ceccarelli

Could you please describe the use case? maybe there is an easier solution



From: "Shifflett, David [USA]" 
Date: Tuesday, July 9, 2019 at 8:02 AM
To: "java-user@lucene.apache.org" 
Subject: How to ignore certain words based on query specifics

Hi all,
I have a configuration file that lists multiple queries, of all different types,
and that lists words to be ignored.

Each of these lists is user configured, variable in length and content.

I know that, in general, unless the ignore word is in the query it won’t match,
but I need to be able to handle wildcard, fuzzy, and Regex, queries which might 
match.

What I need to be able to do is ignore the words in the ignore list,
but only when they match terms the query would match.

For example: if the query is ‘free*’ and ‘freedom’ should be ignored,
I could modify the query to be ‘free*’ and not freedom.

But if ‘liberty’ is also to be ignored, I don’t want to add ‘and not liberty’ 
to that query
because that could produce false negatives for documents containing free and 
liberty.

I think what I need to do is:
for each query
  for each ignore word
if the query would match the ignore word,
  add ‘and not ignore word’ to the query

How can I test if a query would match an ignore word without putting the ignore 
words into an index
and searching the index?
This seems like overkill.

To make matters worse, for a query like A and B and C,
this won’t match an index of ignore words that contains C, but not A or B.

Thanks in advance, for any suggestions or advice,
David Shifflett



How to ignore certain words based on query specifics

2019-07-09 Thread Shifflett, David [USA]
Hi all,
I have a configuration file that lists multiple queries, of all different types,
and that lists words to be ignored.

Each of these lists is user configured, variable in length and content.

I know that, in general, unless the ignore word is in the query it won’t match,
but I need to be able to handle wildcard, fuzzy, and Regex, queries which might 
match.

What I need to be able to do is ignore the words in the ignore list,
but only when they match terms the query would match.

For example: if the query is ‘free*’ and ‘freedom’ should be ignored,
I could modify the query to be ‘free*’ and not freedom.

But if ‘liberty’ is also to be ignored, I don’t want to add ‘and not liberty’ 
to that query
because that could produce false negatives for documents containing free and 
liberty.

I think what I need to do is:
for each query
  for each ignore word
if the query would match the ignore word,
  add ‘and not ignore word’ to the query

How can I test if a query would match an ignore word without putting the ignore 
words into an index
and searching the index?
This seems like overkill.

To make matters worse, for a query like A and B and C,
this won’t match an index of ignore words that contains C, but not A or B.

Thanks in advance, for any suggestions or advice,
David Shifflett