Re: [External] Re: How to ignore certain words based on query specifics

2019-07-09 Thread Shifflett, David [USA]
Michael,
Thanks for your reply.

You are correct, the desired effect is to not match 'freedom ...'.
I hadn't considered the case where both free* and freedom match.

My solution 'free* and not freedom' would NOT match either of your examples.

I think what I really want is
Get every matching term from a matching document,
and if the term also matches an ignore word, then ignore the match.

I hadn't considered the stopwords approach, I'll look into that.
If I add all the ignore words as stop words, will that effect highlighting?
Are the stopwords still available for highlighting?

Thanks,
David Shifflett
 

On 7/9/19, 11:58 AM, "Michael Sokolov"  wrote:

I think what you're saying in you're example is that "free*" should
match anything with a term matching that pattern, but not *only*
freedom. In other words, if a document has "freedom from stupidity"
 then it should not match, but if the document has "free freedom from
stupidity" than it should.

Is that correct?

You could apply stopwords, except that it sounds as if this is a
per-user blacklist, and you want them to share the same index?

On Tue, Jul 9, 2019 at 11:29 AM Shifflett, David [USA]
 wrote:
>
> Sorry for the weird reply path, but I couldn’t find an easy reply method 
via the list archive.
>
> Anyway …
>
> The use case is as follows:
> Allow the user to specify queries such as ‘free*’
> and also include similar words to be ignored, such as freedom.
> Another example would be ‘secret*’ and secretary.
>
> I want to keep the ignore words separate so they apply to all queries,
> but then realized the ignore words should only apply to relevant 
(matching) queries.
>
> I don’t want the users to be required to add ‘and not WORD’ many times to 
each of the listed queries.
>
> David Shifflett
>
> From: Diego Ceccarelli
>
> Could you please describe the use case? maybe there is an easier solution
>
>
>
> From: "Shifflett, David [USA]" 
> Date: Tuesday, July 9, 2019 at 8:02 AM
> To: "java-user@lucene.apache.org" 
> Subject: How to ignore certain words based on query specifics
>
> Hi all,
> I have a configuration file that lists multiple queries, of all different 
types,
> and that lists words to be ignored.
>
> Each of these lists is user configured, variable in length and content.
>
> I know that, in general, unless the ignore word is in the query it won’t 
match,
> but I need to be able to handle wildcard, fuzzy, and Regex, queries which 
might match.
>
> What I need to be able to do is ignore the words in the ignore list,
> but only when they match terms the query would match.
>
> For example: if the query is ‘free*’ and ‘freedom’ should be ignored,
> I could modify the query to be ‘free*’ and not freedom.
>
> But if ‘liberty’ is also to be ignored, I don’t want to add ‘and not 
liberty’ to that query
> because that could produce false negatives for documents containing free 
and liberty.
>
> I think what I need to do is:
> for each query
>   for each ignore word
> if the query would match the ignore word,
>   add ‘and not ignore word’ to the query
>
> How can I test if a query would match an ignore word without putting the 
ignore words into an index
> and searching the index?
> This seems like overkill.
>
> To make matters worse, for a query like A and B and C,
> this won’t match an index of ignore words that contains C, but not A or B.
>
> Thanks in advance, for any suggestions or advice,
> David Shifflett
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





Re: How to ignore certain words based on query specifics

2019-07-09 Thread Michael Sokolov
I think what you're saying in you're example is that "free*" should
match anything with a term matching that pattern, but not *only*
freedom. In other words, if a document has "freedom from stupidity"
 then it should not match, but if the document has "free freedom from
stupidity" than it should.

Is that correct?

You could apply stopwords, except that it sounds as if this is a
per-user blacklist, and you want them to share the same index?

On Tue, Jul 9, 2019 at 11:29 AM Shifflett, David [USA]
 wrote:
>
> Sorry for the weird reply path, but I couldn’t find an easy reply method via 
> the list archive.
>
> Anyway …
>
> The use case is as follows:
> Allow the user to specify queries such as ‘free*’
> and also include similar words to be ignored, such as freedom.
> Another example would be ‘secret*’ and secretary.
>
> I want to keep the ignore words separate so they apply to all queries,
> but then realized the ignore words should only apply to relevant (matching) 
> queries.
>
> I don’t want the users to be required to add ‘and not WORD’ many times to 
> each of the listed queries.
>
> David Shifflett
>
> From: Diego Ceccarelli
>
> Could you please describe the use case? maybe there is an easier solution
>
>
>
> From: "Shifflett, David [USA]" 
> Date: Tuesday, July 9, 2019 at 8:02 AM
> To: "java-user@lucene.apache.org" 
> Subject: How to ignore certain words based on query specifics
>
> Hi all,
> I have a configuration file that lists multiple queries, of all different 
> types,
> and that lists words to be ignored.
>
> Each of these lists is user configured, variable in length and content.
>
> I know that, in general, unless the ignore word is in the query it won’t 
> match,
> but I need to be able to handle wildcard, fuzzy, and Regex, queries which 
> might match.
>
> What I need to be able to do is ignore the words in the ignore list,
> but only when they match terms the query would match.
>
> For example: if the query is ‘free*’ and ‘freedom’ should be ignored,
> I could modify the query to be ‘free*’ and not freedom.
>
> But if ‘liberty’ is also to be ignored, I don’t want to add ‘and not liberty’ 
> to that query
> because that could produce false negatives for documents containing free and 
> liberty.
>
> I think what I need to do is:
> for each query
>   for each ignore word
> if the query would match the ignore word,
>   add ‘and not ignore word’ to the query
>
> How can I test if a query would match an ignore word without putting the 
> ignore words into an index
> and searching the index?
> This seems like overkill.
>
> To make matters worse, for a query like A and B and C,
> this won’t match an index of ignore words that contains C, but not A or B.
>
> Thanks in advance, for any suggestions or advice,
> David Shifflett
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to ignore certain words based on query specifics

2019-07-09 Thread Shifflett, David [USA]
Sorry for the weird reply path, but I couldn’t find an easy reply method via 
the list archive.

Anyway …

The use case is as follows:
Allow the user to specify queries such as ‘free*’
and also include similar words to be ignored, such as freedom.
Another example would be ‘secret*’ and secretary.

I want to keep the ignore words separate so they apply to all queries,
but then realized the ignore words should only apply to relevant (matching) 
queries.

I don’t want the users to be required to add ‘and not WORD’ many times to each 
of the listed queries.

David Shifflett

From: Diego Ceccarelli

Could you please describe the use case? maybe there is an easier solution



From: "Shifflett, David [USA]" 
Date: Tuesday, July 9, 2019 at 8:02 AM
To: "java-user@lucene.apache.org" 
Subject: How to ignore certain words based on query specifics

Hi all,
I have a configuration file that lists multiple queries, of all different types,
and that lists words to be ignored.

Each of these lists is user configured, variable in length and content.

I know that, in general, unless the ignore word is in the query it won’t match,
but I need to be able to handle wildcard, fuzzy, and Regex, queries which might 
match.

What I need to be able to do is ignore the words in the ignore list,
but only when they match terms the query would match.

For example: if the query is ‘free*’ and ‘freedom’ should be ignored,
I could modify the query to be ‘free*’ and not freedom.

But if ‘liberty’ is also to be ignored, I don’t want to add ‘and not liberty’ 
to that query
because that could produce false negatives for documents containing free and 
liberty.

I think what I need to do is:
for each query
  for each ignore word
if the query would match the ignore word,
  add ‘and not ignore word’ to the query

How can I test if a query would match an ignore word without putting the ignore 
words into an index
and searching the index?
This seems like overkill.

To make matters worse, for a query like A and B and C,
this won’t match an index of ignore words that contains C, but not A or B.

Thanks in advance, for any suggestions or advice,
David Shifflett



Re:How to ignore certain words based on query specifics

2019-07-09 Thread Diego Ceccarelli (BLOOMBERG/ LONDON)
Could you please describe the use case? maybe there is an easier solution 

From: java-user@lucene.apache.org At: 07/09/19 14:27:10To:  
java-user@lucene.apache.org
Subject: How to ignore certain words based on query specifics

Hi all,
I have a configuration file that lists multiple queries, of all different types,
and that lists words to be ignored.

Each of these lists is user configured, variable in length and content.

I know that, in general, unless the ignore word is in the query it won’t match,
but I need to be able to handle wildcard, fuzzy, and Regex, queries which might 
match.

What I need to be able to do is ignore the words in the ignore list,
but only when they match terms the query would match.

For example: if the query is ‘free*’ and ‘freedom’ should be ignored,
I could modify the query to be ‘free*’ and not freedom.

But if ‘liberty’ is also to be ignored, I don’t want to add ‘and not liberty’ 
to that query
because that could produce false negatives for documents containing free and 
liberty.

I think what I need to do is:
for each query
  for each ignore word
if the query would match the ignore word,
  add ‘and not ignore word’ to the query

How can I test if a query would match an ignore word without putting the ignore 
words into an index
and searching the index?
This seems like overkill.

To make matters worse, for a query like A and B and C,
this won’t match an index of ignore words that contains C, but not A or B.

Thanks in advance, for any suggestions or advice,
David Shifflett




How to ignore certain words based on query specifics

2019-07-09 Thread Shifflett, David [USA]
Hi all,
I have a configuration file that lists multiple queries, of all different types,
and that lists words to be ignored.

Each of these lists is user configured, variable in length and content.

I know that, in general, unless the ignore word is in the query it won’t match,
but I need to be able to handle wildcard, fuzzy, and Regex, queries which might 
match.

What I need to be able to do is ignore the words in the ignore list,
but only when they match terms the query would match.

For example: if the query is ‘free*’ and ‘freedom’ should be ignored,
I could modify the query to be ‘free*’ and not freedom.

But if ‘liberty’ is also to be ignored, I don’t want to add ‘and not liberty’ 
to that query
because that could produce false negatives for documents containing free and 
liberty.

I think what I need to do is:
for each query
  for each ignore word
if the query would match the ignore word,
  add ‘and not ignore word’ to the query

How can I test if a query would match an ignore word without putting the ignore 
words into an index
and searching the index?
This seems like overkill.

To make matters worse, for a query like A and B and C,
this won’t match an index of ignore words that contains C, but not A or B.

Thanks in advance, for any suggestions or advice,
David Shifflett



Re: Lucene Index Cloud Replication

2019-07-09 Thread Michael McCandless
+1 to share code for doing 1) and 3) both of which are tricky!

Safely moving / copying bytes around is a notoriously difficult problem ...
but Lucene's "end to end checksums" and per-segment-file-GUID make this
safer.

I think Lucene's replicator module is a good place for this?

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jul 3, 2019 at 4:15 PM Michael Froh  wrote:

> Hi there,
>
> I was talking with Varun at Berlin Buzzwords a couple of weeks ago about
> storing and retrieving Lucene indexes in S3, and realized that "uploading a
> Lucene directory to the cloud and downloading it on other machines" is a
> pretty common problem and one that's surprisingly easy to do poorly. In my
> current job, I'm on my third team that needed to do this.
>
> In my experience, there are three main pieces that need to be implemented:
>
> 1. Uploading/downloading individual files (i.e. the blob store), which can
> be eventually consistent if you write once.
> 2. Describing the metadata for a specific commit point (basically what the
> Replicator module does with the "Revision" class). In particular, we want a
> downloader to reliably be able to know if they already have specific files
> (and don't need to download them again).
> 3. Sharing metadata with some degree of consistency, so that multiple
> writers don't clobber each other's metadata, and so readers can discover
> the metadata for the latest commit/revision and trust that they'll
> (eventually) be able to download the relevant files.
>
> I'd like to share what I've got for 1 and 3, based on S3 and DynamoDB, but
> I'd like to do it with  interfaces that lend themselves to other
> implementations for blob and metadata storage.
>
> Is it worth opening a Jira issue for this? Is this something that would
> benefit the Lucene community?
>
> Thanks,
> Michael Froh
>