Re: analyzer context during search

2018-04-13 Thread Chris Tomlinson
Hi,

Thanks for the thoughts. I agree a combinatorial explosion of fields and index 
size would “solve” the problem but the cost is rather absurd. Hence, I posed 
the problem to prompt some discussion about what a plausible/reasonable 
solution might be.

It has seemed to be for some time that there really should be an extension of 
the Analyzer api to include a generic argument of abstract class 
AnalyzerContext that could optionaly be used via IndexWriter snd IndexSearcher 
to supply useful context information from the caller to IndexWriter and 
IndexSearcher.

This would require threading the parameter throughout much as was done versions 
ago with the Version argument.

Another approach might be to instantiate an Analyzer on each use of at least 
IndexSearcher so that a custom analyzer with context information could be 
provided; however, the cost of frequent instantiation of analyzers seems to be 
likely non-performant.

LUCENE-8240 did not appear to me to be a solution direction.

Thanks,
Chris


> On Apr 12, 2018, at 5:24 AM, Michael Sokolov <msoko...@gmail.com> wrote:
> 
> I think you can achieve what you are asking by having a field for every
> possible combination of pairs of input and output. Obviously this would
> explode the size of your index, so it's not ideal.
> 
> Another alternative would be indexing all variants into a single field,
> using different analyzers for different inputs. Doing this requires extra
> context when choosing the analyzer (our the token streams that it
> generates) as you say. See http://issues.apache.org/jira/browse/LUCENE-8240
> for one idea of how to accomplish this.
> 
> 
> 
> On Wed, Apr 11, 2018, 9:34 AM Chris Tomlinson <chris.j.tomlin...@gmail.com>
> wrote:
> 
>> Hello,
>> 
>> I’m working on a project where it would be most helpful for
>> getWrappedAnalyzer() in an extension to DelegatingAnalyzerWrapper to have
>> access to more than just the fieldName.
>> 
>> The scenario is that we are working with several languages: Tibetan,
>> Sanskrit and Chinese; that each have several encodings, e.g., Simplified
>> Chinese (zh-hans), Traditional Chinese (zh-hant), Pinyin with diacritics
>> (zh-pinyin) and Pinyin without diacritics (zh-pinyin-ndia). Our data is
>> from many sources which each use a variety of encodings and we wish to
>> preserve the original encodings used in the data.
>> 
>> For Chinese, for example, we have an analyzer that creates a TokenStream
>> of Pinyin with diacritics for any of the input encodings. Thus it is
>> possible in some situations to retrieve documents originally input as
>> zh-hans and so on.
>> 
>> The same applies to the other languages.
>> 
>> One objective is to allow the user to input a query in zh-pinyin, for
>> example, and to retrieve documents that were originally indexed in any of
>> the variant encodings.
>> 
>> The current scheme, in Apache Jena + Lucene, is to create a fieldName that
>> includes the original name plus a language tag, e.g., label_zh-hans, so
>> that the getWrappedAnalyzer() can then retrieve a registered analyzer for
>> zh-hans that will then index using Pinyin tokens as mentioned above.
>> 
>> For Chinese, we end up with documents that have four different fields:
>> label_zh-hans, label_zh-hant, label_zh-pinyin, and label_zh-pinyin-ndia, so
>> that when indexing we know what input encoding was used so that an
>> appropriate analyzer configuration can be chosen since the analyzer has to
>> be aware of the incoming encoding.
>> 
>> At search time we could try a search like:
>> 
>>(label_zh-hans:a-query-in-pinyin OR label_zh-hant:a-query-in-pinyin OR
>> label_zh-pinyin:a-query-in-pinyin OR label_zh-pinyin-ndia:a-query-in-pinyin)
>> 
>> But this can not work since the information that the query is in zh-pinyin
>> is not available to the getWrappedAnalyzer(), only the original encoding is
>> available as a part of the field name and so it is not possible to know
>> that the query string is in zh-pinyin so that is tokenized correctly when
>> querying the other fields.
>> 
>> I’m probably over-thinking things, but it seems to me that if I had a way
>> of accessing additional context when choosing an analyzer so that there
>> would be information that the query string is in pinyin and the various
>> field names are available as usual.
>> 
>> I don’t see how a custom query analyzer would help here. We would know
>> that the context of the call to the analyzer wrapper was for query versus
>> indexing but we still don’t know the field name versus the encoding of the
>> query.
>> 
>> I imagine this sort of scenario has been solved by others numerous times
>> but I’m stumped as to how to implement.
>> 
>> Thanks in advance for any help,
>> Chris
>> 
>> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



analyzer context during search

2018-04-11 Thread Chris Tomlinson
Hello,

I’m working on a project where it would be most helpful for 
getWrappedAnalyzer() in an extension to DelegatingAnalyzerWrapper to have 
access to more than just the fieldName.

The scenario is that we are working with several languages: Tibetan, Sanskrit 
and Chinese; that each have several encodings, e.g., Simplified Chinese 
(zh-hans), Traditional Chinese (zh-hant), Pinyin with diacritics (zh-pinyin) 
and Pinyin without diacritics (zh-pinyin-ndia). Our data is from many sources 
which each use a variety of encodings and we wish to preserve the original 
encodings used in the data.

For Chinese, for example, we have an analyzer that creates a TokenStream of 
Pinyin with diacritics for any of the input encodings. Thus it is possible in 
some situations to retrieve documents originally input as zh-hans and so on.

The same applies to the other languages.

One objective is to allow the user to input a query in zh-pinyin, for example, 
and to retrieve documents that were originally indexed in any of the variant 
encodings.

The current scheme, in Apache Jena + Lucene, is to create a fieldName that 
includes the original name plus a language tag, e.g., label_zh-hans, so that 
the getWrappedAnalyzer() can then retrieve a registered analyzer for zh-hans 
that will then index using Pinyin tokens as mentioned above.

For Chinese, we end up with documents that have four different fields: 
label_zh-hans, label_zh-hant, label_zh-pinyin, and label_zh-pinyin-ndia, so 
that when indexing we know what input encoding was used so that an appropriate 
analyzer configuration can be chosen since the analyzer has to be aware of the 
incoming encoding.

At search time we could try a search like:

(label_zh-hans:a-query-in-pinyin OR label_zh-hant:a-query-in-pinyin OR 
label_zh-pinyin:a-query-in-pinyin OR label_zh-pinyin-ndia:a-query-in-pinyin)

But this can not work since the information that the query is in zh-pinyin is 
not available to the getWrappedAnalyzer(), only the original encoding is 
available as a part of the field name and so it is not possible to know that 
the query string is in zh-pinyin so that is tokenized correctly when querying 
the other fields.

I’m probably over-thinking things, but it seems to me that if I had a way of 
accessing additional context when choosing an analyzer so that there would be 
information that the query string is in pinyin and the various field names are 
available as usual.

I don’t see how a custom query analyzer would help here. We would know that the 
context of the call to the analyzer wrapper was for query versus indexing but 
we still don’t know the field name versus the encoding of the query.

I imagine this sort of scenario has been solved by others numerous times but 
I’m stumped as to how to implement.

Thanks in advance for any help,
Chris



Re: What is the proper use of stop words in Lucene?

2014-04-28 Thread Chris Tomlinson
Hello Uwe,

Thank you for the reply. I see that there is a version check for the use of 
setEnablePositionIncrements(false); and, I think I may be able to use an 
earlier api with the eXist-db embedding of Lucene 4.4 to avoid the version 
check.

However, my question was intended to improve my understanding of how to 
properly use stop words and/or how to properly achieve the use case that I 
outlined.

My naive understanding of the purpose of stop words is:

to remove from indexing words that are not helpful in discriminating or 
selecting documents since they occur so frequently.

The use case that I intended to illustrate is meant to ignore the occurrence or 
non-occurrence of stop words in a query w.r.t. selection of documents during 
search.

As I understand the situation currently, occurrences of stop words in a query 
phrase are replaced by ?s to indicate the presence of an otherwise 
unspecified word in the query. So the phrase:

blue is the moon

with is and the as stop words, would be indexed effectively as:

blue ? ? moon

and the query phrase:

blue was a moon

would be treated as:

blue ? ? moon

and would retrieve a document containing:

blue is the moon

But in the use case that I presented we really want the query:

blue moon

to also select the document without the user having to indicate the possible 
presence of stop words or not.

So my question is:

How is one supposed to achieve this use case in Lucene 4.4+?

Thank you,
Chris




On Apr 24, 2014, at 5:52 AM, Uwe Schindler u...@thetaphi.de wrote:

 Hi,
 
 You can still change the setting on the TokenFilter after creating it: 
 StopFilter#setEnablePositionIncrements(false) - this method was *not* removed!
 This fails only is you pass matchVersion=Version.LUCENE_44. Just use an 
 older matchVersion parameter to the constructor and you can still enable this 
 broken behavior (for backwards compatibility).
 
 This is no longer officially supported, but can be a workaround. To me it 
 looks like you misunderstood stopwords.
 
 Uwe
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
 -Original Message-
 From: Tincu Gabriel [mailto:tincu.gabr...@gmail.com]
 Sent: Thursday, April 24, 2014 12:27 PM
 To: java-user@lucene.apache.org
 Subject: Re: What is the proper use of stop words in Lucene?
 
 Hi there,
 The StopFilterFactory can be used to produce StopFilters with the desired
 stop-words inside of it . As a constructor argument it takes a
 MapString,String and one of the valid keys you can pass inside of that is
 enablePositionIncrements . If you don't pass that in then it defaults to 
 true.
 Is this what you were looking for?
 
 
 On Wed, Apr 23, 2014 at 12:36 PM, Chris Tomlinson 
 chris.j.tomlin...@gmail.com wrote:
 
 Hello,
 
 I've written several times now on the list with this question /
 problem and no one has yet replied so I don't know if the question is
 too wrong-headed or if there is simply no one reading the list that
 can comment on the question.
 
 The question that I'm trying to get answered is what is the correct
 way of ignoring stop word gaps in Lucene 4.4+?
 
 While we are using Lucene 4.4 embedded in eXist-db (exist-db.org), I
 think the question is a proper Lucene question and really has nothing
 to do with the fact that we're using it in an embedded manner.
 
 The problem to be solved is how to ignore stop word gaps in queries -
 without the user having to indicate where such gaps might occur at
 query time.
 
 Since Lucene 4.4 the
 FilteringTokenFilter.setEnablePositionIncrements(false) is not available.
 None of the resources such as the Lucene in Action and so on explain
 how to use Lucene to get the desired effect now that 4.4+ has removed
 the previous approach.
 
 Prior to Lucene 4.4 it was possible to
 setEnablePositionIncrements(false)
 so that during indexing and querying the number and position of stop
 word gaps would be ignored (as mentioned on pp 138-139 of Lucene in
 Action).
 
 This meant that a document with a phrase such as:
 
   blue is the sky
 
 with stop words is and the would be selected by the query:
 
   blue sky
 
 This is what we want to achieve.
 
 Why? We are working with Tibetan and elisions are not uncommon so
 that,
 e.g.:
 
   rin po che
 
 on some occasions might be shortened to
 
   rin che
 
 and we would like to have a query of
 
   rin po che
 
 or
 
   rin che
 
 find all occurrences of
 
   rin po che
 
 and
 
   rin che
 
 without having the user have to mark where elisions might occur.
 
 The
 
 org.apache.lucene.queryparser.flexible.standard.CommonQueryParserConfi
 guration provides a setEnablePositionIncrements but that does not seem
 to work to allow for the above desired query behavior that was
 possible prior to Lucene 4.4.
 
 What is the proper way to ignore stop word gaps?
 
 Thank you,
 Chris

Re: What is the proper use of stop words in Lucene?

2014-04-28 Thread Chris Tomlinson
Hi,

On Apr 28, 2014, at 11:45 AM, Uwe Schindler u...@thetaphi.de wrote:

 Hello Uwe,
 
 Thank you for the reply. I see that there is a version check for the use of
 setEnablePositionIncrements(false); and, I think I may be able to use an
 earlier api with the eXist-db embedding of Lucene 4.4 to avoid the version
 check.
 
 Hi,
 
 you don't need an older version of the Lucene library. It is enough to pass 
 the constant, also with Lucene 4.7 or 4.8 (release in a moment):
 sf = new StopFilter(Version.LUCENE_43, ...);
 sf. setEnablePositionIncrements (false);
 
 The version constant is exactly to use some components that changed in an 
 incompatible way still in later versions, and preserve index/behavior 
 compatibility.

Thank you for the explanation.


 About stop words: What you are doing, is not really stop words. The main 
 reason for stop words is the following:
 - Stop words are in almost every document, so it makes no sense to query for 
 them.

This was my understanding.


 - The only relevant information behind the stop word is there was a word at 
 this position that

I didn't realize that this was a necessary aspect. I can certainly understand 
that it may be relevant in some (most) cases and it makes sense to me that it 
would appropriate to always preserve the information in indexing. I was looking 
for a solution that would essentially work at query time and had initially 
thought that the CommonQueryParserConfiguration#setEnablePositionIncrements() 
was intended to work this way but it does not.


 If the second item would not be taken care, this information would get lost, 
 too.
 
 If every document really contains a specific stop word (which is almost 
 always the case), there must be no difference between a phrase query with 
 mentioned stop word, using an index with all stop words indexed and one with 
 stop words left out. This can only be done, if the stop word reserves a 
 position.
 
 What you intend to do is not a stopword use case. You want to ignore some 
 words - Lucene has no support for this, because in native language processing 
 this makes no sense.

Thank you for the information. I was unaware that ignoring some words makes no 
sense. I thought I gave a reasonable example of exactly this situation in the 
native processing of Tibetan. Perhaps I am still not understanding.


 One way to do this is to:
 a) write your own TokenFilter, violating the TokenStream contracts
 b) use the Backwards compatibility layer with matchVersion=LUCENE_43
 c) maybe remove the words before tokenizing (e.g. MappingCharFilter, mapping 
 the ignore words to empty string)

Thank you for these useful approaches to solving the use case.

ciao,
Chris



 
 Uwe
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is the proper use of stop words in Lucene?

2014-04-28 Thread Chris Tomlinson

On Apr 28, 2014, at 3:36 PM, Uwe Schindler u...@thetaphi.de wrote:

 Hi,
 
 What you intend to do is not a stopword use case. You want to ignore
 some words - Lucene has no support for this, because in native language
 processing this makes no sense.
 
 Thank you for the information. I was unaware that ignoring some words
 makes no sense. I thought I gave a reasonable example of exactly this
 situation in the native processing of Tibetan. Perhaps I am still not
 understanding.
 
 Elisions are a bit different than stopwords (although I don't know about them 
 in Tibet language). The Tokenizer should *not* split Elisions from the terms 
 (initially the term is the full word including the elision). In most 
 languages those are separated by (for example) an apostrophe (e.g. French: le 
 + arbre → l’arbre). The Tokenizer would keep those parts together (l’arbre). 
 A later TokenFilter would then edit the token and remove the elision (if 
 needed): arbre. This is how the French Analyzer in Lucene works.

Tibetan has no markers for elisions. They can be quite idiosyncratic to a 
school or tradition. It seems that the most flexible approach is to ignore stop 
words and their positions - however, incorrect that may be; otherwise, one gets 
into ever more complex analysis that may not yield cost-effective results. 
We'll work with the pre-4.4 setEnablePositionIncrements(false) approach for now.

Tibetan has no sentence or phrase markers per sé. There are no word boundary 
markers. Essentially, Tibetan is a sequence of syllables with occasional 
markers indicating some break in a thought or what not.


 Lucene currently does not have Tibetanian Analyzer, so you have to make your 
 own one (I think this is what you tried to do). You should carefully choose 
 the Tokenizer and add something like an TibetanElisionFilter that removes the 
 not wanted parts from the tokens.

We have developed a pair of analyzers and associated filters and tokenizers for 
both Tibetan Unicode and the Extended Wylie transliteration system. If there is 
interest we will be happy to donate this work to Apache Lucene. This includes 
paying attention to the myriad punctuation characters, stemming and so on.

Thank you,
Chris

 
 Uwe
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



What is the proper use of stop words in Lucene?

2014-04-23 Thread Chris Tomlinson
Hello,

I've written several times now on the list with this question / problem and no 
one has yet replied so I don't know if the question is too wrong-headed or if 
there is simply no one reading the list that can comment on the question.

The question that I'm trying to get answered is what is the correct way of 
ignoring stop word gaps in Lucene 4.4+?

While we are using Lucene 4.4 embedded in eXist-db (exist-db.org), I think the 
question is a proper Lucene question and really has nothing to do with the fact 
that we're using it in an embedded manner.

The problem to be solved is how to ignore stop word gaps in queries - without 
the user having to indicate where such gaps might occur at query time.

Since Lucene 4.4 the FilteringTokenFilter.setEnablePositionIncrements(false) is 
not available. None of the resources such as the Lucene in Action and so on 
explain how to use Lucene to get the desired effect now that 4.4+ has removed 
the previous approach.

Prior to Lucene 4.4 it was possible to setEnablePositionIncrements(false) so 
that during indexing and querying the number and position of stop word gaps 
would be ignored (as mentioned on pp 138-139 of Lucene in Action).

This meant that a document with a phrase such as:

   blue is the sky

with stop words is and the would be selected by the query:

   blue sky

This is what we want to achieve. 

Why? We are working with Tibetan and elisions are not uncommon so that, e.g.:

   rin po che

on some occasions might be shortened to

   rin che

and we would like to have a query of

   rin po che

or

   rin che

find all occurrences of

   rin po che

and

   rin che

without having the user have to mark where elisions might occur.

The 
org.apache.lucene.queryparser.flexible.standard.CommonQueryParserConfiguration 
provides a setEnablePositionIncrements but that does not seem to work to allow 
for the above desired query behavior that was possible prior to Lucene 4.4.

What is the proper way to ignore stop word gaps?

Thank you,
Chris


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



What is the proper way to ignore stop words in queries with Lucene 4.4+

2014-04-15 Thread Chris Tomlinson
Hello,

We're using the Lucene 4.4 embedded in eXist-db (exist-db.org), and as the 
subject indicates we want to ignore stop word gaps in queries - without the 
user having to indicate where such gaps might occur at query time.

Since Lucene 4.4 the FilteringTokenFilter.setEnablePositionIncrements(false) is 
not available.

Prior to Lucene 4.4 it was possible setEnablePositionIncrements(false) so that 
during indexing and querying the number and position of stop word gaps would be 
ignored.

This meant that a phrase such as:

   blue is the sky

with stop words is and the would be selected by the query:

   blue sky

We are working with Tibetan and elisions are not uncommon so that, e.g.:

   rin po che

on some occasions might be shortened to

   rin che

and we would like to have a query of

   rin po che

or

   rin che

find all occurrences of

   rin po che

and

   rin che

without having the user have to mark where elisions might occur.

The 
org.apache.lucene.queryparser.flexible.standard.CommonQueryParserConfiguration 
provides a setEnablePositionIncrements but that does not seem to work to allow 
for the above desired query behavior that was possible prior to Lucene 4.4.

What is the proper way to ignore stop word gaps?

Thank you,
Chris

How to ignore stop word gaps in queries? Lucene 4.4+

2014-04-10 Thread Chris Tomlinson
Hello,

We're using the Lucene 4.4 embedded in eXist-db (exist-db.org), and as the 
subject indicates we want to ignore stop word gaps in queries - without the 
user having to indicate where such gaps might occur at query time.

Since Lucene 4.4 the FilteringTokenFilter.setEnablePositionIncrements(false) is 
not available.

Prior to Lucene 4.4 it was possible setEnablePositionIncrements(false) so that 
during indexing and querying the number and position of stop word gaps would be 
ignored.

This meant that a phrase such as:

blue is the sky

with stop words is and the would be selected by the query:

blue sky

We are working with Tibetan and elisions are not uncommon so that, e.g.:

rin po che

on some occasions might be shortened to

rin che

and we would like to have a query of

rin po che

or

rin che

find all occurrences of

rin po che

and

rin che

without having the user have to mark where elisions might occur.

The 
org.apache.lucene.queryparser.flexible.standard.CommonQueryParserConfiguration 
provides a setEnablePositionIncrements but that does not seem to work to allow 
for the above desired query behavior that was possible prior to Lucene 4.4.

What is the proper way to ignore stop word gaps?

Thank you,
Chris


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org