Re: analyzer context during search
Hi, Thanks for the thoughts. I agree a combinatorial explosion of fields and index size would “solve” the problem but the cost is rather absurd. Hence, I posed the problem to prompt some discussion about what a plausible/reasonable solution might be. It has seemed to be for some time that there really should be an extension of the Analyzer api to include a generic argument of abstract class AnalyzerContext that could optionaly be used via IndexWriter snd IndexSearcher to supply useful context information from the caller to IndexWriter and IndexSearcher. This would require threading the parameter throughout much as was done versions ago with the Version argument. Another approach might be to instantiate an Analyzer on each use of at least IndexSearcher so that a custom analyzer with context information could be provided; however, the cost of frequent instantiation of analyzers seems to be likely non-performant. LUCENE-8240 did not appear to me to be a solution direction. Thanks, Chris > On Apr 12, 2018, at 5:24 AM, Michael Sokolov <msoko...@gmail.com> wrote: > > I think you can achieve what you are asking by having a field for every > possible combination of pairs of input and output. Obviously this would > explode the size of your index, so it's not ideal. > > Another alternative would be indexing all variants into a single field, > using different analyzers for different inputs. Doing this requires extra > context when choosing the analyzer (our the token streams that it > generates) as you say. See http://issues.apache.org/jira/browse/LUCENE-8240 > for one idea of how to accomplish this. > > > > On Wed, Apr 11, 2018, 9:34 AM Chris Tomlinson <chris.j.tomlin...@gmail.com> > wrote: > >> Hello, >> >> I’m working on a project where it would be most helpful for >> getWrappedAnalyzer() in an extension to DelegatingAnalyzerWrapper to have >> access to more than just the fieldName. >> >> The scenario is that we are working with several languages: Tibetan, >> Sanskrit and Chinese; that each have several encodings, e.g., Simplified >> Chinese (zh-hans), Traditional Chinese (zh-hant), Pinyin with diacritics >> (zh-pinyin) and Pinyin without diacritics (zh-pinyin-ndia). Our data is >> from many sources which each use a variety of encodings and we wish to >> preserve the original encodings used in the data. >> >> For Chinese, for example, we have an analyzer that creates a TokenStream >> of Pinyin with diacritics for any of the input encodings. Thus it is >> possible in some situations to retrieve documents originally input as >> zh-hans and so on. >> >> The same applies to the other languages. >> >> One objective is to allow the user to input a query in zh-pinyin, for >> example, and to retrieve documents that were originally indexed in any of >> the variant encodings. >> >> The current scheme, in Apache Jena + Lucene, is to create a fieldName that >> includes the original name plus a language tag, e.g., label_zh-hans, so >> that the getWrappedAnalyzer() can then retrieve a registered analyzer for >> zh-hans that will then index using Pinyin tokens as mentioned above. >> >> For Chinese, we end up with documents that have four different fields: >> label_zh-hans, label_zh-hant, label_zh-pinyin, and label_zh-pinyin-ndia, so >> that when indexing we know what input encoding was used so that an >> appropriate analyzer configuration can be chosen since the analyzer has to >> be aware of the incoming encoding. >> >> At search time we could try a search like: >> >>(label_zh-hans:a-query-in-pinyin OR label_zh-hant:a-query-in-pinyin OR >> label_zh-pinyin:a-query-in-pinyin OR label_zh-pinyin-ndia:a-query-in-pinyin) >> >> But this can not work since the information that the query is in zh-pinyin >> is not available to the getWrappedAnalyzer(), only the original encoding is >> available as a part of the field name and so it is not possible to know >> that the query string is in zh-pinyin so that is tokenized correctly when >> querying the other fields. >> >> I’m probably over-thinking things, but it seems to me that if I had a way >> of accessing additional context when choosing an analyzer so that there >> would be information that the query string is in pinyin and the various >> field names are available as usual. >> >> I don’t see how a custom query analyzer would help here. We would know >> that the context of the call to the analyzer wrapper was for query versus >> indexing but we still don’t know the field name versus the encoding of the >> query. >> >> I imagine this sort of scenario has been solved by others numerous times >> but I’m stumped as to how to implement. >> >> Thanks in advance for any help, >> Chris >> >> - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
analyzer context during search
Hello, I’m working on a project where it would be most helpful for getWrappedAnalyzer() in an extension to DelegatingAnalyzerWrapper to have access to more than just the fieldName. The scenario is that we are working with several languages: Tibetan, Sanskrit and Chinese; that each have several encodings, e.g., Simplified Chinese (zh-hans), Traditional Chinese (zh-hant), Pinyin with diacritics (zh-pinyin) and Pinyin without diacritics (zh-pinyin-ndia). Our data is from many sources which each use a variety of encodings and we wish to preserve the original encodings used in the data. For Chinese, for example, we have an analyzer that creates a TokenStream of Pinyin with diacritics for any of the input encodings. Thus it is possible in some situations to retrieve documents originally input as zh-hans and so on. The same applies to the other languages. One objective is to allow the user to input a query in zh-pinyin, for example, and to retrieve documents that were originally indexed in any of the variant encodings. The current scheme, in Apache Jena + Lucene, is to create a fieldName that includes the original name plus a language tag, e.g., label_zh-hans, so that the getWrappedAnalyzer() can then retrieve a registered analyzer for zh-hans that will then index using Pinyin tokens as mentioned above. For Chinese, we end up with documents that have four different fields: label_zh-hans, label_zh-hant, label_zh-pinyin, and label_zh-pinyin-ndia, so that when indexing we know what input encoding was used so that an appropriate analyzer configuration can be chosen since the analyzer has to be aware of the incoming encoding. At search time we could try a search like: (label_zh-hans:a-query-in-pinyin OR label_zh-hant:a-query-in-pinyin OR label_zh-pinyin:a-query-in-pinyin OR label_zh-pinyin-ndia:a-query-in-pinyin) But this can not work since the information that the query is in zh-pinyin is not available to the getWrappedAnalyzer(), only the original encoding is available as a part of the field name and so it is not possible to know that the query string is in zh-pinyin so that is tokenized correctly when querying the other fields. I’m probably over-thinking things, but it seems to me that if I had a way of accessing additional context when choosing an analyzer so that there would be information that the query string is in pinyin and the various field names are available as usual. I don’t see how a custom query analyzer would help here. We would know that the context of the call to the analyzer wrapper was for query versus indexing but we still don’t know the field name versus the encoding of the query. I imagine this sort of scenario has been solved by others numerous times but I’m stumped as to how to implement. Thanks in advance for any help, Chris
Re: What is the proper use of stop words in Lucene?
Hello Uwe, Thank you for the reply. I see that there is a version check for the use of setEnablePositionIncrements(false); and, I think I may be able to use an earlier api with the eXist-db embedding of Lucene 4.4 to avoid the version check. However, my question was intended to improve my understanding of how to properly use stop words and/or how to properly achieve the use case that I outlined. My naive understanding of the purpose of stop words is: to remove from indexing words that are not helpful in discriminating or selecting documents since they occur so frequently. The use case that I intended to illustrate is meant to ignore the occurrence or non-occurrence of stop words in a query w.r.t. selection of documents during search. As I understand the situation currently, occurrences of stop words in a query phrase are replaced by ?s to indicate the presence of an otherwise unspecified word in the query. So the phrase: blue is the moon with is and the as stop words, would be indexed effectively as: blue ? ? moon and the query phrase: blue was a moon would be treated as: blue ? ? moon and would retrieve a document containing: blue is the moon But in the use case that I presented we really want the query: blue moon to also select the document without the user having to indicate the possible presence of stop words or not. So my question is: How is one supposed to achieve this use case in Lucene 4.4+? Thank you, Chris On Apr 24, 2014, at 5:52 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, You can still change the setting on the TokenFilter after creating it: StopFilter#setEnablePositionIncrements(false) - this method was *not* removed! This fails only is you pass matchVersion=Version.LUCENE_44. Just use an older matchVersion parameter to the constructor and you can still enable this broken behavior (for backwards compatibility). This is no longer officially supported, but can be a workaround. To me it looks like you misunderstood stopwords. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Tincu Gabriel [mailto:tincu.gabr...@gmail.com] Sent: Thursday, April 24, 2014 12:27 PM To: java-user@lucene.apache.org Subject: Re: What is the proper use of stop words in Lucene? Hi there, The StopFilterFactory can be used to produce StopFilters with the desired stop-words inside of it . As a constructor argument it takes a MapString,String and one of the valid keys you can pass inside of that is enablePositionIncrements . If you don't pass that in then it defaults to true. Is this what you were looking for? On Wed, Apr 23, 2014 at 12:36 PM, Chris Tomlinson chris.j.tomlin...@gmail.com wrote: Hello, I've written several times now on the list with this question / problem and no one has yet replied so I don't know if the question is too wrong-headed or if there is simply no one reading the list that can comment on the question. The question that I'm trying to get answered is what is the correct way of ignoring stop word gaps in Lucene 4.4+? While we are using Lucene 4.4 embedded in eXist-db (exist-db.org), I think the question is a proper Lucene question and really has nothing to do with the fact that we're using it in an embedded manner. The problem to be solved is how to ignore stop word gaps in queries - without the user having to indicate where such gaps might occur at query time. Since Lucene 4.4 the FilteringTokenFilter.setEnablePositionIncrements(false) is not available. None of the resources such as the Lucene in Action and so on explain how to use Lucene to get the desired effect now that 4.4+ has removed the previous approach. Prior to Lucene 4.4 it was possible to setEnablePositionIncrements(false) so that during indexing and querying the number and position of stop word gaps would be ignored (as mentioned on pp 138-139 of Lucene in Action). This meant that a document with a phrase such as: blue is the sky with stop words is and the would be selected by the query: blue sky This is what we want to achieve. Why? We are working with Tibetan and elisions are not uncommon so that, e.g.: rin po che on some occasions might be shortened to rin che and we would like to have a query of rin po che or rin che find all occurrences of rin po che and rin che without having the user have to mark where elisions might occur. The org.apache.lucene.queryparser.flexible.standard.CommonQueryParserConfi guration provides a setEnablePositionIncrements but that does not seem to work to allow for the above desired query behavior that was possible prior to Lucene 4.4. What is the proper way to ignore stop word gaps? Thank you, Chris
Re: What is the proper use of stop words in Lucene?
Hi, On Apr 28, 2014, at 11:45 AM, Uwe Schindler u...@thetaphi.de wrote: Hello Uwe, Thank you for the reply. I see that there is a version check for the use of setEnablePositionIncrements(false); and, I think I may be able to use an earlier api with the eXist-db embedding of Lucene 4.4 to avoid the version check. Hi, you don't need an older version of the Lucene library. It is enough to pass the constant, also with Lucene 4.7 or 4.8 (release in a moment): sf = new StopFilter(Version.LUCENE_43, ...); sf. setEnablePositionIncrements (false); The version constant is exactly to use some components that changed in an incompatible way still in later versions, and preserve index/behavior compatibility. Thank you for the explanation. About stop words: What you are doing, is not really stop words. The main reason for stop words is the following: - Stop words are in almost every document, so it makes no sense to query for them. This was my understanding. - The only relevant information behind the stop word is there was a word at this position that I didn't realize that this was a necessary aspect. I can certainly understand that it may be relevant in some (most) cases and it makes sense to me that it would appropriate to always preserve the information in indexing. I was looking for a solution that would essentially work at query time and had initially thought that the CommonQueryParserConfiguration#setEnablePositionIncrements() was intended to work this way but it does not. If the second item would not be taken care, this information would get lost, too. If every document really contains a specific stop word (which is almost always the case), there must be no difference between a phrase query with mentioned stop word, using an index with all stop words indexed and one with stop words left out. This can only be done, if the stop word reserves a position. What you intend to do is not a stopword use case. You want to ignore some words - Lucene has no support for this, because in native language processing this makes no sense. Thank you for the information. I was unaware that ignoring some words makes no sense. I thought I gave a reasonable example of exactly this situation in the native processing of Tibetan. Perhaps I am still not understanding. One way to do this is to: a) write your own TokenFilter, violating the TokenStream contracts b) use the Backwards compatibility layer with matchVersion=LUCENE_43 c) maybe remove the words before tokenizing (e.g. MappingCharFilter, mapping the ignore words to empty string) Thank you for these useful approaches to solving the use case. ciao, Chris Uwe - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is the proper use of stop words in Lucene?
On Apr 28, 2014, at 3:36 PM, Uwe Schindler u...@thetaphi.de wrote: Hi, What you intend to do is not a stopword use case. You want to ignore some words - Lucene has no support for this, because in native language processing this makes no sense. Thank you for the information. I was unaware that ignoring some words makes no sense. I thought I gave a reasonable example of exactly this situation in the native processing of Tibetan. Perhaps I am still not understanding. Elisions are a bit different than stopwords (although I don't know about them in Tibet language). The Tokenizer should *not* split Elisions from the terms (initially the term is the full word including the elision). In most languages those are separated by (for example) an apostrophe (e.g. French: le + arbre → l’arbre). The Tokenizer would keep those parts together (l’arbre). A later TokenFilter would then edit the token and remove the elision (if needed): arbre. This is how the French Analyzer in Lucene works. Tibetan has no markers for elisions. They can be quite idiosyncratic to a school or tradition. It seems that the most flexible approach is to ignore stop words and their positions - however, incorrect that may be; otherwise, one gets into ever more complex analysis that may not yield cost-effective results. We'll work with the pre-4.4 setEnablePositionIncrements(false) approach for now. Tibetan has no sentence or phrase markers per sé. There are no word boundary markers. Essentially, Tibetan is a sequence of syllables with occasional markers indicating some break in a thought or what not. Lucene currently does not have Tibetanian Analyzer, so you have to make your own one (I think this is what you tried to do). You should carefully choose the Tokenizer and add something like an TibetanElisionFilter that removes the not wanted parts from the tokens. We have developed a pair of analyzers and associated filters and tokenizers for both Tibetan Unicode and the Extended Wylie transliteration system. If there is interest we will be happy to donate this work to Apache Lucene. This includes paying attention to the myriad punctuation characters, stemming and so on. Thank you, Chris Uwe - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
What is the proper use of stop words in Lucene?
Hello, I've written several times now on the list with this question / problem and no one has yet replied so I don't know if the question is too wrong-headed or if there is simply no one reading the list that can comment on the question. The question that I'm trying to get answered is what is the correct way of ignoring stop word gaps in Lucene 4.4+? While we are using Lucene 4.4 embedded in eXist-db (exist-db.org), I think the question is a proper Lucene question and really has nothing to do with the fact that we're using it in an embedded manner. The problem to be solved is how to ignore stop word gaps in queries - without the user having to indicate where such gaps might occur at query time. Since Lucene 4.4 the FilteringTokenFilter.setEnablePositionIncrements(false) is not available. None of the resources such as the Lucene in Action and so on explain how to use Lucene to get the desired effect now that 4.4+ has removed the previous approach. Prior to Lucene 4.4 it was possible to setEnablePositionIncrements(false) so that during indexing and querying the number and position of stop word gaps would be ignored (as mentioned on pp 138-139 of Lucene in Action). This meant that a document with a phrase such as: blue is the sky with stop words is and the would be selected by the query: blue sky This is what we want to achieve. Why? We are working with Tibetan and elisions are not uncommon so that, e.g.: rin po che on some occasions might be shortened to rin che and we would like to have a query of rin po che or rin che find all occurrences of rin po che and rin che without having the user have to mark where elisions might occur. The org.apache.lucene.queryparser.flexible.standard.CommonQueryParserConfiguration provides a setEnablePositionIncrements but that does not seem to work to allow for the above desired query behavior that was possible prior to Lucene 4.4. What is the proper way to ignore stop word gaps? Thank you, Chris - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
What is the proper way to ignore stop words in queries with Lucene 4.4+
Hello, We're using the Lucene 4.4 embedded in eXist-db (exist-db.org), and as the subject indicates we want to ignore stop word gaps in queries - without the user having to indicate where such gaps might occur at query time. Since Lucene 4.4 the FilteringTokenFilter.setEnablePositionIncrements(false) is not available. Prior to Lucene 4.4 it was possible setEnablePositionIncrements(false) so that during indexing and querying the number and position of stop word gaps would be ignored. This meant that a phrase such as: blue is the sky with stop words is and the would be selected by the query: blue sky We are working with Tibetan and elisions are not uncommon so that, e.g.: rin po che on some occasions might be shortened to rin che and we would like to have a query of rin po che or rin che find all occurrences of rin po che and rin che without having the user have to mark where elisions might occur. The org.apache.lucene.queryparser.flexible.standard.CommonQueryParserConfiguration provides a setEnablePositionIncrements but that does not seem to work to allow for the above desired query behavior that was possible prior to Lucene 4.4. What is the proper way to ignore stop word gaps? Thank you, Chris
How to ignore stop word gaps in queries? Lucene 4.4+
Hello, We're using the Lucene 4.4 embedded in eXist-db (exist-db.org), and as the subject indicates we want to ignore stop word gaps in queries - without the user having to indicate where such gaps might occur at query time. Since Lucene 4.4 the FilteringTokenFilter.setEnablePositionIncrements(false) is not available. Prior to Lucene 4.4 it was possible setEnablePositionIncrements(false) so that during indexing and querying the number and position of stop word gaps would be ignored. This meant that a phrase such as: blue is the sky with stop words is and the would be selected by the query: blue sky We are working with Tibetan and elisions are not uncommon so that, e.g.: rin po che on some occasions might be shortened to rin che and we would like to have a query of rin po che or rin che find all occurrences of rin po che and rin che without having the user have to mark where elisions might occur. The org.apache.lucene.queryparser.flexible.standard.CommonQueryParserConfiguration provides a setEnablePositionIncrements but that does not seem to work to allow for the above desired query behavior that was possible prior to Lucene 4.4. What is the proper way to ignore stop word gaps? Thank you, Chris - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org