Re: Custom scoring

2012-02-23 Thread Ahmet Arslan
> The problem is that coord() method is not used (or at least > so that i understand) neither in searching nor in indexing > What do i do wrong? If you want to see coord() values, use a multi-word query (two or more query terms) and go to last page of result set. --

Re: How to construct this query ?

2012-03-07 Thread Ahmet Arslan
> I'm trying to programmatically create a query but don't get > it working. > > The query should return all results that match some prefix, > but not any > results that /exactly/ match the prefix (in the same field). > So only the > results where the field contents are longer than the > prefix. >

Re: Reverse keyword search?

2012-04-27 Thread Ahmet Arslan
> This appears to be somewhat the reverse of the typical > Lucene use case -- rather than having a set of say 1000 of > articles which are indexed, then issuing a query using a few > keywords to search on those articles, I have a set of say > 1000 keywords, and a single article, and I want to deter

Re: Approches/semantics for arbitrarily combining boolean and proximity search operators?

2012-05-16 Thread Ahmet Arslan
> medical w/5 agreement > (medical w/5 agreement) and (doctor w/10 rights) > > but also crazier ones, perhaps like > > agreement w/5 (medical and companion) > (dog or dragon) w/5 (cat and cow) > (daisy and (dog or dragon)) w/25 (cat not cow) This syntax reminds me Surround. http://wiki.apache.o

Re: Store a query in a database for later use

2012-05-18 Thread Ahmet Arslan
> 2. toString() doesn't always generate a query that the > QueryParser can parse. I remember similar discussion, I think Xml-Query-Parser is more suitable for this use case. http://www.lucidimagination.com/blog/2009/02/22/exploring-query-parsers/ --

Re: Boosting numerical field

2012-05-19 Thread Ahmet Arslan
> Is there anyway in a query, I can boost the relevance of a > hit based on the value of a numerical field in the index. > i.e higher the value of the field, more relevant the hit > is. Yes it is possible. Like view count, popularity etc. You can use e(dismax)'s bf boosting function (additive or

Re: using phrase query with wildcard

2012-07-23 Thread Ahmet Arslan
> I'm trying to create a phrase query with wildcard, from the > forums it seems that the solution is not trivial. > I'm trying to create the following queries: "this is a > phrase*"  OR  "*This is a phrase" and > Get hits on every possibility where the * resides. > What is the best way to achieve t

Re: how to put multiplue proximity search in lucene??

2012-07-26 Thread Ahmet Arslan
> fear2dark tight3free is one single > query and im using query parser. If i > will pass >   "fear dark"~2 "tight free"~3  then i will not > get result in which dark > and tight near to eachother. So you want to be dark and tight adjacent to each other. SurroundQueryParser support nested proxim

Re: easy way to figure out most common tokens?

2012-08-15 Thread Ahmet Arslan
> Is there an easy way to figure out > the most common tokens and then remove those tokens from the > documents. Probably this : http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/misc/HighFreqTerms.html - To unsubscr

Re: Efficient string lookup using Lucene

2012-08-24 Thread Ahmet Arslan
> search for a string "run", I do not need to find "ran" but I > do want to find it in all of these strings below: > > Fox is running fast > !%#^&$run!$!%@&$# > run,run With NGramFilter you can do that. But it creates a lot of tokens. For example "Fox is running fast" becomes F o

Re: Stemming and Wildcard Queries

2010-05-20 Thread Ahmet Arslan
> Is there a good way to combine the > wildcard queries and stemming?  > > As is, the field which is stemmed at index time, won't work > with some wildcard queries. org.apache.lucene.queryParser.analyzing.AnalyzingQueryParser may help? ---

Re: Surround QueryParser and PhraseQuery

2010-05-28 Thread Ahmet Arslan
> I'm having problem with searching phrase and using Surround > Query Parser, so > let look at input surround queries (test examples) >    1. "yellow orange" >    2. lemon 2n ("yellow orange") 4n banana > where 2n, 4n are within connectors. You don't need phrasequery when you already have spannear

Re: PhraseQuery vs MultiPhraseQuery

2010-05-28 Thread Ahmet Arslan
> Is there a fundamental difference between > > PhraseQuery query = new PhraseQuery(); > query.add(term1, 0); > query.add(term2, 0); > > and > > MultiPhraseQuery query = new MultiPhraseQuery(); > query.add( new Term[] { term1, term2 } ); > > The only different I could think of is that MPQ som

Re: StandardAnalyzer specifications

2010-06-03 Thread Ahmet Arslan
> I am sorry if this is posted somewhere else, but I think I > sent it to > the wrong list and I am trying again. > > Is there anywhere I can find specifications for > StandardAnalyzer? > > I am looking for specs that tell just how StandardAnalyzer > tokenizes > search terms, and how it deals wit

Re: Are there any tokenizers that ignore HTML tags but keep the offsets so they can be used for highlighting in the original document?

2010-06-07 Thread Ahmet Arslan
> I need to index HTML documents and one of the requirements > is to highlight > documents while maintaining all of the original formatting. > The documents > are relatively simple HTML, meaning no JavaScript code that > changes elements > at runtime or too fancy CSS styling. > > I think it should

Re: Exact match with fuzzy query

2010-06-12 Thread Ahmet Arslan
> I am using lucene 3.0.1. I use a MultiFieldQueryParser with > a GermanAnalyzer. In my index are some values among others > one document with the title "bauer". I append to every word > in my query a ~0.8 (here I am not sure if this is the way to > do it). If I try to search now, I will not get th

Re: Exact match with fuzzy query

2010-06-12 Thread Ahmet Arslan
> > Yes bauer~0.8 bauer as query will bring you both exact > and fuzzy matches. > > Is this the normal way to do it? Somehow. 'bauer~0.8 OR bauer' is easiest way to fuzzy search which also finds exact matches. > Unfortunately this parser seems to be missing in 3.0.1 http://lucene.apache.org/

Re: Strange behaviour of StandardTokenizer

2010-06-17 Thread Ahmet Arslan
> I ran into a strange behaviour of the StandardTokenizer. > Terms containing a '-' are tokenized differently depending > on the context. > For example, the term 'nl-lt' is split into 'nl' and 'lt'. > The term 'nl-lt0' is tokenized into 'nl-lt0'. > Is this a bug or a feature? It is designed tha

Re: Strange behaviour of StandardTokenizer

2010-06-18 Thread Ahmet Arslan
> okay, so it is recognized as a number? Yes. You can see token type definitions in *.jflex file. > Maybe I'll have to use another tokenizer. MappingCharFilter with StandardTokenizer option exists. NormalizeCharMap map = new NormalizeCharMap(); map.add("-", " "); TokenStream stream = new Sta

Re: search for a string which begins with a '$' character

2010-07-03 Thread Ahmet Arslan
> I am using this analyzer: > @Analyzer(impl = > org.apache.lucene.analysis.standard.StandardAnalyzer.class) > > "$" is not inlcluded in the STOP_WORDS for this > analyzer.  Is there > somewhere else i should be looking?  When i use Luke > with the > standardAnalyzer, it does not parse the query. 

Re: multi-term synonym expansion

2010-07-06 Thread Ahmet Arslan
> My custom SKOSAnalyzer already performs synonym expansion > based on the labels defined in a given SKOS model. But now I > have the problem that real-world thesauri often define > (multi terms) synonyms for mult-term words. Here is an > example that defines the abbreviation "UN" as synonym for >

Re: search for a string which begins with a '$' character

2010-07-09 Thread Ahmet Arslan
> WhitespaceAnalyzer is case sensitive.  Is there a way > to > make it case insensitive? You can build your custom analyzer using WhitespaceTokenizer + LowercaseFilter. Source code of an existing analyzer will help you. public TokenStream tokenStream(String fieldName, Reader reader) { Whites

Re: A full-text tokenizer for the NGramTokenFilter

2010-07-17 Thread Ahmet Arslan
> and I'm just wondering if there is a tokenizer > that would return me the whole text. KeywordTokenizer does this. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java

Re: Fuzzy Phrase

2010-09-27 Thread Ahmet Arslan
> I want to > use just one string like -- head:"hello~ world"~3 AND > contents:"colorless~ > green~ ideas~". > > When I this string query within ComplexPhraseQuery, I get > the exception: > -- ParseException: Cannot parse 'hello~ world': Cannot > have clause for > field "content" nested in phras

Re: Changing QueryParser operator images

2010-09-28 Thread Ahmet Arslan
> How can this be done, if at all? has anyone ever did > something like this? I did it by modifying QueryParser.jj and regenerating corresponding java files. But it is better to use (teach users) + - universal operators. http://wiki.apache.org/lucene-java/BooleanQuerySyntax ---

Re: Copying Payload from one Token to the next

2010-10-17 Thread Ahmet Arslan
org.apache.solr.analysis.BufferedTokenStream.java (that can peek n tokens ahead in the buffered input stream, without modifying the stream) and CommonGramsFilter.java may help. --- On Sat, 10/16/10, Christoph Hermann wrote: > From: Christoph Hermann > Subject: Copying Payload from one Token

Re: FW: Use of hyphens in StandardAnalyzer

2010-10-24 Thread Ahmet Arslan
How about replacing "-" with some arbitrary character sequence with MappingCharFilter before tokenizer and then restoring that '-' with PatternReplaceFilter after the tokenizer? May be you can just eat '-' with charFilter so that Lawton-Browne becomes LawtonBrowne. --- On Mon, 10/25/10, Marti

Re: What is the best Analyzer and Parser for this type of question?

2010-11-15 Thread Ahmet Arslan
> Example of Question: > - What is the role of PrnP in mad cow disease? First thing is do not directly query questions. Manually formulate queries: remove 'what' 'is' 'the' 'of' '?' etc. For example i would convert this question into: "mad cow"^5 "cow disease"^3 "mad cow disease"^15 "role PrnP"

Re: Dismax in Lucene

2010-11-20 Thread Ahmet Arslan
> I heard Yonik talk about a better dismax query parser for > Solr so I > was wondering if Lucene already has this functionality > contributed to > its contrib modules? Dismax use Lucene's DisjunctionMaxQuery http://lucene.apache.org/java/2_9_3/api/core/org/apache/lucene/search/DisjunctionMaxQuer

Re: Analyzer

2010-12-02 Thread Ahmet Arslan
> By the way, is there an analyzer > which splites each letter of a word? > e.g. > hello world => h/e/l/l/o/w/o/r/l/d There are classes under the package org.apache.lucene.analysis.ngram - To unsubscribe, e-mail: java

Re: FunctionQuery

2010-12-12 Thread Ahmet Arslan
--- On Sun, 12/12/10, Lev Alyshayev wrote: > From: Lev Alyshayev > Subject: FunctionQuery > To: java-user@lucene.apache.org > Date: Sunday, December 12, 2010, 8:42 PM > Hello there, > > I am trying to solve a problem where I use a new > FunctionQuery to sort the > results by changing the score

Re: The logic of QueryParser

2010-12-13 Thread Ahmet Arslan
> I have googled the mailing list archives and didn't find > anything.  But if > this has been discussed to death, please just point me to > the threads in the > archive. rather than stirring up some old flame war.  > Or just tell me what > to google for (the terms I've tried haven't yielded > anyt

Re: Multifield query parser

2010-12-18 Thread Ahmet Arslan
> While searching across multiple fields using > MultiFieldQueryParser, when a > doc is returned how do I know in this doc which field(among > the multiple > fields i queried over) contained the query term? You can extract that info from org.apache.lucene.search.Explanation. http://lucene.apache.

Re: Can I generate two word phrases from Lucene Index

2010-12-22 Thread Ahmet Arslan
> > 2. *Getting Two Word Phrase ==>* index contents, > using lucene etc... > > You can add ShingleFilter to your analyzer chain. http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/analysis/shingle/ShingleFilter.html

Re: relevant score calculation

2010-12-29 Thread Ahmet Arslan
> Test case >     doc1 :   test -- one two > three >     doc2 :   test, one two three >     doc3 :   one two three > > Search query :  "one two three" by QueryParser and > StandardAnalyzer > > Question:  why all of three documents have the same > score?  As Ian said, length norm values of your a

Re: Search Score percentage, Should not be relative to the highest score

2011-01-03 Thread Ahmet Arslan
It is somehow not recommended to convert scores to percentages. http://wiki.apache.org/lucene-java/ScoresAsPercentages   > When using lucene to search documents, the results have a > score based on their relativity to the search term. Inside > lucene, the score > percentage is calculated as a p

Re: Search Score percentage, Should not be relative to the highest score

2011-01-03 Thread Ahmet Arslan
> I had read the link and I understand the concern, however, > the normalization > is happening inside lucene. Where the normalizing value is > the inverse of > the maxScore. > > I can alter the code to leave the original score, however > it is a business > requirements to view the matching percen

Re: Search Score percentage, Should not be relative to the highest score

2011-01-03 Thread Ahmet Arslan
So, can we say that if you have something that gives you the "how many query terms matched" info, will that satisfy your requirement? Query: term1 term2 Doc1: term1 term2 => n=2 => %100 Doc2: term1 term2 term3 term4 => n=2 => %100 Doc3: term1 term1 term3 => n=1 => %50 Doc4: term2 term3 ter

Re: Use of PrefixQuery to create multi-word queries

2011-01-05 Thread Ahmet Arslan
> I am trying to implement a "progressive search" with > Lucene. What I mean is that > something like what Google does: you type a few letters and > google searches for > matches as you type. The more letters you enter, the more > precise your search > becomes. > > I decided to use a prefix query

Re: Search Score percentage, Should not be relative to the highest score

2011-01-05 Thread Ahmet Arslan
> Did not work, > > I am using my own Similarity and the coord method is not > called, because the > disableCoord variable is set to true from FuzzyQuery > > > public Similarity getSimilarity(Searcher searcher) { >     Similarity result = > super.getSimilarity(searcher); >     if (disableCoord

Re: Frequent updates lead to "Too many open files"

2011-01-08 Thread Ahmet Arslan
--- On Sat, 1/8/11, Andreas Harth wrote: > From: Andreas Harth > Subject: Frequent updates lead to "Too many open files" > To: java-user@lucene.apache.org > Date: Saturday, January 8, 2011, 6:30 PM > Hi, > > I have a single IndexWriter object which I use to update > the index.  After each upda

RE: Creating an index with multiple values for a single field

2011-01-10 Thread Ahmet Arslan
> We do leverage synonyms but they are not appropriate for > this case. We use synonyms for words that are truly > synonymous for the entire index such as "inc" and > "incorporated". Those words are always interchangeable. > However, many of the employer alternate names are only valid > for a singl

Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-18 Thread Ahmet Arslan
> [] ASF Mirrors (linked in our release announcements or via > the Lucene website) > > [X] Maven repository (whether you use Maven, Ant+Ivy, > Buildr, etc.) > > [] I/we build them from source via an SVN/Git checkout. > > [] Other (someone in your company mirrors them internally > or via a downst

RE: Lucene paid support

2011-03-03 Thread Ahmet Arslan
> Thanks for the quick reply. Sorry I was vague in my > message. We are considering using Lucene in a commercial > product that we sell and as we do with other third-party > products we want to make sure that we have timely (i.e. > priority) access to technical support that can help us > resolve is

Re: About query parser

2011-03-15 Thread Ahmet Arslan
> For example, i wanna search for 'great sum', and 'great > sum', 'greater sum', ... may be found, and sum great must > not be found. It means I need not only exact word, but also > prefix search (or some other search criteria like fuzzy...). > In my app, I used "great* sum*", but it does not work

Re: About ComplexPhraseQueryParser highlight prob

2011-03-15 Thread Ahmet Arslan
--- On Tue, 3/15/11, Cescy wrote: > From: Cescy > Subject: About ComplexPhraseQueryParser highlight prob > To: "java-user" > Date: Tuesday, March 15, 2011, 11:44 PM > hi > > > My app can find the document but cannot highlight the > keywords. > > >   ComplexPhraseQueryParser parser = new >

Re: Am I correctly parsing the strings ? Terms or Phrases ?

2011-03-21 Thread Ahmet Arslan
>     description = new TermQuery(new > Term("description", "my string")); > > I ask Lucene to consider "my string" as unique word, right? Correct. > I actually need to consider each word, should I use > PhraseQuery instead ? If description field is tokenized/analyzed during indexing you need

Re: Am I correctly parsing the strings ? Terms or Phrases ?

2011-03-21 Thread Ahmet Arslan
> Date: Monday, March 21, 2011, 7:39 PM > One more thing: It is actually not > clear to me how to use PhraseQuery... I > thought I can just pass a phrase to it, but I see only > add(Term) method... > should I parse the string by myself to single terms ? Yes, you need to do it. QueryParser transf

Re: Sorting by multiple dependent fields

2011-03-23 Thread Ahmet Arslan
> I'm searching for things near your location (as specified > by longitude and latitude).  I've got the search > working correctly (with the help of NumericField), but now I > need to sort the results by distance from you.  The > closer things appear at the top of the list.  There is a contrib pac

Re: ComplexPhraseQueryParser with multiple fields

2011-05-02 Thread Ahmet Arslan
Hi, I've just started using the ComplexPhraseQueryParser and it works great with one field but is there a way for it to work with multiple fields?  For example, right now the query: job_title: "sales man*" AND NOT contact_name: "Chris Salem" throws this exception Caused by: org.apache.lucene.q

Re: Anyway to not bother scoring less good matches ?

2011-05-04 Thread Ahmet Arslan
Im receiving a number of searches with many ORs so that the total number of matches is huge ( > 1 million) although only the first 20 results are required. Analysis shows most time is spent scoring the results. Now it seems to me if you sending a query with 10 OR components, documents that mat

Re: Anyway to not bother scoring less good matches ?

2011-05-04 Thread Ahmet Arslan
> Thanks for the hint, so this could be done by overriding getBooleanQuery() in > QueryParser ? > I think something like this should do the trick. Without overriding anything. Query query= QueryParser.parse("User Entered String"); if (query instanceof BooleanQuery) ((BooleanQuery)query).se

Re: AW: Higher scoring if term is at the beginning of a field/document

2011-05-04 Thread Ahmet Arslan
Besides my "real index" (which is being analyzed through a ShingleAnalyzerWrapper) I implicitly/transparently build up a "search term index" which I populate with the terms (being shingles) of my "real index". The "search term index" is being used to provide search term suggestions when the u

Re: Anyway to not bother scoring less good matches ?

2011-05-04 Thread Ahmet Arslan
> Thanks again, now done that but still not having much > effect on total > ime, So your main concern is enhancing the running time? , not to decrease the number of returned results. Additionally http://wiki.apache.org/lucene-java/ImproveSearchingSpeed ---

Re: Anyway to not bother scoring less good matches ?

2011-05-05 Thread Ahmet Arslan
> Yes correct, but I have looked and the list of > optimizations before. What was clear from profiling was that > it wasnt the searching part that was slow (a query run on > the same index with only a few matching docs ran super fast) > the slowness only occurs when there are loads of matching > do

Re: ComplexPhraseQueryParser with multiple fields

2011-06-22 Thread Ahmet Arslan
> Which of the solutions did you find to work better? > Can you please say which package should I change it to if I > choose to do it > that way? I think changing package name of ComplexQueryParser is easier. This way you can use existing patch directly. Plus, do you mind voting https://issues.a

Re: ComplexPhraseQueryParser with multiple fields

2011-06-23 Thread Ahmet Arslan
> By the way - I'm using the > ComplexPhraseQueryParser that I've downloaded > from: > > https://issues.apache.org/jira/browse/SOLR-1604 > > And I've tried to use packages: > > - org.apache.lucene.search > - org.apache.lucene.queryParser > > Both, when compiled and added to the SOLR lib dir,

Re: ComplexPhraseQueryParser with multiple fields

2011-06-23 Thread Ahmet Arslan
> But now there's another issue. > I'm using SOLR and Lucene 3.1.0 and when sending a query > "Wildcard* phrase*" > it works as expected - but, when sending the query > "wildcard*" (Only one > word withing the phrase) I'm getting another exception: > > HTTP ERROR: 500 > Unknown query type "org.ap

Re: Reverse Matching

2014-02-14 Thread Ahmet Arslan
Hi Siraj, MemoryIndex is used for such use case. Here is a couple of pointers:  http://www.slideshare.net/jdhok/diy-percolator http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html On Friday, February 14, 2014 8:21 PM, Siraj Haider wrote: Hi There, Is

Re: Reverse Matching

2014-02-14 Thread Ahmet Arslan
Hi, Here are two more relevant links: https://github.com/flaxsearch/luwak http://www.lucenerevolution.org/2013/Turning-Search-Upside-Down-Using-Lucene-for-Very-Fast-Stored-Queries Ahmet On Saturday, February 15, 2014 3:01 AM, Ahmet Arslan wrote: Hi Siraj, MemoryIndex is used for such use

Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Ahmet Arslan
Hi Greet, I suggest you to do these kind of transformation on query time only. Don't interfere with the index. This is way is more flexible. You can disable/enable on the fly, change your list without re-indexing.  Just an imaginary example : When user passes String as International Businessma

Re: Phrase search with ComplexPhraseQueryParser/SpanQueryParser.

2014-03-05 Thread Ahmet Arslan
Hi Modassar, Can you post your request (with an example if possible) to lucene-5205 jura ticket too? If you don't have an jira account, anyone can create one.  Thanks, Ahmet On Wednesday, March 5, 2014 9:40 AM, Modassar Ather wrote: Hi, Phrases with stop words in them are not getting searc

Re: Ranking Function based on Probabilistic Retrieval Framework

2014-04-03 Thread Ahmet Arslan
Hi Prakash, Have you seed Robert's write up? http://java.dzone.com/news/flexible-ranking-lucene-4 Ahmet On Thursday, April 3, 2014 2:30 PM, Prakash Dubey wrote: Dear all, Why there is no Ranking function based on Probabilistic Retrieval Framework

Re: What is the proper use of stop words in Lucene?

2014-04-23 Thread Ahmet Arslan
Hi, I think you final goal is not full related to stop word elimination.  I would use synonyms instead of setEnablePositionIncrements. Alternatively, Assuming that you have list of stop words, you may simulate previous behaviorsetEnablePositionIncrements(false) via org.apache.lucene.analysis.Ma

Re: How to add machine learning to Apache lucene

2014-05-15 Thread Ahmet Arslan
Hi Priyanka, There are existing tools that can feed from lucene index. For example http://mahout.apache.org Why not use them? Ahmet On Wednesday, May 7, 2014 11:05 PM, Priyanka Tufchi wrote: Hello All How can I add Maching Learning Part in Apache Lucene . Thanks Priyanka -

Re: How to add machine learning to Apache lucene

2014-05-19 Thread Ahmet Arslan
Hi Diego, There is no such thing in lucene ecosystem yet. Although some ideas http://search-lucene.com/m/WwzTb2nt1Tk1  http://search-lucene.com/m/WwzTb2d9o2m float time to time.  I would like to integrate https://code.google.com/p/jforests/ and create a prototype my self in the future. New a

Re: Relevancy tests

2014-06-12 Thread Ahmet Arslan
Hi, Relevance Judgments are labor intensive and expensive. Some Information Retrieval forums ( TREC, CLEF, etc) provide these golden sets. But they are not public. http://rosenfeldmedia.com/books/search-analytics/ talks about how to create a "golden set" for your top n queries. Also there ar

Re: Two-pass TokenFilter

2014-08-24 Thread Ahmet Arslan
Hi, Can you elaborate more, what do you mean by "I need to know all tokens in  advance." Ahmet On Wednesday, August 20, 2014 6:48 PM, Christian Beil wrote: Hey guys, I need a TokenFilter that filters some tokens like the FilteringTokenFilter. The problem is, in order to do the filtering I ne

Re: custom token filter generates empty tokens

2014-10-09 Thread Ahmet Arslan
Hi G.Long, You can use TrimFilter+LengthFilter to remove empty/whitespace tokens. Ahmet On Thursday, October 9, 2014 5:54 PM, G.Long wrote: Hi :) I wrote a custom token filter which removes special characters. Sometimes, all characters of the token are removed so the filter procudes an empt

Re: analyzers for Thai, Telugu, Vietnamese, Korean, Urdu,...

2014-11-09 Thread Ahmet Arslan
Hi, Thai has this for example : org.apache.lucene.analysis.th.ThaiAnalyzer Ahmet On Saturday, November 8, 2014 12:48 PM, Olivier Binda wrote: Hello What should I use for analysing languages like Thai, Telugu, Vietnamese, Korean, Urdu ? The StandardAnalyzer ? The ICUAnalyzer ? It doesn't l

Re: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-10 Thread Ahmet Arslan
Hi, Regarding Uwe's warning, "NOTE: SnowballFilter expects lowercased text." [1] [1] https://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/analysis/snowball/SnowballFilter.html On Monday, November 10, 2014 4:43 PM, Uwe Schindler wrote: Hi, > Uwe > > Thanks for the reply

Re: How to improve the performance in Lucene when query is long?

2014-11-11 Thread Ahmet Arslan
Hi Harry, May be you can use BooleanQuery#setMinimumNumberShouldMatch method. What happens when you use set it to half of the numTerms? ahmet On Tuesday, November 11, 2014 8:35 AM, Harry Yu <502437...@qq.com> wrote: Hi everyone, I have been using Lucene to build a POI searching & geocoding

Re: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-11 Thread Ahmet Arslan
o the LowerCaseFilter. This seems to work. -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] Sent: 10 Nov 2014 15 19 To: java-user@lucene.apache.org Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2 Hi, Regarding Uwe's warnin

Re: Document Term matrix

2014-11-11 Thread Ahmet Arslan
Hi, Mahout and Carrot2 can cluster the documents from lucene index. ahmet On Tuesday, November 11, 2014 10:37 PM, Elshaimaa Ali wrote: Hi All, I have a Lucene index built with Lucene 4.9 for 584 text documents, I need to extract a Document-term matrix, and Document Document similarity matri

Re: lucene query with additional clause field not null

2014-12-01 Thread Ahmet Arslan
Hi Sascha, Generally RangeQuery is used for that, e.g. fieldName:[* TO *] Ahmet On Monday, December 1, 2014 9:44 PM, Sascha Janz wrote: Hi, is there a chance to add a additional clause to a query for a field that should not be null ? greetings sascha -

IndexSearcher.setSimilarity thread-safety

2014-12-25 Thread Ahmet Arslan
Hi all, Javadocs says "IndexSearcher instances are completely thread safe, meaning multiple threads can call any of its methods, concurrently" Is this true for setSimilarity() method? What happens when every thread uses different similarity implementations? Thanks, Ahmet -

Re: IndexSearcher.setSimilarity thread-safety

2015-01-05 Thread Ahmet Arslan
anyone? On Thursday, December 25, 2014 4:42 PM, Ahmet Arslan wrote: Hi all, Javadocs says "IndexSearcher instances are completely thread safe, meaning multiple threads can call any of its methods, concurrently" Is this true for setSimilarity() method? What happens when every t

Re: IndexSearcher.setSimilarity thread-safety

2015-01-05 Thread Ahmet Arslan
an use a single IndexReader for the IndexSearchers Barry On Mon, Jan 5, 2015 at 1:10 PM, Ahmet Arslan wrote: > > > anyone? > > > > On Thursday, December 25, 2014 4:42 PM, Ahmet Arslan > wrote: > Hi all, > > Javadocs says "IndexSearcher instances are completely th

Re: IndexSearcher.setSimilarity thread-safety

2015-01-05 Thread Ahmet Arslan
hetaphi.de > -Original Message- > From: Barry Coughlan [mailto:b.coughl...@gmail.com] > Sent: Monday, January 05, 2015 3:40 PM > To: java-user@lucene.apache.org; Ahmet Arslan > Subject: Re: IndexSearcher.setSimilarity thread-safety > > Hi Ahmet, > > The IndexSearcher is "t

Re: Looking for docs that have certain fields empty (an/or not set)

2015-01-07 Thread Ahmet Arslan
Hi Clemens, Since you are a lucene user, you might be interested in Uwe's response on a similar topic : http://find.searchhub.org/document/abb73b45a48cb89e Ahmet On Wednesday, January 7, 2015 6:30 PM, Erick Erickson wrote: Should be, but it's a bit confusing because the query syntax is not

Re: AW: LowercaseFilter, preserveOriginal?

2015-01-27 Thread Ahmet Arslan
Hi Clemens, Please see : https://issues.apache.org/jira/browse/LUCENE-5620 Ahmet On Tuesday, January 27, 2015 10:56 AM, Clemens Wyss DEV wrote: > I very much preserveOriginal="true" when applying the >ASCIIFoldingFilter for (german)suggestions Must revise my statement, as I just noticed tha

Re: Analyzer: Access to document?

2015-02-04 Thread Ahmet Arslan
Hi Ralf, Does following code fragment work for you? /** * Modified from : http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/analysis/package-summary.html */ public List getAnalyzedTokens(String text) throws IOException { final List list = new ArrayList<>(); try (TokenStream ts = analy

Re: disabling all scoring?

2015-02-05 Thread Ahmet Arslan
Hi Rob, May be you wrap your query in a ConstantScoreQuery? ahmet On Thursday, February 5, 2015 9:17 AM, Rob Audenaerde wrote: Hi all, I'm doing some analytics with a custom Collector on a fairly large number of searchresults (+-100.000, all the hits that return from a query). I need to retr

getting number of terms in a document/field

2015-02-05 Thread Ahmet Arslan
Hello Lucene Users, I am traversing all documents that contains a given term with following code : Term term = new Term(field, word); Bits bits = MultiFields.getLiveDocs(reader); DocsEnum docsEnum = MultiFields.getTermDocsEnum(reader, bits, field, term.bytes()); while (docsEnum.nextDoc() != Doc

Re: getting number of terms in a document/field

2015-02-06 Thread Ahmet Arslan
approximately in the doc's norm value. Maybe you can use that? Alternatively, you can store this statistic yourself, e.g as a doc value. Mike McCandless http://blog.mikemccandless.com On Thu, Feb 5, 2015 at 7:24 PM, Ahmet Arslan wrote: > Hello Lucene Users, > > I am traversing all

Re: getting number of terms in a document/field

2015-02-08 Thread Ahmet Arslan
ll compute length of fields by myself. Thanks, Ahmet On Friday, February 6, 2015 5:31 PM, Michael McCandless wrote: On Fri, Feb 6, 2015 at 8:51 AM, Ahmet Arslan wrote: > Hi Michael, > > Thanks for the explanation. I am working with a TREC dataset, > since it is static, I

Re: understanding the norm encode and decode

2015-03-04 Thread Ahmet Arslan
Hi Adrien, I read somewhere that norms are stored using docValues. In my understanding, docvalues can store lossless float values. So the question is, why are still several decode/encode methods exist in similarity implementations? Intuitively switching to docvalues for norms should prevent prec

Re: understanding the norm encode and decode

2015-03-05 Thread Ahmet Arslan
s full float precision, but scoring being >>> fuzzy anyway this would multiply your memory needs for norms by 4 >>> while not really improving the quality of the scores of your >>> documents. This precision loss is the right trade-off for most >>> use-cases. &g

Re: Would Like to contribute to Lucene

2015-03-19 Thread Ahmet Arslan
Hi Gimanta, Not sure about the lucene internals, but here are some pointers : http://find.searchhub.org/document/a81b4c9af49c3d0f http://find.searchhub.org/?q=contribute#%2Fp%3Alucene%2Fs%3Aemail Ahmet On Thursday, March 19, 2015 3:58 PM, Gimantha Bandara wrote: Any clue on where to start

Re: CachingTokenFilter tests fail when using MockTokenizer

2015-03-23 Thread Ahmet Arslan
Hi Spyros, Not 100% sure but I think you should override reset method. @Override public void reset() throws IOException { super.reset(); cachedInput = null; } Ahmet On Monday, March 23, 2015 1:29 PM, Spyros Kapnissis wrote: Hello, We have a couple of custom token filters that use CachingTo

Re: Text dependent analyzer

2015-04-14 Thread Ahmet Arslan
Hi Hummel, You can perform sentence detection outside of the solr, using opennlp for instance, and then feed them to solr. https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect Ahmet On Tuesday, April 14, 2015 8:12 PM, Shay Hummel wrote: Hi I would l

Re: Text dependent analyzer

2015-04-17 Thread Ahmet Arslan
ed, Apr 15, 2015 at 3:50 AM Ahmet Arslan wrote: > Hi Hummel, > > You can perform sentence detection outside of the solr, using opennlp for > instance, and then feed them to solr. > > https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect &g

Re: Changing analyzer in an indexwriter

2015-04-19 Thread Ahmet Arslan
Hi Lisa, I think AnalyzerWrapper https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/AnalyzerWrapper.html Ahmet On Sunday, April 19, 2015 1:37 PM, Lisa Ziri wrote: Hi, I'm upgrading to lucene 5.1.0 from lucene 4. In our index we have documents in different languages which are

Re: Phrase query given a word

2015-04-23 Thread Ahmet Arslan
Hi, May be LUCENE-5317 relevant? Ahmet On Thursday, April 23, 2015 8:33 PM, Shashidhar Rao wrote: Hi, I have a large text and from that I need to calculated the top frequencies of words , say 'Driving' occurs the most. Now , I need to find phrase containing 'Driving' in the given text and th

intersection of two posting lists

2015-05-08 Thread Ahmet Arslan
Hello All, I am traversing posting list of a single term by following code. (not sure if there is a better way) Now I need to handle/aggregate multiple terms. Traverse intersection of multiple posting lists and obtain summed freq() of multiple terms per document. What is the easiest way to obta

access query term in similarity calcuation

2015-05-23 Thread Ahmet Arslan
Hi, I have a number of similarity implementation that extends SimilarityBase. I need to learn which term I am scoring inside the method : abstract float score(BasicStats stats, float freq, float docLen); What is the easiest way to access the query term that I am scoring in similarity class? Th

IllegalArgumentException: docID must be >= 0 and < maxDoc=48736112 (got docID=2147483647)

2015-05-29 Thread Ahmet Arslan
Hello List, When a similarity returns NEGATIVE_INFINITY, hits[i].doc becomes 2147483647. Thus, exception is thrown in the following code: for (int i = 0; i < hits.length; i++) { int docId = hits[i].doc; Document doc = searcher.doc(docId); } I know it is an awkward to return infinity (comes from

Re: IllegalArgumentException: docID must be >= 0 and < maxDoc=48736112 (got docID=2147483647)

2015-05-30 Thread Ahmet Arslan
re if collectors could easily have the same performance without them. To me, such scores seem always undesirable and only bugs, and the current assertions are a good tradeoff. On Fri, May 29, 2015 at 8:18 AM, Ahmet Arslan wrote: > Hello List, > > When a similarity returns NEGATIVE_INFINIT

Re: Tf and Df in lucene

2015-06-15 Thread Ahmet Arslan
Hi Hummel, regarding df, Term term = new Term(field, word); TermStatistics termStatistics = searcher.termStatistics(term, TermContext.build(reader.getContext(), term)); System.out.println(query + "\t totalTermFreq \t " + termStatistics.totalTermFreq()); System.out.println(query + "\t docFreq \t

Re: Tf and Df in lucene

2015-06-15 Thread Ahmet Arslan
tates" (two terms) or "free speech zones" (three terms). Shay On Mon, Jun 15, 2015 at 4:55 PM Ahmet Arslan wrote: > Hi Hummel, > > regarding df, > > Term term = new Term(field, word); > TermStatistics termStatistics = searcher.termStatistics(term, > Te

  1   2   3   >