Re: Phrase Queries vs. SpanTermQueries exact phrases vs. stop words

2012-02-01 Thread Doron Cohen
int gap = (pp[pp.length-1] - pp[0]) - (pp.length - 1); Don't want to cause an IndexOutOfBoundsException Right... that's what I meant with (boundary cases)...

Re: When does Query Parser do its analysis ?

2012-02-01 Thread Doron Cohen
In my particular case I add album catalogsno to my index as a keyword field , but of course if the cat log number contains a space as they often do (i.e. cad 6) there is a mismatch. Ive now changed my indexing to index the value as 'cad6' removing spaces. Now if the query sent to the query

Re: Phrase Queries vs. SpanTermQueries exact phrases vs. stop words

2012-01-31 Thread Doron Cohen
Hi, Code here ignores PhraseQuery (PQ) 's positions: int[] pp = PQ.getPositions(); These positions have extra gaps when stop words are removed. To accommodate for this, the overall extra gap can be added to the slope: int gap = (pp[pp.length] - pp[0]) - (pp.length - 1); // (+/-

Re: Taxonomy indexer debug

2011-11-28 Thread Doron Cohen
Sequence of operations seems logical, I don't see straight why this does not work. Could you minimize this to a small stand-alone program that does not work as expected? This will allow to recreate the problem here and debug it. It is interesting that facet 3.5 is used with core 3.4 and queries

Re: Taxonomy indexer debug

2011-11-28 Thread Doron Cohen
Could you minimize this to a small stand-alone program that does not work as expected? This will be hard, because of the bug only appearing after a couple of days or more and i'm starting to think that it is triggered by high data volumes. I'll try to minimize the code and serve more data

Re: Taxonomy indexer debug

2011-11-26 Thread Doron Cohen
However there are at least two issues with this: 1) the info would be in the lower level of the internal index writer, and not in that of the categories logic. 2) one cannot just call super.openIndexWriter(directory, openMode) and modify the result before returning it, because once IW is

Re: Taxonomy indexer debug

2011-11-25 Thread Doron Cohen
I'm having an issue with using NRT and Tax. After a couple of days of running continuously , the taxonomyreader doesn't return results anymore (but taxindex has them). Taxonomy Reader does not support NRT - see https://issues.apache.org/jira/browse/LUCENE-3441 (Add NRT support to

Re: Scoring in Lucene

2011-10-07 Thread Doron Cohen
To my understanding this stems from V(q) · V(d) (see the *Conceptual Scoring Formula*) - the elements in those vectors are *Tf-idf* values, and so, implementation wise (see the *Practical Scoring Function*), idf(t) is multiplied by itself: once for the query and once for the document. HTH, Doron

Re: TaxWriter leakage?

2011-10-04 Thread Doron Cohen
Looking into this with Shai I think we see how this can happen, in this code of LTW: private ParentArray getParentArray() throws IOException { if (parentArray==null) { // [1] if (reader == null) { reader = openReader(); } parentArray = new ParentArray(); // [2]

Re: TaxWriter leakage?

2011-10-04 Thread Doron Cohen
On Tue, Oct 4, 2011 at 11:29 AM, Mihai Caraman caraman.mi...@gmail.comwrote: I also think that there is nothing special in the second restart, except that that by that time there were other servlets up (?) which were able to trigger simultaneous AddDoc requests, exposing this bug...

Re: TaxWriter leakage?

2011-10-04 Thread Doron Cohen
LUCENE-3484 is resolved. Mihai, could you give it a try and see if this solves the NPE problem in your setup? You would need to download a nightly build that contains the fix - see the issue for revision numbers... On Tue, Oct 4, 2011 at 7:51 PM, Mihai Caraman caraman.mi...@gmail.comwrote:

Re: Please help me with a basic question...

2011-05-20 Thread Doron Cohen
Hi Rich, SeetSpotSimilarity looks promising. Does it not favor shorter docs by not normalizing or does it make some attempt to standardized. - using e.g. SeetSpotSimilarity which do not favor shorter documents. SweetSpotSimilarity (I misspelled it previously) defines a range of lengths

Re: SpanNearQuery - inOrder parameter

2011-05-19 Thread Doron Cohen
Hi Greg, I created http://issues.apache.org/jira/browse/LUCENE-3120 for this problem, and attached there a more general test that exposes this problem, based on your test case. I am not sure yet that this is indeed a problem to be fixed with regard to span queries (see more there in JIRA) but

Re: SpanNearQuery - inOrder parameter

2011-05-19 Thread Doron Cohen
Hi Greg, On Thu, May 19, 2011 at 12:26 PM, Gregory Tarr gregory.t...@detica.comwrote: We let our users decide whether they want to force the order or not, so in effect they pass in inOrder. I would have to detect a repeated term and change the parameter as a result of that in order to

Re: Please help me with a basic question...

2011-05-19 Thread Doron Cohen
Hi Rich, If I understand correctly you are concerned that short documents are preferred too much over long ones, is this really the case? It would help to understand what goes on to look at the Explanation of the score for say two result documents - one that you think is ranked too low, and one

Re: How to implement a proximity search using LINES as slop

2011-02-10 Thread Doron Cohen
IIUC what you are trying to achieve I think the following could help, without setting all words in a line to be in the same position: At indexing, set a position increment of N (e.g. 100) at line start tokens. This would set a position gap of N between last token of line x to first token of line

Re: Creating an index with multiple values for a single field

2011-01-10 Thread Doron Cohen
On Mon, Jan 10, 2011 at 7:44 PM, Ryan Aylward r...@glassdoor.com wrote: We do leverage synonyms but they are not appropriate for this case. We use synonyms for words that are truly synonymous for the entire index such as inc and incorporated. Those words are always interchangeable. However,

Re: lucene locking

2010-12-16 Thread Doron Cohen
I have a app that seems to be locking on some search calls. I am including the stacktrace for the blocked and blocker thread. Is it daedlock for sure? No search deadlock fixes were done since 2.1.0, so perhaps it is something else... TP-Processor177 daemon prio=10

Re: Forcing specific index file names

2010-12-15 Thread Doron Cohen
I could make an exception in the patch creation program to detect that there is a lucene directly, and diff the .cfs files, even if they have different names, but was seeing if I can avoid that so the patch program can be agnostic about the contents of the directory tree. Doing only this is

Re: Custom scoring for searhing geographic objects

2010-12-15 Thread Doron Cohen
Also, when taking the Similarity suggestion below note two things in Lucene's default behavior that you seem to wish to avoid: The first is IDF - but only for multi-term queries - otherwise ignore this comment. For multi term queries to only consider term frequency and doc length, you may want to

Re: Searcher#setSimilarity clarifications

2009-04-28 Thread Doron Cohen
Searcher is quite light. It is the index reader that is heavier. So create a single index reader, for each of the similarities to be use concurrently, create a searcher over that single reader, set its similarity, and so on. Doron On Mon, Apr 27, 2009 at 7:53 PM, Rakesh Sinha

Re: exponential boosts

2009-04-24 Thread Doron Cohen
On Fri, Apr 24, 2009 at 12:28 AM, Steven Bethard beth...@stanford.eduwrote: On 4/23/2009 2:08 PM, Marcus Herou wrote: But perhaps one could use a FieldCache somehow ? Some code snippets that may help. I add the PageRank value as a field of the documents I index with Lucene like this:

Re: Error: there are more terms than documents...

2009-04-24 Thread Doron Cohen
On Thu, Apr 23, 2009 at 11:52 PM, bill.che...@sungard.com wrote: I figured it out. We are using Hibernate Search and in my ORM class I am doing the following: @Field(index=Index.TOKENIZED,store=Store.YES) protected String objectId; So when I persisted a new object to our database I was

Re: Error: there are more terms than documents...

2009-04-23 Thread Doron Cohen
On Thu, Apr 23, 2009 at 10:39 PM, bill.che...@sungard.com wrote: I'm getting a strange error when I make a Lucene (2.2.0) query: java.lang.RuntimeException: there are more terms than documents in field objectId, but it's impossible to sort on tokenized fields Is it possible that, for at

Re: exponential boosts

2009-04-23 Thread Doron Cohen
I think we are doing similar things, at least I am trying to implement document boosting with pagerank. Having issues of howto appky the scoring of specific docs without actually reindex them. I feel something should be done at query time which looks at external data but do not know howto

Re: Why is CustomScoreQuery limited to ValueSourceQuery type?

2009-04-22 Thread Doron Cohen
: On 4/21/2009 10:09 AM, Doron Cohen wrote: It could, but (historically and) currently it doesn't... :) I actually have code for this. Would you like open a JIRA issue for this - I'll attach my wrapper there? Done. https://issues.apache.org/jira/browse/LUCENE-1608 Steve On Tue, Apr 21

Re: Why is CustomScoreQuery limited to ValueSourceQuery type?

2009-04-21 Thread Doron Cohen
CustomScoreQuery expects the VSQs to have a score for document matching the (main) subQuery - this does not hold for arbitrary queries. On Sat, Apr 18, 2009 at 2:35 AM, Steven Bethard beth...@stanford.eduwrote: CustomScoreQuery only allows the secondary queries to be of type ValueSourceQuery

Re: IndexWriter update method

2009-04-21 Thread Doron Cohen
*IndexWriter.deleteDocumentshttp://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/index/IndexWriter.html#deleteDocuments%28org.apache.lucene.search.Query%29 *(Queryhttp://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/Query.html query) may be handy too (but note that it

Re: changing term freq in indexing time

2009-04-21 Thread Doron Cohen
Depending on the problem you are trying to solve there may be other solutions to it, not requiring setting wrong (?) values for term frequencies. If you can explain what you are trying to solve, people on the list may be able to suggest such alternatives. - Doron On Sun, Apr 19, 2009 at 2:39 PM,

Re: changing term freq in indexing time

2009-04-21 Thread Doron Cohen
didn't quite understand how to use it. Is there a better way to approach it? I hope I explained it well. Thanks, Liat 2009/4/21 Doron Cohen cdor...@gmail.com Depending on the problem you are trying to solve there may be other solutions to it, not requiring setting wrong (?) values

Re: Why is CustomScoreQuery limited to ValueSourceQuery type?

2009-04-21 Thread Doron Cohen
It could, but (historically and) currently it doesn't... :) I actually have code for this. Would you like open a JIRA issue for this - I'll attach my wrapper there? Doron On Tue, Apr 21, 2009 at 7:58 PM, Steven Bethard beth...@stanford.eduwrote: On 4/21/2009 12:47 AM, Doron Cohen wrote

Re: Fields with the same name?? - Was Re: Payloads and tokenizers

2008-08-20 Thread Doron Cohen
On Tue, Aug 19, 2008 at 2:15 AM, Antony Bowesman [EMAIL PROTECTED] wrote: Thanks for you time and I appreciate your valuable insight Doron. Antony I'm glad I could help! Doron

Re: Fields with the same name?? - Was Re: Payloads and tokenizers

2008-08-18 Thread Doron Cohen
payload and the other part for storing, i.e. something like this: Token token = new Token(...); token.setPayload(...); SingleTokenTokenStream ts = new SingleTokenTokenStream(token); Field f1 = new Field(f,some-stored-content,Store.YES,Index.NO); Field f2 = new Field(f, ts);

Re: Index of Lucene

2008-08-18 Thread Doron Cohen
On Mon, Aug 18, 2008 at 7:28 AM, blazingwolf7 [EMAIL PROTECTED]wrote: Thanks for the info. But do you know where this is actually perform in Lucene? I mean the method involved, that will calculate the value before storing it into the index. I track it to one method known as lengthNorm() in

Re: Payloads and tokenizers

2008-08-17 Thread Doron Cohen
Implementing payloads via Tokens explicitly prevents the use of payloads for untokenized fields, as they only support field.stringValue(). There seems no way to override this. I assume you already know this but just to make sure what I meant was clear - on tokenization but still indexing

Re: Index of Lucene

2008-08-17 Thread Doron Cohen
Norms information comes mainly from lengths of documents - allowing the search time scoring to take into account the effect of document lengths (actually field length within a document). In practice, norms stored within the index may include other information, such as index time boosts - for a

Re: Case Sensitivity

2008-08-16 Thread Doron Cohen
Hi Sergey, seems like case 4 and 5 are equivalent, both meaning case insensitive right. Otherwise please explain the difference. If it is required to support both case sensitive (cases 1,2,3) and case insensitive (case 4/5) then both forms must be saved in the index - in two separate fields (as

Re: Payloads and tokenizers

2008-08-14 Thread Doron Cohen
IIRC first versions of patches that added payloads support had this notion of payload by field rather than by token, but later it was modified to be by token only. I have seen two code patterns to add payloads to tokens. The first one created the field text with a reserved separator/delimiter

Re: Case Sensitivity

2008-08-14 Thread Doron Cohen
In example I want to show what I stored field as Field.Index.NO_NORMS As I understand it means what field contains original string despite what analyzer I chose(StandardAnalyzer by default). This would be achieved by UN_TOKENIZED. The NO_NORMS just guides Lucene to avoid normalizing

Re: Number range search

2008-08-13 Thread Doron Cohen
The code seems correct (although it doesn't show which analyzer was used at indexing). Note that when adding numbers like this there's no real point in analyzing them, so I would add that field as UN_TOKENIZED. This would be more efficient, and would also comply with the query parser who does not

Re: Query to ignore certain phrases

2008-08-12 Thread Doron Cohen
I can't see how to accomplish this without writing some special code, and not just because of query parsing. Phrases are searched by iterating the participating term positions and when a match is found say for b c there is no way to know whether another query a b c d matches exactly the

Re: Query to ignore certain phrases

2008-08-12 Thread Doron Cohen
I think it should look something like this white house NOT russian white house~1 a b c~1 just matches more 'easily' than a b c. It will match for instance a b d c. The NOT however excludes all documents which match this, unlike requested logic. In fact, Q1: a b NOT a b c~1 is worse

Re: Highlight huge documents

2008-08-11 Thread Doron Cohen
I believe Highlighter.setMaxDocBytesToAnalyze(int byteCount) should be used for this. On Mon, Aug 11, 2008 at 11:40 AM, [EMAIL PROTECTED] wrote: Hello I am using Highlighter to highlight query terms in documents getting from a database founded from lucene search. My problem is that when i

Re: Deleting and adding docs

2008-08-09 Thread Doron Cohen
doc.add(new Field(ID_FIELD, id, Field.Store.YES, Field.Index.NO)); writer.deleteDocuments(new Term(ID_FIELD, id)); int i = reader.deleteDocuments(new Term(ID_FIELD, id)); //i returns 0 Both failed. I try to delete one id value that I know for sure it was added in the first step. For

Re: Need help searching

2008-08-09 Thread Doron Cohen
writer = new IndexWriter(C:\\, new StandardAnalyzer(), true); Term term = new Term(line, KOREA); PhraseQuery query = new PhraseQuery(); query.add(term); StandardAnalyzer - used here while indexing - applies lowercasing. The query is created programatically - i.e. without a QueryParser

Re: Re : Stop search process when a given number of hits is reached

2008-08-09 Thread Doron Cohen
Ok, I'm not near any documentation now, but I think throwing an exception is overkill. As I remember all you have to do is return false from your collector and that'll stop the search. But verify that. That would have been much cleaner, however collect() is a void, so throwing a (runtime)

Re: CustomScoreQuery and BooleanQuery

2008-08-07 Thread Doron Cohen
When combining any sub queries, a scorer has at least two things to decide: which docs to match, and once matched, how to score. Boolean Queries applies specific logics for this, and some queries allow some control of the way to score. For current CustomScoreQuery things are more straightforward -

Re: Stop search process when a given number of hits is reached

2008-08-07 Thread Doron Cohen
Nothing built in that I'm aware of will do this, but it can be done by searching with your own HitCollector. There is a related feature - stop search after a specified time - using TimeLimitedCollector. It is not released yet, see issue LUCENE-997. In short, the collector's collect() method is

Re: Concurrent query benchmarks

2008-06-10 Thread Doron Cohen
On Tue, Jun 10, 2008 at 3:50 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi Glen, Thanks for sharing. Does your benchmarking tool build on top of contrib/benchmark? (not sure if that one lets you specify the number of concurrent threads -- if it does not, perhaps this is an opportunity

Re: IndexReader.reopen memory leak

2008-06-01 Thread Doron Cohen
Hi John, IndexReader newInner=in.reopen(); if (in!=newInner) { in.close(); this.in=newInner; // code to clean up my data _cache.clear(); _indexData.load(this, true); init(_fieldConfig); } Just to be sure on this, could you

Re: Opening an index directory inside a jar

2008-06-01 Thread Doron Cohen
: The crux of the issue seems to be that lucene cannot open segments file that : is inside the jar (under luceneFiles/index directory) i'm not entirely sure why it would have problems finding the segments file, but a larger problem is that Lucene needs random access which (last time i

Re: How to add PageRank score with lucene's relevant score in sorting

2008-06-01 Thread Doron Cohen
Hi Jarvis, I have a problem that how to combine two score to sort the search result documents. for example I have 10 million pages in lucene index , and i know their pagerank scores. i give a query to it , every docs returned have a lucene-score, mark it as R (relevant score), and

Re: LUCENE-933 / SOLR-261

2008-03-18 Thread Doron Cohen
hi Jake, yes it was commited in Lucene - this is visible in the JIRA issue when if you switch to the Subversion Commits tab. where you can also see the actual diffs that took place. Best, Doron On Tue, Mar 18, 2008 at 7:14 PM, Jake Mannix [EMAIL PROTECTED] wrote: Hey folks, I was wondering

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-13 Thread Doron Cohen
Hi Daniel, LUCENE-1228 fixes a problem in IndexWriter.commit(). I suspect this can be related to the problem you see though I am not sure. Could you try with the patch there? Thanks, Doron On Thu, Mar 13, 2008 at 10:46 AM, Michael McCandless [EMAIL PROTECTED] wrote: Daniel Noll wrote: On

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-13 Thread Doron Cohen
On Thu, Mar 13, 2008 at 9:30 PM, Doron Cohen [EMAIL PROTECTED] wrote: Hi Daniel, LUCENE-1228 fixes a problem in IndexWriter.commit(). I suspect this can be related to the problem you see though I am not sure. Could you try with the patch there? Thanks, Doron Daniel, I was wrong about

Re: recall/precision with lucene

2008-02-10 Thread Doron Cohen
Take a look at the quality package under contrib/benchmark. Regards, Doron On Sat, Feb 9, 2008 at 2:59 AM, Panos Konstantinidis [EMAIL PROTECTED] wrote: Hello I am a new lucene user. I am trying to calculate the recall/precision of a query and I was wondering if lucene provides an easy way

Re: problem with Whitespace analyzer

2008-02-10 Thread Doron Cohen
Should be the parenthesis which are part of the query syntax Try escaping - \( \) Also see http://lucene.apache.org/java/2_3_0/queryparsersyntax.html#Escaping%20Special%20Characters Doron On Sun, Feb 10, 2008 at 9:03 AM, saikrishna venkata pendyala [EMAIL PROTECTED] wrote: Hi, I am facing

Re: Extracting terms from a query splitting a phrase.

2008-02-10 Thread Doron Cohen
PhraseQuery.extractTerms() returns the terms making up the phrase, and so it is not adequate for 'finding' a single term that represents the phrase query, one that represents the searched entire text. It seems you are trying to obtain a string that can be matched against the displayed text for

Re: Performance guarantees and index format

2008-02-08 Thread Doron Cohen
I was once involved in modified a search index implementation (not Lucene) to write posting lists so that they can be traversed (only) in reverse order. Docids were preserved but you got higher IDs first. This was a non-trivial code change. Now the suggestion to (optionally) order merged segments

Re: appending field to an existing index

2008-01-31 Thread Doron Cohen
This may help: http://www.nabble.com/Updating-Lucene-Index-with-Unstored-fields-tt15188818.html#a15188818 Doron On Thu, Jan 31, 2008 at 2:42 AM, John Wang [EMAIL PROTECTED] wrote: Hi all: We have a large index and it is difficult to reindex. We want to add another field to the index

Re: contrib/benchmark Quality

2008-01-30 Thread Doron Cohen
Hi Grant, I initially thought of doing so, but after working on the Million Queries Track where running the 10,000 queries could take more than a day (depending on the settings) and where indexing was done once and took few days I felt that a more tight control is needed than that provided by the

Re: A small doubt related to write.lock

2008-01-30 Thread Doron Cohen
Hi Ajay, IndexReader.unlock() is a brute force call to be used by applications/users knowing that a lock can be safely removed. finalize() on the other hand is a method that Java will call when garbage collecting a no-more-referenced object. So it is often a place for cleanup code. However the

Re: Some Help needed in search.

2008-01-29 Thread Doron Cohen
You can add phrase on the writer field. I.e. with high boost of 3 and low boost of 2, writing 'h' for 'heading' and 'w' for 'writer', try this query: h:sachin^3 d:tendulkar^3 w:sachin^2 w:tendulkar^2 w:h:Sachin Tendulkar^6 On Jan 29, 2008 9:23 AM, Sure [EMAIL PROTECTED] wrote: Hi All,

Re: Query processing with Lucene

2008-01-08 Thread Doron Cohen
Hi Marjan, Lucene process the query in what can be called one-doc-at-a-time. For the example query - x y - (not the phrase query x y) - all documents containing either x or y are considered a match. When processing the query - x y - the posting lists of these two index terms are traversed, and

Re: Basic Named Entity Indexing

2008-01-08 Thread Doron Cohen
Hi Chris, A null pointer exception can be causes by not checking newToken for null after this line: Token newToken = input.next() I think Hoss meant to call next() on the input as long as returned tokens do not satisfy the check for being a named entity. Also, this code assumes white space

Re: Sorting on tokenized fields

2008-01-08 Thread Doron Cohen
Hi Michael, I think you mean the exception thrown when you search and sort with a field that was not yet indexed: RuntimeException: field BBC does not appear to be indexed I think the current behavior is correct, otherwise an application might (by a bug) attempt to sort by a wrong field,

Re: Query processing with Lucene

2008-01-08 Thread Doron Cohen
This is done by Lucene's scorers. You should however start in http://lucene.apache.org/java/docs/scoring.html, - scorers are described in the Algorithm section. Offsets are used by Phrase Scorers and by Span Scorer. Doron On Jan 8, 2008 11:24 PM, Marjan Celikik [EMAIL PROTECTED] wrote: Doron

Re: Basic Named Entity Indexing

2008-01-08 Thread Doron Cohen
On Jan 8, 2008 11:48 PM, chris.b [EMAIL PROTECTED] wrote: Wrapping the whitespaceanalyzer with the ngramfilter it creates unigrams and the ngrams that i indicate, while maintining the whitespaces. :) The reason i'm doing this is because I only wish to index names with more than one token.

Re: Question regarding adding documents

2008-01-07 Thread Doron Cohen
Or, very similar, wrap the 'real' analyzer A with your analyzer that delegates to A but also keeps the returned tokens, possibly by using a CachingTokenFilter. On Jan 7, 2008 7:11 AM, Daniel Noll [EMAIL PROTECTED] wrote: On Monday 07 January 2008 11:35:59 chris.b wrote: is it possible to add

Re: StopWords problem

2007-12-27 Thread Doron Cohen
This is not a self contained program - it is incomplete, and it depends on files on *your* disk... Still, can you show why you're saying it indexes stopwords? Can you print here few samples of IndexReader.terms().term()? BR, Doron On Dec 27, 2007 10:22 AM, Liaqat Ali [EMAIL PROTECTED] wrote:

Re: StopWords problem

2007-12-27 Thread Doron Cohen
On Dec 27, 2007 11:49 AM, Liaqat Ali [EMAIL PROTECTED] wrote: I got your point. The program given does not give not any error during compilation and it is interpreted well. But the it does not create any index. when the StandardAnalyzer() is called without Stopwords list it works well, but

Re: StopWords problem

2007-12-27 Thread Doron Cohen
PROTECTED] wrote: Doron Cohen wrote: On Dec 27, 2007 11:49 AM, Liaqat Ali [EMAIL PROTECTED] wrote: I got your point. The program given does not give not any error during compilation and it is interpreted well. But the it does not create any index. when the StandardAnalyzer() is called

Re: Modifying StopAnalyzer

2007-12-26 Thread Doron Cohen
can we modify the StopyAnalyzer to insert Stop Words of another language, instead of English, like Urdu given below: public static final String[] URDU_STOP_WORDS = { پر, کا, کی, کو }; new StandardAnalyzer(URDU_STOP_WORDS) should work. Regards, Doron

Re: StopWords problem

2007-12-26 Thread Doron Cohen
On Dec 26, 2007 10:33 PM, Liaqat Ali [EMAIL PROTECTED] wrote: Using javac -encoding UTF-8 still raises the following error. urduIndexer.java : illegal character: \65279 ? ^ 1 error What I am doing wrong? If you have the stop-words in a file, say one word in a line, they can be read like

Re: problem in indexing documents

2007-12-25 Thread Doron Cohen
document.add(new Field(contents,sb.toString(), Field.Store.NO, Field.Index.TOKENIZED)); In addition, for tokenized but not stored like here, the Field() constructor that takes a Reader param can be handy here. Regards, Doron

Re: document deletion problem

2007-12-20 Thread Doron Cohen
etc... when will they be gone? thanks - Original Message From: Doron Cohen [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, December 19, 2007 1:13:56 PM Subject: Re: document deletion problem On Dec 19, 2007 5:45 PM, Tushar B wrote: Hi Doron, I

Re: document deletion problem

2007-12-19 Thread Doron Cohen
On Dec 19, 2007 5:45 PM, Tushar B [EMAIL PROTECTED] wrote: Hi Doron, I was just playing around with deletion because I wanted to delete documents due to spurious entries in one particular field. Could you tell me how do I file a JIRA issue? See Lucene's wiki, at page HowToContribute.

Re: Lucene multifield query problem

2007-12-18 Thread Doron Cohen
Hi Rakesh, Perhaps the confusion comes from the asymmetry between +X and -X. I.e., for the query: A B -C +D one might think that, similar to how -C only disqualifies docs containing C (but not qualifying docs not containing C), also +D only disqualifies docs not containing D. But this is

Re: Lucene multifield query problem

2007-12-18 Thread Doron Cohen
Hi Rakseh, It just occurred to me that your code has String searchCriteria = Indoor*; Assuming StandardAnalyzer used at indexing time, all text words were lowercased. Now, QueryParser by default does not lowercase wildcard queries. You can however instruct it to do so by calling:

Re: FuzzyQuery + QueryParser - I'm puzzled

2007-12-17 Thread Doron Cohen
See in Lucene FAQ: Are Wildcard, Prefix, and Fuzzy queries case sensitive? On Dec 17, 2007 11:27 AM, Helmut Jarausch [EMAIL PROTECTED] wrote: Hi, please help I am totally puzzled. The same query, once with a direct call to FuzzyQuery succeeds while the same query with QueryParser fails.

Re: Field weights

2007-12-14 Thread Doron Cohen
It seems that documents having less fields satisfying the query worth more than those satisfying more fields of the query, because the first ones are more to the point. At least it seems like it in the example. If this makes sense I would try to compose a top level boolean query out of the

Re: Applying SpellChecker to a phrase

2007-12-11 Thread Doron Cohen
Yes that's right, my mistake. In fact even after reading your comment I was puzzled because PhraseScorer indeed requires *all* phrase-positions to be satisfied in order to match. The answer is that the OR logic is taken care of by MultipleTermPositions, so the scorer does not need to be aware of

Re: Problem with termdocs.freq and other

2007-12-10 Thread Doron Cohen
while (termDocs.next()) { termDocs.next(); } For one, this loop calls next() twice in each iteration, so every second is skipped... ? chris.b [EMAIL PROTECTED] wrote on 10/12/2007 12:58:15: Here goes, I'm developing an application using lucene which will

Re: Problem with termdocs.freq and other

2007-12-10 Thread Doron Cohen
Seen as that solved all my problems (i think), Glad it helped! (btw it's always like this with, debugging - others see stuff in my code that I don't) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail:

Re: Applying SpellChecker to a phrase

2007-12-07 Thread Doron Cohen
smokey [EMAIL PROTECTED] wrote on 04/12/2007 16:54:32: Thanks for the information on o.a.l.search.spans. I was thinking of parsing the phrase query string into a sequence of terms, then constructing a phrase query object using add(Term term, int position) method in

Re: Errors while running LIA code.

2007-12-06 Thread Doron Cohen
1) Downloaded http://www.ehatchersolutions.com/downloads/ LuceneInAction.zip - sorry, lucenebook.com is broken at the moment :( This one works too - http://www.manning.com/hatcher2/ -- Downloads -- Source Code - To

Re: SpellChecker performance and usage

2007-12-03 Thread Doron Cohen
I didn't have performance issues when using the spell checker. Can you describe what you tried and how long it took, so people can relate to that. AFAIK the spell checker in o.a.l.search.spell does not expand a query by adding all the permutations of potentially misspelled word. It is based on

Re: can we do partial optimization?

2007-12-03 Thread Doron Cohen
It doesn't make sense to optimize() after every document add. Lucene in fact implements a logic in the spirit of what you describe below, when it decides to merge segments on the fly. There are various ways to tell Lucene how often to flush recently added/updated documents and what to merge. But

Re: Applying SpellChecker to a phrase

2007-12-03 Thread Doron Cohen
See below - smokey [EMAIL PROTECTED] wrote on 03/12/2007 05:14:23: Suppose I have an index containing the terms impostor, imposter, fraud, and fruad, then presumably regardless of whether I spell impostor and fraud correctly, Lucene SpellChecker will offer the improperly spelled versions as

Re: multireader vs multisearcher

2007-12-02 Thread Doron Cohen
MultiReader is more efficient and is preferred when possible. MultiSearcher allows further functionality. Every time an index has more than a single segment (which is. to say almost every index except for after calling optimize()), Opening an IndexReader (or an IndexSearcher) above that index),

Re: FSDirectory Again

2007-12-02 Thread Doron Cohen
This is from Lucene's CHANGES.txt: LUCENE-773: Deprecate the FSDirectory.getDirectory(*) methods that take a boolean create argument. Instead you should use IndexWriter's create argument to create a new index. (Mike McCandless) So you should create the FSDir with

Re: Scoring for all the documents in the index relative to a query

2007-11-20 Thread Doron Cohen
You can also rely on that by default documents are collected in-docid-order. You can therefore use your own hit collector that when collecting doc with id n2, assuming the previous doc collected had id n1, would (know to) assign score 0 to all docs with: n1 id n2. In other words, you can know

Re: Problem in Running Lucene Demo

2007-11-19 Thread Doron Cohen
Try java -verbose to see more info on class loading. Also try java -classpath=yourClassPath from command line. Note that separators in the classpath may differ between operating systems - e.g. ; in Windows but : in Linux... Doron Liaqat Ali [EMAIL PROTECTED] wrote on 19/11/2007 15:43:30: Hi

Re: Customized search with Lucene?

2007-10-25 Thread Doron Cohen
is that the network can give you reasonable results even for words which haven't been used before (as least that is what the book seems to claim). Regards, Lukas On 10/16/07, Doron Cohen [EMAIL PROTECTED] wrote: Where and how do you store this type of info: If user U1 search for query Q7 boost

Re: Customized search with Lucene?

2007-10-25 Thread Doron Cohen
Lukas Vlcek [EMAIL PROTECTED] wrote on 25/10/2007 10:25:23: Doron, You definitely added few important (crucial) questions. There are important concerns and I am glad to hear that Lucene community is debating them. I am not an Lucene viscera expert thus I can hardly compare simple search

Re: How to do RangeQuery on a Computed Value of a Field?

2007-10-21 Thread Doron Cohen
You could use ValueSourceQuery for this - see o.a.l.search.function. The trick is to create your ValueSource class that is using two FieldCacheSource objects - one for each location. See http://issues.apache.org/jira/browse/LUCENE-1019 for a related example. Note however that this solution would

Re: contrib/benchmark Parallel tasks ?

2007-10-18 Thread Doron Cohen
Hi Grant, Grant Ingersoll wrote: I think the answer is: [{ MAddDocs AddDoc } : 5000] : 4 Is this the functional equivalent of doing: { MAddDocs AddDoc } : 2 in parallel? Yes, this is correct, it reads as create 4 threads, each adding 5000 docs to the index, and start/run the 4

Re: Customized search with Lucene?

2007-10-16 Thread Doron Cohen
Where and how do you store this type of info: If user U1 search for query Q7 boost doc D5 by B17 If user U2 search for query Q3 boost doc D15 by B2 Seems lots of info, and it must be persistent. Perhaps o.a.l.search.function can help - assuming you have this info available at search time, and

Re: Field rank?

2007-10-10 Thread Doron Cohen
Hi Scott, Would indexing time field boosts work for you? http://lucene.apache.org/java/docs/scoring.html#Score%20Boosting Doron Scott Phillips wrote: Hi everyone, I have a question that I can't quite seem to find the answer to by googling or searching the archives of this mailing list. The

Re: index conversion

2007-09-24 Thread Doron Cohen
For an already optimized index calling optimize() is a no-op. You may try this: after opening the writer and setting compound=false, add a dummy (even empty) document to the index, then optimize(), and finally optionally remove the dummy document. Note that calling optimize() might be lengthy as

  1   2   3   4   >