Re: highlighting performance

2011-06-22 Thread Itamar Syn-Hershko
I'm not intimately familiar with FVH myself, but that sounds reasonable. Tests usually don't lie. I'd definitely like to see a patched version that avoids that! Itamar. On 22/06/2011 05:29, Michael Sokolov wrote: OK - it seems as if there is a blow-up in FieldPhraseList if a document has a la

Re: Coloring search results based on score?

2011-06-18 Thread Itamar Syn-Hershko
Thanks. That's very abstract and old, but perhaps I could work something out using this. Any other pointers / opinions welcome... Itamar. On 17/06/2011 03:26, Andrzej Bialecki wrote: On 6/17/11 12:29 AM, Itamar Syn-Hershko wrote: No, that was not what I meant. I'm not int

Re: Coloring search results based on score?

2011-06-16 Thread Itamar Syn-Hershko
See Highlighter's GradientFormatter Cheers Mark On 16 Jun 2011, at 22:01, Itamar Syn-Hershko wrote: Hi all, Interesting question: is it possible to color search results in a web-page based on their score? e.g. most relevant results in green, and then different shades through orange, y

Coloring search results based on score?

2011-06-16 Thread Itamar Syn-Hershko
Hi all, Interesting question: is it possible to color search results in a web-page based on their score? e.g. most relevant results in green, and then different shades through orange, yellow, red and then white. Theoretically, one could take the highest score and color based on proximity /

Re: Index size and performance degradation

2011-06-14 Thread Itamar Syn-Hershko
to be failing quite a lot. For example see: http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html On 14/06/2011 10:28, Toke Eskildsen wrote: On Sun, 2011-06-12 at 10:10 +0200, Itamar Syn-Hershko wrote: The whole point of my question was to find out if and how to

Re: Index size and performance degradation

2011-06-13 Thread Itamar Syn-Hershko
However, turning around changes from the adds should be faster (no segment gets flushed). Mike McCandless http://blog.mikemccandless.com On Mon, Jun 13, 2011 at 5:06 PM, Itamar Syn-Hershko wrote: Thanks Mike, much appreciated. Wouldn't Twitter's approach fall for the exact same pi

Re: Index size and performance degradation

2011-06-13 Thread Itamar Syn-Hershko
ally require it. Mike McCandless http://blog.mikemccandless.com On Sun, Jun 12, 2011 at 4:46 PM, Itamar Syn-Hershko wrote: Thanks for your detailed answer. We'll have to tackle this and see whats more important to us then. I'd definitely love to hear Zoie has overcame all that... Any

Re: Index size and performance degradation

2011-06-13 Thread Itamar Syn-Hershko
On 13/06/2011 06:23, Shai Erera wrote: A Language filter is one -- different users search in different languages and want to view pages in those languages only. If you have a field attach to your documents that identifies the language of the document, you can use it to filter the queries to retur

Re: Index size and performance degradation

2011-06-12 Thread Itamar Syn-Hershko
as done (though, those changes are not simple either!). Mike McCandless http://blog.mikemccandless.com On Sun, Jun 12, 2011 at 2:25 PM, Itamar Syn-Hershko wrote: Mike, Speaking of NRT, and completely off-topic, I know: Lucene's NRT apparently isn't fast enough if Zoie was needed, and no

Re: Index size and performance degradation

2011-06-12 Thread Itamar Syn-Hershko
Our problem is a bit different. There aren't always common searches so if we cache blindly we could end up having too much RAM allocated for virtually nothing. And we need to allow for real-time search so caching will hardly help. We enforce some client-side caching, but again - the real-time r

Re: Index size and performance degradation

2011-06-12 Thread Itamar Syn-Hershko
ndless.com On Sun, Jun 12, 2011 at 4:10 AM, Itamar Syn-Hershko wrote: Thanks. The whole point of my question was to find out if and how to make balancing on the SAME machine. Apparently thats not going to help and at a certain point we will just have to prompt the user to buy more hardware...

Re: Index size and performance degradation

2011-06-12 Thread Itamar Syn-Hershko
ays relates to the characteristics of the underlying hardware. I think the best you can do is actually test on various configurations, then at least you can say "on configuration X this is the tipping point". Sorry there isn't a better answer that I know of, but... Best Erick On

Re: Index size and performance degradation

2011-06-12 Thread Itamar Syn-Hershko
istics of the underlying hardware. I think the best you can do is actually test on various configurations, then at least you can say "on configuration X this is the tipping point". Sorry there isn't a better answer that I know of, but... Best Erick On Sat, Jun 11, 2011 at 3:37 PM

Re: Index size and performance degradation

2011-06-12 Thread Itamar Syn-Hershko
on various configurations, then at least you can say "on configuration X this is the tipping point". Sorry there isn't a better answer that I know of, but... Best Erick On Sat, Jun 11, 2011 at 3:37 PM, Itamar Syn-Hershko wrote: Hi all, I know Lucene indexes to be at their optimum

Index size and performance degradation

2011-06-11 Thread Itamar Syn-Hershko
Hi all, I know Lucene indexes to be at their optimum up to a certain size - said to be around several GBs. I haven't found a good discussion over this, but its my understanding that at some point its better to split an index into parts (a la sharding) than to continue searching on a huge-size

Re: multiple small indexes or one big index?

2011-06-10 Thread Itamar Syn-Hershko
Erick, Sorry about reopening this more than a week late... You were asking about the size of each index; at what index size would you consider splitting to several indices with multiple searches etc, for what reasons, and does it matter which Lucene version is used? Thanks :) Itamar.

Re: Recent Content - Lucene vs. DB SELECT / DB Triggers / Memcached

2011-03-09 Thread Itamar Syn-Hershko
(sorry for picking this up so late...) This sounds like a perfect fit for document DBs like CouchDB and MongoDB - based on your architecture and data structure. They are designed for multi-server applications, and use Map/Reduce which will give you Lucene operations directly from your DB, n

Re: MultiFieldQueryParser

2010-10-16 Thread Itamar Syn-Hershko
Perhaps you met this issue which I have already reported? https://issues.apache.org/jira/browse/LUCENE-2518 Itamar. On 14/10/2010 3:40 AM, Erick Erickson wrote: I'm not quite sure what you mean by "run a query against multiple fields". But would creating your own BooleanQuery where each claus

Changing QueryParser operator images

2010-09-28 Thread Itamar Syn-Hershko
Hi all, I'm trying to customize the "AND", "OR" and "NOT" operators being used by the QP, without changing anything in the core. I noticed a previous attempt, but it seems to have died quietly a few years ago [1]. Unfortunately, even changing the hardcoded values seems impossible, as they

Re: finding the analyzer for a language...

2010-09-26 Thread Itamar Syn-Hershko
Shai, I was referring to your #2, which you already indicated in your reply wasn't part of the discussion. Itamar. On 26/9/2010 10:10 AM, Shai Erera wrote: The mapping is simply about returning the right Analyzer for the given Locale. You decide up front (as the Factory developer) what Analyze

Re: finding the analyzer for a language...

2010-09-25 Thread Itamar Syn-Hershko
I may be missing the point here, but how do you define an analyzer <-> language match? What do you do in cases of mixed content, for example? Itamar. On 25/9/2010 10:27 PM, Shai Erera wrote: Shai Erera brought a similar idea up before, to use Locale, but my concerns are it would be limited by

Re: get wordno, lineno, pageno for term/phrase

2010-08-04 Thread Itamar Syn-Hershko
I quite liked the idea Erick brought up in his last response - using a special field for storing this data. See if you can define its structure in a way that would help you do that and save both performance and index size. Each term in it signaling lineno and pageno (term text is "p1", "p2"...

Re: get wordno, lineno, pageno for term/phrase

2010-08-04 Thread Itamar Syn-Hershko
Storing all that info per-token as payloads will bloat the index. Wouldn't it be wiser to use a special token to mark page feed and end of paragraph (numbers of which could be then stored as payloads), and scan the token stream per document to retrieve them back? some extra operations for retri

Re: Scoring exact matches higher in a stemmed field

2010-07-22 Thread Itamar Syn-Hershko
On 22/7/2010 9:20 PM, Shai Erera wrote: How is that different than extending QP? Mainly because the problem I'm having isn't there, and doing it from there doesn't feel right, and definitely not like solving the issue. I want to explore what other options there are before doing anything, an

Re: Scoring exact matches higher in a stemmed field

2010-07-19 Thread Itamar Syn-Hershko
On 19/7/2010 5:50 PM, Shai Erera wrote: If your analyzer outputs b and b$ in the same position, then the below query will already be what the QP output today If you want to incorporate boosting, I can suggest that you extend QP, override newTermQuery for example, and if the term is a stemmed term

Re: Scoring exact matches higher in a stemmed field

2010-07-17 Thread Itamar Syn-Hershko
d your question, then plea correct me. Shai On Friday, July 16, 2010, Itamar Syn-Hershko wrote: Hi all, Consider the following string: "the buffalo buffaloes" [1]. When passed through a stemming analyzer, the resulting token would be "buffalo buffalo" (assuming a good s

Scoring exact matches higher in a stemmed field

2010-07-16 Thread Itamar Syn-Hershko
Hi all, Consider the following string: "the buffalo buffaloes" [1]. When passed through a stemming analyzer, the resulting token would be "buffalo buffalo" (assuming a good stemmer). To enable exact searches, say I mark the original term and index it at the same term position. So "the buf

Re: Best way to use Lucene from perl

2010-07-09 Thread Itamar Syn-Hershko
CLucene is a complete port of Java Lucene to C++, and it has a Perl bindings, although I'm not sure how up to date it is - you'll have to check with its author. CLucene development branch currently supports the Lucene 2.3.2 API and index format. See http://clucene.sourceforge.net/ for more det

Lucene In Action free chapter on CLucene

2010-06-28 Thread Itamar Syn-Hershko
Hi, Just to let everyone know Manning have released an extra chapter from the excellent LIA 2E book, discussing CLucene - the C++ port of Lucene. It is available for free at http://www.code972.com/blog/2010/06/lucene-in-action-free-chapter-coupon-code/. 35% discount for CLucene users is av

RE: arguments in favour of lucene over commercial competition

2010-06-24 Thread Itamar Syn-Hershko
> -Original Message- > From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] > Sent: Friday, June 25, 2010 1:09 AM > To: java-user@lucene.apache.org > Subject: Re: arguments in favour of lucene over commercial competition > > And I was just thinking the other day how it would be cool

RE: arguments in favour of lucene over commercial competition

2010-06-23 Thread Itamar Syn-Hershko
Otis, I'm 99% sure Attivio is just a wrapper arround Lucene... And I personally wouldn't count full text search solutions such as Oracle's. Itamar. > -Original Message- > From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] > Sent: Thursday, June 24, 2010 12:42 AM > To: java-user@

[ANN] First of a kind open-source effort for advancing Hebrew IR

2010-06-07 Thread Itamar Syn-Hershko
http://www.code972.com/blog/hebmorph/. As we progress, updates will be posted to that blog, to our mailing list, and on twitter (#HebMorph). If this is of an interest to you, we would appreciate your feedback and help. Please use our mailing list, or contact me privately, for any inquiries. Itamar Syn-He

RE: recommendation for deprecated StandardTokenizer.next() method?

2010-06-03 Thread Itamar Syn-Hershko
That would be next(Token) I believe. The reason it was deprecated afaik was to force a reuse of the Token object, to gain more performance. Itamar. -Original Message- From: allasso [mailto:allassopra...@gmail.com] Sent: Thursday, June 03, 2010 10:52 PM To: java-user@lucene.apache.org S

RE: What's DisjunctionMaxQuery ?

2010-06-01 Thread Itamar Syn-Hershko
See slide 18 in http://www.cnlp.org/presentations/slides/advancedluceneeu.pdf, and http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/search/Disjunction MaxQuery.html. Itamar. -Original Message- From: Li Li [mailto:fancye...@gmail.com] Sent: Tuesday, June 01, 2010 11:42 AM To: jav

Options of Field constructor accepting Reader

2008-06-09 Thread Itamar Syn-Hershko
Hi all, I was wondering why only the Field constructor which accepts a String offers Store and Index options? I understand there might be no logic in offering them for the TokenStream constructor, but what's wrong in Storing an input from a Reader, that 2.3.2 does not allow it? Itamar.

RE: Version 2.3 Does Not Index/Digest All Document Tokens

2008-05-20 Thread Itamar Syn-Hershko
Just a thought - are the files you're indexing larger than 10,000 words (MAX_FIELD_LENGTH)? If so, maybe either your code or Lucene 2.3.* have changed something in maxFieldLength implementation... Itamar. -Original Message- From: Dan Rugg [mailto:[EMAIL PROTECTED] Sent: Friday, May 16,

Getting Terms Position Gaps

2008-05-18 Thread Itamar Syn-Hershko
Hi all, How can I see the position gaps in my indexed field? I've set up some sort of mechanism to increment position gap for specific terms in specific circumstances, and I want to make sure it is working as expected. I've tried Luke but it doesn't seem to be able to view this info. Thanks in

RE: setPositionIncrement questions

2008-05-11 Thread Itamar Syn-Hershko
Chris, I ended up hacking StandardTokenizer::next() to check for $^$^$, and if it is there then set the current Token PositionIncrement to 500 and resume the tokenizing loop (so the word which will be read into that Term will have position increment of 500). As far as I can tell it is working wel

RE: Why Lucene has to rewrite queries prior to actual searching?

2008-04-09 Thread Itamar Syn-Hershko
IL PROTECTED] Sent: Tuesday, April 08, 2008 5:57 PM To: java-user@lucene.apache.org Subject: Re: Why Lucene has to rewrite queries prior to actual searching? Op Tuesday 08 April 2008 15:18:34 schreef Itamar Syn-Hershko: > Paul, > > I don't see how this answers the question. Towards the e

RE: Why Lucene has to rewrite queries prior to actual searching?

2008-04-08 Thread Itamar Syn-Hershko
Op Tuesday 08 April 2008 00:34:48 schreef Itamar Syn-Hershko: > Paul and John, > > Thanks for your quick reply. > > The problem with query rewriting is the beforementioned > MaxClauseException. Instead of inflating the query and passing a > deterministic list of terms to the

RE: Why Lucene has to rewrite queries prior to actual searching?

2008-04-07 Thread Itamar Syn-Hershko
rts (AND like), Scorer.skipTo() is used, and that could well be the filter mechanism you are referring to; have a look at the javadocs of Scorer, and, if necessary, at the actual code of ConjunctionScorer. Regards, Paul Elschot Op Monday 07 April 2008 23:13:09 schreef Itamar Syn-Hershko: >

Why Lucene has to rewrite queries prior to actual searching?

2008-04-07 Thread Itamar Syn-Hershko
Hi all, Can someone from the experts here explain why Lucene has to get a "rewritten" query for the Searcher - so Phrase or Wildcards queries have to rewrite themselves into a "primitive" query, that is then passed to Lucene to look for? I'm probably not familiar too much with the internals of L

RE: setPositionIncrement questions

2008-03-31 Thread Itamar Syn-Hershko
me query inflation, or as I first suggested, auto-apply synonyms. The only question is, I guess, are there any drawbacks for using this? Thanks. Itamar. -Original Message- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Monday, March 31, 2008 4:25 PM To: java-user@lucene.apache.org

RE: setPositionIncrement questions

2008-03-31 Thread Itamar Syn-Hershko
Chris, Thanks for your input. Please let me make sure that I get this right: while iterating through the words in a document, I can use my tokenizer to setPositionIncrement(150) on a specific token, what would make it be more distant from the previous token than it should have been. The next tok

setPositionIncrement questions

2008-03-26 Thread Itamar Syn-Hershko
Hi all, Breaking proximity data has been discussed several times before, and concluded that setPositionIncrement is the way to go. In regards of it: 1. Where should it be called exactly to create the gap properly? 2. Is there a way to call it directly somehow while indexing (e.g. after adding

RE: Contrib Highlighter and Phrase search

2008-03-19 Thread Itamar Syn-Hershko
(since I'm inflating the query). Does this make sense? Itamar. -Original Message- From: Daniel Noll [mailto:[EMAIL PROTECTED] Sent: Thursday, March 20, 2008 12:44 AM To: java-user@lucene.apache.org Subject: Re: Contrib Highlighter and Phrase search On Wednesday 19 March 2008 18:28:15 Ita

RE: Contrib Highlighter and Phrase search

2008-03-19 Thread Itamar Syn-Hershko
I'm not sure how the current Highlighter works - haven't had the time to look into it yet - but I thought about the following implementation. Judging by your question, this works in a slightly different way than the current Highlighter: 1. Build a Radix tree (PATRICIA) and populate it with all se

RE: Language identification ??

2008-03-14 Thread Itamar Syn-Hershko
For what it worths, I did something similar in my BidiAnalyzer so I can index both Hebrew/Semitic texts and English/Latin words without switching analyzers, giving each the proper treatment. I did it simply by testing the first char and looking at its numeric value - so it falls between Hebrew Ale

Best way to do Query inflation?

2008-03-10 Thread Itamar Syn-Hershko
Hi all, I'm looking for the best way to inflate a query, so a query like: "synchronous AND colour" -- will become something like this: "(synchronous OR asynchronous OR bsynchornous OR synchronos OR asynchronos OR bsynchornos) AND (colour OR acolour OR bcolour OR color OR acolor OR bcolor)". I'

RE: Rebuilding Document from index?

2008-03-01 Thread Itamar Syn-Hershko
uary 2008 03:33:53 Itamar Syn-Hershko wrote: > I'm still trying to engineer the best possible solution for Lucene > with Hebrew, right now my path is NOT using a stemmer by default, only > by explicit request of the user. MoreLikeThis would only return > relevant results if I

RE: Rebuilding Document from index?

2008-02-26 Thread Itamar Syn-Hershko
rogne.net/subversion/revuedepresse/trunk/src/java/lexico n And the web version : https://admin.garambrogne.net/projets/revuedepresse/browser/trunk/src/java/l exicon Le 26 févr. 08 à 17:33, Itamar Syn-Hershko a écrit : > > Implementing something like MoreLikeThis for Hebrew. Non-Hebrew > implement

RE: Rebuilding Document from index?

2008-02-26 Thread Itamar Syn-Hershko
mDocs/TermEnum. Or perhaps TermFreqVector. I admit I haven't used that last, but that family of methods ought to fix you up. What problem are you trying to solve? Perhaps there are better solutions to suggest Best Erick On Mon, Feb 25, 2008 at 6:04 PM, Itamar Syn-Hershko <[EMAIL PROT

RE: Rebuilding Document from index?

2008-02-25 Thread Itamar Syn-Hershko
correctly. > -Original Message- > From: Itamar Syn-Hershko [mailto:[EMAIL PROTECTED] > Sent: Freitag, 22. Februar 2008 14:02 > To: java-user@lucene.apache.org > Subject: Rebuilding Document from index? > > Hi, > > Is it possible to re-create a document from an

RE: Rebuilding Document from index?

2008-02-22 Thread Itamar Syn-Hershko
lding Document from index? You can use Luke to rebuild the document. It will show you the terms of the analyzed document, not the original content. And this is what you want, if I understood you correctly. > -Original Message- > From: Itamar Syn-Hershko [mailto:[EMAIL PROTECTED] >

Rebuilding Document from index?

2008-02-22 Thread Itamar Syn-Hershko
Hi, Is it possible to re-create a document from an index, if its not stored? What I'm looking for is a way to have a text document with the text AFTER it was analyzed, so I can see how my analyzer handles certain cases. So that means I don't care if I will not get the original document. I want to

RE: Retrieving documents that match atleast n query terms

2008-02-04 Thread Itamar Syn-Hershko
I'm not 100% sure, but I think you could use Lucene's scoring for this. So if you ran your query and received N results, loop through them and check the scoring explanation (which I'm not quite sure how to acquire). This should tell you how many terms out of the query were found. This approach shou

Having 2 fields, each using different analyzers?

2008-01-31 Thread Itamar Syn-Hershko
Hi all, Since Analyzer is set per IndexWriter, which is being added a Document, which has several fields, I was wondering how would I store 2 different fields in a Document, each being passed through a different Analyzer? The idea is to have 2 fields of the same content, one stemmed and one is not

RE: Lucene, HTML and Hebrew

2008-01-30 Thread Itamar Syn-Hershko
OK, I've been processing things for a while. I came up with an idea that I want your advice on -- is there a way I could stem the Hebrew words in my analyzer yet keep a note of some sort of the original term which was assembled by this stem, WITHOUT affecting frequency/proximity data? This is I gu

RE: Lucene to index OCR text

2008-01-25 Thread Itamar Syn-Hershko
In our (very) small project (several thousands of pages), we scan what we can scan (and type what is not scannable), and then take someone to read-proof the OCRd material. Precision matters in our case, and this seemed to be the only way. One thought I had on your case - maybe there's an OCR librar

RE: Lucene, HTML and Hebrew

2008-01-24 Thread Itamar Syn-Hershko
where, I suspect). Also, it helps if there is some indication that the questioner has attempted to answer the question for themselves using readily available resources, but failed. On 01/21/2008 at 2:59 PM, Itamar Syn-Hershko wrote: > 1) How would Lucene treat the "normal" paragraph when t

Lucene help?

2008-01-22 Thread Itamar Syn-Hershko
Hi all, Yesterday I sent an email to this group querying about some very important (to me...) features of Lucene. I'm giving it another chance before it goes unnoticed or forgotten. If it was too long please let me know and I will email a shorter list of questions The original post can be f

Lucene, HTML and Hebrew

2008-01-21 Thread Itamar Syn-Hershko
Hi all, I'm starting in the process of creating Hebrew support for Lucene. Specifically I'm using Clucene (which is an awesome and strong port), but that shouldn't matter for my questions. Please, if you know of any info or similar project let me know, it can save me loads of time and headaches.