I'm not intimately familiar with FVH myself, but that sounds reasonable.
Tests usually don't lie. I'd definitely like to see a patched version
that avoids that!
Itamar.
On 22/06/2011 05:29, Michael Sokolov wrote:
OK - it seems as if there is a blow-up in FieldPhraseList if a
document has a la
Thanks. That's very abstract and old, but perhaps I could work something
out using this.
Any other pointers / opinions welcome...
Itamar.
On 17/06/2011 03:26, Andrzej Bialecki wrote:
On 6/17/11 12:29 AM, Itamar Syn-Hershko wrote:
No, that was not what I meant.
I'm not int
See Highlighter's GradientFormatter
Cheers
Mark
On 16 Jun 2011, at 22:01, Itamar Syn-Hershko wrote:
Hi all,
Interesting question: is it possible to color search results in a web-page
based on their score? e.g. most relevant results in green, and then different
shades through orange, y
Hi all,
Interesting question: is it possible to color search results in a
web-page based on their score? e.g. most relevant results in green, and
then different shades through orange, yellow, red and then white.
Theoretically, one could take the highest score and color based on
proximity /
to be
failing quite a lot. For example see:
http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html
On 14/06/2011 10:28, Toke Eskildsen wrote:
On Sun, 2011-06-12 at 10:10 +0200, Itamar Syn-Hershko wrote:
The whole point of my question was to find out if and how to
However, turning around changes from the adds should be faster (no
segment gets flushed).
Mike McCandless
http://blog.mikemccandless.com
On Mon, Jun 13, 2011 at 5:06 PM, Itamar Syn-Hershko wrote:
Thanks Mike, much appreciated.
Wouldn't Twitter's approach fall for the exact same pi
ally require it.
Mike McCandless
http://blog.mikemccandless.com
On Sun, Jun 12, 2011 at 4:46 PM, Itamar Syn-Hershko wrote:
Thanks for your detailed answer. We'll have to tackle this and see whats
more important to us then. I'd definitely love to hear Zoie has overcame all
that...
Any
On 13/06/2011 06:23, Shai Erera wrote:
A Language filter is one -- different users search in different languages
and want to view pages in those languages only. If you have a field attach
to your documents that identifies the language of the document, you can use
it to filter the queries to retur
as done
(though, those changes are not simple either!).
Mike McCandless
http://blog.mikemccandless.com
On Sun, Jun 12, 2011 at 2:25 PM, Itamar Syn-Hershko wrote:
Mike,
Speaking of NRT, and completely off-topic, I know: Lucene's NRT apparently
isn't fast enough if Zoie was needed, and no
Our problem is a bit different. There aren't always common searches so
if we cache blindly we could end up having too much RAM allocated for
virtually nothing. And we need to allow for real-time search so caching
will hardly help. We enforce some client-side caching, but again - the
real-time r
ndless.com
On Sun, Jun 12, 2011 at 4:10 AM, Itamar Syn-Hershko wrote:
Thanks.
The whole point of my question was to find out if and how to make balancing
on the SAME machine. Apparently thats not going to help and at a certain
point we will just have to prompt the user to buy more hardware...
ays relates to the characteristics of the
underlying hardware.
I think the best you can do is actually test on various
configurations, then at least you can say "on configuration
X this is the tipping point".
Sorry there isn't a better answer that I know of, but...
Best
Erick
On
istics of the
underlying hardware.
I think the best you can do is actually test on various
configurations, then at least you can say "on configuration
X this is the tipping point".
Sorry there isn't a better answer that I know of, but...
Best
Erick
On Sat, Jun 11, 2011 at 3:37 PM
on various
configurations, then at least you can say "on configuration
X this is the tipping point".
Sorry there isn't a better answer that I know of, but...
Best
Erick
On Sat, Jun 11, 2011 at 3:37 PM, Itamar Syn-Hershko
wrote:
Hi all,
I know Lucene indexes to be at their optimum
Hi all,
I know Lucene indexes to be at their optimum up to a certain size - said
to be around several GBs. I haven't found a good discussion over this,
but its my understanding that at some point its better to split an index
into parts (a la sharding) than to continue searching on a huge-size
Erick,
Sorry about reopening this more than a week late...
You were asking about the size of each index; at what index size would
you consider splitting to several indices with multiple searches etc,
for what reasons, and does it matter which Lucene version is used?
Thanks :)
Itamar.
(sorry for picking this up so late...)
This sounds like a perfect fit for document DBs like CouchDB and MongoDB
- based on your architecture and data structure.
They are designed for multi-server applications, and use Map/Reduce
which will give you Lucene operations directly from your DB, n
Perhaps you met this issue which I have already reported?
https://issues.apache.org/jira/browse/LUCENE-2518
Itamar.
On 14/10/2010 3:40 AM, Erick Erickson wrote:
I'm not quite sure what you mean by "run a query against multiple fields".
But would
creating your own BooleanQuery where each claus
Hi all,
I'm trying to customize the "AND", "OR" and "NOT" operators being used
by the QP, without changing anything in the core. I noticed a previous
attempt, but it seems to have died quietly a few years ago [1].
Unfortunately, even changing the hardcoded values seems impossible, as
they
Shai, I was referring to your #2, which you already indicated in your
reply wasn't part of the discussion.
Itamar.
On 26/9/2010 10:10 AM, Shai Erera wrote:
The mapping is simply about returning the right Analyzer for the given
Locale. You decide up front (as the Factory developer) what Analyze
I may be missing the point here, but how do you define an analyzer <->
language match? What do you do in cases of mixed content, for example?
Itamar.
On 25/9/2010 10:27 PM, Shai Erera wrote:
Shai Erera brought a similar idea up before, to use Locale, but my concerns
are it would be limited by
I quite liked the idea Erick brought up in his last response - using a
special field for storing this data. See if you can define its structure
in a way that would help you do that and save both performance and index
size. Each term in it signaling lineno and pageno (term text is "p1",
"p2"...
Storing all that info per-token as payloads will bloat the index.
Wouldn't it be wiser to use a special token to mark page feed and end of
paragraph (numbers of which could be then stored as payloads), and scan
the token stream per document to retrieve them back? some extra
operations for retri
On 22/7/2010 9:20 PM, Shai Erera wrote:
How is that different than extending QP?
Mainly because the problem I'm having isn't there, and doing it from
there doesn't feel right, and definitely not like solving the issue. I
want to explore what other options there are before doing anything, an
On 19/7/2010 5:50 PM, Shai Erera wrote:
If your analyzer outputs b and b$ in the same position, then the below query
will already be what the QP output today If you want to incorporate
boosting, I can suggest that you extend QP, override newTermQuery for
example, and if the term is a stemmed term
d your question, then plea correct me.
Shai
On Friday, July 16, 2010, Itamar Syn-Hershko wrote:
Hi all,
Consider the following string: "the buffalo buffaloes" [1].
When passed through a stemming analyzer, the resulting token would be "buffalo
buffalo" (assuming a good s
Hi all,
Consider the following string: "the buffalo buffaloes" [1].
When passed through a stemming analyzer, the resulting token would be
"buffalo buffalo" (assuming a good stemmer).
To enable exact searches, say I mark the original term and index it at
the same term position. So "the buf
CLucene is a complete port of Java Lucene to C++, and it has a Perl
bindings, although I'm not sure how up to date it is - you'll have to
check with its author. CLucene development branch currently supports the
Lucene 2.3.2 API and index format.
See http://clucene.sourceforge.net/ for more det
Hi,
Just to let everyone know Manning have released an extra chapter from
the excellent LIA 2E book, discussing CLucene - the C++ port of Lucene.
It is available for free at
http://www.code972.com/blog/2010/06/lucene-in-action-free-chapter-coupon-code/.
35% discount for CLucene users is av
> -Original Message-
> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
> Sent: Friday, June 25, 2010 1:09 AM
> To: java-user@lucene.apache.org
> Subject: Re: arguments in favour of lucene over commercial competition
>
> And I was just thinking the other day how it would be cool
Otis, I'm 99% sure Attivio is just a wrapper arround Lucene...
And I personally wouldn't count full text search solutions such as Oracle's.
Itamar.
> -Original Message-
> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
> Sent: Thursday, June 24, 2010 12:42 AM
> To: java-user@
http://www.code972.com/blog/hebmorph/. As
we progress, updates will be posted to that blog, to our mailing list, and
on twitter (#HebMorph).
If this is of an interest to you, we would appreciate your feedback and
help. Please use our mailing list, or contact me privately, for any
inquiries.
Itamar Syn-He
That would be next(Token) I believe. The reason it was deprecated afaik was
to force a reuse of the Token object, to gain more performance.
Itamar.
-Original Message-
From: allasso [mailto:allassopra...@gmail.com]
Sent: Thursday, June 03, 2010 10:52 PM
To: java-user@lucene.apache.org
S
See slide 18 in
http://www.cnlp.org/presentations/slides/advancedluceneeu.pdf, and
http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/search/Disjunction
MaxQuery.html.
Itamar.
-Original Message-
From: Li Li [mailto:fancye...@gmail.com]
Sent: Tuesday, June 01, 2010 11:42 AM
To: jav
Hi all,
I was wondering why only the Field constructor which accepts a String offers
Store and Index options? I understand there might be no logic in offering
them for the TokenStream constructor, but what's wrong in Storing an input
from a Reader, that 2.3.2 does not allow it?
Itamar.
Just a thought - are the files you're indexing larger than 10,000 words
(MAX_FIELD_LENGTH)? If so, maybe either your code or Lucene 2.3.* have
changed something in maxFieldLength implementation...
Itamar.
-Original Message-
From: Dan Rugg [mailto:[EMAIL PROTECTED]
Sent: Friday, May 16,
Hi all,
How can I see the position gaps in my indexed field? I've set up some sort of
mechanism to increment position gap for specific terms in specific
circumstances, and I want to make sure it is working as expected. I've tried
Luke but it doesn't seem to be able to view this info.
Thanks in
Chris,
I ended up hacking StandardTokenizer::next() to check for $^$^$, and if it
is there then set the current Token PositionIncrement to 500 and resume the
tokenizing loop (so the word which will be read into that Term will have
position increment of 500). As far as I can tell it is working wel
IL PROTECTED]
Sent: Tuesday, April 08, 2008 5:57 PM
To: java-user@lucene.apache.org
Subject: Re: Why Lucene has to rewrite queries prior to actual searching?
Op Tuesday 08 April 2008 15:18:34 schreef Itamar Syn-Hershko:
> Paul,
>
> I don't see how this answers the question.
Towards the e
Op Tuesday 08 April 2008 00:34:48 schreef Itamar Syn-Hershko:
> Paul and John,
>
> Thanks for your quick reply.
>
> The problem with query rewriting is the beforementioned
> MaxClauseException. Instead of inflating the query and passing a
> deterministic list of terms to the
rts (AND like), Scorer.skipTo() is used, and that
could well be the filter mechanism you are referring to; have a look at the
javadocs of Scorer, and, if necessary, at the actual code of
ConjunctionScorer.
Regards,
Paul Elschot
Op Monday 07 April 2008 23:13:09 schreef Itamar Syn-Hershko:
>
Hi all,
Can someone from the experts here explain why Lucene has to get a "rewritten"
query for the Searcher - so Phrase or Wildcards queries have to rewrite
themselves into a "primitive" query, that is then passed to Lucene to look for?
I'm probably not familiar too much with the internals of L
me query inflation, or as I first
suggested, auto-apply synonyms. The only question is, I guess, are there any
drawbacks for using this?
Thanks.
Itamar.
-Original Message-
From: Erick Erickson [mailto:[EMAIL PROTECTED]
Sent: Monday, March 31, 2008 4:25 PM
To: java-user@lucene.apache.org
Chris,
Thanks for your input.
Please let me make sure that I get this right: while iterating through the
words in a document, I can use my tokenizer to setPositionIncrement(150) on
a specific token, what would make it be more distant from the previous token
than it should have been. The next tok
Hi all,
Breaking proximity data has been discussed several times before, and concluded
that setPositionIncrement is the way to go. In regards of it:
1. Where should it be called exactly to create the gap properly?
2. Is there a way to call it directly somehow while indexing (e.g. after adding
(since I'm inflating the query). Does this make sense?
Itamar.
-Original Message-
From: Daniel Noll [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 20, 2008 12:44 AM
To: java-user@lucene.apache.org
Subject: Re: Contrib Highlighter and Phrase search
On Wednesday 19 March 2008 18:28:15 Ita
I'm not sure how the current Highlighter works - haven't had the time to
look into it yet - but I thought about the following implementation. Judging
by your question, this works in a slightly different way than the current
Highlighter:
1. Build a Radix tree (PATRICIA) and populate it with all se
For what it worths, I did something similar in my BidiAnalyzer so I can
index both Hebrew/Semitic texts and English/Latin words without switching
analyzers, giving each the proper treatment. I did it simply by testing the
first char and looking at its numeric value - so it falls between Hebrew
Ale
Hi all,
I'm looking for the best way to inflate a query, so a query like: "synchronous
AND colour" -- will become something like this:
"(synchronous OR asynchronous OR bsynchornous OR synchronos OR asynchronos OR
bsynchornos) AND (colour OR acolour OR bcolour OR color OR acolor OR bcolor)".
I'
uary 2008 03:33:53 Itamar Syn-Hershko wrote:
> I'm still trying to engineer the best possible solution for Lucene
> with Hebrew, right now my path is NOT using a stemmer by default, only
> by explicit request of the user. MoreLikeThis would only return
> relevant results if I
rogne.net/subversion/revuedepresse/trunk/src/java/lexico
n
And the web version :
https://admin.garambrogne.net/projets/revuedepresse/browser/trunk/src/java/l
exicon
Le 26 févr. 08 à 17:33, Itamar Syn-Hershko a écrit :
>
> Implementing something like MoreLikeThis for Hebrew. Non-Hebrew
> implement
mDocs/TermEnum. Or perhaps TermFreqVector. I admit I haven't used
that last, but that family of methods ought to fix you up.
What problem are you trying to solve? Perhaps there are better solutions to
suggest
Best
Erick
On Mon, Feb 25, 2008 at 6:04 PM, Itamar Syn-Hershko <[EMAIL PROT
correctly.
> -Original Message-
> From: Itamar Syn-Hershko [mailto:[EMAIL PROTECTED]
> Sent: Freitag, 22. Februar 2008 14:02
> To: java-user@lucene.apache.org
> Subject: Rebuilding Document from index?
>
> Hi,
>
> Is it possible to re-create a document from an
lding Document from index?
You can use Luke to rebuild the document. It will show you the terms of the
analyzed document, not the original content.
And this is what you want, if I understood you correctly.
> -Original Message-
> From: Itamar Syn-Hershko [mailto:[EMAIL PROTECTED]
>
Hi,
Is it possible to re-create a document from an index, if its not stored?
What I'm looking for is a way to have a text document with the text AFTER it
was analyzed, so I can see how my analyzer handles certain cases. So that
means I don't care if I will not get the original document. I want to
I'm not 100% sure, but I think you could use Lucene's scoring for this. So
if you ran your query and received N results, loop through them and check
the scoring explanation (which I'm not quite sure how to acquire). This
should tell you how many terms out of the query were found. This approach
shou
Hi all,
Since Analyzer is set per IndexWriter, which is being added a Document,
which has several fields, I was wondering how would I store 2 different
fields in a Document, each being passed through a different Analyzer? The
idea is to have 2 fields of the same content, one stemmed and one is not
OK, I've been processing things for a while. I came up with an idea that I
want your advice on -- is there a way I could stem the Hebrew words in my
analyzer yet keep a note of some sort of the original term which was
assembled by this stem, WITHOUT affecting frequency/proximity data? This is
I gu
In our (very) small project (several thousands of pages), we scan what we
can scan (and type what is not scannable), and then take someone to
read-proof the OCRd material. Precision matters in our case, and this seemed
to be the only way. One thought I had on your case - maybe there's an OCR
librar
where, I suspect). Also, it helps if there is some indication that the
questioner has attempted to answer the question for themselves using readily
available resources, but failed.
On 01/21/2008 at 2:59 PM, Itamar Syn-Hershko wrote:
> 1) How would Lucene treat the "normal" paragraph when t
Hi all,
Yesterday I sent an email to this group querying about some very important
(to me...) features of Lucene. I'm giving it another chance before it goes
unnoticed or forgotten. If it was too long please let me know and I will
email a shorter list of questions
The original post can be f
Hi all,
I'm starting in the process of creating Hebrew support for Lucene.
Specifically I'm using Clucene (which is an awesome and strong port), but
that shouldn't matter for my questions. Please, if you know of any info or
similar project let me know, it can save me loads of time and headaches.
62 matches
Mail list logo