On Wed, Feb 17, 2010 at 10:55 AM, Erick Erickson wrote:
> Well, Query *does* implement the Serializable interface, so that
> might work. WARNING: I haven't personally used the Serializable
> interface on Query, so I have no real clue whether it's applicable!
>
Query is serializable (lots of peopl
On Mon, Feb 8, 2010 at 9:33 AM, Chris Lu wrote:
> Since you already have RMI interface, maybe you can parallel search on
> several nodes, collect the data, pick top ones, and send back results via
> RMI.
>
One thing to be careful about this, which you might already be aware of:
Query (and subcla
On Wed, Feb 3, 2010 at 1:33 PM, tsuraan wrote:
> > The FieldCache loads per segment, and the NRT reader is reloading only
> > new segments from disk, so yes, it's "smarter" about this caching in this
> > case.
>
> Ok, so the cache is tied to the index, and not to any particular
> reader. The act
The FieldCache loads per segment, and the NRT reader is reloading only
new segments from disk, so yes, it's "smarter" about this caching in this
case.
-jake
On Wed, Feb 3, 2010 at 1:07 PM, tsuraan wrote:
> Is the cache used by sorting on strings separated by reader, or is it
> a global thing?
coord won't help him, I don't think.
Doesn't he just want a DisjunctionMaxQuery instead of BooleanQuery?
-jake
On Fri, Jan 29, 2010 at 9:28 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:
> Paul,
>
> Custom Similarity perhaps, oui. Not 100% sure, maybe have this always
> return 1.0
On Wed, Jan 27, 2010 at 12:17 AM, Jamie wrote:
> Hi Jake
>
>
> You were indexing but not searching? So you are never calling getReader()
>> in the first place?
>>
>>
> Of course, the call exists, its just that during testing we did not execute
> any searches at all.
Oh! Re-reading your initi
On Tue, Jan 26, 2010 at 11:13 PM, Jamie wrote:
>
> Hi Jake
>
> Thanks for the info. Are you specifically referring to
> http://issues.apache.org/jira/browse/LUCENE-2120?
>
Yep, that's the issue I'm referring to.
> Our app indexes about 170 50k documents per second in heavy load. In any
> case,
Hi Jamie,
How fast are you indexing (number of documents per second)? We also ran
into this
when trying to perf test heavy query throughput while doing rapid indexing
under exactly
these conditions: call getReader() every time a search is executed (so that
it's "really
real time").
The answe
Merry Christmas to you, Weiwei.
If you want to release your software under *exactly* the Apache License
(version 2.0 is the most current form of it), you may do so very easily -
just read the appendix at the end of this page:
http://www.apache.org/licenses/LICENSE-2.0
In particular, note that
What kind of queries are these? I.e. How much work goes into step 4? Is
this
a fairly standard combination of Boolean/Phrase/other stock Lucene queries
built
up out of tokenizing the text?
If so, it's going to be nowhere near the bottleneck in your runtime (we're
talking
often way less than a mi
Peter,
You want to do a facet query. This kind of functionality is not in
Lucene-core (sadly), but both Solr (the fully featured search application
built on Lucene) and bobo-browse (just a library, like Lucene itself) are
open-source and work with Lucene to provide faceting capabilities for yo
You will want to have one Lucene field which contains this composite key -
they could
be the un-tokenized concatenation of all of the subkeys, for example, and
then one Term
would have the full composite key, and the updateDocument technique would
work fine.
-jake
On Mon, Nov 16, 2009 at 11:09
The usual way to do this is to use:
IndexWriter.updateDocument(Term, Document)
This method deletes all documents with the given Term in it (this would be
your primary key), and then adds the Document you want to add. This is the
traditional way to do updates, and it is fast.
-jake
On Mo
On Sun, Nov 15, 2009 at 11:02 PM, Uwe Schindler wrote:
> the second approach is slower, when deleted docs
> are involved and 0 is inside the range (need to consult TermDocs).
>
This is a good point (and should be mentioned in your blog, John) - for
while
custom FieldCache-like implementations (
On Fri, Nov 13, 2009 at 4:21 PM, Max Lynch wrote:
> Well already, without doing any boosting, documents matching more of the
> > terms
> > in your query will score higher. If you really want to make this effect
> > more
> > pronounced, yes, you can boost the more important query terms higher.
>
On Fri, Nov 13, 2009 at 4:02 PM, Max Lynch wrote:
> > > Now, I would like to know exactly what term was found. For example, if
> a
> > > result comes back from the query above, how do I know whether John
> Smith
> > > was
> > > found, or both John Smith and his company, or just John Smith
> > Ma
On Fri, Nov 13, 2009 at 3:35 PM, Max Lynch wrote:
> > query: "San Francisco" "California" +("John Smith" "John Smith
> > Manufacturing")
> >
> > Here the San Fran and CA clauses are optional, and the ("John Smith" OR
> > "John Smith Manufacturing") is required.
> >
>
> Thanks Jake, that works nic
Did I do that wrong? I always mess up the AND/OR human-readable form
of this - it's clearer when you use +/- unary operators instead:
query: "San Francisco" "California" +("John Smith" "John Smith
Manufacturing")
Here the San Fran and CA clauses are optional, and the ("John Smith" OR
"John Smith
Hi Max,
You want a query like
("San Francisco" OR "California") AND ("John Smith" OR "John Smith
Manufacturing")
essentially? You can give Lucene exactly this query and it will require
that
either "John Smith" or "John Smith Manufacturing" be present, but will score
results which have these
|."
>
> I understood that only the hits (50 in this) for the current search would
> be sorted...
> I'll just do the ordering afterwards. Thank you for clarifying this issue.
>
>
> --
> Nuno Seco
>
>
>
> Jake Mannix wrote:
>
>> Sorting
Sorting utilizes a FieldCache: the forward lookup - the value a document has
for a
particular field (as opposed to the usual "inverted" way of looking at all
documents
which contains a given term), which lives in memory, and takes up as much
space
as one 4-bytes * numDocs.
If you've indexed the en
On Fri, Nov 6, 2009 at 12:25 AM, Mathias Bank wrote:
> Well, it could be a facet search, if there would be tags available but
> if you just wanna have a "tag cloud" generated by full-text, I don't
> see how a facet search could help to generate this cloud.
> Unfortunatelly, I don't have tags in my
Well you can do it as a facet search, but in addition to doing multi-valued
faceting, you can also normalize the counts by dividing by the docFreq of
the term, which instead of getting you the most popular tags which overlap
your query, you get the tags which are more popular for documents matching
If you need faceting on top of Lucene and you're not using Solr, Bobo-browse
( http://bobo-browse.googlecode.com ) is a high-performance open source
faceting library which may suit your needs. You're asking for "all facet
values", which in bobo isn't terribly hard to get: because of the way bobo
k
Hi Michel,
I don't have time to look in too much detail right now, but I'll bet ya $5
it's because
your query is for "sector:IT" - 'IT' lowercases to 'it' which is in the
default stopword
list, and if you're not careful about how you query with this, you'll end up
with TermQuery
instances which
On Tue, Oct 27, 2009 at 6:12 PM, Erick Erickson wrote:
> Could you go into your use case a bit more? Because I'm confused.
> Why don't you want your text tokenized? You say you want to search it,
> which means you have to analyze it.
I think Will is suggesting that he doesn't want to have to ana
er there is any limit as such. And obviously whether
> such a huge index files can be searched at all.
>
> From your response it appears that 1 TB of 1 index file is too much. Is
> there any guideline to what kind of hardware will be required to handle
> (10GB, 50GB, 100GB, 500GB et
On Thu, Oct 22, 2009 at 10:29 PM, Hrishikesh Agashe <
hrishikesh_aga...@persistent.co.in> wrote:
> Can I create an index file with very large size, like 1 TB or so? Is there
> any limit on how large index file one can create? Also, will I be able to
> search on this 1 TB index file at all?
>
Leav
> * open a new reader. But the turarnound time of this
> * method should be faster since it avoids the potentially
> * costly {...@link #commit}.
>
> Mike
>
> On Mon, Oct 12, 2009 at 4:35 PM, Jake Mannix
> wrote:
> > Thanks Yonik,
> >
> > It may be sur
On Mon, Oct 12, 2009 at 1:57 PM, Yonik Seeley wrote:
> On Mon, Oct 12, 2009 at 4:35 PM, Jake Mannix
> wrote:
> > It may be surprising, but in fact I have read that
> > javadoc.
>
> It was not your email I responded to.
>
Sorry, my bad then - you said "guys"
t; * costly {...@link #commit}.
>
> Mike
>
> On Mon, Oct 12, 2009 at 4:35 PM, Jake Mannix
> wrote:
> > Thanks Yonik,
> >
> > It may be surprising, but in fact I have read that
> > javadoc. It talks about not needing to close the
> > writer, but doesn
Thanks Yonik,
It may be surprising, but in fact I have read that
javadoc. It talks about not needing to close the
writer, but doesn't specifically talk about the what
the relationship between commit() calls and
getReader() calls is. I suppose I should have
interpreted:
"@returns a new reader
Or else just make sure that you use PhraseQuery to hit this field when you
want "value1 aaa". If you don't tokenize these pairs, then you will have to
do prefix/wildcard matching to hit just "value1" by itself (if this is
allowed
by your business logic).
-jake
On Mon, Oct 12, 2009 at 1:21 PM,
On Mon, Oct 12, 2009 at 12:26 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:
> On Mon, Oct 12, 2009 at 3:17 PM, Jake Mannix
> wrote:
>
> > Wait, so according to the javadocs, the IndexReader which you got from
> > the IndexWriter forward
Wait, so according to the javadocs, the IndexReader which you got from
the IndexWriter forwards calls to reopen() back to IndexWriter.getReader(),
which means that if the user has a NRT reader, and the user keeps calling
reopen() on it, they're getting uncommitted changes as well, while if they
cal
Hey Chris,
On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz <
christoph.bo...@googlemail.com> wrote:
> Thanks for your reply.
> Yes, it's likely that many terms occur in few documents.
>
> If I understand you right, I should do the following:
> -Write a HitCollector that simply increments a coun
Hi Cedric,
I don't know of anyone with a substantial throughput production system who
is doing realtime search with the 2.9 improvements yet (and in fact, no
serious performance analysis has been done on these even "in the lab" so to
speak: follow https://issues.apache.org/jira/browse/LUCENE-157
-jake
On Sun, Oct 11, 2009 at 3:36 PM, Jake Mannix wrote:
> Hey Eric,
>
> One clarification before letting the rest of this discussion sneak over
> to the zoie list:
>
> On Sun, Oct 11, 2009 at 1:51 PM, Angel, Eric wrote:
>
> * Am I wrong to assume that the RAMDir hol
Hey Eric,
One clarification before letting the rest of this discussion sneak over to
the zoie list:
On Sun, Oct 11, 2009 at 1:51 PM, Angel, Eric wrote:
* Am I wrong to assume that the RAMDir holds the entire index - just as the
> FSDir? Or does RAMDir only hold a portion of the index that ha
d it yet but looking at it closer it looks like it's not
> something I can plug in on top of my original query. I am definitely happy
> using an approximation for the sake of performance but I do need to be able
> to have the original results stay the same.
>
> On Fri, Oct 9, 2
Hi Michael,
If you just want the top "n" hits (the way you used to use the Hits
class), just call
TopDocs topDocs = Searcher.search(query, n);
Don't worry about the Collector interface unless you actually need it.
-jake
On Sat, Oct 10, 2009 at 1:12 PM, M R wrote:
> Hi
>
> This is the
ote:
> Hi Jake,
>
> Zoie looks like a a really cool project. I'd like to learn more about
> the distributed part of the setup. Any way you could describe that
> here or on the wiki?
>
> -Mike
>
> On Thu, Oct 8, 2009 at 9:24 PM, Jake Mannix wrote:
> > On
ccidentally hang onto references to those IndexReaders past
when needed.
-jake
On Fri, Oct 9, 2009 at 3:52 PM, scott w wrote:
> Thanks Jake! I will test this out and report back soon in case it's helpful
> to others. Definitely appreciate the help.
>
> Scott
>
> On Fri
On Fri, Oct 9, 2009 at 3:07 PM, scott w wrote:
> Example Document:
> model_1_score = 0.9
> model_2_score = 0.3
> model_3_score = 0.7
>
> I want to be able to pass in the following map at query time:
> {model_1_score=0.4, model_2_score=0.7} and have that map get used as input
> to a custom score f
ou
> have a default set of weights and you want to adjust them on the fly
> although our use case is a little different.
>
> thanks,
> Scott
>
> On Fri, Oct 9, 2009 at 10:40 AM, Jake Mannix
> wrote:
>
> > Scott,
> >
> > To reiterate what Erick and Andrzej
If you are really using all of that precision (down to the second) the
short answer is YES.
If you can remove much of that precision (only keep down to the day,
for example), then you may be able to get perfectly good performance
with strings alone when the range is only over a small set of terms,
est way to go about
> this is to post benchmarks that others may run in their
> environment which can then be tweaked for their unique edge
> cases. I wish I had more time to work on it.
>
> -J
>
> On Thu, Oct 8, 2009 at 8:18 PM, Jake Mannix wrote:
> > Jason,
> >
&g
Scott,
To reiterate what Erick and Andrzej's said: calling
IndexReader.document(docId)
in your inner scoring loop is the source of your performance problem -
iterating
over all these stored fields is what is killing you.
To do this a better way, can you try to explain exactly what this Scorer
On Thu, Oct 8, 2009 at 9:32 PM, Chris Were wrote:
> Zoie looks very close to what I'm after, however my whole app is written in
> Python and uses PyLucene, so there is a non-trivial amount of work to make
> things work with Zoie.
>
I've never used PyLucene before, but since it's a wrapper, plugg
On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen wrote:
> There is the Zoie system which uses the RAMDir
> solution,
>
Also, to clarify: zoie does not index into a RAMDir and then periodically
merge that
down to disk, as for one thing, this has a bad failure mode when the system
crashes,
as you
On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric wrote:
>
> Does anyone have any recommendations? I've looked at Katta, but it doesn't
> seem to support realtime searching. It also uses hdfs, which I've heard can
> be slow. I'm looking to serve 40gb of indexes and support about 1 million
> updates
Jason,
On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen wrote:
> Today near realtime search (with or without SSDs) comes at a
> price, that is reduced indexing speed due to continued in RAM
> merging. People typically hack something together where indexes
> are held in a RAMDir until being flush
taphi.de
>
>
> > -Original Message-
> > From: Jake Mannix [mailto:jake.man...@gmail.com]
> > Sent: Thursday, October 08, 2009 2:35 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: Help needed figuring out reason for maxClauseCount is set
Hi Eric,
Different Query classes have different options on whether they can score
docs out of order, or if they always proceed in order, so the way to make
sure
you're choosing the right value, if you don't know which you need, is to ask
your Query (or more appropriately, it's Weight):
Query
On Wed, Oct 7, 2009 at 4:42 PM, mitu2009 wrote:
>
> Hi,
>
> I've two sets of search indexes. TestIndex (used in our test environment)
> and ProdIndex(used in PRODUCTION environment). Lucene search query:
> +date:[20090410184806 TO 20091007184806] works fine for test index but
> gives
> this error
stantly updating the index with new info, we're also reopening it very
> frequently to make the new info appear in query results. Would that
> disqualify the update method? And what do you mean by "not very
> frequently".
> Is every 5 min too much?
>
> Thanks agai
As long as you don't have to split up a fully optimized index, or
one with the wrong number of segments for the division you
want to do, that would be useful. Of course, sometimes you
need to split up the big segments into smaller ones too, but
the only way I've done that in the past is basically:
I think a Hadoop cluster is maybe a bit overkill for this kind of
thing - it's pretty common to have to do "grandfathering" of an
index when you have new features, and just doing it in place
with IndexWriter.update() can work just fine as long as you
are not very frequently reopening your index.
T
of queries a day in real time (meaning milliseconds, even under
fairly high indexing load) for the past year.
-jake mannix
Hi Klaus,
If you've really still got 500MB of changes to your index since the last
time you commit()'ed, then the call to commit() will be costly and take a
while to complete. If in another thread, you reopen() an IndexReader
pointing to that index, it will only see changes since the most recen
We started doing the same thing (pooling 1 searcher per core) at my
work when profiling showed a lot of time hitting synchonized blocks
deep inside the SegmentTermReader (? Might be messing the class up)
under high load, due to file read()'s using instance variables for
seeking. I could dig up the
org
> Betreff: Re: feedback: Indexing speed improvement lucene 2.2->2.3.1
>
> Hi Uwe,
>
> Could you tell what Analyzer do you use when you marked so big indexing
> speedup?
> If you use StandardAnalyzer (that uses StandardTokenizer) may be the
> reason is in it. You can see the pr
I think the way I've seen it done most often is to either index some
bi-grams which
contain stop words (so "the database" and "search the" are in the index as
individual
tokens), or else to index that piece of content twice - once with stop words
removed
(and stemming, if you use it), and then agai
u can also
> see the actual diffs that took place.
>
> Best,
> Doron
>
> On Tue, Mar 18, 2008 at 7:14 PM, Jake Mannix <[EMAIL PROTECTED]>
> wrote:
>
> > Hey folks,
> > I was wondering what the status of LUCENE-933 (stop words can cause the
> > q
Hey folks,
I was wondering what the status of LUCENE-933 (stop words can cause the
queryparser to end up with no results, due to an e.g. +(the) clause in the
resultant BooleanQuery). According to the tracking bug, it's resolved, and
there's a patch, but where has that patch been applied? I trie
Gabriel,
You can make this search much more efficient as follows: say that you have
a method
public BooleanQuery createQuery(Collection allowedUUIDs);
that works as you describe. Then you can easily create a useful reusable
filter as follows:
Filter filter = new CachingWrapperFilter(new
Q
What the other posters are referring to is that you will have to
probably write some java code to do lucene indexing: you can get
access to your model objects (with all their dependent data) in java.
- since you are using hibernate, this shouild be easy- then create
lucene documents from your mode
The way I've always done this was to index two fields: say, "contents"
and "contents_unstemmed", (using a PerFieldAnalyzer) and then query
on both of them. This has the double effect of a) boosting unstemmed
hits, because every unstemmed match is also a stemmed one, so the
BooleanQuery combining
te:
> Damn, really? I haven't had the opportunity to test this yet. Has
> anyone else seen this kind of improvement?
>
>
>
> On Feb 3, 2008 2:57 PM, Jake Mannix <[EMAIL PROTECTED]> wrote:
> > Hello all,
> > I know you lucene devs did a lot of work on ind
ge of
multiple threads / cores? If so, I could rerun it again multithreaded and
see if that's even better...
-jake
On Feb 3, 2008 9:02 PM, ajay_garg <[EMAIL PROTECTED]>
wrote:
>
> Hi Jake.
>
> Was the test conducted with a single indexing thread, or multiple on
ood. :)
-jake
On Feb 3, 2008 2:11 PM, Michael McCandless <[EMAIL PROTECTED]>
wrote:
>
> Awesome! We are glad to hear that :)
>
> You might be able to make it even faster with the steps here:
>
> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
>
> Mi
Hello all,
I know you lucene devs did a lot of work on indexing performance in 2.3,
and I just tested it out last thursday, so I thought I'd let you know how it
fared:
On a 2.17 million document index, a recent test gave indexing time to be:
* lucene 2.2: 4.83 hours
* lucene 2.3: 26 m
72 matches
Mail list logo