Re: frequent keyword computation within a search ( and timeinterval )

2012-01-05 Thread Jason Rutherglen
> Short answer is that no, there isn't an aggregate > function. And you shouldn't even try If that is the case why does a 'stats' component exist for Solr with the SUM function built in? http://wiki.apache.org/solr/StatsComponent On Thu, Jan 5, 2012 at 1:37 PM, Erick Erickson wrote: > You will

Re: frequent keyword computation within a search ( and timeinterval )

2012-01-05 Thread Jason Rutherglen
red > SUM, stats would do it. > > Erick > > On Thu, Jan 5, 2012 at 7:23 PM, Jason Rutherglen > wrote: >>> Short answer is that no, there isn't an aggregate >>> function. And you shouldn't even try >> >> If that is the case why does a 'st

Re: RAMDirectory unexpectedly slows

2012-06-04 Thread Jason Rutherglen
If you want the index to be stored completely in RAM, there is the ByteBuffer directory [1]. Though I do not see the point in putting an index in RAM, it will be cached in RAM regardless in the OS system IO cache. 1. https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/ap

Re: RAMDirectory unexpectedly slows

2012-06-04 Thread Jason Rutherglen
t. Is that right? > > What about the ByteBufferDirectory? Can this specific directory utilize the > 2GB memory I grant to the app? > > On Mon, Jun 4, 2012 at 10:58 PM, Jason Rutherglen < > jason.rutherg...@gmail.com> wrote: > >> If you want the index to be stored

Looking for case studies for 'Lucene and Solr: The Definitive Guide' from O'Reilly

2012-12-17 Thread Jason Rutherglen
Cloud * Hadoop integration Thanks, Jason Rutherglen, Jack Krupansky, and Ryan Tabora http://shop.oreilly.com/product/0636920028765.do - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-ma

Monitoring low level IO

2010-06-03 Thread Jason Rutherglen
This is more of a unix related question than Lucene specific however because Lucene is being used, I'm asking here as perhaps other people have run into a similar issue. On an Amazon EC2 merge, read, and write operations are possibly blocking due to underlying IO. Is there a tool that you have use

Re: Last Call: Lucene Revolution CFP Closes Tomorrow Wednesday, June 23, 2010, 12 Midnight PDT

2010-06-22 Thread Jason Rutherglen
Grant, I can probably do the 3 billion document one from Prague, or a realtime search one... I spaced on submitting for ApacheCon. Are there cool places in the Carolinas to hang? Cheers bro, Jason On Tue, Jun 22, 2010 at 10:51 AM, Grant Ingersoll wrote: > Lucene Revolution Call For Particip

Recreate segment infos

2010-10-04 Thread Jason Rutherglen
Lets say the segment infos file is missing, and I'm aware of CheckIndex, however is there a tool to recreate a segment infos file? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail:

Re: Recreate segment infos

2010-10-05 Thread Jason Rutherglen
egment is given the same name as the first segment that > shares it.  However, unfortunately, because of merging, it's possible > that this mapping is not easy (maybe not possible, depending on the > merge policy...) to reconstruct.  I think this'll be the hardest part > :) > &

Re: API access to in-memory tii file (3.x not flex).

2010-11-10 Thread Jason Rutherglen
In a word, no. You'd need to customize the Lucene source to accomplish this. On Wed, Nov 10, 2010 at 1:02 PM, Burton-West, Tom wrote: > Hello all, > > We have an extremely large number of terms in our indexes.  I want to be able > to extract a sample of the terms, say something like every 128th

Re: API access to in-memory tii file (3.x not flex).

2010-11-10 Thread Jason Rutherglen
Yeah that's customizing the Lucene source. :) I should have gone into more detail, I will next time. On Wed, Nov 10, 2010 at 2:10 PM, Michael McCandless wrote: > Actually, the .tii file pre-flex (3.x) is nearly identical to the .tis > file, just that it only contains every 128th term. > > If you

Storing an ID alongside a document

2011-02-02 Thread Jason Rutherglen
I'm curious if there's a new way (using flex or term states) to store IDs alongside a document and retrieve the IDs of the top N results? The goal would be to minimize HD seeks, and not use field caches (because they consume too much heap space) or the doc stores (which require two seeks). One pos

Re: Storing an ID alongside a document

2011-02-02 Thread Jason Rutherglen
s branch) > > -Yonik > http://lucidimagination.com > > > On Wed, Feb 2, 2011 at 1:03 PM, Jason Rutherglen > wrote: > >> I'm curious if there's a new way (using flex or term states) to store >> IDs alongside a document and retrieve the IDs of the top N resul

Re: Storing an ID alongside a document

2011-02-03 Thread Jason Rutherglen
> there is a entire RAM resident part and a Iterator API that reads / > streams data directly from disk. > look at DocValuesEnum vs, Source Nice, thanks! On Thu, Feb 3, 2011 at 12:20 AM, Simon Willnauer wrote: > On Thu, Feb 3, 2011 at 3:23 AM, Jason Rutherglen > wrote: >>

Last/max term in Lucene 4.x

2011-02-18 Thread Jason Rutherglen
This could be a rhetorical question. The way to find the last/max term that is a unique per document is to use TermsEnum to seek to the first term of a field, then call seek to the docFreq-1 for the last ord, then get the term, or is there a better/faster way?

Re: Last/max term in Lucene 4.x

2011-02-19 Thread Jason Rutherglen
that supports ord (eg FixedGap). > > Mike > > On Fri, Feb 18, 2011 at 9:24 PM, Jason Rutherglen > wrote: >> This could be a rhetorical question.  The way to find the last/max >> term that is a unique per document is to use TermsEnum to seek to the >> first term of a

Re: Last/max term in Lucene 4.x

2011-02-20 Thread Jason Rutherglen
rd. How would I seek to the last term in the index using VarGaps? Or do I need to interact directly with the FST class (and if so I'm not sure what to do there either). Thanks Mike. On Sun, Feb 20, 2011 at 2:51 PM, Michael McCandless wrote: > On Sat, Feb 19, 2011 at 8:42 AM, Jason Rutherg

Re: Last/max term in Lucene 4.x

2011-02-21 Thread Jason Rutherglen
ordered IDs stored in the index, so that remaining documents (that lets say were left in RAM prior to process termination) can be indexed. It's an inferred transaction checkpoint. On Mon, Feb 21, 2011 at 5:31 AM, Michael McCandless wrote: > On Sun, Feb 20, 2011 at 8:47 PM, Jason Rutherglen &

Is ConcurrentMergeScheduler useful for multiple running IndexWriter's?

2011-03-04 Thread Jason Rutherglen
ConcurrentMergeScheduler is tied to a specific IndexWriter, however if we're running in an environment (such as Solr's multiple cores, and other similar scenarios) then we'd have a CMS per IW. I think this effectively disables CMS's max thread merge throttling feature? ---

Append Codec random testing

2011-03-21 Thread Jason Rutherglen
I'm seeing an error when using the misc Append codec. java.lang.AssertionError at org.apache.lucene.store.ByteArrayDataInput.readBytes(ByteArrayDataInput.java:107) at org.apache.lucene.index.codecs.BlockTermsReader$FieldReader$SegmentTermsEnum._next(BlockTermsReader.java:661) at org.apache.luce

Re: DocIdSet to represent small numberr of hits in large Document set

2011-04-05 Thread Jason Rutherglen
I think Solr has a HashDocSet implementation? On Tue, Apr 5, 2011 at 3:19 AM, Michael McCandless wrote: > Can we simply factor out (poach!) those useful-sounding classes from > Nutch into Lucene? > > Mike > > http://blog.mikemccandless.com > > On Tue, Apr 5, 2011 at 2:24 AM, Antony Bowesman > w

Lucene Util question

2011-04-08 Thread Jason Rutherglen
Is http://code.google.com/a/apache-extras.org/p/luceneutil/ designed to replace or augment the contrib benchmark? For example it looks like SearchPerfTest would be useful for executing queries over a pre-built index. Though there's no indexing tool in the code tree? -

Re: Index size and performance degradation

2011-06-13 Thread Jason Rutherglen
> I don't think we'd do the post-filtering solution, but instead maybe > resolve the deletes "live" and store them in a transactional data I think Michael B. aptly described the sequence ID approach for 'live' deletes? On Mon, Jun 13, 2011 at 3:00 PM, Michael McCandless wrote: > Yes, adding dele

Re: Index size and performance degradation

2011-06-13 Thread Jason Rutherglen
> deletions made by readers merely mark it for > deletion, and once a doc has been marked for deletions it is deleted for all > intents and purposes, right? There's the point-in-timeness of a reader to consider. > Does the N in NRT represent only the cost of reopening a searcher? Aptly put, and

Re: ElasticSearch

2011-11-16 Thread Jason Rutherglen
> even high complexity as ES supports lucene-like query nesting via JSON That sounds interesting. Where is it described in the ES docs? Thanks. On Wed, Nov 16, 2011 at 1:36 PM, Peter Karich wrote: >  Hi, > > its not really fair to compare NRT of Solr to ElasticSearch. > ElasticSearch provides

Re: ElasticSearch

2011-11-16 Thread Jason Rutherglen
The docs are slim on examples. On Wed, Nov 16, 2011 at 3:35 PM, Peter Karich wrote: > >>> even high complexity as ES supports lucene-like query nesting via JSON >> That sounds interesting.  Where is it described in the ES docs?  Thanks. > > "Think of the Query DSL as an AST of queries" > http://w

BigInteger usage in numeric Trie range queries

2011-11-28 Thread Jason Rutherglen
Even though the NumericRangeQuery.new* methods do not support BigInteger, the underlying recursive algorithm supports any sized number. Has this been explored? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For

Re: Replicating Lucene Index with out SOLR

2008-08-28 Thread Jason Rutherglen
te: > > Yes, I think you pinpointed what I see over and over with Solr. The two > desires pull in opposite directions. I think Jason Rutherglen is very keen > to start talking about Lucene clusters and index replication in such clusters > without using the classic master/slave appr

Realtime Search for Social Networks Collaboration

2008-09-03 Thread Jason Rutherglen
Hello all, I don't mean this to sound like a solicitation. I've been working on realtime search and created some Lucene patches etc. I am wondering if there are social networks (or anyone else) out there who would be interested in collaborating with Apache on realtime search to get it to the poi

Re: Realtime Search for Social Networks Collaboration

2008-09-03 Thread Jason Rutherglen
for social networks interested in realtime search to get involved as it may be something that is difficult for one company to have enough resources to implement to a production level. I think this is where open source collaboration is particularly useful. Cheers, Jason Rutherglen [EMAIL PROTECTED] On W

Re: Realtime Search for Social Networks Collaboration

2008-09-04 Thread Jason Rutherglen
ections. and before a > indexwrite/delete i would sync the cache with index. > > I am waiting for lucene 2.4 to proceed. (query by delete) > > Best. > > On Wed, Sep 3, 2008 at 10:20 PM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > >> Hello all, >> >&g

Re: How can we know if 2 lucene indexes are same?

2008-09-05 Thread Jason Rutherglen
In Ocean I had to use a transaction log and execute everything that way like SQL database replication. Then let each node handle it's own merging process. Syncing the indexes is used to get a new node up to speed, otherwise it's avoided for the reasons mentioned in the previous email. On Fri, Se

Re: Incremental Indexing.

2008-09-08 Thread Jason Rutherglen
Hi Jang, I've been working on Tag Index to address this issue. It seems like a popular feature and I have not had time to fully implement it yet. http://issues.apache.org/jira/browse/LUCENE-1292 To be technical it handles UN_TOKENIZED fields (did this name change now?) and some specialized thing

Re: Incremental Indexing.

2008-09-09 Thread Jason Rutherglen
Hi Jang, Yes, and I have not completed it either... Perhaps when I do you can use it. Best regards, Jason On Tue, Sep 9, 2008 at 9:20 PM, 장용석 <[EMAIL PROTECTED]> wrote: > Thanks for your helps. > I have about 40 documents in my index and it is constant update (price > or name.. etc). > I wil

Re: Frequently updated fields

2008-09-12 Thread Jason Rutherglen
Yes Tag Index will work. I have not had time to complete it however if you are interested in working on it please feel free to contact me. On Fri, Sep 12, 2008 at 3:48 PM, Mark Miller <[EMAIL PROTECTED]> wrote: > You might check out the tagindex issue in jira as well. Havn't looked at it > myself

Re: Frequently updated fields

2008-09-14 Thread Jason Rutherglen
It would be good to allow users to use their own Filter subclasses in SOLR. This will help with RMI based implementations that use SOLR, and will allow all of the open source Filter work to be used in SOLR, without needing to recreate it with DocSets. 2008/9/14 Gerardo Segura <[EMAIL PROTECTED]>:

Re: patching lucene-1314

2008-09-15 Thread Jason Rutherglen
I am updating it to work with trunk. On Mon, Sep 15, 2008 at 2:11 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Yes, probably out of sync with the 2.3.2 code. Have you tried applying it to > the trunk? > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Or

Re: Re[2]: Frequently updated fields

2008-09-16 Thread Jason Rutherglen
Hi Wojciech, The code isn't ready, it is a major project and I am trying to also complete the realtime indexing patches and look for a job. I believe that the tag indexing stuff is of interest to many people so if there is someone who can pay to get it completed feel free to contact me as I am av

Re: Re[4]: Frequently updated fields

2008-09-17 Thread Jason Rutherglen
Hi Wojciech, Integration with SOLR would be ideal. However that would take more time. It depends on the exact features. There is at least one patch to IndexWriter. The merging is the part that needs to be synchronized and this is where I am hesitant because Ocean/realtime search performs merge

Re: Sorting posting lists before intersection

2008-09-17 Thread Jason Rutherglen
It would be a good feature in Lucene to be able to sort, or perhaps store the postings in term frequency sorted order. Thoughts? On Wed, Sep 17, 2008 at 9:33 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Renaud Delbru wrote: >> >> Hi all, >> >> I am wondering if Lucene implements the query op

Re: How to restore corrupted index

2008-09-26 Thread Jason Rutherglen
Mike, As part of my goal of trying to use Lucene as primary storage mechanism (perhaps not the best idea), what do you think is the best way to handle storing data in Lucene and preventing corrupted data the way something like an SQL database handles corrupted data? Or is there simply no good way

Re: How to restore corrupted index

2008-09-26 Thread Jason Rutherglen
2008 at 12:13 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > Corrupted data in what sense? > > EG if you don't trust your IO system to store data properly? > > Mike > > Jason Rutherglen wrote: > >> Mike, >> >> As part of my goal of tryi

Re: triplet store

2008-09-29 Thread Jason Rutherglen
What is that? On Mon, Sep 29, 2008 at 8:51 AM, Cam Bazz <[EMAIL PROTECTED]> wrote: > Has anyone tried to implement a triplet store with lucene? > > Best, > -C.B. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional c

serialVersionUID issue between 2.3 and 2.4

2008-12-01 Thread Jason Rutherglen
Seeing the following issue between Lucene 2.3 and 2.4. A 2.3 serialized Term object cannot be deserialized by 2.4. I would guess it has something to do with a different Java compiler being used for the Lucene 2.4 build as serialVersionUID is not defined in the Term class. Fixing the issue is crit

Re: Which is faster/better

2008-12-01 Thread Jason Rutherglen
It would be nice to have a pluggable solution for deleteddocs in IndexReader that accepts a Filter, and have BitVector implement Filter. This way I do not have to implement IndexReader.clone. On Mon, Dec 1, 2008 at 5:04 PM, Michael McCandless < [EMAIL PROTECTED]> wrote: > > So in your UI, you'd

Re: serialVersionUID issue between 2.3 and 2.4

2008-12-01 Thread Jason Rutherglen
ssign one ourselves, and then we have to remember to change > it if we ever make a big enough change to Term, to allow serialize in > one version of Lucene & deserialize in another. > > Mike > > > Jason Rutherglen wrote: > > Seeing the following issue between Lucene 2.

Re: serialVersionUID issue between 2.3 and 2.4

2008-12-01 Thread Jason Rutherglen
< [EMAIL PROTECTED]> wrote: > > Jason Rutherglen wrote: > > if you don't set serialVersionUID yourself, then java assigns a >>> >> rather volatile one for you >> >> True however the Java specification defines how the serialVersionUID >> shoul

Re: serialVersionUID issue between 2.3 and 2.4

2008-12-02 Thread Jason Rutherglen
I prefer Externalizable as well as it makes Serialization faster. Perhaps also for Query and it's subclasses to start? I had code to do this for Analyzer as well which could be useful, perhaps a different patch though. On Tue, Dec 2, 2008 at 2:22 AM, Michael McCandless < [EMAIL PROTECTED]> wrote

Re: Suggestions for drill downs

2008-12-04 Thread Jason Rutherglen
The field cache is completely reloaded. LUCENE-831 solves this by merging the field caches of the segments. For realtime search systems, merging the field caches is not desirable though. On Thu, Dec 4, 2008 at 6:45 PM, John Wang <[EMAIL PROTECTED]> wrote: > Glad to be of help. > Understand that

Re: Issue upgrading from lucene 2.3.2 to 2.4 (moving from bitset to docidset)

2008-12-10 Thread Jason Rutherglen
Hi M.S., Do you think it would be cool to have some faceting built into Lucene at some point? -J On Tue, Dec 9, 2008 at 10:11 PM, Michael Stoppelman <[EMAIL PROTECTED]>wrote: > Yeah looks similar to what we've implemented for ourselves (although I > haven't looked at the implementation). We've

FastSSFuzzy for faster fuzzy queries in Lucene

2009-01-05 Thread Jason Rutherglen
Hello, I'm interested in getting FastSSFuzzy into Lucene, perhaps as a contrib module. One question is how much would the index grow? We've got a list of people's names we want to do spellchecking on for example. -J

contrib Benchmark enwiki problem

2009-01-21 Thread Jason Rutherglen
I downloaded trunk via SVN. Went to trunk/contrib/benchmark. Executed ant enwiki. I'm not sure what else needs to be done. Received this error: enwiki: [echo] Working Directory: /Users/jrutherg/dev/lucenetrunk/trunk/contrib/benchmark/work [java] Running algorithm from: /Users/jrutherg

Re: contrib Benchmark enwiki problem

2009-01-21 Thread Jason Rutherglen
1, 2009 at 5:03 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > > You should download Wikipedia's XML file manually yourself, uncompress it, > and then edit docs.file in that alg to point to it. > > Mike > > > Jason Rutherglen wrote: > >

Re: contrib Benchmark enwiki problem

2009-01-23 Thread Jason Rutherglen
pendix. >> >> But you are close... oh, it actually looks like the output file >> (work/enwiki.txt) could not be written. Does that directory (../work) >> exist? (I think build.xml should have created it). >> >> Mike >> >> Jason Rutherglen

Re: Poor QPS with highlighting

2009-02-05 Thread Jason Rutherglen
Google uses dedicated highlighting servers. Maybe this architecture would work for you. On Mon, Feb 2, 2009 at 11:24 PM, Michael Stoppelman wrote: > Hi all, > > My search backends are only able to eek out 13-15 qps even with the entire > index in memory (this makes it very expensive to scale). A

Re: Poor QPS with highlighting

2009-02-05 Thread Jason Rutherglen
http://en.wikipedia.org/wiki/Google_platform Document server summarization On Thu, Feb 5, 2009 at 12:57 PM, Michael Stoppelman wrote: > On Thu, Feb 5, 2009 at 12:47 PM, Michael Stoppelman >wrote: > > > > > > > On Thu, Feb 5, 2009 at 9:05 AM, Jason Rutherglen <

Assertion Error in TermsHashPerField.comparePostings - Lucene 2.4

2009-03-24 Thread Jason Rutherglen
While indexing using contrib/org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker. The asserion error is from TermsHashPerField.comparePostings(RawPostingList p1, RawPostingList p2). A Payload is added to the document representing a UID. Only 1-2 out of 1 million documents indexed generates th

MergePolicy public but SegmentInfos package protected?

2009-03-24 Thread Jason Rutherglen
I'm overriding MergePolicy which is public, however SegmentInfos is package protected which means the MergePolicy subclass must be in the org.apache.lucene.index package. Can we make SegmentInfos public?

Re: Assertion Error in TermsHashPerField.comparePostings - Lucene 2.4

2009-03-24 Thread Jason Rutherglen
> > H. > > > > Jason is this easily/compactly repeated? EG, try to index the N docs > > before that one. > > > > If you remove the SinglePayloadTokenStream field, does the exception > > still happen? > > > > Mike > > > > Jas

Re: Assertion Error in TermsHashPerField.comparePostings - Lucene 2.4

2009-03-24 Thread Jason Rutherglen
12:25 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > H. > > Jason is this easily/compactly repeated? EG, try to index the N docs > before that one. > > If you remove the SinglePayloadTokenStream field, does the exception > still happen? > > Mike

Re: Assertion Error in TermsHashPerField.comparePostings - Lucene 2.4

2009-03-25 Thread Jason Rutherglen
It looks like you are reusing a Field (the f.setValue(...) calls); are > you sure you're not changing a Document/Field while another thread is > adding it to the index? > > If you can post the full code, then I can try to run it on my > wikipedia dump locally. > > Mi

Re: Assertion Error in TermsHashPerField.comparePostings - Lucene 2.4

2009-03-25 Thread Jason Rutherglen
LuceneError when executed should reproduce the failure. The contrib/benchmark libraries are required. MultiThreadDocAdd is a multithreaded indexing utility class. On Wed, Mar 25, 2009 at 1:06 PM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote: > Each document is being created in

Re: Assertion Error in TermsHashPerField.comparePostings - Lucene 2.4

2009-03-26 Thread Jason Rutherglen
e segments with enough deletes need to merged away in 1-2 hours. Meaning optimizing may not be best as it requires later large merges. Also an interleaving system that does not perform merges if a flush is occurring could useful for minimizing disk trash. On Wed, Mar 25, 2009 at 3:39 PM, J

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread Jason Rutherglen
John, We looked at implementing delete by doc id for LUCENE-1516, however it seemed to be something that if enough people wanted we could implement it at as a later patch. The implementation involves maintaining a genealogy of SegmentReaders within IndexWriter so that deletes to a reader that has

Re: Getting an IndexReader from a committed IndexWriter

2009-05-14 Thread Jason Rutherglen
Hi Shay, I think IndexWriter.getReader from LUCENE-1516 in trunk is what you're talking about? It pools readers internally so there's no need to call IndexReader.reopen, one simply calls IW.getReader to get new readers containing recent updates. -J BTW I replied to the message on java-u...@lucen

Re: is there a way to control when merges happen?

2009-05-15 Thread Jason Rutherglen
Hi Dan, You are looking to throttle the merging? I'd recommend setting ConcurrentMergeScheduler.setMaxThreadCount(1). This way IW.addDocument doesn't wait while a merge occurs (like SerialMergeScheduler) however it should not use as much CPU as only one merge will occur at a time. In regards to

Bay Area Lucene Group?

2009-05-19 Thread Jason Rutherglen
On the topic of user groups, is there a Bay Area Lucene users group?

Re: Lucene memory usage

2009-06-10 Thread Jason Rutherglen
> LUCENE-1458 (flexible indexing) has these improvements, Mike, can you explain how it's different? I looked through the code once but yeah, it's in with a lot of other changes. On Wed, Jun 10, 2009 at 5:40 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > This (very large number of

Re: Lucene memory usage

2009-06-10 Thread Jason Rutherglen
d >terms, and is slurped into the arrays on init. > > This is a sizable RAM savings over what's done now because you save 2 > objects, 3 pointers, 2 longs, 2 ints (I think), per indexed term. > > Mike > > On Wed, Jun 10, 2009 at 2:02 PM, Jason > Rutherglen wrote: &

Re: caching an indexreader

2009-06-19 Thread Jason Rutherglen
> As I understand it, the user won't see any changes to the index until a new Searcher is created. Correct. > How much memory will caching the searcher cost? Are there other tradeoff's I need to consider? If you're updating the index frequently (every N seconds) and the searcher/reader is closed

Re: caching an indexreader

2009-06-19 Thread Jason Rutherglen
On the topic of RAM consumption, it seems like field caches could return estimated RAM usage (given they're arrays of standard Java types)? There's methods of calculating per platform (I believe relatively accurately). On Fri, Jun 19, 2009 at 12:11 PM, Michael McCandless < luc...@mikemccandless.co

Re: Delete by docId in IndexWriter

2009-06-28 Thread Jason Rutherglen
This requires tracking the genealogy of docids as they are merged inside IndexWriter. It's doable, so if you're particularly interested feel free to open a jira issue. On Sun, Jun 28, 2009 at 2:21 AM, Shay Banon wrote: > > Hi, > > I have a case where deleting documents by doc id make sense (I

Re: Optimizing unordered queries

2009-07-07 Thread Jason Rutherglen
Ah ok, I was thinking we'd wait for the new flex indexing patch. I had started working along these lines before and will take it on as a project (which is I believe reducing the memory consumption of the term dictionary). I plan to segue it into the tag index at some point. On Tue, Jul 7, 2009 at

Anyone used org.apache.lucene.analysis.compound.hyphenation.TernaryTree?

2009-07-14 Thread Jason Rutherglen
Just wondering if it works and if it's a good fit for autosuggest? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread Jason Rutherglen
Do we think that we'll be able to support indexing stop words using PFOR (with relaxation on the compression to gain performance?) Today it seems like the best approach to indexing stop words is to use shingles? However this blows up the term dict because shingles concatenates phrases together. On

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread Jason Rutherglen
be honest, I do not know is anyone today runs high volume search from disk > (maybe SSD), even than, significant portion has to be in RAM... > > One day we could throw many CPUs at Query... but this is not an easy one... > > > > > > - Original Message >> F

New more affordable and performant Intel SSDs

2009-07-22 Thread Jason Rutherglen
http://arstechnica.com/hardware/news/2009/07/intels-new-34nm-ssds-cut-prices-by-60-percent-boost-speed.ars For me the price on the 80GB is now within reason for a $1300 SuperMicro quad-core 12GB RAM type of server. - To unsubscri

Complexity of PhraseQuery slop?

2009-08-12 Thread Jason Rutherglen
In trying to calculate the cost of various slop settings for phrase queries, what's the time complexity? O(n) or O(n^2)? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user

Re: Bizarre indexing issue where thousands of files get created

2009-08-18 Thread Jason Rutherglen
Micah, If you can post some of your code, it may be easier to identify the problem you're experiencing. -J On Tue, Aug 18, 2009 at 9:55 AM, Micah Jaffe wrote: > Hi, thanks for the response!  The (custom) searchers that are falling out of > cache are indeed calling close on their IndexReader in f

Re: Lucene SORT does a sort on entire index..how do I filter SORT?

2009-08-21 Thread Jason Rutherglen
Take a look at contrib/spatial. On Fri, Aug 21, 2009 at 7:00 AM, javaguy44 wrote: > > Hi, > > I'm currently looking at sorting in lucene, and to get started I took a look > at the distance sorting example from the Lucene in Action book. > > Working through the test DistanceSortingTest, I've notice

Re: Lucene SORT does a sort on entire index..how do I filter SORT?

2009-08-21 Thread Jason Rutherglen
even hits. > > Is there no way to limit the sorting to only the documents that were found > in the query? > > Thanks > > > > Jason Rutherglen-2 wrote: >> >> Take a look at contrib/spatial. >> >> On Fri, Aug 21, 2009 at 7:00 AM, javaguy44 wrot

Re: Is there a way to check for field "uniqueness" when indexing?

2009-08-26 Thread Jason Rutherglen
Daniel, You may want to look at SOLR-1375 which enables ID checking using a BloomFilter (with a specified errorrate of false positives). Otherwise for what you're trying to do, you'd need to create a hash map? -J On Thu, Aug 13, 2009 at 7:33 AM, Daniel Shane wrote: > Hi all! > > I'm currently ru

JVM bug?

2009-08-28 Thread Jason Rutherglen
While indexing with the latest nightly build of Solr on Amazon EC2 the following JVM bug has occurred twice on two different servers. Post the log to a Jira issue? java version "1.6.0_07" Java(TM) SE Runtime Environment (build 1.6.0_07-b06) Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed

Re: JVM bug?

2009-08-28 Thread Jason Rutherglen
> - Mark > > http://www.lucidimagination.com > > > > Jason Rutherglen wrote: >> While indexing with the latest nightly build of Solr on Amazon EC2 the >> following JVM bug has occurred twice on two different servers. >> >> Post the log to a Jira issue? >>

Re: Extending Sort/FieldCache

2009-09-10 Thread Jason Rutherglen
I think CSF hasn't been implemented because it's only marginally useful yet requires fairly significant rewrites of core code (i.e. SegmentMerger) so no one's picked it up including myself. An interim solution that fulfills the same function (quickly loading field cache values) using what works rel

Index docstore flush problem

2009-09-10 Thread Jason Rutherglen
I'm seeing a strange exception when indexing using the latest Solr rev on EC2. org.apache.solr.client.solrj.SolrServerException: org.apache.solr.client.solrj.SolrServerException: java.lang.RuntimeException: after flush: fdx size mismatch: 468 docs vs 298404 length in bytes of _0.fdx at or

Re: Index docstore flush problem

2009-09-10 Thread Jason Rutherglen
he fdx file > size is 3748 (= 4 + 468*8), yet the file size is far larger than that > (298404). > > How repeatable is it?  Can you turn on infoStream, get the exception > to happen, then post the resulting output? > > Mike > > On Thu, Sep 10, 2009 at 7:19 PM, Jason Ruther

Re: Concurrent Indexing and Searching

2009-09-25 Thread Jason Rutherglen
It depends on whether or not the commit completes before the reopen. Lucene 2.9 adds an IndexWriter.getReader method that will always return with the latest modifications to your index. So if you're adding many documents, you can at anytime, call IW.getReader and you will be able to search the cha

Re: Efficiently reopening remotely-distributed indexes in 2.9?

2009-10-05 Thread Jason Rutherglen
I'm not sure I understand the question. You're trying to reopen the segments that you're replicated and you're wondering what's changed in Lucene? On Mon, Oct 5, 2009 at 5:30 PM, Nigel wrote: > Anyone have any ideas here?  I imagine a lot of other people will have a > similar question when trying

Re: How to setup a scalable deployment?

2009-10-06 Thread Jason Rutherglen
Chris, It sounds like you're on the right track. Have you looked at Solr which uses the rsync/Java replication method you mentioned? Replication and near realtime in Solr aren't quite there yet, however it wouldn't be too hard to add it. -J On Tue, Oct 6, 2009 at 3:57 PM, Chris Were wrote: > Hi

Re: Best strategy for reindexing large amount of data

2009-10-07 Thread Jason Rutherglen
Maarten, Depending on the hardware available you can use a Hadoop cluster to reindex more quickly. With Amazon EC2 one can spin up several nodes, reindex, then tear them down when they're no longer needed. Also you can simply update in place the existing documents in the index, though you'd need t

Index splitter

2009-10-07 Thread Jason Rutherglen
We have a way to merges indexes together with IW.addIndexes, however not the opposite, split up an index with multiple segments. I think I can simply manufacture a new segmentinfos in a new directory, copy over the segments files from those segments, delete the copied segments from the source, and

Re: Reverse stemmer?

2009-10-08 Thread Jason Rutherglen
Out of curiousity and perhaps for practical purposes, how does one handle mixed language documents? I suppose one could extract the words of a particular language and place it in a lang specific field? Are there libraries to perform this (yet)? On Thu, Oct 8, 2009 at 6:32 AM, Christian Reuschling

Re: Realtime & distributed

2009-10-08 Thread Jason Rutherglen
Eric, Katta doesn't require HDFS which would be slow to search on, though Katta can be used to copy indexes out of HDFS onto local servers. The best bet is hardware that uses SSDs because merges and update latency will greatly decrease and there won't be a synchronous IO issue as there is with har

Re: Realtime & distributed

2009-10-09 Thread Jason Rutherglen
on it. -J On Thu, Oct 8, 2009 at 8:18 PM, Jake Mannix wrote: > Jason, > > On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen > wrote: > >> Today near realtime search (with or without SSDs) comes at a >> price, that is reduced indexing speed due to continued in RAM >&g

Re: Realtime & distributed

2009-10-09 Thread Jason Rutherglen
variety of configurations. The best way to go about >> this is to post benchmarks that others may run in their >> environment which can then be tweaked for their unique edge >> cases. I wish I had more time to work on it. >> >> -J >> >> On Thu, Oct 8, 2009

Re: Realtime & distributed

2009-10-10 Thread Jason Rutherglen
ust plain > disappointing.* > >        Thanks Jake for the clarification, and Eric, let me know if you to > know more in detail with how we are dealing with realtime indexing/search > with Zoie here at linkedin in a production environment powering a real > internet company with real

Re: Realtime search best practices

2009-10-12 Thread Jason Rutherglen
Hi Cedric, There is a wiki page on NRT at: http://wiki.apache.org/lucene-java/NearRealtimeSearch Feel free tp ask questions if there's not enough information. -J On Mon, Oct 12, 2009 at 2:24 AM, melix wrote: > > Hi, > > I'm going to replace an old reader/writer synchronization mechanism we had

Re: IndexWriter.close() no longer seems to close everything

2009-11-12 Thread Jason Rutherglen
If there's a bug you're seeing, it's helpful to open an issue and post code reproducing it. On Wed, Nov 11, 2009 at 3:41 AM, Albert Juhe wrote: > > I think that this is the best way to proceed. > > thank you Mike > > > > Michael McCandless-2 wrote: >> >> Can you narrow the leak down to a small se

Verbose logging via ant, get an OOM

2009-11-12 Thread Jason Rutherglen
Is there a setting to fix this? [junit] Exception in thread "main" java.lang.OutOfMemoryError: Java heap space [junit] at java.util.Arrays.copyOf(Arrays.java:2882) [junit] at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) [junit] at java.lang

  1   2   >