foo

2012-03-07 Thread Phillip Farber
unsubscribe

Re: solr optimize - no space left on device

2009-10-09 Thread Phillip Farber
Thanks Hoss. Yes, in a separate thread on the list I reported that doing a multi-stage optimize worked around the out of space problem. We use mergefactor=10, maxSegments = 16, 8, 4, 2, 1 iteratively starting at the closest power of two below the number of segments to merge.Works nicely

Re: solr optimize - no space left on device

2009-10-07 Thread Phillip Farber
:28 PM, Phillip Farber pfar...@umich.edu wrote: I am attempting to optimize a large shard on solr 1.4 and repeatedly get java.io.IOException: No space left on device. The shard, after a final commit before optimize, shows a size of about 192GB on a 400GB volume. I have successfully optimized 2

Re: How much disk space does optimize really take

2009-10-07 Thread Phillip Farber
Yonik Seeley wrote: Does this means that there's always a lucene IndexReader holding segment files open so they can't be deleted during an optimize so we run out of disk space 2x? Yes. A feature could probably now be developed now that avoids opening a reader until it's requested. That

Re: How much disk space does optimize really take

2009-10-07 Thread Phillip Farber
2x. And after the optimize fails, if we then do a commit or bounce tomcat, a bunch of segments disappear. I am stumped. Yonik Seeley wrote: On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber pfar...@umich.edu wrote: So this implies that for a normal optimize, in every case, due to the Searcher

solr optimize - no space left on device

2009-10-06 Thread Phillip Farber
I am attempting to optimize a large shard on solr 1.4 and repeatedly get java.io.IOException: No space left on device. The shard, after a final commit before optimize, shows a size of about 192GB on a 400GB volume. I have successfully optimized 2 other shards that were similarly large without

Re: best way to get the size of an index

2009-10-02 Thread Phillip Farber
Thanks, Mark. I really appreciate your confirmation. Phil Mark Miller wrote: Phillip Farber wrote: Resuming this discussion in a new thread to focus only on this question: What is the best way to get the size of an index so it does not get too big to be optimized (or to allow a very large

index size before and after commit

2009-10-01 Thread Phillip Farber
I am trying to automate a build process that adds documents to 10 shards over 5 machines and need to limit the size of a shard to no more than 200GB because I only have 400GB of disk available to optimize a given shard. Why does the size (du) of an index typically decrease after a commit?

mergefactor=1 questions

2009-09-30 Thread Phillip Farber
In order to make maximal use of our storage by avoiding the dead 2x overhead needed to optimize the index we are considering setting mergefactor=1 and living with the slow indexing performance which is not a problem in our use case. Some questions: 1) Does mergefactor=1 mean that the size

Re: Writing optimized index to different storage?

2009-09-30 Thread Phillip Farber
Sorry, I should have given more background. We have, at the moment 3.8 million documents of 0.7MB/doc average so we have extremely large shards. We build about 400,000 documents to a shard resulting 200GB/shard. We are also using LVM snapshots to manage a snapshot of the shard which we serve

Re: Writing optimized index to different storage?

2009-09-28 Thread Phillip Farber
files via a FileSwitchDirectory like implementation that knows which new files are optimized and should underneath go to a different physical path. On Mon, Sep 28, 2009 at 7:57 AM, Phillip Farber wrote: Is it possible to tell Solr or Lucene, when optimizing, to write the files that constitute

com.ctc.wstx.exc.WstxUnexpectedCharException error

2009-08-25 Thread Phillip Farber
I have a valid xml document that begins: adddocfield name=idmdp.39015052775379/field field name=rights2/field field name=titleTechnology transfer and in-house Ramp;D in Indian industry : in the later 1990s / edited and with an introduction by Binay Kumar Pattnaik. v.1/field field

Multi-shard query with error on one shard

2009-08-20 Thread Phillip Farber
What will the client receive from the primary solr instance if that instance doesn't get HTTP 200 from all the shards in a multi-shard query? Thanks, Phil

Is there a multi-shard optimize message?

2009-07-28 Thread Phillip Farber
Normally to optimize an index you POST optimize/ to /solr/update. Is there any way to POST an optimize message to one instance and have it propagate to all shards sort of like the select? /solr-shard-1/select?q=dog... shards=shard-1,shard2 Thanks, Phil

Rotating the primary shard in /solr/select

2009-07-28 Thread Phillip Farber
Is there any value in a round-robin scheme to cycle through the Solr instances supporting a multi-shard index over several machines when sending queries or is it better to just pick one instance and stick with it. I'm assuming all machines in the cluster have the same hardware specs. So

Re: Entire heap consumed to answer initial ping()

2009-06-30 Thread Phillip Farber
- Original Message From: Phillip Farber pfar...@umich.edu To: solr-user solr-user@lucene.apache.org Sent: Monday, June 29, 2009 4:20:26 PM Subject: Entire heap consumed to answer initial ping() Jconsole shows the entire 2.1g heap consumed on the first request (a simple ping) to Solr after

initialSize of queryResultCache and documentCache

2009-06-30 Thread Phillip Farber
I'm trying to understand the purpose of the initialSize parameter for the queryResultCache and documentCache. Is it correct that it controls how much heap is allocated to each cache at startup? I can see how it makes sense for queryResultCache since it is documented as an ordered lists of

Entire heap consumed to answer initial ping()

2009-06-29 Thread Phillip Farber
Jconsole shows the entire 2.1g heap consumed on the first request (a simple ping) to Solr after a Tomcat restart. After a Tomcat restart: 13140 tomcatvirtual=2255m resident=183m ... jsvc After the ping(): 13140 tomcatvirtual=2255m resident=2.0g ... jsvc Jconsole says my Tenured Gen

Re: Programatic way to know when an optimize is finished?

2008-11-18 Thread Phillip Farber
and then the next command in the script runs after the optimize finishes. Hours later, in our case. Lance -Original Message- From: Phillip Farber [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2008 10:04 AM To: solr-user@lucene.apache.org Subject: Programatic way to know when an optimize

Programatic way to know when an optimize is finished?

2008-11-14 Thread Phillip Farber
I'd like to automate my indexing processes. Is there a slick method to know when an optimize on an index has completed? Thanks, Phil

Re: Huge increase in index size adding just 2 fields

2008-11-06 Thread Phillip Farber
to the large number of terms. Thanks, Phil Phillip Farber wrote: Hi, We're indexing a lot of dirty OCR. So the index is really huge due to the size of the position file. We still get ok response time though with a median of 100ms. Phrase queries are a different matter obviously. But we're

Re: Huge increase in index size adding just 2 fields

2008-11-06 Thread Phillip Farber
Hi Otis and Hoss, My dates are not too granular. They're always -MM-DD 00:00:00 but I see that I did not omitNorms on the date field and hlb field. Thanks for pointing me in the right direction. Phil Chris Hostetter wrote: : We added the following 2 fields to the above schema as

Huge increase in index size adding just 2 fields

2008-11-03 Thread Phillip Farber
Hi, We're indexing a lot of dirty OCR. So the index is really huge due to the size of the position file. We still get ok response time though with a median of 100ms. Phrase queries are a different matter obviously. But we're seeing some really large increases in index size as we add a

Re: Practical number of Solr instances per machine

2008-10-14 Thread Phillip Farber
. So there is no super clear cut answer. If you have some concrete numbers, that will be easier to answer :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Phillip Farber [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday

Practical number of Solr instances per machine

2008-10-08 Thread Phillip Farber
Hello everyone, What is the generally accepted number of solr instances it makes sense to run on a single machine given solr/lucene threading? Servers now commonly have 4 or 8 cpus. Obviously the more instances you run the bigger your JVM heap needs to be and that takes away from OS cache.

Re: Testing query response time

2008-08-21 Thread Phillip Farber
understand what you are trying to test... Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Phillip Farber [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, August 20, 2008 1:34:20 PM Subject: Testing query response time I

Testing query response time

2008-08-20 Thread Phillip Farber
). Speaking of empty solr caches, is there a way to flush those while solr is running? What other system states do I need to control for to get a handle on response time? Thanks and regards, Phil -- Phillip Farber - http://www.umdl.umich.edu

shards and performance

2008-08-19 Thread Phillip Farber
that we would want any protocols around distributed search to be as stable as possible? Or just wait for the 1.3 release? Thanks very much, Phil -- Phillip Farber - http://www.umdl.umich.edu

Re: shards and performance

2008-08-19 Thread Phillip Farber
[EMAIL PROTECTED] wrote: On 19-Aug-08, at 10:18 AM, Phillip Farber wrote: I'm trying to understand how splitting a monolithic index into shards improves query response time. Please tell me if I'm on the right track here. Were does the increase in performance come from? Is it that in-memory

Re: Index size vs. number of documents

2008-08-15 Thread Phillip Farber
By Index size almost never grows linearly with the number of documents are you saying it increases more slowly that the number of documents, i.e. sub-linearly or more rapidly? With dirty OCR the number of unique terms is always increasing due to the garbage words -Phil Chris Hostetter

Re: Index size vs. number of documents

2008-08-14 Thread Phillip Farber
. yes. I'd be interested in how this changes your index size if you do decide to try it. There's nothing like having somebody else do research for me G. Best Erick On Wed, Aug 13, 2008 at 1:45 PM, Phillip Farber [EMAIL PROTECTED] wrote: We're indexing the ocr for a large number of books

Index size vs. number of documents

2008-08-13 Thread Phillip Farber
We're indexing the ocr for a large number of books. Our experimental schema is simple and id field and an ocr text field (not stored). Currently we just have two data points: 3005 documents = 723 MB index 174237 documents = 51460 MB index These indexes are not optimized. If the index size

Re: scaling / sharding questions

2008-06-18 Thread Phillip Farber
This may be slightly off topic, for which I apologize, but is related to the question of searching several indexes as Lance describes below, quoting: We also found that searching a few smaller indexes via the Solr 1.3 Distributed Search feature is actually faster than searching one large

field normalization and omitNorms

2008-05-27 Thread Phillip Farber
Hi all, I've been looking without success for a simple explanation of the effect of omitNorms=false for a text field. Can someone point me to the relevant doc? What is the effect of omitNorms=false on index size and query performance for say 200K documents that have s single large text

Queuing adds and commits

2008-04-27 Thread Phillip Farber
A while back Hoss described Solr queuing behavior: searches can go on happily while commits/adds are happening, and multiple adds can happen in parallel, ... but all adds block while a commit is taking place. i just give all of clients that update the index a really large timeout value

Solr queuing behavior

2008-04-27 Thread Phillip Farber
Hello, I have a quasi-realtime indexing application where documents are grouped into collections and documents can be added or removed from collections. The document has an id and multiple collection id (collid) fields reflecting the collections that contain that document. The collid field

Re: limit on number of values in a filter query?

2008-04-03 Thread Phillip Farber
! Perhaps you hit a jetty limit on the size of a GET request or something? Perhaps try POST? -Yonik On Thu, Apr 3, 2008 at 11:03 AM, Phillip Farber [EMAIL PROTECTED] wrote: I use a filter query (fq) parameter in my requests to limit the select response to a subset of all document ids. I'm getting

Help with XmlPullParserException

2008-04-02 Thread Phillip Farber
Hello all, I'm indexing a body of OCR and encountered this exception. Apparently it's some kind of XML parser error. Out of thousands of documents, which I create with significant processing to make sure they are XML compliant, only this one appears to have a problem. But can anyone tell

Re: Help with XmlPullParserException

2008-04-02 Thread Phillip Farber
. Phil Phillip Farber wrote: Hello all, I'm indexing a body of OCR and encountered this exception. Apparently it's some kind of XML parser error. Out of thousands of documents, which I create with significant processing to make sure they are XML compliant, only this one appears to have

stopwords and phrase queries

2008-03-21 Thread Phillip Farber
Am I correct that if I index with stop words: to, be, or and not then phrase query to be or not to be will not retrieve any documents? Is there any documentation that discusses the interaction of stop words and phrase queries? Thanks. Phil

Re: Solr feasibility with terabyte-scale data

2008-01-23 Thread Phillip Farber
attempt to OCR a family tree. As in a stylized tree with the data hand-written along the various branches in every orientation. Not a recognizable word in the bunch G Best Erick On Jan 22, 2008 2:05 PM, Phillip Farber [EMAIL PROTECTED] wrote: Ryan McKinley wrote: We are considering Solr 1.2

Re: Solr feasibility with terabyte-scale data

2008-01-22 Thread Phillip Farber
Ryan McKinley wrote: We are considering Solr 1.2 to index and search a terabyte-scale dataset of OCR. Initially our requirements are simple: basic tokenizing, score sorting only, no faceting. The schema is simple too. A document consists of a numeric id, stored and indexed and a large

Re: Solr feasibility with terabyte-scale data

2008-01-22 Thread Phillip Farber
to? Yes we're thinking a single copy of the index using hardware-based snapshot technology for the readers a dedicated indexing solr instance updates the index. Reasonable? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Phillip Farber

Re: Searching for two terms together in a multiValued TextField

2007-12-06 Thread Phillip Farber
Hello Hoss, I appreciate your detailed response. I think I like your second alternative because I'd like to score whole books rather than pages in books. It seems to me that the more words one has to work with in a document the better the scoring would be for the entire book. Here's a

Searching for two terms together in a multiValued TextField

2007-12-05 Thread Phillip Farber
Hello, I'm still new to Solr/Lucene. I want to search documents for 2 or more terms that must appear together on a page. I have a multiValued TextField called page in a document with uniqueId called id that represents a OCR'd book. My default operator is AND. My default field is page. My

Re: Document field data not getting indexed

2007-11-30 Thread Phillip Farber
Well this one falls into the category of bald faced embarrassment. It's a bug in my process. Thanks to all for taking the time to respond. Have I said how great solr support is? :-) Phil Phillip Farber wrote: Hi Yonik, Hoss, et. al. I'm using numItems=2000 in the luke url so I am seeing

Re: Document field data not getting indexed

2007-11-30 Thread Phillip Farber
name=campeau1/int int name=can1/int int name=canadian1/int Yonik Seeley wrote: On Nov 29, 2007 7:29 PM, Phillip Farber [EMAIL PROTECTED] wrote: One of my documents (id=44) contains the word Campeau in the ocr field. But according to luke this term does not appear in the index. AFAIK

Document field data not getting indexed

2007-11-29 Thread Phillip Farber
Hi, I have 22 documents. I index these by posting them using LWP::UserAgent all with http status 200 OK. One of my documents (id=44) contains the word Campeau in the ocr field. But according to luke this term does not appear in the index. Yet when I delete the index (delete by query *:*

Re: Help with Debian solr/jetty install?

2007-11-21 Thread Phillip Farber
.bashrc ... export JAVA_HOME=/home/thorsten/opt/java export PATH=$JAVA_HOME/bin:$PATH The important thing is that $JAVA_HOME points to the JDK and it is first in your path! salu2 Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Phillip Farber

Re: Help with Debian solr/jetty install?

2007-11-21 Thread Phillip Farber
Chris Hostetter wrote: : After following Otis' and Thorsten's advice, I still get: : : HTTP ERROR: 500 No Java compiler available Just so i'm clear, you: 1) downloaded solr, tried out the tutorial, and had the url http://localhost:8983/solr/admin/ work when you ran: cd

Help with Debian solr/jetty install?

2007-11-20 Thread Phillip Farber
Hi, I've successfully run as far as the example admin page on Debian linux 2.6. So I installed the solr-jetty packaged for Debian testing which gives me Jetty 5.1.14-1 and Solr 1.2.0+ds1-1. Jetty starts fine and so does the Solr home page at http://localhost:8280/solr But I get an error

Multiple collections of items

2007-05-17 Thread Phillip Farber
Hello, I'm yet another new solr user and I'll confess that I haven't read the documentation in great depth but hope someone can at least point me in the right direction. I have an application that manages documents in real-time into collections where a given document can live in more than