date:20071009

Re: how to make sure a particular query is ALWAYS cached

2007-10-09 Thread Britske


seperating requests over 2 ports is a nice solution when having multiple
user-types. I like that althuigh I don't think i need it for this case. 

I'm just going to go the 'normal' caching-route and see where that takes me,
instead of thinking it can't be done upfront :-) 

Thanks!



hossman wrote:
 
 
 : Although I haven't tried yet, I can't imagine that this request returns
 in
 : sub-zero seconds, which is what I want (having a index of about 1M docs
 with
 : 6000 fields/ doc and about 10 complex facetqueries / request). 
 
 i wouldn't neccessarily assume that :)  
 
 If you have a request handler which does a query with a facet.field, and 
 then does a followup query for the top N constraings in that facet.field, 
 the time needed to execute that handler on a cold index should primarily 
 depend on the faceting aspect and how many unique terms there are in that 
 field.  try it and see.
 
 : The navigation-pages are pretty important for, eh well navigation ;-)
 and
 : although I can rely on frequent access of these pages most of the time,
 it
 : is not guarenteed (so neither is the caching)
 
 if i were in your shoes: i wouldn't worry about it.  i would setup 
 cold cache warming of the important queries using a firstSearcher event 
 listener, i would setup autowarming on the caches, i would setup explicit 
 warming of queries using sort fields i care about in a newSearcher event 
 listener, andi would make sure to tune my caches so that they were big 
 enough to contain a much larger number of entries then are used by my 
 custom request handler for the queris i care about (especially if my index 
 only changed a few times a day, the caches become a huge win in that case, 
 so throw everything you've got at them)
 
 and for the record: i've been in your shoes.
 
 From a purely theoretical standpoint: if enough other requests are coming 
 in fast enough to expunge the objects used by your important navigation 
 pages from the caches ... then those pages aren't that important (at least 
 not to your end users as an aggregate)
 
 on the other hand: if you've got discreet pools of users (like say: 
 customers who do searches, vs your boss who thiks navigation pages are 
 really important) then another appraoch is to have to ports searching 
 queries -- one that you send your navigation type queries to (with the 
 caches tuned appropriately) and one that you send other traffic to (with 
 caches tuned appropriately) ... i do that for one major index, it makes a 
 lot of sense when you have very distinct usage profiles and you want to 
 get the most bang for your buck cache wise.
 
 
 :  #1 wouldn't really accomplish what you want without #2 as well.
 
 : regarding #1. 
 : Wouldn't making a user-cache for the sole-purpose of storing these
 queries
 : be enough? I could then reference this user-cache by name, and extract
 the
 
 only if you also write a custom request handler ... that was my point 
 before it was clear that you were already doing that no matter what (you 
 had custom request handler listed in #2)
 
 you could definitely make sure to explicitly put all of your DocLists in 
 your own usercache, that will certainly work.  but frankly, based on 
 what you've described about your use case, and how often your data 
 cahnges, it would probably be easier to set up a layer of caching in front 
 of Solr (since you are concerned with ensuring *all* of the date 
 for these important pages gets cached) ... something like an HTTP reverse 
 proxy cache (aka: acelerator proxy) would help you ensure that thes whole 
 pages were getting cached.
 
 i've never tried it, but in theory: you could even setup a newSearcher 
 event listener to trigger a little script to ping your proxy with a 
 request thatforced it to revalidate the query when your index changes.
 
 
 
 -Hoss
 
 
 

-- 
View this message in context: 
http://www.nabble.com/how-to-make-sure-a-particular-query-is-ALWAYS-cached-tf4566711.html#a13110514
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr deployment in tomcat

2007-10-09 Thread Jérôme Etévé

Hi,

Here's what I've got (multiplesolr instance within the same tomcat server)

In
/var/tomcat/conf/Catalina/localhost/

For an instance 'foo' :

foo.xml :
Context path=foo docBase=/var/tomcat/solrapp/solr.war debug=0
crossContext=true 
   Environment name=solr/home type=java.lang.String
value=/var/solr/foo/ override=true /
/Context

/var/tomcat/solrapp/solr.war is the path to the solr war file. It can
be anywhere on the disk.
/var/solr/foo/ is the solr home for this instance (where you'll put
your schema.xml , solrconfig.xml etc.. ) .


Restart tomcat and you should see your foo app appear in your deployed apps.


Jerome.

On 10/9/07, Chris Laux [EMAIL PROTECTED] wrote:
  Hello Group,
   Does anyone able to deploy solr.war @ tomcat. I just tried to deploy it as 
  per wiki and it gives bunch of exceptions and I dont think those exceptions 
  have any relevance with the actual cause. I was wondering if there is any 
  speciaf configuration needed?

 I had that very same problem while trying to set solr up with tomcat
 (and multiple instances). I have given up for now and am working with
 Jetty instead.

 Chris Laux




-- 
Jerome Eteve.
[EMAIL PROTECTED]
http://jerome.eteve.free.fr/

Re: Solr deployment in tomcat

2007-10-09 Thread Chris Laux

Jérôme Etévé wrote:
[...]
 /var/solr/foo/ is the solr home for this instance (where you'll put
 your schema.xml , solrconfig.xml etc.. ) .

Thanks for the input Jérôme, I gave it another try and discovered that
what I was doing wrong was copying the solr/example/ directory to what
you call /var/solr/foo/, while copying solr/example/solr/ is what
works now.

Maybe I should add a note to the Wiki...

Chris

Re: Solr deployment in tomcat

2007-10-09 Thread Jérôme Etévé

On 10/9/07, Chris Laux [EMAIL PROTECTED] wrote:
 Jérôme Etévé wrote:
 [...]
  /var/solr/foo/ is the solr home for this instance (where you'll put
  your schema.xml , solrconfig.xml etc.. ) .

 Thanks for the input Jérôme, I gave it another try and discovered that
 what I was doing wrong was copying the solr/example/ directory to what
 you call /var/solr/foo/, while copying solr/example/solr/ is what
 works now.

 Maybe I should add a note to the Wiki...

Sounds like a good idea ! Actually I remember struggling a bit to have
multiple instance of solr in tomcat.

-- 
Jerome Eteve.
[EMAIL PROTECTED]
http://jerome.eteve.free.fr/

Re: High-Availability deployment

2007-10-09 Thread Daniel Alheiros

Hi Hoss,

Yes I know that, but I want to have a proper dummy backup (something that
could be kept in a very controlled environment). I thought about using this
approach (a slave just for this purpose), but if I'm using it just as a
backup node there is no reason I don't use a proper backup structure (as I
have all needed infra-structure in place for that). It's just an extra
redundancy level as I'm going to have a Master/Slaves structure and the
index is replicated amongst them anyway.

Yes, I got it. I have implemented ways to re-index stuff in an incremental
way so I can just re-index a slice of my content (based on dates or id's)
which should be enough to keep my index up-to-date quickly after a possible
disaster.

Thank you for your considerations,
Daniel


On 8/10/07 18:29, Chris Hostetter [EMAIL PROTECTED] wrote:

 : I'm setting up a backup task to keep a copy of my master index, just to
 : avoid having to re-build my index from scratch. And other important issue is
 
 every slave is a backup of the master, so you don't usually need a
 seperate backup mechanism.
 
 re-building hte index is more about peace of mind when asking why did it
 crash?  what did/didn't get writen the index before it crashed?
 
 
 
 
 -Hoss
 


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.

problems with arabic search

2007-10-09 Thread Heba Farouk

 

Hello

I’m a newbie to solr and I need ur help in developing an Arabic search engine 
using solr.

I succeeded to build the index but failed searching it. I got that error when I 
submit a query like “محمد”.

 

XML Parsing Error: mismatched tag. Expected: /HR.

Location: 
http://localhost:8080/solrServlet/searchServlet?query=%D9%85%D8%AD%D9%85%D8%AFcmdSearch=Search%21

Line Number 1, Column 1260:htmlheadtitleApache Tomcat/6.0.13 - Error 
report/titlestyle!--H1 
{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
 H2 
{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;}
 H3 
{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
 BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} 
B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P 
{font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
 {color : black;}A.name {color : black;}HR {color : #525D76;}--/style 
/headbodyh1HTTP Status 400 - Query parsing error: Cannot parse '': 
'*' or '?' not allowed as first character in WildcardQuery/h1HR size=1 
noshade=noshadepbtype/b Status report/ppbmessage/b uQuery 
parsing error: Cannot parse '': '*' or '?' not allowed as first character 
in WildcardQuery/u/ppbdescription/b uThe request sent by the client 
was syntactically incorrect (Query parsing error: Cannot parse '': '*' or 
'?' not allowed as first character in WildcardQuery)./u/pHR size=1 
noshade=noshadeh3Apache Tomcat/6.0.13/h3/body/html

- 

 

The apache server URIEncoding, jsp and servlets encodings r all set to UTF-8 
but no way.

 

Thanks in advance 

 

 Best regards,

 

Heba Farouk

Software Engineer

Bibliotheca Alexandrina

RE: Availability Issues

2007-10-09 Thread David Whalen

Chris:

We're using Jetty also, so I get the sense I'm looking at the
wrong log file.

On that note -- I've read that Jetty isn't the best servlet
container to use in these situations, is that your experience?

Dave


 -Original Message-
 From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
 Sent: Monday, October 08, 2007 11:20 PM
 To: solr-user
 Subject: RE: Availability Issues
 
 
 : My logs don't look anything like that.  They look like HTTP
 : requests.  Am I looking in the wrong place?
 
 what servlet container are you using?  
 
 every servlet container handles applications logs differently 
 -- it's especially tricky becuse even the format can be 
 changed, the examples i gave before are in the default format 
 you get if you use the jetty setup in the solr example (which 
 logs to stdout), but many servlet containers won't include 
 that much detail by default (they typically leave out the 
 classname and method name).  there's also typically a setting 
 that controls the verbosity -- so in some configurations only 
 the SEVERE messages are logged and in others the INFO 
 messages are logged ... you're going to want at least the 
 INFO level to debug stuff.
 
 grep all the log files you can find for Solr home set to 
 ... that's one of the first messages Solr logs.  if you can 
 find that, you'll find the other messages i was talking about.
 
 
 -Hoss

RE: Availability Issues

2007-10-09 Thread David Whalen

All:

How can I break up my install onto more than one box?  We've
hit a learning curve here and we don't understand how best to
proceed.  Right now we have everything crammed onto one box
because we don't know any better.

So, how would you build it if you could?  Here are the specs:

a) the index needs to hold at least 25 million articles
b) the index is constantly updated at a rate of 10,000 articles
per minute
c) we need to have faceted queries

Again, real-world experience is preferred here over book knowledge.
We've tried to read the docs and it's only made us more confused.

TIA

Dave W
  

 -Original Message-
 From: Yonik Seeley [mailto:[EMAIL PROTECTED] 
 Sent: Monday, October 08, 2007 3:42 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Availability Issues
 
 On 10/8/07, David Whalen [EMAIL PROTECTED] wrote:
   Do you see any requests that took a really long time to finish?
 
  The requests that take a long time to finish are just 
 simple queries.  
  And the same queries run at a later time come back much faster.
 
  Our logs contain 99% inserts and 1% queries.  We are 
 constantly adding 
  documents to the index at a rate of 10,000 per minute, so the logs 
  show mostly that.
 
 Oh, so you are using the same boxes for updating and querying?
 When you insert, are you using multiple threads?  If so, how many?
 
 What is the full URL of those slow query requests?
 Do the slow requests start after a commit?
 
   Start with the thread dump.
   I bet it's multiple queries piling up around some synchronization 
   points in lucene (sometimes caused by multiple threads generating 
   the same big filter that isn't yet cached).
 
  What would be my next steps after that?  I'm not sure I'd 
 understand 
  enough from the dump to make heads-or-tails of it.  Can I 
 share that 
  here?
 
 Yes, post it here.  Most likely a majority of the threads 
 will be blocked somewhere deep in lucene code, and you will 
 probably need help from people here to figure it out.
 
 -Yonik

Re: extending StandardRequestHandler gives ClassCastException

2007-10-09 Thread Erik Hatcher

Are you compiling your custom request handler against the same  
version of Solr that you are deploying with?   My hunch is that  
you're compiling against an older version.


Erik


On Oct 9, 2007, at 9:04 AM, Britske wrote:



I'm trying to add a new requestHandler-plugin to Solr by extending
StandardRequestHandler.
However, when starting solr-server after configuration i get a
ClassCastException:

SEVERE: java.lang.ClassCastException:
wrappt.solr.requesthandler.TopListRequestHandler cannot be cast to
org.apache.solr.request.SolrRequestHandler  at
org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java: 
149)


I can't get my head around what might be wrong, as I am extending
org.apache.solr.handler.StandardRequestHandler which already  
implements
org.apache.solr.request.SolrRequestHandler so it must be able to  
cast i

figure.

Anyone any ideas? below is the code / setup I used.

My handler:
---
package wrappt.solr.requesthandler;

import org.apache.solr.handler.StandardRequestHandler;
import org.apache.solr.request.SolrRequestHandler;

public class TopListRequestHandler extends StandardRequestHandler  
implements

SolrRequestHandler
{
//no code here (so it mimicks StandardRequestHandler)
}
--

configured in solrconfig as:
requestHandler name=toplist
class=wrappt.solr.requesthandler.TopListRequestHandler/

added this handler to a jar called: solrRequestHandler1.jar and  
added this
jar along with  apache-solr-nightly.jar to the \lib directory of my  
server.
(It needs the last jar  for resolving the StandardRequestHandler.  
Isnt this

strange btw, because I thought that it would be resolved from solr.war
automatically. )

general solr-info of the server:
Solr Specification Version: 1.2.2007.10.07.08.05.52
	Solr Implementation Version: nightly ${svnversion} - yonik -  
2007-10-07

08:05:52

I double-checked that the included apache-solr-nightly.jar are the  
same

version as the deployed server by getting the latest nightly build and
getting the .jars and .war from it.

Furthermore, I noticed that  
org.apache.solr.request.StandardRequestHandler

is deprecated. Note that I'm extending
org.apache.solr.handler.StandardRequestHandler. Is it possible that  
this has

anything to do with it?

with regards,
Geert-Jan


--
View this message in context: http://www.nabble.com/extending- 
StandardRequestHandler-gives-ClassCastException- 
tf4594102.html#a13115182

Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr deployment in tomcat

2007-10-09 Thread Cool Coder

It worked. Thanks a lot. I just updated value attrb of Environment tag of 
solr.xml. Maybe you should update wiki for Unix as well as Windows examples.

Context path=solr docBase=C:/apache-solr-1.2.0/example/webapps/solr.war 
debug=0 crossContext=true 
Environment name=solr/home type=java.lang.String 
value=C:/apache-solr-1.2.0/example/solr override=true /
/Context



- Original Message 
From: Jérôme Etévé [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Tuesday, October 9, 2007 6:49:38 AM
Subject: Re: Solr deployment in tomcat


On 10/9/07, Chris Laux [EMAIL PROTECTED] wrote:
 Jérôme Etévé wrote:
 [...]
  /var/solr/foo/ is the solr home for this instance (where you'll put
  your schema.xml , solrconfig.xml etc.. ) .

 Thanks for the input Jérôme, I gave it another try and discovered that
 what I was doing wrong was copying the solr/example/ directory to what
 you call /var/solr/foo/, while copying solr/example/solr/ is what
 works now.

 Maybe I should add a note to the Wiki...

Sounds like a good idea ! Actually I remember struggling a bit to have
multiple instance of solr in tomcat.

-- 
Jerome Eteve.
[EMAIL PROTECTED]
http://jerome.eteve.free.fr/


   

Yahoo! oneSearch: Finally, mobile search 
that gives answers, not web links. 
http://mobile.yahoo.com/mobileweb/onesearch?refer=1ONXIC

indexing problem

2007-10-09 Thread Urvashi Gadi

Hi All,

i m trying to index my data using post.jar and i get the following error


titleError 500 /title
/head
bodyh2HTTP ERROR: 500/h2prename and value cannot both be empty

java.lang.IllegalArgumentException: name and value cannot both be empty
at org.apache.lucene.document.Field.lt;initgt;(Field.java:197)


the only required field in my schema is identifier (i started with the
default schema.xml and made my changes on that)

How do i debug this? Is there a better way to index data?

Best regards,

Urvashi

Re: extending StandardRequestHandler gives ClassCastException

2007-10-09 Thread Britske


Yeah, I'm compiling with a reference to  apache-solr-nightly.jar wich is from
the same nightly builld (7 october 2007) as the apache.solr-nightly.war I'm
deploying against. I include this same apache-solr-nightly.jar in the lib
folder of my deployed server. 

It still seems odd that I have to include the jar, since the
StandardRequestHandler should be picked up in the war right? Is this also a
sign that there must be something wrong with the deployment?

btw: I deployed by copying a directory which contains the example
deployment, and swapped in  the apache.solr-nightly.war in the 'webapps'-dir
after renaming it to solr.war. This enables me to start the new server
using: java -jar start.jar. I don't know if this is common practice or
considered 'exotic', but it might just be causing the problem.. Anyway,
after deploying the server picks up the correct war, as solr/admin shows the
correct Solr Specification Version: 1.2.2007.10.07.08.05.52.

other options?
Geert-Jan



Erik Hatcher wrote:
 
 Are you compiling your custom request handler against the same  
 version of Solr that you are deploying with?   My hunch is that  
 you're compiling against an older version.
 
   Erik
 
 
 On Oct 9, 2007, at 9:04 AM, Britske wrote:
 

 I'm trying to add a new requestHandler-plugin to Solr by extending
 StandardRequestHandler.
 However, when starting solr-server after configuration i get a
 ClassCastException:

 SEVERE: java.lang.ClassCastException:
 wrappt.solr.requesthandler.TopListRequestHandler cannot be cast to
 org.apache.solr.request.SolrRequestHandler  at
 org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java: 
 149)

 I can't get my head around what might be wrong, as I am extending
 org.apache.solr.handler.StandardRequestHandler which already  
 implements
 org.apache.solr.request.SolrRequestHandler so it must be able to  
 cast i
 figure.

 Anyone any ideas? below is the code / setup I used.

 My handler:
 ---
 package wrappt.solr.requesthandler;

 import org.apache.solr.handler.StandardRequestHandler;
 import org.apache.solr.request.SolrRequestHandler;

 public class TopListRequestHandler extends StandardRequestHandler  
 implements
 SolrRequestHandler
 {
  //no code here (so it mimicks StandardRequestHandler)
 }
 --

 configured in solrconfig as:
 requestHandler name=toplist
 class=wrappt.solr.requesthandler.TopListRequestHandler/

 added this handler to a jar called: solrRequestHandler1.jar and  
 added this
 jar along with  apache-solr-nightly.jar to the \lib directory of my  
 server.
 (It needs the last jar  for resolving the StandardRequestHandler.  
 Isnt this
 strange btw, because I thought that it would be resolved from solr.war
 automatically. )

 general solr-info of the server:
 Solr Specification Version: 1.2.2007.10.07.08.05.52
  Solr Implementation Version: nightly ${svnversion} - yonik -  
 2007-10-07
 08:05:52

 I double-checked that the included apache-solr-nightly.jar are the  
 same
 version as the deployed server by getting the latest nightly build and
 getting the .jars and .war from it.

 Furthermore, I noticed that  
 org.apache.solr.request.StandardRequestHandler
 is deprecated. Note that I'm extending
 org.apache.solr.handler.StandardRequestHandler. Is it possible that  
 this has
 anything to do with it?

 with regards,
 Geert-Jan


 -- 
 View this message in context: http://www.nabble.com/extending- 
 StandardRequestHandler-gives-ClassCastException- 
 tf4594102.html#a13115182
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

-- 
View this message in context: 
http://www.nabble.com/extending-StandardRequestHandler-gives-ClassCastException-tf4594102.html#a13118296
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Availability Issues

2007-10-09 Thread Matthew Runo

The way I'd do it would be to buy more servers, set up Tomcat on  
each, and get SOLR replicating from your current machine to the  
others. Then, throw them all behind a load balancer, and there you go.


You could also post your updates to every machine. Then you don't  
need to worry about getting replication running.


++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++


On Oct 9, 2007, at 7:12 AM, David Whalen wrote:


All:

How can I break up my install onto more than one box?  We've
hit a learning curve here and we don't understand how best to
proceed.  Right now we have everything crammed onto one box
because we don't know any better.

So, how would you build it if you could?  Here are the specs:

a) the index needs to hold at least 25 million articles
b) the index is constantly updated at a rate of 10,000 articles
per minute
c) we need to have faceted queries

Again, real-world experience is preferred here over book knowledge.
We've tried to read the docs and it's only made us more confused.

TIA

Dave W



-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: Monday, October 08, 2007 3:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Availability Issues

On 10/8/07, David Whalen [EMAIL PROTECTED] wrote:

Do you see any requests that took a really long time to finish?


The requests that take a long time to finish are just

simple queries.

And the same queries run at a later time come back much faster.

Our logs contain 99% inserts and 1% queries.  We are

constantly adding

documents to the index at a rate of 10,000 per minute, so the logs
show mostly that.


Oh, so you are using the same boxes for updating and querying?
When you insert, are you using multiple threads?  If so, how many?

What is the full URL of those slow query requests?
Do the slow requests start after a commit?


Start with the thread dump.
I bet it's multiple queries piling up around some synchronization
points in lucene (sometimes caused by multiple threads generating
the same big filter that isn't yet cached).


What would be my next steps after that?  I'm not sure I'd

understand

enough from the dump to make heads-or-tails of it.  Can I

share that

here?


Yes, post it here.  Most likely a majority of the threads
will be blocked somewhere deep in lucene code, and you will
probably need help from people here to figure it out.

-Yonik

Re: indexing problem

2007-10-09 Thread Erik Hatcher


What is the XML you POSTed into Solr?

It looks like somehow you've sent in a field with no name or value,  
though this is an error that probably should be caught higher up in  
Solr.


Erik


On Oct 9, 2007, at 11:06 AM, Urvashi Gadi wrote:


Hi All,

i m trying to index my data using post.jar and i get the following  
error



titleError 500 /title
/head
bodyh2HTTP ERROR: 500/h2prename and value cannot both be empty

java.lang.IllegalArgumentException: name and value cannot both be  
empty
at org.apache.lucene.document.Field.lt;initgt;(Field.java: 
197)



the only required field in my schema is identifier (i started with the
default schema.xml and made my changes on that)

How do i debug this? Is there a better way to index data?

Best regards,

Urvashi

Facets and running out of Heap Space

2007-10-09 Thread David Whalen

Hi All.

I run a faceted query against a very large index on a 
regular schedule.  Every now and then the query throws
an out of heap space error, and we're sunk.

So, naturally we increased the heap size and things worked
well for a while and then the errors would happen again.
We've increased the initial heap size to 2.5GB and it's
still happening.

Is there anything we can do about this?

Thanks in advance,

Dave W

Re: indexing problem

2007-10-09 Thread Urvashi Gadi

is there a way to find out the line number in the xml file? the xml file i m
using is quite large.



On 10/9/07, Erik Hatcher [EMAIL PROTECTED] wrote:

 What is the XML you POSTed into Solr?

 It looks like somehow you've sent in a field with no name or value,
 though this is an error that probably should be caught higher up in
 Solr.

Erik


 On Oct 9, 2007, at 11:06 AM, Urvashi Gadi wrote:

  Hi All,
 
  i m trying to index my data using post.jar and i get the following
  error
 
 
  titleError 500 /title
  /head
  bodyh2HTTP ERROR: 500/h2prename and value cannot both be empty
 
  java.lang.IllegalArgumentException: name and value cannot both be
  empty
  at org.apache.lucene.document.Field.lt;initgt;(Field.java:
  197)
 
 
  the only required field in my schema is identifier (i started with the
  default schema.xml and made my changes on that)
 
  How do i debug this? Is there a better way to index data?
 
  Best regards,
 
  Urvashi

Re: Facets and running out of Heap Space

2007-10-09 Thread Yonik Seeley

On 10/9/07, David Whalen [EMAIL PROTECTED] wrote:
 I run a faceted query against a very large index on a
 regular schedule.  Every now and then the query throws
 an out of heap space error, and we're sunk.

 So, naturally we increased the heap size and things worked
 well for a while and then the errors would happen again.
 We've increased the initial heap size to 2.5GB and it's
 still happening.

 Is there anything we can do about this?

Try facet.enum.cache.minDf param:
http://wiki.apache.org/solr/SimpleFacetParameters

-Yonik

RE: Facets and running out of Heap Space

2007-10-09 Thread David Whalen

Hi Yonik.

According to the doc:


 This is only used during the term enumeration method of
 faceting (facet.field type faceting on multi-valued or
 full-text fields). 

What if I'm faceting on just a plain String field?  It's
not full-text, and I don't have multiValued set for it


Dave


 -Original Message-
 From: Yonik Seeley [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, October 09, 2007 12:47 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Facets and running out of Heap Space
 
 On 10/9/07, David Whalen [EMAIL PROTECTED] wrote:
  I run a faceted query against a very large index on a regular 
  schedule.  Every now and then the query throws an out of heap space 
  error, and we're sunk.
 
  So, naturally we increased the heap size and things worked 
 well for a 
  while and then the errors would happen again.
  We've increased the initial heap size to 2.5GB and it's still 
  happening.
 
  Is there anything we can do about this?
 
 Try facet.enum.cache.minDf param:
 http://wiki.apache.org/solr/SimpleFacetParameters
 
 -Yonik

Re: Availability Issues

2007-10-09 Thread Charles Hornberger

I'm about to do a prototype deployment of Solr for a pretty
high-volume site, and I've been following this thread with some
interest.

One thing I want to confirm: It's really possible for Solr to handle a
constant stream of 10K updates/min (150 updates/sec) to a
25M-document index? I new Solr and Lucene were good, but that seems
like a pretty tall order. From the responses I'm seeing to David
Whalen's inquiries, it seems like people think that's possible.

Thanks,
Charlie

On 10/9/07, Matthew Runo [EMAIL PROTECTED] wrote:
 The way I'd do it would be to buy more servers, set up Tomcat on
 each, and get SOLR replicating from your current machine to the
 others. Then, throw them all behind a load balancer, and there you go.

 You could also post your updates to every machine. Then you don't
 need to worry about getting replication running.

 ++
   | Matthew Runo
   | Zappos Development
   | [EMAIL PROTECTED]
   | 702-943-7833
 ++


 On Oct 9, 2007, at 7:12 AM, David Whalen wrote:

  All:
 
  How can I break up my install onto more than one box?  We've
  hit a learning curve here and we don't understand how best to
  proceed.  Right now we have everything crammed onto one box
  because we don't know any better.
 
  So, how would you build it if you could?  Here are the specs:
 
  a) the index needs to hold at least 25 million articles
  b) the index is constantly updated at a rate of 10,000 articles
  per minute
  c) we need to have faceted queries
 
  Again, real-world experience is preferred here over book knowledge.
  We've tried to read the docs and it's only made us more confused.
 
  TIA
 
  Dave W
 
 
  -Original Message-
  From: Yonik Seeley [mailto:[EMAIL PROTECTED]
  Sent: Monday, October 08, 2007 3:42 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Availability Issues
 
  On 10/8/07, David Whalen [EMAIL PROTECTED] wrote:
  Do you see any requests that took a really long time to finish?
 
  The requests that take a long time to finish are just
  simple queries.
  And the same queries run at a later time come back much faster.
 
  Our logs contain 99% inserts and 1% queries.  We are
  constantly adding
  documents to the index at a rate of 10,000 per minute, so the logs
  show mostly that.
 
  Oh, so you are using the same boxes for updating and querying?
  When you insert, are you using multiple threads?  If so, how many?
 
  What is the full URL of those slow query requests?
  Do the slow requests start after a commit?
 
  Start with the thread dump.
  I bet it's multiple queries piling up around some synchronization
  points in lucene (sometimes caused by multiple threads generating
  the same big filter that isn't yet cached).
 
  What would be my next steps after that?  I'm not sure I'd
  understand
  enough from the dump to make heads-or-tails of it.  Can I
  share that
  here?
 
  Yes, post it here.  Most likely a majority of the threads
  will be blocked somewhere deep in lucene code, and you will
  probably need help from people here to figure it out.
 
  -Yonik

Re: Availability Issues

2007-10-09 Thread Matthew Runo

When we are doing a reindex (1x a day), we post around 150-200  
documents per second, on average. Our index is not as large though,  
about 200k docs. During this import, the search service (with faceted  
page navigation) remains available for front-end searches and  
performance does not noticeably change. You can see this install  
running at http://www.6pm.com, where SOLR is in use for every part of  
the navigation and search.


I believe that a sustained load of 150+ posts per second is very  
possible. At that load though, it does make sense to consider  
multiple machines.


++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++


On Oct 9, 2007, at 10:16 AM, Charles Hornberger wrote:


I'm about to do a prototype deployment of Solr for a pretty
high-volume site, and I've been following this thread with some
interest.

One thing I want to confirm: It's really possible for Solr to handle a
constant stream of 10K updates/min (150 updates/sec) to a
25M-document index? I new Solr and Lucene were good, but that seems
like a pretty tall order. From the responses I'm seeing to David
Whalen's inquiries, it seems like people think that's possible.

Thanks,
Charlie

On 10/9/07, Matthew Runo [EMAIL PROTECTED] wrote:

The way I'd do it would be to buy more servers, set up Tomcat on
each, and get SOLR replicating from your current machine to the
others. Then, throw them all behind a load balancer, and there you  
go.


You could also post your updates to every machine. Then you don't
need to worry about getting replication running.

++
  | Matthew Runo
  | Zappos Development
  | [EMAIL PROTECTED]
  | 702-943-7833
++


On Oct 9, 2007, at 7:12 AM, David Whalen wrote:


All:

How can I break up my install onto more than one box?  We've
hit a learning curve here and we don't understand how best to
proceed.  Right now we have everything crammed onto one box
because we don't know any better.

So, how would you build it if you could?  Here are the specs:

a) the index needs to hold at least 25 million articles
b) the index is constantly updated at a rate of 10,000 articles
per minute
c) we need to have faceted queries

Again, real-world experience is preferred here over book knowledge.
We've tried to read the docs and it's only made us more confused.

TIA

Dave W



-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: Monday, October 08, 2007 3:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Availability Issues

On 10/8/07, David Whalen [EMAIL PROTECTED] wrote:

Do you see any requests that took a really long time to finish?


The requests that take a long time to finish are just

simple queries.

And the same queries run at a later time come back much faster.

Our logs contain 99% inserts and 1% queries.  We are

constantly adding

documents to the index at a rate of 10,000 per minute, so the logs
show mostly that.


Oh, so you are using the same boxes for updating and querying?
When you insert, are you using multiple threads?  If so, how many?

What is the full URL of those slow query requests?
Do the slow requests start after a commit?


Start with the thread dump.
I bet it's multiple queries piling up around some synchronization
points in lucene (sometimes caused by multiple threads generating
the same big filter that isn't yet cached).


What would be my next steps after that?  I'm not sure I'd

understand

enough from the dump to make heads-or-tails of it.  Can I

share that

here?


Yes, post it here.  Most likely a majority of the threads
will be blocked somewhere deep in lucene code, and you will
probably need help from people here to figure it out.

-Yonik

Re: extending StandardRequestHandler gives ClassCastException

2007-10-09 Thread Ryan McKinley




It still seems odd that I have to include the jar, since the
StandardRequestHandler should be picked up in the war right? Is this also a
sign that there must be something wrong with the deployment?



Note that in 1.3, the StandardRequestHandler was moved from 
o.a.s.request to o.a.s.handler:


http://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/request/StandardRequestHandler.java
http://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/handler/StandardRequestHandler.java

If you are subclassing StandardRequestHandler, make sure you are using a 
consistent versions


ryan

Re: indexing problem

2007-10-09 Thread Erik Hatcher

Does all your XML look like this sample here - http://wiki.apache.org/ 
solr/UpdateXmlMessages ??


Are you sending in any field elements without a name attribute or  
with a blank value?


Erik


On Oct 9, 2007, at 12:45 PM, Urvashi Gadi wrote:
is there a way to find out the line number in the xml file? the xml  
file i m

using is quite large.



On 10/9/07, Erik Hatcher [EMAIL PROTECTED] wrote:


What is the XML you POSTed into Solr?

It looks like somehow you've sent in a field with no name or value,
though this is an error that probably should be caught higher up in
Solr.

   Erik


On Oct 9, 2007, at 11:06 AM, Urvashi Gadi wrote:


Hi All,

i m trying to index my data using post.jar and i get the following
error


titleError 500 /title
/head
bodyh2HTTP ERROR: 500/h2prename and value cannot both be  
empty


java.lang.IllegalArgumentException: name and value cannot both be
empty
at org.apache.lucene.document.Field.lt;initgt;(Field.java:
197)


the only required field in my schema is identifier (i started  
with the

default schema.xml and made my changes on that)

How do i debug this? Is there a better way to index data?

Best regards,

Urvashi

Re: extending StandardRequestHandler gives ClassCastException

2007-10-09 Thread Chris Hostetter


: SEVERE: java.lang.ClassCastException:
: wrappt.solr.requesthandler.TopListRequestHandler cannot be cast to
: org.apache.solr.request.SolrRequestHandler  at
: org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:149)

: added this handler to a jar called: solrRequestHandler1.jar and added this
: jar along with  apache-solr-nightly.jar to the \lib directory of my server.
: (It needs the last jar  for resolving the StandardRequestHandler. Isnt this
: strange btw, because I thought that it would be resolved from solr.war
: automatically. ) 

classpaths are very very very tricky and anoying.  i believe the problem 
you are seeing is that the SolrCore knows about the copy of 
StandardREquestHandler in the Classloader for your war, but because of 
where you put your custom request handler, the war's classloader is 
delegating up to it's parent (the containers class loader) to find it, 
at which point the containers class loader also needs to resolve 
StandardRequestHandler (hence you put apache-solr-nightly.jar in that lib 
so that classloader can find it)  now the container classloader has 
resolved all of the classes it needs for Solr to finsh constructing your 
hanlder -- except that your handler doesn't extend the copy
of StandardRequestHandler Solr knows about -- it extends one up in in the 
parent classloader.

try creating a lib directory in your solrhome and putting your jar there 
... make sure you get rid of your jar (and the solr-nightly jar) that you 
put in the containers main lib directory.  they will cause you nothing but 
problems.  if that *still* doesn't work, try unpacking the Solr war, and 
adding your class directly to it ... that *completeley* eliminates any 
possibility of classpath issues and will help identify if it's some other 
random problem (but it's a last resort since it makes upgrading later 
hard)

http://wiki.apache.org/solr/SolrPlugins


-Hoss

Re: Facets and running out of Heap Space

2007-10-09 Thread Yonik Seeley

On 10/9/07, David Whalen [EMAIL PROTECTED] wrote:
  This is only used during the term enumeration method of
  faceting (facet.field type faceting on multi-valued or
  full-text fields).

 What if I'm faceting on just a plain String field?  It's
 not full-text, and I don't have multiValued set for it

Then you will be using the FieldCache counting method, and this param
is not applicable :-)
Are all your field that you facet on like this?

The FieldCache entry might be taking up too much room, esp if the
number of entries is high, and the entries are big.  The requests
themselves can take up a good chunk of memory temporarily (4 bytes *
nValuesInField).

You could try a memory profiling tool and see where all the memory is
being taken up too.

-Yonik

Re: extending StandardRequestHandler gives ClassCastException

2007-10-09 Thread Britske


Thanks, but I'm using the updated o.a.s.handler.StandardRequestHandler. I'm
going to try on 1.2 instead to see if it changes things. 

Geert-Jan



ryantxu wrote:
 
 
 It still seems odd that I have to include the jar, since the
 StandardRequestHandler should be picked up in the war right? Is this also
 a
 sign that there must be something wrong with the deployment?
 
 
 Note that in 1.3, the StandardRequestHandler was moved from 
 o.a.s.request to o.a.s.handler:
 
 http://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/request/StandardRequestHandler.java
 http://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/handler/StandardRequestHandler.java
 
 If you are subclassing StandardRequestHandler, make sure you are using a 
 consistent versions
 
 ryan
 
 

-- 
View this message in context: 
http://www.nabble.com/extending-StandardRequestHandler-gives-ClassCastException-tf4594102.html#a13121575
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Availability Issues

2007-10-09 Thread Chris Hostetter


: We're using Jetty also, so I get the sense I'm looking at the
: wrong log file.

if you are using the jetty configs that comes in the solr downloads, it 
writes all of the solr log messages to stdout (ie: when you run it on the 
commandline, the messages come to your terminal).  i don't know off the 
top of my head how to configure Jetty to log application log messages to a 
specific file ... there may be jetty specific config options ofr 
controlling this, or jetty may expect you to explicitly set the system 
properties that tell the JVM default log manager what you wnat it to do...

http://java.sun.com/j2se/1.5.0/docs/guide/logging/overview.html

: On that note -- I've read that Jetty isn't the best servlet
: container to use in these situations, is that your experience?

i can't make any specific recommendations ... i use Resin because someone 
else at my work did some research and decided it's worth paying for.  From 
what i've seen tomcat seems easier to configure then jetty and i had an 
easier time understanding it's docs, but i've never done any performance 
tests.



-Hoss

Re: Facets and running out of Heap Space

2007-10-09 Thread Chris Hostetter


: So, naturally we increased the heap size and things worked
: well for a while and then the errors would happen again.
: We've increased the initial heap size to 2.5GB and it's
: still happening.

is this the same 25,000,000 document index you mentioned before?

2.5GB of heap doesn't seem like much if you are also doing faceting ... 
even if you are faceting on an int field, there's going to be 95MB of 
FieldCache for that field, you said this was a string field, so it's going 
to be 95MB+however much space is needed for all the terms 
(presumably if you are faceting on this field every doc doesn't have a 
unique value, but even assuming a conservative 10% unique values of 10 
characters each that's another ~50MB, so we're up to about 150MB of 
FieldCache to facet that field -- and we haven't even started talking 
about how big the index is itself (or how big the filterCache gets, or 
how many other fields you are faceting on)

how big is your index on disk? are you faceting or sorting on other fields 
as well?

what does the LukeReqeust Handler tell you about the # of distinct terms 
in each field that you facet on?




-Hoss

solr tuple/tag store

2007-10-09 Thread Ryan McKinley


Hello-

I am running into some scaling performance problems with SQL that I hope 
a clever solr solution could fix.  I've already gone through a bunch of 
loops, so I figure I should solicit advice before continuing to chase my 
tail.


I have a bunch of things (100K-500K+) that are defined by a set of user 
tags.  ryan says: (name=xxx, location=yyy, foo=[aaa,bbb,ccc]), and 
alison says (name:zzz, location=bbb) - this list is constantly updating, 
it is fed from automated crawlers and user generated content.  The 
'names' can be arbitrary, but 99% of them will be ~25 distinct names.


My approach has been to build a repository of all the 'tags' and then as 
things come into that repository, I merge all the tags for that entry 
into a single 'flat' document and index it with solr.


When my thing+tag count was small, a simple SQL table with a row for 
each tag works great:


CREATE TABLE `my_tags` (
  entryID varchar(40) NOT NULL, 


  source varchar(40) NOT NULL,
  name varchar(40) NOT NULL,
  value TEXT NOT NULL,
  KEY( entryID ),
  KEY( source )
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

but as the row count gets big(2M+) this gets to be unusable.  To make it 
tractable, I am now splitting the tags across a bunch of tables and 
pushing the per user name/value pairs into a single text field (stored 
with JSON)


CREATE TABLE `my_tags_000` (
 entryID varchar(40) NOT NULL,
 source varchar(40) NOT NULL,
 tags LONGTEXT NOT NULL,
 PRIMARY KEY( entryID, source )
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Then I pick what table that goes into using:
 Math.abs( id.hashCode() )%10

This works OK, but it is still slower then I would like.  DB access is 
slow, and it also needs to search across the updating solr index, and 
that gets slow since it keeps reopening the searcher (autowarming is off!)


S...  I see a few paths and would love external feedback before 
banging my head on this longer.


1. Get help from someone who know more SQL then me and try to make a 
pure SQL approach work.  This would need to work with 10M+ tags.  Solr 
indexing is then a direct SQL - solr dump.


2. Figure out how to keep the base Tuple store in solr.  I think this 
will require finishing up SOLR-139.  This would keep the the core data 
in solr - so there is no good way to 'rebuild' the index.


3. something else?  store input on disk?


Any thoughts / pointers / nay-saying would be really helpful!

thanks
ryan

RE: Facets and running out of Heap Space

2007-10-09 Thread David Whalen

 Make sure you have:
 requestHandler name=/admin/luke 
 class=org.apache.solr.handler.admin.LukeRequestHandler / 
 defined in solrconfig.xml

What's the consequence of me changing the solrconfig.xml file?
Doesn't that cause a restart of solr?

 for a large index, this can be very slow but the results are valuable.

In what way?  I'm still not clear on what this does for me


 -Original Message-
 From: Ryan McKinley [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, October 09, 2007 4:01 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Facets and running out of Heap Space
 
  
  what does the LukeReqeust Handler tell you about the # of distinct 
  terms in each field that you facet on?
  
  Where would I find that?  
 
 check:
 http://wiki.apache.org/solr/LukeRequestHandler
 
 Make sure you have:
 requestHandler name=/admin/luke 
 class=org.apache.solr.handler.admin.LukeRequestHandler / 
 defined in solrconfig.xml
 
 for a large index, this can be very slow but the results are valuable.
 
 ryan

Re: Facets and running out of Heap Space

2007-10-09 Thread Ryan McKinley


David Whalen wrote:

Make sure you have:
requestHandler name=/admin/luke 
class=org.apache.solr.handler.admin.LukeRequestHandler / 
defined in solrconfig.xml


What's the consequence of me changing the solrconfig.xml file?
Doesn't that cause a restart of solr?



editing solrconfig.xml does *not* restart solr.

But you need to restart solr to see any changes to solrconfig.



for a large index, this can be very slow but the results are valuable.


In what way?  I'm still not clear on what this does for me



It gives you all kinds of index statistics - that may or may not be 
useful in figuring out how big field caches will need to be.


It is just a diagnostics tool, not a fix.

ryan

Re: index size

2007-10-09 Thread Kevin Lewandowski

Late reply on this but I just wanted to say thanks for the
suggestions. I went through my whole schema and was storing things
that didn't need to be stored and indexing a lot of things that didn't
need to be indexed. Just completed a full reindex and it's a much more
reasonable size now.

Kevin

On 8/20/07, Mike Klaas [EMAIL PROTECTED] wrote:

 On 17-Aug-07, at 2:03 PM, Kevin Lewandowski wrote:

  Are there any tips on reducing the index size or what factors most
  impact index size?
 
  My index has 2.7 million documents and is 200 gigabytes and growing.
  Most documents are around 2-3kb and there are about 30 indexed fields.

 An ls -sh will tell you roughly where the the space is being
 occupied.  There is something strange going on: 2.5kB * 2.7m is only
 6GB, and I have trouble imagining where the 30-fold index size
 expansion is coming from.

 -Mike

Re: solr tuple/tag store

2007-10-09 Thread Erik Hatcher



On Oct 9, 2007, at 3:14 PM, Ryan McKinley wrote:
2. Figure out how to keep the base Tuple store in solr.  I think  
this will require finishing up SOLR-139.  This would keep the the  
core data in solr - so there is no good way to 'rebuild' the index.


With SOLR-139, cool stuff can be done to 'rebuild' an index  
actually.  Obviously if your store is Solr you'll be using stored  
fields.  So store the most basic stuff, and copyField things around.   
With SOLR-139, to rebuild an index you simply reconfigure the  
copyField settings and basically `touch` each document to reindex it.


I did this with Collex recently as I refactored all of my old Collex  
tag architecture into SOLR-139.   My tag design is nowhere near as  
scalable as the one you're after, I don't think.  Yonik has some  
pretty prescient design ideas here:


http://wiki.apache.org/solr/UserTagDesign

Particularly interesting are the parts about leveraging intra Lucene  
Field matching capability (Phrase/SpanQuery possibilities are pretty  
neat) to reduce the number of fields.



3. something else?  store input on disk?


  *gasp*  Inconceivable!  :)

Erik

Re: solr tuple/tag store

2007-10-09 Thread Pieter Berkel

Given that the tables are of type InnoDB, I think it's safe to assume that
you're not planning to use MySQL full-text search (only supported on MyISAM
tables).  If you are not concerned about transactional integrity provided by
InnoDB, perhaps you could try using MyISAM tables (although most people
report speed improvements for insert operations (on relatively small data
sets) rather than selects).

Without seeing the actual queries that are slow, it's difficult to determine
what the problem is.  Have you tried using EXPLAIN (
http://dev.mysql.com/doc/refman/5.0/en/explain.html) to check if your query
is using the table indexes effectively?

Pieter



On 10/10/2007, Lance Norskog [EMAIL PROTECTED] wrote:

 You did not give your queries. I assume that you are searching against the
 'entryID' and updating the tag list.

 MySQL has a fulltext index. I assume this is a KWIC index but do not
 know.
 A fulltext index on entryID should be very very fast since
 single-record
 results are what Lucene does best.

 Lance

Re: solr tuple/tag store

2007-10-09 Thread Lance Norskog

You could just make a separate Lucene index with the document ID unique and
with multiple tag values.  Your schema would have the entryID as the unique
field and multiple tag values per entryID.

I just made a phrase-suggesting clone of the Spellchecker class that is
almost exactly the same. It indexes multiple second words for each single
first word.  It was my first Lucene project and was very easy to code.

Lance

On 10/9/07, Pieter Berkel [EMAIL PROTECTED] wrote:

 Given that the tables are of type InnoDB, I think it's safe to assume that
 you're not planning to use MySQL full-text search (only supported on
 MyISAM
 tables).  If you are not concerned about transactional integrity provided
 by
 InnoDB, perhaps you could try using MyISAM tables (although most people
 report speed improvements for insert operations (on relatively small data
 sets) rather than selects).

 Without seeing the actual queries that are slow, it's difficult to
 determine
 what the problem is.  Have you tried using EXPLAIN (
 http://dev.mysql.com/doc/refman/5.0/en/explain.html) to check if your
 query
 is using the table indexes effectively?

 Pieter



 On 10/10/2007, Lance Norskog [EMAIL PROTECTED] wrote:
 
  You did not give your queries. I assume that you are searching against
 the
  'entryID' and updating the tag list.
 
  MySQL has a fulltext index. I assume this is a KWIC index but do not
  know.
  A fulltext index on entryID should be very very fast since
  single-record
  results are what Lucene does best.
 
  Lance

Re: Facets and running out of Heap Space

2007-10-09 Thread Mike Klaas


On 9-Oct-07, at 12:36 PM, David Whalen wrote:


field name=id type=string indexed=true stored=true /
field name=content_date type=date indexed=true stored=true /
field name=media_type type=string indexed=true stored=true /
field name=location type=string indexed=true stored=true /
field name=country_code type=string indexed=true  
stored=true /
field name=text type=text indexed=true stored=true  
multiValued=true /
field name=content_source type=string indexed=true  
stored=true /

field name=title type=string indexed=true stored=true /
field name=site_id type=string indexed=true stored=true /
field name=journalist_id type=string indexed=true  
stored=true /

field name=blog_url type=string indexed=true stored=true /
field name=created_date type=date indexed=true stored=true /

I'm sure we could stop storing many of these columns, especially
if someone told me that would make a big difference.


I don't think that it would make a difference in memory consumption,  
but storage is certainly not necessary for faceting.  Extra stored  
fields can slow down search if they are large (in terms of bytes),  
but don't really occupy extra memory, unless they are polluting the  
doc cache.  Does 'text' need to be stored?



what does the LukeReqeust Handler tell you about the # of
distinct terms in each field that you facet on?


Where would I find that?  I could probably estimate that myself
on a per-column basis.  it ranges from 4 distinct values for
media_type to 30-ish for location to 200-ish for country_code
to almost 10,000 for site_id to almost 100,000 for journalist_id.


Using the filter cache method on the things like media type and  
location; this will occupy ~2.3MB of memory _per unique value_, so it  
should be a net win for those (although quite close in space  
requirements for a 30-ary field on your index size).


-Mike

Re: Index files not being deleted

2007-10-09 Thread AgentHubcap


So, this problem came up again.  Now it only happens in a linux environment
when searches are being conducted while an index is running.

Does anything need to be closed on the searching side?



AgentHubcap wrote:
 
 As it turns out I was modifying code that wasn't being run.  Running an
 optimize after deleting did solve my problem.  =)
 
 
 
 AgentHubcap wrote:
 
 I'm running 1.2.
 
 Acutally, i am doing an optimize after I delete the indexes.  (twice, as
 I read there was an issue with the optimize).  Do I need to close
 something manually?
 
 Here's my optimize code:
 
  private void optimize() throws IOException
  {
  UpdateHandler updateHandler =
 SolrCore.getSolrCore().getUpdateHandler();
  CommitUpdateCommand commitcmd = new CommitUpdateCommand(false);
  commitcmd.optimize = true;
  updateHandler.commit(commitcmd);
  updateHandler.close();
  }
 
 
 
 
 ryantxu wrote:
 
 
 - Delete all index files via a delete command
 
 make sure to optimize after deleting the docs -- optimize has lucene get 
 rid of deleted files rather then appending them to the end of the index.
 
 what version of solr are you running?  if you are running 1.3-dev 
 deleting *:* is fast -- if you aren't using 1.3, i don't suggest moving 
 there just for that though
 
 ryan
 
 
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Index-files-not-being-deleted-tf4512068.html#a13128043
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets and running out of Heap Space

2007-10-09 Thread Stu Hood

 Using the filter cache method on the things like media type and
 location; this will occupy ~2.3MB of memory _per unique value_

Mike, how did you calculate that value? I'm trying to tune my caches, and any 
equations that could be used to determine some balanced settings would be 
extremely helpful. I'm in a memory limited environment, so I can't afford to 
throw a ton of cache at the problem.

(I don't want to thread-jack, but I'm also wondering whether anyone has any 
notes on how to tune cache sizes for the filterCache, queryResultCache and 
documentCache).

Thanks,
Stu


-Original Message-
From: Mike Klaas [EMAIL PROTECTED]
Sent: Tuesday, October 9, 2007 9:30pm
To: solr-user@lucene.apache.org
Subject: Re: Facets and running out of Heap Space

On 9-Oct-07, at 12:36 PM, David Whalen wrote:

(snip)
 I'm sure we could stop storing many of these columns, especially
 if someone told me that would make a big difference.

I don't think that it would make a difference in memory consumption,  
but storage is certainly not necessary for faceting.  Extra stored  
fields can slow down search if they are large (in terms of bytes),  
but don't really occupy extra memory, unless they are polluting the  
doc cache.  Does 'text' need to be stored?

 what does the LukeReqeust Handler tell you about the # of
 distinct terms in each field that you facet on?

 Where would I find that?  I could probably estimate that myself
 on a per-column basis.  it ranges from 4 distinct values for
 media_type to 30-ish for location to 200-ish for country_code
 to almost 10,000 for site_id to almost 100,000 for journalist_id.

Using the filter cache method on the things like media type and  
location; this will occupy ~2.3MB of memory _per unique value_, so it  
should be a net win for those (although quite close in space  
requirements for a 30-ary field on your index size).

-Mike

Re: Facets and running out of Heap Space

2007-10-09 Thread Mike Klaas


On 9-Oct-07, at 7:53 PM, Stu Hood wrote:


Using the filter cache method on the things like media type and
location; this will occupy ~2.3MB of memory _per unique value_


Mike, how did you calculate that value? I'm trying to tune my  
caches, and any equations that could be used to determine some  
balanced settings would be extremely helpful. I'm in a memory  
limited environment, so I can't afford to throw a ton of cache at  
the problem.


8bits * 25m docs.  Note that HashSet filters will be smaller  
(cardinality  3000).


(I don't want to thread-jack, but I'm also wondering whether anyone  
has any notes on how to tune cache sizes for the filterCache,  
queryResultCache and documentCache).


I'll give the usual Solr answer: it depends g.  For me:

The filterCache is the most important.  I want my faceting filters to  
be there at all times, as well as the common fq's I throw at Solr.  I  
have this bumped up to 4096 or so.


The queryResultCache isn't too important.  I'm mostly interested in  
keeping around a few recent queries since they tend to be  
reexecuted.  There is generally not a whole lot of overlap, though,  
and I never page very far into the results (10 results over 100  
slaves is more than I typically would ever need).  Memory usage is  
quite low, though, so you might have success going nuts with this cache.


docCache? Make sure this is set to at least maxResults*max  
concurrent queries, since the query processing sometimes assumes  
fetching a document earlier in the request will let us retrieve it  
for free later in the request from the cache.  Other than that, it  
depends on your document usage overlap.  It you have a set of  
documents needed for meta-data storage, it behooves you to make sure  
these are always cached.


cheers,
-Mike

Cache Memory Usage (was: Facets and running out of Heap Space)

2007-10-09 Thread Stu Hood

Sorry... where do the unique values come into the equation?



Also, you say that the queryResultCache memory usage is very low... how
could this be when it is storing the same information as the
filterCache, but with the addition of sorting?



Your answers are very helpful, thanks!

Stu Hood
Webmail.us
You manage your business. We'll manage your email.®

Re: proximity search not working in solr lucene

2007-10-09 Thread Chris Hostetter


: I have installed solr lucene for my website: clickindia.com, but I am
: unable to apply proximity search for the same over there.
: 
: Please help me that how should I index solrconfig.xml  schema.xml
: after providing an option of proximity search.

in order for us to help you, you're going to have to elaborate on what 
you've tried, and what results you get.  there's nothing special you need 
to do in eather file to get proximity queries ... just use quotes.

what do your query URLs look like?



-Hoss

Re: problems with arabic search

2007-10-09 Thread Chris Hostetter


FYI: you don't need to resend your question just because you didn't get a 
reply within a day, either people haven't had a chance to reply, or they 
don't know the answer.

: XML Parsing Error: mismatched tag. Expected: /HR.
: 
: Location: 
http://localhost:8080/solrServlet/searchServlet?query=%D9%85%D8%AD%D9%85%D8%AFcmdSearch=Search%21

this doesn't look like a query error .. and that doesn't look like a solr 
URL, this looks something you have in front of Solr.

: /headbodyh1HTTP Status 400 - Query parsing error: Cannot parse 
: '': '*' or '?' not allowed as first character in 

that looks like a Solr error.  i'm guessing that your app isn't dealing 
with the UTF8 correctly, something is substituting ? characters in place 
of any character it doesn't understand - and Solr thinks you are trying to 
do a wildcard query.

have you tried querying solr directly (in your browser or using curl) for 
your arabic word?


-Hoss

Re: index become bigger and the only way seems to add hardware, another way?

2007-10-09 Thread Otis Gospodnetic

Here are some ways:



Index less data, store fewer fields and less data, compress fields,
change Lucene's the term index interval (default 128; increasing it
will make your index a little bit smaller, but will slow down
queries)... But in general, the more your index the more hw you'll
need.  I saw 1TB disks for ~$300 USD the other day.  You are in China
and this stuff is even cheaper there.
 
Otis

--

Lucene - Solr - Nutch - Consulting -- http://sematext.com/




- Original Message 
From: James liu [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Tuesday, October 9, 2007 11:15:56 PM
Subject: index become bigger and the only way seems to add hardware, another 
way?


i just wanna know is it exist which can decrease index size,,not by
increasing hardware or optimizing lucene params.

-- 
regards
jl

Re: Cache Memory Usage (was: Facets and running out of Heap Space)

2007-10-09 Thread Mike Klaas


On 9-Oct-07, at 8:28 PM, Stu Hood wrote:


Sorry... where do the unique values come into the equation?


Faceting.  You should have a filterCache  # unique values in all  
fields faceted-on (using the fieldCache method).


Also, you say that the queryResultCache memory usage is very low...  
how

could this be when it is storing the same information as the
filterCache, but with the addition of sorting?


Solr caches only the top N documents in the queryResultCache (boosted  
by queryResultWindowSize), which amounts to 40-odd ints, 40-odd  
float, and change.


-Mike

Re: solr tuple/tag store

2007-10-09 Thread Ryan McKinley



the most basic stuff, and copyField things around.  With SOLR-139, to 
rebuild an index you simply reconfigure the copyField settings and 
basically `touch` each document to reindex it.




had not thought of that... yes, that would work


Yonik has some pretty prescient design ideas here:

http://wiki.apache.org/solr/UserTagDesign



Yonik is quite clever!  this does not even involve bit operations.

In the example:
add to A10, field utag=erik#lucene   // or erik lucene, single token
add to A10, field user=erik  // via copyField
add to A10, field tag=lucene // via copyField

I take it 'user' needs a fieldType that would only keep the first part 
of what is passed in, and 'tag' would be a different type (or params) 
that only keeps the later.


To add a 'name', I guess the best approach is to use a dynamic field:
 utag_name=erik#lucene

I'll give this a try.

thanks
ryan

44 matches

Mail list logo