problems with SpellCheckComponent

2008-07-08 Thread Roberto Nieto
Hi,

I have downloaded the trunk version today and I´m having problems with the
SpellCheckComponent. Its any known bug?

This is my configuration:
#
searchComponent name=spellcheck
  class=org.apache.solr.handler.component.SpellCheckComponent
  lst name=defaults
   !-- omp = Only More Popular --
   str name=spellcheck.onlyMorePopularfalse/str
   !-- exr = Extended Results --
   str name=spellcheck.extendedResultsfalse/str
   !--  The number of suggestions to return --
   str name=spellcheck.count1/str
  /lst
  str name=queryAnalyzerFieldTypetext/str

  lst name=spellchecker
   str name=namedefault/str
   str name=fieldtitle/str
   str name=spellcheckIndexDirspellchecker_defaultXX/str

  /lst
 /searchComponent

 queryConverter name=queryConverter
  class=org.apache.solr.spelling.SpellingQueryConverter /

 requestHandler name=/spellCheckCompRH
  class=org.apache.solr.handler.component.SearchHandler
  arr name=last-components
   strspellcheck/str
  /arr
 /requestHandler
##

SCHEMA.XML:... field name=*title* type=*text* indexed=*true* stored=
*true* / ...

When I made:
http://localhost:8080/solr/spellCheckCompRH?q=*:*spellcheck.q=ruckspellcheck=true

I have this exception:

Estado HTTP 500 - null java.lang.NullPointerException at
org.apache.solr.handler.component.SpellCheckComponent.getTokens(SpellCheckComponent.java:217)
at
org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:184)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:156)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:128)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1025) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:263)
at
org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:852)
at
org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:584)
at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1508)
at java.lang.Thread.run(Unknown Source)

Any help will be very usefull for me. Thanks for your attention.

Rober


never desallocate RAM...during search

2008-06-18 Thread Roberto Nieto
Hi users,

Somedays ago I made a question about RAM use during searchs but I didn't
solve my problem with the ideas that some expert users told me. After making
somes test I can make a more specific question hoping someone can help me.

My problem is that i need highlighting and i have quite big docs (txt of
40MB). The conclusion of my tests is that if I set rows to 10, the content
of the first 10 results are cached. This if something normal because its
probable needed for the highlighting, but this memory is never desallocate
although I set solr's caches to 0. With this, the memory grows up until is
close to the heap, then the gc start to desallocate memory..but at that
point the searches are quite slow. Is this a normal behavior? Can I
configure some solr parameter to force the desallocation of results after
each search? [I´m using solr 1.2]

Another thing that I found is that although I comment (in solrconfig) all
this options:
  filterCache, queryResultCache, documentCache, enableLazyFieldLoading,
useFilterForSortedQuery, boolTofilterOptimizer
In the stats always appear caching:true.

I'm probably leaving some stupid thing but I can't find it.

If anyone can help me..i'm quite desperate.


Rober.


Re: doubt with an index of 300gb

2008-06-15 Thread Roberto Nieto
Hi Otis,

Thanks a lot for your interest.

The main thing i cant understand very well is that if I have 8 maquines that
will be searchers, for example, why they will have a higher cost of hw if I
have one big index. If I have 10 smaller indexes I will need
to search over all of them so...that won´t requiere the same hw? I
understand that if i can search in a subset of the index it would be better
to split the index but if i must search in the entire index?

I can add new searcher maquines so i think that my hw problem is the ram,
its that right?

Probably i'm missing something, sorry if my question have an obvious answer.




2008/6/15 Otis Gospodnetic [EMAIL PROTECTED]:

 Hi Roberto,

 SAN is a fine choice, if that's what you were worried about.  There is no
 way to tell exactly how fast your searches will be, as that depends on a lot
 of factors -- benchmarking with your own data and hardware and queries is
 the best way to go.

 As for the cost of multiple smaller machines and one large one (if that's
 what's needed) is that, I *think*, the price of hw goes up significantly
 when you start working with high-end hw, and that cost may be higher than
 the cost of N smaller servers combined.  That's the cost difference that I
 was trying to point out.  That's for your IT people to figure out after you
 tell them what type of hw you need and what the options are.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


 - Original Message 
  From: Roberto Nieto [EMAIL PROTECTED]
  To: solr-user@lucene.apache.org
   Sent: Saturday, June 14, 2008 5:05:54 PM
  Subject: Re: doubt with an index of 300gb
 
  Hi Otis,
 
  Thanks for your fast answer.
 
  I understand perfectly your points. I will explain my limitations ...
 
  --Multiple smaller indices you can split them across several servers, but
  you can't do that with a monolithic index.
  The index will be allocated in a SAN that is not under my election. I can
  decide to split the index or use a monolithic one but not the allocation
 
  --With multiple smaller indices you can choose to search only a subset of
  them, should that make sense for your app.
  --How much does it cost to have 1 server with a LOT of RAM that serving
 this
  index will need?  Maybe it's cheaper to have multiple smaller machines.
  This index will be an index public and i will always need to search in
 the
  entire index. I understand the problem of the RAM, but if I use multiple
  index and then i search in all of them i will use less RAM? The index
 will
  have 10 fields, all of them excepting the content will be small and I
 will
  only sort be score. If someone have any experience of how much ram i will
  need or something about the response times with this kind of index it
 would
  be very usefull for me.
 
  --How long does it take you to rebuild one big index, should it get
  corrupted vs. rebuilding only a subset of your data?
  This is a very important aspect, but my primary objective must be the
  response time. I thought about using different index with different solr
 but
  the problem is the mixture of results and how to sort them...so i think
 (but
  not sure) that using only one index it will be faster knowing that i will
  always need to search in the entire index.
 
 
  Any help or suggestion will be very usefull.
 
  Thank you very much for your attention
 
 
  2008/6/14 Otis Gospodnetic :
 
   Roberto,
  
   Here is some food for thought...
  
   Multiple smaller indices you can split them across several servers, but
 you
   can't do that with a monolithic index.
  
   With multiple smaller indices you can choose to search only a subset of
   them, should that make sense for your app.
   How much does it cost to have 1 server with a LOT of RAM that serving
 this
   index will need?  Maybe it's cheaper to have multiple smaller machines.
  
   How long does it take you to rebuild one big index, should it get
 corrupted
   vs. rebuilding only a subset of your data?
   How long does it take you to copy the index around the network after
 you
   optimize it vs. copying only a subset, or multiple subsets in parallel?
  
   etc.
  
   Otis --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
   - Original Message 
From: Roberto Nieto
 To: solr-user@lucene.apache.org
Sent: Saturday, June 14, 2008 7:31:28 AM
Subject: doubt with an index of 300gb
   
Hi users,
   
I´m going to create a big index of 300gb in a SAN where i have 4TB. I
   read
many entries in the mail list talking about using multiple index with
multicore. I would like to know what kind of benefit can i have
using multiple index instead of one big index if i dont have problems
   with
the disk? I know that the optimizes and the commits would be faster
 with
smaller indexs, but in search? The RAM use would be the same using 10
indexes of 30gb than using 1 index of 300gb? Any suggestion or
 experience

Re: Memory problems when highlight with not very big index

2008-06-15 Thread Roberto Nieto
Hi Yonik,

I think your are right, it must be that.
If i activate the highlighting of a field that i´m not specifing in fl, it
will have the same use of RAM as if i return it?
Internally it will be as if I add it to fl?



2008/6/13 Yonik Seeley [EMAIL PROTECTED]:

 On Fri, Jun 13, 2008 at 3:30 PM, Roberto Nieto [EMAIL PROTECTED]
 wrote:
  The part that i can't understand very well is why if i desactivate
  highlighting the memory doesnt grows.
  It only uses doc cache if highlighting is used or if content retrieve is
  activated?

 Perhaps you are highlighting some fields that you normally don't
 return?  What is fl vs hl.fl?

 -Yonik



Re: doubt with an index of 300gb

2008-06-15 Thread Roberto Nieto
Hi Otis,

I think that my questions were not very well formulated.

We have dedicate machines for parsing, 2 machines (active/pasive) for
indexing, the index allocated in a SAN filesystem and dedicate machines for
searching.
All of my questions came because if i have an index of 300gb i dont know how
much ram i will need for searching in that index. I dont
find anywere documents about memory use in solr and i'm a
bit worried because i dont know how much memory i will need for attending
each search. I dont have much problems with concurrent searchs because I can
had machines to a cluster.

I read about the filterCache, queryResultCache and documentCache but if i
dont use those caches (set them to 0) i dont know how much memory solr will
need (if its needed) to store the docSets orden them, etc ... and attend a
search.

If some document explain it, it will be very usefull for me.



2008/6/15 Otis Gospodnetic [EMAIL PROTECTED]:

 Roberto,

 All I was trying to say that it *might* be cheaper to buy:

 10 smaller servers with 4 GB RAM each, for a total of 40 GB RAM
 than
 1 big server with 40 GB RAM and the CPU matching the CPU power of 10
 smaller servers

 Of course, there are other things to consider, too - power usage, hosting
 space, management, etc.
 There is no single answer, you'll have to evaluate pros and cons yourself.
  I simply wanted to point out various factors that you and your IT team will
 need to consider.


 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


 - Original Message 
  From: Roberto Nieto [EMAIL PROTECTED]
  To: solr-user@lucene.apache.org
  Sent: Sunday, June 15, 2008 8:38:15 AM
  Subject: Re: doubt with an index of 300gb
 
  Hi Otis,
 
  Thanks a lot for your interest.
 
  The main thing i cant understand very well is that if I have 8 maquines
 that
  will be searchers, for example, why they will have a higher cost of hw if
 I
  have one big index. If I have 10 smaller indexes I will need
  to search over all of them so...that won´t requiere the same hw? I
  understand that if i can search in a subset of the index it would be
 better
  to split the index but if i must search in the entire index?
 
  I can add new searcher maquines so i think that my hw problem is the ram,
  its that right?
 
  Probably i'm missing something, sorry if my question have an obvious
 answer.
 
 
 
 
  2008/6/15 Otis Gospodnetic :
 
   Hi Roberto,
  
   SAN is a fine choice, if that's what you were worried about.  There is
 no
   way to tell exactly how fast your searches will be, as that depends on
 a lot
   of factors -- benchmarking with your own data and hardware and queries
 is
   the best way to go.
  
   As for the cost of multiple smaller machines and one large one (if
 that's
   what's needed) is that, I *think*, the price of hw goes up
 significantly
   when you start working with high-end hw, and that cost may be higher
 than
   the cost of N smaller servers combined.  That's the cost difference
 that I
   was trying to point out.  That's for your IT people to figure out after
 you
   tell them what type of hw you need and what the options are.
  
   Otis
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
   - Original Message 
From: Roberto Nieto
 To: solr-user@lucene.apache.org
 Sent: Saturday, June 14, 2008 5:05:54 PM
Subject: Re: doubt with an index of 300gb
   
Hi Otis,
   
Thanks for your fast answer.
   
I understand perfectly your points. I will explain my limitations ...
   
--Multiple smaller indices you can split them across several servers,
 but
you can't do that with a monolithic index.
The index will be allocated in a SAN that is not under my election. I
 can
decide to split the index or use a monolithic one but not the
 allocation
   
--With multiple smaller indices you can choose to search only a
 subset of
them, should that make sense for your app.
--How much does it cost to have 1 server with a LOT of RAM that
 serving
   this
index will need?  Maybe it's cheaper to have multiple smaller
 machines.
This index will be an index public and i will always need to search
 in
   the
entire index. I understand the problem of the RAM, but if I use
 multiple
index and then i search in all of them i will use less RAM? The index
   will
have 10 fields, all of them excepting the content will be small and I
   will
only sort be score. If someone have any experience of how much ram i
 will
need or something about the response times with this kind of index it
   would
be very usefull for me.
   
--How long does it take you to rebuild one big index, should it get
corrupted vs. rebuilding only a subset of your data?
This is a very important aspect, but my primary objective must be the
response time. I thought about using different index with different
 solr
   but
the problem is the mixture of results and how to sort them

doubt with an index of 300gb

2008-06-14 Thread Roberto Nieto
Hi users,

I´m going to create a big index of 300gb in a SAN where i have 4TB. I read
many entries in the mail list talking about using multiple index with
multicore. I would like to know what kind of benefit can i have
using multiple index instead of one big index if i dont have problems with
the disk? I know that the optimizes and the commits would be faster with
smaller indexs, but in search? The RAM use would be the same using 10
indexes of 30gb than using 1 index of 300gb? Any suggestion or experience
will be very usefull for me.

Thanks in advance.

Rober.


Re: doubt with an index of 300gb

2008-06-14 Thread Roberto Nieto
Hi Otis,

Thanks for your fast answer.

I understand perfectly your points. I will explain my limitations ...

--Multiple smaller indices you can split them across several servers, but
you can't do that with a monolithic index.
The index will be allocated in a SAN that is not under my election. I can
decide to split the index or use a monolithic one but not the allocation

--With multiple smaller indices you can choose to search only a subset of
them, should that make sense for your app.
--How much does it cost to have 1 server with a LOT of RAM that serving this
index will need?  Maybe it's cheaper to have multiple smaller machines.
This index will be an index public and i will always need to search in the
entire index. I understand the problem of the RAM, but if I use multiple
index and then i search in all of them i will use less RAM? The index will
have 10 fields, all of them excepting the content will be small and I will
only sort be score. If someone have any experience of how much ram i will
need or something about the response times with this kind of index it would
be very usefull for me.

--How long does it take you to rebuild one big index, should it get
corrupted vs. rebuilding only a subset of your data?
This is a very important aspect, but my primary objective must be the
response time. I thought about using different index with different solr but
the problem is the mixture of results and how to sort them...so i think (but
not sure) that using only one index it will be faster knowing that i will
always need to search in the entire index.


Any help or suggestion will be very usefull.

Thank you very much for your attention


2008/6/14 Otis Gospodnetic [EMAIL PROTECTED]:

 Roberto,

 Here is some food for thought...

 Multiple smaller indices you can split them across several servers, but you
 can't do that with a monolithic index.

 With multiple smaller indices you can choose to search only a subset of
 them, should that make sense for your app.
 How much does it cost to have 1 server with a LOT of RAM that serving this
 index will need?  Maybe it's cheaper to have multiple smaller machines.

 How long does it take you to rebuild one big index, should it get corrupted
 vs. rebuilding only a subset of your data?
 How long does it take you to copy the index around the network after you
 optimize it vs. copying only a subset, or multiple subsets in parallel?

 etc.

 Otis --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


 - Original Message 
  From: Roberto Nieto [EMAIL PROTECTED]
  To: solr-user@lucene.apache.org
  Sent: Saturday, June 14, 2008 7:31:28 AM
  Subject: doubt with an index of 300gb
 
  Hi users,
 
  I´m going to create a big index of 300gb in a SAN where i have 4TB. I
 read
  many entries in the mail list talking about using multiple index with
  multicore. I would like to know what kind of benefit can i have
  using multiple index instead of one big index if i dont have problems
 with
  the disk? I know that the optimizes and the commits would be faster with
  smaller indexs, but in search? The RAM use would be the same using 10
  indexes of 30gb than using 1 index of 300gb? Any suggestion or experience
  will be very usefull for me.
 
  Thanks in advance.
 
  Rober.




Memory problems when highlight with not very big index

2008-06-13 Thread Roberto Nieto
Hi users/developers,

I´m new with solr and i have been reading the list for a few hours but i
didn´t found anything to solve my doubt.
I´m using 5GB index in a 2GB RAM maquine, and i´m trying to optimize the
solr configuration for searching. I´ve have good searching times but when i
activate highlighting the RAM memory grows a lot, it grows the same as if a
want to retrieve the content of the files found. I´m not sure if for
highlighting solr needs to allocate all the content of the resulting
documents to be able to highlight them. How it works? It´s possible to only
allocate the 10 first results to make the snippet of only those results and
use less memory?

Thanks in advance.

Rober.


Re: Memory problems when highlight with not very big index

2008-06-13 Thread Roberto Nieto
Thanks for your fast answer,

I think i tried to put default size to 0 and the problems persist but i will
probe it on Monday again.
The part that i can't understand very well is why if i desactivate
highlighting the memory doesnt grows.
It only uses doc cache if highlighting is used or if content retrieve is
activated?

Thnx

Rober.

2008/6/13 Yonik Seeley [EMAIL PROTECTED]:

 On Fri, Jun 13, 2008 at 1:07 PM, Roberto Nieto [EMAIL PROTECTED]
 wrote:
  It´s possible to only
  allocate the 10 first results to make the snippet of only those results
 and
  use less memory?

 That's how it currently works.

 But there is a Document cache to make things more efficient.
 If you have large documents, you might want to decrease this from it's
 default size (see solrconfig.xml) which is currently 512.  Perhaps
 move it down to 60 (which would allow for 6 concurrent requests of 10
 docs each w/o re-fetching the doc between highlighting and response
 writing).

 -Yonik