Re: Greater-than and less-than in data import SQL queries

2009-11-02 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Mon, Nov 2, 2009 at 11:34 AM, Amit Nithian anith...@gmail.com wrote:
 A thought I had on this from a DIH design perspective. Would it be better to
 have the SQL queries stored in an element rather than an attribute so that
 you can wrap it in a CDATA block without having to mess up the look of query
 with lt, gt? Makes debugging easier (I know find and replace is trivial
 but it can be annoying when debugging SQL issues :-)).

Actually most of the parsers are forgiving in this aspect. I mean ''
and '' are ok in the xml parser shipped with the jdk.


 On Wed, Oct 28, 2009 at 5:15 PM, Lance Norskog goks...@gmail.com wrote:

 It is easier to put SQL select statements in a view, and just use that
 view from the DIH configuration file.

 On Tue, Oct 27, 2009 at 12:30 PM, Andrew Clegg andrew.cl...@gmail.com
 wrote:
 
 
  Heh, eventually I decided
 
  where 4  node_depth
 
  was the most pleasing (if slightly WTF-ish) way of writing it...
 
  Cheers,
 
  Andrew.
 
 
  Erik Hatcher-4 wrote:
 
  Use lt; instead of  in that attribute.  That should fix the issue.
  Remember, it's an XML file, so it has to obey XML encoding rules which
  make it ugly but whatcha gonna do?
 
        Erik
 
  On Oct 27, 2009, at 11:50 AM, Andrew Clegg wrote:
 
 
  Hi,
 
  If I have a DataImportHandler query with a greater-than sign in,
  like this:
 
         entity name=higher_node dataSource=database
  query=select *,
  title as keywords from cathnode_text where node_depth  4
 
  Everything's fine. However, if it contains a less-than sign:
 
         entity name=higher_node dataSource=database
  query=select *,
  title as keywords from cathnode_text where node_depth  4
 
  I get this exception:
 
  INFO: Processing configuration from solrconfig.xml:
  {config=dataconfig.xml}
  [Fatal Error] :240:129: The value of attribute query associated
  with an
  element type null must not contain the '' character.
  27-Oct-2009 15:30:49
  org.apache.solr.handler.dataimport.DataImportHandler
  inform
  SEVERE: Exception while loading DataImporter
  org.apache.solr.handler.dataimport.DataImportHandlerException:
  Exception
  occurred while initializing context
         at
  org
  .apache
  .solr
  .handler.dataimport.DataImporter.loadDataConfig(DataImporter.java:184)
         at
  org
  .apache
  .solr.handler.dataimport.DataImporter.init(DataImporter.java:101)
         at
  org
  .apache
  .solr
  .handler.dataimport.DataImportHandler.inform(DataImportHandler.java:
  113)
         at
  org
  .apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:
  424)
         at org.apache.solr.core.SolrCore.init(SolrCore.java:588)
         at
  org.apache.solr.core.CoreContainer
  $Initializer.initialize(CoreContainer.java:137)
         at
  org
  .apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:
  83)
         at
  org
  .apache
  .catalina
  .core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:
  275)
         at
  org
  .apache
  .catalina
  .core
  .ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:
  397)
         at
  org
  .apache
  .catalina
  .core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:108)
         at
  org
  .apache
  .catalina.core.StandardContext.filterStart(StandardContext.java:3709)
         at
  org.apache.catalina.core.StandardContext.start(StandardContext.java:
  4356)
         at
  org.apache.catalina.manager.ManagerServlet.start(ManagerServlet.java:
  1244)
         at
  org
  .apache
  .catalina.manager.HTMLManagerServlet.start(HTMLManagerServlet.java:
  604)
         at
  org
  .apache
  .catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:
  129)
         at javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
         at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
         at
  org
  .apache
  .catalina
  .core
  .ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:
  290)
         at
  org
  .apache
  .catalina
  .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
         at
  org
  .apache
  .catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:
  233)
         at
  org
  .apache
  .catalina.core.StandardContextValve.invoke(StandardContextValve.java:
  175)
         at
  org
  .apache
  .catalina
  .authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:525)
         at
  org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:
  568)
         at
  org
  .apache
  .catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
         at
  org
  .apache
  .catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
         at
  org
  .apache
  .catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:
  109)
         at
  org
  .apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:
  286)
         at
  org
  .apache.coyote.http11.Http11Processor.process(Http11Processor.java:
  844)
         at
  

Problems downloading lucene 2.9.1

2009-11-02 Thread Licinio Fernández Maurelo
Hi folks,

as we are using an snapshot dependecy to solr1.4, today we are getting
problems when maven try to download lucene 2.9.1 (there isn't a any 2.9.1
there).

Which repository can i use to download it?

Thx

-- 
Lici


RE: CPU utilization and query time high on Solr slave when snapshot install

2009-11-02 Thread biku...@sapient.com
Hi Solr Gurus,

We have solr in 1 master, 2 slave configuration. Snapshot is created post 
commit, post optimization. We have autocommit after 50 documents or 5 minutes. 
Snapshot puller runs as a cron every 10 minutes. What we have observed is that 
whenever snapshot is installed on the slave, we see solrj client used to query 
slave solr, gets timedout and there is high CPU usage/load avg. on slave 
server. If we stop snapshot puller, then slaves work with no issues. The system 
has been running since 2 months and this issue has started to occur only now  
when load on website is increasing.

Following are some details:

Solr Details:
apache-solr Version: 1.3.0
Lucene - 2.4-dev

Master/Slave configurations:

Master:
- for indexing data HTTPRequests are made on Solr server.
- autocommit feature is enabled for 50 docs and 5 minutes
- caching params are disable for this server
- mergeFactor of 10 is set
- we were running optimize script after every 2 hours, but now have reduced the 
duration to twice a day but issue still persists

Slave1/Slave2:
- standard requestHandler is being used
- default values of caching are set
Machine Specifications:

Master:
- 4GB RAM
- 1GB JVM Heap memory is allocated to Solr

Slave1/Slave2:
- 4GB RAM
- 2GB JVM Heap memory is allocated to Solr

Master and Slave1 (solr1)are on single box and Slave2(solr2) on different box. 
We use HAProxy to load balance query requests between 2 slaves. Master is only 
used for indexing.
Please let us know if somebody has ever faced similar kind of issue or has some 
insight into it as we guys are literally struck at the moment with a very 
unstable production environment.

As a workaround, we have started running optimize on master every 7 minutes. 
This seems to have reduced the severity of the problem but still issue occurs 
every 2days now. please suggest what could be the root cause of this.

Thanks,
Bipul






Re: Indexing multiple entities

2009-11-02 Thread Chantal Ackermann

I'm using a code generator for my entities, and I cannot modify the generation.
I need to work out another option :(


shouldn't code generators help development and not make it more complex 
and difficult? oO


(sry off topic)

chantal


Re: StreamingUpdateSolrServer - indexing process stops in a couple of hours

2009-11-02 Thread Shalin Shekhar Mangar
I'm able to reproduce this issue consistently using JDK 1.6.0_16

After an optimize is called, only one thread keeps adding documents and the
rest wait on StreamingUpdateSolrServer line 196.

On Sun, Oct 25, 2009 at 8:03 AM, Dadasheva, Olga olga_dadash...@harvard.edu
 wrote:

 I am using java 1.6.0_05

 To illustrate what is happening I wrote this test program that has 10
 threads adding a collection of documents and one thread optimizing the index
 every 10 sec.

 I am seeing that after the first optimize there is only one thread that
 keeps adding documents. The other ones are locked.

 In the real code I ended up adding synchronized around add on optimize to
 avoid this.

 public static void main(String[] args) {

final JettySolrRunner jetty = new JettySolrRunner(/solr, 8983 );
try {
jetty.start();
// setup the server...
String url = http://localhost:8983/solr;;
final StreamingUpdateSolrServer server = new
 StreamingUpdateSolrServer( url, 2, 5 ) {
 @Override
public void handleError(Throwable ex) {
 // do somethign...
}
};
server.setConnectionTimeout(1000);
server.setDefaultMaxConnectionsPerHost(100);
server.setMaxTotalConnections(100);
int i = 0;
while (i++  10) {
new Thread(add-thread+i) {
public void run(){
int j = 0;
while (true) {
try {
ListSolrInputDocument docs
 = new ArrayListSolrInputDocument();
for (int n = 0; n  50; n++)
 {
SolrInputDocument doc =
 new SolrInputDocument();
String docID =
 this.getName()+_doc_+j++;
doc.addField( id,
 docID);
doc.addField( content,
 document_+docID);
docs.add(doc);
}
server.add(docs);

  System.out.println(this.getName()+ added +docs.size()+ documents);
Thread.sleep(100);
} catch (Exception e) {
e.printStackTrace();

  System.err.println(this.getName()+ +e.getLocalizedMessage());
System.exit(0);
}
}
}
}.start();
}

new Thread(optimizer-thread) {
public void run(){
while (true) {
try {
Thread.sleep(1);
server.optimize();
System.out.println(this.getName()+
 optimized);
} catch (Exception e) {
e.printStackTrace();
System.err.println(optimizer
 +e.getLocalizedMessage());
System.exit(0);
}
}
}
}.start();


} catch (Exception e) {
e.printStackTrace();
 }

 }
 -Original Message-
 From: Lance Norskog [mailto:goks...@gmail.com]
 Sent: Tuesday, October 13, 2009 8:59 PM
 To: solr-user@lucene.apache.org
 Subject: Re: StreamingUpdateSolrServer - indexing process stops in a couple
 of hours

 Which Java release is this?  There are known thread-blocking problems in
 Java 1.5.

 Also, what sockets are used during this time? Try 'netstat -s | fgrep 8983'
 (or your Solr URL port #) and watch the active, TIME_WAIT, CLOSE_WAIT
 sockets build up. This may give a hint.

 On Tue, Oct 13, 2009 at 8:47 AM, Dadasheva, Olga 
 olga_dadash...@harvard.edu wrote:
  Hi,
 
  I am indexing documents using StreamingUpdateSolrServer. My 'setup'
  code is almost a copy of the junit test of the Solr trunk.
 
 try {
 StreamingUpdateSolrServer streamingServer = new
  StreamingUpdateSolrServer( url, 2, 5 ) {
 @Override
 public void handleError(Throwable ex) {
 System.out.println( new
  StreamingUpdateSolrServer error +ex);
 

Lock problems: Lock obtain timed out

2009-11-02 Thread Jérôme Etévé
Hi,

  I've got a few machines who post documents concurrently to a solr
instance. They do not issue the commit themselves, instead, I've got
autocommit set up at solr server side:
   autoCommit
  maxDocs5/maxDocs !--  commit at least every 5 docs --
  maxTime6/maxTime !-- Stays max 60s without commit --
/autoCommit

This usually works fine, but sometime the server goes in a deadlock
state . Here's the errors I get from the log (these go on forever
until I delete the index and restart all from zero):

02-Nov-2009 10:35:27 org.apache.solr.update.SolrIndexWriter finalize
SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates
a bug -- POSSIBLE RESOURCE LEAK!!!
...
[ multiple messages like this ]
...
02-Nov-2009 10:35:27 org.apache.solr.common.SolrException log
SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain
timed out: 
NativeFSLock@/home/solrdata/jobs/index/lucene-703db99881e56205cb910a2e5fd816d3-write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:85)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1538)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1395)
at 
org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:190)
at 
org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:98)
at 
org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173)
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:220)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)


I'm wondering what could be the reason for this (if a commit takes
mire than 60 seconds for instance?), and if I should use better
locking or autocommittting options?

Here's the locking conf I've got at the moment:
   writeLockTimeout1000/writeLockTimeout
commitLockTimeout1/commitLockTimeout
   lockTypenative/lockType

I'm using solr trunk from 12 oct 2009 within tomcat.

Thanks for any help.

Jerome.

-- 
Jerome Eteve.
http://www.eteve.net
jer...@eteve.net


Re: Spell check suggestion and correct way of implementation and some Questions

2009-11-02 Thread Shalin Shekhar Mangar
On Wed, Oct 28, 2009 at 8:57 PM, darniz rnizamud...@edmunds.com wrote:


 Question. Should i build the dictionlary only once and after that as new
 words are indexed the dictionary will be updated. Or i to do that manually
 over certain interval.


No. The dictionary is built only when spellcheck.build=true is specified as
a request parameter. You will need to explicitly send spellcheck.build=true
again when the document changes or you can use the buildOnCommit or
buildOnOptimize parameters to re-build the index automatically.

http://wiki.apache.org/solr/SpellCheckComponent#Building_on_Commits



 add the spellcheck component to the handler in my case as of now standard
 requets handler. I might also start adding some more dismax handlers
 depending on my requirement
  requestHandler name=standard class=solr.SearchHandler default=true
!-- default values for query parameters --
 lst name=defaults
   str name=echoParamsexplicit/str
   !--
   int name=rows10/int
   str name=fl*/str
   str name=version2.1/str
--
 /lst
 arr name=last-components
strspellcheck/str
 /arr
  /requestHandler

 run the query with parameter spell.check=true, and also specify against
 which dictionary you want to run spell check again in my case my
 spellcheck.dictionary parameter is mySpellChecker.


The parameter is spellcheck=true not spell.check=true. If you do not give a
name to your dictionary then you do not need to add the
spellcheck.dictionary parameter.

-- 
Regards,
Shalin Shekhar Mangar.


tracking solr response time

2009-11-02 Thread bharath venkatesh
Hi,

We are using solr for many of ur products  it is doing quite well
.  But since no of hits are becoming high we are experiencing latency
in certain requests ,about 15% of our requests are suffering a latency
 . We are trying to identify  the problem .  It may be due to  network
issue or solr server is taking time to process the request  .   other
than  qtime which is returned along with the response is there any
other way to track solr servers performance ?  how is qtime calculated
, is it the total time from when solr server got the request till it
gave the response ? can we do some extra logging to track solr servers
performance . ideally I would want to pass some log id along with the
request (query ) to  solr server  and solr server must log the
response time along with that log id .

Thanks in advance ..
Bharath


Re: Problems downloading lucene 2.9.1

2009-11-02 Thread Grant Ingersoll


On Nov 2, 2009, at 12:12 AM, Licinio Fernández Maurelo wrote:


Hi folks,

as we are using an snapshot dependecy to solr1.4, today we are getting
problems when maven try to download lucene 2.9.1 (there isn't a any  
2.9.1

there).

Which repository can i use to download it?


They won't be there until 2.9.1 is officially released.  We are trying  
to speed up the Solr release by piggybacking on the Lucene release,  
but this little bit is the one downside.



-Grant

NullPointerException with TermVectorComponent

2009-11-02 Thread Andrew Clegg

Hi,

I've recently added the TermVectorComponent as a separate handler, following
the example in the supplied config file, i.e.:

  searchComponent name=tvComponent
class=org.apache.solr.handler.component.TermVectorComponent/

  requestHandler name=/tvrh
class=org.apache.solr.handler.component.SearchHandler
  lst name=defaults
  bool name=tvtrue/bool
  /lst
  arr name=last-components
  strtvComponent/str
  /arr
  /requestHandler

It works, but with one quirk. When you use tf.all=true, you get the tf*idf
scores in the output, just fine (along with tf and df). But if you use
tv.tf_idf=true you get an NPE:

http://server:8080/solr/tvrh/?q=1cukversion=2.2indent=ontv.tf_idf=true

HTTP Status 500 - null java.lang.NullPointerException at
org.apache.solr.handler.component.TermVectorComponent$TVMapper.getDocFreq(TermVectorComponent.java:253)
at
org.apache.solr.handler.component.TermVectorComponent$TVMapper.map(TermVectorComponent.java:245)
at
org.apache.lucene.index.TermVectorsReader.readTermVector(TermVectorsReader.java:522)
at
org.apache.lucene.index.TermVectorsReader.readTermVectors(TermVectorsReader.java:401)
at org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:378)
at
org.apache.lucene.index.SegmentReader.getTermFreqVector(SegmentReader.java:1253)
at
org.apache.lucene.index.DirectoryReader.getTermFreqVector(DirectoryReader.java:474)
at
org.apache.solr.search.SolrIndexReader.getTermFreqVector(SolrIndexReader.java:244)
at
org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:125)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
(etc.)

Is this a bug, or am I doing it wrong?

Cheers,

Andrew.

-- 
View this message in context: 
http://old.nabble.com/NullPointerException-with-TermVectorComponent-tp26156903p26156903.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: tracking solr response time

2009-11-02 Thread Yonik Seeley
On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh
bharathv6.proj...@gmail.com wrote:
    We are using solr for many of ur products  it is doing quite well
 .  But since no of hits are becoming high we are experiencing latency
 in certain requests ,about 15% of our requests are suffering a latency

How much of a latency compared to normal, and what version of Solr are
you using?

  . We are trying to identify  the problem .  It may be due to  network
 issue or solr server is taking time to process the request  .   other
 than  qtime which is returned along with the response is there any
 other way to track solr servers performance ?
 how is qtime calculated
 , is it the total time from when solr server got the request till it
 gave the response ?

QTime is the time spent in generating the in-memory representation for
the response before the response writer starts streaming it back in
whatever format was requested.  The stored fields of returned
documents are also loaded at this point (to enable handling of huge
response lists w/o storing all in memory).

There are normally servlet container logs that can be configured to
spit out the real total request time.

 can we do some extra logging to track solr servers
 performance . ideally I would want to pass some log id along with the
 request (query ) to  solr server  and solr server must log the
 response time along with that log id .

Yep - Solr isn't bothered by params it doesn't know about, so just put
logid=xxx and it should also be logged with the other request
params.

-Yonik
http://www.lucidimagination.com


Re: tracking solr response time

2009-11-02 Thread Israel Ekpo
On Mon, Nov 2, 2009 at 8:41 AM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh
 bharathv6.proj...@gmail.com wrote:
 We are using solr for many of ur products  it is doing quite well
  .  But since no of hits are becoming high we are experiencing latency
  in certain requests ,about 15% of our requests are suffering a latency

 How much of a latency compared to normal, and what version of Solr are
 you using?

   . We are trying to identify  the problem .  It may be due to  network
  issue or solr server is taking time to process the request  .   other
  than  qtime which is returned along with the response is there any
  other way to track solr servers performance ?
  how is qtime calculated
  , is it the total time from when solr server got the request till it
  gave the response ?

 QTime is the time spent in generating the in-memory representation for
 the response before the response writer starts streaming it back in
 whatever format was requested.  The stored fields of returned
 documents are also loaded at this point (to enable handling of huge
 response lists w/o storing all in memory).

 There are normally servlet container logs that can be configured to
 spit out the real total request time.

  can we do some extra logging to track solr servers
  performance . ideally I would want to pass some log id along with the
  request (query ) to  solr server  and solr server must log the
  response time along with that log id .

 Yep - Solr isn't bothered by params it doesn't know about, so just put
 logid=xxx and it should also be logged with the other request
 params.

 -Yonik
 http://www.lucidimagination.com




If you are not using Java then you may have to track the elapsed time
manually.

If you are using the SolrJ Java client you may have the following options:

There is a method called getElapsedTime() in
org.apache.solr.client.solrj.response.SolrResponseBase which is available to
all the subclasses

I have not used it personally but I think this should return the time spent
on the client side for that request.

The QTime is not the time on the client side but the time spent internally
at the Solr server to process the request.

http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/SolrResponseBase.html

http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/QueryResponse.html

Most likely it could be as a result of an internal network issue between the
two servers or the Solr server is competing with other applications for
resources.

What operating system is the Solr server running on? Is you client
application connection to a Solr server on the same network or over the
internet? Are there other applications like database servers etc running on
the same machine? If so, then the DB server (or any other application) and
the Solr server could be competing for resources like CPU, memory etc.

If you are using Tomcat, you can take a look in
$CATALINA_HOME/logs/catalina.out, there are timestamps there that can also
guide you.

-- 
Good Enough is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.


Re: tracking solr response time

2009-11-02 Thread Grant Ingersoll


On Nov 2, 2009, at 5:41 AM, Yonik Seeley wrote:


QTime is the time spent in generating the in-memory representation for
the response before the response writer starts streaming it back in
whatever format was requested.  The stored fields of returned
documents are also loaded at this point (to enable handling of huge
response lists w/o storing all in memory).

There are normally servlet container logs that can be configured to
spit out the real total request time.


It might be nice to add a flag to DebugComponent to spit out timings  
only.  Thus, one could skip the explains, etc. and just see the  
timings.  Seems like that would have pretty low overhead and still see  
the timings.Î

Re: NullPointerException with TermVectorComponent

2009-11-02 Thread david.stu...@progressivealliance.co.uk
I think it might be to do with the library itself

I downloaded semanticvectors-1.22 and compiled from source. Then created a demo
corpus using 
java org.apache.lucene.demo.IndexFiles against the lucene src directory
I then ran a java pitt.search.semanticvectors.BuildIndex against the index and
got the following

Seedlength = 10
Dimension = 200
Minimum frequency = 0
Number non-alphabet characters = 0
Contents fields are: [contents]
Creating semantic term vectors ...
Populating basic sparse doc vector store, number of vectors: 774
Creating store of sparse vectors  ...
Created 774 sparse random vectors.
Creating term vectors ...
There are 36881 terms (and 774 docs)
0 ... 1000 ... 2000 ... 3000 ... 4000 ... Exception in thread main
java.lang.NullPointerException
    at
org.apache.lucene.index.DirectoryReader$MultiTermDocs.freq(DirectoryReader.java:
1068)
    at
pitt.search.semanticvectors.LuceneUtils.getGlobalTermFreq(LuceneUtils.java:70)
    at
pitt.search.semanticvectors.LuceneUtils.termFilter(LuceneUtils.java:187)
    at
pitt.search.semanticvectors.TermVectorsFromLucene.init(TermVectorsFromLucene.j
ava:163)
    at pitt.search.semanticvectors.BuildIndex.main(BuildIndex.java:138)
I am still digging but when you look at the source code it references lucene
call dating back to lucene 2.4 alot fo which are deprecated might need some
refreshing.

Cheers,

Dave

 
On 02 November 2009 at 14:40 Andrew Clegg andrew.cl...@gmail.com wrote:

 
 Hi,
 
 I've recently added the TermVectorComponent as a separate handler, following
 the example in the supplied config file, i.e.:
 
   searchComponent name=tvComponent
 class=org.apache.solr.handler.component.TermVectorComponent/
 
   requestHandler name=/tvrh
 class=org.apache.solr.handler.component.SearchHandler
           lst name=defaults
                   bool name=tvtrue/bool
           /lst
           arr name=last-components
                   strtvComponent/str
           /arr
   /requestHandler
 
 It works, but with one quirk. When you use tf.all=true, you get the tf*idf
 scores in the output, just fine (along with tf and df). But if you use
 tv.tf_idf=true you get an NPE:
 
 http://server:8080/solr/tvrh/?q=1cukversion=2.2indent=ontv.tf_idf=true
 
 HTTP Status 500 - null java.lang.NullPointerException at
 org.apache.solr.handler.component.TermVectorComponent$TVMapper.getDocFreq(Term
 VectorComponent.java:253)
 at
 org.apache.solr.handler.component.TermVectorComponent$TVMapper.map(TermVectorC
 omponent.java:245)
 at
 org.apache.lucene.index.TermVectorsReader.readTermVector(TermVectorsReader.jav
 a:522)
 at
 org.apache.lucene.index.TermVectorsReader.readTermVectors(TermVectorsReader.ja
 va:401)
 at org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:378)
 at
 org.apache.lucene.index.SegmentReader.getTermFreqVector(SegmentReader.java:125
 3)
 at
 org.apache.lucene.index.DirectoryReader.getTermFreqVector(DirectoryReader.java
 :474)
 at
 org.apache.solr.search.SolrIndexReader.getTermFreqVector(SolrIndexReader.java:
 244)
 at
 org.apache.solr.handler.component.TermVectorComponent.process(TermVectorCompon
 ent.java:125)
 at
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandle
 r.java:195)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.ja
 va:131)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338
 )
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:24
 1)
 at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFi
 lterChain.java:235)
 at 
 (etc.)
 
 Is this a bug, or am I doing it wrong?
 
 Cheers,
 
 Andrew.
 
 -- 
 View this message in context:
 http://old.nabble.com/NullPointerException-with-TermVectorComponent-tp26156903p26156903.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Problems downloading lucene 2.9.1

2009-11-02 Thread Ryan McKinley


On Nov 2, 2009, at 8:29 AM, Grant Ingersoll wrote:



On Nov 2, 2009, at 12:12 AM, Licinio Fernández Maurelo wrote:


Hi folks,

as we are using an snapshot dependecy to solr1.4, today we are  
getting
problems when maven try to download lucene 2.9.1 (there isn't a any  
2.9.1

there).

Which repository can i use to download it?


They won't be there until 2.9.1 is officially released.  We are  
trying to speed up the Solr release by piggybacking on the Lucene  
release, but this little bit is the one downside.


Until then, you can add a repo to:

http://people.apache.org/~mikemccand/staging-area/rc3_lucene2.9.1/maven/




Re: adding and updating a lot of document to Solr, metadata extraction etc

2009-11-02 Thread Alexey Serba
Hi Eugene,

 - ability to iterate over all documents, returned in search, as Lucene does
  provide within a HitCollector instance. We would need to extract and
  aggregate various fields, stored in index, to group results and aggregate 
 them
  in some way.
 
 Also I did not find any way in the tutorial to access the search results with
 all fields to be processed by our application.

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr
Check out Faceted Search, probably you can achieve your goal by using
Facet Component

There's also Field Collapsing patch
http://wiki.apache.org/solr/FieldCollapsing


Alex


RE: Solr YUI autocomplete

2009-11-02 Thread Ankit Bhatnagar


Hey Amit,

My index(ie Solr) was on different domain, so I can't use XHR(as XHR doesnot 
work with cross domain proxyless data fetch).

I tried using YUI's  DS_ScriptNode but didn't work.

I completed my task by using jQuery and it worked well with solr.

-Ankit

-Original Message-
From: Amit Nithian [mailto:anith...@gmail.com] 
Sent: Monday, November 02, 2009 1:00 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr YUI autocomplete

I've used the YUI auto complete (albeit not with Solr which shouldn't matter
here) and it should work with JSON. I did one that simply made XHR calls
over to a method on my server which returned pipe delimited text which
worked fine.

Are you using the XHR Data source and if so, what type are you telling it to
expect. One of the examples on the YUI site is text based and i'm sure you
can specify TYPE_JSON or JS_ARRAY too.

- Amit

On Fri, Oct 30, 2009 at 7:04 AM, Ankit Bhatnagar abhatna...@vantage.comwrote:


 Does Solr supports JSONP (JSON with Padding) in the response?

 -Ankit



 -Original Message-
 From: Ankit Bhatnagar [mailto:abhatna...@vantage.com]
 Sent: Friday, October 30, 2009 10:27 AM
 To: 'solr-user@lucene.apache.org'
 Subject: Solr YUI autocomplete

 Hi Guys,

 I have question regarding - how to specify the

 I am using YUI autocomplete widget and it expects the JSONP response.


 http://localhost:8983/solr/select/?q=monitorversion=2.2start=0rows=10indent=onwt=jsonjson.wrf=

 I am not sure how should I specify the json.wrf=function

 Thanks
 Ankit



question about collapse.type = adjacent

2009-11-02 Thread michael8

Hi,

I would like to confirm if 'adjacent' in collapse.type means the documents
(with the same collapse field value) are considered adjacent *after* the
'sort' param from the query has been applied, or *before*?  I would think it
would be *after* since collapse feature primarily is meant for presentation
use.

Thanks,
Michael
-- 
View this message in context: 
http://old.nabble.com/question-about-collapse.type-%3D-adjacent-tp26157114p26157114.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: tracking solr response time

2009-11-02 Thread bharath venkatesh
Thanks for the quick response
@yonik

How much of a latency compared to normal, and what version of Solr are
you using?

latency is usually around 2-4 secs (some times it goes more than that
)  which happens  to  only 15-20%  of the request  other  80-85% of
request are very fast it is in  milli secs ( around 200,000 requests
happens every day )

@Israel  we are not using java client ..  we  r using  python at the
client with response formatted in json

@yonikn @Israel   does qtime measure the total time taken at the solr
server ? I am already measuring the time to get the response  at
client  end . I would want  a means to know how much time the solr
server is taking to respond (process ) once it gets the request  . so
that I could identify whether it is a solr server issue or internal
network issue


@Israel  we are using rhel server  5 on both client and server .. we
have 6 solr sever . one is acting as master . both client and solr
sever are on the same network . those servers are dedicated solr
server except 2 severs which have DB and memcahce running .. we have
adjusted the load accordingly







On 11/2/09, Israel Ekpo israele...@gmail.com wrote:
 On Mon, Nov 2, 2009 at 8:41 AM, Yonik Seeley
 yo...@lucidimagination.comwrote:

 On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh
 bharathv6.proj...@gmail.com wrote:
 We are using solr for many of ur products  it is doing quite well
  .  But since no of hits are becoming high we are experiencing latency
  in certain requests ,about 15% of our requests are suffering a latency

 How much of a latency compared to normal, and what version of Solr are
 you using?

   . We are trying to identify  the problem .  It may be due to  network
  issue or solr server is taking time to process the request  .   other
  than  qtime which is returned along with the response is there any
  other way to track solr servers performance ?
  how is qtime calculated
  , is it the total time from when solr server got the request till it
  gave the response ?

 QTime is the time spent in generating the in-memory representation for
 the response before the response writer starts streaming it back in
 whatever format was requested.  The stored fields of returned
 documents are also loaded at this point (to enable handling of huge
 response lists w/o storing all in memory).

 There are normally servlet container logs that can be configured to
 spit out the real total request time.

  can we do some extra logging to track solr servers
  performance . ideally I would want to pass some log id along with the
  request (query ) to  solr server  and solr server must log the
  response time along with that log id .

 Yep - Solr isn't bothered by params it doesn't know about, so just put
 logid=xxx and it should also be logged with the other request
 params.

 -Yonik
 http://www.lucidimagination.com




 If you are not using Java then you may have to track the elapsed time
 manually.

 If you are using the SolrJ Java client you may have the following options:

 There is a method called getElapsedTime() in
 org.apache.solr.client.solrj.response.SolrResponseBase which is available to
 all the subclasses

 I have not used it personally but I think this should return the time spent
 on the client side for that request.

 The QTime is not the time on the client side but the time spent internally
 at the Solr server to process the request.

 http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/SolrResponseBase.html

 http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/QueryResponse.html

 Most likely it could be as a result of an internal network issue between the
 two servers or the Solr server is competing with other applications for
 resources.

 What operating system is the Solr server running on? Is you client
 application connection to a Solr server on the same network or over the
 internet? Are there other applications like database servers etc running on
 the same machine? If so, then the DB server (or any other application) and
 the Solr server could be competing for resources like CPU, memory etc.

 If you are using Tomcat, you can take a look in
 $CATALINA_HOME/logs/catalina.out, there are timestamps there that can also
 guide you.

 --
 Good Enough is not good enough.
 To give anything less than your best is to sacrifice the gift.
 Quality First. Measure Twice. Cut Once.



Re: Solr YUI autocomplete

2009-11-02 Thread Eric Pugh

It does, have you looked at
http://wiki.apache.org/solr/SolJSON?highlight=%28json%29#Using_Solr.27s_JSON_output_for_AJAX.
 
Also, in my book on Solr, there is an example, but using the jquery
autocomplete, which I think was answered earlier on the thread!  Hope that
helps.



ANKITBHATNAGAR wrote:
 
 
 Does Solr supports JSONP (JSON with Padding) in the response?
 
 -Ankit
  
 
 
 -Original Message-
 From: Ankit Bhatnagar [mailto:abhatna...@vantage.com] 
 Sent: Friday, October 30, 2009 10:27 AM
 To: 'solr-user@lucene.apache.org'
 Subject: Solr YUI autocomplete
 
 Hi Guys,
 
 I have question regarding - how to specify the 
 
 I am using YUI autocomplete widget and it expects the JSONP response.
 
 http://localhost:8983/solr/select/?q=monitorversion=2.2start=0rows=10indent=onwt=jsonjson.wrf=
 
 I am not sure how should I specify the json.wrf=function
 
 Thanks
 Ankit
 
 

-- 
View this message in context: 
http://old.nabble.com/JQuery-and-autosuggest-tp26130209p26157130.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Cell on web-based files?

2009-11-02 Thread Alexey Serba
 e.g (doesn't work)
 curl http://localhost:8983/solr/update/extract?extractOnly=true
 --data-binary @http://myweb.com/mylocalfile.htm -H Content-type:text/html

 You might try remote streaming with Solr (see
 http://wiki.apache.org/solr/SolrConfigXml).

Yes, curl example

curl 
'http://localhost:8080/solr/main_index/extract/?extractOnly=trueindent=onresource.name=lecture12stream.url=http%3A//myweb.com/lecture12.ppt'

It works great for me.

Alex


RE: Solr YUI autocomplete

2009-11-02 Thread Ankit Bhatnagar

Hey Eric,

That correct however it didn't work with YUI widget.

I changed my approach to use jQuery for now.
 


-Ankit

-Original Message-
From: Eric Pugh [mailto:ep...@opensourceconnections.com] 
Sent: Monday, November 02, 2009 10:20 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr YUI autocomplete


It does, have you looked at
http://wiki.apache.org/solr/SolJSON?highlight=%28json%29#Using_Solr.27s_JSON_output_for_AJAX.
 
Also, in my book on Solr, there is an example, but using the jquery
autocomplete, which I think was answered earlier on the thread!  Hope that
helps.



ANKITBHATNAGAR wrote:
 
 
 Does Solr supports JSONP (JSON with Padding) in the response?
 
 -Ankit
  
 
 
 -Original Message-
 From: Ankit Bhatnagar [mailto:abhatna...@vantage.com] 
 Sent: Friday, October 30, 2009 10:27 AM
 To: 'solr-user@lucene.apache.org'
 Subject: Solr YUI autocomplete
 
 Hi Guys,
 
 I have question regarding - how to specify the 
 
 I am using YUI autocomplete widget and it expects the JSONP response.
 
 http://localhost:8983/solr/select/?q=monitorversion=2.2start=0rows=10indent=onwt=jsonjson.wrf=
 
 I am not sure how should I specify the json.wrf=function
 
 Thanks
 Ankit
 
 

-- 
View this message in context: 
http://old.nabble.com/JQuery-and-autosuggest-tp26130209p26157130.html
Sent from the Solr - User mailing list archive at Nabble.com.



storing other files in index directory

2009-11-02 Thread Paul Rosen
Are there any pitfalls to storing an arbitrary text file in the same 
directory as the solr index?


We're slinging different versions of the index around while we're 
testing and it's hard to keep them straight.


I'd like to put a readme.txt file in the directory that contains some 
history about how that index came to be. Is that harmless? Will it be 
ignored by solr, including during optimizations and any other operation, 
and will solr not delete it?


Re: tracking solr response time

2009-11-02 Thread Erick Erickson
Also, how about a sample of a fast and slow query? And is a slow
query only slow the first time it's executed or every time?

Best
Erick

On Mon, Nov 2, 2009 at 9:52 AM, bharath venkatesh 
bharathv6.proj...@gmail.com wrote:

 Thanks for the quick response
 @yonik

 How much of a latency compared to normal, and what version of Solr are
 you using?

 latency is usually around 2-4 secs (some times it goes more than that
 )  which happens  to  only 15-20%  of the request  other  80-85% of
 request are very fast it is in  milli secs ( around 200,000 requests
 happens every day )

 @Israel  we are not using java client ..  we  r using  python at the
 client with response formatted in json

 @yonikn @Israel   does qtime measure the total time taken at the solr
 server ? I am already measuring the time to get the response  at
 client  end . I would want  a means to know how much time the solr
 server is taking to respond (process ) once it gets the request  . so
 that I could identify whether it is a solr server issue or internal
 network issue


 @Israel  we are using rhel server  5 on both client and server .. we
 have 6 solr sever . one is acting as master . both client and solr
 sever are on the same network . those servers are dedicated solr
 server except 2 severs which have DB and memcahce running .. we have
 adjusted the load accordingly







 On 11/2/09, Israel Ekpo israele...@gmail.com wrote:
  On Mon, Nov 2, 2009 at 8:41 AM, Yonik Seeley
  yo...@lucidimagination.comwrote:
 
  On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh
  bharathv6.proj...@gmail.com wrote:
  We are using solr for many of ur products  it is doing quite well
   .  But since no of hits are becoming high we are experiencing latency
   in certain requests ,about 15% of our requests are suffering a latency
 
  How much of a latency compared to normal, and what version of Solr are
  you using?
 
. We are trying to identify  the problem .  It may be due to  network
   issue or solr server is taking time to process the request  .   other
   than  qtime which is returned along with the response is there any
   other way to track solr servers performance ?
   how is qtime calculated
   , is it the total time from when solr server got the request till it
   gave the response ?
 
  QTime is the time spent in generating the in-memory representation for
  the response before the response writer starts streaming it back in
  whatever format was requested.  The stored fields of returned
  documents are also loaded at this point (to enable handling of huge
  response lists w/o storing all in memory).
 
  There are normally servlet container logs that can be configured to
  spit out the real total request time.
 
   can we do some extra logging to track solr servers
   performance . ideally I would want to pass some log id along with the
   request (query ) to  solr server  and solr server must log the
   response time along with that log id .
 
  Yep - Solr isn't bothered by params it doesn't know about, so just put
  logid=xxx and it should also be logged with the other request
  params.
 
  -Yonik
  http://www.lucidimagination.com
 
 
 
 
  If you are not using Java then you may have to track the elapsed time
  manually.
 
  If you are using the SolrJ Java client you may have the following
 options:
 
  There is a method called getElapsedTime() in
  org.apache.solr.client.solrj.response.SolrResponseBase which is available
 to
  all the subclasses
 
  I have not used it personally but I think this should return the time
 spent
  on the client side for that request.
 
  The QTime is not the time on the client side but the time spent
 internally
  at the Solr server to process the request.
 
 
 http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/SolrResponseBase.html
 
 
 http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/QueryResponse.html
 
  Most likely it could be as a result of an internal network issue between
 the
  two servers or the Solr server is competing with other applications for
  resources.
 
  What operating system is the Solr server running on? Is you client
  application connection to a Solr server on the same network or over the
  internet? Are there other applications like database servers etc running
 on
  the same machine? If so, then the DB server (or any other application)
 and
  the Solr server could be competing for resources like CPU, memory etc.
 
  If you are using Tomcat, you can take a look in
  $CATALINA_HOME/logs/catalina.out, there are timestamps there that can
 also
  guide you.
 
  --
  Good Enough is not good enough.
  To give anything less than your best is to sacrifice the gift.
  Quality First. Measure Twice. Cut Once.
 



tokenize after filters

2009-11-02 Thread Joe Calderon
 is it possible to tokenize a field on whitespace after some filters
have been applied:

ex: A + W Root Beer
the field uses a keyword tokenizer to keep the string together, then
it will get converted to aw root beer by a custom filter ive made, i
now want to split that up into 3 tokens (aw, root, beer), but seems
like you cant use a tokenizer after a filter ... so whats the best way
of accomplishing this?

thx much

--joe


Re: Annotations and reference types

2009-11-02 Thread Shalin Shekhar Mangar
On Thu, Oct 29, 2009 at 7:57 PM, M. Tinnemeyer marc-...@gmx.net wrote:

 Dear listusers,

 Is there a way to store an instance of class A (including the fields from
 myB) via solr using annotations ?
 The index should look like : id; name; b_id; b_name

 --
 Class A {

 @Field
 private String id;
 @Field
 private String name;
 @Field
 private B myB;
 }

 --
 Class B {

 @Field(b_id)
 private String id;
 @Field(B_name)
 private String name;
 }


No.

I guess you want to represent certain fields in class B and have them as an
attribute in Class A (but all fields belong to the same schema), then it can
be a worthwhile addition to Solrj. Can you open an issue? A patch would be
even better :)

-- 
Regards,
Shalin Shekhar Mangar.


Re: Question about DIH execution order

2009-11-02 Thread Bertie Shen
Hi Noble,

   I tried to understand your suggestions and played different variations
according to your reply.  But none of them work. Can you explain it in  more
details?
   Thanks a lot!




BTW, do you mean your solution as follows?

document
   entity name=Course transformer= TemplateTransformer query=select *
from Course
   field column=TmpCourseId name=CourseId
template=Course:${Course.CourseId} name=id/
 entity name=Rating query=select comment from Rating where
Rating.CourseId = ${Course.CourseId}
   field column=comment name=review/
 /entity
  /entity
 /document

 But
   1) There is no TmpCourseId field column.
   2) Can we put two name CourseId and id in the same map? It seems not.





2009/11/1 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

 On Sun, Nov 1, 2009 at 11:59 PM, Bertie Shen bertie.s...@gmail.com
 wrote:
  Hi folks,
 
   I have the following data-config.xml. Is there a way to
  let transformation take place after executing SQL select comment from
  Rating where Rating.CourseId = ${Course.CourseId}?  In MySQL database,
  column CourseId in table Course is integer 1, 2, etc;
  template transformation will make them like Course:1, Course:2; column
  CourseId in table Rating is also integer 1, 2, etc.
 
   If transformation happens before executing select comment from Rating
  where Rating.CourseId = ${Course.CourseId}, then there will no match for
  the SQL statement execution.
 
   document
  entity name=Course transformer=TemplateTransformer query=select
 *
  from Course
   field
  column=CourseId template=Course:${Course.CourseId} name=id/
   entity name=Rating query=select comment from Rating where
  Rating.CourseId = ${Course.CourseId}
 field column=comment name=review/
   /entity
 /entity
   /document
 

 keep the field as follows
  field
 column=TmpCourseId name=CourseId
 template=Course:${Course.CourseId} name=id/




 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com



Re: tracking solr response time

2009-11-02 Thread Israel Ekpo
On Mon, Nov 2, 2009 at 9:52 AM, bharath venkatesh 
bharathv6.proj...@gmail.com wrote:

 Thanks for the quick response
 @yonik

 How much of a latency compared to normal, and what version of Solr are
 you using?

 latency is usually around 2-4 secs (some times it goes more than that
 )  which happens  to  only 15-20%  of the request  other  80-85% of
 request are very fast it is in  milli secs ( around 200,000 requests
 happens every day )

 @Israel  we are not using java client ..  we  r using  python at the
 client with response formatted in json

 @yonikn @Israel   does qtime measure the total time taken at the solr
 server ? I am already measuring the time to get the response  at
 client  end . I would want  a means to know how much time the solr
 server is taking to respond (process ) once it gets the request  . so
 that I could identify whether it is a solr server issue or internal
 network issue


It is the time spent at the Solr server.

I think Yonik already answered this part in his response to your thread :

This is what he said :

QTime is the time spent in generating the in-memory representation for
the response before the response writer starts streaming it back in
whatever format was requested.  The stored fields of returned
documents are also loaded at this point (to enable handling of huge
response lists w/o storing all in memory).



 @Israel  we are using rhel server  5 on both client and server .. we
 have 6 solr sever . one is acting as master . both client and solr
 sever are on the same network . those servers are dedicated solr
 server except 2 severs which have DB and memcahce running .. we have
 adjusted the load accordingly







 On 11/2/09, Israel Ekpo israele...@gmail.com wrote:
  On Mon, Nov 2, 2009 at 8:41 AM, Yonik Seeley
  yo...@lucidimagination.comwrote:
 
  On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh
  bharathv6.proj...@gmail.com wrote:
  We are using solr for many of ur products  it is doing quite well
   .  But since no of hits are becoming high we are experiencing latency
   in certain requests ,about 15% of our requests are suffering a latency
 
  How much of a latency compared to normal, and what version of Solr are
  you using?
 
. We are trying to identify  the problem .  It may be due to  network
   issue or solr server is taking time to process the request  .   other
   than  qtime which is returned along with the response is there any
   other way to track solr servers performance ?
   how is qtime calculated
   , is it the total time from when solr server got the request till it
   gave the response ?
 
  QTime is the time spent in generating the in-memory representation for
  the response before the response writer starts streaming it back in
  whatever format was requested.  The stored fields of returned
  documents are also loaded at this point (to enable handling of huge
  response lists w/o storing all in memory).
 
  There are normally servlet container logs that can be configured to
  spit out the real total request time.
 
   can we do some extra logging to track solr servers
   performance . ideally I would want to pass some log id along with the
   request (query ) to  solr server  and solr server must log the
   response time along with that log id .
 
  Yep - Solr isn't bothered by params it doesn't know about, so just put
  logid=xxx and it should also be logged with the other request
  params.
 
  -Yonik
  http://www.lucidimagination.com
 
 
 
 
  If you are not using Java then you may have to track the elapsed time
  manually.
 
  If you are using the SolrJ Java client you may have the following
 options:
 
  There is a method called getElapsedTime() in
  org.apache.solr.client.solrj.response.SolrResponseBase which is available
 to
  all the subclasses
 
  I have not used it personally but I think this should return the time
 spent
  on the client side for that request.
 
  The QTime is not the time on the client side but the time spent
 internally
  at the Solr server to process the request.
 
 
 http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/SolrResponseBase.html
 
 
 http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/QueryResponse.html
 
  Most likely it could be as a result of an internal network issue between
 the
  two servers or the Solr server is competing with other applications for
  resources.
 
  What operating system is the Solr server running on? Is you client
  application connection to a Solr server on the same network or over the
  internet? Are there other applications like database servers etc running
 on
  the same machine? If so, then the DB server (or any other application)
 and
  the Solr server could be competing for resources like CPU, memory etc.
 
  If you are using Tomcat, you can take a look in
  $CATALINA_HOME/logs/catalina.out, there are timestamps there that can
 also
  guide you.
 
  --
  Good Enough is not good enough.
  To give 

Re: CPU utilization and query time high on Solr slave when snapshot install

2009-11-02 Thread Walter Underwood
If you are going to pull a new index every 10 minutes, try turning off  
cache autowarming.


Your caches are never more than 10 minutes old, so spending a minute  
warming each new cache is a waste of CPU. Autowarm submits queries to  
the new Searcher before putting it in service. This will create a  
burst of query load on the new Searcher, often keeping one CPU pretty  
busy for several seconds.


In solrconfig.xml, set autowarmCount to 0.

Also, if you want the slaves to always have an optimized index, create  
the snapshot only in post-optimize. If you create snapshots in both  
post-commit and post-optimize, you are creating a non-optimized index  
(post-commit), then replacing it with an optimized one a few minutes  
later. A slave might get a non-optimized index one time, then an  
optimized one the next.


wunder

On Nov 2, 2009, at 1:45 AM, biku...@sapient.com wrote:


Hi Solr Gurus,

We have solr in 1 master, 2 slave configuration. Snapshot is created  
post commit, post optimization. We have autocommit after 50  
documents or 5 minutes. Snapshot puller runs as a cron every 10  
minutes. What we have observed is that whenever snapshot is  
installed on the slave, we see solrj client used to query slave  
solr, gets timedout and there is high CPU usage/load avg. on slave  
server. If we stop snapshot puller, then slaves work with no issues.  
The system has been running since 2 months and this issue has  
started to occur only now  when load on website is increasing.


Following are some details:

Solr Details:
apache-solr Version: 1.3.0
Lucene - 2.4-dev

Master/Slave configurations:

Master:
- for indexing data HTTPRequests are made on Solr server.
- autocommit feature is enabled for 50 docs and 5 minutes
- caching params are disable for this server
- mergeFactor of 10 is set
- we were running optimize script after every 2 hours, but now have  
reduced the duration to twice a day but issue still persists


Slave1/Slave2:
- standard requestHandler is being used
- default values of caching are set
Machine Specifications:

Master:
- 4GB RAM
- 1GB JVM Heap memory is allocated to Solr

Slave1/Slave2:
- 4GB RAM
- 2GB JVM Heap memory is allocated to Solr

Master and Slave1 (solr1)are on single box and Slave2(solr2) on  
different box. We use HAProxy to load balance query requests between  
2 slaves. Master is only used for indexing.
Please let us know if somebody has ever faced similar kind of issue  
or has some insight into it as we guys are literally struck at the  
moment with a very unstable production environment.


As a workaround, we have started running optimize on master every 7  
minutes. This seems to have reduced the severity of the problem but  
still issue occurs every 2days now. please suggest what could be the  
root cause of this.


Thanks,
Bipul








Re: tracking solr response time

2009-11-02 Thread bharath venkatesh
@Israel: yes I got that point which yonik mentioned .. but is qtime the
total time taken by solr server for that request or  is it  part of time
taken by the solr for that request ( is there any thing that a solr server
does for that particulcar request which is not included in that qtime
bracket ) ?  I am sorry for dragging in to this qtime. I just want to be
sure, as we observed many times there is huge mismatch between qtime and
time measured at the client for the response ( does this imply it is due to
internal network issue )

@Erick: yes, many times query is slow first time its executed is there any
solution to improve upon this factor .. for querying we use
DisMaxRequestHandler , queries are quite long with many faceting parameters
.


On Mon, Nov 2, 2009 at 10:46 PM, Israel Ekpo israele...@gmail.com wrote:

 On Mon, Nov 2, 2009 at 9:52 AM, bharath venkatesh 
 bharathv6.proj...@gmail.com wrote:

  Thanks for the quick response
  @yonik
 
  How much of a latency compared to normal, and what version of Solr are
  you using?
 
  latency is usually around 2-4 secs (some times it goes more than that
  )  which happens  to  only 15-20%  of the request  other  80-85% of
  request are very fast it is in  milli secs ( around 200,000 requests
  happens every day )
 
  @Israel  we are not using java client ..  we  r using  python at the
  client with response formatted in json
 
  @yonikn @Israel   does qtime measure the total time taken at the solr
  server ? I am already measuring the time to get the response  at
  client  end . I would want  a means to know how much time the solr
  server is taking to respond (process ) once it gets the request  . so
  that I could identify whether it is a solr server issue or internal
  network issue
 

 It is the time spent at the Solr server.

 I think Yonik already answered this part in his response to your thread :

 This is what he said :

 QTime is the time spent in generating the in-memory representation for
 the response before the response writer starts streaming it back in
 whatever format was requested.  The stored fields of returned
 documents are also loaded at this point (to enable handling of huge
 response lists w/o storing all in memory).


 
  @Israel  we are using rhel server  5 on both client and server .. we
  have 6 solr sever . one is acting as master . both client and solr
  sever are on the same network . those servers are dedicated solr
  server except 2 severs which have DB and memcahce running .. we have
  adjusted the load accordingly
 
 
 
 
 
 
 
  On 11/2/09, Israel Ekpo israele...@gmail.com wrote:
   On Mon, Nov 2, 2009 at 8:41 AM, Yonik Seeley
   yo...@lucidimagination.comwrote:
  
   On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh
   bharathv6.proj...@gmail.com wrote:
   We are using solr for many of ur products  it is doing quite well
.  But since no of hits are becoming high we are experiencing
 latency
in certain requests ,about 15% of our requests are suffering a
 latency
  
   How much of a latency compared to normal, and what version of Solr are
   you using?
  
 . We are trying to identify  the problem .  It may be due to
  network
issue or solr server is taking time to process the request  .
 other
than  qtime which is returned along with the response is there any
other way to track solr servers performance ?
how is qtime calculated
, is it the total time from when solr server got the request till it
gave the response ?
  
   QTime is the time spent in generating the in-memory representation for
   the response before the response writer starts streaming it back in
   whatever format was requested.  The stored fields of returned
   documents are also loaded at this point (to enable handling of huge
   response lists w/o storing all in memory).
  
   There are normally servlet container logs that can be configured to
   spit out the real total request time.
  
can we do some extra logging to track solr servers
performance . ideally I would want to pass some log id along with
 the
request (query ) to  solr server  and solr server must log the
response time along with that log id .
  
   Yep - Solr isn't bothered by params it doesn't know about, so just put
   logid=xxx and it should also be logged with the other request
   params.
  
   -Yonik
   http://www.lucidimagination.com
  
  
  
  
   If you are not using Java then you may have to track the elapsed time
   manually.
  
   If you are using the SolrJ Java client you may have the following
  options:
  
   There is a method called getElapsedTime() in
   org.apache.solr.client.solrj.response.SolrResponseBase which is
 available
  to
   all the subclasses
  
   I have not used it personally but I think this should return the time
  spent
   on the client side for that request.
  
   The QTime is not the time on the client side but the time spent
  internally
   at the Solr server to process the request.
  
  
 
 

RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
Any thoughts regarding the subject? I hope FieldCache doesn't use more than
6 bytes per document-field instance... I am too lazy to research Lucene
source code, I hope someone can provide exact answer... Thanks


 Subject: Lucene FieldCache memory requirements
 
 Hi,
 
 
 Can anyone confirm Lucene FieldCache memory requirements? I have 100
 millions docs with non-tokenized field country (10 different countries);
I
 expect it requires array of (int, long), size of array 100,000,000,
 without any impact of country field length;
 
 it requires 600,000,000 bytes: int is pointer to document (Lucene
document
 ID),  and long is pointer to String value...
 
 Am I right, is it 600Mb just for this country (indexed, non-tokenized,
 non-boolean) field and 100 millions docs? I need to calculate exact
minimum RAM
 requirements...
 
 I believe it shouldn't depend on cardinality (distribution) of field...
 
 Thanks,
 Fuad
 
 
 
 





Re: Lucene FieldCache memory requirements

2009-11-02 Thread Michael McCandless
Which FieldCache API are you using?  getStrings?  or getStringIndex
(which is used, under the hood, if you sort by this field).

Mike

On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote:
 Any thoughts regarding the subject? I hope FieldCache doesn't use more than
 6 bytes per document-field instance... I am too lazy to research Lucene
 source code, I hope someone can provide exact answer... Thanks


 Subject: Lucene FieldCache memory requirements

 Hi,


 Can anyone confirm Lucene FieldCache memory requirements? I have 100
 millions docs with non-tokenized field country (10 different countries);
 I
 expect it requires array of (int, long), size of array 100,000,000,
 without any impact of country field length;

 it requires 600,000,000 bytes: int is pointer to document (Lucene
 document
 ID),  and long is pointer to String value...

 Am I right, is it 600Mb just for this country (indexed, non-tokenized,
 non-boolean) field and 100 millions docs? I need to calculate exact
 minimum RAM
 requirements...

 I believe it shouldn't depend on cardinality (distribution) of field...

 Thanks,
 Fuad










LocalSolr, Maven, build files and release candidates (Just for info) and spatial radius (A question)

2009-11-02 Thread Ian Ibbotson
Hallo All. I've been trying to prepare a project using localsolr for the
impending (I hope) arrival of solr 1.4 and Lucene 2.9.1.. Here are some
notes in case anyone else is suffering similarly. Obviously everything here
may change by next week.

First problem has been the lack of any stable maven based lucene and solr
artifacts to wire into my poms. Because of that, and as an interim only
measure, I've built the latest branches of the lucene 2.9.1 and solr 1.4
trees and made them into a *temporary* maven repository at
http://developer.k-int.com/m2snapshots/. in there you can find all the jar
artifacts tagged as xxx-ki-rc1 (For solr) and xxx-ki-rc3 (For lucene) and
finally, a localsolr.localsolr build tagged as 1.5.2-rc1. Sorry for the
naming, but I don't want these artifacts to clash with the real ones when
they come along. This is really just for my own use, but I've seen messages
and spoken to people who are really struggling to get their maven deps
right, if this helps anyone, please feel free to use these until the real
apache artifacts appear. I can't take any responsibility for their quality.
All the poms have been altered to look for the correct dependent artifacts
in the same repository, adding the stanza

  !-- Emergency repository for storing interim builds of lucene and solr
whilst they sort their act out --
  repositories
repository
  idk-int-m2-snapshots/id
  nameK-int M2 Snapshots/name
  urlhttp://developer.k-int.com/m2snapshots/url
  releases
enabledtrue/enabled
  /releases
/repository
  /repositories

to your pom will let you use these deps temporarily until we see an official
build. If you're a maven developer and I've gone way around the houses with
this, please tell me of an easier solution :) This repo *will* go away when
the real builds turn up.

The localsolr in this repo also contains the patches I've submitted (A good
while ago) to the localsolr project to make it build with the lucene 2.9.1
rc3 as the downloadable dist is currently built against an older 2.9 release
that had a different API (IE won't work with the new lucene and solr)

All this means that there is a working localsolr build.

Second up, I've also seen emails (And seen the exception myself) around
asking about the following when trying to get all these revisions working
together.

java.lang.NumberFormatException: Invalid shift value in prefixCoded string
(is encoded value really a LONG?)

There are some threads out there telling you that the Lucene indexes are not
binary compatible between versions, but if you're using localsolr, what you
really need to know is:

1) Make sure that your schema.xml contains at least the following fieldType
defs

   fieldType name=tdouble class=solr.TrieDoubleField precisionStep=8
omitNorms=true positionIncrementGap=0/

2) Convert your old solr sdouble fields to tdoubles:

  field name=lat type=tdouble indexed=true stored=true/
  field name=lng type=tdouble indexed=true stored=true/
  dynamicField name=_local* type=tdouble indexed=true stored=true/

Pretty sure you would need to rebuild your indexes.

Ok, with those changes I managed to get a working spatial search.

My only problem now is that the radius param on the command line seems to
need to be way bigger than it needs to be in order to find anything.
Specifically, if I search with a radius of 220 I get a record back which
marks it's geo_distance as 83.76888211666025. Shuffling the radius around
ends up that a radius of 205 returns that doc, 204 and it's filtered. I'm
going to dig into this now, but if anyone knows about this I'd really
appreciate any help.

Cheers all, hope this is of use to someone out there, if anyone has
corrections/comments I'd really appreciate any info.

Best,
Ian.


Re: question about collapse.type = adjacent

2009-11-02 Thread Martijn v Groningen
Hi Micheal,

Field collapsing is basicly done in two steps. The first step is to
get the uncollapsed sorted (whether it is score or a field value)
documents and the second step is to apply the collapse algorithm on
the uncollapsed documents. So yes, when specifying
collapse.type=adjacent the documents can get collapsed after the sort
has been applied, but this also the case when not specifying
collapse.type=adjacent
I hope this answers your question.

Cheers,

Martijn

2009/11/2 michael8 mich...@saracatech.com:

 Hi,

 I would like to confirm if 'adjacent' in collapse.type means the documents
 (with the same collapse field value) are considered adjacent *after* the
 'sort' param from the query has been applied, or *before*?  I would think it
 would be *after* since collapse feature primarily is meant for presentation
 use.

 Thanks,
 Michael
 --
 View this message in context: 
 http://old.nabble.com/question-about-collapse.type-%3D-adjacent-tp26157114p26157114.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
Met vriendelijke groet,

Martijn van Groningen


apply a patch on solr

2009-11-02 Thread michael8

Hi,

First I like to pardon my novice question on patching solr (1.4).  What I
like to know is, given a patch, like the one for collapse field, how would
one go about knowing what solr source that patch is meant for since this is
a source level patch?  Wouldn't the exact versions of a set of java files to
be patched critical for the patch to work properly?

So far what I have done is to pull the latest collapse field patch down from
http://issues.apache.org/jira/browse/SOLR-236 (field-collapse-5.patch), and
then svn up the latest trunk from
http://svn.apache.org/repos/asf/lucene/solr/trunk/, then patch and build. 
Intuitively I was thinking I should be doing svn up to a specific
revision/tag instead of just latest.  So far everything seems fine, but I
just want to make sure I'm doing the right thing and not just being lucky.

Thanks,
Michael
-- 
View this message in context: 
http://old.nabble.com/apply-a-patch-on-solr-tp26157826p26157826.html
Sent from the Solr - User mailing list archive at Nabble.com.



apply a patch on solr

2009-11-02 Thread michael8

Hi,

First I like to pardon my novice question on patching solr (1.4).  What I
like to know is, given a patch, like the one for collapse field, how would
one go about knowing what solr source that patch is meant for since this is
a source level patch?  Wouldn't the exact versions of a set of java files to
be patched critical for the patch to work properly?

So far what I have done is to pull the latest collapse field patch down from
http://issues.apache.org/jira/browse/SOLR-236 (field-collapse-5.patch), and
then svn up the latest trunk from
http://svn.apache.org/repos/asf/lucene/solr/trunk/, then patch and build. 
Intuitively I was thinking I should be doing svn up to a specific
revision/tag instead of just latest.  So far everything seems fine, but I
just want to make sure I'm doing the right thing and not just being lucky.

Thanks,
Michael
-- 
View this message in context: 
http://old.nabble.com/apply-a-patch-on-solr-tp26157827p26157827.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
I am not using Lucene API directly; I am using SOLR which uses Lucene
FieldCache for faceting on non-tokenized fields...
I think this cache will be lazily loaded, until user executes sorted (by
this field) SOLR query for all documents *:* - in this case it will be fully
populated...


 Subject: Re: Lucene FieldCache memory requirements
 
 Which FieldCache API are you using?  getStrings?  or getStringIndex
 (which is used, under the hood, if you sort by this field).
 
 Mike
 
 On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote:
  Any thoughts regarding the subject? I hope FieldCache doesn't use more
than
  6 bytes per document-field instance... I am too lazy to research Lucene
  source code, I hope someone can provide exact answer... Thanks
 
 
  Subject: Lucene FieldCache memory requirements
 
  Hi,
 
 
  Can anyone confirm Lucene FieldCache memory requirements? I have 100
  millions docs with non-tokenized field country (10 different
countries);
  I
  expect it requires array of (int, long), size of array 100,000,000,
  without any impact of country field length;
 
  it requires 600,000,000 bytes: int is pointer to document (Lucene
  document
  ID),  and long is pointer to String value...
 
  Am I right, is it 600Mb just for this country (indexed,
non-tokenized,
  non-boolean) field and 100 millions docs? I need to calculate exact
  minimum RAM
  requirements...
 
  I believe it shouldn't depend on cardinality (distribution) of field...
 
  Thanks,
  Fuad
 
 
 
 
 
 
 
 




Dismax and Standard Queries together

2009-11-02 Thread ram_sj

Hi,

I have three fields, business_name, category_name, sub_category_name in my
solrconfig file.

my query = pet clinic

example sub_category_names: Veterinarians, Kennels, Veterinary Clinics  
Hospitals, Pet Grooming, Pet Stores, Clinics

my ideal requirement is dismax searching on 

a. dismax over three or two fields
b. followed by a Boolean match over any one of the field is acceptable.

I played around with minimum match attributes, but doesn't seems to be
helpful, I guess the dismax requires at-least two fields. 

The nest queries takes only one qf filed, so it doesn't help much either.

Any suggestions will be helpful.

Thanks
Ram
-- 
View this message in context: 
http://old.nabble.com/Dismax-and-Standard-Queries-together-tp26157830p26157830.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: tokenize after filters

2009-11-02 Thread Steven A Rowe
I think you want Koji Sekiguchi's Char Filters:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?highlight=char+filters#Char_Filters

Steve

 -Original Message-
 From: Joe Calderon [mailto:calderon@gmail.com]
 Sent: Monday, November 02, 2009 11:25 AM
 To: solr-user@lucene.apache.org
 Subject: tokenize after filters
 
  is it possible to tokenize a field on whitespace after some filters
 have been applied:
 
 ex: A + W Root Beer
 the field uses a keyword tokenizer to keep the string together, then
 it will get converted to aw root beer by a custom filter ive made, i
 now want to split that up into 3 tokens (aw, root, beer), but seems
 like you cant use a tokenizer after a filter ... so whats the best way
 of accomplishing this?
 
 thx much
 
 --joe


field queries seem slow

2009-11-02 Thread mike anderson
I took a look through my Solr logs this weekend and noticed that the longest
queries were on particular fields, like author:albert einstein. Is this a
result consistent with other setups out there? If not, Is there a trick to
make these go faster? I've read up on filter queries and use those when
applicable, but they don't really solve all my problems.

If anybody wants to take a shot at it but needs to see my solrconfig, etc
just let me know.

Cheers,
Mike


manually creating indices to speed up indexing with app-knowledge

2009-11-02 Thread Britske

This may seem like a strange question, but here it goes anyway. 

Im considering the possibility of low-level constructing indices for about
20.000 indexed fields (type sInt) if at all possible . (With indices in this
context I mean the inverted indices from term to Documentid just to be 100%
complete)  
These indices have to be recreated each night, along with the normal
reindex. 

Globally it should go something like this (each night) : 
 - documents (consisting of about 20 stored fields and about 10 stored 
indexed fields) are indexed through the normal 'code-path' (solrJ in my
case) 
- After all docs are persisted (max 200.000) I want to extract the mapping
from 'lucene docid' -- 'stored/indexed product key'
I believe this should work, because after all docs are persisted the
internal docids aren't altered, so the relationship between 'lucene docid'
-- 'stored/indexed product key' is invariant from that point forward.
(please correct if wrong) 
- construct the 20.000 inverted indices on such a low enough level that I do
not have to go through IndexWriter if possible, so  I do not need to
construct Documents, I only need to construct the native format of the
indices themselves. Ideally this should work on multiple servers so that the
indices can be created in parallel and the index-files later simply copied
to the index-directory of the master. 

Basically what it boils down to is that indexing time (a reindex should be
done each night)  is a big show-stopper at the moment, although we've tried
and tested all the more standard optimization tricks  techniques, as well
as having build a  home-grown shard-like indexing strategy which uses 20
pretty big servers in parallel. The 20.000 indexed fields are still simply
killing. 

At the same time the app has a lot of knowledge of the 20.000 indices. 
- All indices consist of prices (ints) between 0 and 10.000
- and most important: as part of the document construction process the
ordening of each of the 20.000 indices is known for all documents that are
processed by the document-construction server in question. (This part is
needed, and is already performing at light speed) 

for sake of argument say we have 5 document-construction servers. Each
server processes 40.000 documents. Each server has 20.000 ordered indices in
its own format readily available for the 40.000 documents it's processing. 
Something like: LinkedHashMapInteger,SetInteger -- 
price,{productids}

Say we have 20 indexing servers. Each server has to calculate 1.000 indices
(totalling the 20.000) 
We have the 5 doc-construction servers distribute the ordered sub-indices to
the correct servers. 
Each server constructs an index from 5 ordered sub-indices coming from 5
different construction-servers. This can be done efficiently using a
mergesort (since the sub-indices are already sorted) 

All that is missing (oversimplifying here ) is going from the ordered
indices in application-format to the index-format of lucene (substituting
the productids by the lucene docid's along the way) and stream it to disk. 
I believe this would quite posisbly give a really big indexing improvement.  

Is my thinking correct in the steps involved? 
Do you believe that this indeed would give a big speedup for this specific
situation  
Where would I hook in the SOlr / lucene code to construct the native format?


Thanks in advance (and for making it to here) 

Geert-Jan

-- 
View this message in context: 
http://old.nabble.com/manually-creating-indices-to-speed-up-indexing-with-app-knowledge-tp26157851p26157851.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: apply a patch on solr

2009-11-02 Thread mike anderson
You can see what revision the patch was written for at the top of the patch,
it will look like this:

Index: org/apache/solr/handler/MoreLikeThisHandler.java
===
--- org/apache/solr/handler/MoreLikeThisHandler.java (revision 772437)
+++ org/apache/solr/handler/MoreLikeThisHandler.java (working copy)

now check out revision 772437 using the --revision switch in svn, patch
away, and then svn up to make sure everything merges cleanly.  This is a
good guide to follow as well:
http://www.mail-archive.com/solr-user@lucene.apache.org/msg10189.html

cheers,
-mike

On Mon, Nov 2, 2009 at 3:55 PM, michael8 mich...@saracatech.com wrote:


 Hi,

 First I like to pardon my novice question on patching solr (1.4).  What I
 like to know is, given a patch, like the one for collapse field, how would
 one go about knowing what solr source that patch is meant for since this is
 a source level patch?  Wouldn't the exact versions of a set of java files
 to
 be patched critical for the patch to work properly?

 So far what I have done is to pull the latest collapse field patch down
 from
 http://issues.apache.org/jira/browse/SOLR-236 (field-collapse-5.patch),
 and
 then svn up the latest trunk from
 http://svn.apache.org/repos/asf/lucene/solr/trunk/, then patch and build.
 Intuitively I was thinking I should be doing svn up to a specific
 revision/tag instead of just latest.  So far everything seems fine, but I
 just want to make sure I'm doing the right thing and not just being lucky.

 Thanks,
 Michael
 --
 View this message in context:
 http://old.nabble.com/apply-a-patch-on-solr-tp26157827p26157827.html
 Sent from the Solr - User mailing list archive at Nabble.com.




highlighting error using 1.4rc

2009-11-02 Thread Jake Brownell
Hi,

I've tried installing the latest (3rd) RC for Solr 1.4 and Lucene 2.9.1. One of 
our integration tests, which runs against and embedded server appears to be 
failing on highlighting. I've included the stack trace and the configuration 
from solrconf. I'd appreciate any insights. Please let me know what additional 
information would be useful.


Caused by: org.apache.solr.client.solrj.SolrServerException: 
org.apache.solr.client.solrj.SolrServerException: java.lang.ClassCastException: 
org.apache.lucene.search.spans.SpanOrQuery cannot be cast to 
org.apache.lucene.search.spans.SpanNearQuery
at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:153)
at 
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
at 
org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
at 
org.bookshare.search.solr.SolrSearchServerWrapper.query(SolrSearchServerWrapper.java:96)
... 29 more
Caused by: org.apache.solr.client.solrj.SolrServerException: 
java.lang.ClassCastException: org.apache.lucene.search.spans.SpanOrQuery cannot 
be cast to org.apache.lucene.search.spans.SpanNearQuery
at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:141)
... 32 more
Caused by: java.lang.ClassCastException: 
org.apache.lucene.search.spans.SpanOrQuery cannot be cast to 
org.apache.lucene.search.spans.SpanNearQuery
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.collectSpanQueryFields(WeightedSpanTermExtractor.java:489)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.collectSpanQueryFields(WeightedSpanTermExtractor.java:484)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:249)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:230)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:414)
at 
org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:216)
at 
org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:184)
at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:226)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:335)
at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:203)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:139)
... 32 more

I see in our solrconf the following for highlighting.

  highlighting
   !-- Configure the standard fragmenter --
   !-- This could most likely be commented out in the default case --
   fragmenter name=gap class=org.apache.solr.highlight.GapFragmenter 
default=true
lst name=defaults
 int name=hl.fragsize100/int
/lst
   /fragmenter

   !-- A regular-expression-based fragmenter (f.i., for sentence extraction) 
--
   fragmenter name=regex class=org.apache.solr.highlight.RegexFragmenter
lst name=defaults
  !-- slightly smaller fragsizes work better because of slop --
  int name=hl.fragsize70/int
  !-- allow 50% slop on fragment sizes --
  float name=hl.regex.slop0.5/float
  !-- a basic sentence pattern --
  str name=hl.regex.pattern[-\w ,/\n\']{20,200}/str
/lst
   /fragmenter

   !-- Configure the standard formatter --
   formatter name=html class=org.apache.solr.highlight.HtmlFormatter 
default=true
lst name=defaults
 str name=hl.simple.pre![CDATA[strong]]/str
 str name=hl.simple.post![CDATA[/strong]]/str
/lst
   /formatter
  /highlighting



Thanks,
Jake


Question regarding snapinstaller

2009-11-02 Thread Prasanna Ranganathan

 It looks like the snapinstaller script does an atomic remove and replace of
the entire solr_home/data_dir/index folder with the contents of the new
snapshot before issuing a commit command. I am trying to understand the
implication of the same.

 What happens to queries that come during the time interval between the
instant the existing directory is removed and the commit command gets
finalized? Does a currently running instance of Solr not need the files in
the index folder to serve the query results? Are all the contents of the
index folder loaded into memory?
 
 Thanks in advance for any help.

Regards,

Prasanna.


Re: tracking solr response time

2009-11-02 Thread Erick Erickson
So I need someone with better knowledge to chime in here with an opinion
on whether autowarming would help since the whole faceting thing is
something
I'm not very comfortable with...

hint, hint, hint

Erick

On Mon, Nov 2, 2009 at 2:21 PM, bharath venkatesh 
bharathv6.proj...@gmail.com wrote:

 @Israel: yes I got that point which yonik mentioned .. but is qtime the
 total time taken by solr server for that request or  is it  part of time
 taken by the solr for that request ( is there any thing that a solr server
 does for that particulcar request which is not included in that qtime
 bracket ) ?  I am sorry for dragging in to this qtime. I just want to be
 sure, as we observed many times there is huge mismatch between qtime and
 time measured at the client for the response ( does this imply it is due to
 internal network issue )

 @Erick: yes, many times query is slow first time its executed is there any
 solution to improve upon this factor .. for querying we use
 DisMaxRequestHandler , queries are quite long with many faceting parameters
 .


 On Mon, Nov 2, 2009 at 10:46 PM, Israel Ekpo israele...@gmail.com wrote:

  On Mon, Nov 2, 2009 at 9:52 AM, bharath venkatesh 
  bharathv6.proj...@gmail.com wrote:
 
   Thanks for the quick response
   @yonik
  
   How much of a latency compared to normal, and what version of Solr are
   you using?
  
   latency is usually around 2-4 secs (some times it goes more than that
   )  which happens  to  only 15-20%  of the request  other  80-85% of
   request are very fast it is in  milli secs ( around 200,000 requests
   happens every day )
  
   @Israel  we are not using java client ..  we  r using  python at the
   client with response formatted in json
  
   @yonikn @Israel   does qtime measure the total time taken at the solr
   server ? I am already measuring the time to get the response  at
   client  end . I would want  a means to know how much time the solr
   server is taking to respond (process ) once it gets the request  . so
   that I could identify whether it is a solr server issue or internal
   network issue
  
 
  It is the time spent at the Solr server.
 
  I think Yonik already answered this part in his response to your thread :
 
  This is what he said :
 
  QTime is the time spent in generating the in-memory representation for
  the response before the response writer starts streaming it back in
  whatever format was requested.  The stored fields of returned
  documents are also loaded at this point (to enable handling of huge
  response lists w/o storing all in memory).
 
 
  
   @Israel  we are using rhel server  5 on both client and server .. we
   have 6 solr sever . one is acting as master . both client and solr
   sever are on the same network . those servers are dedicated solr
   server except 2 severs which have DB and memcahce running .. we have
   adjusted the load accordingly
  
  
  
  
  
  
  
   On 11/2/09, Israel Ekpo israele...@gmail.com wrote:
On Mon, Nov 2, 2009 at 8:41 AM, Yonik Seeley
yo...@lucidimagination.comwrote:
   
On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh
bharathv6.proj...@gmail.com wrote:
We are using solr for many of ur products  it is doing quite
 well
 .  But since no of hits are becoming high we are experiencing
  latency
 in certain requests ,about 15% of our requests are suffering a
  latency
   
How much of a latency compared to normal, and what version of Solr
 are
you using?
   
  . We are trying to identify  the problem .  It may be due to
   network
 issue or solr server is taking time to process the request  .
  other
 than  qtime which is returned along with the response is there any
 other way to track solr servers performance ?
 how is qtime calculated
 , is it the total time from when solr server got the request till
 it
 gave the response ?
   
QTime is the time spent in generating the in-memory representation
 for
the response before the response writer starts streaming it back in
whatever format was requested.  The stored fields of returned
documents are also loaded at this point (to enable handling of huge
response lists w/o storing all in memory).
   
There are normally servlet container logs that can be configured to
spit out the real total request time.
   
 can we do some extra logging to track solr servers
 performance . ideally I would want to pass some log id along with
  the
 request (query ) to  solr server  and solr server must log the
 response time along with that log id .
   
Yep - Solr isn't bothered by params it doesn't know about, so just
 put
logid=xxx and it should also be logged with the other request
params.
   
-Yonik
http://www.lucidimagination.com
   
   
   
   
If you are not using Java then you may have to track the elapsed time
manually.
   
If you are using the SolrJ Java client you may have the following
   options:
   

Re: field queries seem slow

2009-11-02 Thread Erick Erickson
H, are you sorting? And has your readers been reopened? Is the
second query of that sort also slow? If the answer to this last question is
no,
have you tried some autowarming queries?

Best
Erick

On Mon, Nov 2, 2009 at 4:34 PM, mike anderson saidthero...@gmail.comwrote:

 I took a look through my Solr logs this weekend and noticed that the
 longest
 queries were on particular fields, like author:albert einstein. Is this a
 result consistent with other setups out there? If not, Is there a trick to
 make these go faster? I've read up on filter queries and use those when
 applicable, but they don't really solve all my problems.

 If anybody wants to take a shot at it but needs to see my solrconfig, etc
 just let me know.

 Cheers,
 Mike



Re: Question about DIH execution order

2009-11-02 Thread Fergus McMenemie
Bertie,

Not sure what you are trying to do, we need a clearer description of
what select * returns and what you want to end up in the index. But 
to answer your question The transformations happen after DIH has
performed the SQL statement. In fact the rows output from the SQL
command are assigned to the DIH fields and then any transformations
are applied. The examples in 
http://wiki.apache.org/solr/DataImportHandler
are quite good.  

Hi Noble,

   I tried to understand your suggestions and played different variations
according to your reply.  But none of them work. Can you explain it in  more
details?
   Thanks a lot!




BTW, do you mean your solution as follows?

document
   entity name=Course transformer= TemplateTransformer query=select *
from Course
   field column=TmpCourseId name=CourseId
template=Course:${Course.CourseId} name=id/
 entity name=Rating query=select comment from Rating where
Rating.CourseId = ${Course.CourseId}
   field column=comment name=review/
 /entity
  /entity
 /document

 But
   1) There is no TmpCourseId field column.
   2) Can we put two name CourseId and id in the same map? It seems not.





2009/11/1 Noble Paul ?? Â Ë³Ë noble.p...@corp.aol.com

 On Sun, Nov 1, 2009 at 11:59 PM, Bertie Shen bertie.s...@gmail.com
 wrote:
  Hi folks,
 
   I have the following data-config.xml. Is there a way to
  let transformation take place after executing SQL select comment from
  Rating where Rating.CourseId = ${Course.CourseId}?  In MySQL database,
  column CourseId in table Course is integer 1, 2, etc;
  template transformation will make them like Course:1, Course:2; column
  CourseId in table Rating is also integer 1, 2, etc.
 
   If transformation happens before executing select comment from Rating
  where Rating.CourseId = ${Course.CourseId}, then there will no match for
  the SQL statement execution.
 
   document
  entity name=Course transformer=TemplateTransformer query=select
 *
  from Course
   field
  column=CourseId template=Course:${Course.CourseId} name=id/
   entity name=Rating query=select comment from Rating where
  Rating.CourseId = ${Course.CourseId}
 field column=comment name=review/
   /entity
 /entity
   /document
 

 keep the field as follows
  field
 column=TmpCourseId name=CourseId
 template=Course:${Course.CourseId} name=id/




 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com


-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===


Re: Lucene FieldCache memory requirements

2009-11-02 Thread Michael McCandless
OK I think someone who knows how Solr uses the fieldCache for this
type of field will have to pipe up.

For Lucene directly, simple strings would consume an pointer (4 or 8
bytes depending on whether your JRE is 64bit) per doc, and the string
index would consume an int (4 bytes) per doc.  (Each also consume
negligible (for your case) memory to hold the actual string values).

Note that for your use case, this is exceptionally wasteful.  If
Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this)
then it'd take much fewer bits to reference the values, since you have
only 10 unique string values.

Mike

On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi f...@efendi.ca wrote:
 I am not using Lucene API directly; I am using SOLR which uses Lucene
 FieldCache for faceting on non-tokenized fields...
 I think this cache will be lazily loaded, until user executes sorted (by
 this field) SOLR query for all documents *:* - in this case it will be fully
 populated...


 Subject: Re: Lucene FieldCache memory requirements

 Which FieldCache API are you using?  getStrings?  or getStringIndex
 (which is used, under the hood, if you sort by this field).

 Mike

 On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote:
  Any thoughts regarding the subject? I hope FieldCache doesn't use more
 than
  6 bytes per document-field instance... I am too lazy to research Lucene
  source code, I hope someone can provide exact answer... Thanks
 
 
  Subject: Lucene FieldCache memory requirements
 
  Hi,
 
 
  Can anyone confirm Lucene FieldCache memory requirements? I have 100
  millions docs with non-tokenized field country (10 different
 countries);
  I
  expect it requires array of (int, long), size of array 100,000,000,
  without any impact of country field length;
 
  it requires 600,000,000 bytes: int is pointer to document (Lucene
  document
  ID),  and long is pointer to String value...
 
  Am I right, is it 600Mb just for this country (indexed,
 non-tokenized,
  non-boolean) field and 100 millions docs? I need to calculate exact
  minimum RAM
  requirements...
 
  I believe it shouldn't depend on cardinality (distribution) of field...
 
  Thanks,
  Fuad
 
 
 
 
 
 
 
 





Re: highlighting error using 1.4rc

2009-11-02 Thread Mark Miller
Umm - crap. This looks looks like a bug in a fix that just went in. My  
fault on the review. I'll fix it tonight when I get home -  
unfortunetly, both lucene and sold are about to be released...


- Mark

http://www.lucidimagination.com (mobile)

On Nov 2, 2009, at 5:17 PM, Jake Brownell ja...@benetech.org wrote:


Hi,

I've tried installing the latest (3rd) RC for Solr 1.4 and Lucene  
2.9.1. One of our integration tests, which runs against and embedded  
server appears to be failing on highlighting. I've included the  
stack trace and the configuration from solrconf. I'd appreciate any  
insights. Please let me know what additional information would be  
useful.



Caused by: org.apache.solr.client.solrj.SolrServerException:  
org.apache.solr.client.solrj.SolrServerException:  
java.lang.ClassCastException:  
org.apache.lucene.search.spans.SpanOrQuery cannot be cast to  
org.apache.lucene.search.spans.SpanNearQuery
   at  
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request 
(EmbeddedSolrServer.java:153)
   at  
org.apache.solr.client.solrj.request.QueryRequest.process 
(QueryRequest.java:89)
   at org.apache.solr.client.solrj.SolrServer.query 
(SolrServer.java:118)
   at org.bookshare.search.solr.SolrSearchServerWrapper.query 
(SolrSearchServerWrapper.java:96)

   ... 29 more
Caused by: org.apache.solr.client.solrj.SolrServerException:  
java.lang.ClassCastException:  
org.apache.lucene.search.spans.SpanOrQuery cannot be cast to  
org.apache.lucene.search.spans.SpanNearQuery
   at  
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request 
(EmbeddedSolrServer.java:141)

   ... 32 more
Caused by: java.lang.ClassCastException:  
org.apache.lucene.search.spans.SpanOrQuery cannot be cast to  
org.apache.lucene.search.spans.SpanNearQuery
   at  
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.collectSpanQueryFields( 
WeightedSpanTermExtractor.java:489)
   at  
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.collectSpanQueryFields( 
WeightedSpanTermExtractor.java:484)
   at  
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms( 
WeightedSpanTermExtractor.java:249)
   at  
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract 
(WeightedSpanTermExtractor.java:230)
   at  
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract 
(WeightedSpanTermExtractor.java:158)
   at  
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms( 
WeightedSpanTermExtractor.java:414)
   at  
org.apache.lucene.search.highlight.QueryScorer.initExtractor 
(QueryScorer.java:216)
   at org.apache.lucene.search.highlight.QueryScorer.init 
(QueryScorer.java:184)
   at  
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments 
(Highlighter.java:226)
   at  
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting 
(DefaultSolrHighlighter.java:335)
   at  
org.apache.solr.handler.component.HighlightComponent.process 
(HighlightComponent.java:89)
   at  
org.apache.solr.handler.component.SearchHandler.handleRequestBody 
(SearchHandler.java:203)
   at  
org.apache.solr.handler.RequestHandlerBase.handleRequest 
(RequestHandlerBase.java:131)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java: 
1316)
   at  
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request 
(EmbeddedSolrServer.java:139)

   ... 32 more

I see in our solrconf the following for highlighting.

 highlighting
  !-- Configure the standard fragmenter --
  !-- This could most likely be commented out in the default case  
--
  fragmenter name=gap  
class=org.apache.solr.highlight.GapFragmenter default=true

   lst name=defaults
int name=hl.fragsize100/int
   /lst
  /fragmenter

  !-- A regular-expression-based fragmenter (f.i., for sentence  
extraction) --
  fragmenter name=regex  
class=org.apache.solr.highlight.RegexFragmenter

   lst name=defaults
 !-- slightly smaller fragsizes work better because of slop --
 int name=hl.fragsize70/int
 !-- allow 50% slop on fragment sizes --
 float name=hl.regex.slop0.5/float
 !-- a basic sentence pattern --
 str name=hl.regex.pattern[-\w ,/\n\']{20,200}/str
   /lst
  /fragmenter

  !-- Configure the standard formatter --
  formatter name=html  
class=org.apache.solr.highlight.HtmlFormatter default=true

   lst name=defaults
str name=hl.simple.pre![CDATA[strong]]/str
str name=hl.simple.post![CDATA[/strong]]/str
   /lst
  /formatter
 /highlighting



Thanks,
Jake


Re: Spell check suggestion and correct way of implementation and some Questions

2009-11-02 Thread darniz

Hello everybody
i am able to use spell checker but i have some questions if someone can
answer this
if i search free text word waranty then i get back suggestion warranty which
is fine.
but if do a search on field for example
description:waranty the output collation element is description:warranty
which i dont want i want to get back only the text ie warranty.

We are using collation to return back the results since if a user types
three words then we use collation in the response element to display the
spelling suggestion.

Any advice

darniz



-- 
View this message in context: 
http://old.nabble.com/Spell-check-suggestion-and-correct-way-of-implementation-and-some-Questions-tp26096664p26157893.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Spell check suggestion and correct way of implementation and some Questions

2009-11-02 Thread darniz

Hello everybody
i am able to use spell checker but i have some questions if someone can
answer this
if i search free text word waranty then i get back suggestion warranty which
is fine.
but if do a search on field for example
description:waranty the output collation element is description:warranty
which i dont want i want to get back only the text ie warranty.

We are using collation to return back the results since if a user types
three words then we use collation in the response element to display the
spelling suggestion.

Any advice

darniz

-- 
View this message in context: 
http://old.nabble.com/Spell-check-suggestion-and-correct-way-of-implementation-and-some-Questions-tp26096664p26157895.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: CPU utilization and query time high on Solr slave when snapshot install

2009-11-02 Thread Mark Miller
Hmm...I think you have to setup warming queries yourself and that  
autowarm just copies entries from the old cache to the new cache,  
rather than issuing queries - the value is how many entries it will  
copy. Though that's still going to take CPU and time.


- Mark

http://www.lucidimagination.com (mobile)

On Nov 2, 2009, at 12:47 PM, Walter Underwood wun...@wunderwood.org  
wrote:


If you are going to pull a new index every 10 minutes, try turning  
off cache autowarming.


Your caches are never more than 10 minutes old, so spending a minute  
warming each new cache is a waste of CPU. Autowarm submits queries  
to the new Searcher before putting it in service. This will create a  
burst of query load on the new Searcher, often keeping one CPU  
pretty busy for several seconds.


In solrconfig.xml, set autowarmCount to 0.

Also, if you want the slaves to always have an optimized index,  
create the snapshot only in post-optimize. If you create snapshots  
in both post-commit and post-optimize, you are creating a non- 
optimized index (post-commit), then replacing it with an optimized  
one a few minutes later. A slave might get a non-optimized index one  
time, then an optimized one the next.


wunder

On Nov 2, 2009, at 1:45 AM, biku...@sapient.com wrote:


Hi Solr Gurus,

We have solr in 1 master, 2 slave configuration. Snapshot is  
created post commit, post optimization. We have autocommit after 50  
documents or 5 minutes. Snapshot puller runs as a cron every 10  
minutes. What we have observed is that whenever snapshot is  
installed on the slave, we see solrj client used to query slave  
solr, gets timedout and there is high CPU usage/load avg. on slave  
server. If we stop snapshot puller, then slaves work with no  
issues. The system has been running since 2 months and this issue  
has started to occur only now  when load on website is increasing.


Following are some details:

Solr Details:
apache-solr Version: 1.3.0
Lucene - 2.4-dev

Master/Slave configurations:

Master:
- for indexing data HTTPRequests are made on Solr server.
- autocommit feature is enabled for 50 docs and 5 minutes
- caching params are disable for this server
- mergeFactor of 10 is set
- we were running optimize script after every 2 hours, but now have  
reduced the duration to twice a day but issue still persists


Slave1/Slave2:
- standard requestHandler is being used
- default values of caching are set
Machine Specifications:

Master:
- 4GB RAM
- 1GB JVM Heap memory is allocated to Solr

Slave1/Slave2:
- 4GB RAM
- 2GB JVM Heap memory is allocated to Solr

Master and Slave1 (solr1)are on single box and Slave2(solr2) on  
different box. We use HAProxy to load balance query requests  
between 2 slaves. Master is only used for indexing.
Please let us know if somebody has ever faced similar kind of issue  
or has some insight into it as we guys are literally struck at the  
moment with a very unstable production environment.


As a workaround, we have started running optimize on master every 7  
minutes. This seems to have reduced the severity of the problem but  
still issue occurs every 2days now. please suggest what could be  
the root cause of this.


Thanks,
Bipul








Re: solr search

2009-11-02 Thread Lance Norskog
The problem is in db-dataconfig.xml. You should start with the example
DataImportHandler configuration fles.

The structure is wrong. First there is a datasource, then there are
'entities' which fetch a document's fields from the datasource.

On Fri, Oct 30, 2009 at 9:03 PM, manishkbawne manish.ba...@gmail.com wrote:

 Hi,
 I have made following changes in solrconfig.xml

   requestHandler name=/dataimport
 class=org.apache.solr.handler.dataimport.DataImportHandler
    lst name=defaults
        str
 name=configC:/Apache-Tomcat/apache-tomcat-6.0.20/solr/conf/db-data-config.xml/str
    /lst
  /requestHandler


 in db-dataconfig.xml
 dataConfig
        document name=id1
                dataSource type=JdbcDataSource
 driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
                url=jdbc:sqlserver://servername:1433/databasename user=sa
 password=p...@123/
                        entity name=id1 query=select id from be 
                         field column=id name=id1 /
                        /entity
        /document
 /dataConfig

 in schema.xml files
 field name=id1 type=string indexes=true default=none/

 Please suggest me the possible cause of error??




 Lance Norskog-2 wrote:

 Please post your dataimporthandler configuration file.

 On Fri, Oct 30, 2009 at 4:17 AM, manishkbawne manish.ba...@gmail.com
 wrote:

 Thanks for your reply .. I am trying to use the database for solr search
 but
 getting this error..

 abortOnConfigurationErrorfalse/abortOnConfigurationError in null
 -
 java.lang.NullPointerException at
 org.apache.solr.handler.dataimport.DataImporter.init(DataImporter.java:95)
 at
 org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:106)
 at org.apache.solr.core.SolrResourceLoader

 Can you please suggest me some possible solution?








 Karsten F. wrote:

 hi manishkbawne,

 unspecific ideas of search improvements are her:
 http://wiki.apache.org/solr/SolrPerformanceFactors

 I really like the last idea in
 http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
 :
 Use a profiler and ask a more specific question in this forum.

 Best regards
   Karsten



 manishkbawne wrote:

 I am using solr search to search through xml files. As I am working on
 millions of data, the result output is slower. Can anyone please
 suggest
 me some way, by which I can increase the search result output?




 --
 View this message in context:
 http://old.nabble.com/solr-search-tp26125183p26128341.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 Lance Norskog
 goks...@gmail.com



 --
 View this message in context: 
 http://old.nabble.com/solr-search-tp26125183p26139946.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
Lance Norskog
goks...@gmail.com


Re: solr web ui

2009-11-02 Thread Lance Norskog
This is what I meant to mention - Uri's GWT browser, not the Velocity toolkit.

On Fri, Oct 30, 2009 at 1:20 PM, Grant Ingersoll gsing...@apache.org wrote:
 There is also a GWT contribution in JIRA that is pretty handy and will
 likely be added in 1.5.  See http://issues.apache.org/jira/browse/SOLR-1163

 -Grant
 On Oct 29, 2009, at 9:17 PM, scabbage wrote:


 Hi,

 I'm a new solr user. I would like to know if there are any easy to setup
 web
 UIs for solr. It can be as simple as a search box, term highlighting and
 basic faceting. Basically I'm using solr to store all our automation
 testing
 logs and would like to have a simple searchable UI. I don't wanna spent
 too
 much time writing my own.

 Thanks.
 --
 View this message in context:
 http://www.nabble.com/solr-web-ui-tp26123604p26123604.html
 Sent from the Solr - User mailing list archive at Nabble.com.







-- 
Lance Norskog
goks...@gmail.com


RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi

Thank you very much Mike,

I found it:
org.apache.solr.request.SimpleFacets
...
// TODO: future logic could use filters instead of the fieldcache if
// the number of terms in the field is small enough.
counts = getFieldCacheCounts(searcher, base, field, offset,limit,
mincount, missing, sort, prefix);
...
FieldCache.StringIndex si =
FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
final String[] terms = si.lookup;
final int[] termNum = si.order;
...


So that 64-bit requires more memory :)


Mike, am I right here?
[(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)]
(64-bit JVM)
1.2Gb RAM for this...

Or, may be I am wrong:
 For Lucene directly, simple strings would consume an pointer (4 or 8
 bytes depending on whether your JRE is 64bit) per doc, and the string
 index would consume an int (4 bytes) per doc.

[8 bytes (64bit)] x [number of documents (100mlns)]? 
0.8Gb

Kind of Map between String and DocSet, saving 4 bytes... Key is String,
and Value is array of 64-bit pointers to Document. Why 64-bit (for 64-bit
JVM)? I always thought it is (int) documentId...

Am I right?


Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990!

 Note that for your use case, this is exceptionally wasteful.  
This is probably very common case... I think it should be confirmed by
Lucene developers too... FieldCache is warmed anyway, even when we don't use
SOLR...

 
-Fuad







 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: November-02-09 6:00 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Lucene FieldCache memory requirements
 
 OK I think someone who knows how Solr uses the fieldCache for this
 type of field will have to pipe up.
 
 For Lucene directly, simple strings would consume an pointer (4 or 8
 bytes depending on whether your JRE is 64bit) per doc, and the string
 index would consume an int (4 bytes) per doc.  (Each also consume
 negligible (for your case) memory to hold the actual string values).
 
 Note that for your use case, this is exceptionally wasteful.  If
 Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this)
 then it'd take much fewer bits to reference the values, since you have
 only 10 unique string values.
 
 Mike
 
 On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi f...@efendi.ca wrote:
  I am not using Lucene API directly; I am using SOLR which uses Lucene
  FieldCache for faceting on non-tokenized fields...
  I think this cache will be lazily loaded, until user executes sorted (by
  this field) SOLR query for all documents *:* - in this case it will be
fully
  populated...
 
 
  Subject: Re: Lucene FieldCache memory requirements
 
  Which FieldCache API are you using?  getStrings?  or getStringIndex
  (which is used, under the hood, if you sort by this field).
 
  Mike
 
  On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote:
   Any thoughts regarding the subject? I hope FieldCache doesn't use
more
  than
   6 bytes per document-field instance... I am too lazy to research
Lucene
   source code, I hope someone can provide exact answer... Thanks
  
  
   Subject: Lucene FieldCache memory requirements
  
   Hi,
  
  
   Can anyone confirm Lucene FieldCache memory requirements? I have 100
   millions docs with non-tokenized field country (10 different
  countries);
   I
   expect it requires array of (int, long), size of array
100,000,000,
   without any impact of country field length;
  
   it requires 600,000,000 bytes: int is pointer to document (Lucene
   document
   ID),  and long is pointer to String value...
  
   Am I right, is it 600Mb just for this country (indexed,
  non-tokenized,
   non-boolean) field and 100 millions docs? I need to calculate exact
   minimum RAM
   requirements...
  
   I believe it shouldn't depend on cardinality (distribution) of
field...
  
   Thanks,
   Fuad
  
  
  
  
  
  
  
  
 
 
 




Re: CPU utilization and query time high on Solr slave when snapshot install

2009-11-02 Thread Jay Hill
So assuming you set up a few sample sort queries to run in the firstSearcher
config, and had very low query volume during that ten minutes so that there
were no evictions before a new Searcher was loaded, would those queries run
by the firstSearcher be passed along to the cache for the next Searcher as
part of the autowarm? If so, it seems like you might want to load a few sort
queries for the firstSearcher, but might not need any included in the
newSearcher?

-Jay


On Mon, Nov 2, 2009 at 4:26 PM, Mark Miller markrmil...@gmail.com wrote:

 Hmm...I think you have to setup warming queries yourself and that autowarm
 just copies entries from the old cache to the new cache, rather than issuing
 queries - the value is how many entries it will copy. Though that's still
 going to take CPU and time.

 - Mark

 http://www.lucidimagination.com (mobile)


 On Nov 2, 2009, at 12:47 PM, Walter Underwood wun...@wunderwood.org
 wrote:

  If you are going to pull a new index every 10 minutes, try turning off
 cache autowarming.

 Your caches are never more than 10 minutes old, so spending a minute
 warming each new cache is a waste of CPU. Autowarm submits queries to the
 new Searcher before putting it in service. This will create a burst of query
 load on the new Searcher, often keeping one CPU pretty busy for several
 seconds.

 In solrconfig.xml, set autowarmCount to 0.

 Also, if you want the slaves to always have an optimized index, create the
 snapshot only in post-optimize. If you create snapshots in both post-commit
 and post-optimize, you are creating a non-optimized index (post-commit),
 then replacing it with an optimized one a few minutes later. A slave might
 get a non-optimized index one time, then an optimized one the next.

 wunder

 On Nov 2, 2009, at 1:45 AM, biku...@sapient.com wrote:

  Hi Solr Gurus,

 We have solr in 1 master, 2 slave configuration. Snapshot is created post
 commit, post optimization. We have autocommit after 50 documents or 5
 minutes. Snapshot puller runs as a cron every 10 minutes. What we have
 observed is that whenever snapshot is installed on the slave, we see solrj
 client used to query slave solr, gets timedout and there is high CPU
 usage/load avg. on slave server. If we stop snapshot puller, then slaves
 work with no issues. The system has been running since 2 months and this
 issue has started to occur only now  when load on website is increasing.

 Following are some details:

 Solr Details:
 apache-solr Version: 1.3.0
 Lucene - 2.4-dev

 Master/Slave configurations:

 Master:
 - for indexing data HTTPRequests are made on Solr server.
 - autocommit feature is enabled for 50 docs and 5 minutes
 - caching params are disable for this server
 - mergeFactor of 10 is set
 - we were running optimize script after every 2 hours, but now have
 reduced the duration to twice a day but issue still persists

 Slave1/Slave2:
 - standard requestHandler is being used
 - default values of caching are set
 Machine Specifications:

 Master:
 - 4GB RAM
 - 1GB JVM Heap memory is allocated to Solr

 Slave1/Slave2:
 - 4GB RAM
 - 2GB JVM Heap memory is allocated to Solr

 Master and Slave1 (solr1)are on single box and Slave2(solr2) on different
 box. We use HAProxy to load balance query requests between 2 slaves. Master
 is only used for indexing.
 Please let us know if somebody has ever faced similar kind of issue or
 has some insight into it as we guys are literally struck at the moment with
 a very unstable production environment.

 As a workaround, we have started running optimize on master every 7
 minutes. This seems to have reduced the severity of the problem but still
 issue occurs every 2days now. please suggest what could be the root cause of
 this.

 Thanks,
 Bipul








Re: Lucene FieldCache memory requirements

2009-11-02 Thread Mark Miller
It also briefly requires more memory than just that - it allocates an
array the size of maxdoc+1 to hold the unique terms - and then sizes down.

Possibly we can use the getUnuiqeTermCount method in the flexible
indexing branch to get rid of that - which is why I was thinking it
might be a good idea to drop the unsupported exception in that method
for things like multi reader and just do the work to get the right
number (currently there is a comment that the user should do that work
if necessary, making the call unreliable for this).

Fuad Efendi wrote:
 Thank you very much Mike,

 I found it:
 org.apache.solr.request.SimpleFacets
 ...
 // TODO: future logic could use filters instead of the fieldcache if
 // the number of terms in the field is small enough.
 counts = getFieldCacheCounts(searcher, base, field, offset,limit,
 mincount, missing, sort, prefix);
 ...
 FieldCache.StringIndex si =
 FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
 final String[] terms = si.lookup;
 final int[] termNum = si.order;
 ...


 So that 64-bit requires more memory :)


 Mike, am I right here?
 [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)]
 (64-bit JVM)
 1.2Gb RAM for this...

 Or, may be I am wrong:
   
 For Lucene directly, simple strings would consume an pointer (4 or 8
 bytes depending on whether your JRE is 64bit) per doc, and the string
 index would consume an int (4 bytes) per doc.
 

 [8 bytes (64bit)] x [number of documents (100mlns)]? 
 0.8Gb

 Kind of Map between String and DocSet, saving 4 bytes... Key is String,
 and Value is array of 64-bit pointers to Document. Why 64-bit (for 64-bit
 JVM)? I always thought it is (int) documentId...

 Am I right?


 Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990!

   
 Note that for your use case, this is exceptionally wasteful.  
   
 This is probably very common case... I think it should be confirmed by
 Lucene developers too... FieldCache is warmed anyway, even when we don't use
 SOLR...

  
 -Fuad







   
 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: November-02-09 6:00 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Lucene FieldCache memory requirements

 OK I think someone who knows how Solr uses the fieldCache for this
 type of field will have to pipe up.

 For Lucene directly, simple strings would consume an pointer (4 or 8
 bytes depending on whether your JRE is 64bit) per doc, and the string
 index would consume an int (4 bytes) per doc.  (Each also consume
 negligible (for your case) memory to hold the actual string values).

 Note that for your use case, this is exceptionally wasteful.  If
 Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this)
 then it'd take much fewer bits to reference the values, since you have
 only 10 unique string values.

 Mike

 On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi f...@efendi.ca wrote:
 
 I am not using Lucene API directly; I am using SOLR which uses Lucene
 FieldCache for faceting on non-tokenized fields...
 I think this cache will be lazily loaded, until user executes sorted (by
 this field) SOLR query for all documents *:* - in this case it will be
   
 fully
   
 populated...


   
 Subject: Re: Lucene FieldCache memory requirements

 Which FieldCache API are you using?  getStrings?  or getStringIndex
 (which is used, under the hood, if you sort by this field).

 Mike

 On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote:
 
 Any thoughts regarding the subject? I hope FieldCache doesn't use
   
 more
   
 than
   
 6 bytes per document-field instance... I am too lazy to research
   
 Lucene
   
 source code, I hope someone can provide exact answer... Thanks


   
 Subject: Lucene FieldCache memory requirements

 Hi,


 Can anyone confirm Lucene FieldCache memory requirements? I have 100
 millions docs with non-tokenized field country (10 different
 
 countries);
   
 I
   
 expect it requires array of (int, long), size of array
 
 100,000,000,
   
 without any impact of country field length;

 it requires 600,000,000 bytes: int is pointer to document (Lucene
 
 document
   
 ID),  and long is pointer to String value...

 Am I right, is it 600Mb just for this country (indexed,
 
 non-tokenized,
   
 non-boolean) field and 100 millions docs? I need to calculate exact
 
 minimum RAM
   
 requirements...

 I believe it shouldn't depend on cardinality (distribution) of
 
 field...
   
 Thanks,
 Fuad




 


   

   


   


-- 
- Mark

http://www.lucidimagination.com





Why does BinaryRequestWriter force the path to be base URL + /update/javabin

2009-11-02 Thread Stuart Tettemer
Hi folks,
First of all, thanks for Solr.  It is a great piece of work.

I have a question about BinaryRequestWriter in the solrj project.  Why does
it force the path of UpdateRequests to have be /update/javabin (see
BinaryRequestWriter.getPath(String) starting on line 109)?

I am extending BinaryRequestWriter specifically to remove this requirement
and am interested to know the reasoning behind in the inital choice.

Thanks for your time,
Stuart


Re: highlighting error using 1.4rc

2009-11-02 Thread Mark Miller
Sorry - it was a bug in the backport from trunk to 2.9.1 - didn't
realize that code didn't get hit because we didn't pass a null field -
else the tests would have caught it. Fix has been committed but I don't
know whether it will make 2.9.1 or 1.4 because both have gotten the
votes and time needed for release.

Mark Miller wrote:
 Umm - crap. This looks looks like a bug in a fix that just went in. My
 fault on the review. I'll fix it tonight when I get home -
 unfortunetly, both lucene and sold are about to be released...

 - Mark

 http://www.lucidimagination.com (mobile)

 On Nov 2, 2009, at 5:17 PM, Jake Brownell ja...@benetech.org wrote:

 Hi,

 I've tried installing the latest (3rd) RC for Solr 1.4 and Lucene
 2.9.1. One of our integration tests, which runs against and embedded
 server appears to be failing on highlighting. I've included the stack
 trace and the configuration from solrconf. I'd appreciate any
 insights. Please let me know what additional information would be
 useful.


 Caused by: org.apache.solr.client.solrj.SolrServerException:
 org.apache.solr.client.solrj.SolrServerException:
 java.lang.ClassCastException:
 org.apache.lucene.search.spans.SpanOrQuery cannot be cast to
 org.apache.lucene.search.spans.SpanNearQuery
at
 org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:153)

at
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)

at
 org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
at
 org.bookshare.search.solr.SolrSearchServerWrapper.query(SolrSearchServerWrapper.java:96)

... 29 more
 Caused by: org.apache.solr.client.solrj.SolrServerException:
 java.lang.ClassCastException:
 org.apache.lucene.search.spans.SpanOrQuery cannot be cast to
 org.apache.lucene.search.spans.SpanNearQuery
at
 org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:141)

... 32 more
 Caused by: java.lang.ClassCastException:
 org.apache.lucene.search.spans.SpanOrQuery cannot be cast to
 org.apache.lucene.search.spans.SpanNearQuery
at
 org.apache.lucene.search.highlight.WeightedSpanTermExtractor.collectSpanQueryFields(WeightedSpanTermExtractor.java:489)

at
 org.apache.lucene.search.highlight.WeightedSpanTermExtractor.collectSpanQueryFields(WeightedSpanTermExtractor.java:484)

at
 org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:249)

at
 org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:230)

at
 org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158)

at
 org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:414)

at
 org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:216)

at
 org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:184)

at
 org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:226)

at
 org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:335)

at
 org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89)

at
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:203)

at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
 org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:139)

... 32 more

 I see in our solrconf the following for highlighting.

  highlighting
   !-- Configure the standard fragmenter --
   !-- This could most likely be commented out in the default case --
   fragmenter name=gap
 class=org.apache.solr.highlight.GapFragmenter default=true
lst name=defaults
 int name=hl.fragsize100/int
/lst
   /fragmenter

   !-- A regular-expression-based fragmenter (f.i., for sentence
 extraction) --
   fragmenter name=regex
 class=org.apache.solr.highlight.RegexFragmenter
lst name=defaults
  !-- slightly smaller fragsizes work better because of slop --
  int name=hl.fragsize70/int
  !-- allow 50% slop on fragment sizes --
  float name=hl.regex.slop0.5/float
  !-- a basic sentence pattern --
  str name=hl.regex.pattern[-\w ,/\n\']{20,200}/str
/lst
   /fragmenter

   !-- Configure the standard formatter --
   formatter name=html
 class=org.apache.solr.highlight.HtmlFormatter default=true
lst name=defaults
 str name=hl.simple.pre![CDATA[strong]]/str
 str name=hl.simple.post![CDATA[/strong]]/str
/lst
   /formatter
  

Re: Programmatically configuring SLF4J for Solr 1.4?

2009-11-02 Thread Don Werve
2009/11/1 Ryan McKinley ryan...@gmail.com

 I'm sure it is possible to configure JDK logging (java.util.loging)
 programatically... but I have never had much luck with it.

 It is very easy to configure log4j programatically, and this works great
 with solr.


Don't suppose I could trouble you for an example?  I'm not terribly familiar
with Java logging frameworks just yet.


RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi

Simple field (10 different values: Canada, USA, UK, ...), 64-bit JVM... no
difference between maxdoc and maxdoc + 1 for such estimate... difference is
between 0.4Gb and 1.2Gb...


So, let's vote ;)

A. [maxdoc] x [8 bytes ~ pointer to String object]

B. [maxdoc] x [8 bytes ~ pointer to Document object]

C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID] 
- same as [String1_Document_Count + ... + String10_Document_Count] x [4
bytes ~ DocumentID]

D. [maxdoc] x [4 bytes + 8 bytes ~ my initial naive thinking...]


Please confirm that it is Pointer to Object and not Lucene Document ID... I
hope it is (int) Document ID...





 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com]
 Sent: November-02-09 6:52 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Lucene FieldCache memory requirements
 
 It also briefly requires more memory than just that - it allocates an
 array the size of maxdoc+1 to hold the unique terms - and then sizes down.
 
 Possibly we can use the getUnuiqeTermCount method in the flexible
 indexing branch to get rid of that - which is why I was thinking it
 might be a good idea to drop the unsupported exception in that method
 for things like multi reader and just do the work to get the right
 number (currently there is a comment that the user should do that work
 if necessary, making the call unreliable for this).
 
 Fuad Efendi wrote:
  Thank you very much Mike,
 
  I found it:
  org.apache.solr.request.SimpleFacets
  ...
  // TODO: future logic could use filters instead of the
fieldcache if
  // the number of terms in the field is small enough.
  counts = getFieldCacheCounts(searcher, base, field,
offset,limit,
  mincount, missing, sort, prefix);
  ...
  FieldCache.StringIndex si =
  FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
  final String[] terms = si.lookup;
  final int[] termNum = si.order;
  ...
 
 
  So that 64-bit requires more memory :)
 
 
  Mike, am I right here?
  [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)]
  (64-bit JVM)
  1.2Gb RAM for this...
 
  Or, may be I am wrong:
 
  For Lucene directly, simple strings would consume an pointer (4 or 8
  bytes depending on whether your JRE is 64bit) per doc, and the string
  index would consume an int (4 bytes) per doc.
 
 
  [8 bytes (64bit)] x [number of documents (100mlns)]?
  0.8Gb
 
  Kind of Map between String and DocSet, saving 4 bytes... Key is
String,
  and Value is array of 64-bit pointers to Document. Why 64-bit (for
64-bit
  JVM)? I always thought it is (int) documentId...
 
  Am I right?
 
 
  Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990!
 
 
  Note that for your use case, this is exceptionally wasteful.
 
  This is probably very common case... I think it should be confirmed by
  Lucene developers too... FieldCache is warmed anyway, even when we don't
use
  SOLR...
 
 
  -Fuad
 
 
 
 
 
 
 
 
  -Original Message-
  From: Michael McCandless [mailto:luc...@mikemccandless.com]
  Sent: November-02-09 6:00 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Lucene FieldCache memory requirements
 
  OK I think someone who knows how Solr uses the fieldCache for this
  type of field will have to pipe up.
 
  For Lucene directly, simple strings would consume an pointer (4 or 8
  bytes depending on whether your JRE is 64bit) per doc, and the string
  index would consume an int (4 bytes) per doc.  (Each also consume
  negligible (for your case) memory to hold the actual string values).
 
  Note that for your use case, this is exceptionally wasteful.  If
  Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this)
  then it'd take much fewer bits to reference the values, since you have
  only 10 unique string values.
 
  Mike
 
  On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi f...@efendi.ca wrote:
 
  I am not using Lucene API directly; I am using SOLR which uses Lucene
  FieldCache for faceting on non-tokenized fields...
  I think this cache will be lazily loaded, until user executes sorted
(by
  this field) SOLR query for all documents *:* - in this case it will be
 
  fully
 
  populated...
 
 
 
  Subject: Re: Lucene FieldCache memory requirements
 
  Which FieldCache API are you using?  getStrings?  or getStringIndex
  (which is used, under the hood, if you sort by this field).
 
  Mike
 
  On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote:
 
  Any thoughts regarding the subject? I hope FieldCache doesn't use
 
  more
 
  than
 
  6 bytes per document-field instance... I am too lazy to research
 
  Lucene
 
  source code, I hope someone can provide exact answer... Thanks
 
 
 
  Subject: Lucene FieldCache memory requirements
 
  Hi,
 
 
  Can anyone confirm Lucene FieldCache memory requirements? I have
100
  millions docs with non-tokenized field country (10 different
 
  countries);
 
  I
 
  expect it requires array of (int, long), size of array
 
  100,000,000,

Getting update/extract RequestHandler to work under Tomcat

2009-11-02 Thread Glock, Thomas

Hoping someone might help with getting /update/extract RequestHandler to
work under Tomcat.

Error 500 happens when trying to access
http://localhost:8080/apache-solr-1.4-dev/update/extract/  (see below)

Note /update/extract DOES work correctly under the Jetty provided
example.

I think I must have a directory path incorrectly specified but not sure
where.

No errors in the Catalina log on startup - only this: 

Nov 2, 2009 7:10:49 PM org.apache.solr.core.RequestHandlers
initHandlersFromConfig
INFO: created /update/extract:
org.apache.solr.handler.extraction.ExtractingRequestHandler

Solrconfig.xml under tomcat is slightly changed from the example with
regards to lib elements:

  lib dir=../contrib/extraction/lib /
  lib dir=../dist/ regex=apache-solr-cell-\d.*\.jar /
  lib dir=../dist/ regex=apache-solr-clustering-\d.*\.jar /:

The \contrib and \dist directories were copied directly below the
webapps\apache-solr-1.4-dev unchanged from the example.

Im the catalina log I see all the Adding specified lib dirs... added
without error:

INFO: Adding specified lib dirs to ClassLoader
Nov 2, 2009 7:31:20 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader
INFO: Adding
'file:/C:/Program%20Files/Apache%20Software%20Foundation/Tomcat%206.0/we
bapps/apache-solr-1.4-dev/contrib/extraction/lib/asm-3.1.jar' to
classloader
Nov 2, 2009 7:31:20 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader
INFO: Adding
'file:/C:/Program%20Files/Apache%20Software%20Foundation/Tomcat%206.0/we
bapps/apache-solr-1.4-dev/contrib/extraction/lib/bcmail-jdk14-136.jar'
to classloader
Nov 2, 2009 7:31:20 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader
INFO: Adding
'file:/C:/Program%20Files/Apache%20Software%20Foundation/Tomcat%206.0/we
bapps/apache-solr-1.4-dev/contrib/extraction/lib/bcprov-jdk14-136.jar'
to classloader

(...many more...)

Solr Home is mapped to:

INFO: SolrDispatchFilter.init()
Nov 2, 2009 7:10:47 PM org.apache.solr.core.SolrResourceLoader
locateSolrHome
INFO: Using JNDI solr.home: .\webapps\apache-solr-1.4-dev\solr
Nov 2, 2009 7:10:47 PM
org.apache.solr.core.CoreContainer$Initializer initialize
INFO: looking for solr.xml: C:\Program Files\Apache Software
Foundation\Tomcat 6.0\.\webapps\apache-solr-1.4-dev\solr\solr.xml
Nov 2, 2009 7:10:47 PM org.apache.solr.core.SolrResourceLoader
init
INFO: Solr home set to '.\webapps\apache-solr-1.4-dev\solr\' 

500 Error:

HTTP Status 500 - lazy loading error
org.apache.solr.common.SolrException: lazy loading error at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappe
dHandler(RequestHandlers.java:249) at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleReq
uest(RequestHandlers.java:231) at
org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja
va:338) at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
ava:241) at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica
tionFilterChain.java:235) at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt
erChain.java:206) at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv
e.java:233) at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv
e.java:191) at
org.apache.catalina.authenticator.AuthenticatorBase.invoke(Authenticator
Base.java:433) at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java
:128) at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java
:102) at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.
java:109) at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:2
93) at
org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.j
ava:859) at
org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.proce
ss(Http11AprProtocol.java:574) at
org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1527)
at java.lang.Thread.run(Unknown Source) Caused by:
org.apache.solr.common.SolrException: Error loading class
'org.apache.solr.handler.extraction.ExtractingRequestHandler' at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.jav
a:373) at
org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413) at
org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449) at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappe
dHandler(RequestHandlers.java:240) ... 17 more Caused by:
java.lang.ClassNotFoundException:
org.apache.solr.handler.extraction.ExtractingRequestHandler at
java.net.URLClassLoader$1.run(Unknown Source) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(Unknown Source) at
java.lang.ClassLoader.loadClass(Unknown Source) at

Re: Lucene FieldCache memory requirements

2009-11-02 Thread Mark Miller
Fuad Efendi wrote:
 Simple field (10 different values: Canada, USA, UK, ...), 64-bit JVM... no
 difference between maxdoc and maxdoc + 1 for such estimate... difference is
 between 0.4Gb and 1.2Gb...

   
I'm not sure I understand - but I didn't mean to imply the +1 on maxdoc
meant anything. The issue is that in the end, it only needs a String
array the size of String[UniqueTerms] - but because it can't easily
figure out that number, it first creates an array of String[MaxDoc+1] -
so with a ton of docs and a few uniques, you get a temp boost in the RAM
reqs until it sizes it down. A pointer for each doc.

-- 
- Mark

http://www.lucidimagination.com





SolrJ looping until I get all the results

2009-11-02 Thread Paul Tomblin
If I want to do a query and only return X number of rows at a time,
but I want to keep querying until I get all the row, how do I do that?
 Can I just keep advancing query.setStart(...) and then checking if
server.query(query) returns any rows?  Or is there a better way?

Here's what I'm thinking

final static int MAX_ROWS = 100;
int start = 0;
query.setRows(MAX_ROWS);
while (true)
{
   QueryResponse resp = solrChunkServer.query(query);
   SolrDocumentList docs = resp.getResults();
   if (docs.size() == 0)
 break;
   
  start += MAX_ROWS;
  query.setStart(start);
}



-- 
http://www.linkedin.com/in/paultomblin
http://careers.stackoverflow.com/ptomblin


RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
I just did some tests in a completely new index (Slave), sort by
low-distributed non-tokenized Field (such as Country) takes milliseconds,
but sort (ascending) on tokenized field with heavy distribution took 30
seconds (initially). Second sort (descending) took milliseconds. Generic
query *.*; FieldCache is not used for tokenized fields... how it is sorted
:)
Fortunately, no any OOM.
-Fuad




RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
Mark,

I don't understand this: 
 so with a ton of docs and a few uniques, you get a temp boost in the RAM
 reqs until it sizes it down.

Sizes down??? Why is it called Cache indeed? And how SOLR uses it if it is
not cache?


And this:
 A pointer for each doc.

Why can't we use (int) DocumentID? For me, it is natural; 64-bit pointer to
an object in RAM is not natural (in Lucene world)...


So, is it [maxdoc]x[4-bytes], or [maxdoc]x[8-bytes]?... 
-Fuad







Re: adding and updating a lot of document to Solr, metadata extraction etc

2009-11-02 Thread Lance Norskog
About large XML files and http overhead: you can tell solr to load the
file directly from a file system. This will stream thousands of
documents in one XML file without loading everything in memory at
once.

This is a new book on Solr. It will help you through this early learning phase.

http://www.packtpub.com/solr-1-4-enterprise-search-server

On Mon, Nov 2, 2009 at 6:24 AM, Alexey Serba ase...@gmail.com wrote:
 Hi Eugene,

 - ability to iterate over all documents, returned in search, as Lucene does
  provide within a HitCollector instance. We would need to extract and
  aggregate various fields, stored in index, to group results and aggregate 
 them
  in some way.
 
 Also I did not find any way in the tutorial to access the search results with
 all fields to be processed by our application.

 http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr
 Check out Faceted Search, probably you can achieve your goal by using
 Facet Component

 There's also Field Collapsing patch
 http://wiki.apache.org/solr/FieldCollapsing


 Alex




-- 
Lance Norskog
goks...@gmail.com


Re: SolrJ looping until I get all the results

2009-11-02 Thread Paul Tomblin
On Mon, Nov 2, 2009 at 8:47 PM, Avlesh Singh avl...@gmail.com wrote:

 I was doing it that way, but what I'm doing with the documents is do
 some manipulation and put the new classes into a different list.
 Because I basically have two times the number of documents in lists,
 I'm running out of memory.  So I figured if I do it 1000 documents at
 a time, the SolrDocumentList will get garbage collected at least.

 You are right w.r.t to all that but I am surprised that you would need ALL
 the documents from the index for a search requirement.

This isn't a search, this is a search and destroy.  Basically I need
the file names of all the documents that I've indexed in Solr so that
I can delete them.

-- 
http://www.linkedin.com/in/paultomblin
http://careers.stackoverflow.com/ptomblin


Re: SolrJ looping until I get all the results

2009-11-02 Thread Avlesh Singh

 This isn't a search, this is a search and destroy.  Basically I need the
 file names of all the documents that I've indexed in Solr so that I can
 delete them.

Okay. I am sure you are aware of the fl parameter which restricts the
number of fields returned back with a response. If you need limited info, it
might be a good idea to use this parameter.

Cheers
Avlesh

On Tue, Nov 3, 2009 at 7:23 AM, Paul Tomblin ptomb...@xcski.com wrote:

 On Mon, Nov 2, 2009 at 8:47 PM, Avlesh Singh avl...@gmail.com wrote:
 
  I was doing it that way, but what I'm doing with the documents is do
  some manipulation and put the new classes into a different list.
  Because I basically have two times the number of documents in lists,
  I'm running out of memory.  So I figured if I do it 1000 documents at
  a time, the SolrDocumentList will get garbage collected at least.
 
  You are right w.r.t to all that but I am surprised that you would need
 ALL
  the documents from the index for a search requirement.

 This isn't a search, this is a search and destroy.  Basically I need
 the file names of all the documents that I've indexed in Solr so that
 I can delete them.

 --
 http://www.linkedin.com/in/paultomblin
 http://careers.stackoverflow.com/ptomblin



RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
I believe this is correct estimate:

 C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID]

   same as 
 [String1_Document_Count + ... + String10_Document_Count + ...] 
 x [4 bytes per DocumentID]


So, for 100 millions docs we need 400Mb for each(!) non-tokenized field.
Although FieldCacheImpl is based on WeakHashMap (somewhere...), we can't
rely on sizing down with SOLR faceting features


I think I finally found the answer...

  /** Expert: Stores term text values and document ordering data. */
  public static class StringIndex {
...   
/** All the term values, in natural order. */
public final String[] lookup;

/** For each document, an index into the lookup array. */
public final int[] order;
...
  }



Another API:
  /** Checks the internal cache for an appropriate entry, and if none
   * is found, reads the term values in codefield/code and returns an
array
   * of size codereader.maxDoc()/code containing the value each document
   * has in the given field.
   * @param reader  Used to get field values.
   * @param field   Which field contains the strings.
   * @return The values in the given field for each document.
   * @throws IOException  If any error occurs.
   */
  public String[] getStrings (IndexReader reader, String field)
  throws IOException;


Looks similar; cache size is [maxdoc]; however values stored are 8-byte
pointers for 64-bit JVM.


  private MapClass?,Cache caches;
  private synchronized void init() {
caches = new HashMapClass?,Cache(7);
...
caches.put(String.class, new StringCache(this));
caches.put(StringIndex.class, new StringIndexCache(this));
...
  }


StringCache and StringIndexCache use WeakHashMap internally... but objects
won't be ever garbage collected in a faceted production system...

SOLR SimpleFacets don't use getStrings API, so the hope is memory
requirements are minimized.


However, Lucene may use it internally for some queries (or, for instance, to
get access to a nontokenized cached field without reading index)... to be
safe, use this in your basic memory estimates:


[512Mb ~ 1Gb] + [non_tokenized_fields_count] x [maxdoc] x [8 bytes]


-Fuad



 -Original Message-
 From: Fuad Efendi [mailto:f...@efendi.ca]
 Sent: November-02-09 7:37 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Lucene FieldCache memory requirements
 
 
 Simple field (10 different values: Canada, USA, UK, ...), 64-bit JVM... no
 difference between maxdoc and maxdoc + 1 for such estimate... difference
is
 between 0.4Gb and 1.2Gb...
 
 
 So, let's vote ;)
 
 A. [maxdoc] x [8 bytes ~ pointer to String object]
 
 B. [maxdoc] x [8 bytes ~ pointer to Document object]
 
 C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID]
 - same as [String1_Document_Count + ... + String10_Document_Count] x [4
 bytes ~ DocumentID]
 
 D. [maxdoc] x [4 bytes + 8 bytes ~ my initial naive thinking...]
 
 
 Please confirm that it is Pointer to Object and not Lucene Document ID...
I
 hope it is (int) Document ID...
 
 
 
 
 
  -Original Message-
  From: Mark Miller [mailto:markrmil...@gmail.com]
  Sent: November-02-09 6:52 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Lucene FieldCache memory requirements
 
  It also briefly requires more memory than just that - it allocates an
  array the size of maxdoc+1 to hold the unique terms - and then sizes
down.
 
  Possibly we can use the getUnuiqeTermCount method in the flexible
  indexing branch to get rid of that - which is why I was thinking it
  might be a good idea to drop the unsupported exception in that method
  for things like multi reader and just do the work to get the right
  number (currently there is a comment that the user should do that work
  if necessary, making the call unreliable for this).
 
  Fuad Efendi wrote:
   Thank you very much Mike,
  
   I found it:
   org.apache.solr.request.SimpleFacets
   ...
   // TODO: future logic could use filters instead of the
 fieldcache if
   // the number of terms in the field is small enough.
   counts = getFieldCacheCounts(searcher, base, field,
 offset,limit,
   mincount, missing, sort, prefix);
   ...
   FieldCache.StringIndex si =
   FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName);
   final String[] terms = si.lookup;
   final int[] termNum = si.order;
   ...
  
  
   So that 64-bit requires more memory :)
  
  
   Mike, am I right here?
   [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents
(100mlns)]
   (64-bit JVM)
   1.2Gb RAM for this...
  
   Or, may be I am wrong:
  
   For Lucene directly, simple strings would consume an pointer (4 or 8
   bytes depending on whether your JRE is 64bit) per doc, and the string
   index would consume an int (4 bytes) per doc.
  
  
   [8 bytes (64bit)] x [number of documents (100mlns)]?
   0.8Gb
  
   Kind of Map between String and DocSet, saving 4 bytes... Key is
 String,
   and Value is array of 64-bit pointers to Document. Why 64-bit (for
 

RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
Hi Mark,

Yes, I understand it now; however, how will StringIndexCache size down in a
production system faceting by Country on a homepage? This is SOLR
specific...


Lucene specific: Lucene doesn't read from disk if it can retrieve field
value for a specific document ID from cache. How will it size down in purely
Lucene-based heavy-loaded production system? Especially if this cache is
used for query optimizations.



 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com]
 Sent: November-02-09 8:53 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Lucene FieldCache memory requirements
 
  static final class StringIndexCache extends Cache {
 StringIndexCache(FieldCache wrapper) {
   super(wrapper);
 }
 
 @Override
 protected Object createValue(IndexReader reader, Entry entryKey)
 throws IOException {
   String field = StringHelper.intern(entryKey.field);
   final int[] retArray = new int[reader.maxDoc()];
   String[] mterms = new String[reader.maxDoc()+1];
   TermDocs termDocs = reader.termDocs();
   TermEnum termEnum = reader.terms (new Term (field));
   int t = 0;  // current term number
 
   // an entry for documents that have no terms in this field
   // should a document with no terms be at top or bottom?
   // this puts them at the top - if it is changed,
 FieldDocSortedHitQueue
   // needs to change as well.
   mterms[t++] = null;
 
   try {
 do {
   Term term = termEnum.term();
   if (term==null || term.field() != field) break;
 
   // store term text
   // we expect that there is at most one term per document
   if (t = mterms.length) throw new RuntimeException (there are
 more terms than  +
   documents in field \ + field + \, but it's
 impossible to sort on  +
   tokenized fields);
   mterms[t] = term.text();
 
   termDocs.seek (termEnum);
   while (termDocs.next()) {
 retArray[termDocs.doc()] = t;
   }
 
   t++;
 } while (termEnum.next());
   } finally {
 termDocs.close();
 termEnum.close();
   }
 
   if (t == 0) {
 // if there are no terms, make the term array
 // have a single null entry
 mterms = new String[1];
   } else if (t  mterms.length) {
 // if there are less terms than documents,
 // trim off the dead array space
 String[] terms = new String[t];
 System.arraycopy (mterms, 0, terms, 0, t);
 mterms = terms;
   }
 
   StringIndex value = new StringIndex (retArray, mterms);
   return value;
 }
   };
 
 The formula for a String Index fieldcache is essentially the String
 array of unique terms (which does indeed size down at the bottom) and
 the int array indexing into the String array.
 
 
 Fuad Efendi wrote:
  To be correct, I analyzed FieldCache awhile ago and I believed it never
  sizes down...
 
  /**
   * Expert: The default cache implementation, storing all values in
memory.
   * A WeakHashMap is used for storage.
   *
   * pCreated: May 19, 2004 4:40:36 PM
   *
   * @since   lucene 1.4
   */
 
 
  Will it size down? Only if we are not faceting (as in SOLR v.1.3)...
 
  And I am still unsure, Document ID vs. Object Pointer.
 
 
 
 
 
  I don't understand this:
 
  so with a ton of docs and a few uniques, you get a temp boost in the
RAM
  reqs until it sizes it down.
 
  Sizes down??? Why is it called Cache indeed? And how SOLR uses it if it
is
  not cache?
 
 
 
 
 
 
 
 --
 - Mark
 
 http://www.lucidimagination.com
 
 





RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
Even in simplistic scenario, when it is Garbage Collected, we still
_need_to_be_able_ to allocate enough RAM to FieldCache on demand... linear
dependency on document count...


 
 Hi Mark,
 
 Yes, I understand it now; however, how will StringIndexCache size down in
a
 production system faceting by Country on a homepage? This is SOLR
 specific...
 
 
 Lucene specific: Lucene doesn't read from disk if it can retrieve field
 value for a specific document ID from cache. How will it size down in
purely
 Lucene-based heavy-loaded production system? Especially if this cache is
 used for query optimizations.
 




Re: Why does BinaryRequestWriter force the path to be base URL + /update/javabin

2009-11-02 Thread Noble Paul നോബിള്‍ नोब्ळ्
yup, that can be relaxed. It was just a convention.

On Tue, Nov 3, 2009 at 5:24 AM, Stuart Tettemer stette...@gmail.com wrote:
 Hi folks,
 First of all, thanks for Solr.  It is a great piece of work.

 I have a question about BinaryRequestWriter in the solrj project.  Why does
 it force the path of UpdateRequests to have be /update/javabin (see
 BinaryRequestWriter.getPath(String) starting on line 109)?

 I am extending BinaryRequestWriter specifically to remove this requirement
 and am interested to know the reasoning behind in the inital choice.

 Thanks for your time,
 Stuart




-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Re: Question regarding snapinstaller

2009-11-02 Thread Lance Norskog
In Posix-compliant systems (basically Unix system calls) a file exists
independent of file names, and there can be multiple names for a file.
If a program has a file open, that file can be deleted but it will
still exist until the program closes (or the program exits).

In the snapinstaller cycle, Solr holds the old index files open while
snapinstaller swaps in the new set. The 'commit' operation causes Solr
to (eventually) close all of the old index files and at that point
they will go away.

On Mon, Nov 2, 2009 at 1:26 PM, Prasanna Ranganathan
pranganat...@netflix.com wrote:

  It looks like the snapinstaller script does an atomic remove and replace of
 the entire solr_home/data_dir/index folder with the contents of the new
 snapshot before issuing a commit command. I am trying to understand the
 implication of the same.

  What happens to queries that come during the time interval between the
 instant the existing directory is removed and the commit command gets
 finalized? Does a currently running instance of Solr not need the files in
 the index folder to serve the query results? Are all the contents of the
 index folder loaded into memory?

  Thanks in advance for any help.

 Regards,

 Prasanna.




-- 
Lance Norskog
goks...@gmail.com


Re: Annotations and reference types

2009-11-02 Thread Noble Paul നോബിള്‍ नोब्ळ्
I guess this is not a very good idea.

The document itself is a flat data structure. It is hard to see that
is nested datastructure. If allowed , how deep would we wish to make
it.

The simple solution would be to write setters for b_id and b_name in class A
and the setters can inject values into B.

On Mon, Nov 2, 2009 at 10:05 PM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:
 On Thu, Oct 29, 2009 at 7:57 PM, M. Tinnemeyer marc-...@gmx.net wrote:

 Dear listusers,

 Is there a way to store an instance of class A (including the fields from
 myB) via solr using annotations ?
 The index should look like : id; name; b_id; b_name

 --
 Class A {

 @Field
 private String id;
 @Field
 private String name;
 @Field
 private B myB;
 }

 --
 Class B {

 @Field(b_id)
 private String id;
 @Field(B_name)
 private String name;
 }


 No.

 I guess you want to represent certain fields in class B and have them as an
 attribute in Class A (but all fields belong to the same schema), then it can
 be a worthwhile addition to Solrj. Can you open an issue? A patch would be
 even better :)

 --
 Regards,
 Shalin Shekhar Mangar.




-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Re: field queries seem slow

2009-11-02 Thread Lance Norskog
This searches author:albert and (default text field): einstein. This
may not be what you expect?

On Mon, Nov 2, 2009 at 2:30 PM, Erick Erickson erickerick...@gmail.com wrote:
 H, are you sorting? And has your readers been reopened? Is the
 second query of that sort also slow? If the answer to this last question is
 no,
 have you tried some autowarming queries?

 Best
 Erick

 On Mon, Nov 2, 2009 at 4:34 PM, mike anderson saidthero...@gmail.comwrote:

 I took a look through my Solr logs this weekend and noticed that the
 longest
 queries were on particular fields, like author:albert einstein. Is this a
 result consistent with other setups out there? If not, Is there a trick to
 make these go faster? I've read up on filter queries and use those when
 applicable, but they don't really solve all my problems.

 If anybody wants to take a shot at it but needs to see my solrconfig, etc
 just let me know.

 Cheers,
 Mike





-- 
Lance Norskog
goks...@gmail.com


Re: tracking solr response time

2009-11-02 Thread Yonik Seeley
On Mon, Nov 2, 2009 at 2:21 PM, bharath venkatesh
bharathv6.proj...@gmail.com wrote:
 we observed many times there is huge mismatch between qtime and
 time measured at the client for the response

Long times to stream back the result to the client could be due to
 - client not reading fast enough
 - network congestion
 - reading the stored fields takes a long time
- this can happen with really big indexes that can't all fit in
memory, and stored fields tend to not be cached well by the OS
(essentially random access patterns over a huge area).  This ends up
causing a disk seek per document being
streamed back.
 - locking contention for reading the index (under Solr 1.3, but not
under 1.4 on non-windows platforms)

I didn't see where you said what Solr version you were using.  There
are some pretty big concurrency differences between 1.3 and 1.4 too
(if your tests involve many concurrent requests).

-Yonik
http://www.lucidimagination.com


RE: Lucene FieldCache memory requirements

2009-11-02 Thread Fuad Efendi
FieldCache uses internally WeakHashMap... nothing wrong, but... no any
Garbage Collection tuning will help in case if allocated RAM is not enough
for replacing Weak** with Strong**, especially for SOLR faceting... 10%-15%
CPU taken by GC were reported...
-Fuad





Proper way to set up Multi Core / Core admin

2009-11-02 Thread Jonathan Hendler
Getting started with multi core setup following http://wiki.apache.org/solr/CoreAdmin 
 and the book. Generally everything makes sense, but I have one  
question.


Here's how easy it was:

place the solr.war into the server
create your core directories in the newly created solr/ directory
set up solr.xml, the config files for a data import handler, the  
[core]/conf/solrconfig.xml [core]/conf/schema.xml, etc
copy the /admin directory present in /solr into each /solr/[core]  
directory


Is step 4 a correct step in the setting up of a multi core environment?

TIA

Re: Proper way to set up Multi Core / Core admin

2009-11-02 Thread Jonathan Hendler

Sorry for the confusion - step four is to be avoided, obviously.


On Nov 2, 2009, at 11:46 PM, Jonathan Hendler wrote:

Getting started with multi core setup following http://wiki.apache.org/solr/CoreAdmin 
 and the book. Generally everything makes sense, but I have one  
question.


Here's how easy it was:

place the solr.war into the server
create your core directories in the newly created solr/ directory
set up solr.xml, the config files for a data import handler, the  
[core]/conf/solrconfig.xml [core]/conf/schema.xml, etc
copy the /admin directory present in /solr into each /solr/[core]  
directory


Is step 4 a correct step in the setting up of a multi core  
environment?


TIA




Re: Match all terms in doc

2009-11-02 Thread Shalin Shekhar Mangar
On Sun, Nov 1, 2009 at 3:33 AM, Magnus Eklund magnus.ekl...@gmail.comwrote:

 Hi

 How do I restrict hits to documents containing all words (regardless of
 order) of a query in particular field?

 Suppose I have two documents with a field called name in my index:

 doc1 = name: Pink
 doc2 = name: Pink Floyd

 When querying for Pink I want only doc1 and when querying for Pink
 Floyd or Floyd Pink I want doc2.


You can query like:
+name:Floyd +name:Pink

The + character means a must have condition. This will match documents which
have Floyd as well as Pink in any order.
-- 
Regards,
Shalin Shekhar Mangar.


Re: solrj query size limit?

2009-11-02 Thread Avlesh Singh
Did you hit the limit for maximum number of characters in a GET request?

Cheers
Avlesh

On Tue, Nov 3, 2009 at 9:36 AM, Gregg Horan greggho...@gmail.com wrote:

 I'm constructing a query using solrj that has a fairly large number of 'OR'
 clauses.  I'm just adding it as a big string to setQuery(), in the format
 accountId:(this OR that OR yada).

 This works all day long with 300 values.  When I push it up to 350-400
 values, I get a Bad Request SolrServerException.  It appears to just be a
 client error - nothing reaching the server logs.  Very repeatable... dial
 it
 back down and it goes through again fine.

 The total string length of the query (including a handful of other faceting
 entries) is about 9500chars.   I do have the maxBooleanClauses jacked up to
 2048.  Using javabin.  1.4-dev.

 Are there any other options or settings I might be overlooking?

 -Gregg



Re: Problems downloading lucene 2.9.1

2009-11-02 Thread Licinio Fernández Maurelo
Thanks guys !!!

2009/11/2 Ryan McKinley ryan...@gmail.com


 On Nov 2, 2009, at 8:29 AM, Grant Ingersoll wrote:


 On Nov 2, 2009, at 12:12 AM, Licinio Fernández Maurelo wrote:

  Hi folks,

 as we are using an snapshot dependecy to solr1.4, today we are getting
 problems when maven try to download lucene 2.9.1 (there isn't a any 2.9.1
 there).

 Which repository can i use to download it?


 They won't be there until 2.9.1 is officially released.  We are trying to
 speed up the Solr release by piggybacking on the Lucene release, but this
 little bit is the one downside.


 Until then, you can add a repo to:

 http://people.apache.org/~mikemccand/staging-area/rc3_lucene2.9.1/maven/





-- 
Lici


Re: Problems downloading lucene 2.9.1

2009-11-02 Thread Licinio Fernández Maurelo
Well, i've solved this problem executing pre mvn install:install-file
-DgroupId=org.apache.lucene -DartifactId=lucene-analyzers -Dversion=2.9.1
-Dpackaging=jar -Dfile=path_to_jar/pre for each lucene-* artifact.

I think there must be an easier way to do this, am i wrong?

Hope it helps

Thx

El 3 de noviembre de 2009 08:03, Licinio Fernández Maurelo 
licinio.fernan...@gmail.com escribió:

 Thanks guys !!!

 2009/11/2 Ryan McKinley ryan...@gmail.com


 On Nov 2, 2009, at 8:29 AM, Grant Ingersoll wrote:


 On Nov 2, 2009, at 12:12 AM, Licinio Fernández Maurelo wrote:

  Hi folks,

 as we are using an snapshot dependecy to solr1.4, today we are getting
 problems when maven try to download lucene 2.9.1 (there isn't a any
 2.9.1
 there).

 Which repository can i use to download it?


 They won't be there until 2.9.1 is officially released.  We are trying to
 speed up the Solr release by piggybacking on the Lucene release, but this
 little bit is the one downside.


 Until then, you can add a repo to:

 http://people.apache.org/~mikemccand/staging-area/rc3_lucene2.9.1/maven/





 --
 Lici




-- 
Lici