Re: search for a number within a range, where range values are mentioned in documents

2010-12-16 Thread lee carroll
During data import can you update a record with min and max fields, these
would be equal in the case of a single non range value.

I know this is not a solr solution but a data pre-processing one but would
work?

Failing the above i've saw in the docs reference to a compound value field
(in the context of points, ie point = lat , lon which would be a nice way to
store your range fields anthough i still think you will need to pre-process
your data.

cheers lee

On 15 December 2010 18:22, Jonathan Rochkind rochk...@jhu.edu wrote:

 I'm not sure you're right that it will result in an out-of-memory error if
 the range is too large. I don't think it will, I think it'll be fine as far
 as memory goes, because of how Lucene works. Or do you actually have reason
 to believe it was causing you memory issues?  Or do you just mean memory
 issues in your transformer, not actually in Solr?

 Using Trie fields should also make it fine as far as CPU time goes.  Using
 a trie int field with a non-zero precision should likely be helpful in
 this case.

 It _will_ increase the on-disk size of your indexes.

 I'm not sure if there's a better approach, i can't think of one, but maybe
 someone else knows one.


 On 12/15/2010 12:56 PM, Arunkumar Ayyavu wrote:

 Hi!

 I have a typical case where in an attribute (in a DB record) can
 contain different ranges of numeric values. Let us say the range
 values in this attribute for record1 are
 (2-4,5000-8000,45000-5,454,231,1000). As you can see this
 attribute can also contain isolated numeric values such as 454, 231
 and 1000. Now, I want to return record1 if the user searches for
 20001 or 5003 or 231 or 5. Right now, I'm exploding the range
 values (within a transformer) and indexing record1 for each of the
 values within a range. But this could result in out-of-memory error if
 the range is too large. Could you help me figure out a better way of
 addressing this type of queries using Solr.

 Thanks a ton.




Re: Memory use during merges (OOM)

2010-12-16 Thread Upayavira
How long does it take to reach this OOM situation? Is it possible for
you to try a merge with each setting in turn, and evaluate what impact
they each have? That is, indexing speed and memory consumption? It might
be interesting to watch garbage collection too while it is running with
jstat, as that could be your speed bottleneck.

Upayavira

On Wed, 15 Dec 2010 18:52 -0500, Burton-West, Tom tburt...@umich.edu
wrote:
 Hello all,
 
 Are there any general guidelines for determining the main factors in
 memory use during merges?
 
 We recently changed our indexing configuration to speed up indexing but
 in the process of doing a very large merge we are running out of memory.
 Below is a list of the changes and part of the indexwriter log.  The
 changes increased the indexing though-put by almost an order of
 magnitude.
 (about 600 documents per hour to about 6000 documents per hour.  Our
 documents are about 800K)
 
 We are trying to determine which of the changes to tweak to avoid the
 OOM, but still keep the benefit of the increased indexing throughput
 
 Is it likely that the changes to ramBufferSizeMB are the culprit or could
 it be the mergeFactor change from 10-20?
 
  Is there any obvious relationship between ramBufferSizeMB and the memory
  consumed by Solr?
  Are there rules of thumb for the memory needed in terms of the number or
  size of segments?
 
 Our largest segments prior to the failed merge attempt were between 5GB
 and 30GB.  The memory allocated to the Solr/tomcat JVM is 10GB.
 
 Tom Burton-West
 -
 
 Changes to indexing configuration:
 mergeScheduler
 before: serialMergeScheduler
 after:concurrentMergeScheduler
 mergeFactor
 before: 10
 after : 20
 ramBufferSizeMB
 before: 32
   after: 320
 
 excerpt from indexWriter.log
 
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010;
 http-8091-Processor70]: LMP: findMerges: 40 segments
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010;
 http-8091-Processor70]: LMP:   level 7.23609 to 7.98609: 20 segments
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010;
 http-8091-Processor70]: LMP: 0 to 20: add this merge
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010;
 http-8091-Processor70]: LMP:   level 5.44878 to 6.19878: 20 segments
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010;
 http-8091-Processor70]: LMP: 20 to 40: add this merge
 
 ...
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010;
 http-8091-Processor70]: applyDeletes
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010;
 http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0
 deleted docIDs and 0 deleted queries on 40 segments.
 Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010;
 http-8091-Processor70]: hit exception flushing deletes
 Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010;
 http-8091-Processor70]: hit OutOfMemoryError inside updateDocument
 tom
 


RE: Dataimport performance

2010-12-16 Thread Ephraim Ofir
Check out 
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3c9f8b39cb3b7c6d4594293ea29ccf438b01702...@icq-mail.icq.il.office.aol.com%3e
This approach of not using sub entities really improved our load time.

Ephraim Ofir

-Original Message-
From: Robert Gründler [mailto:rob...@dubture.com] 
Sent: Wednesday, December 15, 2010 4:49 PM
To: solr-user@lucene.apache.org
Subject: Re: Dataimport performance

i've benchmarked the import already with 500k records, one time without the 
artists subquery, and one time without the join in the main query:


Without subquery: 500k in 3 min 30 sec

Without join and without subquery: 500k in 2 min 30.

With subquery and with left join:   320k in 6 Min 30


so the joins / subqueries are definitely a bottleneck. 

How exactly did you implement the custom data import? 

In our case, we need to de-normalize the relations of the sql data for the 
index, 
so i fear i can't really get rid of the join / subquery.


-robert





On Dec 15, 2010, at 15:43 , Tim Heckman wrote:

 2010/12/15 Robert Gründler rob...@dubture.com:
 The data-config.xml looks like this (only 1 entity):
 
  entity name=track query=select t.id as id, t.title as title, 
 l.title as label from track t left join label l on (l.id = t.label_id) where 
 t.deleted = 0 transformer=TemplateTransformer
field column=title name=title_t /
field column=label name=label_t /
field column=id name=sf_meta_id /
field column=metaclass template=Track name=sf_meta_class/
field column=metaid template=${track.id} name=sf_meta_id/
field column=uniqueid template=Track_${track.id} 
 name=sf_unique_id/
 
entity name=artists query=select a.name as artist from artist a 
 left join track_artist ta on (ta.artist_id = a.id) where 
 ta.track_id=${track.id}
  field column=artist name=artists_t /
/entity
 
  /entity
 
 So there's one track entity with an artist sub-entity. My (admittedly
 rather limited) experience has been that sub-entities, where you have
 to run a separate query for every row in the parent entity, really
 slow down data import. For my own purposes, I wrote a custom data
 import using SolrJ to improve the performance (from 3 hours to 10
 minutes).
 
 Just as a test, how long does it take if you comment out the artists entity?



PHPSolrClient

2010-12-16 Thread Dennis Gearon
First of all, it's a very nice piece of work.

I am just getting my feet wet with Solr in general. So I 'am not even sure how 
a 
document is NORMALLY deleted.

The library PHPDocs say 'add', 'get' 'delete', But does anyone know about 
'update'?
 (obviously one can read-delete-modify-create)

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



Re: Thank you!

2010-12-16 Thread Dennis Gearon
I feel the same way about this group and the Postgres group.

VERY helpful people. All of us helping heacho other.

 Dennis Gearon


Signature Warning


- Original Message 
From: Adam Estrada estrada.a...@gmail.com
Subject: Thank you!

I just want to say that this list serve has been invaluable to a newbie like
me ;-) I posted a question earlier today and literally 10 minutes later I
got an answer that helped me solve my problem. This is proof that there is a
experienced and energetic community behind this FOSS group of projects and I
really appreciate everyone who has put up with my otherwise trivial
questions!  More importantly, thanks to all of the contributors who make the
whole thing possible!  I attended the Lucene Revolution conference in Boston
this year and the information that I was able to take away from the whole
thing has made me and my vocation a lot more valuable. Keep up the
outstanding work in the discovery of useful information from a sea of bleh
;-)

Kindest regards,
Adam



Re: PHPSolrClient

2010-12-16 Thread Tanguy Moal
Hi Dennis,

Not particular to the client you use (solr-php-client) for sending
documents, think of update as an overwrite.

This means that if you update a particular document, the previous
version indexed is lost.
Therefore, when updating a document, make sure that all the fields to
be indexed and retrieved are present in the update.

For an update to occur, only the uniqueKey id (as specified in your
schema.xml) has to be the same as the document you want to update.

Shortly, an update is like an add, (and performed the same way) except
that the added document was previously indexed. It simple gets
replaced by the update.

Hope that helps,

--
Tanguy

2010/12/16 Dennis Gearon gear...@sbcglobal.net:
 First of all, it's a very nice piece of work.

 I am just getting my feet wet with Solr in general. So I 'am not even sure 
 how a
 document is NORMALLY deleted.

 The library PHPDocs say 'add', 'get' 'delete', But does anyone know about
 'update'?
  (obviously one can read-delete-modify-create)

  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a 
 better
 idea to learn from others’ mistakes, so you do not have to make them yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.




indexing a lot of XML dokuments

2010-12-16 Thread Jörg Agatz
hi, users, i serch e way to indexing a lot of iml Dokuments so fast as
Possible.

i have more than 1 million docs on Server 1 and a SolR multicor an Server 2
with tomcat.

i dont know ho i can do it easy and fast..

I cant find a idea in the wiki, maby you have some ideas?

King


Re: Memory use during merges (OOM)

2010-12-16 Thread Michael McCandless
RAM usage for merging is tricky.

First off, merging must hold open a SegmentReader for each segment
being merged.  However, it's not necessarily a full segment reader;
for example, merging doesn't need the terms index nor norms.  But it
will load deleted docs.

But, if you are doing deletions (or updateDocument, which is just a
delete + add under-the-hood), then this will force the terms index of
the segment readers to be loaded, thus consuming more RAM.
Furthermore, if the deletions you (by Term/Query) do in fact result in
deleted documents (ie they were not false deletions), then the
merging allocates an int[maxDoc()] for each SegmentReader that has
deletions.

Finally, if you have multiple merges running at once (see
CSM.setMaxMergeCount) that means RAM for each currently running merge
is tied up.

So I think the gist is... the RAM usage will be in proportion to the
net size of the merge (mergeFactor + how big each merged segment is),
how many merges you allow concurrently, and whether you do false or
true deletions.

If you are doing false deletions (calling .updateDocument when in fact
the Term you are replacing cannot exist) it'd be best if possible to
change the app to not call .updateDocument if you know the Term
doesn't exist.

Mike

On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Hello all,

 Are there any general guidelines for determining the main factors in memory 
 use during merges?

 We recently changed our indexing configuration to speed up indexing but in 
 the process of doing a very large merge we are running out of memory.
 Below is a list of the changes and part of the indexwriter log.  The changes 
 increased the indexing though-put by almost an order of magnitude.
 (about 600 documents per hour to about 6000 documents per hour.  Our 
 documents are about 800K)

 We are trying to determine which of the changes to tweak to avoid the OOM, 
 but still keep the benefit of the increased indexing throughput

 Is it likely that the changes to ramBufferSizeMB are the culprit or could it 
 be the mergeFactor change from 10-20?

  Is there any obvious relationship between ramBufferSizeMB and the memory 
 consumed by Solr?
  Are there rules of thumb for the memory needed in terms of the number or 
 size of segments?

 Our largest segments prior to the failed merge attempt were between 5GB and 
 30GB.  The memory allocated to the Solr/tomcat JVM is 10GB.

 Tom Burton-West
 -

 Changes to indexing configuration:
 mergeScheduler
        before: serialMergeScheduler
        after:    concurrentMergeScheduler
 mergeFactor
        before: 10
            after : 20
 ramBufferSizeMB
        before: 32
              after: 320

 excerpt from indexWriter.log

 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP: findMerges: 40 segments
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP:   level 7.23609 to 7.98609: 20 segments
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP:     0 to 20: add this merge
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP:   level 5.44878 to 6.19878: 20 segments
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP:     20 to 40: add this merge

 ...
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: applyDeletes
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 deleted 
 docIDs and 0 deleted queries on 40 segments.
 Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; 
 http-8091-Processor70]: hit exception flushing deletes
 Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; 
 http-8091-Processor70]: hit OutOfMemoryError inside updateDocument
 tom




Re: Results from More then One Cors?

2010-12-16 Thread Jörg Agatz
ok, works Great, at the Beginning, but now i get a Big Error :-(


HTTP Status 500 - null java.lang.NullPointerException at
org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:462)
at
org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:298)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:290)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:636)


Determining core name from a result?

2010-12-16 Thread Mark Allan

Hi all,

I've been bashing my head against the wall for a few hours now, trying  
to get mlt (more-like-this) queries working across multiple cores.  
I've since seen a JIRA issue and documentation saying that multicore  
doesn't yet support mlt queries.  Oops!


Anyway, to get around this, I was planning to send the mlt query just  
to the specific core that a particular result came from, but I can't  
see a way to obtain that information from the results.  If I figure it  
out by hand, I can get a MLT query to produce similar documents from  
that core which is probably good enough for the time being.


Does anyone know how, after performing a multi-core search to retrieve  
a single document, I can then find out which core that result came from?


I'm using Solr branch_3x.

Many thanks

Mark


--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Google like search

2010-12-16 Thread satya swaroop
Hi All,

 Thanks for your suggestions.. I got the result of what i expected..

Cheers,
Satya


Re: PHPSolrClient

2010-12-16 Thread Erick Erickson
As Tanguy says, simply re-adding a document with the same
uniqueKey will automatically delete/readd the doc.

But I wanted to add a caution about your phrase read-delete-modify-create
You only get back what you #stored#. So generally the update is done
from the original source rather than the index.

So it's simple if uniqueKey is there, just add the doc.

Best
Erick

On Thu, Dec 16, 2010 at 4:14 AM, Dennis Gearon gear...@sbcglobal.netwrote:

 First of all, it's a very nice piece of work.

 I am just getting my feet wet with Solr in general. So I 'am not even sure
 how a
 document is NORMALLY deleted.

 The library PHPDocs say 'add', 'get' 'delete', But does anyone know about
 'update'?
  (obviously one can read-delete-modify-create)

  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a
 better
 idea to learn from others’ mistakes, so you do not have to make them
 yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.




Re: Determining core name from a result?

2010-12-16 Thread Grant Ingersoll
How are you querying the core to begin with?

On Dec 16, 2010, at 6:46 AM, Mark Allan wrote:

 Hi all,
 
 I've been bashing my head against the wall for a few hours now, trying to get 
 mlt (more-like-this) queries working across multiple cores. I've since seen a 
 JIRA issue and documentation saying that multicore doesn't yet support mlt 
 queries.  Oops!
 
 Anyway, to get around this, I was planning to send the mlt query just to the 
 specific core that a particular result came from, but I can't see a way to 
 obtain that information from the results.  If I figure it out by hand, I can 
 get a MLT query to produce similar documents from that core which is probably 
 good enough for the time being.
 
 Does anyone know how, after performing a multi-core search to retrieve a 
 single document, I can then find out which core that result came from?
 
 I'm using Solr branch_3x.
 
 Many thanks
 
 Mark
 
 
 -- 
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.
 

--
Grant Ingersoll
http://www.lucidimagination.com/



Re: Thank you!

2010-12-16 Thread kenf_nc

Hear hear! In the beginning of my journey with Solr/Lucene I couldn't have
done it without this site. Smiley and Pugh's book was useful, but this forum
was invaluable.  I don't have as many questions now, but each new venture,
Geospatial searching, replication and redundancy, performance tuning, brings
me back again and again. This and stackoverflow.com have to be two of the
most useful destinations on the internet for developers. Communities are so
much more relevant than reference materials, and the consistent activity in
this community is impressive.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Thank-you-tp2096329p2098512.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Determining core name from a result?

2010-12-16 Thread Mark Allan

Hi Grant,

Thanks for your reply. I'm using solrj to connect via http, which  
eventually sends this query


http://localhost:8984/solr/core0/select/?q=id:022-80633905version=2start=0rows=1fl=*indent=onshards=localhost:8984/solr/core0,localhost:8984/solr/core1,localhost:8984/solr/core2,localhost:8984/solr/core3,localhost:8984/solr/core4

I subsequently send the MLT query which ends up looking like:

http://localhost:8984/solr/core0/mlt/?q=id:022-80633905version=2start=0rows=5fl=idindent=onmlt.fl=descriptionmlt.match.include=falsemlt.minwl=3mlt.mintf=1mlt.mindf=1localhost:8984/solr/core0,localhost:8984/solr/core1,localhost:8984/solr/core2,localhost:8984/solr/core3,localhost:8984/solr/core4

If I run that query in a browser, the response returned is
response
responseHeader
status0/status
QTime3/QTime
/responseHeader
null name='response'/
/response

Now, because I know the the document with id 022-80633905 went into  
core 1, I get the correct results if I change the first part of the  
URL to http://localhost:8984/solr/core1/mlt but doing so requires my  
app (not just me!) to know which core the result came from.


Thanks
Mark

On 16 Dec 2010, at 1:44 pm, Grant Ingersoll wrote:


How are you querying the core to begin with?

On Dec 16, 2010, at 6:46 AM, Mark Allan wrote:


Hi all,

I've been bashing my head against the wall for a few hours now,  
trying to get mlt (more-like-this) queries working across multiple  
cores. I've since seen a JIRA issue and documentation saying that  
multicore doesn't yet support mlt queries.  Oops!


Anyway, to get around this, I was planning to send the mlt query  
just to the specific core that a particular result came from, but I  
can't see a way to obtain that information from the results.  If I  
figure it out by hand, I can get a MLT query to produce similar  
documents from that core which is probably good enough for the time  
being.


Does anyone know how, after performing a multi-core search to  
retrieve a single document, I can then find out which core that  
result came from?


I'm using Solr branch_3x.

Many thanks

Mark



--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



STUCK Threads at org.apache.lucene.document.CompressionTools.decompress

2010-12-16 Thread Alexander Ramos Jardim
Hello guys,

I am getting threads stuck forever at *
org.apache.lucene.document.CompressionTools.decompress*. I am using
Weblogic 10.02, with solr deployed as ear and no work manager specifically
configured for this instance.

Only doing simple queries at this node (q=itemId:9 or q:skuId:9). My
index has 3Giga.

Now i send the thread dump of the stuck threads. Does anyone ever had this
kind of problem?


'[STUCK] ExecuteThread: '0' for queue: 'weblogic.kernel.Default
(self-tuning)'' Id=19, RUNNABLE on lock=, total cpu time=187228990.ms
user time=186506940.ms
at java.util.zip.Inflater.inflateFast(Native Method)
at java.util.zip.Inflater.inflateBytes(Inflater.java:360)
at java.util.zip.Inflater.inflate(Inflater.java:218)
at java.util.zip.Inflater.inflate(Inflater.java:235)
at
org.apache.lucene.document.CompressionTools.decompress(CompressionTools.java:108)
at org.apache.lucene.index.FieldsReader.uncompress(FieldsReader.java:607)
at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:368)
at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:229)
at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948)
at
org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506)
at org.apache.lucene.index.IndexReader.document(IndexReader.java:947)
at
org.apache.solr.search.SolrIndexReader.document(SolrIndexReader.java:444)
at org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:427)
at
org.apache.solr.util.SolrPluginUtils.optimizePreFetchDocs(SolrPluginUtils.java:267)
at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:269)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)
at
weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.run(WebAppServletContext.java:3402)
at
weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:321)
at weblogic.security.service.SecurityManager.runAs(Unknown Source)
at
weblogic.servlet.internal.WebAppServletContext.securedExecute(WebAppServletContext.java:2140)
at
weblogic.servlet.internal.WebAppServletContext.execute(WebAppServletContext.java:2046)
at
weblogic.servlet.internal.ServletRequestImpl.run(ServletRequestImpl.java:1398)
at weblogic.work.ExecuteThread.execute(ExecuteThread.java:200)
at weblogic.work.ExecuteThread.run(ExecuteThread.java:172)
'weblogic.time.TimeEventGenerator' Id=20, TIMED_WAITING on
lock=weblogic.time.common.internal.timeta...@f051231a, total cpu
time=60.ms user time=60.ms
at java.lang.Object.wait(Native Method)
at weblogic.time.common.internal.TimeTable.snooze(TimeTable.java:286)
at
weblogic.time.common.internal.TimeEventGenerator.run(TimeEventGenerator.java:117)
at java.lang.Thread.run(Thread.java:595)
'JMAPI event thread' Id=21, RUNNABLE on lock=, total cpu time=1220.ms
user time=880.ms
'weblogic.timers.TimerThread' Id=22, TIMED_WAITING on
lock=weblogic.timers.internal.timerthr...@f050f3e4, total cpu
time=1390.ms user time=1080.ms
at java.lang.Object.wait(Native Method)
at weblogic.timers.internal.TimerThread$Thread.run(TimerThread.java:265)
'[STUCK] ExecuteThread: '4' for queue: 'weblogic.kernel.Default
(self-tuning)'' Id=74, RUNNABLE on lock=, total cpu time=180761590.ms
user time=180706770.ms
at java.util.zip.Inflater.inflateFast(Native Method)
at java.util.zip.Inflater.inflateBytes(Inflater.java:360)
at java.util.zip.Inflater.inflate(Inflater.java:218)
at java.util.zip.Inflater.inflate(Inflater.java:235)
at
org.apache.lucene.document.CompressionTools.decompress(CompressionTools.java:108)
at org.apache.lucene.index.FieldsReader.uncompress(FieldsReader.java:607)
at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:383)
at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:229)
at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948)
at
org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506)
at org.apache.lucene.index.IndexReader.document(IndexReader.java:947)
at
org.apache.solr.search.SolrIndexReader.document(SolrIndexReader.java:444)
at org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:427)
at
org.apache.solr.util.SolrPluginUtils.optimizePreFetchDocs(SolrPluginUtils.java:267)
at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:269)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at 

Why does Solr commit block indexing?

2010-12-16 Thread Renaud Delbru

Hi,

See log at [1].
We are using the latest snapshot of lucene_branch3.1. We have configured 
Solr to use the ConcurrentMergeScheduler:

mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler/

When a commit() runs, it blocks indexing (all imcoming update requests 
are blocked until the commit operation is finished) ... at the end of 
the log we notice a 4 minute gap during which none of the solr cients 
trying to add data receive any attention.
This is a bit annoying as it leads to timeout exception on the client 
side. Here, the commit time is only 4 minutes, but it can be larger if 
there are merges of large segments
I thought Solr was able to handle commits and updates at the same time: 
the commit operation should be done in the background, and the server 
still continue to receive update requests (maybe at a slower rate than 
normal). But it looks like it is not the case. Is it a normal behaviour ?


[1] http://pastebin.com/KPkusyVb

Regards
--
Renaud Delbru


Re: indexing a lot of XML dokuments

2010-12-16 Thread Adam Estrada
I have been very successful in following this example
http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_Example

http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_ExampleAdam

On Thu, Dec 16, 2010 at 5:44 AM, Jörg Agatz joerg.ag...@googlemail.comwrote:

 hi, users, i serch e way to indexing a lot of iml Dokuments so fast as
 Possible.

 i have more than 1 million docs on Server 1 and a SolR multicor an Server 2
 with tomcat.

 i dont know ho i can do it easy and fast..

 I cant find a idea in the wiki, maby you have some ideas?

 King



Multicore Search broken

2010-12-16 Thread Jörg Agatz
Hallo users,

I have create a Multicore instance from Solr with Tomcat6,
i create two Cores mail and index2 at first, mail and index2 are the
Same config, after this, i change the Mail config and Indexing 30 xml

No when i search in each core:

http://localhost:8080/solr/mail/select?q=*:*shards=localhost:8080/solr/mail,localhost:8080/solr/http://localhost:8983/solr/core0/select?q=*:*sort=myfield+descshards=localhost:8983/solr/core0,localhost:8983/solr/core1,localhost:8983/solr/core2
index2

i get a Error

__


HTTP Status 500 - null java.lang.NullPointerException at
org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:462)
at
org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:298)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:290)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:636)
__


when i search in one of the Cores, it works,
http://localhost:8080/solr/mail/select?q=*:*http://localhost:8983/solr/core0/select?q=*:*sort=myfield+descshards=localhost:8983/solr/core0,localhost:8983/solr/core1,localhost:8983/solr/core2
=
30 results
http://localhost:8080/solr/index2/select?q=*:*http://localhost:8983/solr/core0/select?q=*:*sort=myfield+descshards=localhost:8983/solr/core0,localhost:8983/solr/core1,localhost:8983/solr/core2
 =
one result


Someone hase a Idea, what is Wrong ?


Re: how to config DataImport Scheduling

2010-12-16 Thread do3do3

I also have the same problem, i configure dataimport.properties file as shown
in 
http://wiki.apache.org/solr/DataImportHandler#dataimport.properties_example
but no change occur, can any one help me


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-config-DataImport-Scheduling-tp2032000p2097768.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: STUCK Threads at org.apache.lucene.document.CompressionTools.decompress

2010-12-16 Thread Erick Erickson
What are you trying to do? It sounds like you're storing fields compressed,
is
that true (i.e. defining compressed=true in your field defs)? If so, why? It
may be
costing you more than you benefit.

A quick test would be to stop returning anything except the score
by specifying fl=score. Or at least stop returning the largest
compressed fields... Make sure you've set enableLazyFieldLoading
in solrconfig.xml appropriately.

If there's no joy here, please post your field definitions and an example or
two (with debugQuery=on) of offending queries.

Best
Erick

On Thu, Dec 16, 2010 at 9:31 AM, Alexander Ramos Jardim 
alexander.ramos.jar...@gmail.com wrote:

 Hello guys,

 I am getting threads stuck forever at *
 org.apache.lucene.document.CompressionTools.decompress*. I am using
 Weblogic 10.02, with solr deployed as ear and no work manager specifically
 configured for this instance.

 Only doing simple queries at this node (q=itemId:9 or q:skuId:9).
 My
 index has 3Giga.

 Now i send the thread dump of the stuck threads. Does anyone ever had this
 kind of problem?


 '[STUCK] ExecuteThread: '0' for queue: 'weblogic.kernel.Default
 (self-tuning)'' Id=19, RUNNABLE on lock=, total cpu time=187228990.ms
 user time=186506940.ms
 at java.util.zip.Inflater.inflateFast(Native Method)
 at java.util.zip.Inflater.inflateBytes(Inflater.java:360)
 at java.util.zip.Inflater.inflate(Inflater.java:218)
 at java.util.zip.Inflater.inflate(Inflater.java:235)
 at

 org.apache.lucene.document.CompressionTools.decompress(CompressionTools.java:108)
 at org.apache.lucene.index.FieldsReader.uncompress(FieldsReader.java:607)
 at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:368)
 at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:229)
 at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948)
 at
 org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506)
 at org.apache.lucene.index.IndexReader.document(IndexReader.java:947)
 at
 org.apache.solr.search.SolrIndexReader.document(SolrIndexReader.java:444)
 at org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:427)
 at

 org.apache.solr.util.SolrPluginUtils.optimizePreFetchDocs(SolrPluginUtils.java:267)
 at

 org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:269)
 at

 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
 at

 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
 at

 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
 at
 weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)
 at

 weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.run(WebAppServletContext.java:3402)
 at

 weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:321)
 at weblogic.security.service.SecurityManager.runAs(Unknown Source)
 at

 weblogic.servlet.internal.WebAppServletContext.securedExecute(WebAppServletContext.java:2140)
 at

 weblogic.servlet.internal.WebAppServletContext.execute(WebAppServletContext.java:2046)
 at

 weblogic.servlet.internal.ServletRequestImpl.run(ServletRequestImpl.java:1398)
 at weblogic.work.ExecuteThread.execute(ExecuteThread.java:200)
 at weblogic.work.ExecuteThread.run(ExecuteThread.java:172)
 'weblogic.time.TimeEventGenerator' Id=20, TIMED_WAITING on
 lock=weblogic.time.common.internal.timeta...@f051231a, total cpu
 time=60.ms user time=60.ms
 at java.lang.Object.wait(Native Method)
 at weblogic.time.common.internal.TimeTable.snooze(TimeTable.java:286)
 at

 weblogic.time.common.internal.TimeEventGenerator.run(TimeEventGenerator.java:117)
 at java.lang.Thread.run(Thread.java:595)
 'JMAPI event thread' Id=21, RUNNABLE on lock=, total cpu time=1220.ms
 user time=880.ms
 'weblogic.timers.TimerThread' Id=22, TIMED_WAITING on
 lock=weblogic.timers.internal.timerthr...@f050f3e4, total cpu
 time=1390.ms user time=1080.ms
 at java.lang.Object.wait(Native Method)
 at weblogic.timers.internal.TimerThread$Thread.run(TimerThread.java:265)
 '[STUCK] ExecuteThread: '4' for queue: 'weblogic.kernel.Default
 (self-tuning)'' Id=74, RUNNABLE on lock=, total cpu time=180761590.ms
 user time=180706770.ms
 at java.util.zip.Inflater.inflateFast(Native Method)
 at java.util.zip.Inflater.inflateBytes(Inflater.java:360)
 at java.util.zip.Inflater.inflate(Inflater.java:218)
 at java.util.zip.Inflater.inflate(Inflater.java:235)
 at

 org.apache.lucene.document.CompressionTools.decompress(CompressionTools.java:108)
 at org.apache.lucene.index.FieldsReader.uncompress(FieldsReader.java:607)
 at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:383)
 at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:229)
 at 

Re: Determining core name from a result?

2010-12-16 Thread Chris Hostetter

: Subject: Determining core name from a result?

FYI: some people may be confused because of terminoligy -- i think what 
you are asking is how to know which *shard* a document came from when 
doing a distributed search.

This isn't currently supported, there is an open issue tracking it...

https://issues.apache.org/jira/browse/SOLR-705

-Hoss


Re: Query performance issue while using EdgeNGram

2010-12-16 Thread Erick Erickson
A couple of observations:

1 your regex at query time is interesting. You're using KeywordTokenizer,
so input of
search me becomes searchme before it goes through the parser. Is
this your intent?
2 Why are you using EdgeNGrams for auto suggest? The TermsComponent is
 an easier, more efficient solution unless you have some special needs,
see here:
 http://wiki.apache.org/solr/TermsComponent Are you trying to suggest
#terms#
 or complete queries? Because if it's just on a term basis,
TermsComponent
 seems much simpler. Jay's example on the Lucid web site (if that's
where
 you started down this path) is for implementing #query# selection.
3 I'd think about checking your caches. I'm not real comfortable with a min
gram
 size of 1 then warming all the alphabet. See the admin page, stats.
Look
 for evictions. You're also bloating the size of your index pretty
significantly
 because of the huge number of unique terms you'll be generating.
4 Optimizing is not all that useful unless you've deleted a bunch of
documents, despite
 the name. What it does do is force a complete reload of the underlying
index/caches, possibly
 you're seeing resource contention here because of that. *After* the
index is warmed,  do you
 see performance differences between optimized and un-optimized indexes?
If not think about
 only optimizing during off hours.

Best
Erick

On Thu, Dec 16, 2010 at 2:47 AM, Shanmugavel SRD
srdshanmuga...@gmail.comwrote:


 While using auto suggest using EdgeNGramFilterFactory in SOLR 1.4.1, we are
 having performance issue on query response time.
 For example, even though 'p' is in auto warming, if I search for 'people'
 immediately after optimization is completed, then search on 'people' is
 taking 11-15 secs respond. But subsequent search on 'people' is responding
 in less than 1 sec. I want to understand why it is taking 11 secs to
 respond
 and how to reduce it to 1 sec.

 These are the below configurations. Could anyone suggest on what am I
 missing here?

 1) Added query warming
 2) Decreased mergeFactor to '3'
 3) Increased HashDocSet maxSize as '7000' (which is 1432735 * 0.005)
 4) Optimized after the data import.

 Data are indexed from a csv file. optimize is called immediately after date
 import.

 No of docs : 1432735

 solrconfig.xml
  indexDefaults
useCompoundFilefalse/useCompoundFile
mergeFactor3/mergeFactor
ramBufferSizeMB32/ramBufferSizeMB
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength
writeLockTimeout1000/writeLockTimeout
commitLockTimeout1/commitLockTimeout
lockTypesingle/lockType
  /indexDefaults
  mainIndex
useCompoundFilefalse/useCompoundFile
ramBufferSizeMB32/ramBufferSizeMB
mergeFactor3/mergeFactor
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength
unlockOnStartupfalse/unlockOnStartup
  /mainIndex
  updateHandler class=solr.DirectUpdateHandler2
maxPendingDeletes10/maxPendingDeletes
  /updateHandler

  query
maxBooleanClauses1024/maxBooleanClauses
 filterCache
  class=solr.LRUCache
  size=16384
  initialSize=4096
  autowarmCount=4096/

queryResultCache
  class=solr.LRUCache
  size=16384
  initialSize=4096
  autowarmCount=4096/

documentCache
  class=solr.LRUCache
  size=5000
  initialSize=5000
  /

enableLazyFieldLoadingtrue/enableLazyFieldLoading
queryResultWindowSize50/queryResultWindowSize
queryResultMaxDocsCached200/queryResultMaxDocsCached
HashDocSet maxSize=7000 loadFactor=0.75/
  listener event=newSearcher class=solr.QuerySenderListener
  arr name=queries
lst str name=qa/strstr name=qttypeahead/strstr
 name=start0/strstr name=rows100/str/lst
lst str name=qb/strstr name=qttypeahead/strstr
 name=start0/strstr name=rows100/str/lst
lst str name=qc/strstr name=qttypeahead/strstr
 name=start0/strstr name=rows100/str/lst
lst str name=qd/strstr name=qttypeahead/strstr
 name=start0/strstr name=rows100/str/lst
lst str name=qe/strstr name=qttypeahead/strstr
 name=start0/strstr name=rows100/str/lst
lst str name=qf/strstr name=qttypeahead/strstr
 name=start0/strstr name=rows100/str/lst
lst str name=qg/strstr name=qttypeahead/strstr
 name=start0/strstr name=rows100/str/lst
lst str name=qh/strstr name=qttypeahead/strstr
 name=start0/strstr name=rows100/str/lst
lst str name=qi/strstr name=qttypeahead/strstr
 name=start0/strstr name=rows100/str/lst
lst str name=qj/strstr name=qttypeahead/strstr
 name=start0/strstr name=rows100/str/lst
lst str name=qk/strstr name=qttypeahead/strstr
 name=start0/strstr name=rows100/str/lst
lst str name=ql/strstr name=qttypeahead/strstr
 name=start0/strstr name=rows100/str/lst
lst str name=qm/strstr name=qttypeahead/strstr
 name=start0/strstr name=rows100/str/lst
lst str name=qn/strstr name=qttypeahead/strstr
 

Re: Determining core name from a result?

2010-12-16 Thread Mark Allan
Oops! Sorry, I thought shard and core were one in the same and the  
terms could be used interchangeably - I've got a multicore setup which  
I'm able to search across by using the shards parameter.  I think  
you're right, that *is* the question I was asking.


Thanks for letting me know it's not supported yet.  I guess the  
easiest thing for me to do right now is to add another field in each  
document saying which core it was inserted into.


Thanks again
Mark

On 16 Dec 2010, at 3:46 pm, Chris Hostetter wrote:


: Subject: Determining core name from a result?

FYI: some people may be confused because of terminoligy -- i think  
what

you are asking is how to know which *shard* a document came from when
doing a distributed search.

This isn't currently supported, there is an open issue tracking it...

https://issues.apache.org/jira/browse/SOLR-705

-Hoss




--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Case Insensitive sorting while preserving case during faceted search

2010-12-16 Thread shan2812

Hi,

I am trying to do a facet search and sort the facet values too.

First I tried with 'solr.TextField' as field type. But this does not return
sorted facet values.

After referring to
FAQ(http://wiki.apache.org/solr/FAQ#Why_Isn.27t_Sorting_Working_on_my_Text_Fields.3F),
I changed it to 'solr.StrField', it did work. But sorting was not always
correct, e.g. 'ALPHA' was sorted above 'Abacus'.

Then I followed the sample example schema.xml, created a copyField of type
''

fieldType name=alphaOnlySort class=solr.TextField
sortMissingLast=true omitNorms=true
analyzer
tokenizer 
class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.TrimFilterFactory/
filter class=solr.PatternReplaceFilterFactory
pattern=([^a-z]) replacement=  
  replace=all/ --
/analyzer
/fieldType

But this gave another problem if the data contains any non-alpha
characters(was replaced). Hence I removed the PatternReplaceFilterFactory
from the above definition and it worked well.

But the sorted facet values dont have their case preserved anymore. 

How can I get around this?

Thank You.

Regards,
Shan 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Case-Insensitive-sorting-while-preserving-case-during-faceted-search-tp2099248p2099248.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: STUCK Threads at org.apache.lucene.document.CompressionTools.decompress

2010-12-16 Thread Alexander Ramos Jardim
2010/12/16 Erick Erickson erickerick...@gmail.com

 What are you trying to do? It sounds like you're storing fields compressed,
 is
 that true (i.e. defining compressed=true in your field defs)? If so, why?
 It
 may be
 costing you more than you benefit.


No compressed fields in my schema


 A quick test would be to stop returning anything except the score
 by specifying fl=score. Or at least stop returning the largest
 compressed fields... Make sure you've set enableLazyFieldLoading
 in solrconfig.xml appropriately.


lazy loading set TRUE


 If there's no joy here, please post your field definitions and an example
 or
 two (with debugQuery=on) of offending queries.


The only type of query I do in this instance: q=itemId:7288407 (obviously,
id may vary)

debug result:

lst name=debug str name=rawquerystringitemId:7288407/str str name
=querystringitemId:7288407/str str name=parsedqueryitemId:7288407
/str str name=parsedquery_toStringitemId:#8;#0;#0;Þ㙗/str lst name=
explain str name=7288407 11.873255 = (MATCH)
fieldWeight(itemId:#8;#0;#0;Þ㙗 in 187), product of: 1.0 =
tf(termFreq(itemId:#8;#0;#0;Þ㙗)=1) 11.873255 = idf(docFreq=4,
maxDocs=263733) 1.0 = fieldNorm(field=itemId, doc=187)/str /lst str
name=QParserLuceneQParser/str lst name=timing double name=time
26.0/double lst name=prepare double name=time3.0/double lst
name=org.apache.solr.handler.component.QueryComponent double name=time
1.0/double /lst lst name=
org.apache.solr.handler.component.FacetComponent double name=time0.0
/double /lst lst name=
org.apache.solr.handler.component.MoreLikeThisComponent double name=time
0.0/double /lst lst name=
org.apache.solr.handler.component.HighlightComponent double name=time
0.0/double /lst lst name=
org.apache.solr.handler.component.StatsComponent double name=time0.0
/double /lst lst name=
org.apache.solr.handler.component.SpellCheckComponent double name=time
0.0/double /lst lst name=
org.apache.solr.handler.component.DebugComponent double name=time0.0
/double /lst /lst lst name=process double name=time21.0
/double lst name=org.apache.solr.handler.component.QueryComponent double
name=time0.0/double /lst lst name=
org.apache.solr.handler.component.FacetComponent double name=time0.0
/double /lst lst name=
org.apache.solr.handler.component.MoreLikeThisComponent double name=time
0.0/double /lst lst name=
org.apache.solr.handler.component.HighlightComponent double name=time
0.0/double /lst lst name=
org.apache.solr.handler.component.StatsComponent double name=time0.0
/double /lst lst name=
org.apache.solr.handler.component.SpellCheckComponent double name=time
0.0/double /lst lst name=
org.apache.solr.handler.component.DebugComponent double name=time21.0
/double /lst /lst



 Best
 Erick

 On Thu, Dec 16, 2010 at 9:31 AM, Alexander Ramos Jardim 
 alexander.ramos.jar...@gmail.com wrote:

  Hello guys,
 
  I am getting threads stuck forever at *
  org.apache.lucene.document.CompressionTools.decompress*. I am using
  Weblogic 10.02, with solr deployed as ear and no work manager
 specifically
  configured for this instance.
 
  Only doing simple queries at this node (q=itemId:9 or q:skuId:9).
  My
  index has 3Giga.
 
  Now i send the thread dump of the stuck threads. Does anyone ever had
 this
  kind of problem?
 
 
  '[STUCK] ExecuteThread: '0' for queue: 'weblogic.kernel.Default
  (self-tuning)'' Id=19, RUNNABLE on lock=, total cpu time=187228990.ms
  user time=186506940.ms
  at java.util.zip.Inflater.inflateFast(Native Method)
  at java.util.zip.Inflater.inflateBytes(Inflater.java:360)
  at java.util.zip.Inflater.inflate(Inflater.java:218)
  at java.util.zip.Inflater.inflate(Inflater.java:235)
  at
 
 
 org.apache.lucene.document.CompressionTools.decompress(CompressionTools.java:108)
  at org.apache.lucene.index.FieldsReader.uncompress(FieldsReader.java:607)
  at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:368)
  at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:229)
  at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948)
  at
 
 org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506)
  at org.apache.lucene.index.IndexReader.document(IndexReader.java:947)
  at
  org.apache.solr.search.SolrIndexReader.document(SolrIndexReader.java:444)
  at
 org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:427)
  at
 
 
 org.apache.solr.util.SolrPluginUtils.optimizePreFetchDocs(SolrPluginUtils.java:267)
  at
 
 
 org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:269)
  at
 
 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
  at
 
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
  at
 

Re: Multicore Search broken

2010-12-16 Thread Jörg Agatz
I have tryed some Thinks, now i have new news,

when i search in :
http://localhost:8080/solr/mail/select?q=*:*shards=localhost:8080/solr/mail,localhost:8080/solr/http://localhost:8983/solr/core0/select?q=*:*sort=myfield+descshards=localhost:8983/solr/core0,localhost:8983/solr/core1,localhost:8983/solr/core2
mail
it works, so it looks that it is not a Problem with the JAVA or something
like this,

i have a Idea, it is Possible, that the diferences configs?

pleas, when you have an idea, than told me this...


Re: Why does Solr commit block indexing?

2010-12-16 Thread Michael McCandless
Unfortunately, (I think?) Solr currently commits by closing the
IndexWriter, which must wait for any running merges to complete, and
then opening a new one.

This is really rather silly because IndexWriter has had its own commit
method (which does not block ongoing indexing nor merging) for quite
some time now.

I'm not sure why we haven't switched over already... there must be
some trickiness involved.

Mike

On Thu, Dec 16, 2010 at 9:39 AM, Renaud Delbru renaud.del...@deri.org wrote:
 Hi,

 See log at [1].
 We are using the latest snapshot of lucene_branch3.1. We have configured
 Solr to use the ConcurrentMergeScheduler:
 mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler/

 When a commit() runs, it blocks indexing (all imcoming update requests are
 blocked until the commit operation is finished) ... at the end of the log we
 notice a 4 minute gap during which none of the solr cients trying to add
 data receive any attention.
 This is a bit annoying as it leads to timeout exception on the client side.
 Here, the commit time is only 4 minutes, but it can be larger if there are
 merges of large segments
 I thought Solr was able to handle commits and updates at the same time: the
 commit operation should be done in the background, and the server still
 continue to receive update requests (maybe at a slower rate than normal).
 But it looks like it is not the case. Is it a normal behaviour ?

 [1] http://pastebin.com/KPkusyVb

 Regards
 --
 Renaud Delbru



RE: Dataimport performance

2010-12-16 Thread Dyer, James
We have ~50 long-running SQL queries that need to be joined and denormalized.  
Not all of the queries are to the same db, and some data comes from fixed-width 
data feeds.  Our current search engine (that we are converting to SOLR) has a 
fast disk-caching mechanism that lets you cache all of these data sources and 
then it will join them locally prior to indexing.  

I'm in the process of developing something similar for DIH that uses the 
Berkley db to do the same thing.  Its good enough that I can do nightly full 
re-indexes of all our data while developing the front-end, but it is still very 
rough.  Possibly I would like to get this refined enough to eventually submit 
as a jira ticket / patch as it seems this is a somewhat common problem that 
needs solving.

Even with our current search engine, the join  denormalize step is always the 
longest-running part of the process.  However, I have it running fairly fast by 
partitioning the data by a modulus of the primary key and then running several 
jobs in parallel.  The trick is not to get I/O bound.  Things run fast if you 
can set it up to maximize CPU.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Ephraim Ofir [mailto:ephra...@icq.com] 
Sent: Thursday, December 16, 2010 3:04 AM
To: solr-user@lucene.apache.org
Subject: RE: Dataimport performance

Check out 
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3c9f8b39cb3b7c6d4594293ea29ccf438b01702...@icq-mail.icq.il.office.aol.com%3e
This approach of not using sub entities really improved our load time.

Ephraim Ofir

-Original Message-
From: Robert Gründler [mailto:rob...@dubture.com] 
Sent: Wednesday, December 15, 2010 4:49 PM
To: solr-user@lucene.apache.org
Subject: Re: Dataimport performance

i've benchmarked the import already with 500k records, one time without the 
artists subquery, and one time without the join in the main query:


Without subquery: 500k in 3 min 30 sec

Without join and without subquery: 500k in 2 min 30.

With subquery and with left join:   320k in 6 Min 30


so the joins / subqueries are definitely a bottleneck. 

How exactly did you implement the custom data import? 

In our case, we need to de-normalize the relations of the sql data for the 
index, 
so i fear i can't really get rid of the join / subquery.


-robert





On Dec 15, 2010, at 15:43 , Tim Heckman wrote:

 2010/12/15 Robert Gründler rob...@dubture.com:
 The data-config.xml looks like this (only 1 entity):
 
  entity name=track query=select t.id as id, t.title as title, 
 l.title as label from track t left join label l on (l.id = t.label_id) where 
 t.deleted = 0 transformer=TemplateTransformer
field column=title name=title_t /
field column=label name=label_t /
field column=id name=sf_meta_id /
field column=metaclass template=Track name=sf_meta_class/
field column=metaid template=${track.id} name=sf_meta_id/
field column=uniqueid template=Track_${track.id} 
 name=sf_unique_id/
 
entity name=artists query=select a.name as artist from artist a 
 left join track_artist ta on (ta.artist_id = a.id) where 
 ta.track_id=${track.id}
  field column=artist name=artists_t /
/entity
 
  /entity
 
 So there's one track entity with an artist sub-entity. My (admittedly
 rather limited) experience has been that sub-entities, where you have
 to run a separate query for every row in the parent entity, really
 slow down data import. For my own purposes, I wrote a custom data
 import using SolrJ to improve the performance (from 3 hours to 10
 minutes).
 
 Just as a test, how long does it take if you comment out the artists entity?



Re: PHPSolrClient

2010-12-16 Thread Dennis Gearon
So just use add and overwrite. OK, thanks

 Dennis Gearon


Signature Warning
-


- Original Message 
From: Tanguy Moal tanguy.m...@gmail.com
To: solr-user@lucene.apache.org
Sent: Thu, December 16, 2010 1:33:36 AM
Subject: Re: PHPSolrClient

Hi Dennis,

Not particular to the client you use (solr-php-client) for sending
documents, think of update as an overwrite.

This means that if you update a particular document, the previous
version indexed is lost.
Therefore, when updating a document, make sure that all the fields to
be indexed and retrieved are present in the update.

For an update to occur, only the uniqueKey id (as specified in your
schema.xml) has to be the same as the document you want to update.

Shortly, an update is like an add, (and performed the same way) except
that the added document was previously indexed. It simple gets
replaced by the update.

Hope that helps,

--
Tanguy

2010/12/16 Dennis Gearon gear...@sbcglobal.net:
 First of all, it's a very nice piece of work.

 I am just getting my feet wet with Solr in general. So I 'am not even sure 
 how 
a
 document is NORMALLY deleted.

 The library PHPDocs say 'add', 'get' 'delete', But does anyone know about
 'update'?
  (obviously one can read-delete-modify-create)

  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a 
better
 idea to learn from others’ mistakes, so you do not have to make them yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.





RE: Memory use during merges (OOM)

2010-12-16 Thread Robert Petersen
Hello we occasionally bump into the OOM issue during merging after propagation 
too, and from the discussion below I guess we are doing thousands of 'false 
deletions' by unique id to make sure certain documents are *not* in the index.  
Could anyone explain why that is bad?  I didn't really understand the 
conclusion below. 

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Thursday, December 16, 2010 2:51 AM
To: solr-user@lucene.apache.org
Subject: Re: Memory use during merges (OOM)

RAM usage for merging is tricky.

First off, merging must hold open a SegmentReader for each segment
being merged.  However, it's not necessarily a full segment reader;
for example, merging doesn't need the terms index nor norms.  But it
will load deleted docs.

But, if you are doing deletions (or updateDocument, which is just a
delete + add under-the-hood), then this will force the terms index of
the segment readers to be loaded, thus consuming more RAM.
Furthermore, if the deletions you (by Term/Query) do in fact result in
deleted documents (ie they were not false deletions), then the
merging allocates an int[maxDoc()] for each SegmentReader that has
deletions.

Finally, if you have multiple merges running at once (see
CSM.setMaxMergeCount) that means RAM for each currently running merge
is tied up.

So I think the gist is... the RAM usage will be in proportion to the
net size of the merge (mergeFactor + how big each merged segment is),
how many merges you allow concurrently, and whether you do false or
true deletions.

If you are doing false deletions (calling .updateDocument when in fact
the Term you are replacing cannot exist) it'd be best if possible to
change the app to not call .updateDocument if you know the Term
doesn't exist.

Mike

On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Hello all,

 Are there any general guidelines for determining the main factors in memory 
 use during merges?

 We recently changed our indexing configuration to speed up indexing but in 
 the process of doing a very large merge we are running out of memory.
 Below is a list of the changes and part of the indexwriter log.  The changes 
 increased the indexing though-put by almost an order of magnitude.
 (about 600 documents per hour to about 6000 documents per hour.  Our 
 documents are about 800K)

 We are trying to determine which of the changes to tweak to avoid the OOM, 
 but still keep the benefit of the increased indexing throughput

 Is it likely that the changes to ramBufferSizeMB are the culprit or could it 
 be the mergeFactor change from 10-20?

  Is there any obvious relationship between ramBufferSizeMB and the memory 
 consumed by Solr?
  Are there rules of thumb for the memory needed in terms of the number or 
 size of segments?

 Our largest segments prior to the failed merge attempt were between 5GB and 
 30GB.  The memory allocated to the Solr/tomcat JVM is 10GB.

 Tom Burton-West
 -

 Changes to indexing configuration:
 mergeScheduler
        before: serialMergeScheduler
        after:    concurrentMergeScheduler
 mergeFactor
        before: 10
            after : 20
 ramBufferSizeMB
        before: 32
              after: 320

 excerpt from indexWriter.log

 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP: findMerges: 40 segments
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP:   level 7.23609 to 7.98609: 20 segments
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP:     0 to 20: add this merge
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP:   level 5.44878 to 6.19878: 20 segments
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP:     20 to 40: add this merge

 ...
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: applyDeletes
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 deleted 
 docIDs and 0 deleted queries on 40 segments.
 Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; 
 http-8091-Processor70]: hit exception flushing deletes
 Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; 
 http-8091-Processor70]: hit OutOfMemoryError inside updateDocument
 tom




Re: Thank you!

2010-12-16 Thread Dennis Gearon
If I ever make it, wikipedia, stackoverflow, PHP, Symfony, Doctrine, Apache are 
all going to get donations.

I already send $20 to wikipedia, they're huring now.

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: kenf_nc ken.fos...@realestate.com
To: solr-user@lucene.apache.org
Sent: Thu, December 16, 2010 6:11:24 AM
Subject: Re: Thank you!


Hear hear! In the beginning of my journey with Solr/Lucene I couldn't have
done it without this site. Smiley and Pugh's book was useful, but this forum
was invaluable.  I don't have as many questions now, but each new venture,
Geospatial searching, replication and redundancy, performance tuning, brings
me back again and again. This and stackoverflow.com have to be two of the
most useful destinations on the internet for developers. Communities are so
much more relevant than reference materials, and the consistent activity in
this community is impressive.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Thank-you-tp2096329p2098512.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Memory use during merges (OOM)

2010-12-16 Thread Michael McCandless
It's not that it's bad, it's just that Lucene must do extra work to
check if these deletes are real or not, and that extra work requires
loading the terms index which will consume additional RAM.

For most apps, though, the terms index is relatively small and so this
isn't really an issue.  But if your terms index is large this can
explain the added RAM usage.

One workaround for large terms index is to set the terms index divisor
that IndexWriter should use whenever it loads a terms index (this is
IndexWriter.setReaderTermsIndexDivisor).

Mike

On Thu, Dec 16, 2010 at 12:17 PM, Robert Petersen rober...@buy.com wrote:
 Hello we occasionally bump into the OOM issue during merging after 
 propagation too, and from the discussion below I guess we are doing thousands 
 of 'false deletions' by unique id to make sure certain documents are *not* in 
 the index.  Could anyone explain why that is bad?  I didn't really understand 
 the conclusion below.

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Thursday, December 16, 2010 2:51 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Memory use during merges (OOM)

 RAM usage for merging is tricky.

 First off, merging must hold open a SegmentReader for each segment
 being merged.  However, it's not necessarily a full segment reader;
 for example, merging doesn't need the terms index nor norms.  But it
 will load deleted docs.

 But, if you are doing deletions (or updateDocument, which is just a
 delete + add under-the-hood), then this will force the terms index of
 the segment readers to be loaded, thus consuming more RAM.
 Furthermore, if the deletions you (by Term/Query) do in fact result in
 deleted documents (ie they were not false deletions), then the
 merging allocates an int[maxDoc()] for each SegmentReader that has
 deletions.

 Finally, if you have multiple merges running at once (see
 CSM.setMaxMergeCount) that means RAM for each currently running merge
 is tied up.

 So I think the gist is... the RAM usage will be in proportion to the
 net size of the merge (mergeFactor + how big each merged segment is),
 how many merges you allow concurrently, and whether you do false or
 true deletions.

 If you are doing false deletions (calling .updateDocument when in fact
 the Term you are replacing cannot exist) it'd be best if possible to
 change the app to not call .updateDocument if you know the Term
 doesn't exist.

 Mike

 On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Hello all,

 Are there any general guidelines for determining the main factors in memory 
 use during merges?

 We recently changed our indexing configuration to speed up indexing but in 
 the process of doing a very large merge we are running out of memory.
 Below is a list of the changes and part of the indexwriter log.  The changes 
 increased the indexing though-put by almost an order of magnitude.
 (about 600 documents per hour to about 6000 documents per hour.  Our 
 documents are about 800K)

 We are trying to determine which of the changes to tweak to avoid the OOM, 
 but still keep the benefit of the increased indexing throughput

 Is it likely that the changes to ramBufferSizeMB are the culprit or could it 
 be the mergeFactor change from 10-20?

  Is there any obvious relationship between ramBufferSizeMB and the memory 
 consumed by Solr?
  Are there rules of thumb for the memory needed in terms of the number or 
 size of segments?

 Our largest segments prior to the failed merge attempt were between 5GB and 
 30GB.  The memory allocated to the Solr/tomcat JVM is 10GB.

 Tom Burton-West
 -

 Changes to indexing configuration:
 mergeScheduler
        before: serialMergeScheduler
        after:    concurrentMergeScheduler
 mergeFactor
        before: 10
            after : 20
 ramBufferSizeMB
        before: 32
              after: 320

 excerpt from indexWriter.log

 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP: findMerges: 40 segments
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP:   level 7.23609 to 7.98609: 20 segments
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP:     0 to 20: add this merge
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP:   level 5.44878 to 6.19878: 20 segments
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP:     20 to 40: add this merge

 ...
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: applyDeletes
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 deleted 
 docIDs and 0 deleted queries on 40 segments.
 Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; 
 http-8091-Processor70]: 

Re: bulk commits

2010-12-16 Thread Adam Estrada
what is it that you are trying to commit?

a

On Thu, Dec 16, 2010 at 1:03 PM, Dennis Gearon gear...@sbcglobal.netwrote:

 What have people found as the best way to do bulk commits either from the
 web or
 from a file on the system?

  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a
 better
 idea to learn from others’ mistakes, so you do not have to make them
 yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.




Re: bulk commits

2010-12-16 Thread Adam Estrada
This is how I import a lot of data from a cvs file. There are close to 100k
records in there. Note that you can either pre-define the column names using
the fieldnames param like I did here *or* include header=true which will
automatically pick up the column header if your file has it.

curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C
:\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8

This seems to load everything in to some kind of temporary location before
it's actually committed. If something goes wrong there is a rollback feature
that will undo anything that happened before the commit.

As far as batching a bunch of files, I copied and pasted the following in to
Cygwin and it worked just fine.

curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C
:\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
:\tmp\xab.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
:\tmp\xac.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
:\tmp\xad.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
:\tmp\xae.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
:\tmp\xaf.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
:\tmp\xag.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
:\tmp\xah.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
:\tmp\xai.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
:\tmp\xaj.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
:\tmp\xak.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
:\tmp\xal.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 

Query Problem

2010-12-16 Thread Ezequiel Calderara
Hi all, I have the following problems.
I have this set of data (View data (Pastebin) http://pastebin.com/jKbUhjVS
)
If i do a search for: *SectionName:Programas_Home* i have no results: Returned
Data (PasteBin) http://pastebin.com/wnPdHqBm
If i do a search for: *Programas_Home* i have only 1 result: Result Returned
(Pastebin) http://pastebin.com/fMZkLvYK
if i do a search for: SectionName:Programa* i have 1 result: Result Returned
(Pastebin) http://pastebin.com/kLLnVp4b

This is my *schema* http://pastebin.com/PQM8uap4 (Pastebin) and this is my
*solrconfig* http://%3c/?xml version=1.0 encoding=UTF-8 ?(PasteBin)

I don't understand why when searching for SectionName:Programas_Home isn't
returning any results at all...

Can someone send some light on this?
-- 
__
Ezequiel.

Http://www.ironicnet.com


RE: Memory use during merges (OOM)

2010-12-16 Thread Burton-West, Tom
Thanks Mike,

But, if you are doing deletions (or updateDocument, which is just a
delete + add under-the-hood), then this will force the terms index of
the segment readers to be loaded, thus consuming more RAM.

Out of 700,000 docs, by the time we get to doc 600,000, there is a good chance 
a few documents have been updated, which would cause a delete +add.  


One workaround for large terms index is to set the terms index divisor
.that IndexWriter should use whenever it loads a terms index (this is
IndexWriter.setReaderTermsIndexDivisor).

I always get confused about the two different divisors and their names in the 
solrconfig.xml file

We are setting  termInfosIndexDivisor, which I think translates to the Lucene 
IndexWriter.setReaderTermsIndexDivisor

indexReaderFactory name=IndexReaderFactory 
class=org.apache.solr.core.StandardIndexReaderFactory
int name=termInfosIndexDivisor8/int
  /indexReaderFactory 

The other one is termIndexInterval which is set on the writer and determines 
what gets written to the tii file.  I don't remember how to set this in Solr.

Are we setting the right one to reduce RAM usage during merging?


 So I think the gist is... the RAM usage will be in proportion to the
 net size of the merge (mergeFactor + how big each merged segment is),
 how many merges you allow concurrently, and whether you do false or
 true deletions

Does an optimize do something differently?  

Tom



RE: Memory use during merges (OOM)

2010-12-16 Thread Robert Petersen
Thanks Mike!  When you say 'term index of the segment readers', are you 
referring to the term vectors?

In our case our index of 8 million docs holds pretty 'skinny' docs containing 
searchable product titles and keywords, with the rest of the doc only holding 
Ids for faceting upon.  Docs typically only have unique terms per doc, with a 
lot of overlap of the terms across categories of docs (all similar products).  
I'm thinking that our unique terms are low vs the size of our index.  The way 
we spin out deletes and adds should keep the terms loaded all the time.  Seems 
like once in a couple weeks a propagation happens which kills the slave farm 
with OOMs.  We are bumping the heap up a couple gigs every time this happens 
and hoping it goes away at this point.  That is why I jumped into this 
discussion, sorry for butting in like that.  you guys are discussing very 
interesting settings I had not considered before.

Rob


-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Thursday, December 16, 2010 10:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Memory use during merges (OOM)

It's not that it's bad, it's just that Lucene must do extra work to
check if these deletes are real or not, and that extra work requires
loading the terms index which will consume additional RAM.

For most apps, though, the terms index is relatively small and so this
isn't really an issue.  But if your terms index is large this can
explain the added RAM usage.

One workaround for large terms index is to set the terms index divisor
that IndexWriter should use whenever it loads a terms index (this is
IndexWriter.setReaderTermsIndexDivisor).

Mike

On Thu, Dec 16, 2010 at 12:17 PM, Robert Petersen rober...@buy.com wrote:
 Hello we occasionally bump into the OOM issue during merging after 
 propagation too, and from the discussion below I guess we are doing thousands 
 of 'false deletions' by unique id to make sure certain documents are *not* in 
 the index.  Could anyone explain why that is bad?  I didn't really understand 
 the conclusion below.

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Thursday, December 16, 2010 2:51 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Memory use during merges (OOM)

 RAM usage for merging is tricky.

 First off, merging must hold open a SegmentReader for each segment
 being merged.  However, it's not necessarily a full segment reader;
 for example, merging doesn't need the terms index nor norms.  But it
 will load deleted docs.

 But, if you are doing deletions (or updateDocument, which is just a
 delete + add under-the-hood), then this will force the terms index of
 the segment readers to be loaded, thus consuming more RAM.
 Furthermore, if the deletions you (by Term/Query) do in fact result in
 deleted documents (ie they were not false deletions), then the
 merging allocates an int[maxDoc()] for each SegmentReader that has
 deletions.

 Finally, if you have multiple merges running at once (see
 CSM.setMaxMergeCount) that means RAM for each currently running merge
 is tied up.

 So I think the gist is... the RAM usage will be in proportion to the
 net size of the merge (mergeFactor + how big each merged segment is),
 how many merges you allow concurrently, and whether you do false or
 true deletions.

 If you are doing false deletions (calling .updateDocument when in fact
 the Term you are replacing cannot exist) it'd be best if possible to
 change the app to not call .updateDocument if you know the Term
 doesn't exist.

 Mike

 On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Hello all,

 Are there any general guidelines for determining the main factors in memory 
 use during merges?

 We recently changed our indexing configuration to speed up indexing but in 
 the process of doing a very large merge we are running out of memory.
 Below is a list of the changes and part of the indexwriter log.  The changes 
 increased the indexing though-put by almost an order of magnitude.
 (about 600 documents per hour to about 6000 documents per hour.  Our 
 documents are about 800K)

 We are trying to determine which of the changes to tweak to avoid the OOM, 
 but still keep the benefit of the increased indexing throughput

 Is it likely that the changes to ramBufferSizeMB are the culprit or could it 
 be the mergeFactor change from 10-20?

  Is there any obvious relationship between ramBufferSizeMB and the memory 
 consumed by Solr?
  Are there rules of thumb for the memory needed in terms of the number or 
 size of segments?

 Our largest segments prior to the failed merge attempt were between 5GB and 
 30GB.  The memory allocated to the Solr/tomcat JVM is 10GB.

 Tom Burton-West
 -

 Changes to indexing configuration:
 mergeScheduler
        before: serialMergeScheduler
        after:    

Re: Memory use during merges (OOM)

2010-12-16 Thread Michael McCandless
On Thu, Dec 16, 2010 at 2:09 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Thanks Mike,

But, if you are doing deletions (or updateDocument, which is just a
delete + add under-the-hood), then this will force the terms index of
the segment readers to be loaded, thus consuming more RAM.

 Out of 700,000 docs, by the time we get to doc 600,000, there is a good 
 chance a few documents have been updated, which would cause a delete +add.

OK so you should do the .updateDocument not .addDocument.

One workaround for large terms index is to set the terms index divisor
.that IndexWriter should use whenever it loads a terms index (this is
IndexWriter.setReaderTermsIndexDivisor).

 I always get confused about the two different divisors and their names in the 
 solrconfig.xml file

 We are setting  termInfosIndexDivisor, which I think translates to the Lucene 
 IndexWriter.setReaderTermsIndexDivisor

 indexReaderFactory name=IndexReaderFactory 
 class=org.apache.solr.core.StandardIndexReaderFactory
    int name=termInfosIndexDivisor8/int
  /indexReaderFactory 

 The other one is termIndexInterval which is set on the writer and determines 
 what gets written to the tii file.  I don't remember how to set this in Solr.

 Are we setting the right one to reduce RAM usage during merging?

It's even more confusing!

There are three settings.  First tells IW how frequent the index terms
are (default is 128).  Second tells IndexReader whether to sub-sample
these on load (default is 1, meaning load all indexed terms; but if
you set it to 2 then 2*128 = every 256th term is loaded).  Third, IW
has the same setting (subsampling) to be used whenever it internally
must open a reader (eg to apply deletes).

The last two are really the same setting, just that one is passed when
you open IndexReader yourself, and the other is passed whenever IW
needs to open a reader.

But, I'm not sure how these settings are named in solrconfig.xml.

 So I think the gist is... the RAM usage will be in proportion to the
 net size of the merge (mergeFactor + how big each merged segment is),
 how many merges you allow concurrently, and whether you do false or
 true deletions

 Does an optimize do something differently?

No, optimize is the same deal.  But, because it's a big merge
(especially the last one), it's the highest RAM usage of all merges.

Mike


Re: Memory use during merges (OOM)

2010-12-16 Thread Michael McCandless
Actually terms index is something different.

If you don't use CFS, go and look at the size of *.tii in your index
directory -- those are the terms index.  The terms index picks a
subset of the terms (by default 128) to hold in RAM (plus some
metadata) in order to make seeking to a specific term faster.

Unfortunately they are held in a RAM intensive way, but in the
upcoming 4.0 release we've greatly reduced that.

Mike

On Thu, Dec 16, 2010 at 2:27 PM, Robert Petersen rober...@buy.com wrote:
 Thanks Mike!  When you say 'term index of the segment readers', are you 
 referring to the term vectors?

 In our case our index of 8 million docs holds pretty 'skinny' docs containing 
 searchable product titles and keywords, with the rest of the doc only holding 
 Ids for faceting upon.  Docs typically only have unique terms per doc, with a 
 lot of overlap of the terms across categories of docs (all similar products). 
  I'm thinking that our unique terms are low vs the size of our index.  The 
 way we spin out deletes and adds should keep the terms loaded all the time.  
 Seems like once in a couple weeks a propagation happens which kills the slave 
 farm with OOMs.  We are bumping the heap up a couple gigs every time this 
 happens and hoping it goes away at this point.  That is why I jumped into 
 this discussion, sorry for butting in like that.  you guys are discussing 
 very interesting settings I had not considered before.

 Rob


 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Thursday, December 16, 2010 10:24 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Memory use during merges (OOM)

 It's not that it's bad, it's just that Lucene must do extra work to
 check if these deletes are real or not, and that extra work requires
 loading the terms index which will consume additional RAM.

 For most apps, though, the terms index is relatively small and so this
 isn't really an issue.  But if your terms index is large this can
 explain the added RAM usage.

 One workaround for large terms index is to set the terms index divisor
 that IndexWriter should use whenever it loads a terms index (this is
 IndexWriter.setReaderTermsIndexDivisor).

 Mike

 On Thu, Dec 16, 2010 at 12:17 PM, Robert Petersen rober...@buy.com wrote:
 Hello we occasionally bump into the OOM issue during merging after 
 propagation too, and from the discussion below I guess we are doing 
 thousands of 'false deletions' by unique id to make sure certain documents 
 are *not* in the index.  Could anyone explain why that is bad?  I didn't 
 really understand the conclusion below.

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Thursday, December 16, 2010 2:51 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Memory use during merges (OOM)

 RAM usage for merging is tricky.

 First off, merging must hold open a SegmentReader for each segment
 being merged.  However, it's not necessarily a full segment reader;
 for example, merging doesn't need the terms index nor norms.  But it
 will load deleted docs.

 But, if you are doing deletions (or updateDocument, which is just a
 delete + add under-the-hood), then this will force the terms index of
 the segment readers to be loaded, thus consuming more RAM.
 Furthermore, if the deletions you (by Term/Query) do in fact result in
 deleted documents (ie they were not false deletions), then the
 merging allocates an int[maxDoc()] for each SegmentReader that has
 deletions.

 Finally, if you have multiple merges running at once (see
 CSM.setMaxMergeCount) that means RAM for each currently running merge
 is tied up.

 So I think the gist is... the RAM usage will be in proportion to the
 net size of the merge (mergeFactor + how big each merged segment is),
 how many merges you allow concurrently, and whether you do false or
 true deletions.

 If you are doing false deletions (calling .updateDocument when in fact
 the Term you are replacing cannot exist) it'd be best if possible to
 change the app to not call .updateDocument if you know the Term
 doesn't exist.

 Mike

 On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Hello all,

 Are there any general guidelines for determining the main factors in memory 
 use during merges?

 We recently changed our indexing configuration to speed up indexing but in 
 the process of doing a very large merge we are running out of memory.
 Below is a list of the changes and part of the indexwriter log.  The 
 changes increased the indexing though-put by almost an order of magnitude.
 (about 600 documents per hour to about 6000 documents per hour.  Our 
 documents are about 800K)

 We are trying to determine which of the changes to tweak to avoid the OOM, 
 but still keep the benefit of the increased indexing throughput

 Is it likely that the changes to ramBufferSizeMB are the culprit or could 
 it be the mergeFactor change from 10-20?

  Is there any 

Re: Memory use during merges (OOM)

2010-12-16 Thread Robert Muir
On Thu, Dec 16, 2010 at 2:09 PM, Burton-West, Tom tburt...@umich.edu wrote:

 I always get confused about the two different divisors and their names in the 
 solrconfig.xml file

This one (for the writer) isnt configurable by Solr. want to open an issue?


 We are setting  termInfosIndexDivisor, which I think translates to the Lucene 
 IndexWriter.setReaderTermsIndexDivisor

 indexReaderFactory name=IndexReaderFactory 
 class=org.apache.solr.core.StandardIndexReaderFactory
    int name=termInfosIndexDivisor8/int
  /indexReaderFactory 

 The other one is termIndexInterval which is set on the writer and determines 
 what gets written to the tii file.  I don't remember how to set this in Solr.

 Are we setting the right one to reduce RAM usage during merging?


When you write the terms, it creates a terms dictionary, and a terms
index. The termsIndexInterval (default 128) controls how many terms go
into the index.
For example every 128th term.

The divisor just samples this at runtime... e.g. with your divisor of
8 its only reading every 8th term from the index [or every 8*128th
term is read into ram, another way to see it].

Your setting isn't being applied to the reader IW uses during
merging... its only for readers Solr opens from directories
explicitly.
I think you should open a jira issue!


Re: Dataimport performance

2010-12-16 Thread Glen Newton
Hi,

LuSqlv2 beta comes out in the next few weeks, and is designed to
address this issue (among others).

LuSql original 
(http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
now moved to: https://code.google.com/p/lusql/) is a JDBC--Lucene
high performance loader.

You may have seen my posts on this list suggesting LuSql as high
performance alternative to DIH, for a subset of use cases.

LuSqlV2 has evolved into a full extract-transform-load (ETL) high
performance engine, focusing on many of the issues of interest to the
Lucene/SOLR community.
It has a pipelined, pluggable, multithreaded architecture.
It is basically: pluggable source -- 0 or more pluggable filters --
pluggable sink

Source plugins implemented:
- JDBC, Lucene, SOLR (SolrJ), BDB, CSV, RMI, Java Serialization
Sink plugins implemented:
- JDBC, Lucene, SOLR (SolrJ), BDB, XML, RMI, Java Serialization, Tee,
NullSink [I am working on a memcached Sink]
A number of different filters implemented (i.e. get PDF file from
filesystem based on SQL field and convert  get test, etc) including:
BDBJoinFIlter, JDBCJoinFilter

--

This particular problem is one of the unit tests I have: given a
simple database of:
1- table Name
2- table City
3- table nameCityJoin
4- table Job
5- table nameJobJoin

run a JDBC--BDB LuSql instance each for of City+nameCityJoin and
Job+nameJobJoin; then run a JDBC--SolrJ on table Name, adding 2
BDBJoinFIlters, each which take the BDB generated earlier and do the
join (you just tell the filters which field from the JDBC-generated to
use against the BDB key).

So your use case use a larger example of this.

Also of interest:
- Java RMI (Remote Method Invocation): both an RMISink(Server) and
RMISource(Client) are implemented. This means you can set up N
machines which are doing something, and have one or more clients (on
their own machines) that are pulling this data and doing something
with it. For example, JDBC--PDFToTextFilter--RMI (converting PDF
files to text based on the contents of a SQL database, with text files
in the file system): basically doing some heavy lifting, and then
start up an RMI--SolrJ (or Lucene) which is a client to the N PDF
converting machines, doing only the Lucene/SOLR indexing. The client
does a pull when it needs more data. You can have N servers x M
clients! Oh, string fields length  1024 are automatically gzipped by
the RMI Sink(Server), to reduce network (at the cost of cpu:
selectable). I am looking into RMI alternatives, like Thrift, ProtoBuf
for my next Source/Sinks to implement. Another example is the reverse
use case: when the indexing is more expensive getting the data.
Example: One JDBC--RMISink(Server) instance, N
RMISource(Client)--Lucene instances; this allows multiple Lucenes to
be fed from a single JDBC source, across machines.

- TeeSink: the Tee sink hides N sinks, so you can split the pipeline
into multiple Sinks. I've used it to send the same content to Lucene
as well as BDB in one fell swoop. Can you say index and content store
in one step?

I am working on cleaning up the code, writing docs (I made the mistake
of making great docs for LusqlV1, so I have work to do...!), and
making a couple more tests.

I will announce the beta on this and the Lucene list.

If you have any questions, please contact me.

Thanks,
Glen Newton
http://zzzoot.blogspot.com

-- Old LuSql benchmarks:
http://zzzoot.blogspot.com/2008/11/lucene-231-vs-24-benchmarks-using-lusql.html

On Thu, Dec 16, 2010 at 12:04 PM, Dyer, James james.d...@ingrambook.com wrote:
 We have ~50 long-running SQL queries that need to be joined and denormalized. 
  Not all of the queries are to the same db, and some data comes from 
 fixed-width data feeds.  Our current search engine (that we are converting to 
 SOLR) has a fast disk-caching mechanism that lets you cache all of these data 
 sources and then it will join them locally prior to indexing.

 I'm in the process of developing something similar for DIH that uses the 
 Berkley db to do the same thing.  Its good enough that I can do nightly full 
 re-indexes of all our data while developing the front-end, but it is still 
 very rough.  Possibly I would like to get this refined enough to eventually 
 submit as a jira ticket / patch as it seems this is a somewhat common problem 
 that needs solving.

 Even with our current search engine, the join  denormalize step is always 
 the longest-running part of the process.  However, I have it running fairly 
 fast by partitioning the data by a modulus of the primary key and then 
 running several jobs in parallel.  The trick is not to get I/O bound.  Things 
 run fast if you can set it up to maximize CPU.

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Ephraim Ofir [mailto:ephra...@icq.com]
 Sent: Thursday, December 16, 2010 3:04 AM
 To: solr-user@lucene.apache.org
 Subject: RE: Dataimport performance

 Check out 
 

Re: bulk commits

2010-12-16 Thread Dennis Gearon
That easy, huh? Heck, this gets better and better.

BTW, how about escaping?


 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Adam Estrada estrada.adam.gro...@gmail.com
To: Dennis Gearon gear...@sbcglobal.net; solr-user@lucene.apache.org
Sent: Thu, December 16, 2010 10:58:47 AM
Subject: Re: bulk commits

This is how I import a lot of data from a cvs file. There are close to 100k
records in there. Note that you can either pre-define the column names using
the fieldnames param like I did here *or* include header=true which will
automatically pick up the column header if your file has it.

curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C

:\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8

This seems to load everything in to some kind of temporary location before
it's actually committed. If something goes wrong there is a rollback feature
that will undo anything that happened before the commit.

As far as batching a bunch of files, I copied and pasted the following in to
Cygwin and it worked just fine.

curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C

:\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

:\tmp\xab.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

:\tmp\xac.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

:\tmp\xad.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

:\tmp\xae.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

:\tmp\xaf.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

:\tmp\xag.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

:\tmp\xah.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

:\tmp\xai.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

:\tmp\xaj.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
curl 

Re: bulk commits

2010-12-16 Thread Yonik Seeley
On Thu, Dec 16, 2010 at 3:06 PM, Dennis Gearon gear...@sbcglobal.net wrote:
 That easy, huh? Heck, this gets better and better.

 BTW, how about escaping?

The CSV escaping?  It's configurable to allow for loading different
CSV dialects.

http://wiki.apache.org/solr/UpdateCSV

By default it uses double quote encapsulation, like excel would.
The bottom of the wiki page shows how to configure tab separators and
backslash escaping like MySQL produces by default.

-Yonik
http://www.lucidimagination.com



  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a 
 better
 idea to learn from others’ mistakes, so you do not have to make them yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.



 - Original Message 
 From: Adam Estrada estrada.adam.gro...@gmail.com
 To: Dennis Gearon gear...@sbcglobal.net; solr-user@lucene.apache.org
 Sent: Thu, December 16, 2010 10:58:47 AM
 Subject: Re: bulk commits

 This is how I import a lot of data from a cvs file. There are close to 100k
 records in there. Note that you can either pre-define the column names using
 the fieldnames param like I did here *or* include header=true which will
 automatically pick up the column header if your file has it.

 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C

 :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8

 This seems to load everything in to some kind of temporary location before
 it's actually committed. If something goes wrong there is a rollback feature
 that will undo anything that happened before the commit.

 As far as batching a bunch of files, I copied and pasted the following in to
 Cygwin and it worked just fine.

 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C

 :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

 :\tmp\xab.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

 :\tmp\xac.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

 :\tmp\xad.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

 :\tmp\xae.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

 :\tmp\xaf.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

 :\tmp\xag.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

 :\tmp\xah.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 curl 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C

 :\tmp\xai.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 curl 
 

Re: Query Problem

2010-12-16 Thread Erick Erickson
Ezequiel:

Nice job of including relevant details, by the way. Unfortunately I'm
puzzled too. Your SectionName is a string type, so it should
be placed in the index as-is. Be a bit cautious about looking at
returned results (as I see in one of your xml files) because the returned
values are the verbatim, stored field NOT what's tokenized, and the
tokenized data is what's searched..

That said, you SectionName should not be tokenized at all because
it's a string type. Take a look at the admin page, schema browser and
see what values for SectionName look (these will be the tokenized
values. They should be exactly
Programas_Name, complete with underscore, case changes, etc. Is that
the case?

Another place that might help is the admin/analysis page. Check the debug
boxes and input your steps and it'll show you what the transformations
are applied. But a quick look leaves me completely baffled.

Sorry I can't be more help
Erick

On Thu, Dec 16, 2010 at 2:07 PM, Ezequiel Calderara ezech...@gmail.comwrote:

 Hi all, I have the following problems.
 I have this set of data (View data (Pastebin) 
 http://pastebin.com/jKbUhjVS
 )
 If i do a search for: *SectionName:Programas_Home* i have no results:
 Returned
 Data (PasteBin) http://pastebin.com/wnPdHqBm
 If i do a search for: *Programas_Home* i have only 1 result: Result
 Returned
 (Pastebin) http://pastebin.com/fMZkLvYK
 if i do a search for: SectionName:Programa* i have 1 result: Result
 Returned
 (Pastebin) http://pastebin.com/kLLnVp4b

 This is my *schema* http://pastebin.com/PQM8uap4 (Pastebin) and this is
 my
 *solrconfig* http://%3c/?xml version=1.0 encoding=UTF-8 ?(PasteBin)

 I don't understand why when searching for SectionName:Programas_Home
 isn't
 returning any results at all...

 Can someone send some light on this?
 --
 __
 Ezequiel.

 Http://www.ironicnet.com



RE: Memory use during merges (OOM)

2010-12-16 Thread Burton-West, Tom
Your setting isn't being applied to the reader IW uses during
merging... its only for readers Solr opens from directories
explicitly.
I think you should open a jira issue!

Do I understand correctly that this setting in theory could be applied to the 
reader IW uses during merging but is not currently being applied?   

indexReaderFactory name=IndexReaderFactory 
class=org.apache.solr.core.StandardIndexReaderFactory
int name=termInfosIndexDivisor8/int
  /indexReaderFactory 

I understand the tradeoffs for doing this during searching, but not the 
trade-offs for doing this during merging.  Is the use during merging the 
similar to the use during searching? 

 i.e. Some process has to look up data for a particular term as opposed to 
having to iterate through all the terms?  
 (Haven't yet dug into the merging/indexing code).   

Tom


-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 

 We are setting  termInfosIndexDivisor, which I think translates to the Lucene 
 IndexWriter.setReaderTermsIndexDivisor




Re: Memory use during merges (OOM)

2010-12-16 Thread Robert Muir
On Thu, Dec 16, 2010 at 4:03 PM, Burton-West, Tom tburt...@umich.edu wrote:
Your setting isn't being applied to the reader IW uses during
merging... its only for readers Solr opens from directories
explicitly.
I think you should open a jira issue!

 Do I understand correctly that this setting in theory could be applied to the 
 reader IW uses during merging but is not currently being applied?

yes, i'm not really sure (especially given the name=) if you can/or
it was planned to have multiple IR factories in solr, e.g. a separate
one for spellchecking.
so i'm not sure if we should (hackishly) steal this parameter from the
IR factory (it is common to all IRFactories, not just
StandardIRFactory) and apply it to to IW..

but we could at least expose the divisor param separately to the IW
config so you have some way of setting it.


 indexReaderFactory name=IndexReaderFactory 
 class=org.apache.solr.core.StandardIndexReaderFactory
    int name=termInfosIndexDivisor8/int
  /indexReaderFactory 

 I understand the tradeoffs for doing this during searching, but not the 
 trade-offs for doing this during merging.  Is the use during merging the 
 similar to the use during searching?

  i.e. Some process has to look up data for a particular term as opposed to 
 having to iterate through all the terms?
  (Haven't yet dug into the merging/indexing code).

it needs it for applying deletes...

as a workaround (if you are reindexing), maybe instead of using the
Terms Index Divisor=8 you could set the Terms Index Interval = 1024 (8
* 128) ?

this will solve your merging problem, and have the same perf
characteristics of divisor=8, except you cant go back down like you
can with the divisor without reindexing with a smaller interval...

if you've already tested that performance with the divisor of 8 is
acceptable, or in your case maybe necessary!, it sort of makes sense
to 'bake it in' by setting your divisor back to 1 and your interval =
1024 instead...


Re: Memory use during merges (OOM)

2010-12-16 Thread Yonik Seeley
On Thu, Dec 16, 2010 at 5:51 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 If you are doing false deletions (calling .updateDocument when in fact
 the Term you are replacing cannot exist) it'd be best if possible to
 change the app to not call .updateDocument if you know the Term
 doesn't exist.

FWIW, if you're going to add a batch of documents you know aren't
already in the index,
you can use the overwrite=false parameter for that Solr update request.

-Yonik
http://www.lucidimagination.com


Re: Query Problem

2010-12-16 Thread Ezequiel Calderara
I'll check the Tokenizer to see if that's the problem.
The results of Analysis Page for SectionName:Programas_Home
 Query Analyzer org.apache.solr.schema.FieldType$DefaultAnalyzer {}  term
position 1 term text Programas_Home term type word source start,end 0,14
payload

So it's not having problems with that... Also in the debug you can see that
the parsed query is correct...
So i don't know where to look...

I know nothing about Stemming or tokenizing, but i will look if that has
anything to do.

If anyone can help me out, please do :D




On Thu, Dec 16, 2010 at 5:55 PM, Erick Erickson erickerick...@gmail.comwrote:

 Ezequiel:

 Nice job of including relevant details, by the way. Unfortunately I'm
 puzzled too. Your SectionName is a string type, so it should
 be placed in the index as-is. Be a bit cautious about looking at
 returned results (as I see in one of your xml files) because the returned
 values are the verbatim, stored field NOT what's tokenized, and the
 tokenized data is what's searched..

 That said, you SectionName should not be tokenized at all because
 it's a string type. Take a look at the admin page, schema browser and
 see what values for SectionName look (these will be the tokenized
 values. They should be exactly
 Programas_Name, complete with underscore, case changes, etc. Is that
 the case?

 Another place that might help is the admin/analysis page. Check the debug
 boxes and input your steps and it'll show you what the transformations
 are applied. But a quick look leaves me completely baffled.

 Sorry I can't be more help
 Erick

 On Thu, Dec 16, 2010 at 2:07 PM, Ezequiel Calderara ezech...@gmail.com
 wrote:

  Hi all, I have the following problems.
  I have this set of data (View data (Pastebin) 
  http://pastebin.com/jKbUhjVS
  )
  If i do a search for: *SectionName:Programas_Home* i have no results:
  Returned
  Data (PasteBin) http://pastebin.com/wnPdHqBm
  If i do a search for: *Programas_Home* i have only 1 result: Result
  Returned
  (Pastebin) http://pastebin.com/fMZkLvYK
  if i do a search for: SectionName:Programa* i have 1 result: Result
  Returned
  (Pastebin) http://pastebin.com/kLLnVp4b
 
  This is my *schema* http://pastebin.com/PQM8uap4 (Pastebin) and this
 is
  my
  *solrconfig* http://%3c/?xml version=1.0 encoding=UTF-8
 ?(PasteBin)
  
  I don't understand why when searching for SectionName:Programas_Home
  isn't
  returning any results at all...
 
  Can someone send some light on this?
  --
  __
  Ezequiel.
 
  Http://www.ironicnet.com http://www.ironicnet.com/
 




-- 
__
Ezequiel.

Http://www.ironicnet.com


Re: Query Problem

2010-12-16 Thread Erick Erickson
OK, what version of Solr are you using? I can take a quick check to see
what behavior I get

Erick

On Thu, Dec 16, 2010 at 4:44 PM, Ezequiel Calderara ezech...@gmail.comwrote:

 I'll check the Tokenizer to see if that's the problem.
 The results of Analysis Page for SectionName:Programas_Home
  Query Analyzer org.apache.solr.schema.FieldType$DefaultAnalyzer {}  term
 position 1 term text Programas_Home term type word source start,end 0,14
 payload

 So it's not having problems with that... Also in the debug you can see that
 the parsed query is correct...
 So i don't know where to look...

 I know nothing about Stemming or tokenizing, but i will look if that has
 anything to do.

 If anyone can help me out, please do :D




 On Thu, Dec 16, 2010 at 5:55 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  Ezequiel:
 
  Nice job of including relevant details, by the way. Unfortunately I'm
  puzzled too. Your SectionName is a string type, so it should
  be placed in the index as-is. Be a bit cautious about looking at
  returned results (as I see in one of your xml files) because the returned
  values are the verbatim, stored field NOT what's tokenized, and the
  tokenized data is what's searched..
 
  That said, you SectionName should not be tokenized at all because
  it's a string type. Take a look at the admin page, schema browser and
  see what values for SectionName look (these will be the tokenized
  values. They should be exactly
  Programas_Name, complete with underscore, case changes, etc. Is that
  the case?
 
  Another place that might help is the admin/analysis page. Check the debug
  boxes and input your steps and it'll show you what the transformations
  are applied. But a quick look leaves me completely baffled.
 
  Sorry I can't be more help
  Erick
 
  On Thu, Dec 16, 2010 at 2:07 PM, Ezequiel Calderara ezech...@gmail.com
  wrote:
 
   Hi all, I have the following problems.
   I have this set of data (View data (Pastebin) 
   http://pastebin.com/jKbUhjVS
   )
   If i do a search for: *SectionName:Programas_Home* i have no results:
   Returned
   Data (PasteBin) http://pastebin.com/wnPdHqBm
   If i do a search for: *Programas_Home* i have only 1 result: Result
   Returned
   (Pastebin) http://pastebin.com/fMZkLvYK
   if i do a search for: SectionName:Programa* i have 1 result: Result
   Returned
   (Pastebin) http://pastebin.com/kLLnVp4b
  
   This is my *schema* http://pastebin.com/PQM8uap4 (Pastebin) and this
  is
   my
   *solrconfig* http://%3c/?xml version=1.0 encoding=UTF-8
  ?(PasteBin)
   
   I don't understand why when searching for SectionName:Programas_Home
   isn't
   returning any results at all...
  
   Can someone send some light on this?
   --
   __
   Ezequiel.
  
   Http://www.ironicnet.com http://www.ironicnet.com/
  
 



 --
 __
 Ezequiel.

 Http://www.ironicnet.com



Re: Query Problem

2010-12-16 Thread Ezequiel Calderara
The jars are named like *1.4.1* . So i suppose its the version 1.4.1

Thanks!

On Thu, Dec 16, 2010 at 6:54 PM, Erick Erickson erickerick...@gmail.comwrote:

 OK, what version of Solr are you using? I can take a quick check to see
 what behavior I get

 Erick

 On Thu, Dec 16, 2010 at 4:44 PM, Ezequiel Calderara ezech...@gmail.com
 wrote:

  I'll check the Tokenizer to see if that's the problem.
  The results of Analysis Page for SectionName:Programas_Home
   Query Analyzer org.apache.solr.schema.FieldType$DefaultAnalyzer {}  term
  position 1 term text Programas_Home term type word source start,end 0,14
  payload
 
  So it's not having problems with that... Also in the debug you can see
 that
  the parsed query is correct...
  So i don't know where to look...
 
  I know nothing about Stemming or tokenizing, but i will look if that
 has
  anything to do.
 
  If anyone can help me out, please do :D
 
 
 
 
  On Thu, Dec 16, 2010 at 5:55 PM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   Ezequiel:
  
   Nice job of including relevant details, by the way. Unfortunately I'm
   puzzled too. Your SectionName is a string type, so it should
   be placed in the index as-is. Be a bit cautious about looking at
   returned results (as I see in one of your xml files) because the
 returned
   values are the verbatim, stored field NOT what's tokenized, and the
   tokenized data is what's searched..
  
   That said, you SectionName should not be tokenized at all because
   it's a string type. Take a look at the admin page, schema browser and
   see what values for SectionName look (these will be the tokenized
   values. They should be exactly
   Programas_Name, complete with underscore, case changes, etc. Is that
   the case?
  
   Another place that might help is the admin/analysis page. Check the
 debug
   boxes and input your steps and it'll show you what the transformations
   are applied. But a quick look leaves me completely baffled.
  
   Sorry I can't be more help
   Erick
  
   On Thu, Dec 16, 2010 at 2:07 PM, Ezequiel Calderara 
 ezech...@gmail.com
   wrote:
  
Hi all, I have the following problems.
I have this set of data (View data (Pastebin) 
http://pastebin.com/jKbUhjVS
)
If i do a search for: *SectionName:Programas_Home* i have no results:
Returned
Data (PasteBin) http://pastebin.com/wnPdHqBm
If i do a search for: *Programas_Home* i have only 1 result: Result
Returned
(Pastebin) http://pastebin.com/fMZkLvYK
if i do a search for: SectionName:Programa* i have 1 result: Result
Returned
(Pastebin) http://pastebin.com/kLLnVp4b
   
This is my *schema* http://pastebin.com/PQM8uap4 (Pastebin) and
 this
   is
my
*solrconfig* http://%3c/?xml version=1.0 encoding=UTF-8
   ?(PasteBin)

I don't understand why when searching for
 SectionName:Programas_Home
isn't
returning any results at all...
   
Can someone send some light on this?
--
__
Ezequiel.
   
Http://www.ironicnet.com http://www.ironicnet.com/ 
 http://www.ironicnet.com/

  
 
 
 
  --
  __
  Ezequiel.
 
  Http://www.ironicnet.com http://www.ironicnet.com/
 




-- 
__
Ezequiel.

Http://www.ironicnet.com


Re: Jquery Autocomplete Json formatting ?

2010-12-16 Thread Anurag

Installed Firebug

Now getting the following error
4139 matches.call( document.documentElement, [test!='']:sizzle );

Though my solr server is running on port8983, I am not using any server to
run this jquery, its just an html file in my home folder that i am opening
in my firefox browser.



-
Kumar Anurag

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Jquery-Autocomplete-Json-formatting-tp2101346p2101593.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Jquery Autocomplete Json formatting ?

2010-12-16 Thread Anurag

Installed Firebug

Now getting the following error
4139 matches.call( document.documentElement, [test!='']:sizzle );

Though my solr server is running on port8983, I am not using any server to
run this jquery, its just an html file in my home folder that i am opening
in my firefox browser.



-
Kumar Anurag

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Jquery-Autocomplete-Json-formatting-tp2101346p2101595.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Faceted Search Slows Down as index gets larger

2010-12-16 Thread Furkan Kuru
I am sorry for raising up this thread after 6 months.

But we have still problems with faceted search on full-text fields.

We try to get most frequent words in a text field that is created in 1 hour.
The faceted search takes too much time even the matching number of documents
(created_at within 1 HOUR) is constant (10-20K) as the total number of
documents increases (now 20M) the query gets slower. Solr throws exceptions
and does not respond. We have to restart and delete old docs. (3G RAM) Index
is around 2.2 GB.
And we store the data in solr as well. The documents are small.

$response = $solr-search('created_at:[NOW-'.$hours.'HOUR TO NOW]', 0, 1,
array( 'facet' = 'true', 'facet.field'= $field, 'facet.mincount' = 1,
'facet.method' = 'enum', 'facet.enum.cache.minDf' = 100 ));

Yonik had suggested distributed search. But I am not sure if we set every
configuration correctly. For example the solr caches if they are related
with faceted searching.

We use default values:

filterCache
  class=solr.FastLRUCache
  size=512
  initialSize=512
  autowarmCount=0/


queryResultCache
  class=solr.LRUCache
  size=512
  initialSize=512
  autowarmCount=0/



Any help is appreciated.



On Sun, Jun 6, 2010 at 8:54 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Sun, Jun 6, 2010 at 1:12 PM, Furkan Kuru furkank...@gmail.com wrote:
  We try to provide real-time search. So the index is changing almost in
 every
  minute.
 
  We commit for every 100 documents received.
 
  The facet search is executed every 5 mins.

 OK, that's the problem - pretty much every facet search is rebuilding
 the facet cache, which takes most of the time (and facet.fc is more
 expensive than facet.enum in this regard).

 One strategy is to use distributed search... have some big cores that
 don't change often, and then small cores for the new stuff that
 changes rapidly.

 -Yonik
 http://www.lucidimagination.com




-- 
Furkan Kuru


Re: how to config DataImport Scheduling

2010-12-16 Thread Ahmet Arslan
 I also have the same problem, i configure
 dataimport.properties file as shown
 in 
 http://wiki.apache.org/solr/DataImportHandler#dataimport.properties_example
 but no change occur, can any one help me

What version of solr are you using? This seems a new feature. So it won't work 
on solr 1.4.1.


  


Re: Jquery Autocomplete Json formatting ?

2010-12-16 Thread lee carroll
I think this could be down to the same server rule applied to ajax requests.

Your not allowed to display content from two different servers :-(

the good news solr supports jsonp which is a neat trick around this try this
(pasted from another thread)

queryString = *:*
$.getJSON(
http://[server]:[port]/solr/
select/?jsoncallback=?,
{q: queryString,
version: 2.2,
start: 0,
rows: 10,
indent: on,
json.wrf: callbackFunctionToDoSomethingWithOurData,
wt: json,
fl: field1}
);

and the callback function

function callbackFunctionToDoSomethingWithOurData(solrData) {
   // do stuff with your nice data
}




cheers lee c

On 16 December 2010 23:18, Anurag anurag.it.jo...@gmail.com wrote:


 Installed Firebug

 Now getting the following error
 4139 matches.call( document.documentElement, [test!='']:sizzle );

 Though my solr server is running on port8983, I am not using any server to
 run this jquery, its just an html file in my home folder that i am opening
 in my firefox browser.



 -
 Kumar Anurag

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Jquery-Autocomplete-Json-formatting-tp2101346p2101595.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Query Problem

2010-12-16 Thread Erick Erickson
OK, it works perfectly for me on a 1.4.1 instance. I've looked over your
files a couple of times and see nothing obvious (but you'll never find
anyone better at overlooking the obvious than me!).

Tokenizing and stemming are irrelevant in this case because your
type is string, which is an untokenizedtype so you don't need to
go there.

The way your query parses and analyzes backs this up, so you're
getting to the right schema definition.

Which may bring us to whether what's in the index is what you *think* is
in there. I'm betting not. Either you changed the schema and didn't re-index
(say changed index=false to index=true), you didn't commit the documents
after indexing or other such-like, or changed the field type and didn't
reindex.

So go into /solr/admin. Click on schema browser, click on fields.
Along
the left you should see SectionName, click on that. That will show you the
#indexed# terms, and you should see, exactly, Programas_Home in there,
just
like in your returned documents. Let us know if that's in fact what you do
see. It's
possible you're being mislead by the difference between seeing the value in
a returned
document (the stored value) and what's searched on (the indexed token(s)).

And I'm assuming that some asterisks in your mails were really there for
bolding and
you are NOT doing wildcard searches for, for instance,
 *SectionName:Programas_Home*.

But we're at a point where my 1.4.1 instance produces the results you're
expecting,
at least as I understand them so I don't think it's a problem with Solr, but
some change
you've made is producing results you don't expect but are correct. Like I
said,
look at the indexed terms. If you see Programas_Home in the admin console
after
following the steps above, then I don't know what to suggest

Best
Erick

On Thu, Dec 16, 2010 at 5:12 PM, Ezequiel Calderara ezech...@gmail.comwrote:

 The jars are named like *1.4.1* . So i suppose its the version 1.4.1

 Thanks!

 On Thu, Dec 16, 2010 at 6:54 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  OK, what version of Solr are you using? I can take a quick check to see
  what behavior I get
 
  Erick
 
  On Thu, Dec 16, 2010 at 4:44 PM, Ezequiel Calderara ezech...@gmail.com
  wrote:
 
   I'll check the Tokenizer to see if that's the problem.
   The results of Analysis Page for SectionName:Programas_Home
Query Analyzer org.apache.solr.schema.FieldType$DefaultAnalyzer {}
  term
   position 1 term text Programas_Home term type word source start,end
 0,14
   payload
  
   So it's not having problems with that... Also in the debug you can see
  that
   the parsed query is correct...
   So i don't know where to look...
  
   I know nothing about Stemming or tokenizing, but i will look if that
  has
   anything to do.
  
   If anyone can help me out, please do :D
  
  
  
  
   On Thu, Dec 16, 2010 at 5:55 PM, Erick Erickson 
 erickerick...@gmail.com
   wrote:
  
Ezequiel:
   
Nice job of including relevant details, by the way. Unfortunately I'm
puzzled too. Your SectionName is a string type, so it should
be placed in the index as-is. Be a bit cautious about looking at
returned results (as I see in one of your xml files) because the
  returned
values are the verbatim, stored field NOT what's tokenized, and the
tokenized data is what's searched..
   
That said, you SectionName should not be tokenized at all because
it's a string type. Take a look at the admin page, schema browser
 and
see what values for SectionName look (these will be the tokenized
values. They should be exactly
Programas_Name, complete with underscore, case changes, etc. Is that
the case?
   
Another place that might help is the admin/analysis page. Check the
  debug
boxes and input your steps and it'll show you what the
 transformations
are applied. But a quick look leaves me completely baffled.
   
Sorry I can't be more help
Erick
   
On Thu, Dec 16, 2010 at 2:07 PM, Ezequiel Calderara 
  ezech...@gmail.com
wrote:
   
 Hi all, I have the following problems.
 I have this set of data (View data (Pastebin) 
 http://pastebin.com/jKbUhjVS
 )
 If i do a search for: *SectionName:Programas_Home* i have no
 results:
 Returned
 Data (PasteBin) http://pastebin.com/wnPdHqBm
 If i do a search for: *Programas_Home* i have only 1 result: Result
 Returned
 (Pastebin) http://pastebin.com/fMZkLvYK
 if i do a search for: SectionName:Programa* i have 1 result: Result
 Returned
 (Pastebin) http://pastebin.com/kLLnVp4b

 This is my *schema* http://pastebin.com/PQM8uap4 (Pastebin) and
  this
is
 my
 *solrconfig* http://%3c/?xml version=1.0 encoding=UTF-8
?(PasteBin)
 
 I don't understand why when searching for
  SectionName:Programas_Home
 isn't
 returning any results at all...

 Can someone send some light on this?
 --
 __
 Ezequiel.

 

Re: Faceted Search Slows Down as index gets larger

2010-12-16 Thread Yonik Seeley
Another thing you can try is trunk.  This specific case has been
improved by an order of magnitude recenty.
The case that has been sped up is initial population of the
filterCache, or when the filterCache can't hold all of the unique
values, or when faceting is configured to not use the filterCache much
of the time via facet.enum.cache.minDf.

-Yonik
http://www.lucidimagination.com

On Thu, Dec 16, 2010 at 6:39 PM, Furkan Kuru furkank...@gmail.com wrote:
 I am sorry for raising up this thread after 6 months.

 But we have still problems with faceted search on full-text fields.

 We try to get most frequent words in a text field that is created in 1 hour.
 The faceted search takes too much time even the matching number of documents
 (created_at within 1 HOUR) is constant (10-20K) as the total number of
 documents increases (now 20M) the query gets slower. Solr throws exceptions
 and does not respond. We have to restart and delete old docs. (3G RAM) Index
 is around 2.2 GB.
 And we store the data in solr as well. The documents are small.

 $response = $solr-search('created_at:[NOW-'.$hours.'HOUR TO NOW]', 0, 1,
 array( 'facet' = 'true', 'facet.field'= $field, 'facet.mincount' = 1,
 'facet.method' = 'enum', 'facet.enum.cache.minDf' = 100 ));

 Yonik had suggested distributed search. But I am not sure if we set every
 configuration correctly. For example the solr caches if they are related
 with faceted searching.

 We use default values:

 filterCache
   class=solr.FastLRUCache
   size=512
   initialSize=512
   autowarmCount=0/


 queryResultCache
   class=solr.LRUCache
   size=512
   initialSize=512
   autowarmCount=0/



 Any help is appreciated.



 On Sun, Jun 6, 2010 at 8:54 PM, Yonik Seeley yo...@lucidimagination.com
 wrote:

 On Sun, Jun 6, 2010 at 1:12 PM, Furkan Kuru furkank...@gmail.com wrote:
  We try to provide real-time search. So the index is changing almost in
  every
  minute.
 
  We commit for every 100 documents received.
 
  The facet search is executed every 5 mins.

 OK, that's the problem - pretty much every facet search is rebuilding
 the facet cache, which takes most of the time (and facet.fc is more
 expensive than facet.enum in this regard).

 One strategy is to use distributed search... have some big cores that
 don't change often, and then small cores for the new stuff that
 changes rapidly.

 -Yonik
 http://www.lucidimagination.com



 --
 Furkan Kuru



Re: bulk commits

2010-12-16 Thread Adam Estrada
One very important thing I forgot to mention is that you will have to
increase the JAVA heap size for larger data sets.

Set JAVA_OPT to something acceptable.

Adam

On Thu, Dec 16, 2010 at 3:27 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Thu, Dec 16, 2010 at 3:06 PM, Dennis Gearon gear...@sbcglobal.net
 wrote:
  That easy, huh? Heck, this gets better and better.
 
  BTW, how about escaping?

 The CSV escaping?  It's configurable to allow for loading different
 CSV dialects.

 http://wiki.apache.org/solr/UpdateCSV

 By default it uses double quote encapsulation, like excel would.
 The bottom of the wiki page shows how to configure tab separators and
 backslash escaping like MySQL produces by default.

 -Yonik
 http://www.lucidimagination.com


 
   Dennis Gearon
 
 
  Signature Warning
  
  It is always a good idea to learn from your own mistakes. It is usually a
 better
  idea to learn from others’ mistakes, so you do not have to make them
 yourself.
  from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'
 
 
  EARTH has a Right To Life,
  otherwise we all die.
 
 
 
  - Original Message 
  From: Adam Estrada estrada.adam.gro...@gmail.com
  To: Dennis Gearon gear...@sbcglobal.net; solr-user@lucene.apache.org
  Sent: Thu, December 16, 2010 10:58:47 AM
  Subject: Re: bulk commits
 
  This is how I import a lot of data from a cvs file. There are close to
 100k
  records in there. Note that you can either pre-define the column names
 using
  the fieldnames param like I did here *or* include header=true which will
  automatically pick up the column header if your file has it.
 
  curl 
 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C
 
 
 :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
 
  This seems to load everything in to some kind of temporary location
 before
  it's actually committed. If something goes wrong there is a rollback
 feature
  that will undo anything that happened before the commit.
 
  As far as batching a bunch of files, I copied and pasted the following in
 to
  Cygwin and it worked just fine.
 
  curl 
 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C
 
 
 :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
  curl 
 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
 
  :\tmp\xab.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
  curl 
 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
 
  :\tmp\xac.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
  curl 
 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
 
  :\tmp\xad.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
  curl 
 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
 
  :\tmp\xae.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
  curl 
 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
 
  :\tmp\xaf.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
  curl 
 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
 
  :\tmp\xag.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
  curl 
 
 http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
 
  :\tmp\xah.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
  curl 
 
 

Re: facet.pivot for date fields

2010-12-16 Thread Adeel Qureshi
i guess one last call for help .. i am assuming for people who wrote or have
used the pivot faceting .. this should be a yes no question .. are date
fields supported ?

On Wed, Dec 15, 2010 at 12:58 PM, Adeel Qureshi adeelmahm...@gmail.comwrote:

 Thanks Pankaj - that was useful to know. I havent used the query stuff
 before for facets .. so that was good to know .. but the problem is still
 there because I want the hierarchical counts which is exactly what
 facet.pivot does ..

 so e.g. i want to count for fieldC within fieldB and even fieldB within
 fieldA .. that kind of stuff .. for string based fields .. facet.pivot does
 exactly that and does it very well .. but it doesnt seems to work for date
 ranges .. so in this case I want counts to be broken down by fieldA and
 fieldB and then fieldB counts for monthly ranges .. I understand that I
 might be able to use facet.query to construct several queries to get these
 counts .. e.g. *facet.query=fieldA:someValue AND fieldB:someValue AND
 fieldC:[NOW-1YEAR TO NOW]* .. but there could be thousand of possible
 combinations for fieldA and fieldB which will require as many facet.queries
 which I am assuming is not the way to go ..

 it might be confusing what I have explained above so the simple question
 still is if there is a way to get date range counts included in facet.pivot

 Adeel



 On Tue, Dec 14, 2010 at 10:53 PM, pankaj bhatt panbh...@gmail.com wrote:

 Hi Adeel,
  You can make use of facet.query attribute to make the Faceting work
 across a range of dates. Here i am using the duration, just replace the
 field with a field date and Range values as the DATE in SOLR Format.
 so your query parameter will be like this ( you can pass multiple
 parameter
 of facet.query name)

 http//blasdsdfsd/q?=asdfasdfacet.query=itemduration:[0 To
 49]facet.query=itemduration:[50 To 99]facet.query=itemduration:[100 To
 149]

 Hope, it helps.

 / Pankaj Bhatt.

 On Wed, Dec 15, 2010 at 2:01 AM, Adeel Qureshi adeelmahm...@gmail.com
 wrote:

  It doesnt seems like pivot facetting works on dates .. I was just
 curious
  if
  thats how its supposed to be or I am doing something wrong .. if I
 include
  a
  datefield in the pivot list .. i simply dont get any facet results back
 for
  that datefield
 
  Thanks
  Adeel
 





Re: bulk commits

2010-12-16 Thread Dennis Gearon
Thanks Adam!
Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Thu, 12/16/10, Adam Estrada estrada.a...@gmail.com wrote:

 From: Adam Estrada estrada.a...@gmail.com
 Subject: Re: bulk commits
 To: solr-user@lucene.apache.org
 Date: Thursday, December 16, 2010, 6:18 PM
 One very important thing I forgot to
 mention is that you will have to
 increase the JAVA heap size for larger data sets.
 
 Set JAVA_OPT to something acceptable.
 
 Adam
 
 On Thu, Dec 16, 2010 at 3:27 PM, Yonik Seeley 
 yo...@lucidimagination.comwrote:
 
  On Thu, Dec 16, 2010 at 3:06 PM, Dennis Gearon gear...@sbcglobal.net
  wrote:
   That easy, huh? Heck, this gets better and
 better.
  
   BTW, how about escaping?
 
  The CSV escaping?  It's configurable to allow for
 loading different
  CSV dialects.
 
  http://wiki.apache.org/solr/UpdateCSV
 
  By default it uses double quote encapsulation, like
 excel would.
  The bottom of the wiki page shows how to configure tab
 separators and
  backslash escaping like MySQL produces by default.
 
  -Yonik
  http://www.lucidimagination.com
 
 
  
    Dennis Gearon
  
  
   Signature Warning
   
   It is always a good idea to learn from your own
 mistakes. It is usually a
  better
   idea to learn from others’ mistakes, so you do
 not have to make them
  yourself.
   from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'
  
  
   EARTH has a Right To Life,
   otherwise we all die.
  
  
  
   - Original Message 
   From: Adam Estrada estrada.adam.gro...@gmail.com
   To: Dennis Gearon gear...@sbcglobal.net;
 solr-user@lucene.apache.org
   Sent: Thu, December 16, 2010 10:58:47 AM
   Subject: Re: bulk commits
  
   This is how I import a lot of data from a cvs
 file. There are close to
  100k
   records in there. Note that you can either
 pre-define the column names
  using
   the fieldnames param like I did here *or* include
 header=true which will
   automatically pick up the column header if your
 file has it.
  
   curl 
  
  http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C
  
  
 
 :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
  
   This seems to load everything in to some kind of
 temporary location
  before
   it's actually committed. If something goes wrong
 there is a rollback
  feature
   that will undo anything that happened before the
 commit.
  
   As far as batching a bunch of files, I copied and
 pasted the following in
  to
   Cygwin and it worked just fine.
  
   curl 
  
  http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C
  
  
 
 :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
   curl 
  
  http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
  
  
 :\tmp\xab.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
   curl 
  
  http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
  
  
 :\tmp\xac.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
   curl 
  
  http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
  
  
 :\tmp\xad.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
   curl 
  
  http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
  
  
 :\tmp\xae.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
   curl 
  
  http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C
  
  
 :\tmp\xaf.csvoverwrite=truestream.contentType=text/plain;charset=utf-8
   curl 
  
  

Got error when range query and highlight

2010-12-16 Thread Qi Ouyang
Hello all,

I got an error as follows when I do a range query search ([1 TO *])
on an numeric field and highlight is set on another text field.

2010/12/15 10:58:55 org.apache.solr.common.SolrException log
Fatal: org.apache.lucene.search.BooleanQuery$TooManyClauses:
maxClauseCount is set to 1024
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:153)
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:144)
at 
org.apache.lucene.search.MultiTermQuery$ScoringBooleanQueryRewrite.rewrite(MultiTermQuery.java:110)
at 
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:382)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:178)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:111)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:111)
at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:414)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
jp.co.spectrum.insight.hooserver.core.solrext.dispatcher.HooDispatchFilter.doFilter(Unknown
Source)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:216)
at 
org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:184)
at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:226)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:335)
at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
jp.co.spectrum.insight.hooserver.core.solrext.dispatcher.HooDispatchFilter.execute(Unknown
Source)

at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:852)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Unknown Source)

Could anyone give me any suggest?


Ouyang

https://sites.google.com/a/spectrum.co.jp/openinsight/


Re: Got error when range query and highlight

2010-12-16 Thread Ahmet Arslan
 I got an error as follows when I do a range query search
 ([1 TO *])
 on an numeric field and highlight is set on another text
 field.

Are you using hl.highlightMultiTerm=true? Pasting your search URL can give more 
hints.

Adding hl.requireFieldMatch=true should probably solve your problem.


  


Re: Got error when range query and highlight

2010-12-16 Thread Qi Ouyang
Thank you for reply.

 Are you using hl.highlightMultiTerm=true? Pasting your search URL can give 
 more hints.

Yes, I used the  hl.highlightMultiTerm=true , my search query is as follows :
start=0rows=10facet.mincount=1facet.field=authornavfacet.field=contentsearchkeywordnavfacet.field=contentstypenavfacet.field=copyrightnavfacet.field=docdatetimenavfacet.field=downloadpathnavfacet.field=filenamenavfacet.field=folderchecksumnavfacet.field=folderpathnavfacet.field=groupidnavfacet.field=kcmeta%2Fbookmark%2Fcountnavfacet.field=kcmeta%2Fcomment%2Fcountnavfacet.field=kcmeta%2Fenterprisetag%2Fcountnavfacet.field=kcmeta%2Fenterprisetag%2Fvaluenavfacet.field=kcmeta%2Fusertag%2Fcountnavfacet.field=kcmeta%2Fusertag%2Fvaluenavfacet.field=kcmeta%2Fview%2Fcountnavfacet.field=kcmeta%2Fvote%2Fcountnavfacet.field=lastmodifiernavfacet.field=mimetypenavfacet.field=orgidnavfacet.field=originalcontentstypenavfacet.field=processingtimenavfacet.field=roleidnavfacet.field=sizenavfacet.field=sourcenavfacet.field=titlenavfacet.field=useridnavfacet=truehl=truehl.fl=bodyhl.fl=titlehl.simple.pre=%3Cspan+class%3D%22highlight-solr%22%3Ehl.simple.post=%3C%2Fspan%3Eq=%28%28kcmeta%2Fview%2Fcount%3A%5B+1+TO+*+%5D+AND+contentstype%3AA9000B0001*%29+AND+%28issecure%3A%220%22+OR+userid%3A%22c6305dc4%5C-cbba%5C-bf48%5C-97d5%5C-dcfe6f2430ef%22%29%29facet.sort=counthl.highlightMultiTerm=true

 Adding hl.requireFieldMatch=true should probably solve your problem.

Yes, adding hl.requireFieldMatch=true can solve my problem, but in my
solution , I have a content field indexing all fields' contents to
support full text search, but I also have another 2 fields title and
body which support highlight, when I do search on content, I expect
the title and body can be high-lighted. So using the
hl.requireFieldMatch=true may be not work.




Ouyang

https://sites.google.com/a/spectrum.co.jp/openinsight/


Re: Got error when range query and highlight

2010-12-16 Thread Ahmet Arslan
  Adding hl.requireFieldMatch=true should probably
 solve your problem.
 
 Yes, adding hl.requireFieldMatch=true can solve my
 problem, but in my
 solution , I have a content field indexing all fields'
 contents to
 support full text search, but I also have another 2 fields
 title and
 body which support highlight, when I do search on
 content, I expect
 the title and body can be high-lighted. So using the
 hl.requireFieldMatch=true may be not work.

So you can increase the number of max boolean clauses in solrconfig.xml. 
Default is 1024. Or you can use hl.highlightMultiTerm=false.

By the way I couldn't see full-text query in your URL. It is better to filter 
out your non full-text queries into filter queries. This can solve your 
problem. For example 

fq=cmeta/view/count:[ 1 TO * ]fq=contentstype:A9000B0001*fq=(issecure:0 OR 
userid:c6305dc4\-cbba\-bf48\-97d5\-dcfe6f2430ef)

You can define multiple fq's. And you can benefit caching.
http://wiki.apache.org/solr/CommonQueryParameters#fq
http://wiki.apache.org/solr/FilterQueryGuidance


  


Solr (and mabye Java?) version numbering systems

2010-12-16 Thread Dennis Gearon
I've inferred from a bunch of posts that Solr 1.4 is actually the upcoming 4.x 
release?

And the numbering systems on other Java products don't seem to match what's 
really out there,i.e Eclipse and Sun Java.

So what IS the Solr versioning number system? Can anyone give a (maybe 
possible) chronological list?

Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


A schema inside a Solr Schema (Schema in a can)

2010-12-16 Thread Dennis Gearon
Is it possible to put name value pairs of any type in a native Solr Index field 
type? Like JSON/XML/YML?

The reason that I ask, since you asked, is I want my main index schema to be a 
base object, and another multivalue column to be the attributes of base object 
inherited descendants. 

Is there any other way to do this?

What are the limitations in searching and indexing documents with multivalue 
fields?

Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


Re: Best practice for Delta every 2 Minutes.

2010-12-16 Thread Li Li
I think it will not because default configuration can only have 2
newSearcher threads but the delay will be more and more long. The
newer newSearcher will wait these 2 ealier one to finish.

2010/12/1 Jonathan Rochkind rochk...@jhu.edu:
 If your index warmings take longer than two minutes, but you're doing a
 commit every two minutes -- you're going to run into trouble with
 overlapping index preperations, eventually leading to an OOM.  Could this be
 it?

 On 11/30/2010 11:36 AM, Erick Erickson wrote:

 I don't know, you'll have to debug it to see if it's the thing that takes
 so
 long. Solr
 should be able to handle 1,200 updates in a very short time unless there's
 something
 else going on, like you're committing after every update or something.

 This may help you track down performance with DIH

 http://wiki.apache.org/solr/DataImportHandler#interactive

 http://wiki.apache.org/solr/DataImportHandler#interactiveBest
 Erick

 On Tue, Nov 30, 2010 at 9:01 AM, stockiist...@shopgate.com  wrote:

 how do you think is the deltaQuery better ? XD
 --
 View this message in context:

 http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p1992774.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Testing Solr

2010-12-16 Thread satya swaroop
Hi All,

 I built solr successfully and i am thinking to test it  with nearly
300 pdf files, 300 docs, 300 excel files,...and so on of each type with 300
files nearly
 Is there any dummy data available to test for solr,Otherwise i need to
download each and every file individually..??
Another question is there any Benchmarks of solr...??

Regards,
satya


Re: Best practice for Delta every 2 Minutes.

2010-12-16 Thread Li Li
we now meet the same situation and want to implement like this:
we add new documents to a RAMDirectory and search two indice-- the
index in disk and the RAM index.
regularly(e.g. every hour we flush the RAMDirecotry into disk and make
a new segment)
to prevent error. before add to RAMDirecotry,we write the document
into log file.
and after flushing, we delete corresponding lines in the log file
if the program corrput. we will redo the log and add them into RAMDirectory.
Any one has done similar work?

2010/12/1 Li Li fancye...@gmail.com:
 you may implement your own MergePolicy to keep on large index and
 merge all other small ones
 or simply set merge factor to 2 and the largest index not be merged by
 set maxMergeDocs less than the docs in the largest one.
 So there is one large index and a small one. when adding a little
 docs, they will be merged into the small one. and you can, e.g. weekly
 optimize the index and merge all indice into one index.

 2010/11/30 stockii st...@shopgate.com:

 Hello.

 index is about 28 Million documents large. When i starts an delta-import is
 look at modified. but delta import takes to long. over an hour need solr for
 delta.

 thats my query. all sessions from the last hour should updated and all
 changed. i think its normal that solr need long time for the querys. how can
 i optimize this ?

 deltaQuery=SELECT id FROM sessions
 WHERE created BETWEEN DATE_ADD( NOW(), INTERVAL - 10 HOUR ) AND NOW()
 OR modified BETWEEN '${dataimporter.last_index_time}' AND DATE_ADD( NOW(),
 INTERVAL - 1 HOUR  ) 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p1992714.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Best practice for Delta every 2 Minutes.

2010-12-16 Thread Dennis Gearon
BTW, what is a Delta  (in this context, not an equipment line or a rocket, 
please :-)
Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Thu, 12/16/10, Li Li fancye...@gmail.com wrote:

 From: Li Li fancye...@gmail.com
 Subject: Re: Best practice for Delta every 2 Minutes.
 To: solr-user@lucene.apache.org
 Date: Thursday, December 16, 2010, 10:54 PM
 I think it will not because default
 configuration can only have 2
 newSearcher threads but the delay will be more and more
 long. The
 newer newSearcher will wait these 2 ealier one to finish.
 
 2010/12/1 Jonathan Rochkind rochk...@jhu.edu:
  If your index warmings take longer than two minutes,
 but you're doing a
  commit every two minutes -- you're going to run into
 trouble with
  overlapping index preperations, eventually leading to
 an OOM.  Could this be
  it?
 
  On 11/30/2010 11:36 AM, Erick Erickson wrote:
 
  I don't know, you'll have to debug it to see if
 it's the thing that takes
  so
  long. Solr
  should be able to handle 1,200 updates in a very
 short time unless there's
  something
  else going on, like you're committing after every
 update or something.
 
  This may help you track down performance with DIH
 
  http://wiki.apache.org/solr/DataImportHandler#interactive
 
  http://wiki.apache.org/solr/DataImportHandler#interactiveBest
  Erick
 
  On Tue, Nov 30, 2010 at 9:01 AM, stockiist...@shopgate.com
  wrote:
 
  how do you think is the deltaQuery better ?
 XD
  --
  View this message in context:
 
  http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p1992774.html
  Sent from the Solr - User mailing list archive
 at Nabble.com.
 
 



Re: Testing Solr

2010-12-16 Thread Dennis Gearon
There are websites with data sets out there. 'Data sets' may not be the right 
search terms, but it's something like that.

Exactly what you want, I couldn't guess otherwise?
Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Thu, 12/16/10, satya swaroop satya.yada...@gmail.com wrote:

 From: satya swaroop satya.yada...@gmail.com
 Subject: Testing Solr
 To: solr-user@lucene.apache.org
 Date: Thursday, December 16, 2010, 10:55 PM
 Hi All,
 
          I built solr
 successfully and i am thinking to test it  with nearly
 300 pdf files, 300 docs, 300 excel files,...and so on of
 each type with 300
 files nearly
  Is there any dummy data available to test for
 solr,Otherwise i need to
 download each and every file individually..??
 Another question is there any Benchmarks of solr...??
 
 Regards,
 satya