RE: Making my QParserPlugin the default one, with cores
Thanks, Ahmet. Yes, my solrconfig.xml file is very similar to what you wrote. When I use echoparams=all and defType=myqp, I get: lst name=params str name=qhi/str str name=echoparamsall/str str name=defTypemyqp/str /lst However, when I do not use the defType (hoping it will be automatically Inserted from solrconfig), I get: lst name=params str name=qhi/str str name=echoparamsall/str /lst Can you see what I am doing wrong? Thanks, Yuval -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Tuesday, June 08, 2010 3:52 PM To: solr-user@lucene.apache.org Subject: Re: Making my QParserPlugin the default one, with cores It appears that the defType parameter is not being set by the request handler. What do you get when you append echoParams=all to your search url? So you have something like this entry in solrconfig.xml requestHandler name=standard class=solr.SearchHandler default=true lst name=defaults str name=defTypemyqp/str /lst /requestHandler
Re: question about the fieldCollapseCache
The fieldCollapseCache should not be used as it is now, it uses too much memory. It stores any information relevant for a field collapse search. Like document collapse counts, collapsed document ids / fields, collapsed docset and uncollapsed docset (everything per unique search). So the memory usage will grow for each unique query (and fast with all this information). So its best I think to disable this cache for now. Martijn On 8 June 2010 19:05, Jean-Sebastien Vachon js.vac...@videotron.ca wrote: Hi All, I've been running some tests using 6 shards each one containing about 1 millions documents. Each shard is running in its own virtual machine with 7 GB of ram (5GB allocated to the JVM). After about 1100 unique queries the shards start to struggle and run out of memory. I've reduced all other caches without significant impact. When I remove completely the fieldCollapseCache, the server can keep up for hours and use only 2 GB of ram. (I'm even considering returning to a 32 bits JVM) The size of the fieldCollapseCache was set to 5000 items. How can 5000 items eat 3 GB of ram? Can someone tell me what is put in this cache? Has anyone experienced this kind of problem? I am running Solr 1.4.1 with patch 236. All requests are collapsing on a single field (pint) and collapse.maxdocs set to 200 000. Thanks for any hints...
Re: Filtering near-duplicates using TextProfileSignature
Here's my config for the updateProcessor. It not uses another signature method but i've used TextProfileSignature as well and it works - sort of. updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsig/str bool name=overwriteDupestrue/bool str name=fieldscontent/str str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain Of course, you must define the updateProcessor in your requestHandler, it's commented out in mine at the moment. requestHandler name=/update class=solr.XmlUpdateRequestHandler !-- lst name=defaults str name=update.processordedupe/str /lst -- /requestHandler Also, i see you define minTokenLen = 3. Where does that come from? Haven't seen anything on the wiki specifying such a parameter. On Tuesday 08 June 2010 19:45:35 Neeb wrote: Hey Andrew, Just wondering if you ever managed to run TextProfileSignature based deduplication. I would appreciate it if you could send me the code fragment for it from solrconfig. I have currently something like this, but not sure if I am doing it right: updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsignature/str bool name=overwriteDupestrue/bool str name=fieldstitle,author,abstract/str str name=signatureClassorg.apache.solr.update.processor.TextProfileSignature /str str name=minTokenLen3/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain -- Thanks in advance, -Ali Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Filtering near-duplicates using TextProfileSignature
Well, it got me too! KMail didn't properly order this thread. Can't seem to find Hatcher's reply anywhere. ??!!? On Tuesday 08 June 2010 22:00:06 Andrew Clegg wrote: Andrew Clegg wrote: Re. your config, I don't see a minTokenLength in the wiki page for deduplication, is this a recent addition that's not documented yet? Sorry about this -- stupid question -- I should have read back through the thread and refreshed my memory. Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: [Blacklight-development] facet data cleanup
On Jun 8, 2010, at 1:57 PM, Naomi Dushay wrote: Missing Facet Values: --- to find how many documents are missing values: facet.missing=truefacet.mincount=really big http://your.solr.baseurl/select?rows=0facet.field=ffldnamefacet.mincount=1000facet.missing=true to find the documents with missing values: http://your.solr.baseurl/select?qt=standardq=+uniquekey:[* TO *] - ffldname:[* TO *] You could shorten that query to just q=-field_name:[* TO *] Solr's lucene query parser supports top-level negative clauses. And I'm assuming every doc has a unique key, so you could use *:* instead of uniquekey:[* TO *] - but I doubt one is really better than the other. Erik
Re: Making my QParserPlugin the default one, with cores
Yuval - my only hunch is that you're hitting a different request handler than where you configured the default defType. Send us the URL you're hitting Solr with, and the full request handler mapping. And you're sure you're the exact core you're hitting (since you mention multicore) you think you are? Look at Solr's admin to see where the solr home directory is and ensure you're looking at the right solrconfig.xml. Erik On Jun 9, 2010, at 12:52 AM, Yuval Feinstein wrote: Thanks, Ahmet. Yes, my solrconfig.xml file is very similar to what you wrote. When I use echoparams=all and defType=myqp, I get: lst name=params str name=qhi/str str name=echoparamsall/str str name=defTypemyqp/str /lst However, when I do not use the defType (hoping it will be automatically Inserted from solrconfig), I get: lst name=params str name=qhi/str str name=echoparamsall/str /lst Can you see what I am doing wrong? Thanks, Yuval -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Tuesday, June 08, 2010 3:52 PM To: solr-user@lucene.apache.org Subject: Re: Making my QParserPlugin the default one, with cores It appears that the defType parameter is not being set by the request handler. What do you get when you append echoParams=all to your search url? So you have something like this entry in solrconfig.xml requestHandler name=standard class=solr.SearchHandler default=true lst name=defaults str name=defTypemyqp/str /lst /requestHandler
Re: Filtering near-duplicates using TextProfileSignature
Markus Jelsma wrote: Well, it got me too! KMail didn't properly order this thread. Can't seem to find Hatcher's reply anywhere. ??!!? Whole thread here: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tt479039.html -- View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p881797.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index search optimization for fulltext remote streaming
We have following solr configuration: java -Xms512M -Xmx1024M -Dsolr.solr.home=solr home directory -jar start.jar in SolrConfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor4/mergeFactor maxBufferedDocs20/maxBufferedDocs ramBufferSizeMB1024/ramBufferSizeMB maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypenative/lockType /indexDefaults mainIndex useCompoundFilefalse/useCompoundFile ramBufferSizeMB1024/ramBufferSizeMB mergeFactor4/mergeFactor !-- Deprecated -- !--maxBufferedDocs10/maxBufferedDocs-- !--maxMergeDocs2147483647/maxMergeDocs-- unlockOnStartupfalse/unlockOnStartup reopenReaderstrue/reopenReaders deletionPolicy class=solr.SolrDeletionPolicy str name=maxCommitsToKeep1/str str name=maxOptimizedCommitsToKeep0/str /deletionPolicy infoStream file=INFOSTREAM.txtfalse/infoStream /mainIndex Also, we have used autoCommit=false. We have our PC spec: Core2-Duo 2GB RAM Solr Server running in localhost Index Directory is also in local FileSystem Input Fulltext files using remoteStreaming from another PC Here, when we indexed 10 Fulltext documents, the total time taken is 40mins. We want to optimize the time lesser to this. We have been studying on UpdateRequestProcessorChain section requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processordedupe/str /lst /requestHandler How to use this UpdateRequestProcessorChain in /update/extract/ to run indexing in multiple chains (i.e multiple threads). Can you suggest me if I can optimize the process changing any of these configurations? with regards, Danyal Mark -- View this message in context: http://lucene.472066.n3.nabble.com/Index-search-optimization-for-fulltext-remote-streaming-tp828274p881809.html Sent from the Solr - User mailing list archive at Nabble.com.
how to get multicore to work?
Hi - I can't seem to get multicores to work. I have a solr installtion which does not have a solr.xml file - I assume this means it is not multicore. If I create a solr.xml, as described on http://wiki.apache.org/solr/CoreAdmin, my solr installation fails - for example I get 404 errors when trying to search, and solr/admin does not work. Is there more than simply making solr.xml to get multicores to work? Thanks, Peter -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-get-multicore-to-work-tp881826p881826.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering near-duplicates using TextProfileSignature
Thanks guys. I will try this with some test documents, fingers crossed. And by the way, I got the minTokenLen parameter from one of the thread replies (from Erik). Cheerz, Ali -- View this message in context: http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p881840.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to get multicore to work?
If you take a look in the examples directory there is a directory called multicore. This is an example of the solrhome of a multicore setup. Otherwise take a look at the logged output of Solr itself. It should tell you what is wrong with the setup On 9 June 2010 11:08, xdzgor p...@alphasolutions.dk wrote: Hi - I can't seem to get multicores to work. I have a solr installtion which does not have a solr.xml file - I assume this means it is not multicore. If I create a solr.xml, as described on http://wiki.apache.org/solr/CoreAdmin, my solr installation fails - for example I get 404 errors when trying to search, and solr/admin does not work. Is there more than simply making solr.xml to get multicores to work? Thanks, Peter -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-get-multicore-to-work-tp881826p881826.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Making my QParserPlugin the default one, with cores
Thanks, Ahmet. Yes, my solrconfig.xml file is very similar to what you wrote. When I use echoparams=all and defType=myqp, I get: lst name=params str name=qhi/str str name=echoparamsall/str str name=defTypemyqp/str /lst However, when I do not use the defType (hoping it will be automatically Inserted from solrconfig), I get: lst name=params str name=qhi/str str name=echoparamsall/str /lst In echoParams=all p should be capital. Just use echoParams=all and don't include defType explicitly. echoParams=all will display default parameters that you specify in solrconfig.xml. You can debug this way. http://wiki.apache.org/solr/CoreQueryParameters#echoParams If you don't see str name=defTypemyqp/str listed under lst name=params then it is not written in solrconfig.xml. May be you forgot to restart core after editing solrconfig.xml?
Copyfield multi valued to single value
Hello, Is there a way to copy a multivalued field to a single value by taking for example the first index of the multivalued field? I am actually trying to sort my index by Title and my index contains Tika extracted titles which come in as multi valued hence why my title field is multi valued. However when i do a sort on the title field, it crashes because well it cannot compare two arrays i guess which is logical. So my thought was to copy only one value from the array to another field. Maybe there is another way to do that? Can anyone help me? Thanks in advance! Marc _ Vous voulez regarder la TV directement depuis votre PC ? C'est très simple avec Windows 7 http://clk.atdmt.com/FRM/go/229960614/direct/01/
requesthandler, variable ...
Hello. i want to call the termscomponent with this request: http://host/solr/app/select/?q=har i want the same result when i use this request: http://host/solr/app/terms/?q=harterms.prefix=har --lst name=terms lst name=suggest int name=hardcore9/int int name=hardcore evo9/int int name=hardcore evo 20109/int ... . this is my solrconfig.xml requestHandler searchComponent name=termsComponent class=org.apache.solr.handler.component.TermsComponent/ requestHandler name=standard class=org.apache.solr.handler.component.SearchHandler lst name=defaults str name=qtterms/str /lst arr name=components strtermsComponent/str /arr /requestHandler !-- qt=terms -- requestHandler name=terms class=org.apache.solr.handler.component.SearchHandler lst name=defaults bool name=termstrue/bool str name=terms.flsuggest/str str name=terms.sortindex/str str name=terms.prefixstr name=q//str /lst arr name=components strtermsComponent/str /arr /requestHandler it this possible ? str name=terms.prefixstr name=q//str or ho can i put the q-value on the place of term.prefix ? -- View this message in context: http://lucene.472066.n3.nabble.com/requesthandler-variable-tp881906p881906.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Making my QParserPlugin the default one, with cores
Thanks again Ahmet and Erik. Turns out that this was calling the correct query parser all along. The real problem was a combination of the query cache and my hacking the query to enable BM25 scoring. When I use a standard BooleanQuery, this behaved as published. Now I have to understand how to tweak my Lucene query data structure so that the query caching works like the standard Lucene queries. Cheers, Yuval -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Wednesday, June 09, 2010 1:36 PM To: solr-user@lucene.apache.org Subject: RE: Making my QParserPlugin the default one, with cores Thanks, Ahmet. Yes, my solrconfig.xml file is very similar to what you wrote. When I use echoparams=all and defType=myqp, I get: lst name=params str name=qhi/str str name=echoparamsall/str str name=defTypemyqp/str /lst However, when I do not use the defType (hoping it will be automatically Inserted from solrconfig), I get: lst name=params str name=qhi/str str name=echoparamsall/str /lst In echoParams=all p should be capital. Just use echoParams=all and don't include defType explicitly. echoParams=all will display default parameters that you specify in solrconfig.xml. You can debug this way. http://wiki.apache.org/solr/CoreQueryParameters#echoParams If you don't see str name=defTypemyqp/str listed under lst name=params then it is not written in solrconfig.xml. May be you forgot to restart core after editing solrconfig.xml?
how to test solr's performance?
are there any built-in tools for performance test? thanks
AW: how to get multicore to work?
- solr.xml have to reside in the solr.home dir. you can setup this with the java-option -Dsolr.solr.home= - admin is per core, so solr/CORENAME/admin will work it is quite simple to setup. -Ursprüngliche Nachricht- Von: xdzgor [mailto:p...@alphasolutions.dk] Gesendet: Mittwoch, 9. Juni 2010 12:08 An: solr-user@lucene.apache.org Betreff: how to get multicore to work? Hi - I can't seem to get multicores to work. I have a solr installtion which does not have a solr.xml file - I assume this means it is not multicore. If I create a solr.xml, as described on http://wiki.apache.org/solr/CoreAdmin, my solr installation fails - for example I get 404 errors when trying to search, and solr/admin does not work. Is there more than simply making solr.xml to get multicores to work? Thanks, Peter -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-get-multicore-to-wor k-tp881826p881826.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: question about the fieldCollapseCache
ok great. I believe this should be mentioned in the wiki. Later On 2010-06-09, at 4:06 AM, Martijn v Groningen wrote: The fieldCollapseCache should not be used as it is now, it uses too much memory. It stores any information relevant for a field collapse search. Like document collapse counts, collapsed document ids / fields, collapsed docset and uncollapsed docset (everything per unique search). So the memory usage will grow for each unique query (and fast with all this information). So its best I think to disable this cache for now. Martijn On 8 June 2010 19:05, Jean-Sebastien Vachon js.vac...@videotron.ca wrote: Hi All, I've been running some tests using 6 shards each one containing about 1 millions documents. Each shard is running in its own virtual machine with 7 GB of ram (5GB allocated to the JVM). After about 1100 unique queries the shards start to struggle and run out of memory. I've reduced all other caches without significant impact. When I remove completely the fieldCollapseCache, the server can keep up for hours and use only 2 GB of ram. (I'm even considering returning to a 32 bits JVM) The size of the fieldCollapseCache was set to 5000 items. How can 5000 items eat 3 GB of ram? Can someone tell me what is put in this cache? Has anyone experienced this kind of problem? I am running Solr 1.4.1 with patch 236. All requests are collapsing on a single field (pint) and collapse.maxdocs set to 200 000. Thanks for any hints...
Re: question about the fieldCollapseCache
I agree. I'll add this information to the wiki. On 9 June 2010 14:32, Jean-Sebastien Vachon js.vac...@videotron.ca wrote: ok great. I believe this should be mentioned in the wiki. Later On 2010-06-09, at 4:06 AM, Martijn v Groningen wrote: The fieldCollapseCache should not be used as it is now, it uses too much memory. It stores any information relevant for a field collapse search. Like document collapse counts, collapsed document ids / fields, collapsed docset and uncollapsed docset (everything per unique search). So the memory usage will grow for each unique query (and fast with all this information). So its best I think to disable this cache for now. Martijn On 8 June 2010 19:05, Jean-Sebastien Vachon js.vac...@videotron.ca wrote: Hi All, I've been running some tests using 6 shards each one containing about 1 millions documents. Each shard is running in its own virtual machine with 7 GB of ram (5GB allocated to the JVM). After about 1100 unique queries the shards start to struggle and run out of memory. I've reduced all other caches without significant impact. When I remove completely the fieldCollapseCache, the server can keep up for hours and use only 2 GB of ram. (I'm even considering returning to a 32 bits JVM) The size of the fieldCollapseCache was set to 5000 items. How can 5000 items eat 3 GB of ram? Can someone tell me what is put in this cache? Has anyone experienced this kind of problem? I am running Solr 1.4.1 with patch 236. All requests are collapsing on a single field (pint) and collapse.maxdocs set to 200 000. Thanks for any hints... -- Met vriendelijke groet, Martijn van Groningen
Solr spellcheck config
Hi everyone, I am trying to build the spellcheck index with *IndexBasedSpellChecker* lst name=spellchecker str name=namedefault/str str name=fieldtext/str str name=spellcheckIndexDir./spellchecker/str /lst And I want to specify the dynamic field *_text as the field option: dynamicField name=*_text stored=false type=text multiValued=true indexed=true How it can be done? Thanks, Bogdan -- Bogdan Gusiev. agre...@gmail.com
Issue with response header in SOLR running on Linux instance
Hi, I have been using SOLR for sometime now and had no issues till I was using it in windows. Yesterday I moved the SOLR code to Linux servers and started to index the data. Indexing completed successfully in the linux severs but when I queried the index, the response header returned (by the SOLR instance running in Linux server) is different from the response header returned in SOLR instance that is running on windows instance. Response header returned by SOLR instance running in windows machine - lst name=responseHeader int name=status0/int int name=QTime2219/int - lst name=params str name=indenton/str str name=start0/str str name=qcredit/str str name=version2.2/str str name=rows10/str /lst /lst Response header returned by SOLR instance running in Linux machine - response - responseHeader status0/status QTime26/QTime - lst name=params str name=qcredit/str /lst /responseHeader Any idea why this happens? Thanks, Barani -- View this message in context: http://lucene.472066.n3.nabble.com/Issue-with-response-header-in-SOLR-running-on-Linux-instance-tp882181p882181.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Core Unload
Refering http://lucene.472066.n3.nabble.com/unloading-a-solr-core-doesn-t-free-any-memory-td501246.html#a501246 Do we have any solution to free up memory after Solr Core Unload? Ankit -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Core-Unload-tp882187p882187.html Sent from the Solr - User mailing list archive at Nabble.com.
custom scorer in Solr
Hi all, We are currently working on a proof-of-concept for a client using Solr and have been able to configure all the features they want except the scoring. Problem is that they want scores that make results fall in buckets: * Bucket 1: exact match on category (score = 4) * Bucket 2: exact match on name (score = 3) * Bucket 3: partial match on category (score = 2) * Bucket 4: partial match on name (score = 1) First thing we did was develop a custom similarity class that would return the correct score depending on the field and an exact or partial match. The only problem now is that when a document matches on both the category and name the scores are added together. Example: searching for restaurant returns documents in the category restaurant that also have the word restaurant in their name and thus get a score of 5 (4+1) but they should only get 4. I assume for this to work we would need to develop a custom Scorer class but we have no clue on how to incorporate this in Solr. Maybe there is even a simpler solution that we don't know about. All suggestions welcome! Thanks, Tom
Re: Issue with response header in SOLR running on Linux instance
Hi, Check your requestHandler. It may preset some values that you don't see. Your echoParams setting may be explicit instead of all [1]. Alternatively, you could add the echoParams parameter to your query if it isn't set as an invariant in your requestHandler. [1]: http://wiki.apache.org/solr/CoreQueryParameters Cheers, On Wednesday 09 June 2010 15:25:09 bbarani wrote: Hi, I have been using SOLR for sometime now and had no issues till I was using it in windows. Yesterday I moved the SOLR code to Linux servers and started to index the data. Indexing completed successfully in the linux severs but when I queried the index, the response header returned (by the SOLR instance running in Linux server) is different from the response header returned in SOLR instance that is running on windows instance. Response header returned by SOLR instance running in windows machine - lst name=responseHeader int name=status0/int int name=QTime2219/int - lst name=params str name=indenton/str str name=start0/str str name=qcredit/str str name=version2.2/str str name=rows10/str /lst /lst Response header returned by SOLR instance running in Linux machine - response - responseHeader status0/status QTime26/QTime - lst name=params str name=qcredit/str /lst /responseHeader Any idea why this happens? Thanks, Barani Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Anyone using Solr spatial from trunk?
... but decided not to use it anyway? that's pretty much correct. the huge commercial scale of the project dictates that we need as much system stability as possible from the outset; thus the tools we are use must be established, community-tested and trusted versions. we also noticed that some of the regular non-geospatial queries seemed to run slightly slower than on 1.4, with only a fraction of the total amount of records we'd be searching in production (but that wasn't the main reason for our decision). i would perhaps use it for a much smaller [private] project where speed, scaling and reliability weren't such critical issues. future proofing was also a consideration: *With all the changes currently occurring with Solr, I would go so far as to say that users should continue to use Solr 1.4. However, if you need access to one of the many new features introduced in Solr 1.5+ or Lucene 3.x, then given Solr 3.1 a shot, and report back your experiences. *(from http://blog.jteam.nl/2010/04/14/state-of-solr/*).* On 8 June 2010 21:09, Darren Govoni dar...@ontrenet.com wrote: So let me understand what you said. You went through the trouble to implement a geospatial solution using Solr 1.5, it worked really well. You saw no signs of instability, but decided not to use it anyway? Did you put it through a routine of tests and witness some stability problem? Or just guessing it had them? I'm just curious the reasoning behind your comment. On Tue, 2010-06-08 at 09:05 +0100, Rob Ganly wrote: i used the 1.5 build a few weeks ago, implemented the geospatial functionality and it worked really well. however due to the unknown quantity in terms of stability (and the uncertain future of 1.5) etc. we decided not to use it in production. rob ganly On 8 June 2010 03:50, Darren Govoni dar...@ontrenet.com wrote: I've been experimenting with it, but haven't quite gotten it to work as yet. On Mon, 2010-06-07 at 17:47 -0700, Jay Hill wrote: I was wondering about the production readiness of the new-in-trunk spatial functionality. Is anyone using this in a production environment? -Jay
Re: AW: XSLT for JSON
help me please =( -- View this message in context: http://lucene.472066.n3.nabble.com/XSLT-for-JSON-tp845386p882319.html Sent from the Solr - User mailing list archive at Nabble.com.
How Solr Manages Connected Database Updates
Hey All, I am new to Solr Area, and just started exploring it and done basic stuff, now I am stuck with logic : How Solr Manages Connected Database Updates Scenario : -- Wrote one Indexing Program which runs on Tomcat , and by running this program, it reads data from connected MySql Database and then perform Indexing. Use Case - Database is not fixed, Its a data base for a web application, from where user keep on inserting data, so database have frequent updates. almost every minute. How automatically solr should grab those changes and perform Index updation ? Do I need to Write a Cron Job kind of stuff ? Or Use Data Import Handler ? (Several ways could be ?) Is there any one who can provide his comments or share his experience If some one gone though from similar situation ? Thanks, -Sumit
Diagnosing solr timeout
Hi all, In my app, it seems like solr has become slower over time. The index has grown a bit, and there are probably a few more people using the site, but the changes are not drastic. I notice that when a solr search is made, the amount of cpu and ram spike precipitously. I notice in the solr log, a bunch of entries in the same second that end in: status=0 QTime=212 status=0 QTime=96 status=0 QTime=44 status=0 QTime=276 status=0 QTime=8552 status=0 QTime=16 status=0 QTime=20 status=0 QTime=56 and then: status=0 QTime=315919 status=0 QTime=325071 My questions: How do I figure out what to fix? Do I need to start java with more memory? How do I tell what is the correct amount of memory to use? Is there something particularly inefficient about something else in my configuration, or the way I'm formulating the solr request, and how would I narrow down what it could be? I can't tell, but it seems like it happens after solr has been running unattended for a little while. Should I have a cron job that restarts solr every day? Could the solr process be starved by something else on the server (although -- the only other thing that is particularly running is apache/passenger/rails app)? In other words, I'm at a total loss about how to fix this. Thanks! P.S. In case this helps, here's the exact log entry for the first item that failed: Jun 9, 2010 1:02:52 PM org.apache.solr.core.SolrCore execute INFO: [resources] webapp=/solr path=/select params={hl.fragsize=600facet.missing=truefacet=falsefacet.mincount=1ids=http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.04.xml;chunk.id%3Ddiv.ww.shelleyworks.v4.44,http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.06.xml;chunk.id%3Ddiv.ww.shelleyworks.v6.67,http://pm.nlx.com/xtf/view?docId%3Dtennyson_c/tennyson_c.02.xml;chunk.id%3Ddiv.tennyson.v2.1115,http://pm.nlx.com/xtf/view?docId%3Dmarx/marx.39.xml;chunk.id%3Ddiv.marx.engels.39.325,http://pm.nlx.com/xtf/view?docId%3Dshelley_j/shelley_j.01.xml;chunk.id%3Ddiv.ww.shelley.journals.v1.80,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.116,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.115,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.75,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.76,http://pm.nlx.com/xtf/view?docId%3Demerson/emerson.05.xml;chunk.id%3Dralph.waldo.v5.d083,http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.04.xml;chunk.id%3Ddiv.ww.shelleyworks.v4.31,http://pm.nlx.com/xtf/view?docId%3Dshelley_j/shelley_j.01.xml;chunk.id%3Ddiv.ww.shelley.journals.v1.88,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.03.xml;chunk.id%3Ddiv.eliot.romola.48facet.limit=-1hl.fl=texthl.maxAnalyzedChars=512000wt=javabinhl=truerows=30version=1fl=uri,archive,date_label,genre,source,image,thumbnail,title,alternative,url,role_ART,role_AUT,role_EDT,role_PBL,role_TRL,role_EGR,role_ETR,role_CRE,freeculture,is_ocr,federation,has_full_text,source_xml,uristart=0q=(*:*+AND+(life)+AND+(death)+AND+(of)+AND+(jason)+AND+federation:NINES)+OR+(*:*+AND+(life)+AND+(death)+AND+(of)+AND+(jason)+AND+federation:NINES+-genre:Citation)^5facet.field=genrefacet.field=archivefacet.field=freeculturefacet.field=has_full_textfacet.field=federationisShard=truefq=year:1882} status=0 QTime=315919
Dataimport in debug mode store a last index date
Hi, When using the data import handler and clicking on 'Debug now' it stores the current date as 'last_index_time' into the dataimport.properties file. Is it the right behaviour, as debug don't do a commit? Thanks marc
Re: Diagnosing solr timeout
Have you looked at the garbage collector statistics? I've experienced this kind of issues in the past and I was getting huge spikes when the GC was doing its job. On 2010-06-09, at 10:52 AM, Paul wrote: Hi all, In my app, it seems like solr has become slower over time. The index has grown a bit, and there are probably a few more people using the site, but the changes are not drastic. I notice that when a solr search is made, the amount of cpu and ram spike precipitously. I notice in the solr log, a bunch of entries in the same second that end in: status=0 QTime=212 status=0 QTime=96 status=0 QTime=44 status=0 QTime=276 status=0 QTime=8552 status=0 QTime=16 status=0 QTime=20 status=0 QTime=56 and then: status=0 QTime=315919 status=0 QTime=325071 My questions: How do I figure out what to fix? Do I need to start java with more memory? How do I tell what is the correct amount of memory to use? Is there something particularly inefficient about something else in my configuration, or the way I'm formulating the solr request, and how would I narrow down what it could be? I can't tell, but it seems like it happens after solr has been running unattended for a little while. Should I have a cron job that restarts solr every day? Could the solr process be starved by something else on the server (although -- the only other thing that is particularly running is apache/passenger/rails app)? In other words, I'm at a total loss about how to fix this. Thanks! P.S. In case this helps, here's the exact log entry for the first item that failed: Jun 9, 2010 1:02:52 PM org.apache.solr.core.SolrCore execute INFO: [resources] webapp=/solr path=/select params={hl.fragsize=600facet.missing=truefacet=falsefacet.mincount=1ids=http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.04.xml;chunk.id%3Ddiv.ww.shelleyworks.v4.44,http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.06.xml;chunk.id%3Ddiv.ww.shelleyworks.v6.67,http://pm.nlx.com/xtf/view?docId%3Dtennyson_c/tennyson_c.02.xml;chunk.id%3Ddiv.tennyson.v2.1115,http://pm.nlx.com/xtf/view?docId%3Dmarx/marx.39.xml;chunk.id%3Ddiv.marx.engels.39.325,http://pm.nlx.com/xtf/view?docId%3Dshelley_j/shelley_j.01.xml;chunk.id%3Ddiv.ww.shelley.journals.v1.80,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.116,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.115,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.75,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.76,http://pm.nlx.com/xtf/view?docId%3Demerson/emerson.05.xml;chunk.id%3Dralph.waldo.v5.d083,http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.04.xml;chunk.id%3Ddiv.ww.shelleyworks.v4.31,http://pm.nlx.com/xtf/view?docId%3Dshelley_j/shelley_j.01.xml;chunk.id%3Ddiv.ww.shelley.journals.v1.88,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.03.xml;chunk.id%3Ddiv.eliot.romola.48facet.limit=-1hl.fl=texthl.maxAnalyzedChars=512000wt=javabinhl=truerows=30version=1fl=uri,archive,date_label,genre,source,image,thumbnail,title,alternative,url,role_ART,role_AUT,role_EDT,role_PBL,role_TRL,role_EGR,role_ETR,role_CRE,freeculture,is_ocr,federation,has_full_text,source_xml,uristart=0q=(*:*+AND+(life)+AND+(death)+AND+(of)+AND+(jason)+AND+federation:NINES)+OR+(*:*+AND+(life)+AND+(death)+AND+(of)+AND+(jason)+AND+federation:NINES+-genre:Citation)^5facet.field=genrefacet.field=archivefacet.field=freeculturefacet.field=has_full_textfacet.field=federationisShard=truefq=year:1882} status=0 QTime=315919
Re: Diagnosing solr timeout
Have you looked at the garbage collector statistics? I've experienced this kind of issues in the past and I was getting huge spikes when the GC was doing its job. I haven't, and I'm not sure what a good way to monitor this is. The problem occurs maybe once a week on a server. Should I run jstat the whole time and redirect the output to a log file? Is there another way to get that info? Also, I was suspecting GC myself. So, if it is the problem, what do I do about it? It seems like increasing RAM might make the problem worse because it would wait longer to GC, then it would have more to do.
TrieRange for storage of dates
What is the best practice? Perhaps we can amend the article at http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/ to include the recommendation (ie, dates are commonly unique). I'm assuming using a long is the best choice.
Re: Tomcat startup script
On Tue, Jun 8, 2010 at 4:18 PM, cbenn...@job.com wrote: The following should work on centos/redhat, don't forget to edit the paths, user, and java options for your environment. You can use chkconfig to add it to your startup. Thanks, Colin. Sixten
Some questions about ability of solr.
I am keeping some data int Json format in HBase table. I would like to index this data with solr. Is there any examples of indexing HBase table? Evry node in HBase has atribyte that saves the data then it was writed int table. Is there any option to search no only by text but also to search the data for period of time then it was writed into the HBase?
Re: general debugging techniques?
On Fri, Jun 4, 2010 at 3:14 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : That is still really small for 5MB documents. I think the default solr : document cache is 512 items, so you would need at least 3 GB of memory : if you didn't change that and the cache filled up. that assumes that the extracted text tika extracts from each document is the same size as the original raw files *and* that he's configured that content field to be stored ... in practice if you only stored=true the Most times the extracted text is much smaller, though there are occasional zip files that may expand in size (and in an unrelated note, multifile zip archives cause tika 0.7 to hang currently). fast, 128MB is really, really, really small for a typical Solr instance. In any case I bumped up the heap to 3G as suggested, which has helped stability. I have found that in practice I need to commit every extraction because a crash or error will wipe out all extractions after the last commit. if you are only seeing one log line per request, then you are just looking at the request log ... there should be more logs with messages from all over the code base with various levels of severity -- and using standard java log level controls you can turn these up/down for various components. Unfortunately, I'm not very familiar with java deploys so I don't know where the standard controls are yet. As a concrete example, I do see INFO level logs, but haven't found a way to move up DEBUG level in either solr or tomcat. I was hopeful debug statements would point to where extraction/indexing hangs were occurring. I will keep poking around, thanks for the tips. Jim
Re: Diagnosing solr timeout
I use the following article as a reference when dealing with GC related issues http://www.petefreitag.com/articles/gctuning/ I suggest you activate the verbose option and send GC stats to a file. I don't remember exactly what was the option but you should find the information easily Good luck On 2010-06-09, at 11:35 AM, Paul wrote: Have you looked at the garbage collector statistics? I've experienced this kind of issues in the past and I was getting huge spikes when the GC was doing its job. I haven't, and I'm not sure what a good way to monitor this is. The problem occurs maybe once a week on a server. Should I run jstat the whole time and redirect the output to a log file? Is there another way to get that info? Also, I was suspecting GC myself. So, if it is the problem, what do I do about it? It seems like increasing RAM might make the problem worse because it would wait longer to GC, then it would have more to do.
Re: AW: how to get multicore to work?
Thanks for the comments. I still can't get this multicore thing to work! Here is my directory structure: d: __apachesolr lucidworks __lucidworks solr __bin __conf __lib tomcat There is no solr.xml, and solr.solr.home points to d:\apachesolr\lucidworkd\lucidworks\solr As it stands, solr works fine, and sites like http://locahost:8983/solr/admin also work. As soon as I put a solr.xml in the solr directory, and restart the tomcat service. It all stops working. solr persistent=false cores adminPath=/admin/cores core name=core0 instanceDir=. / /cores /solr Any idea where I can look? Where is the solr startup log written? Thanks, Peter -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-get-multicore-to-work-tp881826p883780.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: general debugging techniques?
https://issues.apache.org/jira/browse/LUCENE-2387 There is a memory leak that causes the last PDF binary file image to stick around while working on the next binary image. When you commit after every extraction, you clear up this memory leak. This is fixed in trunk and should make it into a 'bug fix' Solr 1.4.1 if such a thing happens. Lance On Wed, Jun 9, 2010 at 10:13 AM, Jim Blomo jim.bl...@pbworks.com wrote: On Fri, Jun 4, 2010 at 3:14 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : That is still really small for 5MB documents. I think the default solr : document cache is 512 items, so you would need at least 3 GB of memory : if you didn't change that and the cache filled up. that assumes that the extracted text tika extracts from each document is the same size as the original raw files *and* that he's configured that content field to be stored ... in practice if you only stored=true the Most times the extracted text is much smaller, though there are occasional zip files that may expand in size (and in an unrelated note, multifile zip archives cause tika 0.7 to hang currently). fast, 128MB is really, really, really small for a typical Solr instance. In any case I bumped up the heap to 3G as suggested, which has helped stability. I have found that in practice I need to commit every extraction because a crash or error will wipe out all extractions after the last commit. if you are only seeing one log line per request, then you are just looking at the request log ... there should be more logs with messages from all over the code base with various levels of severity -- and using standard java log level controls you can turn these up/down for various components. Unfortunately, I'm not very familiar with java deploys so I don't know where the standard controls are yet. As a concrete example, I do see INFO level logs, but haven't found a way to move up DEBUG level in either solr or tomcat. I was hopeful debug statements would point to where extraction/indexing hangs were occurring. I will keep poking around, thanks for the tips. Jim -- Lance Norskog goks...@gmail.com
Re: How Solr Manages Connected Database Updates
The DataImportHandler has a tool for fetching recent updates in the database and indexing only those newchanged records. It has no scheduler. You would set up the DIH configuration and then write a cron job to run it at regular intervals. Lance On Wed, Jun 9, 2010 at 7:51 AM, Sumit Arora sumit1...@gmail.com wrote: Hey All, I am new to Solr Area, and just started exploring it and done basic stuff, now I am stuck with logic : How Solr Manages Connected Database Updates Scenario : -- Wrote one Indexing Program which runs on Tomcat , and by running this program, it reads data from connected MySql Database and then perform Indexing. Use Case - Database is not fixed, Its a data base for a web application, from where user keep on inserting data, so database have frequent updates. almost every minute. How automatically solr should grab those changes and perform Index updation ? Do I need to Write a Cron Job kind of stuff ? Or Use Data Import Handler ? (Several ways could be ?) Is there any one who can provide his comments or share his experience If some one gone though from similar situation ? Thanks, -Sumit -- Lance Norskog goks...@gmail.com
Master master?
Does Solr handling having two masters that are also slaves to each other (ie in a cycle)? Regards, Glen
Re: Index-time vs. search-time boosting performance
Is it necessary that a document 1 year old be more relevant than one that's 1 year and 1 hour old? In other words, can the boosting be logarithmic wrt time instead of linear? A schema design tip: you can store a separate date field which is rounded down to the hour. This will make for a much smaller term dictionary and therefore faster searching range queries. On Mon, Jun 7, 2010 at 4:08 AM, Asif Rahman a...@newscred.com wrote: I still need a relatively precise boost. No less precise than hourly. I think that would make for a pretty messy field query. On Mon, Jun 7, 2010 at 2:15 AM, Lance Norskog goks...@gmail.com wrote: If you are unhappy with the performance overhead of a function boost, you can push it into a field query by boosting date ranges. You would group in date ranges: documents in September would be boosted 1.0, October 2.0, November 3.0 etc. On 6/5/10, Asif Rahman a...@newscred.com wrote: Thanks everyone for your help so far. I'm still trying to get to the bottom of whether switching over to index-time boosts will give me a performance improvement, and if so if it will be noticeable. This is all under the assumption that I can achieve the scoring functionality that I need with either index-time or search-time boosting (given the loss of precision. I can always dust off the old profiler to see what's going on with the search-time boosts, but testing the index-time boosts will require a full reindex, which could take days with our dataset. On Sat, Jun 5, 2010 at 9:17 AM, Robert Muir rcm...@gmail.com wrote: On Fri, Jun 4, 2010 at 7:50 PM, Asif Rahman a...@newscred.com wrote: Perhaps I should have been more specific in my initial post. I'm doing date-based boosting on the documents in my index, so as to assign a higher score to more recent documents. Currently I'm using a boost function to achieve this. I'm wondering if there would be a performance improvement if instead of using the boost function at search time, I indexed the documents with a date-based boost. Asif, without knowing more details, before you look at performance you might want to consider the relevance impacts of switching to index-time boosting for your use case too. You can read more about the differences here: http://lucene.apache.org/java/3_0_1/scoring.html But I think the most important for this date-influenced use case is: Indexing time boosts are preprocessed for storage efficiency and written to the directory (when writing the document) in a single byte (!) If you do this as an index-time boost, your boosts will lose lots of precision for this reason. -- Robert Muir rcm...@gmail.com -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com -- Lance Norskog goks...@gmail.com -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com -- Lance Norskog goks...@gmail.com
Re: Need help with document format
This is what Field Collapsing does. It is a complex feature and is not in the Solr trunk yet. On Tue, Jun 8, 2010 at 9:15 AM, Moazzam Khan moazz...@gmail.com wrote: How would I do a facet search if I did this and not get duplicates? Thanks, Moazzam On Mon, Jun 7, 2010 at 10:07 AM, Israel Ekpo israele...@gmail.com wrote: I think you need a 1:1 mapping between the consultant and the company, else how are you going to run your queries for let's say consultants that worked for Google or AOL between March 1999 and August 2004? If the mapping is 1:1, your life would be easier and you would not need to do extra parsing of the results your retrieved. Unfortunately, it looks like your are doing to have a lot of records. With an RDBMS, it is easier to do joins but with Lucene and Solr you have to denormalize all the relationships. Hence in this particular scenario, if you have 5 consultants that worked for 4 distinct companies you will have to send 20 documents to Solr On Mon, Jun 7, 2010 at 10:15 AM, Moazzam Khan moazz...@gmail.com wrote: Thanks for the replies guys. I am currently storing consultants like this .. doc id123/id FirstNametony/FirstName LastNamemarjo/LastName CompanyGoogle/Company CompanyAOL/Company doc I have a few multi valued fields so if I do it the way Israel suggested it, I will have tons of records. Do you think it will be better if I did this instead ? doc id123/id FirstNametony/FirstName LastNamemarjo/LastName CompanyGoogle_StartDate_EndDate/Company CompanyAOL_StartDate_EndDate/Company doc Or is what you guys said better? Thanks for all the help. Moazzam On Mon, Jun 7, 2010 at 1:10 AM, Lance Norskog goks...@gmail.com wrote: And for 'present', you would pick some time far in the future: 2100-01-01T00:00:00Z On 6/5/10, Israel Ekpo israele...@gmail.com wrote: You need to make each document added to the index a 1 to 1 mapping for each company and consultant combo schema fields !-- Concatenation of company and consultant id -- field name=consultant_id_company_id type=string indexed=true stored=true required=true/ field name=consultant_firstname type=string indexed=true stored=true multiValued=false/ field name=consultant_lastname type=string indexed=true stored=true multiValued=false/ !-- The name of the company the consultant worked for -- field name=company type=text indexed=true stored=true multiValued=false/ field name=start_date type=tdate indexed=true stored=true multiValued=false/ field name=end_date type=tdate indexed=true stored=true multiValued=false/ /fields defaultSearchFieldtext/defaultSearchField copyField source=consultant_firstname dest=text/ copyField source=consultant_lastname dest=text/ copyField source=company dest=text/ /schema !-- So for instance, you have 2 consultants Michael Davis and Tom Anderson who worked for AOL and Microsoft, Yahoo, Google and Facebook. Michael Davis = 1 Tom Anderson = 2 AOL = 1 Microsoft = 2 Yahoo = 3 Google = 4 Facebook = 5 This is how you would add the documents to the index -- doc consultant_id_company_id1_1/consultant_id_company_id consultant_firstnameMichael/consultant_firstname consultant_lastnameDavis/consultant_lastname companyAOL/company start_date2006-02-13T15:26:37Z/start_date end_date2008-02-13T15:26:37Z/end_date /doc doc consultant_id_company_id1_4/consultant_id_company_id consultant_firstnameMichael/consultant_firstname consultant_lastnameDavis/consultant_lastname companyGoogle/company start_date2006-02-13T15:26:37Z/start_date end_date2009-02-13T15:26:37Z/end_date /doc doc consultant_id_company_id2_3/consultant_id_company_id consultant_firstnameTom/consultant_firstname consultant_lastnameAnderson/consultant_lastname companyYahoo/company start_date2001-01-13T15:26:37Z/start_date end_date2009-02-13T15:26:37Z/end_date /doc doc consultant_id_company_id2_4/consultant_id_company_id consultant_firstnameTom/consultant_firstname consultant_lastnameAnderson/consultant_lastname companyGoogle/company start_date1999-02-13T15:26:37Z/start_date end_date2010-02-13T15:26:37Z/end_date /doc The you can search as q=company:X AND start_date:[X TO *] AND end_date:[* TO Z] On Fri, Jun 4, 2010 at 4:58 PM, Moazzam Khan moazz...@gmail.com wrote: Hi guys, I have a list of consultants and the users (people who work for the company) are supposed to be able to search for consultants based on the time frame they worked for, for a company. For example, I should be able to search for all consultants who worked for Bear Stearns in the month of july. What is the best of accomplishing this? I was thinking of formatting the document like this company name
Indexing HTML
What is the preferred way to index html using DIH (my html is stored in a blob field in our database)? I know there is the built in HTMLStripTransformer but that doesn't seem to work well with malformed/incomplete HTML. I've created a custom transformer to first tidy up the html using JTidy then I pass it to the HTMLStripTransformer like so: field column=description name=description tidy=true ignoreErrors=true propertiesFile=config/tidy.properties/ field column=description name=description stripHTML=true/ However this method isn't fool-proof as you can see by my ignoreErrors option. I quickly took a peek at Tika and I noticed that it has its own HtmlParser. Is this something I should look into? Are there any alternatives that deal with malformed/incomplete html? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884497.html Sent from the Solr - User mailing list archive at Nabble.com.
Can query boosting be used with a custom request handlers?
I want to try out the bobo plugin for Solr, which is a custom request handler (http://code.google.com/p/bobo-browse/wiki/SolrIntegration). At the same time I want to use BoostQParserPlugin to boost my queries, something like {!boost b=log(popularity)}foo Can I use the {!boost} feature in conjunction with an external custom request handler like the bobo plugin, or does {!boost} only work with the standard request handler?
Re: Diagnosing solr timeout
Every time you reload the index it is to rebuild the facet cached data. Could that be it? Also, how big are the fields being highlighted? And are they indexed with term vectors? (If not, the text is re-analyzed in flight with term vectors.) How big are the caches? Are they growing growing? On Wed, Jun 9, 2010 at 11:12 AM, Jean-Sebastien Vachon js.vac...@videotron.ca wrote: I use the following article as a reference when dealing with GC related issues http://www.petefreitag.com/articles/gctuning/ I suggest you activate the verbose option and send GC stats to a file. I don't remember exactly what was the option but you should find the information easily Good luck On 2010-06-09, at 11:35 AM, Paul wrote: Have you looked at the garbage collector statistics? I've experienced this kind of issues in the past and I was getting huge spikes when the GC was doing its job. I haven't, and I'm not sure what a good way to monitor this is. The problem occurs maybe once a week on a server. Should I run jstat the whole time and redirect the output to a log file? Is there another way to get that info? Also, I was suspecting GC myself. So, if it is the problem, what do I do about it? It seems like increasing RAM might make the problem worse because it would wait longer to GC, then it would have more to do. -- Lance Norskog goks...@gmail.com
Re: Indexing HTML
The HTMLStripChar variants are newer and might work better. On Wed, Jun 9, 2010 at 8:38 PM, Blargy zman...@hotmail.com wrote: What is the preferred way to index html using DIH (my html is stored in a blob field in our database)? I know there is the built in HTMLStripTransformer but that doesn't seem to work well with malformed/incomplete HTML. I've created a custom transformer to first tidy up the html using JTidy then I pass it to the HTMLStripTransformer like so: field column=description name=description tidy=true ignoreErrors=true propertiesFile=config/tidy.properties/ field column=description name=description stripHTML=true/ However this method isn't fool-proof as you can see by my ignoreErrors option. I quickly took a peek at Tika and I noticed that it has its own HtmlParser. Is this something I should look into? Are there any alternatives that deal with malformed/incomplete html? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884497.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
how to have shards parameter by default
Hi. I am running distributed search on solr. I have 70 solr instances. So each time I want to search I need to use ?shards=localhost:7500/solr,localhost..7620/solr It is very long url. so how can I encode shards into config file then i don't need to type each time. thanks. Scott
Re: how to have shards parameter by default
I tried put shards into default request handler. But now each time if search, solr hangs forever. So what's the correct solution? Thanks. requestHandler name=standard class=solr.SearchHandler default=true !-- default values for query parameters -- lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=fl*/str str name=version2.1/str str name=shardslocalhost:7500/solr,localhost:7501/solr,localhost:7502/solr,localhost:7503/solr,localhost:7504/solr,localhost:7505/solr,localhost:7506/solr/str !-- -- /lst /requestHandler On Thu, Jun 10, 2010 at 11:48 AM, Scott Zhang macromars...@gmail.comwrote: Hi. I am running distributed search on solr. I have 70 solr instances. So each time I want to search I need to use ?shards=localhost:7500/solr,localhost..7620/solr It is very long url. so how can I encode shards into config file then i don't need to type each time. thanks. Scott
Re: Indexing HTML
Does the HTMLStripChar apply at index time or query time? Would it matter to use over the other? As a side question, if I want to perform highlighter summaries against this field do I need to store the whole field or just index it with TermVector.WITH_POSITIONS_OFFSETS? -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884579.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing HTML
Wait... do you mean I should try the HTMLStripCharFilterFactory analyzer at index time? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884592.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing HTML
On Jun 9, 2010, at 8:38pm, Blargy wrote: What is the preferred way to index html using DIH (my html is stored in a blob field in our database)? I know there is the built in HTMLStripTransformer but that doesn't seem to work well with malformed/incomplete HTML. I've created a custom transformer to first tidy up the html using JTidy then I pass it to the HTMLStripTransformer like so: field column=description name=description tidy=true ignoreErrors=true propertiesFile=config/tidy.properties/ field column=description name=description stripHTML=true/ However this method isn't fool-proof as you can see by my ignoreErrors option. I quickly took a peek at Tika and I noticed that it has its own HtmlParser. Is this something I should look into? Are there any alternatives that deal with malformed/incomplete html? Thanks Actually the Tika HtmlParser just wraps TagSoup - that's a good option for cleaning up busted HTML. -- Ken http://ken-blog.krugler.org +1 530-265-2225 Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g