Re: schema.xml in other than conf folder
Chris, Our solr conf folder is in read-only file system. But the data directory (index) is not in read-only file system. As per our production environment guidelines, the configuration files should be in read-only file system. Thanks, SRD -- View this message in context: http://lucene.472066.n3.nabble.com/schema-xml-in-other-than-conf-folder-tp2206587p2225625.html Sent from the Solr - User mailing list archive at Nabble.com.
Tuning StatsComponent
Hello. i`m using the StatsComponent to get the sum of amounts. but solr statscomponent is very slow on a huge index of 30 Million documents. how can i tune the statscomponent ? the problem is, that i have 5 currencys and i need to send for each currency a new request. thats make the solr search sometimes very slow. =( any ideas ? -- View this message in context: http://lucene.472066.n3.nabble.com/Tuning-StatsComponent-tp2225809p2225809.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH load only selected documents with XPathEntityProcessor
Hi Gora, thanks a lot, very nice solution, works perfectly. I will dig more into ScriptTransformer, seems to be very powerful. Regards, Bernd Am 08.01.2011 14:38, schrieb Gora Mohanty: On Fri, Jan 7, 2011 at 12:30 PM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Hello list, is it possible to load only selected documents with XPathEntityProcessor? While loading docs I want to drop/skip/ignore documents with missing URL. Example: documents document titlefirst title/title ididentifier_01/id linkhttp://www.foo.com/path/bar.html/link /document document titlesecond title/title ididentifier_02/id link/link /document /documents The first document should be loaded, the second document should be ignored because it has an empty link (should also work for missing link field). [...] You can use a ScriptTransformer, along with $skipRow/$skipDoc. E.g., something like this for your data import configuration file: dataConfig script![CDATA[ function skipRow(row) { var link = row.get( 'link' ); if( link == null || link == '' ) { row.put( '$skipRow', 'true' ); } return row; } ]]/script dataSource type=FileDataSource / document entity name=f processor=FileListEntityProcessor baseDir=/home/gora/test fileName=.*xml newerThan='NOW-3DAYS' recursive=true rootEntity=false dataSource=null entity name=top processor=XPathEntityProcessor forEach=/documents/document url=${f.fileAbsolutePath} transformer=script:skipRow field column=link xpath=/documents/document/link/ field column=title xpath=/documents/document/title/ field column=id xpath=/documents/document/id/ /entity /entity /document /dataConfig Regards, Gora
Re: Tuning StatsComponent
On Mon, Jan 10, 2011 at 2:28 PM, stockii st...@shopgate.com wrote: Hello. i`m using the StatsComponent to get the sum of amounts. but solr statscomponent is very slow on a huge index of 30 Million documents. how can i tune the statscomponent ? Not sure about this problem. the problem is, that i have 5 currencys and i need to send for each currency a new request. thats make the solr search sometimes very slow. =( [...] I guess that you mean the search from the front-end is slow. It is difficult to make a guess without details of your index, and of your queries, but one thing that immediately jumps out is that you could shard the Solr index by currency, and have your front-end direct queries for each currency to the appropriate Solr server. Please do share a description of what all you are indexing, how large your index is, and what kind of queries you are running. I take it that you have already taken a look at http://wiki.apache.org/solr/SolrPerformanceFactors Regards, Gora
Re: Tuning StatsComponent
oh thx for your fast reply. i will try the suggestions. in meanwhile more information about my index. i have 2 solr instances with 6 cores. each core have his own index and one core`s index is about 30 million documents. each document have:(stats-relevant) amount amount_euro currency_id user_costs user_costs_euro currency_id_user_costs so i send for each currency an requeston statscompontn like this stats=truejson.nl=mapwt=javabinrows=0version=2fl=uniquekey,scorestart=0stats.field=amountq=QUERYisShard=truefq=product:bla+currency_id:EURfsv=true the stats.field is changing and filter for each of my 5 currencys. so for ONE search-request, i need to send 10 requests to get the sums. and that veeery slow =( i searching over two shards. sometimes more than two. -- View this message in context: http://lucene.472066.n3.nabble.com/Tuning-StatsComponent-tp2225809p2226258.html Sent from the Solr - User mailing list archive at Nabble.com.
segment gets corrupted (after background merge ?)
Hi, We are using : Solr Specification Version: 1.4.1 Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42 Lucene Specification Version: 2.9.3 Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55 # java -version java version 1.6.0_20 Java(TM) SE Runtime Environment (build 1.6.0_20-b02) Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode) We want to index 4M docs in one core (and when it works fine we will add other cores with 2M on the same server) (1 doc ~= 1kB) We use SOLR replication every 5 minutes to update the slave server (queries are executed on the slave only) Documents are changing very quickly, during a normal day we will have approx : * 200 000 updated docs * 1000 new docs * 200 deleted docs I attached the last good checkIndex : solr20110107.txt And the corrupted one : solr20110110.txt This is not the first time a segment gets corrupted on this server, that's why I ran frequent checkIndex. (but as you can see the first segment is 1.800.000 docs and it works fine!) I can't find any SEVER FATAL or exception in the Solr logs. I also attached my schema.xml and solrconfig.xml Is there something wrong with what we are doing ? Do you need other info ? Thanks, Opening index @ /solr/multicore/core1/data/index/ Segments file=segments_i7t numSegments=9 version=FORMAT_DIAGNOSTICS [Lucene 2.9] 1 of 9: name=_ncc docCount=1841685 compound=false hasProx=true numFiles=9 size (MB)=6,683.447 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_ncc_13m.del] test: open reader.OK [105940 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...OK [17952652 terms; 174113812 terms/docs pairs; 248678841 tokens] test: stored fields...OK [51585300 total field count; avg 29.719 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 2 of 9: name=_nqt docCount=431889 compound=false hasProx=true numFiles=9 size (MB)=1,671.375 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_nqt_gt.del] test: open reader.OK [10736 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...OK [5211271 terms; 39824029 terms/docs pairs; 67787288 tokens] test: stored fields...OK [12562924 total field count; avg 29.83 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 3 of 9: name=_ol7 docCount=913886 compound=false hasProx=true numFiles=9 size (MB)=3,567.63 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_ol7_3.del] test: open reader.OK [11 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...OK [9825896 terms; 93954470 terms/docs pairs; 152947518 tokens] test: stored fields...OK [29587930 total field count; avg 32.376 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 4 of 9: name=_ol2 docCount=1011 compound=false hasProx=true numFiles=8 size (MB)=6.959 diagnostics = {os.version=2.6.26-2-amd64, os=Linux, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=flush, os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.} no deletions test: open reader.OK test: fields..OK [38 fields] test: field norms.OK [38 fields] test: terms, freq, prox...OK [54205 terms; 220705 terms/docs pairs; 389336 tokens] test: stored fields...OK [27402 total field count; avg 27.104 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] 5 of 9: name=_ol3 docCount=1000 compound=false hasProx=true numFiles=8 size (MB)=6.944 diagnostics = {os.version=2.6.26-2-amd64, os=Linux, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=flush, os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.} no deletions test: open reader.OK test: fields..OK [33 fields] test:
Re: Internal Server Error when indexing a pdf file
Check your libraries for Tika related Jar files.Tika related files must be on classpath of solr - Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/Internal-Server-Error-when-indexing-a-pdf-file-tp2214617p2226374.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tuning StatsComponent
when i start statsComponent i get this message: INFO: UnInverted multi-valued field {field=product,memSize=4336,tindexSize=46,time=0,phase1=0,nTerms=1,bigTerms=1,termInstances=0,uses=0} what means this ? -- View this message in context: http://lucene.472066.n3.nabble.com/Tuning-StatsComponent-tp2225809p2226555.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Creating Solr index from map/reduce
Thanks Alexander 2011/1/3 Alexander Kanarsky kanarsky2...@gmail.com Joan, current version of the patch assumes the location and names for the schema and solrconfig files ($SOLR_HOME/conf), it is hardcoded (see the SolrRecordWriter's constructor). Multi-core configuration with separate configuration locations via solr.xml is not supported as for now. As a workaround, you could link or copy the schema and solrconfig files to follow the hardcoded assumption. Thanks, -Alexander On Wed, Dec 29, 2010 at 2:50 AM, Joan joan.monp...@gmail.com wrote: If I rename my custom schema file (schema-xx.xml), whitch is located in SOLR_HOME/schema/, and then I copy it to conf folder and finally I try to run CSVIndexer, it shows me an other error: Caused by: java.lang.RuntimeException: Can't find resource 'solrconfig.xml' in classpath or '/tmp/hadoop-root/mapred/local/taskTracker/archive/localhost/tmp/b7611d6d-9cc7-4237-a240-96ecaab9f21a.solr.zip/conf/' I dont't understand because I've a solr configuration file (solr.xml) where I define all core: core name=core_name instanceDir=solr-data/index config=solr/conf/solrconfig_xx.xml schema=solr/schema/schema_xx.xml properties=solr/conf/solrcore.properties/ But I think that when I run CSVIndexer, it doesn't know that solr.xml exist, and it try to looking for schema.xml and solrconfig.xml by default in default folder (conf) 2010/12/29 Joan joan.monp...@gmail.com Hi, I'm trying generate Solr index from hadoop (map/reduce) so I'm using this patch SOLR-301 https://issues.apache.org/jira/browse/SOLR-1301, however I don't get it. When I try to run CSVIndexer with some arguments: directory Solr index -solr Solr home input, in this case CSV I'm runnig CSVIndexer: HADOOP_INSTALL/bin/hadoop jar my.jar CSVIndexer INDEX_FOLDER -solr /SOLR_HOME CSV FILE PATH Before that I run CSVIndexer, I've put csv file into HDFS. My Solr home hasn't default files configurations, but which is divided into multiple folders /conf /schema I have custom solr file configurations so CSVIndexer can't find schema.xml, obviously It won't be able to find it because this file doesn't exist, in my case, this file is named schema-xx.xml and CSVIndexer is looking for it inside conf folder and It don't know that schema folder exist. And I have solr configuration file (solr.xml) where I configure multiple cores. I tried to modify solr's paths but It still not working . I understand that CSVIndexer copy Solr Home specified into HDFS (/tmp/hadoop-user/mapred/local/taskTracker/archive/...) and when It try to find schema.xml it doesn't exit: 10/12/29 10:18:11 INFO mapred.JobClient: Task Id : attempt_201012291016_0002_r_00_1, Status : FAILED java.lang.IllegalStateException: Failed to initialize record writer for my.jar, attempt_201012291016_0002_r_00_1 at org.apache.solr.hadoop.SolrRecordWriter.init(SolrRecordWriter.java:253) at org.apache.solr.hadoop.SolrOutputFormat.getRecordWriter(SolrOutputFormat.java:152) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:553) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.io.FileNotFoundException: Source '/tmp/hadoop-guest/mapred/local/taskTracker/archive/localhost/tmp/e8be5bb1-e910-47a1-b5a7-1352dfec2b1f.solr.zip/conf/schema.xml' does not exist at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:636) at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:606) at org.apache.solr.hadoop.SolrRecordWriter.init(SolrRecordWriter.java:222) ... 4 more
Solr trunk for production
Hello, Are people using Solr trunk in serious production environments? I suspect the answer is yes, just want to see if there are any gotchas/warnings. Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: Replication: abort-fetch and restarting
Any thoughts on this one? Should i add a ticket? On Tuesday 04 January 2011 20:08:40 Markus Jelsma wrote: Hi, It seems abort-fetch nicely removes the index directory which i'm replicating to which is fine. Restarting, however, does not trigger the the same feature as the abort-fetch command does. At least, that's what my tests seems to tell me. Shouldn't a restart of Solr nicely clean up the mess before exiting? And, shouldn't starting Solr also look for mess left behind by a possible sudden shutdown of the server at which the mess obviously cannot get cleaned? If i now stop, clean and start my slave it will attempt to download an existing index. If i abort-fetch it will clean up the mess and (due to low interval polling) make another attempt. If i, however, restart (instead of abort-fetch) the old temporary directory will stay and needs to be deleted manually. Cheers, -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
How to let crawlers in, but prevent their damage?
Hi, How do people with public search services deal with bots/crawlers? And I don't mean to ask how one bans them (robots.txt) or slow them down (Delay stuff in robots.txt) or prevent them from digging too deep in search results... What I mean is that when you have publicly exposed search that bots crawl, they issue all kinds of crazy queries that result in errors, that add noise to Solr caches, increase Solr cache evictions, etc. etc. Are there some known recipes for dealing with them, minimizing their negative side-effects, while still letting them crawl you? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: DIH - Closing ResultSet in JdbcDataSource
Gora, Thanks for the response. After taking another look, you are correct about the hasnext() closing the ResultSet object (1.4.1 as well as 1.4.0). I didn't recognize the case difference in the two function calls, so missed it. I'll keep looking into the original issue and reply if I find a cause/solution. Shane On Sat, Jan 8, 2011 at 4:04 AM, Gora Mohanty g...@mimirtech.com wrote: On Sat, Jan 8, 2011 at 1:10 AM, Shane Perry thry...@gmail.com wrote: Hi, I am in the process of migrating our system from Postgres 8.4 to Solr 1.4.1. Our system is fairly complex and as a result, I have had to define 19 base entities in the data-config.xml definition file. Each of these entities executes 5 queries. When doing a full-import, as each entity completes, the server hosting Postgres shows 5 idle in transaction for the entity. In digging through the code, I found that the JdbcDataSource wraps the ResultSet object in a custom ResultSetIterator object, leaving the ResultSet open. Walking through the code I can't find a close() call anywhere on the ResultSet. I believe this results in the idle in transaction processes. [...] Have not examined the idle in transaction issue that you mention, but the ResultSet object in a ResultSetIterator is closed in the private hasnext() method, when there are no more results, or if there is an exception. hasnext() is called by the public hasNext() method that should be used in iterating over the results, so I see no issue there. Regards, Gora P.S. This is from Solr 1.4.0 code, but I would not think that this part of the code would have changed.
strange SOLR behavior with required field attribute
Dear list, while trying different options with DIH and SciptTransformer I also tried using the required=true option for a field. I have 3 records: documents document titlefirst title/title ididentifier_01/id linkhttp://www.foo.com/path/bar.html/link /document document titlesecond title/title ididentifier_02/id link/link /document document titlethierd title/title ididentifier_03/id /document /documents schema.xml snippet: field name=title type=string indexed=true stored=true / field name=id type=string indexed=true stored=true required=true / field name=link type=string indexed=true stored=true required=true / After loading I have 2 records in the index. str name=titlefirst title/str str name=ididentifier_01/str str name=linkhttp://www.foo.com/path/bar.html/link str name=titlesecond title/str str name=ididentifier_02/str str name=link/ Sure, I get an SolrException in the logs saying missing required field: link but this is for the third record whereas the second record gets loaded even if link is empty. So I guess this is a feature of Solr? And the required attribute means the presense of the tag and not the presense of content for the tag, right? Regards Bernd
Re: How to let crawlers in, but prevent their damage?
Hi Otis, From what I learned at Krugle, the approach that worked for us was: 1. Block all bots on the search page. 2. Expose the target content via statically linked pages that are separately generated from the same backing store, and optimized for target search terms (extracted from your own search logs). -- Ken On Jan 10, 2011, at 5:41am, Otis Gospodnetic wrote: Hi, How do people with public search services deal with bots/crawlers? And I don't mean to ask how one bans them (robots.txt) or slow them down (Delay stuff in robots.txt) or prevent them from digging too deep in search results... What I mean is that when you have publicly exposed search that bots crawl, they issue all kinds of crazy queries that result in errors, that add noise to Solr caches, increase Solr cache evictions, etc. etc. Are there some known recipes for dealing with them, minimizing their negative side-effects, while still letting them crawl you? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: How to let crawlers in, but prevent their damage?
Sorry not an answer but a +1 vote for finding out best practice for this. Related to it is DOS attacks. We have rewrite rules in between the proxy server and solr which attempts to filter out undesriable stuff but would it be better to have a query app doing this? any standard rewrite rules which drop invalid or potentially malicious queries would be very nice :-) lee c On 10 January 2011 13:41, Otis Gospodnetic otis_gospodne...@yahoo.comwrote: Hi, How do people with public search services deal with bots/crawlers? And I don't mean to ask how one bans them (robots.txt) or slow them down (Delay stuff in robots.txt) or prevent them from digging too deep in search results... What I mean is that when you have publicly exposed search that bots crawl, they issue all kinds of crazy queries that result in errors, that add noise to Solr caches, increase Solr cache evictions, etc. etc. Are there some known recipes for dealing with them, minimizing their negative side-effects, while still letting them crawl you? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: Multivalued fields and facet performance
Hi Howard, This is normal. Your first query is reading a bunch of index data from disk and your RAM is then caching it. If your first query involves sorting, some more data for FieldCache is being read and stored. If there are multiple sort fields, one such thing for each. If facets are involves, more of that stuff. If you are optimizing your index you are likely to be forcing more disk IO Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Howard Lee how...@workdigital.co.uk To: solr-user@lucene.apache.org Sent: Mon, January 10, 2011 8:59:03 AM Subject: Multivalued fields and facet performance Hi, I'd appreciate some explanation on what may be going on in the following scenario using multivalued fields and facets. Solr version: 1.5 Our index contains 35 million docs, and our search is using 2 multivalued fields as facets. There are approx 5 million different values in one field and 5000 in the other. We are seeing the following, and I'm curious as what is actually happening in the background. The first search can take up to 5 minutes, all subsequent queries of any q return in under a second. This is fine unless you are the first search or new searcher. I plan on adding a first searcher and new searcher in the config to avoid long delays every time the index is updated (once a day) but I have concerns of the length of the delay in launching a new searcher, and whether this is causing too much overhead. Can someone explain to me what processes are going on in the backgroud that cause this behaviour so I can understand the implications or make some adjustments in the config to compensate. thanx Howard
Re: How to let crawlers in, but prevent their damage?
Hi Ken, thanks Ken. :) The problem with this approach is that it exposes very limited content to bots/web search engines. Take http://search-lucene.com/ for example. People enter all kinds of queries in web search engines and end up on that site. People who visit the site directly don't necessarily search for those same things. Plus, new terms are entered to get to search-lucene.com every day, so keeping up with that would mean constantly generating more and more of those static pages. Basically, the tail is super long. On top of that, new content is constantly being generated, so one would have to also constantly both add and update those static pages. I have a feeling there is not a good solution for this because on one hand people don't like the negative bot side effect, on the other hand people want as much of their sites indexed by the big guys. The only half-solution that comes to mind involves looking at who's actually crawling you and who's bringing you visitors, then blocking those with a bad ratio of those two - bots that crawl a lot but don't bring a lot of value. Any other ideas? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Ken Krugler kkrugler_li...@transpac.com To: solr-user@lucene.apache.org Sent: Mon, January 10, 2011 9:43:49 AM Subject: Re: How to let crawlers in, but prevent their damage? Hi Otis, From what I learned at Krugle, the approach that worked for us was: 1. Block all bots on the search page. 2. Expose the target content via statically linked pages that are separately generated from the same backing store, and optimized for target search terms (extracted from your own search logs). -- Ken On Jan 10, 2011, at 5:41am, Otis Gospodnetic wrote: Hi, How do people with public search services deal with bots/crawlers? And I don't mean to ask how one bans them (robots.txt) or slow them down (Delay stuff in robots.txt) or prevent them from digging too deep in search results... What I mean is that when you have publicly exposed search that bots crawl, they issue all kinds of crazy queries that result in errors, that add noise to Solr caches, increase Solr cache evictions, etc. etc. Are there some known recipes for dealing with them, minimizing their negative side-effects, while still letting them crawl you? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: strange SOLR behavior with required field attribute
(11/01/10 23:26), Bernd Fehling wrote: Dear list, while trying different options with DIH and SciptTransformer I also tried using the required=true option for a field. I have 3 records: documents document titlefirst title/title ididentifier_01/id linkhttp://www.foo.com/path/bar.html/link /document document titlesecond title/title ididentifier_02/id link/link /document document titlethierd title/title ididentifier_03/id /document /documents schema.xml snippet: field name=title type=string indexed=true stored=true / field name=id type=string indexed=true stored=true required=true / field name=link type=string indexed=true stored=true required=true / After loading I have 2 records in the index. str name=titlefirst title/str str name=ididentifier_01/str str name=linkhttp://www.foo.com/path/bar.html/link str name=titlesecond title/str str name=ididentifier_02/str str name=link/ Sure, I get an SolrException in the logs saying missing required field: link but this is for the third record whereas the second record gets loaded even if link is empty. So I guess this is a feature of Solr? And the required attribute means the presense of the tag and not the presense of content for the tag, right? Regards Bernd Bernd, Seems like same problem of SOLR-1973 that I've recently fixed in trunk and 3x, but I'm not sure. Which version are you using? Can you try trunk or 3x? If you still get same error with trunk/3x, please open a jira issue. Koji -- http://www.rondhuit.com/en/
Re: Multivalued fields and facet performance
Otis, The reason I ask is that I run a number of sites on Solr, some with 10 million+ docs faceting on similar types of data, and have not seen anywhere near this length of initial delay. The main difference is that these sites facet on single value fields rather that multivalued and that this site is searching on 3 times the volume of data. Would switching to single valued (I'd rather not) make much of a difference. I've also noticed that multivalued fields aren't populating the lucene field cache. Is this the correct behaviour. Regards Howard On 10 January 2011 14:55, Otis Gospodnetic otis_gospodne...@yahoo.comwrote: Hi Howard, This is normal. Your first query is reading a bunch of index data from disk and your RAM is then caching it. If your first query involves sorting, some more data for FieldCache is being read and stored. If there are multiple sort fields, one such thing for each. If facets are involves, more of that stuff. If you are optimizing your index you are likely to be forcing more disk IO Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Howard Lee how...@workdigital.co.uk To: solr-user@lucene.apache.org Sent: Mon, January 10, 2011 8:59:03 AM Subject: Multivalued fields and facet performance Hi, I'd appreciate some explanation on what may be going on in the following scenario using multivalued fields and facets. Solr version: 1.5 Our index contains 35 million docs, and our search is using 2 multivalued fields as facets. There are approx 5 million different values in one field and 5000 in the other. We are seeing the following, and I'm curious as what is actually happening in the background. The first search can take up to 5 minutes, all subsequent queries of any q return in under a second. This is fine unless you are the first search or new searcher. I plan on adding a first searcher and new searcher in the config to avoid long delays every time the index is updated (once a day) but I have concerns of the length of the delay in launching a new searcher, and whether this is causing too much overhead. Can someone explain to me what processes are going on in the backgroud that cause this behaviour so I can understand the implications or make some adjustments in the config to compensate. thanx Howard -- WORKDIGITAL LTD workdigital.co.uk 32-34 Broadwick Street W1A 2HG London, UK Howard Lee CEO M +44(0)7931 476 766 E how...@workdigital.co.uk workhound.co.uk - salarytrack.co.uk - twitterjobsearch.com - dreamjobalert.co.uk - recruitmentadnetwork.com
Token Counter
Hello, I would like to know if there is a trivial procedure/tool for displaying the number of appearances of each token from query results. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Token-Counter-tp2227795p2227795.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: strange SOLR behavior with required field attribute
Hi Koji, I'm using apache-solr-4.0-2010-11-24_09-25-17 from trunk. A grep for SOLR-1973 in CHANGES.txt says that it should have been fixed. Strange... Regards, Bernd Am 10.01.2011 16:14, schrieb Koji Sekiguchi: (11/01/10 23:26), Bernd Fehling wrote: Dear list, while trying different options with DIH and SciptTransformer I also tried using the required=true option for a field. I have 3 records: documents document titlefirst title/title ididentifier_01/id linkhttp://www.foo.com/path/bar.html/link /document document titlesecond title/title ididentifier_02/id link/link /document document titlethierd title/title ididentifier_03/id /document /documents schema.xml snippet: field name=title type=string indexed=true stored=true / field name=id type=string indexed=true stored=true required=true / field name=link type=string indexed=true stored=true required=true / After loading I have 2 records in the index. str name=titlefirst title/str str name=ididentifier_01/str str name=linkhttp://www.foo.com/path/bar.html/link str name=titlesecond title/str str name=ididentifier_02/str str name=link/ Sure, I get an SolrException in the logs saying missing required field: link but this is for the third record whereas the second record gets loaded even if link is empty. So I guess this is a feature of Solr? And the required attribute means the presense of the tag and not the presense of content for the tag, right? Regards Bernd Bernd, Seems like same problem of SOLR-1973 that I've recently fixed in trunk and 3x, but I'm not sure. Which version are you using? Can you try trunk or 3x? If you still get same error with trunk/3x, please open a jira issue. Koji
Storing metadata from post parameters and XML
I'm very unclear on how to associate what I need to a Solr index entry. Based on what I've read thus far, you can extract data from text files and store that in a Solr document. I have hundreds of thousands of documents in a database/svn type system. When I index a file, it is likely going to be local to the filesystem and I know the location it will take on in the database. So, when I index, I want to provide a path that it can find it when someone else does a search. 123.xml may look like: mydoc titlemy title/title paraEvery foobar has its day/para figure href=/abc/xxx.gifcaptionMy caption/caption /mydoc and the proprietary location I want it to be associated with is: /abc/def/ghi/123.xml So, when a user does a search for foobar, it returns some information about 123.xml but most importantly the location should be available. I have yet to find (in the schema.xml or otherwise) where you can define that path to store, and how you would pass along that parameter in the indexing of that document. Instead, from the examples I can find, including the book, you store fields from your data into the index. In the book's examples (a music database), searching for Cherub Rock returns a list of with their duration, track name, album name, and artist. In other words, the full text data you retrieve is the only information the search index has to offer. Just for example, using the exampledocs post.jar, I'm envisioning something like this: java -jar post.jar 123.xml -dblocation /abc/def/ghi/123.xml -othermeta1 xxx -othermeta2 zzz Then the Solr doc would look like: doc field name=id123/field field name=dblocation/abc/def/ghi/123.xml/field field name=othermeta1xxx/field field name=othermeta2zzz/field field name=titlemy title/field field name=graphic/abc/xxx.gif/field field name=textEvery foobar has its day My caption/field /doc This way, when a user searches for foobar, they get item 123 back, review the search result and if they decide that's the data they want, they can use the dblocation field to retrieve the data for editing purposes (and then re-index it following their edits). I'm guessing I just haven't found the right terms yet to look into, as I'm very new to this. Thanks for any direction you can provide. Also, if Solr appears to be the wrong tool for what I need, let me know as well! Thank you, Walter
Re: Storing metadata from post parameters and XML
Hey Walter, what's against just putting your db-location in a 'string' field, and use it like any other value? There is no special field-type for something like a path/directory/location-information, afaik. Regards Stefan On Mon, Jan 10, 2011 at 4:50 PM, Walter Closenfleight walter.p.closenflei...@gmail.com wrote: I'm very unclear on how to associate what I need to a Solr index entry. Based on what I've read thus far, you can extract data from text files and store that in a Solr document. I have hundreds of thousands of documents in a database/svn type system. When I index a file, it is likely going to be local to the filesystem and I know the location it will take on in the database. So, when I index, I want to provide a path that it can find it when someone else does a search. 123.xml may look like: mydoc titlemy title/title paraEvery foobar has its day/para figure href=/abc/xxx.gifcaptionMy caption/caption /mydoc and the proprietary location I want it to be associated with is: /abc/def/ghi/123.xml So, when a user does a search for foobar, it returns some information about 123.xml but most importantly the location should be available. I have yet to find (in the schema.xml or otherwise) where you can define that path to store, and how you would pass along that parameter in the indexing of that document. Instead, from the examples I can find, including the book, you store fields from your data into the index. In the book's examples (a music database), searching for Cherub Rock returns a list of with their duration, track name, album name, and artist. In other words, the full text data you retrieve is the only information the search index has to offer. Just for example, using the exampledocs post.jar, I'm envisioning something like this: java -jar post.jar 123.xml -dblocation /abc/def/ghi/123.xml -othermeta1 xxx -othermeta2 zzz Then the Solr doc would look like: doc field name=id123/field field name=dblocation/abc/def/ghi/123.xml/field field name=othermeta1xxx/field field name=othermeta2zzz/field field name=titlemy title/field field name=graphic/abc/xxx.gif/field field name=textEvery foobar has its day My caption/field /doc This way, when a user searches for foobar, they get item 123 back, review the search result and if they decide that's the data they want, they can use the dblocation field to retrieve the data for editing purposes (and then re-index it following their edits). I'm guessing I just haven't found the right terms yet to look into, as I'm very new to this. Thanks for any direction you can provide. Also, if Solr appears to be the wrong tool for what I need, let me know as well! Thank you, Walter
Re: Storing metadata from post parameters and XML
Stefan, You're right. I was attempting to post some quick pseudo-code, but that doc/ is pretty misleading, they should have been str elements, like str name=dblocation/abc/def/ghi/123.xml/str, or something to that affect. Thanks, Walter On Mon, Jan 10, 2011 at 10:08 AM, Stefan Matheis matheis.ste...@googlemail.com wrote: Hey Walter, what's against just putting your db-location in a 'string' field, and use it like any other value? There is no special field-type for something like a path/directory/location-information, afaik. Regards Stefan On Mon, Jan 10, 2011 at 4:50 PM, Walter Closenfleight walter.p.closenflei...@gmail.com wrote: I'm very unclear on how to associate what I need to a Solr index entry. Based on what I've read thus far, you can extract data from text files and store that in a Solr document. I have hundreds of thousands of documents in a database/svn type system. When I index a file, it is likely going to be local to the filesystem and I know the location it will take on in the database. So, when I index, I want to provide a path that it can find it when someone else does a search. 123.xml may look like: mydoc titlemy title/title paraEvery foobar has its day/para figure href=/abc/xxx.gifcaptionMy caption/caption /mydoc and the proprietary location I want it to be associated with is: /abc/def/ghi/123.xml So, when a user does a search for foobar, it returns some information about 123.xml but most importantly the location should be available. I have yet to find (in the schema.xml or otherwise) where you can define that path to store, and how you would pass along that parameter in the indexing of that document. Instead, from the examples I can find, including the book, you store fields from your data into the index. In the book's examples (a music database), searching for Cherub Rock returns a list of with their duration, track name, album name, and artist. In other words, the full text data you retrieve is the only information the search index has to offer. Just for example, using the exampledocs post.jar, I'm envisioning something like this: java -jar post.jar 123.xml -dblocation /abc/def/ghi/123.xml -othermeta1 xxx -othermeta2 zzz Then the Solr doc would look like: doc field name=id123/field field name=dblocation/abc/def/ghi/123.xml/field field name=othermeta1xxx/field field name=othermeta2zzz/field field name=titlemy title/field field name=graphic/abc/xxx.gif/field field name=textEvery foobar has its day My caption/field /doc This way, when a user searches for foobar, they get item 123 back, review the search result and if they decide that's the data they want, they can use the dblocation field to retrieve the data for editing purposes (and then re-index it following their edits). I'm guessing I just haven't found the right terms yet to look into, as I'm very new to this. Thanks for any direction you can provide. Also, if Solr appears to be the wrong tool for what I need, let me know as well! Thank you, Walter
Help needed in handling plurals
Hi, I am currently facing the following problematic scenario: At index time, i index a field by the value of Laptop At index time, i index another field with the value of Laptops At query time, i search for Laptops. What is happening right now is that i am only getting back Laptops in the results, whereas i would like both Laptop and Laptops to be included. I do not want to use the Porter stemmer due to its aggressive nature, and i have tried to set up Pling Stemmer as a custom filter in my analyzer, but, to no avail. Can anyone guide me as to: 1. Where to put the PlingStemmer.class file. 2. How to set up the custom filter in the schema.xml file. Thanks in advance. Regards, Taimur -- View this message in context: http://lucene.472066.n3.nabble.com/Help-needed-in-handling-plurals-tp2228165p2228165.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Token Counter
On 1/10/2011 8:38 AM, supersoft wrote: Hello, I would like to know if there is a trivial procedure/tool for displaying the number of appearances of each token from query results. Thanks Unless I'm misunderstanding what you mean, this sounds exactly like facets. http://wiki.apache.org/solr/SolrFacetingOverview An example URL (rows=0 for less distraction): http://HOST:8983/solr/CORE/select/?q=horserows=0facet=truefacet.field=keywords Am I misunderstanding your question? Thanks, Shawn
Re: How to let crawlers in, but prevent their damage?
I don't nkow about stopping proble3ms with the issues that you've raised. But I do know that web sites that aren't indempotent with GET requests are in a hurt locket. That seems to be WAY too many of them. This means, don't do anything with GET that changes the contents of your web site. Regarding a more dierct answer to your question, you'd probably have to have some sort of filtering applied. And anyway, crawlers only issue 'queries' based on the URLs found in the site, right? So are you going to have wierd URLs embedded in your site? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Otis Gospodnetic otis_gospodne...@yahoo.com To: solr-user@lucene.apache.org Sent: Mon, January 10, 2011 5:41:17 AM Subject: How to let crawlers in, but prevent their damage? Hi, How do people with public search services deal with bots/crawlers? And I don't mean to ask how one bans them (robots.txt) or slow them down (Delay stuff in robots.txt) or prevent them from digging too deep in search results... What I mean is that when you have publicly exposed search that bots crawl, they issue all kinds of crazy queries that result in errors, that add noise to Solr caches, increase Solr cache evictions, etc. etc. Are there some known recipes for dealing with them, minimizing their negative side-effects, while still letting them crawl you? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: How to let crawlers in, but prevent their damage?
On Jan 10, 2011, at 7:02am, Otis Gospodnetic wrote: Hi Ken, thanks Ken. :) The problem with this approach is that it exposes very limited content to bots/web search engines. Take http://search-lucene.com/ for example. People enter all kinds of queries in web search engines and end up on that site. People who visit the site directly don't necessarily search for those same things. Plus, new terms are entered to get to search-lucene.com every day, so keeping up with that would mean constantly generating more and more of those static pages. Basically, the tail is super long. To clarify - the issue of using actual user search traffic is one of SEO, not what content you expose. If, for example, people commonly do a search for java something then that's a hint that the URL to the static content, and the page title, should have the language as part of it. So you shouldn't be generating static pages based on search traffic. Though you might want to decide what content to favor (see below) based on popularity. On top of that, new content is constantly being generated, so one would have to also constantly both add and update those static pages. Yes, but that's why you need to automate that content generation, and do it on a regular (e.g. weekly) basis. The big challenges we ran into were: 1. Dealing with badly behaved bots that would hammer the site. We wound up putting this content on a separate system, so it wouldn't impact users on the main system. And generating a regular report by user agent IP address, so that we could block by robots.txt and IP when necessary. 2. Figuring out how to structure the static content so that it didn't look like spam to Google/Yahoo/Bing You don't want to have too many links per page, or too much depth, but that constrains how many pages you can reasonably expose. We had project scores based on code, activity, usage - so we used that to rank the content and focus on exposing early (low depth) the good stuff. You could do the same based on popularity, from search logs. Anyway, there's a lot to this topic, but it doesn't feel very Solr specific. So apologies for reducing the signal-to-noise ratio with talk about SEO :) -- Ken I have a feeling there is not a good solution for this because on one hand people don't like the negative bot side effect, on the other hand people want as much of their sites indexed by the big guys. The only half-solution that comes to mind involves looking at who's actually crawling you and who's bringing you visitors, then blocking those with a bad ratio of those two - bots that crawl a lot but don't bring a lot of value. Any other ideas? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Ken Krugler kkrugler_li...@transpac.com To: solr-user@lucene.apache.org Sent: Mon, January 10, 2011 9:43:49 AM Subject: Re: How to let crawlers in, but prevent their damage? Hi Otis, From what I learned at Krugle, the approach that worked for us was: 1. Block all bots on the search page. 2. Expose the target content via statically linked pages that are separately generated from the same backing store, and optimized for target search terms (extracted from your own search logs). -- Ken On Jan 10, 2011, at 5:41am, Otis Gospodnetic wrote: Hi, How do people with public search services deal with bots/crawlers? And I don't mean to ask how one bans them (robots.txt) or slow them down (Delay stuff in robots.txt) or prevent them from digging too deep in search results... What I mean is that when you have publicly exposed search that bots crawl, they issue all kinds of crazy queries that result in errors, that add noise to Solr caches, increase Solr cache evictions, etc. etc. Are there some known recipes for dealing with them, minimizing their negative side-effects, while still letting them crawl you? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: How to let crawlers in, but prevent their damage?
- Original Message From: lee carroll lee.a.carr...@googlemail.com To: solr-user@lucene.apache.org Sent: Mon, January 10, 2011 6:48:12 AM Subject: Re: How to let crawlers in, but prevent their damage? Sorry not an answer but a +1 vote for finding out best practice for this. Related to it is DOS attacks. We have rewrite rules in between the proxy server and solr which attempts to filter out undesriable stuff but would it be better to have a query app doing this? any standard rewrite rules which drop invalid or potentially malicious queries would be very nice :- What exactly are milicious queries? (besides scraping) What's the problem with invalid queries? Unless someone is doing a custom crawl/scraping of your site, how are they going to issue queries that aren't alread on the site as URLs? On 10 January 2011 13:41, Otis Gospodnetic otis_gospodne...@yahoo.comwrote: Hi, How do people with public search services deal with bots/crawlers? And I don't mean to ask how one bans them (robots.txt) or slow them down (Delay stuff in robots.txt) or prevent them from digging too deep in search results... What I mean is that when you have publicly exposed search that bots crawl, they issue all kinds of crazy queries that result in errors, that add noise to Solr caches, increase Solr cache evictions, etc. etc. Are there some known recipes for dealing with them, minimizing their negative side-effects, while still letting them crawl you? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: PHP PECL solr API library
Yeah, it doesn't look like an easy, CRUD based interface. - Original Message From: Lukas Kahwe Smith m...@pooteeweet.org To: solr-user@lucene.apache.org Sent: Sun, January 9, 2011 11:33:16 PM Subject: Re: PHP PECL solr API library On 10.01.2011, at 08:16, Dennis Gearon wrote: Anyone have any experience using this library? http://us3.php.net/solr Yeah. it works quite well. However imho the API is a maze. Also its lacking critical stuff like escaping and nice to have stuff like lucene query parsing/rewriting. regards, Lukas Kahwe Smith m...@pooteeweet.org
Re: How to let crawlers in, but prevent their damage?
H, so if someone says they have SEO skills on their resume, they COULD be talking about optimizing the SEARH engnie at some site, not just a web site to be crawled by search engines? - Original Message From: Ken Krugler kkrugler_li...@transpac.com To: solr-user@lucene.apache.org Sent: Mon, January 10, 2011 9:07:43 AM Subject: Re: How to let crawlers in, but prevent their damage? On Jan 10, 2011, at 7:02am, Otis Gospodnetic wrote: Hi Ken, thanks Ken. :) The problem with this approach is that it exposes very limited content to bots/web search engines. Take http://search-lucene.com/ for example. People enter all kinds of queries in web search engines and end up on that site. People who visit the site directly don't necessarily search for those same things. Plus, new terms are entered to get to search-lucene.com every day, so keeping up with that would mean constantly generating more and more of those static pages. Basically, the tail is super long. To clarify - the issue of using actual user search traffic is one of SEO, not what content you expose. If, for example, people commonly do a search for java something then that's a hint that the URL to the static content, and the page title, should have the language as part of it. So you shouldn't be generating static pages based on search traffic. Though you might want to decide what content to favor (see below) based on popularity. On top of that, new content is constantly being generated, so one would have to also constantly both add and update those static pages. Yes, but that's why you need to automate that content generation, and do it on a regular (e.g. weekly) basis. The big challenges we ran into were: 1. Dealing with badly behaved bots that would hammer the site. We wound up putting this content on a separate system, so it wouldn't impact users on the main system. And generating a regular report by user agent IP address, so that we could block by robots.txt and IP when necessary. 2. Figuring out how to structure the static content so that it didn't look like spam to Google/Yahoo/Bing You don't want to have too many links per page, or too much depth, but that constrains how many pages you can reasonably expose. We had project scores based on code, activity, usage - so we used that to rank the content and focus on exposing early (low depth) the good stuff. You could do the same based on popularity, from search logs. Anyway, there's a lot to this topic, but it doesn't feel very Solr specific. So apologies for reducing the signal-to-noise ratio with talk about SEO :) -- Ken I have a feeling there is not a good solution for this because on one hand people don't like the negative bot side effect, on the other hand people want as much of their sites indexed by the big guys. The only half-solution that comes to mind involves looking at who's actually crawling you and who's bringing you visitors, then blocking those with a bad ratio of those two - bots that crawl a lot but don't bring a lot of value. Any other ideas? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Ken Krugler kkrugler_li...@transpac.com To: solr-user@lucene.apache.org Sent: Mon, January 10, 2011 9:43:49 AM Subject: Re: How to let crawlers in, but prevent their damage? Hi Otis, From what I learned at Krugle, the approach that worked for us was: 1. Block all bots on the search page. 2. Expose the target content via statically linked pages that are separately generated from the same backing store, and optimized for target search terms (extracted from your own search logs). -- Ken On Jan 10, 2011, at 5:41am, Otis Gospodnetic wrote: Hi, How do people with public search services deal with bots/crawlers? And I don't mean to ask how one bans them (robots.txt) or slow them down (Delay stuff in robots.txt) or prevent them from digging too deep in search results... What I mean is that when you have publicly exposed search that bots crawl, they issue all kinds of crazy queries that result in errors, that add noise to Solr caches, increase Solr cache evictions, etc. etc. Are there some known recipes for dealing with them, minimizing their negative side-effects, while still letting them crawl you? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g -- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Help needed in handling plurals
--- On Mon, 1/10/11, taimurAQ taimur_qure...@hotmail.com wrote: From: taimurAQ taimur_qure...@hotmail.com Subject: Help needed in handling plurals To: solr-user@lucene.apache.org Date: Monday, January 10, 2011, 6:35 PM Hi, I am currently facing the following problematic scenario: At index time, i index a field by the value of Laptop At index time, i index another field with the value of Laptops At query time, i search for Laptops. What is happening right now is that i am only getting back Laptops in the results, whereas i would like both Laptop and Laptops to be included. I do not want to use the Porter stemmer due to its aggressive nature, and i have tried to set up Pling Stemmer as a custom filter in my analyzer, but, to no avail. Can anyone guide me as to: 1. Where to put the PlingStemmer.class file. 2. How to set up the custom filter in the schema.xml file. For an alternative to PlingStemmer see : http://search-lucene.com/m/uHzMd2h5uDK1/ To integrate Pling to solr you need to write a custom TokenFilterFactory. http://wiki.apache.org/solr/SolrPlugins public class PlingStemFilterFactory extends BaseTokenFilterFactory { } You can do this by modifying existing subclasses. You need to create a jar file and put it into solrhome/lib directory. All custom codes must be included as jar files.
first steps in nlp
Hi I'm indexing a set of documents which have a conversational writing style. In particular the authors are very fond of listing facts in a variety of ways (this is to keep a human reader interested) but its causing my index trouble. For example instead of listing facts like: the house is white, the castle is pretty. We get the house is the complete opposite of black and the castle is not ugly. What are the best approaches to resolve these sorts of issues. Even if its just handling not correctly would be a good start cheers lee c
Re: Token Counter
As I understand, a faceted search would be useful if keywords is a multivalued field and the its field value is just a token. I want to display the occurences of the tokens wich appear in a indexed (and stored) text field. -- View this message in context: http://lucene.472066.n3.nabble.com/Token-Counter-tp2227795p2228991.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Token Counter
Faceting will do this for you. Check out: http://wiki.apache.org/solr/SimpleFacetParameters#facet.field This param allows you to specify a field which should be treated as a facet. It will iterate over each Term in the field and generate a facet count using that Term as the constraint. For a text field, it actually does go over each of the indexed tokens. On Mon, Jan 10, 2011 at 10:11 AM, supersoft elarab...@gmail.com wrote: As I understand, a faceted search would be useful if keywords is a multivalued field and the its field value is just a token. I want to display the occurences of the tokens wich appear in a indexed (and stored) text field. -- View this message in context: http://lucene.472066.n3.nabble.com/Token-Counter-tp2227795p2228991.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: first steps in nlp
On Jan 10, 2011, at 12:42 PM, lee carroll wrote: Hi I'm indexing a set of documents which have a conversational writing style. In particular the authors are very fond of listing facts in a variety of ways (this is to keep a human reader interested) but its causing my index trouble. For example instead of listing facts like: the house is white, the castle is pretty. We get the house is the complete opposite of black and the castle is not ugly. What are the best approaches to resolve these sorts of issues. Even if its just handling not correctly would be a good start Hmm, good problem. I guess I'd start by stepping back and ask what is the problem you are trying to solve? You've stated, I think, one half of the problem, namely that your authors have a conversational style, but you haven't stated what your users are expecting to do with this information? Is this a pure search app? Is it something else that is just backed by Solr but the user would never do a search? Do you have a relevance problem? Also, what is your notion of handling not correctly? In other words, more details are welcome! -Grant -- Grant Ingersoll http://www.lucidimagination.com
Box occasionally pegs one cpu at 100%
I have a fairly classic master/slave set up. Response times on the slave are generally good with blips periodically, apparently when replication is happening. Occasionally however the process will have one incredibly slow query and will peg the CPU at 100%. The weird thing is that it will remain that way even if we stop querying it and stop replication and then wait for over 20 minutes. The only way to fix the problem at that point is to restart tomcat. Looking at slow queries around the time of the incident they don't look particularly bad - they're predominantly filter queries running under dismax and there doesn't seem to be anything unusual about them. The index file is about 266G and has 30G of disk free. The machine has 50G of RAM and is running with -Xmx35G. Looking at the processes running it appears to be the main Java thread that's CPU bound, not the child threads. Stracing the process gives a lot of brk instructions (presumably some sort of wait loop) with occasional blips of: mprotect(0x7fc5721d9000, 4096, PROT_READ) = 0 futex(0x451c24a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x451c24a0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x4269dd14, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4269dd10, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7fbc941603b4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 325, {1294683789, 614186000}, ) = 0 futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0 mprotect(0x7fc5721d8000, 4096, PROT_READ) = 0 mprotect(0x7fc5721d8000, 4096, PROT_READ|PROT_WRITE) = 0 futex(0x7fbc94eeb5b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fbc94eeb5b0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x426a6a28, FUTEX_WAKE_PRIVATE, 1) = 1 mprotect(0x7fc5721d9000, 4096, PROT_NONE) = 0 futex(0x41cae8f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41cae8f0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x41cae328, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7fbc941603b4, FUTEX_WAIT_PRIVATE, 327, NULL) = 0 futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0 mmap(0x7fc2e023, 121962496, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fc2e023 mmap(0x7fbca58e, 237568, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fbca58e Any ideas about what's happening and if there's anyway to mitigate it? If the box at least recovered then I could run another slave and load balance between them working on the principle that the second box would pick up the slack whilst the first box restabilised but, as it is, that's not reliable. Thanks, Simon
Re: Box occasionally pegs one cpu at 100%
This sounds like it could be garbage collection related, especially with a heap that large. Depending on your jvm tuning, a FGC could take quite a while, effectively 'pausing' the JVM. Have you looked at something like jstat -gcutil or similar to monitor the garbage collection? On Jan 10, 2011, at 1:36 PM, Simon Wistow wrote: I have a fairly classic master/slave set up. Response times on the slave are generally good with blips periodically, apparently when replication is happening. Occasionally however the process will have one incredibly slow query and will peg the CPU at 100%. The weird thing is that it will remain that way even if we stop querying it and stop replication and then wait for over 20 minutes. The only way to fix the problem at that point is to restart tomcat. Looking at slow queries around the time of the incident they don't look particularly bad - they're predominantly filter queries running under dismax and there doesn't seem to be anything unusual about them. The index file is about 266G and has 30G of disk free. The machine has 50G of RAM and is running with -Xmx35G. Looking at the processes running it appears to be the main Java thread that's CPU bound, not the child threads. Stracing the process gives a lot of brk instructions (presumably some sort of wait loop) with occasional blips of: mprotect(0x7fc5721d9000, 4096, PROT_READ) = 0 futex(0x451c24a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x451c24a0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x4269dd14, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4269dd10, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7fbc941603b4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 325, {1294683789, 614186000}, ) = 0 futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0 mprotect(0x7fc5721d8000, 4096, PROT_READ) = 0 mprotect(0x7fc5721d8000, 4096, PROT_READ|PROT_WRITE) = 0 futex(0x7fbc94eeb5b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fbc94eeb5b0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x426a6a28, FUTEX_WAKE_PRIVATE, 1) = 1 mprotect(0x7fc5721d9000, 4096, PROT_NONE) = 0 futex(0x41cae8f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41cae8f0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x41cae328, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7fbc941603b4, FUTEX_WAIT_PRIVATE, 327, NULL) = 0 futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0 mmap(0x7fc2e023, 121962496, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fc2e023 mmap(0x7fbca58e, 237568, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fbca58e Any ideas about what's happening and if there's anyway to mitigate it? If the box at least recovered then I could run another slave and load balance between them working on the principle that the second box would pick up the slack whilst the first box restabilised but, as it is, that's not reliable. Thanks, Simon
Re: How to let crawlers in, but prevent their damage?
: What I mean is that when you have publicly exposed search that bots crawl, they : issue all kinds of crazy queries that result in errors, that add noise to Solr : caches, increase Solr cache evictions, etc. etc. I teld with this type of thing a few years back by having my front end app executing queries to different solr tiers based on the User-Agent. Typical users to the main tier, known bots of partners to their own alt tier, known bots of public crawlers to a third alt tier. in some cases these alternate tier had the same configs as my normal search tier, but by being distinct, the unusual and eratic query volume and number of unique queries didn't screw up the cache rates or user stats generated by log parsing that i would use on my regular search tier. In other cases the tiers had slightly differnet configs, ie: the bots of my known parterns ran twice a day at predictible times, didn't do any faceting, and used a very predictible set of filters -- so i did snappulling only twice a day, and force warmed those filters. i advocate this kind of distinct search tiers per user base even for human users -- assusming your volumne is high enough and you have the budget for the hardware -- users who do similar queries on a certain subset of documents (with tons of faceting on a certain subset fields) should all use the same set of query servers -- but if a differnt group of users tend to issue differnt types of queries (and facet on different fields) and you know this in advance -- you might as well have that second group of people query differnet boxes. it's esentailly session affinity except it's not about sessions -- it's about expected behavior based on what you know about the user -Hoss
Re: Box occasionally pegs one cpu at 100%
One other possiblity is that the OS or BIOS is doing that, at least on a laptop. There is a new feature where, if the load is low enough, non multi threaded applications can be assigned to one processor and that processor has it's clock boosted so the older software will run faster on the new processors - Otherwise they run SLOWER!. My brother has a cad program that runs slower on his new quad core because the base clock speed is slower than a single processor CPU. The software company is not taking the time to rewrite their code, excpet where they add features or fixes. - Original Message From: Brian Burke bbu...@techtarget.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Mon, January 10, 2011 10:56:27 AM Subject: Re: Box occasionally pegs one cpu at 100% This sounds like it could be garbage collection related, especially with a heap that large. Depending on your jvm tuning, a FGC could take quite a while, effectively 'pausing' the JVM. Have you looked at something like jstat -gcutil or similar to monitor the garbage collection? On Jan 10, 2011, at 1:36 PM, Simon Wistow wrote: I have a fairly classic master/slave set up. Response times on the slave are generally good with blips periodically, apparently when replication is happening. Occasionally however the process will have one incredibly slow query and will peg the CPU at 100%. The weird thing is that it will remain that way even if we stop querying it and stop replication and then wait for over 20 minutes. The only way to fix the problem at that point is to restart tomcat. Looking at slow queries around the time of the incident they don't look particularly bad - they're predominantly filter queries running under dismax and there doesn't seem to be anything unusual about them. The index file is about 266G and has 30G of disk free. The machine has 50G of RAM and is running with -Xmx35G. Looking at the processes running it appears to be the main Java thread that's CPU bound, not the child threads. Stracing the process gives a lot of brk instructions (presumably some sort of wait loop) with occasional blips of: mprotect(0x7fc5721d9000, 4096, PROT_READ) = 0 futex(0x451c24a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x451c24a0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x4269dd14, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4269dd10, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7fbc941603b4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 325, {1294683789, 614186000}, ) = 0 futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0 mprotect(0x7fc5721d8000, 4096, PROT_READ) = 0 mprotect(0x7fc5721d8000, 4096, PROT_READ|PROT_WRITE) = 0 futex(0x7fbc94eeb5b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fbc94eeb5b0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x426a6a28, FUTEX_WAKE_PRIVATE, 1) = 1 mprotect(0x7fc5721d9000, 4096, PROT_NONE) = 0 futex(0x41cae8f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41cae8f0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x41cae328, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7fbc941603b4, FUTEX_WAIT_PRIVATE, 327, NULL) = 0 futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0 mmap(0x7fc2e023, 121962496, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fc2e023 mmap(0x7fbca58e, 237568, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fbca58e Any ideas about what's happening and if there's anyway to mitigate it? If the box at least recovered then I could run another slave and load balance between them working on the principle that the second box would pick up the slack whilst the first box restabilised but, as it is, that's not reliable. Thanks, Simon
Re: Improving Solr performance
I see from your other messages that these indexes all live on the same machine. You're almost certainly I/O bound, because you don't have enough memory for the OS to cache your index files. With 100GB of total index size, you'll get best results with between 64GB and 128GB of total RAM. Is that a general rule of thumb? That it is best to have about the same amount of RAM as the size of your index? So, with a 5GB index, I should have between 4GB and 8GB of RAM dedicated to solr?
Re: Improving Solr performance
No, it also depends on the queries you execute (sorting is a big consumer) and the number of concurrent users. Is that a general rule of thumb? That it is best to have about the same amount of RAM as the size of your index? So, with a 5GB index, I should have between 4GB and 8GB of RAM dedicated to solr?
Re: Improving Solr performance
I see a lot of people using shards to hold different types of documents, and it almost always seems to be a bad solution. Shards are intended for distributing a large index over multiple hosts -- that's it. Not for some kind of federated search over multiple schemas, not for access control. Why not put everything in the same index, without shards, and just use an 'fq' limit in order to limit to the specific document you'd like to search over in a given search?I think that would achieve your goal a lot more simply than shards -- then you use sharding only if and when your index grows to be so large you'd like to distribute it over multiple hosts, and when you do so you choose a shard key that will have more or less equal distribution accross shards. Using shards for access control or schema management just leads to headaches. [Apparently Solr could use some highlighted documentation on what shards are really for, as it seems to be a very common issue on this list, someone trying to use them for something else and then inevitably finding problems with that approach.] Jonathan On 1/7/2011 6:48 AM, supersoft wrote: The reason of this distribution is the kind of the documents. In spite of having the same schema structure (and solr conf), a document belongs to 1 of 5 different kinds. Each kind corresponds to a concrete shard and due to this, the implemented client tool avoids searching in all the shards when the users selects just one or a few of kinds. The tool runs a multisharded query of the proper shards. I guess this is a right approach but correct me if I am wrong. The real problem of this architecture is the correlation between concurrent users and response time: 1 query: n seconds 2 queries: 2*n second each query 3 queries: 3*n seconds each query and so... This is being a real headache because 1 single query has an acceptable response time but when many users are accessing to the server the performance goes hardly down.
Re: Tuning StatsComponent
I found StatsComponent to be slow only when I didn't have enough RAM allocated to the JVM. I'm not sure exactly what was causing it, but it was pathologically slow -- and then adding more RAM to the JVM made it incredibly fast. On 1/10/2011 4:58 AM, Gora Mohanty wrote: On Mon, Jan 10, 2011 at 2:28 PM, stockiist...@shopgate.com wrote: Hello. i`m using the StatsComponent to get the sum of amounts. but solr statscomponent is very slow on a huge index of 30 Million documents. how can i tune the statscomponent ? Not sure about this problem. the problem is, that i have 5 currencys and i need to send for each currency a new request. thats make the solr search sometimes very slow. =( [...] I guess that you mean the search from the front-end is slow. It is difficult to make a guess without details of your index, and of your queries, but one thing that immediately jumps out is that you could shard the Solr index by currency, and have your front-end direct queries for each currency to the appropriate Solr server. Please do share a description of what all you are indexing, how large your index is, and what kind of queries you are running. I take it that you have already taken a look at http://wiki.apache.org/solr/SolrPerformanceFactors Regards, Gora
Re: Including Small Amounts of New Data in Searches (MultiSearcher ?)
most of the Solr sites I know of have much larger indexes than ram and expect everything to work smoothly Hmm... In that case, throttling the merges would probably help most, though, yes, that's not available today. In lieu of that, I'd run large merges during off-peak hours, or better yet, use Solr's replication, eg, merge on the master where queries aren't hitting anything. Perhaps that'd throw off the NRT interval though. On Sun, Jan 9, 2011 at 8:55 PM, Lance Norskog goks...@gmail.com wrote: Ok. I was talking about what tools are available now- much better things are in the NRT work. I don't know how merges work now, in re multitasking and thread contention. Most of the Solr sites I know of have much larger indexes than ram and expect everything to work smoothly. Lance On Sun, Jan 9, 2011 at 9:18 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: The older MergePolicies followed a strategy which is quite disruptive in an NRT environment. Can you elaborate as to why (maybe we need to place this in a wiki)? If large merges are running in their own thread, they should not disrupt queries, eg, there won't be CPU contention. The IO contention can be disruptive, depending on the size and type of hardware, however in the ideal case of the index 'fitting' into RAM/IO cache, then a large merge should not affect queries (or indexing). I think what's useful that is being developed for not disrupting NRT with merges is DirectIOLinuxDirectory: https://issues.apache.org/jira/browse/LUCENE-2500 It's also useful for the non-NRT use case because anytime IO cache pages are evicted, queries will slow down (unless the index is too large to fit in RAM anyways). On Sat, Jan 8, 2011 at 7:55 PM, Lance Norskog goks...@gmail.com wrote: There are always slowdowns when merging new segments during indexing. A MergePolicy decides when to merge segments. The older MergePolicies followed a strategy which is quite disruptive in an NRT environment. There is a new feature in 3.x the trunk called 'BalancedSegmentMergePolicy'. This new MergePolicy is designed for the near-real-time use case. It was contributed by LinkedIn. You may find it works well enough for your case. Lance On Thu, Jan 6, 2011 at 10:21 AM, Stephen Boesch java...@gmail.com wrote: Thanks Yonik, Using a stable release of Solr what would you suggest to do - given MultiSearch's demise and the other work is still ongoing? 2011/1/6 Yonik Seeley yo...@lucidimagination.com On Thu, Jan 6, 2011 at 12:37 PM, Stephen Boesch java...@gmail.com wrote: Solr/lucene newbie here .. We would like searches against a solr/lucene index to immediately be able to view data that was added. I stress small amount of new data given that any significant amount would require excessive latency. There has been significant ongoing work in lucene-core for NRT (near real time). We need to overhaul Solr's DirectUpdateHandler2 to take advantage of all this work. Mark Miller took a first crack at it (sharing a single IndexWriter, letting lucene handle the concurrency issues, etc) but if there's a JIRA issue, I'm having trouble finding it. Looking around, i'm wondering if the direction would be a MultiSearcher living on top of our standard directory-based IndexReader as well as a custom Searchable that handles the newest documents - and then combines the two results? If you look at trunk, MultiSearcher has already gone away. -Yonik http://www.lucidimagination.com -- Lance Norskog goks...@gmail.com -- Lance Norskog goks...@gmail.com
Re: Tuning StatsComponent
StatsComponent, like many things, relies on FieldCache (and the related uninverted version in Solr for multivalued fields), which takes up memory and is related to the number of documents in the index. Strings in FieldCache can also be expensive. -Grant On Jan 10, 2011, at 4:10 PM, Jonathan Rochkind wrote: I found StatsComponent to be slow only when I didn't have enough RAM allocated to the JVM. I'm not sure exactly what was causing it, but it was pathologically slow -- and then adding more RAM to the JVM made it incredibly fast. On 1/10/2011 4:58 AM, Gora Mohanty wrote: On Mon, Jan 10, 2011 at 2:28 PM, stockiist...@shopgate.com wrote: Hello. i`m using the StatsComponent to get the sum of amounts. but solr statscomponent is very slow on a huge index of 30 Million documents. how can i tune the statscomponent ? Not sure about this problem. the problem is, that i have 5 currencys and i need to send for each currency a new request. thats make the solr search sometimes very slow. =( [...] I guess that you mean the search from the front-end is slow. It is difficult to make a guess without details of your index, and of your queries, but one thing that immediately jumps out is that you could shard the Solr index by currency, and have your front-end direct queries for each currency to the appropriate Solr server. Please do share a description of what all you are indexing, how large your index is, and what kind of queries you are running. I take it that you have already taken a look at http://wiki.apache.org/solr/SolrPerformanceFactors Regards, Gora -- Grant Ingersoll http://www.lucidimagination.com/
Re: Improving Solr performance
On Mon, 2011-01-10 at 21:43 +0100, Paul wrote: I see from your other messages that these indexes all live on the same machine. You're almost certainly I/O bound, because you don't have enough memory for the OS to cache your index files. With 100GB of total index size, you'll get best results with between 64GB and 128GB of total RAM. Is that a general rule of thumb? That it is best to have about the same amount of RAM as the size of your index? I does not seems like there is a clear current consensus on hardware to handle IO problems. I am firmly in the SSD camp, but as you can see from the current thread, other people recommend RAM and/or extra machines. I can say that our tests with RAM and spinning disks showed us that a lot of RAM certainly helps a lot, but also that it takes a considerable amount of time to warm the index before the performance is satisfactory. It might be helped with disk cache tricks, such as copying the whole index to /dev/null before opening it in Solr. So, with a 5GB index, I should have between 4GB and 8GB of RAM dedicated to solr? Not as -Xmx, but free for disk cache, yes. If you follow the RAM ~= index size recommendation.
Re: Improving Solr performance
What I seem to see suggested here is to use different cores for the things you suggested: different types of documents Access Control Lists I wonder how sharding would work in that scenario? Me, I plan on : For security: Using a permissions field For different schmas: Dynamic fields with enough premade fields to handle it. The one thing I don't thing my approach does well with is statistics. Dennis Gearon - Original Message From: Jonathan Rochkind rochk...@jhu.edu To: solr-user@lucene.apache.org solr-user@lucene.apache.org Cc: supersoft elarab...@gmail.com Sent: Mon, January 10, 2011 1:08:00 PM Subject: Re: Improving Solr performance I see a lot of people using shards to hold different types of documents, and it almost always seems to be a bad solution. Shards are intended for distributing a large index over multiple hosts -- that's it. Not for some kind of federated search over multiple schemas, not for access control. Why not put everything in the same index, without shards, and just use an 'fq' limit in order to limit to the specific document you'd like to search over in a given search?I think that would achieve your goal a lot more simply than shards -- then you use sharding only if and when your index grows to be so large you'd like to distribute it over multiple hosts, and when you do so you choose a shard key that will have more or less equal distribution accross shards. Using shards for access control or schema management just leads to headaches. [Apparently Solr could use some highlighted documentation on what shards are really for, as it seems to be a very common issue on this list, someone trying to use them for something else and then inevitably finding problems with that approach.] Jonathan On 1/7/2011 6:48 AM, supersoft wrote: The reason of this distribution is the kind of the documents. In spite of having the same schema structure (and solr conf), a document belongs to 1 of 5 different kinds. Each kind corresponds to a concrete shard and due to this, the implemented client tool avoids searching in all the shards when the users selects just one or a few of kinds. The tool runs a multisharded query of the proper shards. I guess this is a right approach but correct me if I am wrong. The real problem of this architecture is the correlation between concurrent users and response time: 1 query: n seconds 2 queries: 2*n second each query 3 queries: 3*n seconds each query and so... This is being a real headache because 1 single query has an acceptable response time but when many users are accessing to the server the performance goes hardly down.
Re: first steps in nlp
Hi Grant, Its a search relevancy problem. For example: a document about london reads like London is not very good for a peaceful break. we analyse this at the (i can't remember the technical term) is it lexical level? (bloody hell i think you may have wrote the book !) anyway which produces tokens in our index of say London good peaceful holiday users search for cities which would be nice for them to take a holiday in say the search is good for a peaceful break and bang london is top. talk about a relevancy problem :-) now i was thinking of using phrase matches in the synonyms file but is that the best approach or could nlp help here? cheers lee On 10 January 2011 18:21, Grant Ingersoll gsing...@apache.org wrote: On Jan 10, 2011, at 12:42 PM, lee carroll wrote: Hi I'm indexing a set of documents which have a conversational writing style. In particular the authors are very fond of listing facts in a variety of ways (this is to keep a human reader interested) but its causing my index trouble. For example instead of listing facts like: the house is white, the castle is pretty. We get the house is the complete opposite of black and the castle is not ugly. What are the best approaches to resolve these sorts of issues. Even if its just handling not correctly would be a good start Hmm, good problem. I guess I'd start by stepping back and ask what is the problem you are trying to solve? You've stated, I think, one half of the problem, namely that your authors have a conversational style, but you haven't stated what your users are expecting to do with this information? Is this a pure search app? Is it something else that is just backed by Solr but the user would never do a search? Do you have a relevance problem? Also, what is your notion of handling not correctly? In other words, more details are welcome! -Grant -- Grant Ingersoll http://www.lucidimagination.com
Re: Improving Solr performance
Not sure if this was mentioned yet, but if you are doing slave/master replication you'll need 2x the RAM at replication time. Just something to keep in mind. -mike On Mon, Jan 10, 2011 at 5:01 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote: On Mon, 2011-01-10 at 21:43 +0100, Paul wrote: I see from your other messages that these indexes all live on the same machine. You're almost certainly I/O bound, because you don't have enough memory for the OS to cache your index files. With 100GB of total index size, you'll get best results with between 64GB and 128GB of total RAM. Is that a general rule of thumb? That it is best to have about the same amount of RAM as the size of your index? I does not seems like there is a clear current consensus on hardware to handle IO problems. I am firmly in the SSD camp, but as you can see from the current thread, other people recommend RAM and/or extra machines. I can say that our tests with RAM and spinning disks showed us that a lot of RAM certainly helps a lot, but also that it takes a considerable amount of time to warm the index before the performance is satisfactory. It might be helped with disk cache tricks, such as copying the whole index to /dev/null before opening it in Solr. So, with a 5GB index, I should have between 4GB and 8GB of RAM dedicated to solr? Not as -Xmx, but free for disk cache, yes. If you follow the RAM ~= index size recommendation.
Re: Box occasionally pegs one cpu at 100%
On Mon, Jan 10, 2011 at 01:56:27PM -0500, Brian Burke said: This sounds like it could be garbage collection related, especially with a heap that large. Depending on your jvm tuning, a FGC could take quite a while, effectively 'pausing' the JVM. Have you looked at something like jstat -gcutil or similar to monitor the garbage collection? I think you may have hit the nail on the head. Having checked the configuration again I noticed that the -server flag didn't appear to be present in the options passed to Java (I'm convinced it used to be there). As I understand it, this would mean that the Parallel GC wouldn't be implicitly enabled. If that's true then that's a definite strong candidate for causing the root process and only the root process to peg a single CPU. Anybody have any experience of the differences between -XX:+UseParallelGC and -XX:+UseConcMarkSweepGC with -XX:+UseParNewGC ? I believe -XX:+UseParallelGC is the default with -server so I suppose that's a good place to start but I'd appreciate any anecdotes or experiences.
Re: Improving Solr performance
On 1/10/2011 5:03 PM, Dennis Gearon wrote: What I seem to see suggested here is to use different cores for the things you suggested: different types of documents Access Control Lists I wonder how sharding would work in that scenario? Sharding has nothing to do with that scenario at all. Different cores are essentially _entirely seperate_. While it can be convenient to use different cores like this, it means you don't get ANY searches that 'join' over multiple 'kinds' of data in different cores. Solr is not great at handling hetereogenous data like that. Putting it in seperate cores is one solution, although then they are entirely seperate. If that works, great. Another solution is putting them in the same index, but using mostly different fields, and perhaps having a 'type' field shared amongst all of your 'kinds' of data, and then always querying with an 'fq' for the right 'kind'. Or if the fields they use are entirely different, you don't even need the fq, since a query on a certain field will only match a certain 'kind' of document. Solr is not great at handling complex queries over data with hetereogenous schemata. Solr wants you to to flatten all your data into one single set of documents. Sharding is a way of splitting up a single index (multiple cores are _multiple indexes_) amongst several hosts for performance reasons, mostly when you have a very large index. That is it. The end. if you have multiple cores, that's the same as having multiple solr indexes (which may or may not happen to be on the same machine). Any one or more of those cores could be sharded if you want. This is a seperate issue.
Re: Improving Solr performance
And I don't think I've seen anyone suggest a seperate core just for Access Control Lists. I'm not sure what that would get you. Perhaps a separate store that isn't Solr at all, in some cases. On 1/10/2011 5:36 PM, Jonathan Rochkind wrote: Access Control Lists
Re: Improving Solr performance
Any sources to cite for this statement? And are you talking about RAM allocated to the JVM or available for OS cache? Not sure if this was mentioned yet, but if you are doing slave/master replication you'll need 2x the RAM at replication time. Just something to keep in mind. -mike On Mon, Jan 10, 2011 at 5:01 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote: On Mon, 2011-01-10 at 21:43 +0100, Paul wrote: I see from your other messages that these indexes all live on the same machine. You're almost certainly I/O bound, because you don't have enough memory for the OS to cache your index files. With 100GB of total index size, you'll get best results with between 64GB and 128GB of total RAM. Is that a general rule of thumb? That it is best to have about the same amount of RAM as the size of your index? I does not seems like there is a clear current consensus on hardware to handle IO problems. I am firmly in the SSD camp, but as you can see from the current thread, other people recommend RAM and/or extra machines. I can say that our tests with RAM and spinning disks showed us that a lot of RAM certainly helps a lot, but also that it takes a considerable amount of time to warm the index before the performance is satisfactory. It might be helped with disk cache tricks, such as copying the whole index to /dev/null before opening it in Solr. So, with a 5GB index, I should have between 4GB and 8GB of RAM dedicated to solr? Not as -Xmx, but free for disk cache, yes. If you follow the RAM ~= index size recommendation.
Re: Solr Spellcheker automatically tokenizes on period marks
I've noticed that the spellcheck component also seems to tokenize by itself on question marks, not only period marks. Based on the spellcheck definition above, does anyone know how to stop Solr from tokenizing strings on queries such as www.sometest.com (which causes suggestions of the form www.www.sometest.com.com) It gets really messy if the user then clicks the above suggestion, which causes a suggestion such as www.www.www.sometest.com.com.com to be given. Thanks in advance! Sebastian -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Spellcheker-automatically-tokenizes-on-period-marks-tp2131844p2231170.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Box occasionally pegs one cpu at 100%
This reminded me of a situation I ran into in the past where the JVM was being rendered useless because it was calling FGC repeatedly. Effectively what was going on is that a very large array was allocated which swamped the JVM memory and caused it to trash, much like an OS. Here are some links which will help (at least they helped me): http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html (you need to read this one) http://java.sun.com/performance/reference/whitepapers/tuning.html (and this one). http://www.oracle.com/technetwork/java/javase/tech/index-jsp-136373.html http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html http://java.sun.com/performance/jvmstat/ http://blogs.sun.com/watt/resource/jvm-options-list.html jstat is also very good for seeing what is going on in the JVM. I also recall there was a way to trace GC in the JVM but cant recall how off the top of my head, maybe it was a JVM option. Hope this helps. Cheers François On Jan 10, 2011, at 5:13 PM, Simon Wistow wrote: On Mon, Jan 10, 2011 at 01:56:27PM -0500, Brian Burke said: This sounds like it could be garbage collection related, especially with a heap that large. Depending on your jvm tuning, a FGC could take quite a while, effectively 'pausing' the JVM. Have you looked at something like jstat -gcutil or similar to monitor the garbage collection? I think you may have hit the nail on the head. Having checked the configuration again I noticed that the -server flag didn't appear to be present in the options passed to Java (I'm convinced it used to be there). As I understand it, this would mean that the Parallel GC wouldn't be implicitly enabled. If that's true then that's a definite strong candidate for causing the root process and only the root process to peg a single CPU. Anybody have any experience of the differences between -XX:+UseParallelGC and -XX:+UseConcMarkSweepGC with -XX:+UseParNewGC ? I believe -XX:+UseParallelGC is the default with -server so I suppose that's a good place to start but I'd appreciate any anecdotes or experiences.
Re: Box occasionally pegs one cpu at 100%
On Mon, Jan 10, 2011 at 05:58:42PM -0500, François Schiettecatte said: http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html(you need to read this one) http://java.sun.com/performance/reference/whitepapers/tuning.html (and this one). Yeah, I have these two pages bookmarked :) jstat is also very good for seeing what is going on in the JVM. I also recall there was a way to trace GC in the JVM but cant recall how off the top of my head, maybe it was a JVM option. You can use -XX:+PrintGC and -XX:+PrintGCDetail (and -XX:+PrintGCTimeStamps) as well as -Xloggc:gc.log to log to a file. I'm also finding NewRelic's RPM system great for monitoring Solr - the integration is really good, I give it two thumbs up.
RE: Empty value/string matching
Anyone know why this would not be working in solr?. Just to recap, we are trying to exclude document which have fields missing values in the search results. I have tried and none of it seems to be working: 1. *:* -field:[* TO *]2. -field:[* TO *]3. field: The fields are either typed string or custom and the query parser used is the,LuceneQParser. The below suggested solutions of using some default values do not work for our use case. ThanksViswa From: bob.sandif...@sirsidynix.com To: solr-user@lucene.apache.org Date: Mon, 22 Nov 2010 08:35:22 -0700 Subject: RE: Empty value/string matching One possibility to consider - if you really need documents with specifically empty or non-defined values (if that's not an oxymoron :)), and you have control over the values you send into the indexing, you could set a special value that means 'no value'. We've done that in a similar vein, using something like '@@EMPTY@@' for a given field, meaning that the original document didn't actually have a value for that field. I.E. it is something very unlikely to be a 'real' value - and then we can easily select on documents by querying for the field:@@EMPTY@@ instead of the negated form of the select... However, we haven't considered things like what it does to index size. It's relatively rare for us (that there not be a value), so our 'gut feel' is that it's not impacting the indexes very much size-wise or performance-wise. Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.com -Original Message- From: Viswa S [mailto:svis...@hotmail.com] Sent: Saturday, November 20, 2010 5:38 PM To: solr-user@lucene.apache.org Subject: RE: Empty value/string matching Erick, Thanks for the quick response. The output i showed is on a test instance i created to simulate this issue. I intentionally tried to create documents with no values by creating xml nodes with field name=fieldName/field, but having values in the other fields in a document. Are you saying that there is no way have a field with no value?, with text fields they seem to make sense than for string?. You are right on fieldName:[* TO *] results, which basically returned all the documents which included the couple of documents in question. -Viswa Date: Sat, 20 Nov 2010 17:20:53 -0500 Subject: Re: Empty value/string matching From: erickerick...@gmail.com To: solr-user@lucene.apache.org I don't think that's correct. The documents wouldn't be showing up in the facets if they had no value for the field. So I think you're being mislead by the printout from the faceting. Perhaps you have unprintable characters in there or some such. Certainly the name: is actually a value, admittedly just a space. As for the other, I suspect something similar. What results do you get back when you just search for FieldName:[* TO *]? I'm betting you get all the docs back, but I've been very wrong before. Best Erick On Sat, Nov 20, 2010 at 5:02 PM, Viswa S svis...@hotmail.com wrote: Yes I do have a couple of documents with no values and one with an empty string. Find below the output of a facet on the fieldName. ThanksViswa int name=2/intint name=CASTIGO.4302/intint name=GDOGPRODY.4242/intint name=QMAGIC.4122/intint name= 1/int Date: Sat, 20 Nov 2010 15:29:06 -0500 Subject: Re: Empty value/string matching From: erickerick...@gmail.com To: solr-user@lucene.apache.org Are you absolutely sure your documents really don't have any values for FieldName? Because your results are perfectly correct if every doc has a value for FieldName. Or are you saying there no such field as FieldName? Best Erick On Sat, Nov 20, 2010 at 3:12 PM, Viswa S svis...@hotmail.com wrote: Folks,Am trying to query documents which have no values present, I have used the following constructs and it doesn't seem to work on the solr dev tip (as of 09/22) or the 1.4 builds.1. (*:* AND -FieldName[* TO *]) - returns no documents, parsedquery was +MatchAllDocsQuery(*:*) -FieldName:[* TO *]2. -FieldName:[* TO *] - returns no documents, parsedquery was -FieldName:[* TO *]3. FieldName: - returns no documents, parsedquery was empty (str name=parsedquery/)The field is type string, using the LuceneQParser, I have also tried to see if FieldName:[* TO *] if the documents with no terms are ignored and didn't seem to be the case, the result set was everything.Any help would be appreciated.-Viswa
Solr highlighting is botching output
Hi all, I'm implementing Solr for a course and book search service for college students, and I'm running into some issues with the highlighting plugin. After a few minutes of tinkering, searching on Google, searching the group archives and not finding anything, I thought I would see if anyone else is having this problem and if not what I am doing to cause it. Basically, the issue is that whenever I turn on highlighting for a certain field, I get either (1) inconsistent highlights or (2) bizarre highlight output for some of the results. A few of the results look correct. Here's my solrconfig.xml: http://pastie.org/private/iz3fd77innxb5r2v63zpa Broken output: http://pastie.org/private/pyptpektckitp2piqvcgw As you can see, I searched for history. In the results, a few times that the query is highlighted, you'll see that the name fields contain strings such as spanHistory/spanspanHistory/spanspanHistory/span , instead of just highlighting it once. I don't have the knowledge to understand why Solr would treat African American History: From Emancipation to the Present differently than African American Women's History, other than one is longer than the other, or why it would double or quadruple the highlighted response. I tried to figure out what configuration option could change this, to no avail. If anyone has any input, I would be very grateful. Thank you! Dan
Post size limit to Solr?
Is there a max POST size limit when sending documents over to Solrs update handler to be indexed? Right now I've self imposed a limit of sending a max of 50 docs per request to solr in my PHP code..and that seems to work fine. I was just curious as to if there was a limit somewhere at which Solr will complain? Thanks Stephen
Re: Post size limit to Solr?
Is there a max POST size limit when sending documents over to Solrs update handler to be indexed? Right now I've self imposed a limit of sending a max of 50 docs per request to solr in my PHP code..and that seems to work fine. I was just curious as to if there was a limit somewhere at which Solr will complain? I think this is related to servlet container. Default maxPostSize for tamcat is 2 megabytes. http://tomcat.apache.org/tomcat-5.5-doc/config/http.html
Re: Solr highlighting is botching output
I'm implementing Solr for a course and book search service for college students, and I'm running into some issues with the highlighting plugin. After a few minutes of tinkering, searching on Google, searching the group archives and not finding anything, I thought I would see if anyone else is having this problem and if not what I am doing to cause it. Basically, the issue is that whenever I turn on highlighting for a certain field, I get either (1) inconsistent highlights or (2) bizarre highlight output for some of the results. A few of the results look correct. Here's my solrconfig.xml: http://pastie.org/private/iz3fd77innxb5r2v63zpa Broken output: http://pastie.org/private/pyptpektckitp2piqvcgw As you can see, I searched for history. In the results, a few times that the query is highlighted, you'll see that the name fields contain strings such as spanHistory/spanspanHistory/spanspanHistory/span , instead of just highlighting it once. I don't have the knowledge to understand why Solr would treat African American History: From Emancipation to the Present differently than African American Women's History, other than one is longer than the other, or why it would double or quadruple the highlighted response. I tried to figure out what configuration option could change this, to no avail. If anyone has any input, I would be very grateful. Thank you! Thats really strange. Can you provide us field type definition of text field. And full search URL that caused that output. And the solr version.
Re: Solr highlighting is botching output
Thats really strange. Can you provide us field type definition of text field. And full search URL that caused that output. And the solr version. Also, did you enable term vectors on text field?
Re: Post size limit to Solr?
Thanks! On Mon, Jan 10, 2011 at 9:27 PM, Ahmet Arslan iori...@yahoo.com wrote: Is there a max POST size limit when sending documents over to Solrs update handler to be indexed? Right now I've self imposed a limit of sending a max of 50 docs per request to solr in my PHP code..and that seems to work fine. I was just curious as to if there was a limit somewhere at which Solr will complain? I think this is related to servlet container. Default maxPostSize for tamcat is 2 megabytes. http://tomcat.apache.org/tomcat-5.5-doc/config/http.html
Synonyms at index time
Hi, I'm not sure if this question is better posted in Solr - User or Solr - Dev, but I'll start here. I'm interested to find some documentation that describes in detail how synonym expansion is handled at index time. http://www.lucidimagination.com/blog/2009/03/18/exploring-lucenes-indexing-code-part-2/ This article explains what the index looks like for three example documents. However, I'm looking for some documentation about what the index (the inverted index) looks like when synonyms are thrown into the mix. Thanks in advance for your help. -Mark -- View this message in context: http://lucene.472066.n3.nabble.com/Synonyms-at-index-time-tp2232470p2232470.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Input raw log file
can u give an example.. like something that is currently being used.. i'am an engineering student and my project is to index all the real time log files from different devices and use some artificial intelligence and produce a usefull data out of it.. i'm doing this for my college.. i'm struggling more than a month even for a start.. -- View this message in context: http://lucene.472066.n3.nabble.com/Input-raw-log-file-tp2210043p2232604.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr highlighting is botching output
On Mon, Jan 10, 2011 at 6:48 PM, Ahmet Arslan iori...@yahoo.com wrote: Thats really strange. Can you provide us field type definition of text field. And full search URL that caused that output. And the solr version. Sure. Full search URL: /solr/select?indent=on version=2.2q=historyfq=start=0rows=10fl=*,scoreqt=standardwt=standardexplainOther=hl=onhl.fl= Here's the type definition: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.EdgeNGramTokenizerFactory minGramSize=1 maxGramSize=114 / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType
Re: Solr highlighting is botching output
On Mon, Jan 10, 2011 at 6:51 PM, Ahmet Arslan iori...@yahoo.com wrote: Thats really strange. Can you provide us field type definition of text field. And full search URL that caused that output. And the solr version. Also, did you enable term vectors on text field? Not sure what those are, so I'm guessing no :)
icq or other 'instant gratification' communication forums for Solr
Are there any chatrooms or ICQ rooms to ask questions late at night to people who stay up or are on other side of planet? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
How to insert this using Solr PHP?
I am switching between building the query to a Solr instance by hand and doing it with PHP Solr Extension. I have this query that my dev partner said to insert before all the other column searches. What kind of query is it and how do I get it into the query in an 'OOP' style using the PHP Solr extension? In particular, I'm interested in what is the part in the query 'q={!.}. Is that a filter query? How do I put it into the query . . . I already asked that ;-) URL_BASE?wt=jsonindent=truestart=0rows=20q={!spatial lat=xx.x long=xxx.x radius=10 unit=km threadCount=3} OTHER COLUMNS, blah blah bcc: my partner Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: Solr highlighting is botching output
Not sure about your solr version but probably it can be https://issues.apache.org/jira/browse/LUCENE-2266 Is there a special reason for using EdgeNGramTokenizerFactory? Replacing this tokenizer with WhiteSpaceTokenizer should solve this. Or upgrade solr version. And I don't see span either in your search URL or solrconfig.xml, how span is popping up in the response? Sure. Full search URL: /solr/select?indent=on version=2.2q=historyfq=start=0rows=10fl=*,scoreqt=standardwt=standardexplainOther=hl=onhl.fl= Here's the type definition: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.EdgeNGramTokenizerFactory minGramSize=1 maxGramSize=114 / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType
Re: Solr highlighting is botching output
On Mon, Jan 10, 2011 at 9:19 PM, Ahmet Arslan iori...@yahoo.com wrote: Not sure about your solr version but probably it can be https://issues.apache.org/jira/browse/LUCENE-2266 Is there a special reason for using EdgeNGramTokenizerFactory? Replacing this tokenizer with WhiteSpaceTokenizer should solve this. I'm trying to implement autocomplete, so I need to be able to search within words. Maybe I was using it incorrectly, but the WhiteSpaceTokenizer would only index on whole words. econ needs to match economics, econometrics, etc. Or upgrade solr version. Oops, forgot to mention the version. I'm running Solr 1.4.1. And I don't see span either in your search URL or solrconfig.xml, how span is popping up in the response? My mistake. I was playing around with the pre/post parameters. Everything else is the same.
Re: Solr highlighting is botching output
Replacing with EdgeNGramTokenizerFactory with tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=114 / combination should solve your problem. Preserving your search within words. Searching histo will return : African American emHisto/emry --- On Tue, 1/11/11, Dan Loewenherz dloewenh...@gmail.com wrote: From: Dan Loewenherz dloewenh...@gmail.com Subject: Re: Solr highlighting is botching output To: solr-user@lucene.apache.org Date: Tuesday, January 11, 2011, 7:30 AM On Mon, Jan 10, 2011 at 9:19 PM, Ahmet Arslan iori...@yahoo.com wrote: Not sure about your solr version but probably it can be https://issues.apache.org/jira/browse/LUCENE-2266 Is there a special reason for using EdgeNGramTokenizerFactory? Replacing this tokenizer with WhiteSpaceTokenizer should solve this. I'm trying to implement autocomplete, so I need to be able to search within words. Maybe I was using it incorrectly, but the WhiteSpaceTokenizer would only index on whole words. econ needs to match economics, econometrics, etc. Or upgrade solr version. Oops, forgot to mention the version. I'm running Solr 1.4.1. And I don't see span either in your search URL or solrconfig.xml, how span is popping up in the response? My mistake. I was playing around with the pre/post parameters. Everything else is the same.
Re: Solr highlighting is botching output
Awesome, thank you so much! That did the trick. On Mon, Jan 10, 2011 at 10:02 PM, Ahmet Arslan iori...@yahoo.com wrote: Replacing with EdgeNGramTokenizerFactory with tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=114 / combination should solve your problem. Preserving your search within words. Searching histo will return : African American emHisto/emry --- On Tue, 1/11/11, Dan Loewenherz dloewenh...@gmail.com wrote: From: Dan Loewenherz dloewenh...@gmail.com Subject: Re: Solr highlighting is botching output To: solr-user@lucene.apache.org Date: Tuesday, January 11, 2011, 7:30 AM On Mon, Jan 10, 2011 at 9:19 PM, Ahmet Arslan iori...@yahoo.com wrote: Not sure about your solr version but probably it can be https://issues.apache.org/jira/browse/LUCENE-2266 Is there a special reason for using EdgeNGramTokenizerFactory? Replacing this tokenizer with WhiteSpaceTokenizer should solve this. I'm trying to implement autocomplete, so I need to be able to search within words. Maybe I was using it incorrectly, but the WhiteSpaceTokenizer would only index on whole words. econ needs to match economics, econometrics, etc. Or upgrade solr version. Oops, forgot to mention the version. I'm running Solr 1.4.1. And I don't see span either in your search URL or solrconfig.xml, how span is popping up in the response? My mistake. I was playing around with the pre/post parameters. Everything else is the same.
Re: How to insert this using Solr PHP?
I'm interested in what is the part in the query 'q={!.}. Is that a filter query? It is in local params syntax. http://wiki.apache.org/solr/LocalParams
Solr highlighting is botching output
Hi all, I'm implementing Solr for a course and book search service for college students, and I'm running into some issues with the highlighting plugin. After a few minutes of tinkering, searching on Google, searching the group archives and not finding anything, I thought I would see if anyone else is having this problem and if not what I am doing to cause it. Basically, the issue is that whenever I turn on highlighting for a certain field, I get either (1) inconsistent highlights or (2) bizarre highlight output for some of the results. A few of the results look correct. Here's my solrconfig.xml: http://pastie.org/private/iz3fd77innxb5r2v63zpa Broken output: http://pastie.org/private/pyptpektckitp2piqvcgw As you can see, I searched for history. In the results, a few times that the query is highlighted, you'll see that the name fields contain strings such as spanHistory/spanspanHistory/spanspanHistory/span , instead of just highlighting it once. I don't have the knowledge to understand why Solr would treat African American History: From Emancipation to the Present differently than African American Women's History, other than one is longer than the other, or why it would double or quadruple the highlighted response. I tried to figure out what configuration option could change this, to no avail. If anyone has any input, I would be very grateful. Thank you! Dan