java.lang.NullPointerException withs stats component and shards
Hi, I have problem with Stats component in shards environment. Solr throws Java.lang.NullPointerException when there is no results and statistic is computed over date field. arr name=stats.field strprice/str strddate/str /arr str name=statstrue/str str name=q*:*/str arr name=fq strdate:[2013-03-23T00:00:00Z TO *]/str strprice:[5000 TO *]/str /arr str name=rows10/str /lst /lst lst name=error str name=tracejava.lang.NullPointerException at org.apache.solr.handler.component.DateStatsValues.updateTypeSpecificStats(StatsValuesFactory.java:340) at org.apache.solr.handler.component.AbstractStatsValues.accumulate(StatsValuesFactory.java:106) at org.apache.solr.handler.component.StatsComponent.handleResponses(StatsComponent.java:112) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:311) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:662) /str int name=code500/int /lst /response The same query without shards acts normally: arr name=stats.field strprice/str strddate/str /arr str name=statstrue/str str name=q*:*/str arr name=fq strddate:[2013-03-23T00:00:00Z TO *]/str strprice:[5000 TO *]/str /arr str name=rows10/str /lst /lst lst name=grouped lst name=id int name=matches0/int int name=ngroups0/int arr name=groups/ /lst /lst I've tested it on Solr 4.0 and then on Solr 4.2 and the problem still exists. Regards Agnieszka Kukałowicz
Grouping performance problem
Hi, Is the any way to make grouping searches more efficient? My queries look like: /select?q=querygroup=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1 For index with 3 mln documents query for all docs with group=true takes almost 4000ms. Because queryResultCache is not used next queries take a long time also. When I remove group=true and leave only faceting the query for all docs takes much more less time: for first time ~ 700ms and next runs only 200ms because of queryResultCache being used. So with group=true the query is about 20 time slower than without it. Is it possible or is there any way to improve performance with grouping? My application needs grouping feature and all of the queries use it but the performance of them is to low for production use. I use Solr 4.x from trunk Agnieszka Kukalowicz
Re: Grouping performance problem
Hi Pavel, I tried with group.ngroups=false but didn't notice a big improvement. The times were still about 4000 ms. It doesn't solve my problem. Maybe this is because of my index type. I have millions of documents but only about 20 000 groups. Cheers Agnieszka 2012/7/16 Pavel Goncharik pavel.goncha...@gmail.com Hi Agnieszka , if you don't need number of groups, you can try leaving out group.ngroups=true param. In this case Solr apparently skips calculating all groups and delivers results much faster. At least for our application the difference in performance with/without group.ngroups=true is significant (have to say, we use Solr 3.6). WBR, Pavel On Mon, Jul 16, 2012 at 1:00 PM, Agnieszka Kukałowicz agnieszka.kukalow...@usable.pl wrote: Hi, Is the any way to make grouping searches more efficient? My queries look like: /select?q=querygroup=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1 For index with 3 mln documents query for all docs with group=true takes almost 4000ms. Because queryResultCache is not used next queries take a long time also. When I remove group=true and leave only faceting the query for all docs takes much more less time: for first time ~ 700ms and next runs only 200ms because of queryResultCache being used. So with group=true the query is about 20 time slower than without it. Is it possible or is there any way to improve performance with grouping? My application needs grouping feature and all of the queries use it but the performance of them is to low for production use. I use Solr 4.x from trunk Agnieszka Kukalowicz
Re: Grouping performance problem
I have server with 24GB RAM. I have 4 shards on it, each of them with 4GB RAM for java: JAVA_OPTIONS=-server -Xms4096M -Xmx4096M The size is about 15GB for one shard (i use ssd disk for index data). Agnieszka 2012/7/16 alx...@aim.com What are the RAM of your server and size of the data folder? -Original Message- From: Agnieszka Kukałowicz agnieszka.kukalow...@usable.pl To: solr-user solr-user@lucene.apache.org Sent: Mon, Jul 16, 2012 6:16 am Subject: Re: Grouping performance problem Hi Pavel, I tried with group.ngroups=false but didn't notice a big improvement. The times were still about 4000 ms. It doesn't solve my problem. Maybe this is because of my index type. I have millions of documents but only about 20 000 groups. Cheers Agnieszka 2012/7/16 Pavel Goncharik pavel.goncha...@gmail.com Hi Agnieszka , if you don't need number of groups, you can try leaving out group.ngroups=true param. In this case Solr apparently skips calculating all groups and delivers results much faster. At least for our application the difference in performance with/without group.ngroups=true is significant (have to say, we use Solr 3.6). WBR, Pavel On Mon, Jul 16, 2012 at 1:00 PM, Agnieszka Kukałowicz agnieszka.kukalow...@usable.pl wrote: Hi, Is the any way to make grouping searches more efficient? My queries look like: /select?q=querygroup=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1 For index with 3 mln documents query for all docs with group=true takes almost 4000ms. Because queryResultCache is not used next queries take a long time also. When I remove group=true and leave only faceting the query for all docs takes much more less time: for first time ~ 700ms and next runs only 200ms because of queryResultCache being used. So with group=true the query is about 20 time slower than without it. Is it possible or is there any way to improve performance with grouping? My application needs grouping feature and all of the queries use it but the performance of them is to low for production use. I use Solr 4.x from trunk Agnieszka Kukalowicz
Re: Groups count in distributed grouping is wrong in some case
Hi, I'm using SOLR 4.x from trunk. This was the version from 2012-07-10. So this is one of the latest versions. I searched mailing list and jira but found only this https://issues.apache.org/jira/browse/SOLR-3436 It was committed in May to trunk so my version of SOLR has this fix. But the problem still exists. Cheers Agnieszka 2012/7/15 Erick Erickson erickerick...@gmail.com what version of Solr are you using? There's been quite a bit of work on this lately, I'm not even sure how much has made it into 3.6. You might try searching the JIRA list, Martijn van Groningen has done a bunch of work lately, look for his name. Fortunately, it's not likely to get a bunch of false hits G.. Best Erick On Fri, Jul 13, 2012 at 7:50 AM, Agnieszka Kukałowicz agnieszka.kukalow...@usable.pl wrote: Hi, I have problem with faceting count in distributed grouping. It appears only when I make query that returns almost all of the documents. My SOLR implementation has 4 shards and my queries looks like: http://host:port /select/q?=*:*shards=shard1,shard2,shard3,shard4group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1 With query like above I get strange counts for field category1. The counts for values are very big: int name=val19659/int int name=val27015/int int name=val35676/int int name=val41180/int int name=val51105/int int name=val6979/int int name=val7770/int int name=val8701/int int name=612/int int name=val9422/int int name=val10358/int When I make query to narrow the results adding to query fq=category1:val1, etc. I get different counts than facet category1 shows for a few first values: fq=category1:val1 - counts: 22 fq=category1:val2 - counts: 22 fq=category1:val3 - counts: 21 fq=category1:val4 - counts: 19 fq=category1:val5 - counts: 19 fq=category1:val6 - counts: 20 fq=category1:val7 - counts: 20 fq=category1:val8 - counts: 25 fq=category1:val9 - counts: 422 fq=category1:val10 - counts: 358 From val9 the count is ok. First I thought that for some values in facet category1 groups count does not work and it returns counts of all documents not group by field id. But the number of all documents matches query fq=category1:val1 is 45468. So the numbers are not the same. I check the queries on each shard for val1 and the results are: shard1: query: http://shard1/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1 lst name=fcategory int name=val111/int query: http://shard1/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1fq=category1 :val1 shard 2: query: http://shard2/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1 there is no value val1 in category1 facet. query: http://shard2/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1fq=category1 :val1 int name=ngroups7/int shard3: query: http://shard3/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1 there is no value val1 in category1 facet query: http://shard3/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1fq=category1 :val1 int name=ngroups4/int So it looks that detail query with fq=category1:val1 returns the relevant results. But Solr has problem with faceting counts when one of the shard does not return the faceting value (in this scenario val1) that exists on other shards. I checked shards for val10 and I got: shard1: count for val10 - 142 shard2: count for val10 - 131 shard3: count for val10 - 149 sum of counts 422 - ok. I'm not sure how to resolve that situation. For sure the counts of val1 to val9 should be different and they should not be on the top of the category1 facet because this is very confusing. Do you have any idea how to fix this problem? Best regards Agnieszka
Groups count in distributed grouping is wrong in some case
Hi, I have problem with faceting count in distributed grouping. It appears only when I make query that returns almost all of the documents. My SOLR implementation has 4 shards and my queries looks like: http://host:port /select/q?=*:*shards=shard1,shard2,shard3,shard4group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1 With query like above I get strange counts for field category1. The counts for values are very big: int name=val19659/int int name=val27015/int int name=val35676/int int name=val41180/int int name=val51105/int int name=val6979/int int name=val7770/int int name=val8701/int int name=612/int int name=val9422/int int name=val10358/int When I make query to narrow the results adding to query fq=category1:val1, etc. I get different counts than facet category1 shows for a few first values: fq=category1:val1 - counts: 22 fq=category1:val2 - counts: 22 fq=category1:val3 - counts: 21 fq=category1:val4 - counts: 19 fq=category1:val5 - counts: 19 fq=category1:val6 - counts: 20 fq=category1:val7 - counts: 20 fq=category1:val8 - counts: 25 fq=category1:val9 - counts: 422 fq=category1:val10 - counts: 358 From val9 the count is ok. First I thought that for some values in facet category1 groups count does not work and it returns counts of all documents not group by field id. But the number of all documents matches query fq=category1:val1 is 45468. So the numbers are not the same. I check the queries on each shard for val1 and the results are: shard1: query: http://shard1/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1 lst name=fcategory int name=val111/int query: http://shard1/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1fq=category1 :val1 shard 2: query: http://shard2/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1 there is no value val1 in category1 facet. query: http://shard2/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1fq=category1 :val1 int name=ngroups7/int shard3: query: http://shard3/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1 there is no value val1 in category1 facet query: http://shard3/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1fq=category1 :val1 int name=ngroups4/int So it looks that detail query with fq=category1:val1 returns the relevant results. But Solr has problem with faceting counts when one of the shard does not return the faceting value (in this scenario val1) that exists on other shards. I checked shards for val10 and I got: shard1: count for val10 - 142 shard2: count for val10 - 131 shard3: count for val10 - 149 sum of counts 422 - ok. I'm not sure how to resolve that situation. For sure the counts of val1 to val9 should be different and they should not be on the top of the category1 facet because this is very confusing. Do you have any idea how to fix this problem? Best regards Agnieszka
NPE with 500 error
Hi, I've recently got NPE with 500 status with my search: SEVERE: java.lang.NullPointerException at org.apache.lucene.index.DocTermOrds$TermOrdsIterator.reset(DocTermOrds.java:623) at org.apache.lucene.index.DocTermOrds.lookup(DocTermOrds.java:649) at org.apache.lucene.search.grouping.term.TermGroupFacetCollector$MV.collect(TermGroupFacetCollector.java:191) at org.apache.lucene.search.Scorer.score(Scorer.java:60) at org.apache.lucene.search.ConstantScoreQuery$ConstantScorer.score(ConstantScoreQuery.java:232) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:572) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:274) at org.apache.solr.request.SimpleFacets.getGroupedCounts(SimpleFacets.java:341) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:292) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:396) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:205) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:85) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:204) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111) at org.eclipse.jetty.server.Server.handle(Server.java:351) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534) at java.lang.Thread.run(Thread.java:662) This happens with Solr from trunk 4.x ( 2012-06-13) with distributed search - index is splitted to 4 shards. The query uses grouping (group.facet) and faceting. This problem occured only on one shard after few days of normal working. Documents were sent to the this shard and indexing. There were also many documents deleted from this shard. Do you how to fix this problem? Best regards Agnieszka Kukalowicz
RE: solr 3.5 and indexing performance
Bug ticket created: https://issues.apache.org/jira/browse/SOLR-3245 I also made test you ask with english dictionary. The results are in the ticket. Agnieszka -Original Message- From: Jan Høydahl [mailto:jan@cominvent.com] Sent: Wednesday, March 14, 2012 12:54 AM To: solr-user@lucene.apache.org Subject: Re: solr 3.5 and indexing performance Hi, Thanks a lot for your detailed problem description. It definitely is an error. Would you be so kind to register it as a bug ticket, including your descriptions from this email? http://wiki.apache.org/solr/HowToContribute#JIRA_tips_.28our_issue.2BAC8 -bug_tracker.29. Also please attach to the issue your polish hunspell dictionaries. Then we'll try to reproduce the error. I wonder if this performance decrease is also seen for English dictionaries? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 13. mars 2012, at 16:42, Agnieszka Kukałowicz wrote: Hi, I did some more tests for Hunspell in solr 3.4, 4.0: Solr 3.4, full import 489017 documents: StempelPolishStemFilterFactory - 2908 seconds, 168 docs/sec HunspellStemFilterFactory - 3922 seconds, 125 docs/sec Solr 4.0, full import 489017 documents: StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec Server specification and Java settings are the same as before. Cheers Agnieszka -Original Message- From: Agnieszka Kukałowicz [mailto:agnieszka.kukalow...@usable.pl] Sent: Tuesday, March 13, 2012 10:39 AM To: 'solr-user@lucene.apache.org' Subject: RE: solr 3.5 and indexing performance Hi, Yes, I confirmed that without Hunspell indexing has normal speed. I did tests in solr 4.0 with Hunspell and PolishStemmer. With StempelPolishStemFilterFactory the speed is normal. My schema is quit easy. For Hunspell I have one text field I copy 14 text fields to: field name=text type=text_pl_hunspell indexed=true stored=false multiValued=true/ copyField source=field1 dest=text/ copyField source=field2 dest=text/ copyField source=field3 dest=text/ copyField source=field4 dest=text/ copyField source=field5 dest=text/ copyField source=field6 dest=text/ copyField source=field7 dest=text/ copyField source=field8 dest=text/ copyField source=field9 dest=text/ copyField source=field10 dest=text/ copyField source=field11 dest=text/ copyField source=field12 dest=text/ copyField source=field13 dest=text/ copyField source=field14 dest=text/ The text_pl_hunspell configuration: fieldType name=text_pl_hunspell class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.HunspellStemFilterFactory dictionary=dict/pl_PL.dic affix=dict/pl_PL.aff ignoreCase=true !--filter class=solr.KeywordMarkerFilterFactory protected=protwords_pl.txt/-- /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=dict/synonyms_pl.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.HunspellStemFilterFactory dictionary=dict/pl_PL.dic affix=dict/pl_PL.aff ignoreCase=true filter class=solr.KeywordMarkerFilterFactory protected=dict/protwords_pl.txt/ /analyzer /fieldType I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same files I used in 3.4 version. For Polish Stemmer the diffrence is only in definion text field: field name=text type=text_pl indexed=true stored=false multiValued=true/ fieldType name=text_pl class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StempelPolishStemFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=dict/protwords_pl.txt/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class
RE: solr 3.5 and indexing performance
. VisualVM and run the profiler to see what part of the code takes up the time http://docs.oracle.com/javase/6/docs/technotes/tools/share/jvisualvm.ht ml -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 12. mars 2012, at 16:42, Agnieszka Kukałowicz wrote: Hi guys, I have hit the same problem with Hunspell. Doing a few tests for 500 000 documents, I've got: Hunspell from http://code.google.com/p/lucene-hunspell/ with 3.4 version - 125 documents per second Build Hunspell from 4.0 trunk - 11 documents per second. All the tests were made on 8 core CPU with 32 GB RAM and index on SSD disks. For Solr 3.5 I've tried to change JVM heap size, rambuffersize, mergefactor but the speed of indexing was about 10 -20 documents per second. Is it possible that there is some performance bug with Solr 4.0? According to previous post the problem exists in 3.5 version. Best regards Agnieszka Kukałowicz -Original Message- From: mizayah [mailto:miza...@gmail.com] Sent: Thursday, February 23, 2012 10:19 AM To: solr-user@lucene.apache.org Subject: Re: solr 3.5 and indexing performance Ok i found it. Its becouse of Hunspell which now is in solr. Somehow when im using it by myself in 3.4 it is a lot of faster then one from 3.5. Dont know about differences, but is there any way i use my old Google Hunspell jar? -- View this message in context: http://lucene.472066.n3.nabble.com/solr- 3-5-and-indexing-performance-tp3766653p3769139.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: solr 3.5 and indexing performance
Hi, I did some more tests for Hunspell in solr 3.4, 4.0: Solr 3.4, full import 489017 documents: StempelPolishStemFilterFactory - 2908 seconds, 168 docs/sec HunspellStemFilterFactory - 3922 seconds, 125 docs/sec Solr 4.0, full import 489017 documents: StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec Server specification and Java settings are the same as before. Cheers Agnieszka -Original Message- From: Agnieszka Kukałowicz [mailto:agnieszka.kukalow...@usable.pl] Sent: Tuesday, March 13, 2012 10:39 AM To: 'solr-user@lucene.apache.org' Subject: RE: solr 3.5 and indexing performance Hi, Yes, I confirmed that without Hunspell indexing has normal speed. I did tests in solr 4.0 with Hunspell and PolishStemmer. With StempelPolishStemFilterFactory the speed is normal. My schema is quit easy. For Hunspell I have one text field I copy 14 text fields to: field name=text type=text_pl_hunspell indexed=true stored=false multiValued=true/ copyField source=field1 dest=text/ copyField source=field2 dest=text/ copyField source=field3 dest=text/ copyField source=field4 dest=text/ copyField source=field5 dest=text/ copyField source=field6 dest=text/ copyField source=field7 dest=text/ copyField source=field8 dest=text/ copyField source=field9 dest=text/ copyField source=field10 dest=text/ copyField source=field11 dest=text/ copyField source=field12 dest=text/ copyField source=field13 dest=text/ copyField source=field14 dest=text/ The text_pl_hunspell configuration: fieldType name=text_pl_hunspell class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.HunspellStemFilterFactory dictionary=dict/pl_PL.dic affix=dict/pl_PL.aff ignoreCase=true !--filter class=solr.KeywordMarkerFilterFactory protected=protwords_pl.txt/-- /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=dict/synonyms_pl.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.HunspellStemFilterFactory dictionary=dict/pl_PL.dic affix=dict/pl_PL.aff ignoreCase=true filter class=solr.KeywordMarkerFilterFactory protected=dict/protwords_pl.txt/ /analyzer /fieldType I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same files I used in 3.4 version. For Polish Stemmer the diffrence is only in definion text field: field name=text type=text_pl indexed=true stored=false multiValued=true/ fieldType name=text_pl class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StempelPolishStemFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=dict/protwords_pl.txt/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=dict/synonyms_pl.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=dict/stopwords_pl.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StempelPolishStemFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=dict/protwords_pl.txt/ /analyzer /fieldType One document has 23 fields: - 14 text fields copy to one text field (above) that is only indexed - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The size of one document is 3-4 kB. So, I think this is not very complicated schema. My environment is: - Linux, RedHat 6.2, kernel 2.6.32 - 2 physical CPU Xeon 5606 (4 cores each) - 32 GB RAM - 2 SSD disks in RAID 0 - java version: java -version java version 1.6.0_26 Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) - java is running with -server
RE: solr 3.5 and indexing performance
Hi guys, I have hit the same problem with Hunspell. Doing a few tests for 500 000 documents, I've got: Hunspell from http://code.google.com/p/lucene-hunspell/ with 3.4 version - 125 documents per second Build Hunspell from 4.0 trunk - 11 documents per second. All the tests were made on 8 core CPU with 32 GB RAM and index on SSD disks. For Solr 3.5 I've tried to change JVM heap size, rambuffersize, mergefactor but the speed of indexing was about 10 -20 documents per second. Is it possible that there is some performance bug with Solr 4.0? According to previous post the problem exists in 3.5 version. Best regards Agnieszka Kukałowicz -Original Message- From: mizayah [mailto:miza...@gmail.com] Sent: Thursday, February 23, 2012 10:19 AM To: solr-user@lucene.apache.org Subject: Re: solr 3.5 and indexing performance Ok i found it. Its becouse of Hunspell which now is in solr. Somehow when im using it by myself in 3.4 it is a lot of faster then one from 3.5. Dont know about differences, but is there any way i use my old Google Hunspell jar? -- View this message in context: http://lucene.472066.n3.nabble.com/solr- 3-5-and-indexing-performance-tp3766653p3769139.html Sent from the Solr - User mailing list archive at Nabble.com.
Polish language in Solr
Hi, I have question about Polish language in Solr. There are 2 options: StempelPolishStemFilterFactory or HunspellStemFilterFactory with polish dictionary. I've made some tests but the results are not satisfying me. StempelPolishStemFilterFactory is very fast during indexing but the quality of searches is not exactly that I expect. In turn HunspellStemFilterFactory is better in searching but indexing polish text is very slow. For example indexing 100k documents with StempelPolishStemFilterFactory takes only 10 min (150 doc/sec), with HunspellStemFilterFactory - 1h 20 min, so it is only 18-20 doc/sec. (server with 8 cores, 24GB RAM, index on SSD disk). Is it possible to speed up indexing with hunspell? What should I optimize? Have you any experience with Hunspell? I use Solr 4.0. Best regards Agnieszka
RE: Problem with SolrCloud + Zookeeper + DataImportHandler
Hi, As you've asked. https://issues.apache.org/jira/browse/SOLR-3165 If you have any questions or need more details I can debug this problem more. Agnieszka -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Friday, February 24, 2012 10:11 PM To: solr-user@lucene.apache.org Subject: Re: Problem with SolrCloud + Zookeeper + DataImportHandler The key piece is ZkSolrResourceLoader does not support getConfigDir() Apparently DIH is doing something that requires getting the local config dir path - but this is on ZK in SolrCloud mode, not the local filesystem. Could you make a JIRA issue for this? I could look into a work around depending on why DIH needs to do this. - Mark On Feb 20, 2012, at 7:28 AM, Agnieszka Kukałowicz wrote: Hi All, I've recently downloaded latest solr trunk to configure solrcloud with zookeeper using standard configuration from wiki: http://wiki.apache.org/solr/SolrCloud. The problem occurred when I tried to configure DataImportHandler in solrconfig.xml: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdb-data-config.xml/str /lst /requestHandler After starting solr with zookeeper I've got errors: Feb 20, 2012 11:30:12 AM org.apache.solr.common.SolrException log SEVERE: null:org.apache.solr.common.SolrException at org.apache.solr.core.SolrCore.init(SolrCore.java:606) at org.apache.solr.core.SolrCore.init(SolrCore.java:490) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:705) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:442) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:313) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer .ja va:262) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java :98 ) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50 ) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java :71 3) at org.mortbay.jetty.servlet.Context.startContext(Context.java:140) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java: 128 2) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:51 8) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50 ) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.j ava :152) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandl erC ollection.java:156) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50 ) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.j ava :152) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50 ) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:13 0) at org.mortbay.jetty.Server.doStart(Server.java:224) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50 ) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.ja va: 39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso rIm pl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.mortbay.start.Main.invokeMain(Main.java:194) at org.mortbay.start.Main.start(Main.java:534) at org.mortbay.start.Main.start(Main.java:441) at org.mortbay.start.Main.main(Main.java:119) Caused by: org.apache.solr.common.SolrException: FATAL: Could not create importer. DataImporter config invalid at org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportH and ler.java:120) at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java: 542 ) at org.apache.solr.core.SolrCore.init(SolrCore.java:601) ... 31 more Caused by: org.apache.solr.common.cloud.ZooKeeperException: ZkSolrResourceLoader does not support getConfigDir() - likely, w at org.apache.solr.cloud.ZkSolrResourceLoader.getConfigDir(ZkSolrResourceL oad er.java:99) at org.apache.solr.handler.dataimport.SimplePropertiesWriter.init(SimplePr ope rtiesWriter.java:47) at org.apache.solr.handler.dataimport.DataImporter.init(DataImporter.jav a:1 12
Problem with SolrCloud + Zookeeper + DataImportHandler
Hi All, I've recently downloaded latest solr trunk to configure solrcloud with zookeeper using standard configuration from wiki: http://wiki.apache.org/solr/SolrCloud. The problem occurred when I tried to configure DataImportHandler in solrconfig.xml: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdb-data-config.xml/str /lst /requestHandler After starting solr with zookeeper I've got errors: Feb 20, 2012 11:30:12 AM org.apache.solr.common.SolrException log SEVERE: null:org.apache.solr.common.SolrException at org.apache.solr.core.SolrCore.init(SolrCore.java:606) at org.apache.solr.core.SolrCore.init(SolrCore.java:490) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:705) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:442) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:313) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.ja va:262) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:98 ) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:71 3) at org.mortbay.jetty.servlet.Context.startContext(Context.java:140) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:128 2) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java :152) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerC ollection.java:156) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java :152) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130) at org.mortbay.jetty.Server.doStart(Server.java:224) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: 39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm pl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.mortbay.start.Main.invokeMain(Main.java:194) at org.mortbay.start.Main.start(Main.java:534) at org.mortbay.start.Main.start(Main.java:441) at org.mortbay.start.Main.main(Main.java:119) Caused by: org.apache.solr.common.SolrException: FATAL: Could not create importer. DataImporter config invalid at org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHand ler.java:120) at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:542 ) at org.apache.solr.core.SolrCore.init(SolrCore.java:601) ... 31 more Caused by: org.apache.solr.common.cloud.ZooKeeperException: ZkSolrResourceLoader does not support getConfigDir() - likely, w at org.apache.solr.cloud.ZkSolrResourceLoader.getConfigDir(ZkSolrResourceLoad er.java:99) at org.apache.solr.handler.dataimport.SimplePropertiesWriter.init(SimplePrope rtiesWriter.java:47) at org.apache.solr.handler.dataimport.DataImporter.init(DataImporter.java:1 12) at org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHand ler.java:114) ... 33 more I've checked if file db-data-config.xml is available in Zookeeper: [zk: localhost:2181(CONNECTED) 0] ls /configs/conf1 [admin-extra.menu-top.html, dict, solrconfig.xml, dataimport.properties, admin-extra.html, solrconfig.xml.old, solrconfig.xml.new, solrconfig.xml~, xslt, db-data-config.xml, velocity, elevate.xml, admin-extra.menu-bottom.html, solrconfig.xml.dataimport, schema.xml] [zk: localhost:2181(CONNECTED) 1] Is it possible to configure DIH with Zookeper? And how to do it? I'm little confused with that. Regards Agnieszka Kukalowicz