java.lang.NullPointerException withs stats component and shards

2013-03-21 Thread Agnieszka Kukałowicz
Hi,



I have problem with Stats component  in shards environment.



Solr throws Java.lang.NullPointerException when there is no results  and
statistic is computed over date field.



arr name=stats.field

  strprice/str

  strddate/str

/arr

str name=statstrue/str

str name=q*:*/str

arr name=fq

  strdate:[2013-03-23T00:00:00Z TO *]/str

  strprice:[5000 TO *]/str

/arr

str name=rows10/str

 /lst

/lst

lst name=error

  str name=tracejava.lang.NullPointerException

at
org.apache.solr.handler.component.DateStatsValues.updateTypeSpecificStats(StatsValuesFactory.java:340)

at
org.apache.solr.handler.component.AbstractStatsValues.accumulate(StatsValuesFactory.java:106)

at
org.apache.solr.handler.component.StatsComponent.handleResponses(StatsComponent.java:112)

at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:311)

at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)

at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)

at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)

at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)

at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)

at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)

at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)

at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)

at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)

at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)

at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)

at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)

at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)

at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)

at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)

at org.eclipse.jetty.server.Server.handle(Server.java:365)

at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)

at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)

at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926)

at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988)

at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635)

at
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)

at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)

at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)

at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)

at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)

at java.lang.Thread.run(Thread.java:662)

/str

  int name=code500/int

/lst

/response





The same query without shards acts normally:



arr name=stats.field

  strprice/str

  strddate/str

/arr

str name=statstrue/str

str name=q*:*/str

arr name=fq

  strddate:[2013-03-23T00:00:00Z TO *]/str

  strprice:[5000 TO *]/str

/arr

str name=rows10/str

  /lst

/lst

lst name=grouped

  lst name=id

int name=matches0/int

int name=ngroups0/int

arr name=groups/

  /lst

/lst



I've tested it on Solr 4.0 and then on Solr 4.2 and  the problem still
exists.



Regards

Agnieszka Kukałowicz


Grouping performance problem

2012-07-16 Thread Agnieszka Kukałowicz
Hi,

Is the any way to make grouping searches more efficient?

My queries look like:
/select?q=querygroup=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1

For index with 3 mln documents query for all docs with group=true takes
almost 4000ms. Because queryResultCache is not used next queries take a
long time also.

When I remove group=true and leave only faceting the query for all docs
takes much more less time: for first time ~ 700ms and next runs only 200ms
because of queryResultCache being used.

So with group=true the query is about 20 time slower than without it.
Is it possible or is there any way to improve performance with grouping?

My application needs grouping feature and all of the queries use it but the
performance of them is to low for production use.

I use Solr 4.x from trunk

Agnieszka Kukalowicz


Re: Grouping performance problem

2012-07-16 Thread Agnieszka Kukałowicz
Hi Pavel,

I tried with group.ngroups=false but didn't notice a big improvement.
The times were still about 4000 ms. It doesn't solve my problem.
Maybe this is because of my index type. I have millions of documents but
only about 20 000 groups.

 Cheers
 Agnieszka

2012/7/16 Pavel Goncharik pavel.goncha...@gmail.com

 Hi Agnieszka ,

 if you don't need number of groups, you can try leaving out
 group.ngroups=true param.
 In this case Solr apparently skips calculating all groups and delivers
 results much faster.
 At least for our application the difference in performance
 with/without group.ngroups=true is significant (have to say, we use
 Solr 3.6).

 WBR,
 Pavel

 On Mon, Jul 16, 2012 at 1:00 PM, Agnieszka Kukałowicz
 agnieszka.kukalow...@usable.pl wrote:
  Hi,
 
  Is the any way to make grouping searches more efficient?
 
  My queries look like:
 
 /select?q=querygroup=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1
 
  For index with 3 mln documents query for all docs with group=true takes
  almost 4000ms. Because queryResultCache is not used next queries take a
  long time also.
 
  When I remove group=true and leave only faceting the query for all docs
  takes much more less time: for first time ~ 700ms and next runs only
 200ms
  because of queryResultCache being used.
 
  So with group=true the query is about 20 time slower than without it.
  Is it possible or is there any way to improve performance with grouping?
 
  My application needs grouping feature and all of the queries use it but
 the
  performance of them is to low for production use.
 
  I use Solr 4.x from trunk
 
  Agnieszka Kukalowicz



Re: Grouping performance problem

2012-07-16 Thread Agnieszka Kukałowicz
I have server with 24GB RAM. I have 4 shards on it, each of them with 4GB
RAM for java:
JAVA_OPTIONS=-server -Xms4096M -Xmx4096M
The size is about 15GB for one shard (i use ssd disk for index data).

Agnieszka


2012/7/16 alx...@aim.com

 What are the RAM of your server and size of the data folder?



 -Original Message-
 From: Agnieszka Kukałowicz agnieszka.kukalow...@usable.pl
 To: solr-user solr-user@lucene.apache.org
 Sent: Mon, Jul 16, 2012 6:16 am
 Subject: Re: Grouping performance problem


 Hi Pavel,

 I tried with group.ngroups=false but didn't notice a big improvement.
 The times were still about 4000 ms. It doesn't solve my problem.
 Maybe this is because of my index type. I have millions of documents but
 only about 20 000 groups.

  Cheers
  Agnieszka

 2012/7/16 Pavel Goncharik pavel.goncha...@gmail.com

  Hi Agnieszka ,
 
  if you don't need number of groups, you can try leaving out
  group.ngroups=true param.
  In this case Solr apparently skips calculating all groups and delivers
  results much faster.
  At least for our application the difference in performance
  with/without group.ngroups=true is significant (have to say, we use
  Solr 3.6).
 
  WBR,
  Pavel
 
  On Mon, Jul 16, 2012 at 1:00 PM, Agnieszka Kukałowicz
  agnieszka.kukalow...@usable.pl wrote:
   Hi,
  
   Is the any way to make grouping searches more efficient?
  
   My queries look like:
  
 
 /select?q=querygroup=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1
  
   For index with 3 mln documents query for all docs with group=true takes
   almost 4000ms. Because queryResultCache is not used next queries take a
   long time also.
  
   When I remove group=true and leave only faceting the query for all docs
   takes much more less time: for first time ~ 700ms and next runs only
  200ms
   because of queryResultCache being used.
  
   So with group=true the query is about 20 time slower than without it.
   Is it possible or is there any way to improve performance with
 grouping?
  
   My application needs grouping feature and all of the queries use it but
  the
   performance of them is to low for production use.
  
   I use Solr 4.x from trunk
  
   Agnieszka Kukalowicz
 





Re: Groups count in distributed grouping is wrong in some case

2012-07-15 Thread Agnieszka Kukałowicz
Hi,

I'm using SOLR 4.x from trunk. This was the version from 2012-07-10. So
this is one of the latest versions.

I searched mailing list and jira but found only this
https://issues.apache.org/jira/browse/SOLR-3436

It was committed in May to trunk so my version of SOLR has this fix. But
the problem still exists.

Cheers
Agnieszka

2012/7/15 Erick Erickson erickerick...@gmail.com

 what version of Solr are you using? There's been quite a bit of work
 on this lately,
 I'm not even sure how much has made it into 3.6. You might try searching
 the
 JIRA list, Martijn van Groningen has done a bunch of work lately, look for
 his name. Fortunately, it's not likely to get a bunch of false hits G..

 Best
 Erick

 On Fri, Jul 13, 2012 at 7:50 AM, Agnieszka Kukałowicz
 agnieszka.kukalow...@usable.pl wrote:
  Hi,
 
  I have problem with faceting count in distributed grouping. It appears
 only
  when I make query that returns almost all of the documents.
 
  My SOLR implementation has 4 shards and my queries looks like:
 
  http://host:port
 
 /select/q?=*:*shards=shard1,shard2,shard3,shard4group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1
 
  With query like above I get strange counts for field category1.
  The counts for values are very big:
  int name=val19659/int
  int name=val27015/int
  int name=val35676/int
  int name=val41180/int
  int name=val51105/int
  int name=val6979/int
  int name=val7770/int
  int name=val8701/int
  int name=612/int
  int name=val9422/int
  int name=val10358/int
 
  When I make query to narrow the results adding to query
  fq=category1:val1, etc. I get different counts than facet category1
 shows
  for a few first values:
 
  fq=category1:val1 - counts: 22
  fq=category1:val2 - counts: 22
  fq=category1:val3 - counts: 21
  fq=category1:val4 - counts: 19
  fq=category1:val5 - counts: 19
  fq=category1:val6 - counts: 20
  fq=category1:val7 - counts: 20
  fq=category1:val8 - counts: 25
  fq=category1:val9 - counts: 422
  fq=category1:val10 - counts: 358
 
  From val9 the count is ok.
 
  First I thought that for some values in facet category1 groups count
 does
  not work and it returns counts of all documents not group by field id.
  But the number of all documents matches query  fq=category1:val1 is
  45468. So the numbers are not the same.
 
  I check the queries on each shard for val1 and the results are:
 
  shard1:
  query:
 
 http://shard1/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1
 
  lst name=fcategory
  int name=val111/int
 
  query:
 
 http://shard1/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1fq=category1
  :val1
 
  shard 2:
  query:
 
 http://shard2/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1
 
  there is no value val1 in category1 facet.
 
  query:
 
 http://shard2/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1fq=category1
  :val1
 
  int name=ngroups7/int
 
  shard3:
  query:
 
 http://shard3/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1
 
  there is no value val1 in category1 facet
 
  query:
 
 http://shard3/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1fq=category1
  :val1
 
  int name=ngroups4/int
 
  So it looks that detail query with fq=category1:val1 returns the
 relevant
  results. But Solr has problem with faceting counts when one of the shard
  does not return the faceting value (in this scenario val1) that exists
 on
  other shards.
 
  I checked shards for val10 and I got:
 
  shard1: count for val10 - 142
  shard2: count for val10 - 131
  shard3: count for val10 -  149
  sum of counts 422 - ok.
 
  I'm not sure how to resolve that situation. For sure the counts of val1
 to
  val9 should be different and they should not be on the top of the
 category1
  facet because this is very confusing. Do you have any idea how to fix
 this
  problem?
 
  Best regards
  Agnieszka



Groups count in distributed grouping is wrong in some case

2012-07-13 Thread Agnieszka Kukałowicz
Hi,

I have problem with faceting count in distributed grouping. It appears only
when I make query that returns almost all of the documents.

My SOLR implementation has 4 shards and my queries looks like:

http://host:port
/select/q?=*:*shards=shard1,shard2,shard3,shard4group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1

With query like above I get strange counts for field category1.
The counts for values are very big:
int name=val19659/int
int name=val27015/int
int name=val35676/int
int name=val41180/int
int name=val51105/int
int name=val6979/int
int name=val7770/int
int name=val8701/int
int name=612/int
int name=val9422/int
int name=val10358/int

When I make query to narrow the results adding to query
fq=category1:val1, etc. I get different counts than facet category1 shows
for a few first values:

fq=category1:val1 - counts: 22
fq=category1:val2 - counts: 22
fq=category1:val3 - counts: 21
fq=category1:val4 - counts: 19
fq=category1:val5 - counts: 19
fq=category1:val6 - counts: 20
fq=category1:val7 - counts: 20
fq=category1:val8 - counts: 25
fq=category1:val9 - counts: 422
fq=category1:val10 - counts: 358

From val9 the count is ok.

First I thought that for some values in facet category1 groups count does
not work and it returns counts of all documents not group by field id.
But the number of all documents matches query  fq=category1:val1 is
45468. So the numbers are not the same.

I check the queries on each shard for val1 and the results are:

shard1:
query:
http://shard1/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1

lst name=fcategory
int name=val111/int

query:
http://shard1/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1fq=category1
:val1

shard 2:
query:
http://shard2/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1

there is no value val1 in category1 facet.

query:
http://shard2/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1fq=category1
:val1

int name=ngroups7/int

shard3:
query:
http://shard3/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1

there is no value val1 in category1 facet

query:
http://shard3/select/?q=*:*group=truegroup.field=idgroup.facet=truegroup.ngroups=truefacet.field=category1facet.missing=falsefacet.mincount=1fq=category1
:val1

int name=ngroups4/int

So it looks that detail query with fq=category1:val1 returns the relevant
results. But Solr has problem with faceting counts when one of the shard
does not return the faceting value (in this scenario val1) that exists on
other shards.

I checked shards for val10 and I got:

shard1: count for val10 - 142
shard2: count for val10 - 131
shard3: count for val10 -  149
sum of counts 422 - ok.

I'm not sure how to resolve that situation. For sure the counts of val1 to
val9 should be different and they should not be on the top of the category1
facet because this is very confusing. Do you have any idea how to fix this
problem?

Best regards
Agnieszka


NPE with 500 error

2012-07-11 Thread Agnieszka Kukałowicz
Hi,

I've recently got NPE with 500 status with my search:

SEVERE: java.lang.NullPointerException
at
org.apache.lucene.index.DocTermOrds$TermOrdsIterator.reset(DocTermOrds.java:623)
at org.apache.lucene.index.DocTermOrds.lookup(DocTermOrds.java:649)
at
org.apache.lucene.search.grouping.term.TermGroupFacetCollector$MV.collect(TermGroupFacetCollector.java:191)
at org.apache.lucene.search.Scorer.score(Scorer.java:60)
at
org.apache.lucene.search.ConstantScoreQuery$ConstantScorer.score(ConstantScoreQuery.java:232)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:572)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:274)
at
org.apache.solr.request.SimpleFacets.getGroupedCounts(SimpleFacets.java:341)
at
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:292)
at
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:396)
at
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:205)
at
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:85)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:204)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
at org.eclipse.jetty.server.Server.handle(Server.java:351)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634)
at
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
at java.lang.Thread.run(Thread.java:662)


This happens with Solr from trunk 4.x ( 2012-06-13) with distributed search
- index is splitted to 4 shards.
The query uses grouping (group.facet) and faceting.

This problem occured only on one shard after few days of normal working.
Documents were sent to the this shard and indexing.
There were also many documents deleted from this shard.

Do you how to fix this problem?

Best regards
Agnieszka Kukalowicz


RE: solr 3.5 and indexing performance

2012-03-14 Thread Agnieszka Kukałowicz
Bug ticket created:
https://issues.apache.org/jira/browse/SOLR-3245

I also made test you ask with english dictionary.
The results are in the ticket.

Agnieszka

 -Original Message-
 From: Jan Høydahl [mailto:jan@cominvent.com]
 Sent: Wednesday, March 14, 2012 12:54 AM
 To: solr-user@lucene.apache.org
 Subject: Re: solr 3.5 and indexing performance

 Hi,

 Thanks a lot for your detailed problem description. It definitely is an
 error. Would you be so kind to register it as a bug ticket, including
 your descriptions from this email?
 http://wiki.apache.org/solr/HowToContribute#JIRA_tips_.28our_issue.2BAC8
 -bug_tracker.29. Also please attach to the issue your polish hunspell
 dictionaries. Then we'll try to reproduce the error.

 I wonder if this performance decrease is also seen for English
 dictionaries?

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com

 On 13. mars 2012, at 16:42, Agnieszka Kukałowicz wrote:

  Hi,
 
  I did some more tests for Hunspell in solr 3.4, 4.0:
 
  Solr 3.4, full import 489017 documents:
 
  StempelPolishStemFilterFactory -  2908 seconds, 168 docs/sec
  HunspellStemFilterFactory - 3922 seconds, 125 docs/sec
 
  Solr 4.0, full import 489017 documents:
 
  StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec
  HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11
 docs/sec
 
  Server specification and Java settings are the same as before.
 
  Cheers
  Agnieszka
 
 
  -Original Message-
  From: Agnieszka Kukałowicz [mailto:agnieszka.kukalow...@usable.pl]
  Sent: Tuesday, March 13, 2012 10:39 AM
  To: 'solr-user@lucene.apache.org'
  Subject: RE: solr 3.5 and indexing performance
 
  Hi,
 
  Yes, I confirmed that without Hunspell indexing has normal speed.
  I did tests in solr 4.0 with Hunspell and PolishStemmer.
  With StempelPolishStemFilterFactory the speed is normal.
 
  My schema is quit easy. For Hunspell I have one text field I copy 14
  text fields to:
 
  field name=text type=text_pl_hunspell indexed=true
  stored=false multiValued=true/
 
 
  copyField source=field1 dest=text/  copyField source=field2
  dest=text/  copyField source=field3 dest=text/  copyField
  source=field4 dest=text/  copyField source=field5
 dest=text/
  copyField source=field6 dest=text/  copyField source=field7
  dest=text/  copyField source=field8 dest=text/  copyField
  source=field9 dest=text/  copyField source=field10
 dest=text/
  copyField source=field11 dest=text/  copyField
 source=field12
  dest=text/  copyField source=field13 dest=text/  copyField
  source=field14 dest=text/
 
  The text_pl_hunspell configuration:
 
  fieldType name=text_pl_hunspell class=solr.TextField
  positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=dict/stopwords_pl.txt
 enablePositionIncrements=true
 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.HunspellStemFilterFactory
  dictionary=dict/pl_PL.dic affix=dict/pl_PL.aff ignoreCase=true
 !--filter class=solr.KeywordMarkerFilterFactory
  protected=protwords_pl.txt/--
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.SynonymFilterFactory
  synonyms=dict/synonyms_pl.txt ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=dict/stopwords_pl.txt
 enablePositionIncrements=true
 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.HunspellStemFilterFactory
  dictionary=dict/pl_PL.dic affix=dict/pl_PL.aff ignoreCase=true
 filter class=solr.KeywordMarkerFilterFactory
  protected=dict/protwords_pl.txt/
   /analyzer
 /fieldType
 
  I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt,
  synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same
  files I used in 3.4 version.
 
  For Polish Stemmer the diffrence is only in definion text field:
 
  field name=text type=text_pl indexed=true stored=false
  multiValued=true/
 
 fieldType name=text_pl class=solr.TextField
  positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=dict/stopwords_pl.txt
 enablePositionIncrements=true
 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StempelPolishStemFilterFactory/
 filter class=solr.KeywordMarkerFilterFactory
  protected=dict/protwords_pl.txt/
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class

RE: solr 3.5 and indexing performance

2012-03-13 Thread Agnieszka Kukałowicz
.
 VisualVM and run the profiler to see what part of the code takes up the
 time
 http://docs.oracle.com/javase/6/docs/technotes/tools/share/jvisualvm.ht
 ml

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com

 On 12. mars 2012, at 16:42, Agnieszka Kukałowicz wrote:

  Hi guys,
 
  I have hit the same problem with Hunspell.
  Doing a few tests for 500 000 documents, I've got:
 
  Hunspell from http://code.google.com/p/lucene-hunspell/ with 3.4
  version -
  125 documents per second
  Build Hunspell from 4.0 trunk - 11 documents per second.
 
  All the tests were made on 8 core CPU with 32 GB RAM and index on SSD
  disks.
  For Solr 3.5 I've tried to change JVM heap size, rambuffersize,
  mergefactor but the speed of indexing was about 10 -20 documents per
  second.
 
  Is it possible that there is some performance bug with Solr 4.0?
  According to previous post the problem exists in 3.5 version.
 
  Best regards
  Agnieszka Kukałowicz
 
 
  -Original Message-
  From: mizayah [mailto:miza...@gmail.com]
  Sent: Thursday, February 23, 2012 10:19 AM
  To: solr-user@lucene.apache.org
  Subject: Re: solr 3.5 and indexing performance
 
  Ok i found it.
 
  Its becouse of Hunspell which now is in solr. Somehow when im using
  it by myself in 3.4 it is a lot of faster then one from 3.5.
 
  Dont know about differences, but is there any way i use my old
 Google
  Hunspell jar?
 
  --
  View this message in context:
  http://lucene.472066.n3.nabble.com/solr-
  3-5-and-indexing-performance-tp3766653p3769139.html
  Sent from the Solr - User mailing list archive at Nabble.com.


RE: solr 3.5 and indexing performance

2012-03-13 Thread Agnieszka Kukałowicz
Hi,

I did some more tests for Hunspell in solr 3.4, 4.0:

Solr 3.4, full import 489017 documents:

StempelPolishStemFilterFactory -  2908 seconds, 168 docs/sec
HunspellStemFilterFactory - 3922 seconds, 125 docs/sec

Solr 4.0, full import 489017 documents:

StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec
HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec

Server specification and Java settings are the same as before.

Cheers
Agnieszka


 -Original Message-
 From: Agnieszka Kukałowicz [mailto:agnieszka.kukalow...@usable.pl]
 Sent: Tuesday, March 13, 2012 10:39 AM
 To: 'solr-user@lucene.apache.org'
 Subject: RE: solr 3.5 and indexing performance

 Hi,

 Yes, I confirmed that without Hunspell indexing has normal speed.
 I did tests in solr 4.0 with Hunspell and PolishStemmer.
 With StempelPolishStemFilterFactory the speed is normal.

 My schema is quit easy. For Hunspell I have one text field I copy 14
 text fields to:

 field name=text type=text_pl_hunspell indexed=true
 stored=false multiValued=true/


  copyField source=field1 dest=text/  copyField source=field2
 dest=text/  copyField source=field3 dest=text/  copyField
 source=field4 dest=text/  copyField source=field5 dest=text/
 copyField source=field6 dest=text/  copyField source=field7
 dest=text/  copyField source=field8 dest=text/  copyField
 source=field9 dest=text/  copyField source=field10 dest=text/
 copyField source=field11 dest=text/  copyField source=field12
 dest=text/  copyField source=field13 dest=text/  copyField
 source=field14 dest=text/

 The text_pl_hunspell configuration:

 fieldType name=text_pl_hunspell class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=dict/stopwords_pl.txt
 enablePositionIncrements=true
 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.HunspellStemFilterFactory
 dictionary=dict/pl_PL.dic affix=dict/pl_PL.aff ignoreCase=true
 !--filter class=solr.KeywordMarkerFilterFactory
 protected=protwords_pl.txt/--
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.SynonymFilterFactory
 synonyms=dict/synonyms_pl.txt ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=dict/stopwords_pl.txt
 enablePositionIncrements=true
 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.HunspellStemFilterFactory
 dictionary=dict/pl_PL.dic affix=dict/pl_PL.aff ignoreCase=true
 filter class=solr.KeywordMarkerFilterFactory
 protected=dict/protwords_pl.txt/
   /analyzer
 /fieldType

 I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt,
 synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same
 files I used in 3.4 version.

 For Polish Stemmer the diffrence is only in definion text field:

 field name=text type=text_pl indexed=true stored=false
 multiValued=true/

 fieldType name=text_pl class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=dict/stopwords_pl.txt
 enablePositionIncrements=true
 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StempelPolishStemFilterFactory/
 filter class=solr.KeywordMarkerFilterFactory
 protected=dict/protwords_pl.txt/
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.SynonymFilterFactory
 synonyms=dict/synonyms_pl.txt ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=dict/stopwords_pl.txt
 enablePositionIncrements=true
 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StempelPolishStemFilterFactory/
 filter class=solr.KeywordMarkerFilterFactory
 protected=dict/protwords_pl.txt/
   /analyzer
 /fieldType

 One document has 23 fields:
 - 14 text fields copy to one text field (above) that is only indexed
 - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The
 size of one document is 3-4 kB.
 So, I think this is not very complicated schema.

 My environment is:
 - Linux, RedHat 6.2, kernel 2.6.32
 - 2 physical CPU Xeon 5606 (4 cores each)
 - 32 GB RAM
 - 2 SSD disks in RAID 0
 - java version:

 java -version
 java version 1.6.0_26
 Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM)
 64-Bit Server VM (build 20.1-b02, mixed mode)

 - java is running with -server

RE: solr 3.5 and indexing performance

2012-03-12 Thread Agnieszka Kukałowicz
Hi guys,

I have hit the same problem with Hunspell.
Doing a few tests for 500 000 documents, I've got:

Hunspell from http://code.google.com/p/lucene-hunspell/ with 3.4 version -
125 documents per second
Build Hunspell from 4.0 trunk - 11 documents per second.

All the tests were made on 8 core CPU with 32 GB RAM and index on SSD
disks.
For Solr 3.5 I've tried to change JVM heap size, rambuffersize,
mergefactor but the speed of indexing was about 10 -20 documents per
second.

Is it possible that there is some performance bug with Solr 4.0? According
to previous post the problem exists in 3.5 version.

Best regards
Agnieszka Kukałowicz


 -Original Message-
 From: mizayah [mailto:miza...@gmail.com]
 Sent: Thursday, February 23, 2012 10:19 AM
 To: solr-user@lucene.apache.org
 Subject: Re: solr 3.5 and indexing performance

 Ok i found it.

 Its becouse of Hunspell which now is in solr. Somehow when im using it
 by myself in 3.4 it is a lot of faster then one from 3.5.

 Dont know about differences, but is there any way i use my old Google
 Hunspell jar?

 --
 View this message in context: http://lucene.472066.n3.nabble.com/solr-
 3-5-and-indexing-performance-tp3766653p3769139.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Polish language in Solr

2012-03-05 Thread Agnieszka Kukałowicz
Hi,

I have question about Polish language in Solr.

There are 2 options: StempelPolishStemFilterFactory or
HunspellStemFilterFactory with polish dictionary. I've made some tests but
the results are not satisfying me. StempelPolishStemFilterFactory is very
fast during indexing but the quality of searches is not exactly that I
expect. In turn HunspellStemFilterFactory is better in searching but
indexing polish text is very slow.

For example indexing 100k documents with StempelPolishStemFilterFactory
takes only 10 min (150 doc/sec), with HunspellStemFilterFactory - 1h 20
min, so it is only 18-20 doc/sec. (server with 8 cores, 24GB RAM, index on
SSD disk).

Is it possible to speed up indexing with hunspell? What should I optimize?

Have you any experience with Hunspell?

I use Solr 4.0.

Best regards
Agnieszka


RE: Problem with SolrCloud + Zookeeper + DataImportHandler

2012-02-25 Thread Agnieszka Kukałowicz
Hi,

As you've asked.
https://issues.apache.org/jira/browse/SOLR-3165

If you have any questions or need more details I can debug this problem
more.

Agnieszka

 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com]
 Sent: Friday, February 24, 2012 10:11 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problem with SolrCloud + Zookeeper + DataImportHandler

 The key piece is ZkSolrResourceLoader does not support getConfigDir()
 

 Apparently DIH is doing something that requires getting the local
 config dir path - but this is on ZK in SolrCloud mode, not the local
 filesystem.

 Could you make a JIRA issue for this? I could look into a work around
 depending on why DIH needs to do this.

 - Mark

 On Feb 20, 2012, at 7:28 AM, Agnieszka Kukałowicz wrote:

  Hi All,
 
  I've recently downloaded latest solr trunk to configure solrcloud
 with
  zookeeper
  using standard configuration from wiki:
  http://wiki.apache.org/solr/SolrCloud.
 
  The problem occurred when I tried to configure DataImportHandler in
  solrconfig.xml:
 
   requestHandler name=/dataimport
  class=org.apache.solr.handler.dataimport.DataImportHandler
 lst name=defaults
str name=configdb-data-config.xml/str
 /lst
   /requestHandler
 
 
  After starting solr with zookeeper I've got errors:
 
  Feb 20, 2012 11:30:12 AM org.apache.solr.common.SolrException log
  SEVERE: null:org.apache.solr.common.SolrException
 at org.apache.solr.core.SolrCore.init(SolrCore.java:606)
 at org.apache.solr.core.SolrCore.init(SolrCore.java:490)
 at
  org.apache.solr.core.CoreContainer.create(CoreContainer.java:705)
 at
 org.apache.solr.core.CoreContainer.load(CoreContainer.java:442)
 at
 org.apache.solr.core.CoreContainer.load(CoreContainer.java:313)
 at
 
 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer
 .ja
  va:262)
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java
 :98
  )
 at
  org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
 at
 
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50
 )
 at
 
 org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java
 :71
  3)
 at
  org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
 at
 
 org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:
 128
  2)
 at
 
 org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:51
 8)
 at
 
 org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
 at
 
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50
 )
 at
 
 org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.j
 ava
  :152)
 at
 
 org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandl
 erC
  ollection.java:156)
 at
 
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50
 )
 at
 
 org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.j
 ava
  :152)
 at
 
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50
 )
 at
 
 org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:13
 0)
 at org.mortbay.jetty.Server.doStart(Server.java:224)
 at
 
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50
 )
 at
  org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.ja
 va:
  39)
 at
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso
 rIm
  pl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.mortbay.start.Main.invokeMain(Main.java:194)
 at org.mortbay.start.Main.start(Main.java:534)
 at org.mortbay.start.Main.start(Main.java:441)
 at org.mortbay.start.Main.main(Main.java:119)
  Caused by: org.apache.solr.common.SolrException: FATAL: Could not
 create
  importer. DataImporter config invalid
 at
 
 org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportH
 and
  ler.java:120)
 at
 
 org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:
 542
  )
 at org.apache.solr.core.SolrCore.init(SolrCore.java:601)
 ... 31 more
  Caused by: org.apache.solr.common.cloud.ZooKeeperException:
  ZkSolrResourceLoader does not support getConfigDir() - likely, w
 at
 
 org.apache.solr.cloud.ZkSolrResourceLoader.getConfigDir(ZkSolrResourceL
 oad
  er.java:99)
 at
 
 org.apache.solr.handler.dataimport.SimplePropertiesWriter.init(SimplePr
 ope
  rtiesWriter.java:47)
 at
 
 org.apache.solr.handler.dataimport.DataImporter.init(DataImporter.jav
 a:1
  12

Problem with SolrCloud + Zookeeper + DataImportHandler

2012-02-20 Thread Agnieszka Kukałowicz
Hi All,

I've recently downloaded latest solr trunk to configure solrcloud with
zookeeper
using standard configuration from wiki:
http://wiki.apache.org/solr/SolrCloud.

The problem occurred when I tried to configure DataImportHandler in
solrconfig.xml:

  requestHandler name=/dataimport
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
   str name=configdb-data-config.xml/str
/lst
  /requestHandler


After starting solr with zookeeper I've got errors:

Feb 20, 2012 11:30:12 AM org.apache.solr.common.SolrException log
SEVERE: null:org.apache.solr.common.SolrException
at org.apache.solr.core.SolrCore.init(SolrCore.java:606)
at org.apache.solr.core.SolrCore.init(SolrCore.java:490)
at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:705)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:442)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:313)
at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.ja
va:262)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:98
)
at
org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:71
3)
at
org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
at
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:128
2)
at
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
at
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java
:152)
at
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerC
ollection.java:156)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java
:152)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
at org.mortbay.jetty.Server.doStart(Server.java:224)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at
org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm
pl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.mortbay.start.Main.invokeMain(Main.java:194)
at org.mortbay.start.Main.start(Main.java:534)
at org.mortbay.start.Main.start(Main.java:441)
at org.mortbay.start.Main.main(Main.java:119)
Caused by: org.apache.solr.common.SolrException: FATAL: Could not create
importer. DataImporter config invalid
at
org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHand
ler.java:120)
at
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:542
)
at org.apache.solr.core.SolrCore.init(SolrCore.java:601)
... 31 more
Caused by: org.apache.solr.common.cloud.ZooKeeperException:
ZkSolrResourceLoader does not support getConfigDir() - likely, w
at
org.apache.solr.cloud.ZkSolrResourceLoader.getConfigDir(ZkSolrResourceLoad
er.java:99)
at
org.apache.solr.handler.dataimport.SimplePropertiesWriter.init(SimplePrope
rtiesWriter.java:47)
at
org.apache.solr.handler.dataimport.DataImporter.init(DataImporter.java:1
12)
at
org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHand
ler.java:114)
... 33 more

I've checked if file db-data-config.xml is available in Zookeeper:

[zk: localhost:2181(CONNECTED) 0] ls /configs/conf1
[admin-extra.menu-top.html, dict, solrconfig.xml, dataimport.properties,
admin-extra.html, solrconfig.xml.old, solrconfig.xml.new, solrconfig.xml~,
xslt, db-data-config.xml, velocity, elevate.xml,
admin-extra.menu-bottom.html, solrconfig.xml.dataimport, schema.xml]
[zk: localhost:2181(CONNECTED) 1]

Is it possible to configure DIH with Zookeper? And how to do it?
I'm little confused with that.

Regards
Agnieszka Kukalowicz