solr admin form query (full interface) - unknown handler: select

2012-05-07 Thread Robert Petersen
Hi solr users and solr dev guys, 

I just wanted to point out that the admin form in solr 3.6 seems to have
a bug in the 'full interface' link off 'Make a Query'...  I couldn't
find any mention of this on markmail under solr-user so I thought I'd
bring it up.  I just upgraded from solr 1.4 so I don't know if this was
an issue in previous 3.x versions of solr.

The full interface query form throws an error 'unknown handler: /select'
where it appears that there is no trailing '/' character in the url.
The qt parameter seems to cause problems also.  

Form generated url:
http://my server
url:8983/solr/1/select?indent=onversion=2.2qt=%2Fselectq=gpsfq=
start=0rows=10fl=*%2Cscorewt=explainOther=hl.fl=

If I manually fix the url and remove the qt parameter then it works of
course.   
http://my server
url:8983/solr/1/select/?indent=onversion=2.2q=gpsfq=start=0row
s=10fl=*%2Cscorewt=explainOther=hl.fl=

I just wanted to mention this for the benefit of others.

Thanks,
Robi



RE: solr admin form query (full interface) - unknown handler: select

2012-05-07 Thread Robert Petersen
Hi Jack,

That is interesting!  I hadn't realized but I guess mine varies slightly
from the example.  I show my version below.  It is like this because I
basically merged my 1.4 schema and solr configs with the example 3.6
configs (btw everything else is working fine):

requestDispatcher handleSelect=true 
httpCaching never304=true 
!-- a bunch of comments --
/httpCaching
/requestDispatcher


requestHandler name=standard class=solr.SearchHandler
default=true
!-- default values for query parameters --
lst name=defaults
str name=echoParamsexplicit/str
str name=fl*,score/str

/lst
/requestHandler

Thanks
Robi

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Monday, May 07, 2012 11:12 AM
To: solr-user@lucene.apache.org
Subject: Re: solr admin form query (full interface) - unknown handler:
select

I don't see that problem with a fresh, unmodified 3.6 using example. The
qt 
parameter doesn't show up in the query URL unless I modify the Request 
Handler to something other than /select.

Here's the query URL I get:

http://localhost:8983/solr/select?indent=onversion=2.2q=gpsfq=start=
0rows=10fl=*%2Cscorewt=explainOther=hl.fl=

Have you modified your solrconfig, for example requestDispatcher, 
handleSelect, the /select request handler, etc.?

-- Jack Krupansky

-Original Message- 
From: Robert Petersen
Sent: Monday, May 07, 2012 1:44 PM
To: solr-user@lucene.apache.org
Subject: solr admin form query (full interface) - unknown handler:
select

Hi solr users and solr dev guys,

I just wanted to point out that the admin form in solr 3.6 seems to have
a bug in the 'full interface' link off 'Make a Query'...  I couldn't
find any mention of this on markmail under solr-user so I thought I'd
bring it up.  I just upgraded from solr 1.4 so I don't know if this was
an issue in previous 3.x versions of solr.

The full interface query form throws an error 'unknown handler: /select'
where it appears that there is no trailing '/' character in the url.
The qt parameter seems to cause problems also.

Form generated url:
http://my server
url:8983/solr/1/select?indent=onversion=2.2qt=%2Fselectq=gpsfq=
start=0rows=10fl=*%2Cscorewt=explainOther=hl.fl=

If I manually fix the url and remove the qt parameter then it works of
course.
http://my server
url:8983/solr/1/select/?indent=onversion=2.2q=gpsfq=start=0row
s=10fl=*%2Cscorewt=explainOther=hl.fl=

I just wanted to mention this for the benefit of others.

Thanks,
Robi 



RE: need some help with a multicore config of solr3.6.0+tomcat7. mine reports: Severe errors in solr configuration.

2012-05-02 Thread Robert Petersen
I don't know if this will help but I usually add a dataDir element to
each cores solrconfig.xml to point at a local data folder for the core
like this:


!-- Used to specify an alternate directory to hold all index
data
   other than the default ./data under the Solr home.
   If replication is in use, this should match the replication
configuration. --
dataDir${solr.data.dir:./solr/core0/data}/dataDir


-Original Message-
From: loc...@mm.st [mailto:loc...@mm.st] 
Sent: Wednesday, May 02, 2012 1:06 PM
To: solr-user@lucene.apache.org
Subject: need some help with a multicore config of solr3.6.0+tomcat7.
mine reports: Severe errors in solr configuration.


i've installed tomcat7 and solr 3.6.0 on linux/64

i'm trying to get a single webapp + multicore setup working.  my efforts
have gone off the rails :-/ i suspect i've followed too many of the
wrong examples.

i'd appreciate some help/direction getting this working.

so far, i've configured

grep   /etc/tomcat7/server.xml -A2 -B2
 Java AJP  Connector: /docs/config/ajp.html
 APR (HTTP/AJP) Connector: /docs/apr.html
 Define a non-SSL HTTP/1.1 Connector on port
 
--
Connector port= protocol=HTTP/1.1
   connectionTimeout=2
   redirectPort=8443 /
--
!--
Connector executor=tomcatThreadPool
   port= protocol=HTTP/1.1
   connectionTimeout=2
   redirectPort=8443 /

cat /etc/tomcat7/Catalina/localhost/solr.xml
Context docBase=/srv/tomcat7/webapps/solr.war
debug=0 privileged=true allowLinking=true
crossContext=true 
Environment name=solr/home type=java.lang.String
value=/srv/www/solrbase override=true /
/Context

after tomcat restart,

ps ax | grep tomcat
 6129 pts/4Sl 0:06 /etc/alternatives/jre/bin/java
 -classpath

:/usr/share/tomcat7/bin/bootstrap.jar:/usr/share/tomcat7/bin/tomcat-juli
.jar:/usr/share/java/commons-daemon.jar
 -Dcatalina.base=/usr/share/tomcat7
 -Dcatalina.home=/usr/share/tomcat7 -Djava.endorsed.dirs=
 -Djava.io.tmpdir=/var/cache/tomcat7/temp

-Djava.util.logging.config.file=/usr/share/tomcat7/conf/logging.properti
es

-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
 org.apache.catalina.startup.Bootstrap start

if i nav to

 http://127.0.0.1:

i see as expected

 Server Information
  Tomcat Version   JVM VersionJVM Vendor OS Name
  OS Version OS Architecture
  Apache Tomcat/7.0.26 1.7.0_147-icedtea-b147 Oracle Corporation Linux  
  3.1.10-1.9-desktop amd64

now, i'm trying to set up multicore properly.  i configured,

cat /srv/www/solrbase/solr.xml
?xml version=1.0 encoding=UTF-8 ?
solr persistent=false
  cores adminPath=/admin/cores
core name=core0   instanceDir=core0  /
core name=core1   instanceDir=core1  /
  /cores
/solr

then

mkdir -p /srv/www/solrbase/{core0,core1}
cp -a/srv/www/solrbase/conf /srv/www/solrbase/core0/
cp -a/srv/www/solrbase/conf /srv/www/solrbase/core1/

if i nav to

http://localhost:/solr/core0

i get,

HTTP Status 500 - Severe errors in solr configuration. Check
your log files for more detailed information on what may be
wrong. If you want solr to continue after configuration errors,
change:
abortOnConfigurationErrorfalse/abortOnConfigurationError in
solr.xml
-
org.apache.solr.common.SolrException: No cores were created,
please check the logs for errors at

org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.
java:172)
at

org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:
96)
at

org.apache.catalina.core.ApplicationFilterConfig.initFilter(ApplicationF
ilterConfig.java:277)
at

org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFi
lterConfig.java:258)
at

org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(Applicatio
nFilterConfig.java:382)
at

org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilte
rConfig.java:103)
at

org.apache.catalina.core.StandardContext.filterStart(StandardContext.jav
a:4638)
at

org.apache.catalina.core.StandardContext.startInternal(StandardContext.j
ava:5294)
   

solr broke a pipe

2012-05-02 Thread Robert Petersen
Anyone have any clues about this exception?  It happened during the
course of normal indexing.  This is new to me (we're running solr 3.6 on
tomcat 6/redhat RHEL) and we've been running smoothly for some time now
until this showed up:

Red Hat Enterprise Linux Server release 5.3 (Tikanga)

 

Apache Tomcat Version 6.0.20

 

java.runtime.version = 1.6.0_25-b06

 

java.vm.name = Java HotSpot(TM) 64-Bit Server VM

 

May 2, 2012 4:07:48 PM
org.apache.solr.handler.ReplicationHandler$FileStream write

WARNING: Exception while writing response for params:
indexversion=1276893500358file=_1uca.frqcommand=filecontentchecksum=t
ruewt=filestream

ClientAbortException:  java.net.SocketException: Broken pipe

at
org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.j
ava:358)

at
org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:354)

at
org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java:
381)

at
org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:370)

at
org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStrea
m.java:89)

at
org.apache.solr.common.util.FastOutputStream.write(FastOutputStream.java
:87)

at
org.apache.solr.handler.ReplicationHandler$FileStream.write(ReplicationH
andler.java:1076)

at
org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.ja
va:936)

at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFil
ter.java:345)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
ava:273)

at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica
tionFilterChain.java:235)

at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt
erChain.java:206)

at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv
e.java:233)

at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv
e.java:191)

at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java
:128)

at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java
:102)

at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.
java:109)

at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:2
93)

at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:84
9)

at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(
Http11Protocol.java:583)

at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)

at java.lang.Thread.run(Unknown Source)

Caused by: java.net.SocketException: Broken pipe

at java.net.SocketOutputStream.socketWrite0(Native
Method)

at java.net.SocketOutputStream.socketWrite(Unknown
Source)

at java.net.SocketOutputStream.write(Unknown Source)

at
org.apache.coyote.http11.InternalOutputBuffer.realWriteBytes(InternalOut
putBuffer.java:740)

at
org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:434)

at
org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:349)

at
org.apache.coyote.http11.InternalOutputBuffer$OutputStreamOutputBuffer.d
oWrite(InternalOutputBuffer.java:764)

at
org.apache.coyote.http11.filters.ChunkedOutputFilter.doWrite(ChunkedOutp
utFilter.java:126)

at
org.apache.coyote.http11.InternalOutputBuffer.doWrite(InternalOutputBuff
er.java:573)

at org.apache.coyote.Response.doWrite(Response.java:560)

at
org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.j
ava:353)

... 21 more



what's best to use for monitoring solr 3.6 farm on redhat/tomcat

2012-04-17 Thread Robert Petersen
Hello solr users,

 

Is there any lightweight tool of choice for monitoring multiple solr
boxes for memory consumption, heap usage, and other statistics?  We have
a pretty large farm of RHEL servers running solr now and up until
migrating from 1.4 to 3.6 we were running the lucid gaze component on
each box for these stats... and this doesn't function under solr 3.x and
this was cumbersome anyway as we had to hit each box separately.  What
do the rest of you guys use to keep tabs on your servers?  We're running
solr 3.6 in tomcat on RHEL

 

Red Hat Enterprise Linux Server release 5.3 (Tikanga)

Apache Tomcat Version 6.0.20

java.runtime.version = 1.6.0_25-b06

java.vm.name = Java HotSpot(TM) 64-Bit Server VM

 

 

Thanks,

 

Robert (Robi) Petersen

Senior Software Engineer

Site Search Specialist

 



RE: what's best to use for monitoring solr 3.6 farm on redhat/tomcat

2012-04-17 Thread Robert Petersen
Wow that looks like just what the doctor ordered!  Thanks Otis

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Tuesday, April 17, 2012 1:29 PM
To: solr-user@lucene.apache.org
Subject: Re: what's best to use for monitoring solr 3.6 farm on redhat/tomcat

Hi Robert,

Have a look at SPM for 
Solr: http://sematext.com/spm/solr-performance-monitoring/index.html

It has all Solr metrics, works with 3.*, has a bunch of system metrics, 
filtering, alerting, email subscriptions, no loss of granularity, and you can 
use it to monitor other types of systems (e.g. HBase, ElasticSearch, Sensei...) 
and, starting with the next versions pretty much any Java app (not necessarily 
a webapp).

Otis

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html




 From: Robert Petersen rober...@buy.com
To: solr-user@lucene.apache.org 
Sent: Tuesday, April 17, 2012 12:02 PM
Subject: what's best to use for monitoring solr 3.6 farm on redhat/tomcat
 
Hello solr users,



Is there any lightweight tool of choice for monitoring multiple solr
boxes for memory consumption, heap usage, and other statistics?  We have
a pretty large farm of RHEL servers running solr now and up until
migrating from 1.4 to 3.6 we were running the lucid gaze component on
each box for these stats... and this doesn't function under solr 3.x and
this was cumbersome anyway as we had to hit each box separately.  What
do the rest of you guys use to keep tabs on your servers?  We're running
solr 3.6 in tomcat on RHEL



Red Hat Enterprise Linux Server release 5.3 (Tikanga)

Apache Tomcat Version 6.0.20

java.runtime.version = 1.6.0_25-b06

java.vm.name = Java HotSpot(TM) 64-Bit Server VM





Thanks,



Robert (Robi) Petersen

Senior Software Engineer

Site Search Specialist








RE: [ANNOUNCE] Apache Solr 3.6 released

2012-04-12 Thread Robert Petersen
I think this page needs updating...  it says it's not out yet.  

https://wiki.apache.org/solr/Solr3.6


-Original Message-
From: Robert Muir [mailto:rm...@apache.org] 
Sent: Thursday, April 12, 2012 1:33 PM
To: d...@lucene.apache.org; solr-user@lucene.apache.org; Lucene mailing list; 
announce
Subject: [ANNOUNCE] Apache Solr 3.6 released

12 April 2012, Apache Solr™ 3.6.0 available
The Lucene PMC is pleased to announce the release of Apache Solr 3.6.0.

Solr is the popular, blazing fast open source enterprise search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search, dynamic clustering, database
integration, rich document (e.g., Word, PDF) handling, and geospatial search.
Solr is highly scalable, providing distributed search and index replication,
and it powers the search and navigation features of many of the world's
largest internet sites.

This release contains numerous bug fixes, optimizations, and
improvements, some of which are highlighted below.  The release
is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html (see
note below).

See the CHANGES.txt file included with the release for a full list of
details.

Solr 3.6.0 Release Highlights:

 * New SolrJ client connector using Apache Http Components http client
   (SOLR-2020)

 * Many analyzer factories are now multi term query aware allowing for things
   like field type aware lowercasing when building prefix  wildcard queries.
   (SOLR-2438)

 * New Kuromoji morphological analyzer tokenizes Japanese text, producing
   both compound words and their segmentation. (SOLR-3056)

 * Range Faceting (Dates  Numbers) is now supported in distributed search
   (SOLR-1709)

 * HTMLStripCharFilter has been completely re-implemented, fixing many bugs
   and greatly improving the performance (LUCENE-3690)

 * StreamingUpdateSolrServer now supports the javabin format (SOLR-1565)

 * New LFU Cache option for use in Solr's internal caches. (SOLR-2906)

 * Memory performance improvements to all FST based suggesters (SOLR-2888)

 * New WFSTLookupFactory suggester supports finer-grained ranking for
   suggestions. (LUCENE-3714)

 * New options for configuring the amount of concurrency used in distributed
   searches (SOLR-3221)

 * Many bug fixes

Note: The Apache Software Foundation uses an extensive mirroring network for
distributing releases.  It is possible that the mirror you are using may not
have replicated the release yet.  If that is the case, please try another
mirror.  This also goes for Maven access.

Happy searching,

Lucene/Solr developers


RE: how to correctly facet clothing multiple sizes and colors?

2012-04-10 Thread Robert Petersen
Well yes but in my experience people generally search for something
particular... then select colors and sizes thereafter.

-Original Message-
From: danjfoley [mailto:d...@micamedia.com] 
Sent: Monday, April 09, 2012 4:18 PM
To: solr-user@lucene.apache.org
Subject: Re: how to correctly facet clothing multiple sizes and colors?

The problem with that approach is that if you selected say large and red
you'd get back all the products with large and red as variants. Not the
products with red in the large size add would be expected.

Sent from my phone

- Reply message -
From: Andrew Harvey [via Lucene]
ml-node+s472066n3898049...@n3.nabble.com
Date: Mon, Apr 9, 2012 5:21 pm
Subject: how to correctly facet clothing multiple sizes and colors?
To: danjfoley d...@micamedia.com



What we do in our application is exactly what Robert described. We index
Products, not variants. The variant data (colour, size etc.) is
denormalised into the product document at index time. We then facet on
the variant attributes and get product count instead of variant count. 

What you're seeing are correct results. You are indexing 6 documents, as
you said before. You actually only want to index one document with
multi-valued fields. 

Hope that's somehow helpful,

Andrew

On 10/04/2012, at 3:01, Robert Petersen rober...@buy.com wrote:

 You *could* do it by making one and only one solr document for each
 clothing item, then just have the front end render all the sizes and
 colors available for that item as size/color pickers on the product
 page.  You can add all the colors and sized to the one document in the
 index so they are searchable also, but the caveat is that they won't
 show up as a facet.  This is just one simple approach.
 
 -Original Message-
 From: danjfoley [mailto:d...@micamedia.com] 
 Sent: Saturday, April 07, 2012 7:04 PM
 To: solr-user@lucene.apache.org
 Subject: how to correctly facet clothing multiple sizes and colors?
 
 I've been searching for a solution to my issue, and this seems to come
 closest to it. But not exactly. 
 
 I am indexing clothing. Each article of clothing comes in many sizes
and
 colors, and can belong to any number of categories. 
 
 For example take the following: I add 6 documents to solr as follows: 
 
 product, color, size, category 
 
 shirt A, red, small, valentines day 
 shirt A, red, large, valentines day 
 shirt A, blue, small, valentines day 
 shirt A, blue, large, valentines day 
 shirt A, green, small, valentines day 
 shirt A, green, large, valentines day 
 
 I'd like my facet counts to return as follows: 
 
 color 
 
 red (1) 
 blue (1) 
 green (1) 
 
 size 
 
 small (1) 
 large (1) 
 
 category 
 
 valentines day (1) 
 
 But they come back like this: 
 
 color: 
 red (2) 
 blue (2) 
 green (2) 
 
 size: 
 small (2) 
 large (2) 
 
 category 
 valentines day (6) 
 
 I see the group.facet parameter in version 4.0 does exactly this.
 However
 how can I make this happen now? There are all sorts of ecommerce
systems
 out
 there that facet exactly how i'm asking. i thought solr is supposed to
 be
 the very best fastest search system, yet it doesn't seem to be able to
 facet
 correct for items with multiple values? 
 
 Am i indexing my data wrong? 
 
 how can i make this happen?
 
 --
 View this message in context:

http://lucene.472066.n3.nabble.com/how-to-correctly-facet-clothing-multi
 ple-sizes-and-colors-tp3893747p3893747.html
 Sent from the Solr - User mailing list archive at Nabble.com.


___
If you reply to this email, your message will be added to the discussion
below:
http://lucene.472066.n3.nabble.com/how-to-correctly-facet-clothing-multi
ple-sizes-and-colors-tp3893747p3898049.html

To unsubscribe from how to correctly facet clothing multiple sizes and
colors?, visit
http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubs
cribe_by_codenode=3893747code=ZGFuQG1pY2FtZWRpYS5jb218Mzg5Mzc0N3wtMTEy
NjQzODIyNg==

--
View this message in context:
http://lucene.472066.n3.nabble.com/how-to-correctly-facet-clothing-multi
ple-sizes-and-colors-tp3893747p3898271.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: how to correctly facet clothing multiple sizes and colors?

2012-04-09 Thread Robert Petersen
You *could* do it by making one and only one solr document for each
clothing item, then just have the front end render all the sizes and
colors available for that item as size/color pickers on the product
page.  You can add all the colors and sized to the one document in the
index so they are searchable also, but the caveat is that they won't
show up as a facet.  This is just one simple approach.

-Original Message-
From: danjfoley [mailto:d...@micamedia.com] 
Sent: Saturday, April 07, 2012 7:04 PM
To: solr-user@lucene.apache.org
Subject: how to correctly facet clothing multiple sizes and colors?

I've been searching for a solution to my issue, and this seems to come
closest to it. But not exactly. 

I am indexing clothing. Each article of clothing comes in many sizes and
colors, and can belong to any number of categories. 

For example take the following: I add 6 documents to solr as follows: 

product, color, size, category 

shirt A, red, small, valentines day 
shirt A, red, large, valentines day 
shirt A, blue, small, valentines day 
shirt A, blue, large, valentines day 
shirt A, green, small, valentines day 
shirt A, green, large, valentines day 

I'd like my facet counts to return as follows: 

color 

red (1) 
blue (1) 
green (1) 

size 

small (1) 
large (1) 

category 

valentines day (1) 

But they come back like this: 

color: 
red (2) 
blue (2) 
green (2) 

size: 
small (2) 
large (2) 

category 
valentines day (6) 

I see the group.facet parameter in version 4.0 does exactly this.
However
how can I make this happen now? There are all sorts of ecommerce systems
out
there that facet exactly how i'm asking. i thought solr is supposed to
be
the very best fastest search system, yet it doesn't seem to be able to
facet
correct for items with multiple values? 

Am i indexing my data wrong? 

how can i make this happen?

--
View this message in context:
http://lucene.472066.n3.nabble.com/how-to-correctly-facet-clothing-multi
ple-sizes-and-colors-tp3893747p3893747.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: upgrade solr from 1.4 to 3.5 not working

2012-04-06 Thread Robert Petersen
Note that I am trying to upgrade from the Lucid Imagination distribution
of Solr 1.4, dunno if that makes a difference.  We have an existing
index of 11 million documents which I am trying to preserve in the
upgrade process.

-Original Message-
From: Robert Petersen [mailto:rober...@buy.com] 
Sent: Thursday, April 05, 2012 2:21 PM
To: solr-user@lucene.apache.org
Subject: upgrade solr from 1.4 to 3.5 not working

Hi folks, I'm a little stumped here.

 

I have an existing solr 1.4 setup which is well configured.  I want to
upgrade to the latest solr release, and after reading release notes, the
wiki, etc, I concluded the correct path would be to not change any
config items and just replace the solr.war file in tomcats webapps
folder with the new one and then start tomcat back up.

 

This worked fine, solr came up.  The problem is that on the solr info
page it still says that I am running solr 1.4 even after several
restarts and even a server reboot.  Am I missing something?  Info says
this though there is no solr 1.4 war file anywhere under tomcat root:

 

Solr Specification Version: 1.4.0.2009.12.10.10.34.34

Solr Implementation Version: 1.4 exported - sam - 2009-12-10
10:34:34

Lucene Specification Version: 2.9.1

Lucene Implementation Version: 2.9.1 exported - 2009-12-10
10:32:14

Current Time: Thu Apr 05 12:56:12 PDT 2012

Server Start Time:Thu Apr 05 12:52:25 PDT 2012

 

Any help would be appreciated.

Thanks

Robi



RE: upgrade solr from 1.4 to 3.5 not working

2012-04-06 Thread Robert Petersen
OK I found in the tomcat documentation that I not only have to drop the
war file into webapps but also have to delete the expanded version of
the war that tomcat makes.  Now tomcat doesn't find the velocity
response writer which I seem to recall seeing some note about.  I'll try
to find that again.  Thanks for the help?  Oh well...

-Original Message-
From: Robert Petersen [mailto:rober...@buy.com] 
Sent: Friday, April 06, 2012 8:27 AM
To: solr-user@lucene.apache.org
Subject: RE: upgrade solr from 1.4 to 3.5 not working

Note that I am trying to upgrade from the Lucid Imagination distribution
of Solr 1.4, dunno if that makes a difference.  We have an existing
index of 11 million documents which I am trying to preserve in the
upgrade process.

-Original Message-
From: Robert Petersen [mailto:rober...@buy.com] 
Sent: Thursday, April 05, 2012 2:21 PM
To: solr-user@lucene.apache.org
Subject: upgrade solr from 1.4 to 3.5 not working

Hi folks, I'm a little stumped here.

 

I have an existing solr 1.4 setup which is well configured.  I want to
upgrade to the latest solr release, and after reading release notes, the
wiki, etc, I concluded the correct path would be to not change any
config items and just replace the solr.war file in tomcats webapps
folder with the new one and then start tomcat back up.

 

This worked fine, solr came up.  The problem is that on the solr info
page it still says that I am running solr 1.4 even after several
restarts and even a server reboot.  Am I missing something?  Info says
this though there is no solr 1.4 war file anywhere under tomcat root:

 

Solr Specification Version: 1.4.0.2009.12.10.10.34.34

Solr Implementation Version: 1.4 exported - sam - 2009-12-10
10:34:34

Lucene Specification Version: 2.9.1

Lucene Implementation Version: 2.9.1 exported - 2009-12-10
10:32:14

Current Time: Thu Apr 05 12:56:12 PDT 2012

Server Start Time:Thu Apr 05 12:52:25 PDT 2012

 

Any help would be appreciated.

Thanks

Robi



upgrade solr from 1.4 to 3.5 not working

2012-04-05 Thread Robert Petersen
Hi folks, I'm a little stumped here.

 

I have an existing solr 1.4 setup which is well configured.  I want to
upgrade to the latest solr release, and after reading release notes, the
wiki, etc, I concluded the correct path would be to not change any
config items and just replace the solr.war file in tomcats webapps
folder with the new one and then start tomcat back up.

 

This worked fine, solr came up.  The problem is that on the solr info
page it still says that I am running solr 1.4 even after several
restarts and even a server reboot.  Am I missing something?  Info says
this though there is no solr 1.4 war file anywhere under tomcat root:

 

Solr Specification Version: 1.4.0.2009.12.10.10.34.34

Solr Implementation Version: 1.4 exported - sam - 2009-12-10
10:34:34

Lucene Specification Version: 2.9.1

Lucene Implementation Version: 2.9.1 exported - 2009-12-10
10:32:14

Current Time: Thu Apr 05 12:56:12 PDT 2012

Server Start Time:Thu Apr 05 12:52:25 PDT 2012

 

Any help would be appreciated.

Thanks

Robi



RE: Core overhead

2011-12-15 Thread Robert Petersen
I am running eight cores, each core serves up different types of
searches so there is no overlap in their function.  Some cores have
millions of documents.  My search times are quite fast.  I don't see any
real slowdown from multiple cores, but you just have to have enough
memory for them. Memory simply has to be big enough to hold what you are
loading.  Try it out, but make sure that the functionality you are
actually looking for isn't sharding instead of multiple cores...  

http://wiki.apache.org/solr/DistributedSearch


-Original Message-
From: Yury Kats [mailto:yuryk...@yahoo.com] 
Sent: Thursday, December 15, 2011 10:31 AM
To: solr-user@lucene.apache.org
Subject: Re: Core overhead

On 12/15/2011 1:07 PM, Robert Stewart wrote:

 I think overall memory usage would be close to the same.

Is this really so? I suspect that the consumed memory is in direct
proportion to the number of terms in the index. I also suspect that
if I divided 1 core with N terms into 10 smaller cores, each smaller
core would have much more than N/10 terms. Let's say I'm indexing
English texts, it's likely that all smaller cores would have almost
the same number of terms, close to the original N. Not so?


RE: Core overhead

2011-12-15 Thread Robert Petersen
Sure that is possible, but doesn't that defeat the purpose of sharding?
Why distribute across one machine?  Just keep all in one index in that
case is my thought there...

-Original Message-
From: Yury Kats [mailto:yuryk...@yahoo.com] 
Sent: Thursday, December 15, 2011 11:47 AM
To: solr-user@lucene.apache.org
Subject: Re: Core overhead

On 12/15/2011 1:41 PM, Robert Petersen wrote:
 loading.  Try it out, but make sure that the functionality you are
 actually looking for isn't sharding instead of multiple cores...  

Yes, but the way to achieve sharding is to have multiple cores.
The question is then becomes -- how many cores (shards)?


RE: Core overhead

2011-12-15 Thread Robert Petersen
I see there is a lot of discussions about micro-sharding, I'll have to
read them.  I'm on an older version of solr and just use master index
replicating out to a farm of slaves.  It always seemed like sharding
causes a lot of background traffic to me when I read about it, but I
never tried it out.  Thanks for the heads up on that topic...  :)

-Original Message-
From: Yury Kats [mailto:yuryk...@yahoo.com] 
Sent: Thursday, December 15, 2011 2:16 PM
To: solr-user@lucene.apache.org
Subject: Re: Core overhead

On 12/15/2011 4:46 PM, Robert Petersen wrote:
 Sure that is possible, but doesn't that defeat the purpose of
sharding?
 Why distribute across one machine?  Just keep all in one index in that
 case is my thought there...

To be able to scale w/o re-indexing. Also often referred to as
micro-sharding.


RE: Questions about Solr's security

2011-11-03 Thread Robert Petersen
Me too!

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Tuesday, November 01, 2011 1:02 PM
To: solr-user@lucene.apache.org
Subject: Re: Questions about Solr's security

I once had to deal with a severe performance problem caused by a bot
that was requesting results starting at 5000. We disallowed requests
over a certain number of pages in the front end to fix it.

wunder

On Nov 1, 2011, at 12:57 PM, Erik Hatcher wrote:

 Be aware that even /select could have some harmful effects, see
https://issues.apache.org/jira/browse/SOLR-2854 (addressed on trunk).
 
 Even disregarding that issue, /select is a potential gateway to any
request handler defined via /select?qt=/req_handler
 
 Again, in general it's not a good idea to expose Solr to anything but
a controlled app server.  
 
   Erik
 
 On Nov 1, 2011, at 15:51 , Alireza Salimi wrote:
 
 What if we just expose '/select' paths - by firewalls and load
balancers -
 and
 also use SSL and HTTP basic or digest access control?
 
 On Tue, Nov 1, 2011 at 2:20 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:
 
 
 : I was wondering if it's a good idea to expose Solr to the outside
world,
 : so that our clients running on smart phones will be able to use
Solr.
 
 As a general rule of thumb, i would say that it is not a good idea
to
 expose solr directly to the public internet.
 
 there are exceptions to this rule -- AOL hosted some live solr
instances
 of the Sarah Palin emails for HufPo -- but it is definitely an
expert
 level type thing for people who are so familiar with solr they know
 exactly what to lock down to make it safe
 
 for typical users: put an application between your untrusted users
and
 solr and only let that application generate safe welformed
requests to
 Solr...
 
 https://wiki.apache.org/solr/SolrSecurity
 
 
 -Hoss
 
 
 
 
 -- 
 Alireza Salimi
 Java EE Developer
 

--
Walter Underwood
Venture Asst. Scoutmaster
Troop 14, Palo Alto, CA





difference between analysis output and searches

2011-10-29 Thread Robert Petersen
Why is it that I can see in the analysis admin page an obvious match
between terms, yet sometimes they don't come back in searches?  Debug
output on the searches indicate a non-match yet the analysis page shows
an obvious match.  I don't get it.



i don't get why this says non-match

2011-10-28 Thread Robert Petersen
It looks to me like everything matches down the line but top level says
otherQuery is a non-match... I don't get it?
- response
- lst name=responseHeader
  int name=status0/int 
  int name=QTime77/int 
- lst name=params
  str name=explainOtherSyncMaster/str 
  str name=fl*,score/str 
  str name=debugQueryon/str 
  str name=indenton/str 
  str name=start0/str 
  str name=q+syncmaster -SyncMaster/str 
  str name=hl.fl / 
  str name=qtstandard/str 
  str name=wtstandard/str 
  str name=fq / 
  str name=rows41/str 
  str name=version2.2/str 
  /lst
  /lst
+ result name=response numFound=26 start=0 maxScore=1.6049292
- lst name=debug
  str name=rawquerystring+syncmaster -SyncMaster/str 
  str name=querystring+syncmaster -SyncMaster/str 
  str name=parsedquery+moreWords:syncmaster
-MultiPhraseQuery(moreWords:sync (master syncmaster))/str 
  str name=parsedquery_toString+moreWords:syncmaster
-moreWords:sync (master syncmaster)/str
str name=otherQuerySyncMaster/str 
- lst name=explainOther
str name=2097309980.0 = (NON-MATCH) Failure to meet condition(s) of
required/prohibited clause(s) 1.4043131 = (MATCH)
fieldWeight(moreWords:syncmaster in 46710), product of: 1.4142135 =
tf(termFreq(moreWords:syncmaster)=2) 9.078851 = idf(docFreq=41,
maxDocs=135472) 0.109375 = fieldNorm(field=moreWords, doc=46710) 0.0 =
match on prohibited clause (moreWords:sync (master syncmaster))
9.393997 = (MATCH) weight(moreWords:sync (master syncmaster) in
46710), product of: 2.5863855 = queryWeight(moreWords:sync (master
syncmaster)), product of: 23.481407 = idf(moreWords:sync (master
syncmaster)) 0.1101461 = queryNorm 3.6320949 = (MATCH)
fieldWeight(moreWords:sync (master syncmaster) in 46710), product of:
1.4142135 = tf(phraseFreq=2.0) 23.481407 = idf(moreWords:sync (master
syncmaster)) 0.109375 = fieldNorm(field=moreWords, doc=46710)/str 



RE: Trouble configuring multicore / accessing admin page

2011-09-28 Thread Robert Petersen
Just go to localhost:8983 (or whatever other port you are using) and use
this path to see all the cores available on the box:

In your example this should give you a core list:

http://solrhost:8080/solr/

-Original Message-
From: Joshua Miller [mailto:jos...@itsecureadmin.com] 
Sent: Wednesday, September 28, 2011 1:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Trouble configuring multicore / accessing admin page

On Sep 28, 2011, at 1:03 PM, Shawn Heisey wrote:

 On 9/28/2011 1:40 PM, Joshua Miller wrote:
 I am trying to get SOLR working with multiple cores and have a
problem accessing the admin page once I configure multiple cores.
 
 Problem:
 When accessing the admin page via http://solrhost:8080/solr/admin, I
get a 404, missing core name in path.
 
 Question:  when using the multicore option, is the standard admin
page still available?
 
 When you enable multiple cores, the URL syntax becomes a little
different.  On 1.4.1 and 3.2.0, I ran into a problem where the trailing
/ is required on this URL, but that problem seems to be fixed in 3.4.0:
 
 http://host:port/solr/corename/admin/
 
 If you put a defaultCoreName=somecore into the cores tag in
solr.xml, the original /solr/admin URL should work as well.  I just
tried it on Solr 3.4.0 and it does work.  According to the wiki, it
should work in 1.4 as well.  I don't have a 1.4.1 server any more, so I
can't verify that.
 
 http://wiki.apache.org/solr/CoreAdmin#cores

Hi Shawn,

Thanks for the quick response.

I can't get any of those combinations to work.

I've added the defaultCoreName=core0 into the solr.xml and restarted
and tried the following combinations:

http://host:port/solr/admin
http://host:port/solr/admin/
http://host:port/solr/core0/admin/
...
(and many others)

I'm stuck on 1.4.1 at least temporarily as I'm taking over an
application from another resource and need to get it up and running
before modifying anything so any help here would be greatly appreciated.

Thanks, 

Josh Miller
Open Source Solutions Architect
(425) 737-2590
http://itsecureadmin.com/


synonyms vs replacements

2011-08-26 Thread Robert Petersen
Hello all,

 

Which is better?   Say you add an index time synonym between nunchuck
and nunchuk and then both words will be in the document and both will be
searchable.   I can get the same exact behavior by putting an index time
replacement of nunchuck = nunchuk and a search time replacement of the
same.  

 

I figured the replacement strategy keeps the the index size slightly
smaller by only having the one term in the index, but the synonym
strategy only requires you update the master, not the slave farm, and
requires slightly less work for the searchers during a user query.  Are
there any other considerations I should be aware of?  

 

Thanks

 

BTW nunchuk is the correct spelling.  J

 

 



RE: please help explaining debug output

2011-07-26 Thread Robert Petersen
That didn't help.  Seems like another case where I should get matches but don't 
and this time it is only for some documents.  Others with similar content do 
match just fine.  The debug output 'explain other' section for a non-matching 
document seems to say the term frequency is 0 for my problematic term, although 
I know it is in the content.  

I ended up making a synonym to do what the analysis stack *should* be doing: 
splitting LaserJet on case changes.  IE putting LaserJet, laser jet in synonyms 
at index time makes this work.  I don't know why though.

Question:  Does this debug output mean it is matching the terms but the term 
frequency vector is returning 0 for the frequency of this term.  IE Does this 
mean the term is in the doc but not in the tf array?

0.0 = no match on required clause (moreWords:laser jet)

0.0 = weight(moreWords:laser jet in 32497), product of:

  0.60590804 = queryWeight(moreWords:laser jet), product of:

14.597603 = idf(moreWords: laser=26731 jet=12685)

0.041507367 = queryNorm

  0.0 = fieldWeight(moreWords:laser jet in 32497), product of:

0.0 = tf(phraseFreq=0.0)

14.597603 = idf(moreWords: laser=26731 jet=12685)

0.078125 = fieldNorm(field=moreWords, doc=32497)




-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Monday, July 25, 2011 3:28 PM
To: solr-user@lucene.apache.org
Subject: Re: please help explaining debug output

Hmmm, I can't find a convenient 1.4.0 to download, but re-indexing is a good
idea since this seems like it *should* work.

Erick

On Mon, Jul 25, 2011 at 5:32 PM, Robert Petersen rober...@buy.com wrote:
 I'm still on solr 1.4.0 and the analysis page looks like they should match, 
 and other products with the same content do in fact match.  I'm reindexing 
 the non-matching ones to rule that out.

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Monday, July 25, 2011 1:58 PM
 To: solr-user@lucene.apache.org
 Subject: Re: please help explaining debug output

 Hmmm, I'm assuming that moreWords is your default text field, yes?

 But it works for me (tm), using 1.4.1. What version of Solr are you on?

 Also, take a glance at the admin/analysis page, that might help...

 Gotta run

 Erick

 On Mon, Jul 25, 2011 at 4:52 PM, Robert Petersen rober...@buy.com wrote:
 Sorry, to clarify a search for P1102W matches all three docs but a
 search for p1102w LaserJet only matches the second two.  Someone asked
 me a question while I was typing and I got distracted, apologies for any
 confusion.

 -Original Message-
 From: Robert Petersen [mailto:rober...@buy.com]
 Sent: Monday, July 25, 2011 1:42 PM
 To: solr-user@lucene.apache.org
 Subject: please help explaining debug output

 I have three documents with the following product titles in a text field
 called moreWords with analysis stack matching the solr example text
 field definition.



 1.       HP LaserJet P1102W Monochrome Laser Printer
 http://www.buy.com/prod/hp-laserjet-p1102w-monochrome-laser-printer/q/l
 oc/101/213824965.html

 2.       HP CE285A (85A) Remanufactured Black Toner Cartridge for
 LaserJet M1212nf, P1102, P1102W Series
 http://www.buy.com/prod/hp-ce285a-85a-remanufactured-black-toner-cartri
 dge-for-laserjet/q/loc/101/217145536.html

 3.       Black HP CE285A Toner Cartridge For LaserJet P1102W, LaserJet
 M1130, LaserJet M1132, LaserJet M1210
 http://www.buy.com/prod/black-hp-ce285a-toner-cartridge-for-laserjet-p1
 102w-laserjet-m1130/q/loc/101/222045267.html



 A search for P1102W matches (2) and (3), but not (1) above.  Can someone
 explain the debug output?  It looks like I am getting a non-match on (1)
 because term frequency is zero?  Am I reading that right?  If so, how
 could that be? the searched terms are equivalently in all three docs.  I
 don't get it.





 lst name=debug

 str name=rawquerystringp1102w LaserJet /str

 str name=querystringp1102w LaserJet /str

 str name=parsedquery+PhraseQuery(moreWords:p 1102 w)
 +PhraseQuery(moreWords:laser jet)/str

 str name=parsedquery_toString+moreWords:p 1102 w +moreWords:laser
 jet/str

 lst name=explain

 str name=222045267

 3.64852 = (MATCH) sum of:

  2.4758534 = weight(moreWords:p 1102 w in 6667236), product of:

    0.7955347 = queryWeight(moreWords:p 1102 w), product of:

      19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

      0.041507367 = queryNorm

    3.1121879 = fieldWeight(moreWords:p 1102 w in 6667236), product
 of:

      1.7320508 = tf(phraseFreq=3.0)

      19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

      0.09375 = fieldNorm(field=moreWords, doc=6667236)

  1.1726664 = weight(moreWords:laser jet in 6667236), product of:

    0.60590804 = queryWeight(moreWords:laser jet), product of:

      14.597603 = idf(moreWords: laser=26731 jet=12685)

      0.041507367 = queryNorm

    1.9353869 = fieldWeight(moreWords:laser jet in 6667236), product

please help explaining debug output

2011-07-25 Thread Robert Petersen
I have three documents with the following product titles in a text field
called moreWords with analysis stack matching the solr example text
field definition.

 

1.   HP LaserJet P1102W Monochrome Laser Printer
http://www.buy.com/prod/hp-laserjet-p1102w-monochrome-laser-printer/q/l
oc/101/213824965.html 

2.   HP CE285A (85A) Remanufactured Black Toner Cartridge for
LaserJet M1212nf, P1102, P1102W Series
http://www.buy.com/prod/hp-ce285a-85a-remanufactured-black-toner-cartri
dge-for-laserjet/q/loc/101/217145536.html 

3.   Black HP CE285A Toner Cartridge For LaserJet P1102W, LaserJet
M1130, LaserJet M1132, LaserJet M1210
http://www.buy.com/prod/black-hp-ce285a-toner-cartridge-for-laserjet-p1
102w-laserjet-m1130/q/loc/101/222045267.html 

 

A search for P1102W matches (2) and (3), but not (1) above.  Can someone
explain the debug output?  It looks like I am getting a non-match on (1)
because term frequency is zero?  Am I reading that right?  If so, how
could that be? the searched terms are equivalently in all three docs.  I
don't get it.

 

 

lst name=debug

str name=rawquerystringp1102w LaserJet /str

str name=querystringp1102w LaserJet /str

str name=parsedquery+PhraseQuery(moreWords:p 1102 w)
+PhraseQuery(moreWords:laser jet)/str

str name=parsedquery_toString+moreWords:p 1102 w +moreWords:laser
jet/str

lst name=explain

str name=222045267

3.64852 = (MATCH) sum of:

  2.4758534 = weight(moreWords:p 1102 w in 6667236), product of:

0.7955347 = queryWeight(moreWords:p 1102 w), product of:

  19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

  0.041507367 = queryNorm

3.1121879 = fieldWeight(moreWords:p 1102 w in 6667236), product
of:

  1.7320508 = tf(phraseFreq=3.0)

  19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

  0.09375 = fieldNorm(field=moreWords, doc=6667236)

  1.1726664 = weight(moreWords:laser jet in 6667236), product of:

0.60590804 = queryWeight(moreWords:laser jet), product of:

  14.597603 = idf(moreWords: laser=26731 jet=12685)

  0.041507367 = queryNorm

1.9353869 = fieldWeight(moreWords:laser jet in 6667236), product
of:

  1.4142135 = tf(phraseFreq=2.0)

  14.597603 = idf(moreWords: laser=26731 jet=12685)

  0.09375 = fieldNorm(field=moreWords, doc=6667236)

 

/str

str name=222045265

2.8656518 = (MATCH) sum of:

  1.4294347 = weight(moreWords:p 1102 w in 6684158), product of:

0.7955347 = queryWeight(moreWords:p 1102 w), product of:

  19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

  0.041507367 = queryNorm

1.7968225 = fieldWeight(moreWords:p 1102 w in 6684158), product
of:

  1.0 = tf(phraseFreq=1.0)

  19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

  0.09375 = fieldNorm(field=moreWords, doc=6684158)

  1.4362172 = weight(moreWords:laser jet in 6684158), product of:

0.60590804 = queryWeight(moreWords:laser jet), product of:

  14.597603 = idf(moreWords: laser=26731 jet=12685)

  0.041507367 = queryNorm

2.3703551 = fieldWeight(moreWords:laser jet in 6684158), product
of:

  1.7320508 = tf(phraseFreq=3.0)

  14.597603 = idf(moreWords: laser=26731 jet=12685)

  0.09375 = fieldNorm(field=moreWords, doc=6684158)

 

/str

/lst

str name=otherQuerysku:213824965

/str

lst name=explainOther

str name=213824965

0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited
clause(s)

  1.1911955 = weight(moreWords:p 1102 w in 32497), product of:

0.7955347 = queryWeight(moreWords:p 1102 w), product of:

  19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

  0.041507367 = queryNorm

1.4973521 = fieldWeight(moreWords:p 1102 w in 32497), product of:

  1.0 = tf(phraseFreq=1.0)

  19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

  0.078125 = fieldNorm(field=moreWords, doc=32497)

  0.0 = no match on required clause (moreWords:laser jet)

0.0 = weight(moreWords:laser jet in 32497), product of:

  0.60590804 = queryWeight(moreWords:laser jet), product of:

14.597603 = idf(moreWords: laser=26731 jet=12685)

0.041507367 = queryNorm

  0.0 = fieldWeight(moreWords:laser jet in 32497), product of:

0.0 = tf(phraseFreq=0.0)

14.597603 = idf(moreWords: laser=26731 jet=12685)

0.078125 = fieldNorm(field=moreWords, doc=32497)

 

/str

/lst



RE: please help explaining debug output

2011-07-25 Thread Robert Petersen
Sorry, to clarify a search for P1102W matches all three docs but a
search for p1102w LaserJet only matches the second two.  Someone asked
me a question while I was typing and I got distracted, apologies for any
confusion.

-Original Message-
From: Robert Petersen [mailto:rober...@buy.com] 
Sent: Monday, July 25, 2011 1:42 PM
To: solr-user@lucene.apache.org
Subject: please help explaining debug output

I have three documents with the following product titles in a text field
called moreWords with analysis stack matching the solr example text
field definition.

 

1.   HP LaserJet P1102W Monochrome Laser Printer
http://www.buy.com/prod/hp-laserjet-p1102w-monochrome-laser-printer/q/l
oc/101/213824965.html 

2.   HP CE285A (85A) Remanufactured Black Toner Cartridge for
LaserJet M1212nf, P1102, P1102W Series
http://www.buy.com/prod/hp-ce285a-85a-remanufactured-black-toner-cartri
dge-for-laserjet/q/loc/101/217145536.html 

3.   Black HP CE285A Toner Cartridge For LaserJet P1102W, LaserJet
M1130, LaserJet M1132, LaserJet M1210
http://www.buy.com/prod/black-hp-ce285a-toner-cartridge-for-laserjet-p1
102w-laserjet-m1130/q/loc/101/222045267.html 

 

A search for P1102W matches (2) and (3), but not (1) above.  Can someone
explain the debug output?  It looks like I am getting a non-match on (1)
because term frequency is zero?  Am I reading that right?  If so, how
could that be? the searched terms are equivalently in all three docs.  I
don't get it.

 

 

lst name=debug

str name=rawquerystringp1102w LaserJet /str

str name=querystringp1102w LaserJet /str

str name=parsedquery+PhraseQuery(moreWords:p 1102 w)
+PhraseQuery(moreWords:laser jet)/str

str name=parsedquery_toString+moreWords:p 1102 w +moreWords:laser
jet/str

lst name=explain

str name=222045267

3.64852 = (MATCH) sum of:

  2.4758534 = weight(moreWords:p 1102 w in 6667236), product of:

0.7955347 = queryWeight(moreWords:p 1102 w), product of:

  19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

  0.041507367 = queryNorm

3.1121879 = fieldWeight(moreWords:p 1102 w in 6667236), product
of:

  1.7320508 = tf(phraseFreq=3.0)

  19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

  0.09375 = fieldNorm(field=moreWords, doc=6667236)

  1.1726664 = weight(moreWords:laser jet in 6667236), product of:

0.60590804 = queryWeight(moreWords:laser jet), product of:

  14.597603 = idf(moreWords: laser=26731 jet=12685)

  0.041507367 = queryNorm

1.9353869 = fieldWeight(moreWords:laser jet in 6667236), product
of:

  1.4142135 = tf(phraseFreq=2.0)

  14.597603 = idf(moreWords: laser=26731 jet=12685)

  0.09375 = fieldNorm(field=moreWords, doc=6667236)

 

/str

str name=222045265

2.8656518 = (MATCH) sum of:

  1.4294347 = weight(moreWords:p 1102 w in 6684158), product of:

0.7955347 = queryWeight(moreWords:p 1102 w), product of:

  19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

  0.041507367 = queryNorm

1.7968225 = fieldWeight(moreWords:p 1102 w in 6684158), product
of:

  1.0 = tf(phraseFreq=1.0)

  19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

  0.09375 = fieldNorm(field=moreWords, doc=6684158)

  1.4362172 = weight(moreWords:laser jet in 6684158), product of:

0.60590804 = queryWeight(moreWords:laser jet), product of:

  14.597603 = idf(moreWords: laser=26731 jet=12685)

  0.041507367 = queryNorm

2.3703551 = fieldWeight(moreWords:laser jet in 6684158), product
of:

  1.7320508 = tf(phraseFreq=3.0)

  14.597603 = idf(moreWords: laser=26731 jet=12685)

  0.09375 = fieldNorm(field=moreWords, doc=6684158)

 

/str

/lst

str name=otherQuerysku:213824965

/str

lst name=explainOther

str name=213824965

0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited
clause(s)

  1.1911955 = weight(moreWords:p 1102 w in 32497), product of:

0.7955347 = queryWeight(moreWords:p 1102 w), product of:

  19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

  0.041507367 = queryNorm

1.4973521 = fieldWeight(moreWords:p 1102 w in 32497), product of:

  1.0 = tf(phraseFreq=1.0)

  19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

  0.078125 = fieldNorm(field=moreWords, doc=32497)

  0.0 = no match on required clause (moreWords:laser jet)

0.0 = weight(moreWords:laser jet in 32497), product of:

  0.60590804 = queryWeight(moreWords:laser jet), product of:

14.597603 = idf(moreWords: laser=26731 jet=12685)

0.041507367 = queryNorm

  0.0 = fieldWeight(moreWords:laser jet in 32497), product of:

0.0 = tf(phraseFreq=0.0)

14.597603 = idf(moreWords: laser=26731 jet=12685)

0.078125 = fieldNorm(field=moreWords, doc=32497)

 

/str

/lst



RE: please help explaining debug output

2011-07-25 Thread Robert Petersen
I'm still on solr 1.4.0 and the analysis page looks like they should match, and 
other products with the same content do in fact match.  I'm reindexing the 
non-matching ones to rule that out.

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Monday, July 25, 2011 1:58 PM
To: solr-user@lucene.apache.org
Subject: Re: please help explaining debug output

Hmmm, I'm assuming that moreWords is your default text field, yes?

But it works for me (tm), using 1.4.1. What version of Solr are you on?

Also, take a glance at the admin/analysis page, that might help...

Gotta run

Erick

On Mon, Jul 25, 2011 at 4:52 PM, Robert Petersen rober...@buy.com wrote:
 Sorry, to clarify a search for P1102W matches all three docs but a
 search for p1102w LaserJet only matches the second two.  Someone asked
 me a question while I was typing and I got distracted, apologies for any
 confusion.

 -Original Message-
 From: Robert Petersen [mailto:rober...@buy.com]
 Sent: Monday, July 25, 2011 1:42 PM
 To: solr-user@lucene.apache.org
 Subject: please help explaining debug output

 I have three documents with the following product titles in a text field
 called moreWords with analysis stack matching the solr example text
 field definition.



 1.       HP LaserJet P1102W Monochrome Laser Printer
 http://www.buy.com/prod/hp-laserjet-p1102w-monochrome-laser-printer/q/l
 oc/101/213824965.html

 2.       HP CE285A (85A) Remanufactured Black Toner Cartridge for
 LaserJet M1212nf, P1102, P1102W Series
 http://www.buy.com/prod/hp-ce285a-85a-remanufactured-black-toner-cartri
 dge-for-laserjet/q/loc/101/217145536.html

 3.       Black HP CE285A Toner Cartridge For LaserJet P1102W, LaserJet
 M1130, LaserJet M1132, LaserJet M1210
 http://www.buy.com/prod/black-hp-ce285a-toner-cartridge-for-laserjet-p1
 102w-laserjet-m1130/q/loc/101/222045267.html



 A search for P1102W matches (2) and (3), but not (1) above.  Can someone
 explain the debug output?  It looks like I am getting a non-match on (1)
 because term frequency is zero?  Am I reading that right?  If so, how
 could that be? the searched terms are equivalently in all three docs.  I
 don't get it.





 lst name=debug

 str name=rawquerystringp1102w LaserJet /str

 str name=querystringp1102w LaserJet /str

 str name=parsedquery+PhraseQuery(moreWords:p 1102 w)
 +PhraseQuery(moreWords:laser jet)/str

 str name=parsedquery_toString+moreWords:p 1102 w +moreWords:laser
 jet/str

 lst name=explain

 str name=222045267

 3.64852 = (MATCH) sum of:

  2.4758534 = weight(moreWords:p 1102 w in 6667236), product of:

    0.7955347 = queryWeight(moreWords:p 1102 w), product of:

      19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

      0.041507367 = queryNorm

    3.1121879 = fieldWeight(moreWords:p 1102 w in 6667236), product
 of:

      1.7320508 = tf(phraseFreq=3.0)

      19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

      0.09375 = fieldNorm(field=moreWords, doc=6667236)

  1.1726664 = weight(moreWords:laser jet in 6667236), product of:

    0.60590804 = queryWeight(moreWords:laser jet), product of:

      14.597603 = idf(moreWords: laser=26731 jet=12685)

      0.041507367 = queryNorm

    1.9353869 = fieldWeight(moreWords:laser jet in 6667236), product
 of:

      1.4142135 = tf(phraseFreq=2.0)

      14.597603 = idf(moreWords: laser=26731 jet=12685)

      0.09375 = fieldNorm(field=moreWords, doc=6667236)



 /str

 str name=222045265

 2.8656518 = (MATCH) sum of:

  1.4294347 = weight(moreWords:p 1102 w in 6684158), product of:

    0.7955347 = queryWeight(moreWords:p 1102 w), product of:

      19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

      0.041507367 = queryNorm

    1.7968225 = fieldWeight(moreWords:p 1102 w in 6684158), product
 of:

      1.0 = tf(phraseFreq=1.0)

      19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

      0.09375 = fieldNorm(field=moreWords, doc=6684158)

  1.4362172 = weight(moreWords:laser jet in 6684158), product of:

    0.60590804 = queryWeight(moreWords:laser jet), product of:

      14.597603 = idf(moreWords: laser=26731 jet=12685)

      0.041507367 = queryNorm

    2.3703551 = fieldWeight(moreWords:laser jet in 6684158), product
 of:

      1.7320508 = tf(phraseFreq=3.0)

      14.597603 = idf(moreWords: laser=26731 jet=12685)

      0.09375 = fieldNorm(field=moreWords, doc=6684158)



 /str

 /lst

 str name=otherQuerysku:213824965

 /str

 lst name=explainOther

 str name=213824965

 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited
 clause(s)

  1.1911955 = weight(moreWords:p 1102 w in 32497), product of:

    0.7955347 = queryWeight(moreWords:p 1102 w), product of:

      19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

      0.041507367 = queryNorm

    1.4973521 = fieldWeight(moreWords:p 1102 w in 32497), product of:

      1.0 = tf(phraseFreq=1.0)

      19.166107 = idf(moreWords: p=189166 1102=1135 w=445720)

      0.078125 = fieldNorm

RE: Solr 3.3: Exception in thread Lucene Merge Thread #1

2011-07-20 Thread Robert Petersen
Says it is caused by a Java out of memory error, no?  

-Original Message-
From: mdz-munich [mailto:sebastian.lu...@bsb-muenchen.de] 
Sent: Wednesday, July 20, 2011 9:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 3.3: Exception in thread Lucene Merge Thread #1

Here we go ...

This time we tried to use the old LogByteSizeMergePolicy and
SerialMergeScheduler:

mergePolicy class=org.apache.lucene.index.LogByteSizeMergePolicy/
mergeScheduler class=org.apache.lucene.index.SerialMergeScheduler/

We did this before, just to be sure ... 

~300 Documents:

/
SEVERE: java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:782)
at
org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(MMapDirector
y.java:264)
at
org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:216)
at
org.apache.lucene.index.FieldsReader.init(FieldsReader.java:129)
at
org.apache.lucene.index.SegmentCoreReaders.openDocStores(SegmentCoreRead
ers.java:244)
at
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:116)
at
org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:702)
at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4192)
at
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3859)
at
org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.
java:37)
at
org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2714)
at
org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2709)
at
org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2705)
at
org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3509)
at
org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1850)
at
org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1814)
at
org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1778)
at
org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:143)
at
org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHand
ler2.java:183)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.
java:416)
at
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpd
ateProcessorFactory.java:85)
at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:98)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte
ntStreamHandlerBase.java:67)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja
va:356)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
ava:252)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica
tionFilterChain.java:243)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt
erChain.java:210)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv
e.java:240)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv
e.java:164)
at
org.apache.catalina.authenticator.AuthenticatorBase.invoke(Authenticator
Base.java:462)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java
:164)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java
:100)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:563
)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.
java:118)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:4
03)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:30
1)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(
Http11Protocol.java:162)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(
Http11Protocol.java:140)
at
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.j
ava:309)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecuto
r.java:897)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja
va:919)
at java.lang.Thread.run(Thread.java:736)
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:779)
... 44 more

20.07.2011 18:07:30 org.apache.solr.core.SolrCore execute
INFO: [core.digi20] webapp=/solr path=/update params={} status=500
QTime=12302 
20.07.2011 18:07:30 org.apache.solr.common.SolrException log
SEVERE: 

RE: Analysis page output vs. actually getting search matches, a discrepency?

2011-07-19 Thread Robert Petersen
Thanks Eric,

Unfortunately I'm stemming the same on both sides, similar to the SOLR example 
settings for the text type field.  Default search field is moreWords, as I want 
yes.

Since I don't have this problem for any other mfg names at all in our index of 
almost 10 mm product docs, and this shows that is should match in my best 
estimation.

Note:  LucidKStemFilterFactory does not take 'Sterling' down to 'Sterl' in 
indexing nor searching, it stays as 'Sterling'.

I have given up on this.  I've decided it is just an unexplainable anomaly, and 
have solved it by inserting a LucidKStemFilterFactory and just modifying that 
word to it's searchable form before hitting the WhitespaceTokenizerFactory, 
which is kind of hackish but solves my problem at least.  This seller only has 
a couple hundred cheap products on our site, so I have bigger fish to fry at 
this point.  I've wasted too much time trying to chase this down.

Cheers all
Robi

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Monday, July 18, 2011 5:33 PM
To: solr-user@lucene.apache.org
Subject: Re: Analysis page output vs. actually getting search matches, a 
discrepency?

Hmmm, is there any chance that you're stemming one place and
not the other?
And I infer from your output that your default search field is
moreWords, is that true and expected?

You might use luke or the TermsComponent to see what's actually in
the index, I'm going to guess that you'll find sterl but not sterling as
an indexed term and your problem is stemming, but that's
a shot in the dark.

Best
Erick

On Mon, Jul 18, 2011 at 5:37 PM, Robert Petersen rober...@buy.com wrote:
 OK I did what Hoss said, it only confirms I don't get a match when I
 should and that the query parser is doing the expected.  Here are the
 details for one test sku.

 My analysis page output is shown in my email starting this thread and
 here is my query debug output.  This absolutely should match but
 doesn't.  Both the indexing side and the query side are splitting on
 case changes.  This actually isn't a problem for any of our other
 content, for instance there is no issue searching for 'VideoSecu'.
 Their products come up fine in our searches regardless of casing in the
 query.  Only SterlingTek's products seem to be causing us issues.

 Indexed content has camel case, stored in the text field 'moreWords':
 SterlingTek's NB-2LH 2 Pack Batteries + Charger Combo for Canon DC301
 Search term not matching with camel case: SterlingTek's
 Search term matching if no case changes: Sterlingtek's

 Indexing:
 filter class=solr.WordDelimiterFilterFactory
        generateWordParts=1
        generateNumberParts=1
        catenateWords=1
        catenateNumbers=1
        catenateAll=0
        splitOnCaseChange=1
        preserveOriginal=0
 /
 Searching:
 filter class=solr.WordDelimiterFilterFactory
         generateWordParts=1
         generateNumberParts=1
         catenateWords=0
         catenateNumbers=0
         catenateAll=0
         splitOnCaseChange=1
         preserveOriginal=0
 /

 Thanks

 http://ssdevrh01.buy.com:8983/solr/1/select?indent=onversion=2.2q=
 SterlingTek%27sfq=start=0rows=1fl=*%2Cscoreqt=standardwt=standard
 debugQuery=onexplainOther=sku%3A216473417hl=onhl.fl=echoHandler=true
 adf

 response
 lst name=responseHeader
 int name=status0/int
 int name=QTime4/int
 str
 name=handlerorg.apache.solr.handler.component.SearchHandler/str
 lst name=params
  str name=explainOthersku:216473417/str
  str name=indenton/str
  str name=echoHandlertrue/str
  str name=hl.fl/
  str name=wtstandard/str
  str name=hlon/str
  str name=rows1/str
  str name=version2.2/str
  str name=fl*,score/str
  str name=debugQueryon/str
  str name=start0/str
  str name=qSterlingTek's/str
  str name=qtstandard/str
  str name=fq/
 /lst
 /lst
 result name=response numFound=0 start=0 maxScore=0.0/
 lst name=highlighting/
 lst name=debug
 str name=rawquerystringSterlingTek's/str
 str name=querystringSterlingTek's/str
 str name=parsedqueryPhraseQuery(moreWords:sterling tek)/str
 str name=parsedquery_toStringmoreWords:sterling tek/str
 lst name=explain/
 str name=otherQuerysku:216473417/str
 lst name=explainOther
 str name=216473417
 0.0 = fieldWeight(moreWords:sterling tek in 76351), product of:
  0.0 = tf(phraseFreq=0.0)
  19.502613 = idf(moreWords: sterling=1 tek=72)
  0.15625 = fieldNorm(field=moreWords, doc=76351)

 /str
 /lst
 str name=QParserLuceneQParser
 /str
 arr name=filter_queries
 str/
 /arr



 -Original Message-
 From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
 Sent: Friday, July 15, 2011 4:36 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Analysis page output vs. actually getting search matches, a
 discrepency?


 : Subject: Analysis page output vs. actually getting search matches,
 :     a discrepency?

 99% of the time when people ask questions like this, it's because of
 confusion about how/when QueryParsing comes into play (as opposed to
 analysis

RE: Analysis page output vs. actually getting search matches, a discrepency?

2011-07-19 Thread Robert Petersen
Um sorry for any confusion.  I meant to say I solved my issue by inserting a 
charFilter before the WhitespaceTokenizerFactory to convert my problem word to 
a searchable form.  I had a cut n paste malfunction below.  Thanks guys.

-Original Message-
From: Robert Petersen [mailto:rober...@buy.com] 
Sent: Tuesday, July 19, 2011 11:06 AM
To: solr-user@lucene.apache.org
Subject: RE: Analysis page output vs. actually getting search matches, a 
discrepency?

Thanks Eric,

Unfortunately I'm stemming the same on both sides, similar to the SOLR example 
settings for the text type field.  Default search field is moreWords, as I want 
yes.

Since I don't have this problem for any other mfg names at all in our index of 
almost 10 mm product docs, and this shows that is should match in my best 
estimation.

Note:  LucidKStemFilterFactory does not take 'Sterling' down to 'Sterl' in 
indexing nor searching, it stays as 'Sterling'.

I have given up on this.  I've decided it is just an unexplainable anomaly, and 
have solved it by inserting a LucidKStemFilterFactory and just modifying that 
word to it's searchable form before hitting the WhitespaceTokenizerFactory, 
which is kind of hackish but solves my problem at least.  This seller only has 
a couple hundred cheap products on our site, so I have bigger fish to fry at 
this point.  I've wasted too much time trying to chase this down.

Cheers all
Robi

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Monday, July 18, 2011 5:33 PM
To: solr-user@lucene.apache.org
Subject: Re: Analysis page output vs. actually getting search matches, a 
discrepency?

Hmmm, is there any chance that you're stemming one place and
not the other?
And I infer from your output that your default search field is
moreWords, is that true and expected?

You might use luke or the TermsComponent to see what's actually in
the index, I'm going to guess that you'll find sterl but not sterling as
an indexed term and your problem is stemming, but that's
a shot in the dark.

Best
Erick

On Mon, Jul 18, 2011 at 5:37 PM, Robert Petersen rober...@buy.com wrote:
 OK I did what Hoss said, it only confirms I don't get a match when I
 should and that the query parser is doing the expected.  Here are the
 details for one test sku.

 My analysis page output is shown in my email starting this thread and
 here is my query debug output.  This absolutely should match but
 doesn't.  Both the indexing side and the query side are splitting on
 case changes.  This actually isn't a problem for any of our other
 content, for instance there is no issue searching for 'VideoSecu'.
 Their products come up fine in our searches regardless of casing in the
 query.  Only SterlingTek's products seem to be causing us issues.

 Indexed content has camel case, stored in the text field 'moreWords':
 SterlingTek's NB-2LH 2 Pack Batteries + Charger Combo for Canon DC301
 Search term not matching with camel case: SterlingTek's
 Search term matching if no case changes: Sterlingtek's

 Indexing:
 filter class=solr.WordDelimiterFilterFactory
        generateWordParts=1
        generateNumberParts=1
        catenateWords=1
        catenateNumbers=1
        catenateAll=0
        splitOnCaseChange=1
        preserveOriginal=0
 /
 Searching:
 filter class=solr.WordDelimiterFilterFactory
         generateWordParts=1
         generateNumberParts=1
         catenateWords=0
         catenateNumbers=0
         catenateAll=0
         splitOnCaseChange=1
         preserveOriginal=0
 /

 Thanks

 http://ssdevrh01.buy.com:8983/solr/1/select?indent=onversion=2.2q=
 SterlingTek%27sfq=start=0rows=1fl=*%2Cscoreqt=standardwt=standard
 debugQuery=onexplainOther=sku%3A216473417hl=onhl.fl=echoHandler=true
 adf

 response
 lst name=responseHeader
 int name=status0/int
 int name=QTime4/int
 str
 name=handlerorg.apache.solr.handler.component.SearchHandler/str
 lst name=params
  str name=explainOthersku:216473417/str
  str name=indenton/str
  str name=echoHandlertrue/str
  str name=hl.fl/
  str name=wtstandard/str
  str name=hlon/str
  str name=rows1/str
  str name=version2.2/str
  str name=fl*,score/str
  str name=debugQueryon/str
  str name=start0/str
  str name=qSterlingTek's/str
  str name=qtstandard/str
  str name=fq/
 /lst
 /lst
 result name=response numFound=0 start=0 maxScore=0.0/
 lst name=highlighting/
 lst name=debug
 str name=rawquerystringSterlingTek's/str
 str name=querystringSterlingTek's/str
 str name=parsedqueryPhraseQuery(moreWords:sterling tek)/str
 str name=parsedquery_toStringmoreWords:sterling tek/str
 lst name=explain/
 str name=otherQuerysku:216473417/str
 lst name=explainOther
 str name=216473417
 0.0 = fieldWeight(moreWords:sterling tek in 76351), product of:
  0.0 = tf(phraseFreq=0.0)
  19.502613 = idf(moreWords: sterling=1 tek=72)
  0.15625 = fieldNorm(field=moreWords, doc=76351)

 /str
 /lst
 str name=QParserLuceneQParser
 /str
 arr name=filter_queries
 str/
 /arr



 -Original

RE: Analysis page output vs. actually getting search matches, a discrepency?

2011-07-18 Thread Robert Petersen
OK I did what Hoss said, it only confirms I don't get a match when I
should and that the query parser is doing the expected.  Here are the
details for one test sku.

My analysis page output is shown in my email starting this thread and
here is my query debug output.  This absolutely should match but
doesn't.  Both the indexing side and the query side are splitting on
case changes.  This actually isn't a problem for any of our other
content, for instance there is no issue searching for 'VideoSecu'.
Their products come up fine in our searches regardless of casing in the
query.  Only SterlingTek's products seem to be causing us issues.

Indexed content has camel case, stored in the text field 'moreWords':
SterlingTek's NB-2LH 2 Pack Batteries + Charger Combo for Canon DC301
Search term not matching with camel case: SterlingTek's
Search term matching if no case changes: Sterlingtek's

Indexing:
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=0
splitOnCaseChange=1
preserveOriginal=0
/
Searching:
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1
 generateNumberParts=1
 catenateWords=0
 catenateNumbers=0
 catenateAll=0
 splitOnCaseChange=1
 preserveOriginal=0
/

Thanks

http://ssdevrh01.buy.com:8983/solr/1/select?indent=onversion=2.2q=
SterlingTek%27sfq=start=0rows=1fl=*%2Cscoreqt=standardwt=standard
debugQuery=onexplainOther=sku%3A216473417hl=onhl.fl=echoHandler=true
adf

response
lst name=responseHeader
int name=status0/int
int name=QTime4/int
str
name=handlerorg.apache.solr.handler.component.SearchHandler/str
lst name=params
 str name=explainOthersku:216473417/str 
 str name=indenton/str
 str name=echoHandlertrue/str
 str name=hl.fl/
 str name=wtstandard/str
 str name=hlon/str
 str name=rows1/str
 str name=version2.2/str
 str name=fl*,score/str
 str name=debugQueryon/str
 str name=start0/str
 str name=qSterlingTek's/str
 str name=qtstandard/str
 str name=fq/
/lst
/lst
result name=response numFound=0 start=0 maxScore=0.0/
lst name=highlighting/
lst name=debug
str name=rawquerystringSterlingTek's/str
str name=querystringSterlingTek's/str
str name=parsedqueryPhraseQuery(moreWords:sterling tek)/str
str name=parsedquery_toStringmoreWords:sterling tek/str
lst name=explain/
str name=otherQuerysku:216473417/str
lst name=explainOther
str name=216473417
0.0 = fieldWeight(moreWords:sterling tek in 76351), product of:
  0.0 = tf(phraseFreq=0.0)
  19.502613 = idf(moreWords: sterling=1 tek=72)
  0.15625 = fieldNorm(field=moreWords, doc=76351)

/str
/lst
str name=QParserLuceneQParser
/str
arr name=filter_queries
str/
/arr



-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Friday, July 15, 2011 4:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Analysis page output vs. actually getting search matches, a
discrepency?


: Subject: Analysis page output vs. actually getting search matches,
: a discrepency?

99% of the time when people ask questions like this, it's because of 
confusion about how/when QueryParsing comes into play (as opposed to 
analysis) -- analysis.jsp only shows you part of the equation, it
doesn't 
know what query parser you are using.

you mentioned that you aren't getting matches when you expect them, and 
you provided the analysis.jsp output, but you didn't mention anything 
about the request you are making, the query parser used etc  it
owuld 
be good to know the full query URL, along with the debugQuery output 
showing the final query toString info.

if that info doesn't clear up the discrepency, you should also take a
look 
at the explainOther info for the doc that you expect to match that isn't

-- if you still aren't sure what's going on, post all of that info to 
solr-user and folks can probably help you make sense of it.

(all that said: in some instances this type of problem is simply that 
someone changed the schema and didn't reindex everything, so the indexed

terms don't really match what you think they do)


-Hoss


RE: ' invisible ' words

2011-07-18 Thread Robert Petersen
Read my thread  RE: Analysis page output vs. actually getting search
matches, a discrepancy? and see if it is not somewhat like your
problem... even if not, there might be something to help as to how to
figure out what is going on in your case...

-Original Message-
From: deniz [mailto:denizdurmu...@gmail.com] 
Sent: Sunday, July 17, 2011 6:24 PM
To: solr-user@lucene.apache.org
Subject: RE: ' invisible ' words

Hi Jagdish,

thank oyu very much for the tool that you have sent... It is really
useful
for this problem... 

After using the tool, I just got interesting results... for some words;
when
i use the tool. it returns the matched docs, on the other hand when i
use
solr admin page to make a search i cant get any matches... with the same
words... now i am more confused and honestly have no idea about what to
do... 

anyone has ever faced such a problem?

-
Zeki ama calismiyor... Calissa yapar...
--
View this message in context:
http://lucene.472066.n3.nabble.com/invisible-words-tp3158060p3177907.htm
l
Sent from the Solr - User mailing list archive at Nabble.com.


Analysis page output vs. actually getting search matches, a discrepency?

2011-07-15 Thread Robert Petersen
I a problem searching for one mfg name (out of our 10mm product titles)
and it is indexed in a text type field  having about the same analyzer
settings as the solr example text field definition, and most everything
works fine but we found this one example which I cannot get a direct hit
on.  In the Field Analysis page, It sure looks like it would *have* to
match but sadly during searches it just doesn't.  I can get it to match
by turning off 'split on case change' but that breaks many other
searches like 'appleTV' which need to split on case change to match
'apple tv' in our content!

 

If I search for SterlingTek's anything I get zero results.

If I change the casing to Sterlingtek's in my query, I get all the
results.

If I turn off 'split on case change then the first gets results also.

 

See verbose analysis output to see actual filter settings, I put
non-verbose first for easier reading (hope the tables don't get lost
during posting to this group) but the analysis shows complete matchup,
that is what I don't get:

 

Field Analysis

Top of Form

Field

Field value (Index) 
verbose output  
highlight matches 

SterlingTek's NB-2LH

Field value (Query) 
verbose output 

SterlingTek's NB-2LH

Bottom of Form

Index Analyzer

SterlingTek's

NB-2LH

 

SterlingTek's

NB-2LH

 

SterlingTek's

NB-2LH

 

Sterling

Tek

NB

2

LH

SterlingTek

 

sterling

tek

nb

2

lh

sterlingtek

 

sterling

tek

nb

2

lh

sterlingtek

 

sterling

tek

nb

2

lh


sterlingtek

Note every field is highlighted in the last line above meaning all have
a match, right???

Query Analyzer

SterlingTek's

NB-2LH

 

SterlingTek's

NB-2LH

 

SterlingTek's

NB-2LH

 

Sterling

Tek

NB

2

LH

 

sterling

tek

nb

2

lh

 

sterling

tek

nb

2

lh

 

sterling

tek

nb

2

lh

 

 

VERBOSE OUTPUT FOLLOWS:


Index Analyzer


org.apache.solr.analysis.WhitespaceTokenizerFactory {}

term position

1

2

term text

SterlingTek's

NB-2LH

term type

word

word

source start,end

0,13

14,20

payload



org.apache.solr.analysis.SynonymFilterFactory
{synonyms=index_synonyms.txt, expand=true, ignoreCase=true}

term position

1

2

term text

SterlingTek's

NB-2LH

term type

word

word

source start,end

0,13

14,20

payload



org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=true}

term position

1

2

term text

SterlingTek's

NB-2LH

term type

word

word

source start,end

0,13

14,20

payload



org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=0,
splitOnCaseChange=1, generateNumberParts=1, catenateWords=1,
generateWordParts=1, catenateAll=0, catenateNumbers=1}

term position

1

2

3

4

5

term text

Sterling

Tek

NB

2

LH

SterlingTek

term type

word

word

word

word

word

word

source start,end

0,8

8,11

14,16

17,18

18,20

0,11

payload




org.apache.solr.analysis.LowerCaseFilterFactory {}

term position

1

2

3

4

5

term text

sterling

tek

nb

2

lh

sterlingtek

term type

word

word

word

word

word

word

source start,end

0,8

8,11

14,16

17,18

18,20

0,11

payload




com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
{protected=protwords.txt}

term position

1

2

3

4

5

term text

sterling

tek

nb

2

lh

sterlingtek

term type

word

word

word

word

word

word

source start,end

0,8

8,11

14,16

17,18

18,20

0,11

payload




org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}

term position

1

2

3

4

5

term text

sterling

tek

nb

2

lh

sterlingtek

term type

word

word

word

word

word

word

source start,end

0,8

8,11

14,16

17,18

18,20

0,11

payload




Query Analyzer


org.apache.solr.analysis.WhitespaceTokenizerFactory {}

term position

1

2

term text

SterlingTek's

NB-2LH

term type

word

word

source start,end

0,13

14,20

payload



org.apache.solr.analysis.SynonymFilterFactory
{synonyms=query_synonyms.txt, expand=true, ignoreCase=true}

term position

1

2

term text

SterlingTek's

NB-2LH

term type

word

word

source start,end

0,13

14,20

payload



org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=true}

term position

1

2

term text

SterlingTek's

NB-2LH

term type

word

word

source start,end

0,13

14,20

payload



org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=0,
splitOnCaseChange=1, generateNumberParts=1, catenateWords=0,
generateWordParts=1, catenateAll=0, catenateNumbers=0}

term position

1

2

3

4

5

term text

Sterling

Tek

NB

2

LH

term type

word

word

word

word

word

source start,end

0,8

8,11

14,16

17,18

18,20

payload




RE: Analysis page output vs. actually getting search matches, a discrepency?

2011-07-15 Thread Robert Petersen
Hi Chris, 

Well to start from the bottom of your list there, I restrict my testing
to one sku while continuously reindexing the sku after every indexer
side change, and reload the core every time also.  I just search from
the admin page using the word in question and the exact match on the sku
field (the unique one) like this:
response
lst name=responseHeader
int name=status0/int
int name=QTime6/int
lst name=params
str name=indenton/str
str name=start0/str
str name=qSterlingTek's NB-2LH sku:216473417/str
str name=bbba/str
str name=rows10/str
str name=version2.2/str
/lst
/lst

I will have to find out more about query parsers before I can answer the
rest, Will reply to that later... and it's Friday after all!  :)

Thanks


-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Friday, July 15, 2011 4:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Analysis page output vs. actually getting search matches, a
discrepency?


: Subject: Analysis page output vs. actually getting search matches,
: a discrepency?

99% of the time when people ask questions like this, it's because of 
confusion about how/when QueryParsing comes into play (as opposed to 
analysis) -- analysis.jsp only shows you part of the equation, it
doesn't 
know what query parser you are using.

you mentioned that you aren't getting matches when you expect them, and 
you provided the analysis.jsp output, but you didn't mention anything 
about the request you are making, the query parser used etc  it
owuld 
be good to know the full query URL, along with the debugQuery output 
showing the final query toString info.

if that info doesn't clear up the discrepency, you should also take a
look 
at the explainOther info for the doc that you expect to match that isn't

-- if you still aren't sure what's going on, post all of that info to 
solr-user and folks can probably help you make sense of it.

(all that said: in some instances this type of problem is simply that 
someone changed the schema and didn't reindex everything, so the indexed

terms don't really match what you think they do)


-Hoss


RE: Feature: skipping caches and info about cache use

2011-06-03 Thread Robert Petersen
Why, I'm just wondering?
  
For a case where you know the next query would not be possible to be
already in the cache because it is so different from the norm? 

Just for timing information for instrumentation used for tuning  (ie so
you can compare cached response times vs non-cached response times)?  


-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Friday, June 03, 2011 10:02 AM
To: solr-user@lucene.apache.org
Subject: Feature: skipping caches and info about cache use

Hi,

Is it just me, or would others like things like:
* The ability to tell Solr (by passing some URL param?) to skip one or
more of 
its caches and get data from the index
* An additional attrib in the Solr response that shows whether the query
came 
from the cache or not

* Maybe something else along these lines?

Or maybe some of this is already there and I just don't know about it?
:)

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



RE: Anyway to know changed documents?

2011-06-02 Thread Robert Petersen
...and it works really well!!!  :)

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Wednesday, June 01, 2011 5:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Anyway to know changed documents?

On 6/1/2011 6:12 AM, pravesh wrote:
 SOLR wiki will provide help on this. You might be interested in pure
Java
 based replication too. I'm not sure,whether SOLR operational will have
this
 feature(synch'ing only changed segments). You might need to change
 configuration in searchconfig.xml

Yes, this feature is there in the Java/HTTP based replication since Solr
1.4



RE: Newbie question: how to deal with different # of search results per page due to pagination then grouping

2011-06-01 Thread Robert Petersen
Don't manually group by author from your results, the list will always
be incomplete...  use faceting instead to show the authors of the books
you have found in your search.

http://wiki.apache.org/solr/SolrFacetingOverview

-Original Message-
From: beccax [mailto:bec...@gmail.com] 
Sent: Wednesday, June 01, 2011 11:56 AM
To: solr-user@lucene.apache.org
Subject: Newbie question: how to deal with different # of search results
per page due to pagination then grouping

Apologize if this question has already been raised.  I tried searching
but
couldn't find the relevant posts.

We've indexed a bunch of documents by different authors.  Then for
search
results, we'd like to show the authors that have 1 or more documents
matching the search keywords.  

The problem is right now our solr search method first paginates results
to
100 documents per page, then we take the results and group by authors.
This
results in different number of authors per page.  (Some authors may only
have one matching document and others 5 or 10.)

How do we change it to somehow show the same number of authors (say 25)
per
page?

I mean alternatively we could just show all the documents themselves
ordered
by author, but it's not the user experience we're looking for.

Thanks so much.  And please let me know if you need more details not
provided here.
B

--
View this message in context:
http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-diff
erent-of-search-results-per-page-due-to-pagination-then-grouping-tp30121
68p3012168.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Newbie question: how to deal with different # of search results per page due to pagination then grouping

2011-06-01 Thread Robert Petersen
I think facet.offset allows facet paging nicely by letting you index
into the list of facet values.  It is working for me...

http://wiki.apache.org/solr/SimpleFacetParameters#facet.offset


-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Wednesday, June 01, 2011 12:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Newbie question: how to deal with different # of search
results per page due to pagination then grouping

There's no great way to do that.

One approach would be using facets, but that will just get you the 
author names (as stored in fields), and not the documents under it. If 
you really only want to show the author names, facets could work. One 
issue with facets though is Solr won't tell you the total number of 
facet values for your query, so it's tricky to provide next/prev paging 
through them.

There is also a 'field collapsing' feature that I think is not in a 
released Solr, but may be in the Solr repo. I'm not sure it will quite 
do what you want either though, although it's related and worth a look. 
http://wiki.apache.org/solr/FieldCollapsing

Another vaguely related thing that is also not yet in a released Solr, 
is a 'join' function. That could possibly be used to do what you want, 
although it'd be tricky too.
https://issues.apache.org/jira/browse/SOLR-2272

Jonathan

On 6/1/2011 2:56 PM, beccax wrote:
 Apologize if this question has already been raised.  I tried searching
but
 couldn't find the relevant posts.

 We've indexed a bunch of documents by different authors.  Then for
search
 results, we'd like to show the authors that have 1 or more documents
 matching the search keywords.

 The problem is right now our solr search method first paginates
results to
 100 documents per page, then we take the results and group by authors.
This
 results in different number of authors per page.  (Some authors may
only
 have one matching document and others 5 or 10.)

 How do we change it to somehow show the same number of authors (say
25) per
 page?

 I mean alternatively we could just show all the documents themselves
ordered
 by author, but it's not the user experience we're looking for.

 Thanks so much.  And please let me know if you need more details not
 provided here.
 B

 --
 View this message in context:
http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-diff
erent-of-search-results-per-page-due-to-pagination-then-grouping-tp30121
68p3012168.html
 Sent from the Solr - User mailing list archive at Nabble.com.



RE: Newbie question: how to deal with different # of search results per page due to pagination then grouping

2011-06-01 Thread Robert Petersen
Yes that is exactly the issue... we're thinking just maybe always have a
next button and if you go too far you just get zero results.  User gets
what the user asks for, and so user could simply back up if desired to
where the facet still has values.  Could also detect an empty facet
results on the front end.  You can also only expand one facet only to
allow paging only the facet pane and not the whole page using an ajax
call.



-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Wednesday, June 01, 2011 2:30 PM
To: solr-user@lucene.apache.org
Cc: Robert Petersen
Subject: Re: Newbie question: how to deal with different # of search
results per page due to pagination then grouping

How do you know whether to provide a 'next' button, or whether you are 
the end of your facet list?

On 6/1/2011 4:47 PM, Robert Petersen wrote:
 I think facet.offset allows facet paging nicely by letting you index
 into the list of facet values.  It is working for me...

 http://wiki.apache.org/solr/SimpleFacetParameters#facet.offset


 -Original Message-
 From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
 Sent: Wednesday, June 01, 2011 12:41 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Newbie question: how to deal with different # of search
 results per page due to pagination then grouping

 There's no great way to do that.

 One approach would be using facets, but that will just get you the
 author names (as stored in fields), and not the documents under it. If
 you really only want to show the author names, facets could work. One
 issue with facets though is Solr won't tell you the total number of
 facet values for your query, so it's tricky to provide next/prev
paging
 through them.

 There is also a 'field collapsing' feature that I think is not in a
 released Solr, but may be in the Solr repo. I'm not sure it will quite
 do what you want either though, although it's related and worth a
look.
 http://wiki.apache.org/solr/FieldCollapsing

 Another vaguely related thing that is also not yet in a released Solr,
 is a 'join' function. That could possibly be used to do what you want,
 although it'd be tricky too.
 https://issues.apache.org/jira/browse/SOLR-2272

 Jonathan

 On 6/1/2011 2:56 PM, beccax wrote:
 Apologize if this question has already been raised.  I tried
searching
 but
 couldn't find the relevant posts.

 We've indexed a bunch of documents by different authors.  Then for
 search
 results, we'd like to show the authors that have 1 or more documents
 matching the search keywords.

 The problem is right now our solr search method first paginates
 results to
 100 documents per page, then we take the results and group by
authors.
 This
 results in different number of authors per page.  (Some authors may
 only
 have one matching document and others 5 or 10.)

 How do we change it to somehow show the same number of authors (say
 25) per
 page?

 I mean alternatively we could just show all the documents themselves
 ordered
 by author, but it's not the user experience we're looking for.

 Thanks so much.  And please let me know if you need more details not
 provided here.
 B

 --
 View this message in context:

http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-diff

erent-of-search-results-per-page-due-to-pagination-then-grouping-tp30121
 68p3012168.html
 Sent from the Solr - User mailing list archive at Nabble.com.



RE: How to index and query C# as whole term?

2011-05-16 Thread Robert Petersen
I have always just converted terms like 'C#' or 'C++' into 'csharp' and
'cplusplus' before indexing them and similarly converted those terms if
someone searched on them.  That always has worked just fine for me...
:)

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Monday, May 16, 2011 8:28 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index and query C# as whole term?

I don't think you'd want to use the string type here. String type is 
almost never appropriate for a field you want to actually search on (it 
is appropriate for fields to facet on).

But you may want to use Text type with different analyzers selected.  
You probably want Text type so the value is still split into different 
tokens on word boundaries; you just don't want an analyzer set that 
removes punctuation.

On 5/16/2011 10:46 AM, Gora Mohanty wrote:
 On Mon, May 16, 2011 at 7:05 PM, Gnanakumargna...@zoniac.com  wrote:
 Hi,

 I'm using Apache Solr v3.1.

 How do I configure/allow Solr to both index and query the term c#
as a
 whole word/term?  From Analysis page, I could see that the term
c# is
 being reduced/converted into just c by
solr.WordDelimiterFilterFactory.
 [...]

 Yes, as you have discovered the analyzers for the field type in
 question will affect the values indexed.

 To index c# exactly as is, you can use the string type, instead
 of the text type. However, what you probably want some filters
 to be applied, e.g., LowerCaseFilterFactory. Take a look at the
 definition of the fieldType text in schema.xml, define a new field
 type that has only the tokenizers and analyzers that you need, and
 use that type for your field. This Wiki page should be helpful:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

 Regards,
 Gora



RE: How to index and query C# as whole term?

2011-05-16 Thread Robert Petersen
Sorry I am also using a synonyms.txt for this in the analysis stack.  I
was not clear, sorry for any confusion.  I am not doing it outside of
Solr but on the way into the index it is converted...  :)

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Monday, May 16, 2011 8:51 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index and query C# as whole term?

Before indexing so outside Solr? Using the SynonymFilter would be easier
i 
guess.

On Monday 16 May 2011 17:44:24 Robert Petersen wrote:
 I have always just converted terms like 'C#' or 'C++' into 'csharp'
and
 'cplusplus' before indexing them and similarly converted those terms
if
 someone searched on them.  That always has worked just fine for me...
 
 :)
 
 -Original Message-
 From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
 Sent: Monday, May 16, 2011 8:28 AM
 To: solr-user@lucene.apache.org
 Subject: Re: How to index and query C# as whole term?
 
 I don't think you'd want to use the string type here. String type is
 almost never appropriate for a field you want to actually search on
(it
 is appropriate for fields to facet on).
 
 But you may want to use Text type with different analyzers selected.
 You probably want Text type so the value is still split into different
 tokens on word boundaries; you just don't want an analyzer set that
 removes punctuation.
 
 On 5/16/2011 10:46 AM, Gora Mohanty wrote:
  On Mon, May 16, 2011 at 7:05 PM, Gnanakumargna...@zoniac.com
wrote:
  Hi,
  
  I'm using Apache Solr v3.1.
  
  How do I configure/allow Solr to both index and query the term c#
 
 as a
 
  whole word/term?  From Analysis page, I could see that the term
 
 c# is
 
  being reduced/converted into just c by
 
 solr.WordDelimiterFilterFactory.
 
  [...]
  
  Yes, as you have discovered the analyzers for the field type in
  question will affect the values indexed.
  
  To index c# exactly as is, you can use the string type, instead
  of the text type. However, what you probably want some filters
  to be applied, e.g., LowerCaseFilterFactory. Take a look at the
  definition of the fieldType text in schema.xml, define a new field
  type that has only the tokenizers and analyzers that you need, and
  use that type for your field. This Wiki page should be helpful:
  http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
  
  Regards,
  Gora

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


RE: Synonym Filter disable at query time

2011-05-10 Thread Robert Petersen
Very nice! Good job! :)

-Original Message-
From: mtraynham [mailto:mtrayn...@digitalsmiths.com] 
Sent: Tuesday, May 10, 2011 9:44 AM
To: solr-user@lucene.apache.org
Subject: RE: Synonym Filter disable at query time

Just a heads up on a solution.

copyField wasn't need, but a new fieldType and a non-indexed, non-stored
field was added.

Within a new Synonym processor that executes right before the
AnalyzerQueryNodeProcessor, I was able to modify the field name for each
node to point at the new field.  Therefore I could build out the
necessary
synonym values from the tokenizer and then reassign them all back to the
original field with whatever boosts they needed.  This allowed me to
retain
the original value match, to keep it's boost at 1 and then boost the
synonyms according to a user specified boost value.  Works perfectly.

Thanks again for the help.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Synonym-Filter-disable-at-query-time-
tp2919876p2923775.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Synonym Filter disable at query time

2011-05-09 Thread Robert Petersen
Just make another field using copyfield which the other field does not
apply synonyms to the text and then search either the one with or
without from the front end...  that will be your selector.  :)

-Original Message-
From: mtraynham [mailto:mtrayn...@digitalsmiths.com] 
Sent: Monday, May 09, 2011 11:17 AM
To: solr-user@lucene.apache.org
Subject: Synonym Filter disable at query time

I would like to be able to disable the synonym filter during runtime
based on
a query parameter, say 'synoynms=true' or 'synonyms=false'.

Is there a way within the AnaylzerQueryNodeProcessor or QParser that I
can
remove the SynonymFilter from the AnalyzerAttributes?  

It seems that the Analyzer has a hashmap for it's 'analyzers' but I
cannot
find the declaration of this item.

Am I going about this wrong is also another question I had...


--
View this message in context:
http://lucene.472066.n3.nabble.com/Synonym-Filter-disable-at-query-time-
tp2919876p2919876.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Synonym Filter disable at query time

2011-05-09 Thread Robert Petersen
I was thinking search both and boost non-synonym field perhaps?

-Original Message-
From: mtraynham [mailto:mtrayn...@digitalsmiths.com] 
Sent: Monday, May 09, 2011 1:20 PM
To: solr-user@lucene.apache.org
Subject: RE: Synonym Filter disable at query time

Awesome thanks!  Also, you wouldn't happen to have any insight on
boosting
synonyms lower than the original query after they were stemmed, would
you?

Say if I had synonyms turned on:

The TokenStream is setup to do Synonyms - StopFilter - LowerCaseFilter
-
SnowballPorter.

Say I search for Thomas, synonyms produces Thomas, Tom, Tommy.
The SnowballPorter produces Tom, Tommi, Thoma.

Is there a way to know Thoma would match the original term, so it could
be
boosted higher?



--
View this message in context:
http://lucene.472066.n3.nabble.com/Synonym-Filter-disable-at-query-time-
tp2919876p2920342.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Synonym Filter disable at query time

2011-05-09 Thread Robert Petersen
Yay!   :)

-Original Message-
From: mtraynham [mailto:mtrayn...@digitalsmiths.com] 
Sent: Monday, May 09, 2011 1:59 PM
To: solr-user@lucene.apache.org
Subject: RE: Synonym Filter disable at query time

Actually now that I think about it, with copy fields I can just single
out
the Synonym reader and boost from an earlier processor.

Thanks again though, that solved a lot of headache!

--
View this message in context:
http://lucene.472066.n3.nabble.com/Synonym-Filter-disable-at-query-time-
tp2919876p2920510.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: stemming for English

2011-05-03 Thread Robert Petersen
From what I have seen, adding a second field with the same terms as the first 
does *not* double your index size at all.

-Original Message-
From: Dmitry Kan [mailto:dmitry@gmail.com] 
Sent: Tuesday, May 03, 2011 4:06 AM
To: solr-user@lucene.apache.org
Subject: Re: stemming for English

Yes, Ludovic. Thus effectively we get index doubled. Given the volume of
data we store, we very carefully consider such cases, where the doubling of
index is must.

Dmitry

On Tue, May 3, 2011 at 1:08 PM, lboutros boutr...@gmail.com wrote:

 Dmitry,

 I don't know any way to keep both stemming and consistent wildcard support
 in the same field.
 To me, you have to create 2 different fields.

 Ludovic.

 2011/5/3 Dmitry Kan [via Lucene] 
 ml-node+2893628-993677979-383...@n3.nabble.com

  Hi Ludovic,
 
  That's an option we had before we decided to go for a full-blown support
 of
 
  wildcards.
 
  Do you know of a way to keep both stemming and consistent wildcard
 support
  in the same field?`
 
  Dmitry
 
 


 -
 Jouve
 France.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/stemming-for-English-tp2893599p2893652.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Regards,

Dmitry Kan


RE: boost fields which have value

2011-04-28 Thread Robert Petersen
I believe the sortMissingLast fieldtype attribute is what you want: fieldType 
... sortMissingLast=true ... /


http://wiki.apache.org/solr/SchemaXml


-Original Message-
From: Zoltán Altfatter [mailto:altfatt...@gmail.com] 
Sent: Thursday, April 28, 2011 6:11 AM
To: solr-user@lucene.apache.org
Subject: boost fields which have value

Hi,

How can I achieve that documents which don't have field1 and field2 filled
in, are returned in the end of the search result.

I have tried with *bf* parameter, which seems to work but just with one
field.

Is there any function query which I can use in bf value to boost two fields?

Thank you.

Regards,
Zoltan


RE: SynonymFilterFactory case changes

2011-04-27 Thread Robert Petersen
Yes I did, but that's cool because it is useful to make the final determination 
explicit here on the group for the benefit of other users.  :)

Thanks
Robi

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, April 26, 2011 5:10 PM
To: solr-user@lucene.apache.org
Subject: Re: SynonymFilterFactory case changes

Ahhh, I mis-read your post..

First, it's not the synonymfilterfactory that's lowercasing anything. The
ingorecase=true affects the matching, not the output. The output is
probably lowercased because you have it that way in the synonyms.txt
file. At least that's what I just saw using the analysis page from the
Solr admin page.

So yes, if you want the WDF to do anything on tokens put into the input
stream by SynonymFilterFactory, you need to make the
replacement be the accurate case.

But I think you already figured all that out

Best
Erick

On Tue, Apr 26, 2011 at 7:19 PM, Robert Petersen rober...@buy.com wrote:
 But in this case lowercase is after WDF.  The question is that when you get a 
 hit in the SynonymFilter on a synonym and where the entries in synonmyms.txt 
 file are all in lower case do I need to add the case changing versions to 
 make WDF work on case changes because it appears the synonym text is replaced 
 verbatim by what is in the txt file and so that defeats the WDF filter.  In 
 fact, adding the case changing versions of this term to the synonyms.txt file 
 makes this use case work.  (yay)

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Tuesday, April 26, 2011 3:39 PM
 To: solr-user@lucene.apache.org
 Subject: Re: SynonymFilterFactory case changes

 Yes, order does matter.  You're right, putting, say, lowercase in front
 of WordDelimiter... will mess up the operations of WDFF.

 The admin/analysis page is *extremely* useful for understanding what
 happens in the analysis of input. Make sure to check the verbose
 checkbox.

 Best
 Erick

 On Tue, Apr 26, 2011 at 5:10 PM, Robert Petersen rober...@buy.com wrote:
 So if there is a hit in the synonym filter factory, do I need to put the
 various case changes for a term so that the following
 WordDelimiterFilter analyzer can do its 'split on case changes' work?
 Here we see SynonymFilterFactory makes all terms lowercase because this
 is what is in my synonmyms.txt file and I have ignoreCase=true:
 macafee, mcafee

 Index Analyzer
 org.apache.solr.analysis.WhitespaceTokenizerFactory {}
 term position   1
 term text       McAfee
 term type       word
 source start,end        0,6
 payload
 org.apache.solr.analysis.SynonymFilterFactory
 {synonyms=index_synonyms.txt, expand=true, ignoreCase=true}
 term position   1
 term text       macafee
 mcafee
 term type       word
 word
 source start,end        0,6
 0,6
 payload





RE: term position question from analyzer stack for WordDelimiterFilterFactory

2011-04-26 Thread Robert Petersen
OK this is even more weird... everything is working much better except
for one thing: I was testing use cases with our top query terms to make
sure the below query settings wouldn't break any existing behavior, and
got this most unusual result.  The analyzer stack completely eliminated
the word McAfee from the query terms!  I'm like huh?  Here is the
analyzer page output for that search term:

Query Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position   1
term text   McAfee
term type   word
source start,end0,6
payload 
org.apache.solr.analysis.SynonymFilterFactory
{synonyms=query_synonyms.txt, expand=true, ignoreCase=true}
term position   1
term text   McAfee
term type   word
source start,end0,6
payload 
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=true}
term position   1
term text   McAfee
term type   word
source start,end0,6
payload 
org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=0,
generateNumberParts=0, catenateWords=0, generateWordParts=0,
catenateAll=0, catenateNumbers=0}
term position
term text
term type
source start,end
payload
org.apache.solr.analysis.LowerCaseFilterFactory {}
term position
term text
term type
source start,end
payload
com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
{protected=protwords.txt}
term position
term text
term type
source start,end
payload
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
term position
term text
term type
source start,end
payload



-Original Message-
From: Robert Petersen [mailto:rober...@buy.com] 
Sent: Monday, April 25, 2011 11:27 AM
To: solr-user@lucene.apache.org; yo...@lucidimagination.com
Subject: RE: term position question from analyzer stack for
WordDelimiterFilterFactory

Aha!  I knew something must be awry, but when I looked at the analysis
page output, well it sure looked like it should match.  :)

OK here is the query side WDF that finally works, I just turned
everything off.  (yay)  First I tried just completely removeing WDF from
the query side analyzer stack but that didn't work.  So anyway I suppose
I should turn off the catenate all plus the preserve original settings,
reindex, and see if I still get a match huh?  (PS  thank you very much
for the help!!!)

  filter class=solr.WordDelimiterFilterFactory
generateWordParts=0
generateNumberParts=0
catenateWords=0
catenateNumbers=0
catenateAll=0
preserveOriginal=0
/  



-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
Seeley
Sent: Monday, April 25, 2011 9:24 AM
To: solr-user@lucene.apache.org
Subject: Re: term position question from analyzer stack for
WordDelimiterFilterFactory

On Mon, Apr 25, 2011 at 12:15 PM, Robert Petersen rober...@buy.com
wrote:
 The search and index analyzer stack are the same.

Ahhh, they should not be!
Using both generate and catenate in WDF at query time is a no-no.
Same reason you can't have multi-word synonyms at query time:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Synonym
FilterFactory

I'd recommend going back to the WDF settings in the solr example
server as a starting point.


-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


SynonymFilterFactory case changes

2011-04-26 Thread Robert Petersen
So if there is a hit in the synonym filter factory, do I need to put the
various case changes for a term so that the following
WordDelimiterFilter analyzer can do its 'split on case changes' work?
Here we see SynonymFilterFactory makes all terms lowercase because this
is what is in my synonmyms.txt file and I have ignoreCase=true:
macafee, mcafee 

Index Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position   1
term text   McAfee
term type   word
source start,end0,6
payload 
org.apache.solr.analysis.SynonymFilterFactory
{synonyms=index_synonyms.txt, expand=true, ignoreCase=true}
term position   1
term text   macafee
mcafee
term type   word
word
source start,end0,6
0,6
payload 



RE: term position question from analyzer stack for WordDelimiterFilterFactory

2011-04-26 Thread Robert Petersen
Yeah I am about to try turning one on at a time and see what happens.  I
had a meeting so couldn't do it yet...  (darn those meetings)  (lol)


-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Tuesday, April 26, 2011 2:37 PM
To: solr-user@lucene.apache.org
Subject: Re: term position question from analyzer stack for
WordDelimiterFilterFactory

Hi Robert,

I'm no WDFF expert, but all these zero look suspicious:

org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=0,
generateNumberParts=0, catenateWords=0, generateWordParts=0,
catenateAll=0, catenateNumbers=0}

A quick visit to 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDel
imiterFilterFactory
 makes me think you want:

splitOnCaseChange=1  (if you want Mc Afee for some reason?)
generateWordParts=1 (if you want Mc Afee for some reason?)
preserveOriginal=1


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Robert Petersen rober...@buy.com
 To: solr-user@lucene.apache.org; yo...@lucidimagination.com
 Sent: Tue, April 26, 2011 4:39:49 PM
 Subject: RE: term position question from analyzer stack for 
WordDelimiterFilterFactory
 
 OK this is even more weird... everything is working much better except
 for  one thing: I was testing use cases with our top query terms to
make
 sure the  below query settings wouldn't break any existing behavior,
and
 got this most  unusual result.  The analyzer stack completely
eliminated
 the word  McAfee from the query terms!  I'm like huh?  Here is the
 analyzer  page output for that search term:
 
 Query  Analyzer
 org.apache.solr.analysis.WhitespaceTokenizerFactory {}
 term  position 1
 term text McAfee
 term  type word
 source start,end  0,6
 payload 
 org.apache.solr.analysis.SynonymFilterFactory
 {synonyms=query_synonyms.txt,  expand=true, ignoreCase=true}
 term position 1
 term  text McAfee
 term type word
 source  start,end 0,6
 payload 
 org.apache.solr.analysis.StopFilterFactory  {words=stopwords.txt,
 ignoreCase=true}
 term position  1
 term text McAfee
 term type  word
 source start,end 0,6
 payload 
 org.apache.solr.analysis.WordDelimiterFilterFactory
{preserveOriginal=0,
 generateNumberParts=0, catenateWords=0,  generateWordParts=0,
 catenateAll=0, catenateNumbers=0}
 term  position
 term text
 term type
 source  start,end
 payload
 org.apache.solr.analysis.LowerCaseFilterFactory  {}
 term position
 term text
 term type
 source  start,end
 payload
 com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 {protected=protwords.txt}
 term  position
 term text
 term type
 source  start,end
 payload
 org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory  {}
 term position
 term text
 term type
 source  start,end
 payload
 
 
 
 -Original Message-
 From: Robert  Petersen [mailto:rober...@buy.com] 
 Sent: Monday, April 25,  2011 11:27 AM
 To: solr-user@lucene.apache.org; yo...@lucidimagination.com
 Subject:  RE: term position question from analyzer stack  for
 WordDelimiterFilterFactory
 
 Aha!  I knew something must be  awry, but when I looked at the
analysis
 page output, well it sure looked like  it should match.  :)
 
 OK here is the query side WDF that finally  works, I just turned
 everything off.  (yay)  First I tried just  completely removeing WDF
from
 the query side analyzer stack but that didn't  work.  So anyway I
suppose
 I should turn off the catenate all plus the  preserve original
settings,
 reindex, and see if I still get a match  huh?  (PS  thank you very
much
 for the help!!!)
 
filter  class=solr.WordDelimiterFilterFactory
  generateWordParts=0
  generateNumberParts=0
  catenateWords=0
  catenateNumbers=0
  catenateAll=0
  preserveOriginal=0
  /
 
 
 
 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of  Yonik
 Seeley
 Sent: Monday, April 25, 2011 9:24 AM
 To: solr-user@lucene.apache.org
 Subject:  Re: term position question from analyzer stack  for
 WordDelimiterFilterFactory
 
 On Mon, Apr 25, 2011 at 12:15 PM,  Robert Petersen rober...@buy.com
 wrote:
  The  search and index analyzer stack are the same.
 
 Ahhh, they should not  be!
 Using both generate and catenate in WDF at query time is a no-no.
 Same  reason you can't have multi-word synonyms at query time:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Synonym
 FilterFactory
 
 I'd  recommend going back to the WDF settings in the solr example
 server as a  starting point.
 
 
 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User  Conference, May
 25-26, San Francisco
 


RE: SynonymFilterFactory case changes

2011-04-26 Thread Robert Petersen
But in this case lowercase is after WDF.  The question is that when you get a 
hit in the SynonymFilter on a synonym and where the entries in synonmyms.txt 
file are all in lower case do I need to add the case changing versions to make 
WDF work on case changes because it appears the synonym text is replaced 
verbatim by what is in the txt file and so that defeats the WDF filter.  In 
fact, adding the case changing versions of this term to the synonyms.txt file 
makes this use case work.  (yay)

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, April 26, 2011 3:39 PM
To: solr-user@lucene.apache.org
Subject: Re: SynonymFilterFactory case changes

Yes, order does matter.  You're right, putting, say, lowercase in front
of WordDelimiter... will mess up the operations of WDFF.

The admin/analysis page is *extremely* useful for understanding what
happens in the analysis of input. Make sure to check the verbose
checkbox.

Best
Erick

On Tue, Apr 26, 2011 at 5:10 PM, Robert Petersen rober...@buy.com wrote:
 So if there is a hit in the synonym filter factory, do I need to put the
 various case changes for a term so that the following
 WordDelimiterFilter analyzer can do its 'split on case changes' work?
 Here we see SynonymFilterFactory makes all terms lowercase because this
 is what is in my synonmyms.txt file and I have ignoreCase=true:
 macafee, mcafee

 Index Analyzer
 org.apache.solr.analysis.WhitespaceTokenizerFactory {}
 term position   1
 term text       McAfee
 term type       word
 source start,end        0,6
 payload
 org.apache.solr.analysis.SynonymFilterFactory
 {synonyms=index_synonyms.txt, expand=true, ignoreCase=true}
 term position   1
 term text       macafee
 mcafee
 term type       word
 word
 source start,end        0,6
 0,6
 payload




RE: term position question from analyzer stack for WordDelimiterFilterFactory

2011-04-25 Thread Robert Petersen
Sorry, that was supposed to be just another way to say the same thing...
OK look here is my current situation.  Even with preserveOriginal and
concatAll set, I am still getting an even odder result.

I set up sku=218078624 with title= Beanbag AppleTV Friction Dash Mount
for GPS  and index it in dev.

The search and index analyzer stack are the same.  When I do this search
in the solr admin page I get zero results  sku:218078624  title:AppleTV
 but when I do this search I get one result  sku:218078624
title:appletv .  This is the opposite of what was happening before I
added the preserve original setting.  In the analysis page I plug in
that title and term, and it looks to me like it should match... which is
why I started asking about term positions and such.  I don't understand
why I don't get a hit in both cases.  It is so weird.



-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
Seeley
Sent: Friday, April 22, 2011 5:55 PM
To: Robert Petersen
Cc: solr-user@lucene.apache.org
Subject: Re: term position question from analyzer stack for
WordDelimiterFilterFactory

On Fri, Apr 22, 2011 at 8:24 PM, Robert Petersen rober...@buy.com
wrote:
 I can repeatedly demonstrate this in my dev environment, where I get
 entirely different results searching for AppleTV vs. appletv

You originally said I cannot get a match between AppleTV on the
indexing side and appletv on the search side.
Getting different numbers of results or different results is slightly
different.

For example, if there were a document with Apple TV in it, then a
query of AppleTV would match that doc, but a query of appletv
would not.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


RE: term position question from analyzer stack for WordDelimiterFilterFactory

2011-04-25 Thread Robert Petersen
Aha!  I knew something must be awry, but when I looked at the analysis
page output, well it sure looked like it should match.  :)

OK here is the query side WDF that finally works, I just turned
everything off.  (yay)  First I tried just completely removeing WDF from
the query side analyzer stack but that didn't work.  So anyway I suppose
I should turn off the catenate all plus the preserve original settings,
reindex, and see if I still get a match huh?  (PS  thank you very much
for the help!!!)

  filter class=solr.WordDelimiterFilterFactory
generateWordParts=0
generateNumberParts=0
catenateWords=0
catenateNumbers=0
catenateAll=0
preserveOriginal=0
/  



-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
Seeley
Sent: Monday, April 25, 2011 9:24 AM
To: solr-user@lucene.apache.org
Subject: Re: term position question from analyzer stack for
WordDelimiterFilterFactory

On Mon, Apr 25, 2011 at 12:15 PM, Robert Petersen rober...@buy.com
wrote:
 The search and index analyzer stack are the same.

Ahhh, they should not be!
Using both generate and catenate in WDF at query time is a no-no.
Same reason you can't have multi-word synonyms at query time:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Synonym
FilterFactory

I'd recommend going back to the WDF settings in the solr example
server as a starting point.


-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


RE: term position question from analyzer stack for WordDelimiterFilterFactory

2011-04-22 Thread Robert Petersen
I can repeatedly demonstrate this in my dev environment, where I get
entirely different results searching for AppleTV vs. appletv and I
really just don't get it.  I set up a specific sku in dev with AppleTV
in its title to experiment with.  What can I provide to help diagnose?
I need to make this work...  thanks for the help!


-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
Seeley
Sent: Thursday, April 21, 2011 5:54 PM
To: solr-user@lucene.apache.org
Subject: Re: term position question from analyzer stack for
WordDelimiterFilterFactory

On Thu, Apr 21, 2011 at 8:06 PM, Robert Petersen rober...@buy.com
wrote:
 So if I don't put preserveOriginal=1 in my WordDelimiterFilterFactory
settings I cannot get a match between AppleTV on the indexing side and
appletv on the search side.

Hmmm, that shouldn't be the case.  The text field in the solr
example config doesn't use preserveOriginal, and AppleTV is indexed as

appl, tv/appletv

And a search for appletv does match fine.

Perhaps on the search side there is actually a phrase query like big
appletv?  One workaround for that is to add a little slop... big
appletv~1

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


RE: stemming filter analyzers, any favorites?

2011-04-21 Thread Robert Petersen
Adding another field with another stemmer and searching both???  Wow never 
thought of doing that.  I guess that doesn't really double the size of your 
index tho because all the terms are almost the same right?  Let me look into 
that.  I'll raise the other issue in a separate thread and thanks.

-Original Message-
From: Em [mailto:mailformailingli...@yahoo.de] 
Sent: Thursday, April 21, 2011 1:55 AM
To: solr-user@lucene.apache.org
Subject: RE: stemming filter analyzers, any favorites?

Hi Robert,

we often ran into the same issue with stemmers. This is why we created more
than one field, each field with different stemmers. It adds some overhead
but worked quite well.

Regarding your off-topic-question:
Look at the debugging-output of your searches. Sometimes you configured your
tools, especially the WDF, wrong and the queryParser creates an unexpected
result which leads to unmatched but still relevant documents.

Please, show us your debugging-output and the field-definition so that we
can provide you some help!

Regards,
Em


Robert Petersen-3 wrote:
 
 I have been doing that, and for Bags example the trailing 's' is not being
 removed by the Kstemmer so if indexing the word bags and searching on bag
 you get no matches.  Why wouldn't the trailing 's' get stemmed off? 
 Kstemmer is dictionary based so bags isn't in the dictionary?   That
 trailing 's' should always be dropped no?  That seems like it would be
 better, we don't want to make synonyms for basic use cases like this.  I
 fear I will have to return to the Porter stemmer.  Are there other better
 ones is my main question.
 
 Off topic secondary question: sometimes I am puzzled by the output of the
 analysis page.  It seems like there should be a match, but I don't get the
 results during a search that I'd expect...  
 
 Like in the case if the WordDelimiterFilterFactory splits up a term into a
 bunch of terms before the K-stemmer is applied, sometimes if the matching
 term is in position two of the final analysis but the searcher had the
 partial term just alone and so thereby in position 1 in the analysis stack
 then when searching there wasn't a match.  Am I reading this correctly? 
 Is that right or should that match and I am misreading my analysis output?  
 
 Thanks!
 
 Robi
 
 PS  I have a category named Bags and am catching flack for it not coming
 up in a search for bag.  hah
 PPS the term is not in protwords.txt
 
 
 com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 {protected=protwords.txt}
 term position 1
 term text bags
 term type word
 source start,end  0,4
 payload   
 
 
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com] 
 Sent: Wednesday, April 20, 2011 10:55 AM
 To: solr-user@lucene.apache.org
 Subject: Re: stemming filter analyzers, any favorites?
 
 You can get a better sense of exactly what tranformations occur when
 if you look at the analysis page (be sure to check the verbose
 checkbox).
 
 I'm surprised that bags doesn't match bag, what does the analysis
 page say?
 
 Best
 Erick
 
 On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen lt;rober...@buy.comgt;
 wrote:
 Stemming filter analyzers... anyone have any favorites for particular
 search domains?  Just wondering what people are using.  I'm using Lucid
 K Stemmer and having issues.   Seems like it misses a lot of common
 stems.  We went to that because of excessively loose matches on the
 solr.PorterStemFilterFactory


 I understand K Stemmer is a dictionary based stemmer.  Seems to me like
 it is missing a lot of common stem reductions.  Ie   Bags does not match
 Bag in our searches.

 Here is my analyzer stack:

                fieldType name=text class=solr.TextField
 positionIncrementGap=100
                        analyzer type=index
                                tokenizer
 class=solr.WhitespaceTokenizerFactory/
                                filter
 class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
 ignoreCase=true expand=true/
                                filter class=solr.StopFilterFactory
 ignoreCase=true words=stopwords.txt/
          filter class=solr.WordDelimiterFilterFactory
                generateWordParts=1
                generateNumberParts=1
                catenateWords=1
                catenateNumbers=1
                catenateAll=1
                preserveOriginal=1
                /                              filter
 class=solr.LowerCaseFilterFactory/
                                
                                filter
 class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 protected=protwords.txt/
                                filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
                        /analyzer
                        analyzer type=query
                                tokenizer
 class=solr.WhitespaceTokenizerFactory/
                                filter
 class=solr.SynonymFilterFactory synonyms

RE: stemming filter analyzers, any favorites?

2011-04-21 Thread Robert Petersen
Nice!  Thanks!

-Original Message-
From: Em [mailto:mailformailingli...@yahoo.de] 
Sent: Thursday, April 21, 2011 9:23 AM
To: solr-user@lucene.apache.org
Subject: RE: stemming filter analyzers, any favorites?

As far as I know Lucene does not store an inverted index per field, so no, it
would not double the size of the index.

However, it could influence the score a little bit.

For example: If both stemmers reduce schools to school and you are
searching for all schools in america the term school has more weight to
the resulting score, since it definitly occurs in two fields which consist
of nearly the same value.

To reduce this effect you could write your own queryParser which creates a
disjunctionMaxQuery consisting of two boolean queries and a tie-break of 0 -
so only the better scoring stemmed-field contributes to the total score of
your document.

Regards,
Em


Robert Petersen-3 wrote:
 
 Adding another field with another stemmer and searching both???  Wow never
 thought of doing that.  I guess that doesn't really double the size of
 your index tho because all the terms are almost the same right?  Let me
 look into that.  I'll raise the other issue in a separate thread and
 thanks.
 
 -Original Message-
 From: Em [mailto:mailformailingli...@yahoo.de] 
 Sent: Thursday, April 21, 2011 1:55 AM
 To: solr-user@lucene.apache.org
 Subject: RE: stemming filter analyzers, any favorites?
 
 Hi Robert,
 
 we often ran into the same issue with stemmers. This is why we created
 more
 than one field, each field with different stemmers. It adds some overhead
 but worked quite well.
 
 Regarding your off-topic-question:
 Look at the debugging-output of your searches. Sometimes you configured
 your
 tools, especially the WDF, wrong and the queryParser creates an unexpected
 result which leads to unmatched but still relevant documents.
 
 Please, show us your debugging-output and the field-definition so that we
 can provide you some help!
 
 Regards,
 Em
 
 
 Robert Petersen-3 wrote:
 
 I have been doing that, and for Bags example the trailing 's' is not
 being
 removed by the Kstemmer so if indexing the word bags and searching on bag
 you get no matches.  Why wouldn't the trailing 's' get stemmed off? 
 Kstemmer is dictionary based so bags isn't in the dictionary?   That
 trailing 's' should always be dropped no?  That seems like it would be
 better, we don't want to make synonyms for basic use cases like this.  I
 fear I will have to return to the Porter stemmer.  Are there other better
 ones is my main question.
 
 Off topic secondary question: sometimes I am puzzled by the output of the
 analysis page.  It seems like there should be a match, but I don't get
 the
 results during a search that I'd expect...  
 
 Like in the case if the WordDelimiterFilterFactory splits up a term into
 a
 bunch of terms before the K-stemmer is applied, sometimes if the matching
 term is in position two of the final analysis but the searcher had the
 partial term just alone and so thereby in position 1 in the analysis
 stack
 then when searching there wasn't a match.  Am I reading this correctly? 
 Is that right or should that match and I am misreading my analysis
 output?  
 
 Thanks!
 
 Robi
 
 PS  I have a category named Bags and am catching flack for it not coming
 up in a search for bag.  hah
 PPS the term is not in protwords.txt
 
 
 com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 {protected=protwords.txt}
 term position1
 term textbags
 term typeword
 source start,end 0,4
 payload  
 
 
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com] 
 Sent: Wednesday, April 20, 2011 10:55 AM
 To: solr-user@lucene.apache.org
 Subject: Re: stemming filter analyzers, any favorites?
 
 You can get a better sense of exactly what tranformations occur when
 if you look at the analysis page (be sure to check the verbose
 checkbox).
 
 I'm surprised that bags doesn't match bag, what does the analysis
 page say?
 
 Best
 Erick
 
 On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen lt;rober...@buy.comgt;
 wrote:
 Stemming filter analyzers... anyone have any favorites for particular
 search domains?  Just wondering what people are using.  I'm using Lucid
 K Stemmer and having issues.   Seems like it misses a lot of common
 stems.  We went to that because of excessively loose matches on the
 solr.PorterStemFilterFactory


 I understand K Stemmer is a dictionary based stemmer.  Seems to me like
 it is missing a lot of common stem reductions.  Ie   Bags does not match
 Bag in our searches.

 Here is my analyzer stack:

                fieldType name=text class=solr.TextField
 positionIncrementGap=100
                        analyzer type=index
                                tokenizer
 class=solr.WhitespaceTokenizerFactory/
                                filter
 class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
 ignoreCase=true expand=true

term position question from analyzer stack for WordDelimiterFilterFactory

2011-04-21 Thread Robert Petersen
So if I don't put preserveOriginal=1 in my WordDelimiterFilterFactory settings 
I cannot get a match between AppleTV on the indexing side and appletv on the 
search side.  Without that setting the all lowercase version of AppleTV is in 
term position two due to the catenateWords=1 or the catenateAll=1 settings.  I 
am surprised.  How does term position affect searching?  Here is my analysis 
with preserveOriginal=1 to make the lower case occur in both term position 1 
and 2:

Index Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position   1
term text   AppleTV
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.SynonymFilterFactory {synonyms=index_synonyms.txt, 
expand=true, ignoreCase=true}
term position   1
term text   AppleTV
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, 
ignoreCase=true}
term position   1
term text   AppleTV
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1, 
generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=1, 
catenateNumbers=1}
term position   1   2
term text   AppleTV TV
Apple   AppleTV
term type   wordword
wordword
source start,end0,7 5,7
0,5 0,7
payload 

org.apache.solr.analysis.LowerCaseFilterFactory {}
term position   1   2
term text   appletv tv
apple   appletv
term type   wordword
wordword
source start,end0,7 5,7
0,5 0,7
payload 

com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory 
{protected=protwords.txt}
term position   1   2
term text   appletv tv
apple   appletv
term type   wordword
wordword
source start,end0,7 5,7
0,5 0,7
payload 

org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
term position   1   2
term text   appletv tv
apple   appletv
term type   wordword
wordword
source start,end0,7 5,7
0,5 0,7
payload 

Query Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position   1
term text   appletv
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.SynonymFilterFactory {synonyms=query_synonyms.txt, 
expand=true, ignoreCase=true}
term position   1
term text   appletv
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, 
ignoreCase=true}
term position   1
term text   appletv
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1, 
generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=1, 
catenateNumbers=1}
term position   1
term text   appletv
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.LowerCaseFilterFactory {}
term position   1
term text   appletv
term type   word
source start,end0,7
payload 
com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory 
{protected=protwords.txt}
term position   1
term text   appletv
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
term position   1
term text   appletv
term type   word
source start,end0,7
payload 


stemming filter analyzers, any favorites?

2011-04-20 Thread Robert Petersen
Stemming filter analyzers... anyone have any favorites for particular
search domains?  Just wondering what people are using.  I'm using Lucid
K Stemmer and having issues.   Seems like it misses a lot of common
stems.  We went to that because of excessively loose matches on the
solr.PorterStemFilterFactory


I understand K Stemmer is a dictionary based stemmer.  Seems to me like
it is missing a lot of common stem reductions.  Ie   Bags does not match
Bag in our searches.

Here is my analyzer stack:

fieldType name=text class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer
class=solr.WhitespaceTokenizerFactory/
filter
class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true words=stopwords.txt/
  filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=1
preserveOriginal=1
/  filter
class=solr.LowerCaseFilterFactory/
!-- The LucidKStemmer currently
requires a lowercase filter somewhere before it. --
filter
class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
protected=protwords.txt/
filter
class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
tokenizer
class=solr.WhitespaceTokenizerFactory/
filter
class=solr.SynonymFilterFactory synonyms=query_synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true words=stopwords.txt/
  filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=1
preserveOriginal=1
/  filter
class=solr.LowerCaseFilterFactory/
!-- The LucidKStemmer currently
requires a lowercase filter somewhere before it. --
filter
class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
protected=protwords.txt/
filter
class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType


RE: stemming filter analyzers, any favorites?

2011-04-20 Thread Robert Petersen
I have been doing that, and for Bags example the trailing 's' is not being 
removed by the Kstemmer so if indexing the word bags and searching on bag you 
get no matches.  Why wouldn't the trailing 's' get stemmed off?  Kstemmer is 
dictionary based so bags isn't in the dictionary?   That trailing 's' should 
always be dropped no?  That seems like it would be better, we don't want to 
make synonyms for basic use cases like this.  I fear I will have to return to 
the Porter stemmer.  Are there other better ones is my main question.

Off topic secondary question: sometimes I am puzzled by the output of the 
analysis page.  It seems like there should be a match, but I don't get the 
results during a search that I'd expect...  

Like in the case if the WordDelimiterFilterFactory splits up a term into a 
bunch of terms before the K-stemmer is applied, sometimes if the matching term 
is in position two of the final analysis but the searcher had the partial term 
just alone and so thereby in position 1 in the analysis stack then when 
searching there wasn't a match.  Am I reading this correctly?  Is that right or 
should that match and I am misreading my analysis output?  

Thanks!

Robi

PS  I have a category named Bags and am catching flack for it not coming up in 
a search for bag.  hah
PPS the term is not in protwords.txt


com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory 
{protected=protwords.txt}
term position   1
term text   bags
term type   word
source start,end0,4
payload 


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, April 20, 2011 10:55 AM
To: solr-user@lucene.apache.org
Subject: Re: stemming filter analyzers, any favorites?

You can get a better sense of exactly what tranformations occur when
if you look at the analysis page (be sure to check the verbose
checkbox).

I'm surprised that bags doesn't match bag, what does the analysis
page say?

Best
Erick

On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen rober...@buy.com wrote:
 Stemming filter analyzers... anyone have any favorites for particular
 search domains?  Just wondering what people are using.  I'm using Lucid
 K Stemmer and having issues.   Seems like it misses a lot of common
 stems.  We went to that because of excessively loose matches on the
 solr.PorterStemFilterFactory


 I understand K Stemmer is a dictionary based stemmer.  Seems to me like
 it is missing a lot of common stem reductions.  Ie   Bags does not match
 Bag in our searches.

 Here is my analyzer stack:

                fieldType name=text class=solr.TextField
 positionIncrementGap=100
                        analyzer type=index
                                tokenizer
 class=solr.WhitespaceTokenizerFactory/
                                filter
 class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
 ignoreCase=true expand=true/
                                filter class=solr.StopFilterFactory
 ignoreCase=true words=stopwords.txt/
          filter class=solr.WordDelimiterFilterFactory
                generateWordParts=1
                generateNumberParts=1
                catenateWords=1
                catenateNumbers=1
                catenateAll=1
                preserveOriginal=1
                /                              filter
 class=solr.LowerCaseFilterFactory/
                                !-- The LucidKStemmer currently
 requires a lowercase filter somewhere before it. --
                                filter
 class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 protected=protwords.txt/
                                filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
                        /analyzer
                        analyzer type=query
                                tokenizer
 class=solr.WhitespaceTokenizerFactory/
                                filter
 class=solr.SynonymFilterFactory synonyms=query_synonyms.txt
 ignoreCase=true expand=true/
                                filter class=solr.StopFilterFactory
 ignoreCase=true words=stopwords.txt/
          filter class=solr.WordDelimiterFilterFactory
                generateWordParts=1
                generateNumberParts=1
                catenateWords=1
                catenateNumbers=1
                catenateAll=1
                preserveOriginal=1
                /                              filter
 class=solr.LowerCaseFilterFactory/
                                !-- The LucidKStemmer currently
 requires a lowercase filter somewhere before it. --
                                filter
 class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 protected=protwords.txt/
                                filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
                        /analyzer
                /fieldType



RE: what happens to docsPending if stop solr before commit

2011-04-06 Thread Robert Petersen
Oh woe is me...  lol NP good to know.  I'll get them on the next go
'round.  :) 

Thanks for the answer!



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, April 06, 2011 6:05 AM
To: solr-user@lucene.apache.org
Subject: Re: what happens to docsPending if stop solr before commit

They're lost, never to be seen again. You'll have to reindex them.

Best
Erick

On Tue, Apr 5, 2011 at 4:25 PM, Robert Petersen rober...@buy.com
wrote:

 Hello fellow enthusiastic solr users,



 I tried to find the answer to this simple question online, but failed.
 I was wondering about this, what happens to uncommitted docsPending if
I
 stop solr and then restart solr?  Are they lost?  Are they still there
 but still uncommitted?  Do they get committed at startup?  I noticed
 after a restart my 250K pending doc count went to 0 is what got me
 wondering.



 TIA!

 Robi




RE: what happens to docsPending if stop solr before commit

2011-04-06 Thread Robert Petersen
Really?  Great!  I was wondering if there was some cleanup cycle like
that which would occur upon shutdown.  That sounds like much more
logical behavior! 

-Original Message-
From: Koji Sekiguchi [mailto:k...@r.email.ne.jp] 
Sent: Wednesday, April 06, 2011 4:03 PM
To: solr-user@lucene.apache.org
Subject: Re: what happens to docsPending if stop solr before commit

(11/04/06 5:25), Robert Petersen wrote:
 I tried to find the answer to this simple question online, but failed.
 I was wondering about this, what happens to uncommitted docsPending if
I
 stop solr and then restart solr?  Are they lost?  Are they still there
 but still uncommitted?  Do they get committed at startup?  I noticed
 after a restart my 250K pending doc count went to 0 is what got me
 wondering.

Robi,

Usually they are never lost, but they are committed.

When you stop Solr, servlet container (Jetty) calls servlets/filters
destroy() methods. This causes closing all SolrCores. Then
SolrCore.close()
calls UpdateHandler.close(). It calls SolrIndexWriter.close(). Then
pending docs are flushed, then committed.

Koji
-- 
http://www.rondhuit.com/en/


what happens to docsPending if stop solr before commit

2011-04-05 Thread Robert Petersen
Hello fellow enthusiastic solr users,

 

I tried to find the answer to this simple question online, but failed.
I was wondering about this, what happens to uncommitted docsPending if I
stop solr and then restart solr?  Are they lost?  Are they still there
but still uncommitted?  Do they get committed at startup?  I noticed
after a restart my 250K pending doc count went to 0 is what got me
wondering.  

 

TIA!

Robi



RE: FW: no results searching for stadium seating chairs

2011-03-30 Thread Robert Petersen
Thanks for the input!  We've discussed using synonyms to help here.  We
have product managers who are supposed to add keywords on to skus also
which our indexer will automatically consume.  Getting them to do that
is a different matter!  haha

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Tuesday, March 29, 2011 11:19 AM
To: solr-user@lucene.apache.org
Subject: Re: FW: no results searching for stadium seating chairs

It seems unlikely you are going to find something that stems everything 
exactly how you want it, and nothing how you don't want it. This is very

domain dependent, as you've discovered. I doubt there's even such a 
thing as the way everyone doing a 'retail product title search' would 
want it, it's going to vary.

You could use the synonym feature to make your own stemming dictionary, 
tell it to stem seating to seat.

Of course, that's also very expensive in terms of your time, to create

your own custom dictionary.  But you're going to have to live with one 
of the compromises, software cant' do magic!

For particular titles, you could also, in your own metadata control, add

alternate titles that you want it to match on, before it even gets 
indexed.

On 3/29/2011 1:43 PM, Robert Petersen wrote:
 For retail product title search, would there be a better stemmer to
use?  We wanted a less aggressive stemmer, but I would expect the term
seating to stem.  I have found several other words which end in ing and
do not get stemmed.  Amongst our product lines are four million books
with all kinds of crazy titles, like the following oddity!  Here
counseling stems and unknowing doesn't:

 1. The Cloud of Unknowing and the Book of Privy Counseling
 Buy New: $29.95 $18.30
 3 New and Used from $18.30


 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
Seeley
 Sent: Tuesday, March 29, 2011 10:27 AM
 To: solr-user@lucene.apache.org
 Cc: Robert Petersen
 Subject: Re: FW: no results searching for stadium seating chairs

 On Tue, Mar 29, 2011 at 1:17 PM, Robert Petersenrober...@buy.com
wrote:
 Very interestingly, LucidKStemFilterFactory is stemming 'ing's
differently for different words.  The word 'seating' doesn't lose the
'ing' but the word 'counseling' does!  Can anyone explain the difference
here?  protwords.txt is empty btw.
 KStem is dictionary driven, so seating is probably in the
 dictionary.  I guess the author decided that seating and seat were
 sufficiently different.


 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco



RE: FW: no results searching for stadium seating chairs

2011-03-30 Thread Robert Petersen
Wow that sounds rad!

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Wednesday, March 30, 2011 9:39 AM
To: solr-user@lucene.apache.org
Subject: Re: FW: no results searching for stadium seating chairs

There are some new features in 3.1 to make it easier to tune this
stuff, especially:

http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_1/solr/src/java/org/apache/solr/analysis/StemmerOverrideFilterFactory.java

This takes a tab separate list of words-stems, and sets a flag to any
downstream stemmer to not mess with any of your mappings (thus the
name: StemmerOverrideFilter).

So the idea is you pick a stemmer thats close to what you want, then
you put this filter before it to tune it to your needs.


On Wed, Mar 30, 2011 at 12:05 PM, Robert Petersen rober...@buy.com wrote:
 Thanks for the input!  We've discussed using synonyms to help here.  We
 have product managers who are supposed to add keywords on to skus also
 which our indexer will automatically consume.  Getting them to do that
 is a different matter!  haha

 -Original Message-
 From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
 Sent: Tuesday, March 29, 2011 11:19 AM
 To: solr-user@lucene.apache.org
 Subject: Re: FW: no results searching for stadium seating chairs

 It seems unlikely you are going to find something that stems everything
 exactly how you want it, and nothing how you don't want it. This is very

 domain dependent, as you've discovered. I doubt there's even such a
 thing as the way everyone doing a 'retail product title search' would
 want it, it's going to vary.

 You could use the synonym feature to make your own stemming dictionary,
 tell it to stem seating to seat.

 Of course, that's also very expensive in terms of your time, to create

 your own custom dictionary.  But you're going to have to live with one
 of the compromises, software cant' do magic!

 For particular titles, you could also, in your own metadata control, add

 alternate titles that you want it to match on, before it even gets
 indexed.

 On 3/29/2011 1:43 PM, Robert Petersen wrote:
 For retail product title search, would there be a better stemmer to
 use?  We wanted a less aggressive stemmer, but I would expect the term
 seating to stem.  I have found several other words which end in ing and
 do not get stemmed.  Amongst our product lines are four million books
 with all kinds of crazy titles, like the following oddity!  Here
 counseling stems and unknowing doesn't:

 1. The Cloud of Unknowing and the Book of Privy Counseling
 Buy New: $29.95 $18.30
 3 New and Used from $18.30


 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: Tuesday, March 29, 2011 10:27 AM
 To: solr-user@lucene.apache.org
 Cc: Robert Petersen
 Subject: Re: FW: no results searching for stadium seating chairs

 On Tue, Mar 29, 2011 at 1:17 PM, Robert Petersenrober...@buy.com
 wrote:
 Very interestingly, LucidKStemFilterFactory is stemming 'ing's
 differently for different words.  The word 'seating' doesn't lose the
 'ing' but the word 'counseling' does!  Can anyone explain the difference
 here?  protwords.txt is empty btw.
 KStem is dictionary driven, so seating is probably in the
 dictionary.  I guess the author decided that seating and seat were
 sufficiently different.


 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco




FW: no results searching for stadium seating chairs

2011-03-29 Thread Robert Petersen
 

Very interestingly, LucidKStemFilterFactory is stemming ‘ing’s differently for 
different words.  The word ‘seating’ doesn't lose the 'ing' but the word 
‘counseling’ does!  Can anyone explain the difference here?  protwords.txt is 
empty btw.

 

com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory 
{protected=protwords.txt}

term position

1

2

term text

privy

counsel

term type

word

word

source start,end

0,5

6,16

payload



 

 

com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory 
{protected=protwords.txt}

term position

1

term text

seating

term type

word

source start,end

0,7

 



RE: FW: no results searching for stadium seating chairs

2011-03-29 Thread Robert Petersen
For retail product title search, would there be a better stemmer to use?  We 
wanted a less aggressive stemmer, but I would expect the term seating to stem.  
I have found several other words which end in ing and do not get stemmed.  
Amongst our product lines are four million books with all kinds of crazy 
titles, like the following oddity!  Here counseling stems and unknowing doesn't:

1. The Cloud of Unknowing and the Book of Privy Counseling 
Buy New: $29.95 $18.30
3 New and Used from $18.30


-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Tuesday, March 29, 2011 10:27 AM
To: solr-user@lucene.apache.org
Cc: Robert Petersen
Subject: Re: FW: no results searching for stadium seating chairs

On Tue, Mar 29, 2011 at 1:17 PM, Robert Petersen rober...@buy.com wrote:
 Very interestingly, LucidKStemFilterFactory is stemming 'ing's differently 
 for different words.  The word 'seating' doesn't lose the 'ing' but the word 
 'counseling' does!  Can anyone explain the difference here?  protwords.txt is 
 empty btw.

KStem is dictionary driven, so seating is probably in the
dictionary.  I guess the author decided that seating and seat were
sufficiently different.


-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


RE: Different options for autocomplete/autosuggestion

2011-03-16 Thread Robert Petersen
I take raw user search term data, 'collapse' it into a form where I have
only unique terms, per store, ordered by frequency of searches over some
time period.  The suggestions are then grouped and presented with store
breakouts.  That sounds kind of like what this page is talking about
here, but I could be using the wrong terminology:
http://wiki.apache.org/solr/FieldCollapsing


-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Tuesday, March 15, 2011 9:00 PM
To: solr-user@lucene.apache.org
Subject: Re: Different options for autocomplete/autosuggestion

Hi,

I actually don't follow how field collapsing helps with
autocompletion...?

Over at http://search-lucene.com we eat our own autocomplete dog food: 
http://sematext.com/products/autocomplete/index.html .  Tasty stuff.

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Kai Schlamp schl...@gmx.de
 To: solr-user@lucene.apache.org
 Sent: Mon, March 14, 2011 11:52:48 PM
 Subject: Re: Different options for autocomplete/autosuggestion
 
 @Robert: That sounds interesting and very flexible, but also like a
 lot of  work. This approach also doesn't seem to allow querying Solr
 directly by  using Ajax ... one of the big benefits in my opinion when
 using  Solr.
 @Bill: There are some things I don't like about the  Suggester
 component. It doesn't seem to allow infix searches (at least it is
not
 mentioned in the Wiki or elsewhere). It also uses a separate  index
 that has to be rebuild independently of the main index. And it
doesn't
 support any filter queries.
 
 The Lucid Imagination blog also  describes a further autosuggest
 approach 
(http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popu
lar-queries-using-edgengrams/).

 The  disadvantage here is that the source documents must have distinct
 fields  (resp. the dih selects must provide distinct data). Otherwise
 duplications  would come up in the Solr query result, cause of the
 document nature of  Solr.
 
 In my opinion field collapsing seems to be most promising for a  full
 featured autosuggestion solution. Unfortunately it is not  available
 for Solr 1.4.x or 3.x (I tried patching those branches several  times
 without success).
 
 2011/3/15 Bill Bell billnb...@gmail.com:
 
http://lucidworks.lucidimagination.com/display/LWEUG/Spell+Checking+and+
Aut
   omatic+Completion+of+User+Queries
 
  For Auto-Complete, find the  following section in the solrconfig.xml
file
  for the collection:
!-- Auto-Complete component --
   searchComponent  name=autocomplete
class=solr.SpellCheckComponent
 lst  name=spellchecker
   str  name=nameautocomplete/str
   str
   name=classnameorg.apache.solr.spelling.suggest.Suggester/str
str
 
name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup
/s
   tr
   str name=fieldautocomplete/str
str name=buildOnCommittrue/str
  !--
str name=sourceLocationamerican-english/str
   --
 /lst
 
 
 
 
  On  3/14/11 8:16 PM, Andy angelf...@yahoo.com  wrote:
 
 Can you provide more details? Or a  link?
 
 --- On Mon, 3/14/11, Bill Bell billnb...@gmail.com  wrote:
 
  See how Lucid Enterprise does it...  A
  bit differently.
 
  On 3/14/11  12:14 AM, Kai Schlamp kai.schl...@googlemail.com
   wrote:
 
  Hi.
   
  There seems to be several options for implementing  an
  autocomplete/autosuggestions feature with Solr. I  am
  trying to
  summarize those possibilities  together with their
  advantages and
   disadvantages. It would be really nice to read some of
  your  opinions.
  
  * Using N-Gram filter + text  field query
  + available in stable 1.4.x
   + results can be boosted
  + sorted by best  matches
  - may return duplicate results
   
  * Facets
  + available in stable  1.4.x
  + no duplicate entries
  - sorted by  count
  - may need an extra N-Gram field for infix  queries
  
  * Terms
  +  available in stable 1.4.x
  + infix query by using regex in  3.x
  - only prefix query in 1.4.x
  -  regexp may be slow (just a guess)
  
  *  Suggestions
  ? Did not try that yet. Does it allow infix  queries?
  
  * Field  Collapsing
  + no duplications
  - only  available in 4.x branch
  ? Does it work together with  highlighting? That would
  be a big plus.
   
  What are your experiences regarding
   autocomplete/autosuggestion with
  Solr? Any additions,  suggestions or corrections? What
  do you prefer?
   
   Kai
 
 
 
 
 
 
 
 
 
 
 
 
 -- 
 Dr. med. Kai Schlamp
 Am Fort Elisabeth 17
 55131  Mainz
 Germany
 Phone +49-177-7402778
 Email: schl...@gmx.de
 


i don't get why my index didn't grow more...

2011-03-16 Thread Robert Petersen
OK I have a 30 gb index where there are lots of sparsly populated int
fields and then one title field and one catchall field with title and
everything else we want as keywords, the catchall field.  I figure it is
the biggest field in our documents which as I mentioned is otherwise
composed of a variety if int fields and a title.

 

So my puzzlement is that my biggest field is copied into a double
metaphone field and now I added another copyfield to also copy the
catchall field into a newly created soundex field for an experiment to
compare the effectiveness of the two.  I expected the index to grow by
at least 25% to 30%, but it barely grew at all.  Can someone explain
this to me?  Thanks!  J

 



RE: Different options for autocomplete/autosuggestion

2011-03-14 Thread Robert Petersen
I like field collapsing because that way my suggestions gives phrase
results (ie the suggestion starts with what the user has typed so far)
and thus I limit suggestions to be in the order of the words typed.  I
think that looks better for our retail oriented site.  I populate the
index with previous user queries.  I just put wildcards on the end of
the collapsed version what the user has typed so far.  It is very fast.
I make suggestions for every keystroke as a user types in his query on
our site.  Hope that helps.


-Original Message-
From: Kai Schlamp [mailto:kai.schl...@googlemail.com] 
Sent: Sunday, March 13, 2011 11:14 PM
To: solr-user@lucene.apache.org
Subject: Different options for autocomplete/autosuggestion

Hi.

There seems to be several options for implementing an
autocomplete/autosuggestions feature with Solr. I am trying to
summarize those possibilities together with their advantages and
disadvantages. It would be really nice to read some of your opinions.

* Using N-Gram filter + text field query
+ available in stable 1.4.x
+ results can be boosted
+ sorted by best matches
- may return duplicate results

* Facets
+ available in stable 1.4.x
+ no duplicate entries
- sorted by count
- may need an extra N-Gram field for infix queries

* Terms
+ available in stable 1.4.x
+ infix query by using regex in 3.x
- only prefix query in 1.4.x
- regexp may be slow (just a guess)

* Suggestions
? Did not try that yet. Does it allow infix queries?

* Field Collapsing
+ no duplications
- only available in 4.x branch
? Does it work together with highlighting? That would be a big plus.

What are your experiences regarding autocomplete/autosuggestion with
Solr? Any additions, suggestions or corrections? What do you prefer?

Kai


RE: Different options for autocomplete/autosuggestion

2011-03-14 Thread Robert Petersen
I am doing this very differently.  We are on solr 1.4.0 and I accomplish the 
collapsing in my wrapper layer.  I have written a layer of code around SOLR, an 
indexer on one end and a search service wrapping solrs on the other end.  I 
manually collapse the field in my code. I keep both a collapsed and uncollapsed 
version of the phrase in my index, the uncollapsed is the only one stored for 
retrieval btw.  I do this on both ends so I have complete control here...  
works well!   Different than a patch of course tho.

-Original Message-
From: kai.schl...@googlemail.com [mailto:kai.schl...@googlemail.com] On Behalf 
Of Kai Schlamp
Sent: Monday, March 14, 2011 2:12 PM
To: solr-user@lucene.apache.org
Subject: Re: Different options for autocomplete/autosuggestion

Robert, thanks for your answer. What Solr version do you use? 4.0?
As mentioned in my other post here I tried to patch 1.4 for using
field collapsing, but couldn't get it to work (compiled fine, but
collapsed parameters seems to be completely ignored).

2011/3/14 Robert Petersen rober...@buy.com:
 I like field collapsing because that way my suggestions gives phrase
 results (ie the suggestion starts with what the user has typed so far)
 and thus I limit suggestions to be in the order of the words typed.  I
 think that looks better for our retail oriented site.  I populate the
 index with previous user queries.  I just put wildcards on the end of
 the collapsed version what the user has typed so far.  It is very fast.
 I make suggestions for every keystroke as a user types in his query on
 our site.  Hope that helps.


 -Original Message-
 From: Kai Schlamp [mailto:kai.schl...@googlemail.com]
 Sent: Sunday, March 13, 2011 11:14 PM
 To: solr-user@lucene.apache.org
 Subject: Different options for autocomplete/autosuggestion

 Hi.

 There seems to be several options for implementing an
 autocomplete/autosuggestions feature with Solr. I am trying to
 summarize those possibilities together with their advantages and
 disadvantages. It would be really nice to read some of your opinions.

 * Using N-Gram filter + text field query
 + available in stable 1.4.x
 + results can be boosted
 + sorted by best matches
 - may return duplicate results

 * Facets
 + available in stable 1.4.x
 + no duplicate entries
 - sorted by count
 - may need an extra N-Gram field for infix queries

 * Terms
 + available in stable 1.4.x
 + infix query by using regex in 3.x
 - only prefix query in 1.4.x
 - regexp may be slow (just a guess)

 * Suggestions
 ? Did not try that yet. Does it allow infix queries?

 * Field Collapsing
 + no duplications
 - only available in 4.x branch
 ? Does it work together with highlighting? That would be a big plus.

 What are your experiences regarding autocomplete/autosuggestion with
 Solr? Any additions, suggestions or corrections? What do you prefer?

 Kai




-- 
Dr. med. Kai Schlamp
Am Fort Elisabeth 17
55131 Mainz
Germany
Phone +49-177-7402778
Email: schl...@gmx.de


RE: Different options for autocomplete/autosuggestion

2011-03-14 Thread Robert Petersen
Note that due to the 'raw' nature of my source data I also have to heavily 
filter my data before collapsing it also.  I don't want to suggest garbage 
phrases just because a lot of people searched on them.  We store auxiliary data 
in the index for filtering on to perform the grouping.

-Original Message-
From: Robert Petersen [mailto:rober...@buy.com] 
Sent: Monday, March 14, 2011 4:25 PM
To: solr-user@lucene.apache.org
Subject: RE: Different options for autocomplete/autosuggestion

I am doing this very differently.  We are on solr 1.4.0 and I accomplish the 
collapsing in my wrapper layer.  I have written a layer of code around SOLR, an 
indexer on one end and a search service wrapping solrs on the other end.  I 
manually collapse the field in my code. I keep both a collapsed and uncollapsed 
version of the phrase in my index, the uncollapsed is the only one stored for 
retrieval btw.  I do this on both ends so I have complete control here...  
works well!   Different than a patch of course tho.

-Original Message-
From: kai.schl...@googlemail.com [mailto:kai.schl...@googlemail.com] On Behalf 
Of Kai Schlamp
Sent: Monday, March 14, 2011 2:12 PM
To: solr-user@lucene.apache.org
Subject: Re: Different options for autocomplete/autosuggestion

Robert, thanks for your answer. What Solr version do you use? 4.0?
As mentioned in my other post here I tried to patch 1.4 for using
field collapsing, but couldn't get it to work (compiled fine, but
collapsed parameters seems to be completely ignored).

2011/3/14 Robert Petersen rober...@buy.com:
 I like field collapsing because that way my suggestions gives phrase
 results (ie the suggestion starts with what the user has typed so far)
 and thus I limit suggestions to be in the order of the words typed.  I
 think that looks better for our retail oriented site.  I populate the
 index with previous user queries.  I just put wildcards on the end of
 the collapsed version what the user has typed so far.  It is very fast.
 I make suggestions for every keystroke as a user types in his query on
 our site.  Hope that helps.


 -Original Message-
 From: Kai Schlamp [mailto:kai.schl...@googlemail.com]
 Sent: Sunday, March 13, 2011 11:14 PM
 To: solr-user@lucene.apache.org
 Subject: Different options for autocomplete/autosuggestion

 Hi.

 There seems to be several options for implementing an
 autocomplete/autosuggestions feature with Solr. I am trying to
 summarize those possibilities together with their advantages and
 disadvantages. It would be really nice to read some of your opinions.

 * Using N-Gram filter + text field query
 + available in stable 1.4.x
 + results can be boosted
 + sorted by best matches
 - may return duplicate results

 * Facets
 + available in stable 1.4.x
 + no duplicate entries
 - sorted by count
 - may need an extra N-Gram field for infix queries

 * Terms
 + available in stable 1.4.x
 + infix query by using regex in 3.x
 - only prefix query in 1.4.x
 - regexp may be slow (just a guess)

 * Suggestions
 ? Did not try that yet. Does it allow infix queries?

 * Field Collapsing
 + no duplications
 - only available in 4.x branch
 ? Does it work together with highlighting? That would be a big plus.

 What are your experiences regarding autocomplete/autosuggestion with
 Solr? Any additions, suggestions or corrections? What do you prefer?

 Kai




-- 
Dr. med. Kai Schlamp
Am Fort Elisabeth 17
55131 Mainz
Germany
Phone +49-177-7402778
Email: schl...@gmx.de


RE: True master-master fail-over without data gaps

2011-03-09 Thread Robert Petersen
If you have a wrapper, like an indexer app which prepares solr docs and
sends them into solr, then it is simple.  The wrapper is your 'tee' and
it can send docs to both (or N) masters.

-Original Message-
From: Michael Sokolov [mailto:soko...@ifactory.com] 
Sent: Wednesday, March 09, 2011 4:14 AM
To: solr-user@lucene.apache.org
Cc: Jonathan Rochkind
Subject: Re: True master-master fail-over without data gaps

Yes, I think this should be pushed upstream - insert a tee in the 
document stream so that all documents go to both masters.
Then use a load balancer to make requests of the masters.

The tee itself then becomes a possible single point of failure, but 
you didn't say anything about the architecture of the document feed.  Is

that also fault-tolerant?

-Mike

On 3/9/2011 1:06 AM, Jonathan Rochkind wrote:
 I'd honestly think about buffer the incoming documents in some store
that's actually made for fail-over persistence reliability, maybe
CouchDB or something. And then that's taking care of not losing
anything, and the problem becomes how we make sure that our solr master
indexes are kept in sync with the actual persistent store; which I'm
still not sure about, but I'm thinking it's a simpler problem. The right
tool for the right job, that kind of failover persistence is not solr's
specialty.
 
 From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
 Sent: Tuesday, March 08, 2011 11:45 PM
 To: solr-user@lucene.apache.org
 Subject: True master-master fail-over without data gaps

 Hello,

 What are some common or good ways to handle indexing (master)
fail-over?
 Imagine you have a continuous stream of incoming documents that you
have to
 index without losing any of them (or with losing as few of them as
possible).
 How do you set up you masters?
 In other words, you can't just have 2 masters where the secondary is
the
 Repeater (or Slave) of the primary master and replicates the index
periodically:
 you need to have 2 masters that are in sync at all times!
 How do you achieve that?

 * Do you just put N masters behind a LB VIP, configure them both to
point to the
 index on some shared storage (e.g. SAN), and count on the LB to
fail-over to the
 secondary master when the primary becomes unreachable?
 If so, how do you deal with index locks?  You use the Native lock and
count on
 it disappearing when the primary master goes down?  That means you
count on the
 whole JVM process dying, which may not be the case...

 * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2
masters
 with 2 separate indices in sync, while making sure you write to only 1
of them
 via LB VIP or otherwise?

 * Or ...


 This thread is on a similar topic, but is inconclusive:
http://search-lucene.com/m/aOsyN15f1qd1

 Here is another similar thread, but this one doesn't cover how 2
masters are
 kept in sync at all times:
http://search-lucene.com/m/aOsyN15f1qd1

 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/




RE: True master-master fail-over without data gaps

2011-03-09 Thread Robert Petersen
Currently I use an application connected to a queue containing incoming
data which my indexer app turns into solr docs.  I log everything to a
log table and have never had an issue with losing anything.  I can trace
incoming docs exactly, and keep timing data in there also. If I added a
second solr url for a second master and resent the same doc to master02
that I sent to master01, I would expect near 100% synchronization.  The
problem here is how to get the slave farm to start replicating from the
second master if and when the first goes down.  I can only see that as
being a manual operation, repointing the slaves to master02 and
restarting or reloading them etc...



-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Wednesday, March 09, 2011 8:52 AM
To: solr-user@lucene.apache.org
Subject: Re: True master-master fail-over without data gaps

Hi,


- Original Message 
 From: Robert Petersen rober...@buy.com
 To: solr-user@lucene.apache.org
 Sent: Wed, March 9, 2011 11:40:56 AM
 Subject: RE: True master-master fail-over without data gaps
 
 If you have a wrapper, like an indexer app which prepares solr docs
and
 sends  them into solr, then it is simple.  The wrapper is your 'tee'
and
 it can  send docs to both (or N) masters.

Doesn't this make it too easy for 2 masters to get out of sync even if
the 
problem is not with them?
e.g. something happens in this tee component and it indexes a doc to
master A, 
but not master B.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



 -Original Message-
 From:  Michael Sokolov [mailto:soko...@ifactory.com] 
 Sent:  Wednesday, March 09, 2011 4:14 AM
 To: solr-user@lucene.apache.org
 Cc:  Jonathan Rochkind
 Subject: Re: True master-master fail-over without data  gaps
 
 Yes, I think this should be pushed upstream - insert a tee in the 
 document stream so that all documents go to both masters.
 Then use a load  balancer to make requests of the masters.
 
 The tee itself then becomes a  possible single point of failure, but

 you didn't say anything about the  architecture of the document feed.
Is
 
 that also  fault-tolerant?
 
 -Mike
 
 On 3/9/2011 1:06 AM, Jonathan Rochkind  wrote:
  I'd honestly think about buffer the incoming documents in some
store
 that's actually made for fail-over persistence reliability,  maybe
 CouchDB or something. And then that's taking care of not  losing
 anything, and the problem becomes how we make sure that our solr
master
 indexes are kept in sync with the actual persistent store; which  I'm
 still not sure about, but I'm thinking it's a simpler problem. The
right
 tool for the right job, that kind of failover persistence is not
solr's
 specialty.
  
   From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
   Sent: Tuesday, March 08, 2011 11:45 PM
  To: solr-user@lucene.apache.org
   Subject: True master-master fail-over without data gaps
 
   Hello,
 
  What are some common or good ways to handle indexing  (master)
 fail-over?
  Imagine you have a continuous stream of incoming  documents that you
 have to
  index without losing any of them (or with  losing as few of them as
 possible).
  How do you set up you  masters?
  In other words, you can't just have 2 masters where the  secondary
is
 the
  Repeater (or Slave) of the primary master and  replicates the index
 periodically:
  you need to have 2 masters that  are in sync at all times!
  How do you achieve that?
 
  * Do  you just put N masters behind a LB VIP, configure them both to
 point to  the
  index on some shared storage (e.g. SAN), and count on the LB  to
 fail-over to the
  secondary master when the primary becomes  unreachable?
  If so, how do you deal with index locks?  You use the  Native lock
and
 count on
  it disappearing when the primary master goes  down?  That means you
 count on the
  whole JVM process dying,  which may not be the case...
 
  * Or do you use tools like DRBD,  Corosync, Pacemaker, etc. to keep
2
 masters
  with 2 separate indices  in sync, while making sure you write to
only 1
 of them
  via LB VIP or  otherwise?
 
  * Or ...
 
 
  This thread is on a  similar topic, but is inconclusive:
 http://search-lucene.com/m/aOsyN15f1qd1
 
  Here is another  similar thread, but this one doesn't cover how 2
 masters are
  kept in  sync at all times:
 http://search-lucene.com/m/aOsyN15f1qd1
 
  Thanks,
   Otis
  
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
  Lucene  ecosystem search :: http://search-lucene.com/
 
 
 


RE: True master-master fail-over without data gaps

2011-03-09 Thread Robert Petersen
...but the index resides on disk doesn't it???  lol

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Wednesday, March 09, 2011 9:06 AM
To: solr-user@lucene.apache.org
Subject: Re: True master-master fail-over without data gaps

Hi,



- Original Message 

 I'd honestly think about buffer the incoming documents in some store
that's  
actually made for fail-over persistence reliability, maybe CouchDB or
something.  
And then that's taking care of not losing anything, and the problem
becomes how  
we make sure that our solr master indexes are kept in sync with the
actual  
persistent store; which I'm still not sure about, but I'm thinking it's
a  
simpler problem. The right tool for the right job, that kind of
failover  
persistence is not solr's specialty. 


But check this!  In some cases one is not allowed to save content to
disk (think 
copyrights).  I'm not making this up - we actually have a customer with
this 
cannot save to disk (but can index) requirement.

So buffering to disk is not an option, and buffering in memory is not
practical 
because of the input document rate and their size.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



 From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
 Sent:  Tuesday, March 08, 2011 11:45 PM
 To: solr-user@lucene.apache.org
 Subject:  True master-master fail-over without data gaps
 
 Hello,
 
 What are  some common or good ways to handle indexing (master)
fail-over?
 Imagine you  have a continuous stream of incoming documents that you
have to
 index without  losing any of them (or with losing as few of them as
possible).
 How do you  set up you masters?
 In other words, you can't just have 2 masters where the  secondary is
the
 Repeater (or Slave) of the primary master and replicates the  index 
periodically:
 you need to have 2 masters that are in sync at all  times!
 How do you achieve that?
 
 * Do you just put N masters behind a  LB VIP, configure them both to
point to 
the
 index on some shared storage  (e.g. SAN), and count on the LB to
fail-over to 
the
 secondary master when the  primary becomes unreachable?
 If so, how do you deal with index locks?   You use the Native lock and
count 
on
 it disappearing when the primary master  goes down?  That means you
count on 
the
 whole JVM process dying, which  may not be the case...
 
 * Or do you use tools like DRBD, Corosync,  Pacemaker, etc. to keep 2
masters
 with 2 separate indices in sync, while  making sure you write to only
1 of 
them
 via LB VIP or otherwise?
 
 * Or  ...
 
 
 This thread is on a similar topic, but is inconclusive:
   http://search-lucene.com/m/aOsyN15f1qd1
 
 Here is another similar  thread, but this one doesn't cover how 2
masters are
 kept in sync at all  times:
   http://search-lucene.com/m/aOsyN15f1qd1
 
 Thanks,
 Otis
 
 Sematext  :: http://sematext.com/ ::  Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 


RE: True master-master fail-over without data gaps

2011-03-09 Thread Robert Petersen
I guess you could put a LB between slaves and masters, never thought of
that!  :)

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Wednesday, March 09, 2011 9:10 AM
To: solr-user@lucene.apache.org
Subject: Re: True master-master fail-over without data gaps

Hi,



- Original Message 

 Currently I use an application connected to a queue containing
incoming
 data  which my indexer app turns into solr docs.  I log everything to
a
 log  table and have never had an issue with losing anything.  

Yeah, if everything goes through some storage that can be polled (either
a DB or 
a durable JMS Topic or some such), then N masters could connect to it,
not miss 
anything, and be more or less in near real-time sync.

 I can  trace
 incoming docs exactly, and keep timing data in there also. If I added
a
 second solr url for a second master and resent the same doc to
master02
 that I sent to master01, I would expect near 100%  synchronization.
The
 problem here is how to get the slave farm to start  replicating from
the
 second master if and when the first goes down.  I  can only see that
as
 being a manual operation, repointing the slaves to  master02 and
 restarting or reloading them etc...

Actually, you can configure a LB to handle that, so that's less of a
problem, I 
think.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


 -Original  Message-
 From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
 Sent: Wednesday, March 09, 2011 8:52 AM
 To: solr-user@lucene.apache.org
 Subject:  Re: True master-master fail-over without data gaps
 
 Hi,
 
 
 -  Original Message 
  From: Robert Petersen rober...@buy.com
  To: solr-user@lucene.apache.org
   Sent: Wed, March 9, 2011 11:40:56 AM
  Subject: RE: True master-master  fail-over without data gaps
  
  If you have a wrapper, like an  indexer app which prepares solr docs
 and
  sends  them into solr,  then it is simple.  The wrapper is your
'tee'
 and
  it can   send docs to both (or N) masters.
 
 Doesn't this make it too easy for 2  masters to get out of sync even
if
 the 
 problem is not with them?
 e.g.  something happens in this tee component and it indexes a doc
to
 master A, 
 but not master B.
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 
  -Original  Message-
  From:  Michael Sokolov [mailto:soko...@ifactory.com] 
   Sent:  Wednesday, March 09, 2011 4:14 AM
  To: solr-user@lucene.apache.org
   Cc:  Jonathan Rochkind
  Subject: Re: True master-master fail-over  without data  gaps
  
  Yes, I think this should be pushed  upstream - insert a tee in the

  document stream so that all documents  go to both masters.
  Then use a load  balancer to make requests of  the masters.
  
  The tee itself then becomes a  possible  single point of failure,
but
 
  you didn't say anything about the   architecture of the document
feed.
 Is
  
  that also   fault-tolerant?
  
  -Mike
  
  On 3/9/2011 1:06 AM,  Jonathan Rochkind  wrote:
   I'd honestly think about buffer the  incoming documents in some
 store
  that's actually made for fail-over  persistence reliability,  maybe
  CouchDB or something. And then  that's taking care of not  losing
  anything, and the problem becomes  how we make sure that our solr
 master
  indexes are kept in sync with  the actual persistent store; which
I'm
  still not sure about, but  I'm thinking it's a simpler problem. The
 right
  tool for the right  job, that kind of failover persistence is not
 solr's
   specialty.
   
 From: Otis Gospodnetic [otis_gospodne...@yahoo.com]
 Sent: Tuesday, March 08, 2011 11:45 PM
   To: solr-user@lucene.apache.org
 Subject: True master-master fail-over without data gaps
   
Hello,
  
   What are some common or  good ways to handle indexing  (master)
  fail-over?
Imagine you have a continuous stream of incoming  documents that
you
  have to
   index without losing any of them (or with   losing as few of them
as
  possible).
   How do you set up  you  masters?
   In other words, you can't just have 2 masters  where the
secondary
 is
  the
   Repeater (or Slave) of  the primary master and  replicates the
index
  periodically:
you need to have 2 masters that  are in sync at all times!
How do you achieve that?
  
   * Do  you just put  N masters behind a LB VIP, configure them both
to
  point to   the
   index on some shared storage (e.g. SAN), and count on the  LB  to
  fail-over to the
   secondary master when the  primary becomes  unreachable?
   If so, how do you deal with  index locks?  You use the  Native
lock
 and
  count on
it disappearing when the primary master goes  down?  That means
you
  count on the
   whole JVM process dying,  which may  not be the case...
  
   * Or do you use tools like  DRBD

RE: True master-master fail-over without data gaps (choosing CA in CAP)

2011-03-09 Thread Robert Petersen
Can't you skip the SAN and keep the indexes locally?  Then you would
have two redundant copies of the index and no lock issues.  

Also, Can't master02 just be a slave to master01 (in the master farm and
separate from the slave farm) until such time as master01 fails?  Then
master02 would start receiving the new documents with an indexes
complete up to the last replication at least and the other slaves would
be directed by LB to poll master02 also...

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Wednesday, March 09, 2011 9:47 AM
To: solr-user@lucene.apache.org
Subject: Re: True master-master fail-over without data gaps (choosing CA
in CAP)

Hi,

 
- Original Message 
 From: Walter Underwood wun...@wunderwood.org

 On Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote:
 
  You mean it's  not possible to have 2 masters that are in nearly
real-time 
sync?
  How  about with DRBD?  I know people use DRBD to keep 2 Hadoop NNs
(their 
edit 

  logs) in sync to avoid the current NN SPOF, for example, so I'm
thinking 
this 

  could be doable with Solr masters, too, no?
 
 If you add fault-tolerant, you run into the CAP  Theorem. Consistency,

availability, partition: choose two. You cannot have it  all.

Right, so I'll take Consistency and Availability, and I'll put my 2
masters in 
the same rack (which has redundant switches, power supply, etc.) and
thus 
minimize/avoid partitioning.
Assuming the above actually works, I think my Q remains:

How do you set up 2 Solr masters so they are in near real-time sync?
DRBD?

But here is maybe a simpler scenario that more people may be
considering:

Imagine 2 masters on 2 different servers in 1 rack, pointing to the same
index 
on the shared storage (SAN) that also happens to live in the same rack.
2 Solr masters are behind 1 LB VIP that indexer talks to.
The VIP is configured so that all requests always get routed to the
primary 
master (because only 1 master can be modifying an index at a time),
except when 
this primary is down, in which case the requests are sent to the
secondary 
master.

So in this case my Q is around automation of this, around Lucene index
locks, 
around the need for manual intervention, and such.
Concretely, if you have these 2 master instances, the primary master has
the 
Lucene index lock in the index dir.  When the secondary master needs to
take 
over (i.e., when it starts receiving documents via LB), it needs to be
able to 
write to that same index.  But what if that lock is still around?  One
could use 
the Native lock to make the lock disappear if the primary master's JVM
exited 
unexpectedly, and in that case everything *should* work and be
completely 
transparent, right?  That is, the secondary will start getting new docs,
it will 
use its IndexWriter to write to that same shared index, which won't be
locked 
for writes because the lock is gone, and everyone will be happy.  Did I
miss 
something important here?

Assuming the above is correct, what if the lock is *not* gone because
the 
primary master's JVM is actually not dead, although maybe unresponsive,
so LB 
thinks the primary master is dead.  Then the LB will route indexing
requests to 
the secondary master, which will attempt to write to the index, but be
denied 
because of the lock.  So a human needs to jump in, remove the lock, and
manually 
reindex failed docs if the upstream component doesn't buffer docs that
failed to 
get indexed and doesn't retry indexing them automatically.  Is this
correct or 
is there a way to avoid humans here?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


RE: Memory use during merges (OOM)

2010-12-16 Thread Robert Petersen
Hello we occasionally bump into the OOM issue during merging after propagation 
too, and from the discussion below I guess we are doing thousands of 'false 
deletions' by unique id to make sure certain documents are *not* in the index.  
Could anyone explain why that is bad?  I didn't really understand the 
conclusion below. 

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Thursday, December 16, 2010 2:51 AM
To: solr-user@lucene.apache.org
Subject: Re: Memory use during merges (OOM)

RAM usage for merging is tricky.

First off, merging must hold open a SegmentReader for each segment
being merged.  However, it's not necessarily a full segment reader;
for example, merging doesn't need the terms index nor norms.  But it
will load deleted docs.

But, if you are doing deletions (or updateDocument, which is just a
delete + add under-the-hood), then this will force the terms index of
the segment readers to be loaded, thus consuming more RAM.
Furthermore, if the deletions you (by Term/Query) do in fact result in
deleted documents (ie they were not false deletions), then the
merging allocates an int[maxDoc()] for each SegmentReader that has
deletions.

Finally, if you have multiple merges running at once (see
CSM.setMaxMergeCount) that means RAM for each currently running merge
is tied up.

So I think the gist is... the RAM usage will be in proportion to the
net size of the merge (mergeFactor + how big each merged segment is),
how many merges you allow concurrently, and whether you do false or
true deletions.

If you are doing false deletions (calling .updateDocument when in fact
the Term you are replacing cannot exist) it'd be best if possible to
change the app to not call .updateDocument if you know the Term
doesn't exist.

Mike

On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Hello all,

 Are there any general guidelines for determining the main factors in memory 
 use during merges?

 We recently changed our indexing configuration to speed up indexing but in 
 the process of doing a very large merge we are running out of memory.
 Below is a list of the changes and part of the indexwriter log.  The changes 
 increased the indexing though-put by almost an order of magnitude.
 (about 600 documents per hour to about 6000 documents per hour.  Our 
 documents are about 800K)

 We are trying to determine which of the changes to tweak to avoid the OOM, 
 but still keep the benefit of the increased indexing throughput

 Is it likely that the changes to ramBufferSizeMB are the culprit or could it 
 be the mergeFactor change from 10-20?

  Is there any obvious relationship between ramBufferSizeMB and the memory 
 consumed by Solr?
  Are there rules of thumb for the memory needed in terms of the number or 
 size of segments?

 Our largest segments prior to the failed merge attempt were between 5GB and 
 30GB.  The memory allocated to the Solr/tomcat JVM is 10GB.

 Tom Burton-West
 -

 Changes to indexing configuration:
 mergeScheduler
        before: serialMergeScheduler
        after:    concurrentMergeScheduler
 mergeFactor
        before: 10
            after : 20
 ramBufferSizeMB
        before: 32
              after: 320

 excerpt from indexWriter.log

 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP: findMerges: 40 segments
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP:   level 7.23609 to 7.98609: 20 segments
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP:     0 to 20: add this merge
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP:   level 5.44878 to 6.19878: 20 segments
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: LMP:     20 to 40: add this merge

 ...
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: applyDeletes
 Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
 http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 deleted 
 docIDs and 0 deleted queries on 40 segments.
 Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; 
 http-8091-Processor70]: hit exception flushing deletes
 Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; 
 http-8091-Processor70]: hit OutOfMemoryError inside updateDocument
 tom




RE: Memory use during merges (OOM)

2010-12-16 Thread Robert Petersen
Thanks Mike!  When you say 'term index of the segment readers', are you 
referring to the term vectors?

In our case our index of 8 million docs holds pretty 'skinny' docs containing 
searchable product titles and keywords, with the rest of the doc only holding 
Ids for faceting upon.  Docs typically only have unique terms per doc, with a 
lot of overlap of the terms across categories of docs (all similar products).  
I'm thinking that our unique terms are low vs the size of our index.  The way 
we spin out deletes and adds should keep the terms loaded all the time.  Seems 
like once in a couple weeks a propagation happens which kills the slave farm 
with OOMs.  We are bumping the heap up a couple gigs every time this happens 
and hoping it goes away at this point.  That is why I jumped into this 
discussion, sorry for butting in like that.  you guys are discussing very 
interesting settings I had not considered before.

Rob


-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Thursday, December 16, 2010 10:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Memory use during merges (OOM)

It's not that it's bad, it's just that Lucene must do extra work to
check if these deletes are real or not, and that extra work requires
loading the terms index which will consume additional RAM.

For most apps, though, the terms index is relatively small and so this
isn't really an issue.  But if your terms index is large this can
explain the added RAM usage.

One workaround for large terms index is to set the terms index divisor
that IndexWriter should use whenever it loads a terms index (this is
IndexWriter.setReaderTermsIndexDivisor).

Mike

On Thu, Dec 16, 2010 at 12:17 PM, Robert Petersen rober...@buy.com wrote:
 Hello we occasionally bump into the OOM issue during merging after 
 propagation too, and from the discussion below I guess we are doing thousands 
 of 'false deletions' by unique id to make sure certain documents are *not* in 
 the index.  Could anyone explain why that is bad?  I didn't really understand 
 the conclusion below.

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Thursday, December 16, 2010 2:51 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Memory use during merges (OOM)

 RAM usage for merging is tricky.

 First off, merging must hold open a SegmentReader for each segment
 being merged.  However, it's not necessarily a full segment reader;
 for example, merging doesn't need the terms index nor norms.  But it
 will load deleted docs.

 But, if you are doing deletions (or updateDocument, which is just a
 delete + add under-the-hood), then this will force the terms index of
 the segment readers to be loaded, thus consuming more RAM.
 Furthermore, if the deletions you (by Term/Query) do in fact result in
 deleted documents (ie they were not false deletions), then the
 merging allocates an int[maxDoc()] for each SegmentReader that has
 deletions.

 Finally, if you have multiple merges running at once (see
 CSM.setMaxMergeCount) that means RAM for each currently running merge
 is tied up.

 So I think the gist is... the RAM usage will be in proportion to the
 net size of the merge (mergeFactor + how big each merged segment is),
 how many merges you allow concurrently, and whether you do false or
 true deletions.

 If you are doing false deletions (calling .updateDocument when in fact
 the Term you are replacing cannot exist) it'd be best if possible to
 change the app to not call .updateDocument if you know the Term
 doesn't exist.

 Mike

 On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Hello all,

 Are there any general guidelines for determining the main factors in memory 
 use during merges?

 We recently changed our indexing configuration to speed up indexing but in 
 the process of doing a very large merge we are running out of memory.
 Below is a list of the changes and part of the indexwriter log.  The changes 
 increased the indexing though-put by almost an order of magnitude.
 (about 600 documents per hour to about 6000 documents per hour.  Our 
 documents are about 800K)

 We are trying to determine which of the changes to tweak to avoid the OOM, 
 but still keep the benefit of the increased indexing throughput

 Is it likely that the changes to ramBufferSizeMB are the culprit or could it 
 be the mergeFactor change from 10-20?

  Is there any obvious relationship between ramBufferSizeMB and the memory 
 consumed by Solr?
  Are there rules of thumb for the memory needed in terms of the number or 
 size of segments?

 Our largest segments prior to the failed merge attempt were between 5GB and 
 30GB.  The memory allocated to the Solr/tomcat JVM is 10GB.

 Tom Burton-West
 -

 Changes to indexing configuration:
 mergeScheduler
        before: serialMergeScheduler
        after

RE: entire farm fails at the same time with OOM issues

2010-12-01 Thread Robert Petersen
It has typically been when query traffic was lowest!  We are at 12 GB heap, so 
I will try to bump it to 14 GB.  We have 64GB main memory installed now.  Here 
is our settings, do these look OK?

export JAVA_OPTS=-Xmx12228m -Xms12228m -XX:+UseConcMarkSweepGC 
-XX:+CMSIncrementalMode



-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Tuesday, November 30, 2010 6:44 PM
To: solr-user@lucene.apache.org
Subject: Re: entire farm fails at the same time with OOM issues

On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersen rober...@buy.com wrote:
 My question is this.  Why in the world would all of my slaves, after
 running fine for some days, suddenly all at the exact same minute
 experience OOM heap errors and go dead?

If there is no change in query traffic when this happens, then it's
due to what the index looks like.

My guess is a large index merge happened, which means that when the
searchers re-open on the new index, it requires more memory than
normal (much less can be shared with the previous index).

I'd try bumping the heap a little bit, and then optimizing once a day
during off-peak hours.
If you still get OOM errors, bump the heap a little more.

-Yonik
http://www.lucidimagination.com


RE: entire farm fails at the same time with OOM issues

2010-12-01 Thread Robert Petersen
Good idea.  Our farm is behind Akamai so that should be ok to do.

-Original Message-
From: Peter Karich [mailto:peat...@yahoo.de] 
Sent: Wednesday, December 01, 2010 12:21 PM
To: solr-user@lucene.apache.org
Subject: Re: entire farm fails at the same time with OOM issues


  also try to minimize maxWarming searchers to 1(?) or 2.
And decrease cache usage (especially autowarming) if possible at all. 
But again: only if it doesn't affect performance ...

Regards,
Peter.

 On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersenrober...@buy.com
wrote:
 My question is this.  Why in the world would all of my slaves, after
 running fine for some days, suddenly all at the exact same minute
 experience OOM heap errors and go dead?
 If there is no change in query traffic when this happens, then it's
 due to what the index looks like.

 My guess is a large index merge happened, which means that when the
 searchers re-open on the new index, it requires more memory than
 normal (much less can be shared with the previous index).

 I'd try bumping the heap a little bit, and then optimizing once a day
 during off-peak hours.
 If you still get OOM errors, bump the heap a little more.

 -Yonik
 http://www.lucidimagination.com



shutdown.sh does not kill the tomcat process running solr./?

2010-11-30 Thread Robert Petersen
Greetings, we're wondering why we can issue the command to shutdown
tomcat/solr but the process remains visible in memory (by using the top
command) and we have to manually kill the PID for it to release its
memory before we can (re)start tomcat/solr?  Anybody have any ideas?
The process is using 12+ GB main memory typically but can go up to 40 GB
on the master where we index.  We have 64GB main memory on these
servers.  I set the heap at 12 GB and use the concurrent garbage
collector too.

 

That raises another question:  top can show only 20 GB free out of 64
but the tomcat/solr process only shows its using half of that.  What is
using the rest?  The numbers don't add up...

 

Environment:  Lucid Imagination distro of Solr 1.4 on Tomcat  

Platform: RHEL with Sun JRE 1.6.0_18 



entire farm fails at the same time with OOM issues

2010-11-30 Thread Robert Petersen
Greetings, we are running one master and four slaves of our multicore
solr setup.  We just served searches for our catalog of 8 million
products with this farm during black Friday and cyber Monday, our
busiest days of the year, and the servers did not break a sweat!  Index
size is about 28GB.

 

However, twice now recently during a time of low load we have had a fire
drill where I have seen tomcat/solr fail and become unresponsive after
some OOM heap errors.  Solr wouldn't even serve up its admin pages.
I've had to go in and manually knock tomcat out of memory and then
restart it.  These solr slaves are load balanced and the load balancers
always probe the solr slaves so if they stop serving up searches they
are automatically removed from the load balancer.  When all four fail at
the same time we have an issue!

 

My question is this.  Why in the world would all of my slaves, after
running fine for some days, suddenly all at the exact same minute
experience OOM heap errors and go dead?  The load balancer kicks them
all out at the same time each time.  Each slave only talks to the master
and not to each other, but the master show no errors in the logs at all.
Something must be triggering this though.  The only other odd thing I
saw in the logs was after the first OOM errors were recorded, the slaves
started occasionally not being able to get to the master.

 

This behavior makes me a little nervous...=:-o  eek!

 

 

Environment:  Lucid Imagination distro of Solr 1.4 on Tomcat  

 

Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with
64GB memory etc etc

 

 

 



RE: entire farm fails at the same time with OOM issues

2010-11-30 Thread Robert Petersen
What would I do with the heap dump though?  Run one of those java heap
analyzers looking for memory leaks or something?  I have no experience
with thoseI saw there was a bug fix in solr 1.4.1 for a 100 byte memory
leak occurring on each commit, but it would take thousands of commits to
make that add up to anything right?

-Original Message-
From: Ken Krugler [mailto:kkrugler_li...@transpac.com] 
Sent: Tuesday, November 30, 2010 3:12 PM
To: solr-user@lucene.apache.org
Subject: Re: entire farm fails at the same time with OOM issues

Hi Robert,

I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError  
and -XX:HeapDumpPath=path to where you want the file to go, so then  
you have something to look at versus a Gedankenexperiment :)

-- Ken

On Nov 30, 2010, at 3:04pm, Robert Petersen wrote:

 Greetings, we are running one master and four slaves of our multicore
 solr setup.  We just served searches for our catalog of 8 million
 products with this farm during black Friday and cyber Monday, our
 busiest days of the year, and the servers did not break a sweat!   
 Index
 size is about 28GB.

 However, twice now recently during a time of low load we have had a  
 fire
 drill where I have seen tomcat/solr fail and become unresponsive after
 some OOM heap errors.  Solr wouldn't even serve up its admin pages.
 I've had to go in and manually knock tomcat out of memory and then
 restart it.  These solr slaves are load balanced and the load  
 balancers
 always probe the solr slaves so if they stop serving up searches they
 are automatically removed from the load balancer.  When all four  
 fail at
 the same time we have an issue!

 My question is this.  Why in the world would all of my slaves, after
 running fine for some days, suddenly all at the exact same minute
 experience OOM heap errors and go dead?  The load balancer kicks them
 all out at the same time each time.  Each slave only talks to the  
 master
 and not to each other, but the master show no errors in the logs at  
 all.
 Something must be triggering this though.  The only other odd thing I
 saw in the logs was after the first OOM errors were recorded, the  
 slaves
 started occasionally not being able to get to the master.

 This behavior makes me a little nervous...=:-o  eek!





 Environment:  Lucid Imagination distro of Solr 1.4 on Tomcat



 Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with
 64GB memory etc etc









http://ken-blog.krugler.org
+1 530-265-2225






--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







RE: Adding new field after data is already indexed

2010-11-10 Thread Robert Petersen
1)  Just put the new field in the schema and stop/start solr.  Documents
in the index will not have the field until you reindex them but it won't
hurt anything.

2)  Just turn off their handlers in solrconfig is all I think that
takes.

-Original Message-
From: gauravshetti [mailto:gaurav.she...@tcs.com] 
Sent: Monday, November 08, 2010 5:21 AM
To: solr-user@lucene.apache.org
Subject: Adding new field after data is already indexed


Hi,
 
 I had a few questions regarding Solr.
Say my schema file looks like
field name=folder_id type=long indexed=true stored=true/
field name=indexed type=boolean indexed=true stored=true/

and i index data on the basis of these fields. Now, incase i need to add
a
new field, is there a way i can add the field without corrupting the
previous data. Is there any feature which adds a new field with a
default
value to the existing records.


2) Is there any security mechanism/authorization check to prevent url
like
/admin and /update to only a few users.

-- 
View this message in context:
http://lucene.472066.n3.nabble.com/Adding-new-field-after-data-is-alread
y-indexed-tp1862575p1862575.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: phrase query with autosuggest (SOLR-1316)

2010-10-06 Thread Robert Petersen
My simple but effective solution to that problem was to replace the
white spaces in the items you index for autosuggest with some special
character, then your wildcarding will work with the whole phrase as you
desire.

Index: mike_shaffer
Query: mike_sha*  

-Original Message-
From: mike anderson [mailto:saidthero...@gmail.com] 
Sent: Wednesday, October 06, 2010 7:33 AM
To: solr-user@lucene.apache.org
Subject: phrase query with autosuggest (SOLR-1316)

It seemed like SOLR-1316 was a little too long to continue the
conversation.

Is there support for quotes indicating a phrase query. For example, my
autosuggest query for mike sha ought to return mike shaffer, mike
sharp, etc. Instead I get suggestions for mike and for sha,
resulting
in a collated result mike r meyer shaw,

Cheers,
Mike


RE: Do commits block updates in SOLR 1.4?

2010-09-03 Thread Robert Petersen
So you are saying we definitely do not need to pause ADD activity on other 
threads while we send the COMMIT?  And the same goes with AUTOCOMMIT right?

We are using SOLR 1.4 now.  We were on 1.3 previously.  We pretty much just 
assumed pausing ADDs during COMMITs was required by SOLR when we designed our 
indexing system, mostly due to our experience with an older and different 
search engine.  

-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com] 
Sent: Thursday, September 02, 2010 6:10 PM
To: solr-user@lucene.apache.org
Subject: Re: Do commits block updates in SOLR 1.4?

Yes, indexing synchronized during commits. You can call commit all you
want, and index docs, and commit will finish and then indexing will
restart.  Previous Solr release did this also; how far back is your
existing Solr?

On Thu, Sep 2, 2010 at 1:11 PM, Robert Petersen rober...@buy.com wrote:
 Hello sorry to bother but does anyone know the answer to this?  This is
 the closest thing I can find on the subject:

 http://lucene.472066.n3.nabble.com/Autocommit-blocking-adds-AutoCommit-S
 peedup-td498465.html

 -Original Message-
 From: Robert Petersen [mailto:rober...@buy.com]
 Sent: Wednesday, September 01, 2010 11:35 AM
 To: solr-user@lucene.apache.org
 Subject: Do commits block updates in SOLR 1.4?

 I can't seem to find a definitive answer.  I have ten threads doing my
 indexing and I block all the threads when one is ready to do a commit so
 no adds are done until the commit finishes.  Is this still required in
 SOLR 1.4 or could I take it out?  I tried testing this on a separate
 small index where I set autocommit in solrconfig and seem to have no
 issues just continuously adding documents from multiple threads to it
 despite its commit activity.  I'd like to do the same in my big main
 index, is it safe?



 Also, is there any difference in behavior between autocommits and
 explicit commits in this regard?









-- 
Lance Norskog
goks...@gmail.com


RE: Do commits block updates in SOLR 1.4?

2010-09-03 Thread Robert Petersen
Thanks guys!  I will be quite happy to remove the unnecessary complexity from 
our code.

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Friday, September 03, 2010 10:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Do commits block updates in SOLR 1.4?

Solr handles all of this concurrency for you - it's actually even a
little too aggressive about that these days, as Lucene has changed a lot
- but yes - you can add while committing and commit while adding - Solr
will block itself as needed.

- Mark

On 9/3/10 1:27 PM, Robert Petersen wrote:
 So you are saying we definitely do not need to pause ADD activity on other 
 threads while we send the COMMIT?  And the same goes with AUTOCOMMIT right?
 
 We are using SOLR 1.4 now.  We were on 1.3 previously.  We pretty much just 
 assumed pausing ADDs during COMMITs was required by SOLR when we designed our 
 indexing system, mostly due to our experience with an older and different 
 search engine.  
 
 -Original Message-
 From: Lance Norskog [mailto:goks...@gmail.com] 
 Sent: Thursday, September 02, 2010 6:10 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Do commits block updates in SOLR 1.4?
 
 Yes, indexing synchronized during commits. You can call commit all you
 want, and index docs, and commit will finish and then indexing will
 restart.  Previous Solr release did this also; how far back is your
 existing Solr?
 
 On Thu, Sep 2, 2010 at 1:11 PM, Robert Petersen rober...@buy.com wrote:
 Hello sorry to bother but does anyone know the answer to this?  This is
 the closest thing I can find on the subject:

 http://lucene.472066.n3.nabble.com/Autocommit-blocking-adds-AutoCommit-S
 peedup-td498465.html

 -Original Message-
 From: Robert Petersen [mailto:rober...@buy.com]
 Sent: Wednesday, September 01, 2010 11:35 AM
 To: solr-user@lucene.apache.org
 Subject: Do commits block updates in SOLR 1.4?

 I can't seem to find a definitive answer.  I have ten threads doing my
 indexing and I block all the threads when one is ready to do a commit so
 no adds are done until the commit finishes.  Is this still required in
 SOLR 1.4 or could I take it out?  I tried testing this on a separate
 small index where I set autocommit in solrconfig and seem to have no
 issues just continuously adding documents from multiple threads to it
 despite its commit activity.  I'd like to do the same in my big main
 index, is it safe?



 Also, is there any difference in behavior between autocommits and
 explicit commits in this regard?






 
 
 



RE: Do commits block updates in SOLR 1.4?

2010-09-02 Thread Robert Petersen
Hello sorry to bother but does anyone know the answer to this?  This is
the closest thing I can find on the subject:

http://lucene.472066.n3.nabble.com/Autocommit-blocking-adds-AutoCommit-S
peedup-td498465.html

-Original Message-
From: Robert Petersen [mailto:rober...@buy.com] 
Sent: Wednesday, September 01, 2010 11:35 AM
To: solr-user@lucene.apache.org
Subject: Do commits block updates in SOLR 1.4?

I can't seem to find a definitive answer.  I have ten threads doing my
indexing and I block all the threads when one is ready to do a commit so
no adds are done until the commit finishes.  Is this still required in
SOLR 1.4 or could I take it out?  I tried testing this on a separate
small index where I set autocommit in solrconfig and seem to have no
issues just continuously adding documents from multiple threads to it
despite its commit activity.  I'd like to do the same in my big main
index, is it safe?

 

Also, is there any difference in behavior between autocommits and
explicit commits in this regard?

 

 



Do commits block updates in SOLR 1.4?

2010-09-01 Thread Robert Petersen
I can't seem to find a definitive answer.  I have ten threads doing my
indexing and I block all the threads when one is ready to do a commit so
no adds are done until the commit finishes.  Is this still required in
SOLR 1.4 or could I take it out?  I tried testing this on a separate
small index where I set autocommit in solrconfig and seem to have no
issues just continuously adding documents from multiple threads to it
despite its commit activity.  I'd like to do the same in my big main
index, is it safe?

 

Also, is there any difference in behavior between autocommits and
explicit commits in this regard?

 

 



RE: Auto Suggest

2010-09-01 Thread Robert Petersen
I do this by replacing the spaces with a '%' in a separate search field
which is not parsed nor tokenized and then you can wildcard across the
whole phrase like you want and the spaces don't mess you up.  Just store
the original phrase with spaces in a separate field for returning to the
front end for display.

-Original Message-
From: Jazz Globe [mailto:jazzgl...@hotmail.com] 
Sent: Wednesday, September 01, 2010 7:33 AM
To: solr-user@lucene.apache.org
Subject: Auto Suggest


Hallo

How would one implement a multiple term auto-suggest feature in Solr
that is filter sensitive?
For example, a user enters :
mp3
  and solr might suggest:
  -   mp3 player
  -   mp3 nano
  -   mp3 sony
and then the user starts the second word :
mp3 n
and that narrows it down to:
  - mp3 nano

I had a quick look at the Terms Component.
I suppose it just returns term totals for the entire index and cannot be
used with a filter or query?

Thanks
Johan

  


RE: Auto Suggest

2010-09-01 Thread Robert Petersen
We don't have that many, just a hundred thousand, and solr response
times (since the index's docs are small and not complex) are logged as
typically 1 ms if not 0 ms.  It's funny but sometimes it is so fast no
milliseconds have elapsed.  Incredible if you ask me...  :)

Once you get SOLR to consider the whole phrase as just one big term, the
wildcard is very fast.

-Original Message-
From: Eric Grobler [mailto:impalah...@googlemail.com] 
Sent: Wednesday, September 01, 2010 12:35 PM
To: solr-user@lucene.apache.org
Subject: Re: Auto Suggest

Hi Robert,

Interesting approach, how many documents do you have in Solr?
I have about 2 million and I just wonder if it might be a bit slow.

Regards
Johan

On Wed, Sep 1, 2010 at 7:38 PM, Robert Petersen rober...@buy.com
wrote:

 I do this by replacing the spaces with a '%' in a separate search
field
 which is not parsed nor tokenized and then you can wildcard across the
 whole phrase like you want and the spaces don't mess you up.  Just
store
 the original phrase with spaces in a separate field for returning to
the
 front end for display.

 -Original Message-
 From: Jazz Globe [mailto:jazzgl...@hotmail.com]
 Sent: Wednesday, September 01, 2010 7:33 AM
 To: solr-user@lucene.apache.org
 Subject: Auto Suggest


 Hallo

 How would one implement a multiple term auto-suggest feature in Solr
 that is filter sensitive?
 For example, a user enters :
 mp3
  and solr might suggest:
  -   mp3 player
  -   mp3 nano
  -   mp3 sony
 and then the user starts the second word :
 mp3 n
 and that narrows it down to:
  - mp3 nano

 I had a quick look at the Terms Component.
 I suppose it just returns term totals for the entire index and cannot
be
 used with a filter or query?

 Thanks
 Johan





It seems like using a wildcard causes lowercase filter to not do the lowercasing?

2010-08-09 Thread Robert Petersen
I have a field with lowercase filter on search and index sides, and
searching in this field works fine with uppercase or lowercase terms,
except if I wildcard!  So searching for 'gps' or 'GPS' returns the same
result set, but searching for 'gps*' returns results as expected and
searching for 'GPS*' returns nothing.  It seems the asterisk blocks the
lower case filter operation and then no matches occur because the index
is all lowercased.

 

This is a very simple index with very simple docs, and the field is
defined like this in the schema:

 

field name=phraseNoSpaces  type=alphaOnlySort indexed=true
stored=false required=true/

 

 

fieldType name=alphaOnlySort class=solr.TextField
sortMissingLast=true omitNorms=true

  analyzer

tokenizer
class=solr.KeywordTokenizerFactory/

filter class=solr.LowerCaseFilterFactory /

filter class=solr.TrimFilterFactory /

  /analyzer

/fieldType

 



RE: It seems like using a wildcard causes lowercase filter to not do the lowercasing?

2010-08-09 Thread Robert Petersen
Aha, I overlooked that.  Thank you.

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Monday, August 09, 2010 1:28 PM
To: solr-user@lucene.apache.org
Subject: Re: It seems like using a wildcard causes lowercase filter to not do 
the lowercasing?

 I have a field with lowercase filter
 on search and index sides, and
 searching in this field works fine with uppercase or
 lowercase terms,
 except if I wildcard!  So searching for 'gps' or 'GPS'
 returns the same
 result set, but searching for 'gps*' returns results as
 expected and
 searching for 'GPS*' returns nothing.  It seems the
 asterisk blocks the
 lower case filter operation and then no matches occur
 because the index
 is all lowercased.

Unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy queries are 
not passed through the Analyzer [1]

[1]http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F


  


does this indicate a commit happened for every add?

2010-07-27 Thread Robert Petersen
I'm adding lots of small docs with several threads to solr and the adds
start fast but then slow down.  I didn't do any explicit commits and
autocommit is turned off but the logs show lots of commit activity on
this core and restarting this solr core logged the below.  Where did all
these commits come from, the exact same number as my adds?  I'm
stumped...

Jul 27, 2010 10:07:17 AM org.apache.solr.update.DirectUpdateHandler2
close
INFO: closed
DirectUpdateHandler2{commits=456389,autocommits=0,optimizes=0,rollbacks=
0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,e
rrors=0,cumulative_adds=456393,cumulative_deletesById=0,cumulative_delet
esByQuery=0,cumulative_errors=0}


RE: CommonsHttpSolrServer add document hangs

2010-07-12 Thread Robert Petersen
Maybe solr is busy doing a commit or optimize?

-Original Message-
From: Max Lynch [mailto:ihas...@gmail.com] 
Sent: Monday, July 12, 2010 9:59 AM
To: solr-user@lucene.apache.org
Subject: CommonsHttpSolrServer add document hangs

Hey guys,
I'm using Solr 1.4.1 and I've been having some problems lately with code
that adds documents through a CommonsHttpSolrServer.  It seems that
randomly
the call to theserver.add() will hang.  I am currently running my code
in a
single thread, but I noticed this would happen in multi threaded code as
well.  The jar version of commons-httpclient is 3.1.

I got a thread dump of the process, and one thread seems to be waiting
on
the org.apache.commons.httpclient.MultiThreadedHttpConnectionManager as
shown below.  All other threads are in a RUNNABLE state (besides the
Finalizer daemon).

 [java] Full thread dump Java HotSpot(TM) 64-Bit Server VM (16.3-b01
mixed mode):
 [java]
 [java] MultiThreadedHttpConnectionManager cleanup daemon prio=10
tid=0x7f441051c800 nid=0x527c in Object.wait() [0x7f4417e2f000]
 [java]java.lang.Thread.State: WAITING (on object monitor)
 [java] at java.lang.Object.wait(Native Method)
 [java] - waiting on 0x7f443ae5b290 (a
java.lang.ref.ReferenceQueue$Lock)
 [java] at
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
 [java] - locked 0x7f443ae5b290 (a
java.lang.ref.ReferenceQueue$Lock)
 [java] at
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
 [java] at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$Referen
ceQueueThread.run(MultiThreadedHttpConnectionManager.java:1122)

Any ideas?

Thanks.


RE: CommonsHttpSolrServer add document hangs

2010-07-12 Thread Robert Petersen
You could try a master slave setup using replication perhaps, then the
slave serves searches and indexing commits on the master won't hang up
searches at least...

Here is the description:  http://wiki.apache.org/solr/SolrReplication


-Original Message-
From: Max Lynch [mailto:ihas...@gmail.com] 
Sent: Monday, July 12, 2010 11:57 AM
To: solr-user@lucene.apache.org
Subject: Re: CommonsHttpSolrServer add document hangs

Thanks Robert,

My script did start going again, but it was waiting for about half an
hour
which seems a bit excessive to me.  Is there some tuning I can do on the
solr end to optimize for my use case, which is very heavy on commits and
very light on searches (I do most of my searches on the raw Lucene index
in
the background)?

Thanks.

On Mon, Jul 12, 2010 at 12:06 PM, Robert Petersen rober...@buy.com
wrote:

 Maybe solr is busy doing a commit or optimize?

 -Original Message-
 From: Max Lynch [mailto:ihas...@gmail.com]
 Sent: Monday, July 12, 2010 9:59 AM
 To: solr-user@lucene.apache.org
 Subject: CommonsHttpSolrServer add document hangs

 Hey guys,
 I'm using Solr 1.4.1 and I've been having some problems lately with
code
 that adds documents through a CommonsHttpSolrServer.  It seems that
 randomly
 the call to theserver.add() will hang.  I am currently running my code
 in a
 single thread, but I noticed this would happen in multi threaded code
as
 well.  The jar version of commons-httpclient is 3.1.

 I got a thread dump of the process, and one thread seems to be waiting
 on
 the org.apache.commons.httpclient.MultiThreadedHttpConnectionManager
as
 shown below.  All other threads are in a RUNNABLE state (besides the
 Finalizer daemon).

 [java] Full thread dump Java HotSpot(TM) 64-Bit Server VM
(16.3-b01
 mixed mode):
 [java]
 [java] MultiThreadedHttpConnectionManager cleanup daemon prio=10
 tid=0x7f441051c800 nid=0x527c in Object.wait()
[0x7f4417e2f000]
 [java]java.lang.Thread.State: WAITING (on object monitor)
 [java] at java.lang.Object.wait(Native Method)
 [java] - waiting on 0x7f443ae5b290 (a
 java.lang.ref.ReferenceQueue$Lock)
 [java] at
 java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
 [java] - locked 0x7f443ae5b290 (a
 java.lang.ref.ReferenceQueue$Lock)
 [java] at
 java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
 [java] at

org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$Referen
 ceQueueThread.run(MultiThreadedHttpConnectionManager.java:1122)

 Any ideas?

 Thanks.



GC tuning - heap size autoranging

2010-06-30 Thread Robert Petersen

Is this a true statement???  This seems to contradict other statements 
regarding setting the heap size I have seen here...

Default Heap Size
If not otherwise set on the command line, the initial and maximum heap sizes 
are calculated based on the amount of memory on the machine. The proportion of 
memory to use for the heap is controlled by the command line 
options DefaultInitialRAMFraction and DefaultMaxRAMFraction, as shown in the 
table below. (In the table, memory represents the amount of memory on the 
machine.)

Pasted from 
http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#available_collectors.selecting


RE: OOM on uninvert field request

2010-06-30 Thread Robert Petersen
Hey so after adding those GC options, I was able to incrementally push my max 
(and min) memory settings up and when we got to max=min=12GB we started looking 
much better!  One slave handles all the load with no OOMs at all!  I'm watching 
the live tomcat log using 'tail'.  Next I will convert that field type to 
(trie) int and reindex.  I'll have to start a new index from scratch with a 
field type change like that so I'll have to delete the old one first on our 
master... It takes us a couple days to index 15 million products (some are sets 
so the final index size is only 8 million) so I don't want to do *that* too 
often as the slaves will be quite stale by the time it's done!  :)

Thanks for the help!

-Original Message-
From: Robert Petersen [mailto:rober...@buy.com] 
Sent: Wednesday, June 30, 2010 9:49 AM
To: solr-user@lucene.apache.org
Subject: RE: OOM on uninvert field request

At and above 4GB we get those GC errors though!  Should I switch to something 
like this?

Recommended Options
To use i-cms in Java SE 6, use the following command line options:

-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode \
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps


Caused by: java.lang.RuntimeException: java.lang.OutOfMemoryError: GC overhead 
limit exceeded
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068)
at 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:418)
at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:467)
at 
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:319)
... 11 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded


-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com] 
Sent: Tuesday, June 29, 2010 8:42 PM
To: solr-user@lucene.apache.org
Subject: Re: OOM on uninvert field request

Yes, it is better to use ints for ids than strings. Also, the Trie int
fields have a compressed format that may cut the storage needs even
more. 8m * 4 = 32mb, times a few hundred, we'll say 300, is 900mb of
IDs.  I don't know how these fields are stored, but if they are
separate objects we've blown up to several gigs (per-object overheads
are surprising).

4G is probably not enough for what you want. If you watch the total
memory with 'top' and hit it with different queries, you will get a
stronger sense of how much memory your use cases need.

On Tue, Jun 29, 2010 at 4:32 PM, Robert Petersen rober...@buy.com wrote:
 Hello I am trying to find the right max and min settings for Java 1.6 on 20GB 
 index with 8 million docs, running 1.6_018 JVM with solr 1.4, and am 
 currently have java set to an even 4GB (export JAVA_OPTS=-Xmx4096m 
 -Xms4096m) for both min and max which is doing pretty well but occasionally 
 still getting the below OOM errors.  We're running on dual quad core xeons 
 with 16GB memory installed.  I've been getting the below OOM exceptions still 
 though.

 Is the memsize mentioned in the INFO for the uninvert in bytes?  Ie is 
 memSize=29604020 mean 29MB?  We have a few hundred of these fields and they 
 contain ints used as IDs, and so I guess could they eat all the memory to 
 uninvert them all after we apply load and enough queries are performed.  Does 
 the field type matter, would int be better than string if these are lookup 
 ids sparsely populated across the index?  BTW these are used for faceting and 
 filtering only.

                dynamicField name=*_contentAttributeToken  type=string  
 indexed=true multiValued=true   stored=true required=false/

 Jun 29, 2010 3:54:50 PM org.apache.solr.request.UnInvertedField uninvert
 INFO: UnInverted multi-valued field 
 {field=768_contentAttributeToken,memSize=29604014,tindexSize=50,time=1841,phase1=1824,nTerms=1,bigTerms=0,termInstances=18,uses=0}
 Jun 29, 2010 3:54:52 PM org.apache.solr.request.UnInvertedField uninvert
 INFO: UnInverted multi-valued field 
 {field=749_contentAttributeToken,memSize=29604020,tindexSize=56,time=1847,phase1=1829,nTerms=143,bigTerms=0,termInstances=951,uses=0}
 Jun 29, 2010 3:54:59 PM org.apache.solr.common.SolrException log
 SEVERE: java.lang.OutOfMemoryError: Java heap space
        at 
 org.apache.solr.request.UnInvertedField.uninvert(UnInvertedField.java:191)
        at 
 org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:178)
        at 
 org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:839)
        at 
 org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:250)
        at 
 org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:283)
        at 
 org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:166)




-- 
Lance Norskog
goks...@gmail.com


tomcat solr logs

2010-06-30 Thread Robert Petersen
Sorry if this is at all off topic.  Our solr log files need grooming and we 
would also like to analyze them, perhaps pulling various data points into a DB 
table, is there a preferred app for doing log file analysis and/or an easy way 
to delete the old log files?


RE: OOM on uninvert field request

2010-06-30 Thread Robert Petersen
Most of these hundreds of facet fields have tens of values but a couple have 
thousands, is thousands of different values too many to do enum or is that 
still ok?  If so I could apply it carte blanche to the whole field...

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Wednesday, June 30, 2010 1:38 PM
To: solr-user@lucene.apache.org
Subject: Re: OOM on uninvert field request

On Tue, Jun 29, 2010 at 7:32 PM, Robert Petersen rober...@buy.com wrote:
 Hello I am trying to find the right max and min settings for Java 1.6 on 20GB 
 index with 8 million docs, running 1.6_018 JVM with solr 1.4, and am 
 currently have java set to an even 4GB (export JAVA_OPTS=-Xmx4096m 
 -Xms4096m) for both min and max which is doing pretty well but occasionally 
 still getting the below OOM errors.  We're running on dual quad core xeons 
 with 16GB memory installed.  I've been getting the below OOM exceptions still 
 though.

 Is the memsize mentioned in the INFO for the uninvert in bytes? is 
 memSize=29604020 mean 29MB?

Yes.

 We have a few hundred of these fields and they contain ints used as IDs, and 
 so I guess could they eat all the memory to uninvert them all after we apply 
 load and enough queries are performed.  Does the field type matter, would int 
 be better than string if these are lookup ids sparsely populated across the 
 index?

No, using UnInvertedField faceting, the fieldType won't matter much at
all for the space it takes up.

The key here is that it looks like the number of unique terms in these
fields is low - you would probably do much better with
facet.method=enum (which iterates over terms rather than documents).

-Yonik
http://www.lucidimagination.com


OOM on uninvert field request

2010-06-29 Thread Robert Petersen
Hello I am trying to find the right max and min settings for Java 1.6 on 20GB 
index with 8 million docs, running 1.6_018 JVM with solr 1.4, and am currently 
have java set to an even 4GB (export JAVA_OPTS=-Xmx4096m -Xms4096m) for both 
min and max which is doing pretty well but occasionally still getting the below 
OOM errors.  We're running on dual quad core xeons with 16GB memory installed.  
I've been getting the below OOM exceptions still though.  

Is the memsize mentioned in the INFO for the uninvert in bytes?  Ie is 
memSize=29604020 mean 29MB?  We have a few hundred of these fields and they 
contain ints used as IDs, and so I guess could they eat all the memory to 
uninvert them all after we apply load and enough queries are performed.  Does 
the field type matter, would int be better than string if these are lookup ids 
sparsely populated across the index?  BTW these are used for faceting and 
filtering only.

dynamicField name=*_contentAttributeToken  type=string  
indexed=true multiValued=true   stored=true required=false/

Jun 29, 2010 3:54:50 PM org.apache.solr.request.UnInvertedField uninvert
INFO: UnInverted multi-valued field 
{field=768_contentAttributeToken,memSize=29604014,tindexSize=50,time=1841,phase1=1824,nTerms=1,bigTerms=0,termInstances=18,uses=0}
Jun 29, 2010 3:54:52 PM org.apache.solr.request.UnInvertedField uninvert
INFO: UnInverted multi-valued field 
{field=749_contentAttributeToken,memSize=29604020,tindexSize=56,time=1847,phase1=1829,nTerms=143,bigTerms=0,termInstances=951,uses=0}
Jun 29, 2010 3:54:59 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.OutOfMemoryError: Java heap space
at 
org.apache.solr.request.UnInvertedField.uninvert(UnInvertedField.java:191)
at 
org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:178)
at 
org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:839)
at 
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:250)
at 
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:283)
at 
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:166)


RE: 99.9% uptime requirement

2009-08-06 Thread Robert Petersen
Here is another idea.  With solr multicore you can dynamically spin up
extra cores and bring them online.  I'm not sure how well this would
work for us since we have hard coded the names of the cores we are
hitting in our config files.

-Original Message-
From: Brian Klippel [mailto:br...@theport.com] 
Sent: Thursday, August 06, 2009 8:38 AM
To: solr-user@lucene.apache.org
Subject: RE: 99.9% uptime requirement

You could create a new working core, then call the swap command once
it is ready.  Then remove the work core and delete the appropriate index
folder at your convenience.


-Original Message-
From: Robert Petersen [mailto:rober...@buy.com] 
Sent: Wednesday, August 05, 2009 6:41 PM
To: solr-user@lucene.apache.org
Subject: RE: 99.9% uptime requirement

Maintenance Questions:  In a two slave one master setup where the two
slaves are behind load balancers what happens if I have to restart solr?
If I have to restart solr say for a schema update where I have added a
new field then what is the recommended procedure?

If I can guarantee no commits or optimizes happen on the master during
the schema update so no new snapshots become available then can I safely
leave rsyncd enabled?  When I stop and start a slave server, should I
first pull it out of the load balancers list or will solr gracefully
release connections as it shuts down so no searches are lost?

What do you guys do to push out updates?

Thanks for any thoughts,
Robi


-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Tuesday, August 04, 2009 8:57 AM
To: solr-user@lucene.apache.org
Subject: Re: 99.9% uptime requirement

Right. You don't get to 99.9% by assuming that an 8 hour outage is OK.  
Design for continuous uptime, with plans for how long it takes to  
patch around a single point of failure. For example, if your load  
balancer is a single point of failure, make sure that you can redirect  
the front end servers to a single Solr server in much less than 8 hours.

Also, think about your SLA. Can the search index be more than 8 hours  
stale? How quickly do you need to be able to replace a failed indexing  
server? You might be able to run indexing locally on each search  
server if they are lightly loaded.

wunder

On Aug 4, 2009, at 7:11 AM, Norberto Meijome wrote:

 On Mon, 3 Aug 2009 13:15:44 -0700
 Robert Petersen rober...@buy.com wrote:

 Thanks all, I figured there would be more talk about daemontools if  
 there
 were really a need.  I appreciate the input and for starters we'll  
 put two
 slaves behind a load balancer and grow it from there.


 Robert,
 not taking away from daemon tools, but daemon tools won't help you  
 if your
 whole server goes down.

 don't put all your eggs in one basket - several
 servers, load balancer (hardware load balancers x 2, haproxy, etc)

 and sure, use daemon tools to keep your services running within each  
 server...

 B
 _
 {Beto|Norberto|Numard} Meijome

 Why do you sit there looking like an envelope without any address  
 on it?
  Mark Twain

 I speak for myself, not my employer. Contents may be hot. Slippery  
 when wet.
 Reading disclaimers makes you go blind. Writing them is worse. You  
 have been
 Warned.




  1   2   >