from:"Benson Margulies"

Re: TokenStream contract violation: close() call missing error in 4.9.0

2015-06-09 Thread Benson Margulies

What tokenizer are you using? I think, but I'm not entirely sure, that
this would require a bug in a tokenizer.


On Tue, Jun 9, 2015 at 10:21 AM, Ryan, Michael F. (LNG-DAY)
michael.r...@lexisnexis.com wrote:
 I'm using Solr 4.9.0. I'm trying to figure out what would cause an error like 
 this to occur a rare, non-deterministic manner:

 java.lang.IllegalStateException: TokenStream contract violation: close() call 
 missing
 at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:90)
 at 
 org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:307)
 at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:183)

 Are there any known bugs that would cause this, or unusual conditions? I'm 
 thinking crazy things like a corrupted index, or a hardware issue.

 I don't directly use TokenStream, so I'm wondering if there is something that 
 could indirectly cause this (i.e., me doing something wrong that causes 
 Lucene itself to not close the TokenStream).

 I can provide more details later. Right now I'm just grasping at straws, 
 hoping someone has encountered this.

 -Michael

Re: Korean script conversion

2015-03-30 Thread Benson Margulies

Why do you think that this is a good idea? Hanja are used for special
purposes; they are not trivally convertable to Hanjul due to ambiguity, and
it's not at all clear that a typical search user wants to treat them as
equivalent.

On Sun, Mar 29, 2015 at 1:52 AM, Eyal Naamati 
eyal.naam...@exlibrisgroup.com wrote:

  Hi,



 We are starting to index records in Korean. Korean text can be written in
 two scripts: Han characters (Chinese) and Hangul characters (Korean).

 We are looking for some solr filter or another built in solr component
 that converts between Han and Hangul characters (transliteration).

 I know there is the ICUTransformFilterFactory that can convert between
 Japanese or chinese scripts, for example:

 filter class=*solr.ICUTransformFilterFactory* id=*Katakana- Hiragana*
 / for Japanese script conversions

 So far I couldn't find anything readymade for Korean scripts, but perhaps
 someone knows of one?



 Thanks!

 Eyal Naamati
 Alma Developer
 Tel: +972-2-6499313
 Mobile: +972-547915255
 eyal.naam...@exlibrisgroup.com
 [image: Description: Description: Description: Description:
 C://signature/exlibris.jpg]
 www.exlibrisgroup.com

qt.shards in solrconfig.xml

2015-02-26 Thread Benson Margulies

A query I posted yesterday amounted to me forgetting that I have to
set qt.shards when I use a URL other than plain old '/select' with
SolrCloud. Is there any way to configure a query handler to automate
this, so that all queries addressed to '/RNI' get that added in?

Re: qt.shards in solrconfig.xml

2015-02-26 Thread Benson Margulies

I apparently am feeling dense; the following does not worl.

 requestHandler name=/RNI class=solr.SearchHandler default=false
list name=defaults
  str name=shards.qt/RNI/str
/list
arr name=components
strname-indexing-query/str
strname-indexing-rescore/str
strfacet/str
strmlt/str
strhighlight/str
strstats/str
strdebug/str
  /arr
  /requestHandler


On Thu, Feb 26, 2015 at 11:33 AM, Jack Krupansky
jack.krupan...@gmail.com wrote:
 I was hoping that Benson was hinting at adding a qt.shards.auto=true
 parameter to so that would magically use on the path from the incoming
 request - and that this would be the default, since that's what most people
 would expect.

 Or, maybe just add a commented-out custom handler that has the qt.shards
 parameter as suggested, to re-emphasize to people that if they want to use
 a custom handler in distributed mode, then they will most likely need this
 parameter.

 -- Jack Krupansky

 On Thu, Feb 26, 2015 at 11:28 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

 Hello,

 Giving

 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201301.mbox/%3c711daae5-c366-4349-b644-8e29e80e2...@gmail.com%3E
 you can add qt.shards into handler defaults/invariants.

 On Thu, Feb 26, 2015 at 5:40 PM, Benson Margulies bimargul...@gmail.com
 wrote:

  A query I posted yesterday amounted to me forgetting that I have to
  set qt.shards when I use a URL other than plain old '/select' with
  SolrCloud. Is there any way to configure a query handler to automate
  this, so that all queries addressed to '/RNI' get that added in?
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com

Re: 8 Shards of Cloud with 4.10.3.

2015-02-25 Thread Benson Margulies

On Wed, Feb 25, 2015 at 8:04 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 2/25/2015 5:50 AM, Benson Margulies wrote:
 So, found the following line in the guide:

java -DzkRun -DnumShards=2
 -Dbootstrap_confdir=./solr/collection1/conf
 -Dcollection.configName=myconf -jar start.jar

 using a completely clean, new, solr_home.

 In my own bootstrap dir, I have my own solrconfig.xml and schema.xml,
 and I modified to have:

  -DnumShards=8 -DmaxShardsPerNode=8

 When I went to start loading data into this, I failed:

 Caused by: 
 org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
 No registered leader was found after waiting for 4000ms , collection:
 rni slice: shard4
 at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:554)
 at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
 at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
 at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
 at 
 org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:285)
 at 
 org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:271)
 at 
 com.basistech.rni.index.internal.SolrCloudEvaluationNameIndex.init(SolrCloudEvaluationNameIndex.java:53)

 with corresponding log traffic in the solr log.

 The cloud page in the Solr admin app shows the IP address in green.
 It's a bit hard to read in general, it's all squished up to the top.

 The way I would do it would be to start Solr *only* with the zkHost
 parameter.  If you're going to use embedded zookeeper, I guess you would
 use zkRun instead.

 Once I had Solr running in cloud mode, I would upload the config to
 zookeeper using zkcli, and create the collection using the Collections
 API, including things like numShards and maxShardsPerNode on that CREATE
 call, not as startup properties.  Then I would completely reindex my
 data into the new collection.  It's a whole lot cleaner than trying to
 convert non-cloud to cloud and split shards.

Shawn, I _am_ starting from clean. However, I didn't find a recipe for
what you suggest as a process, and  (following Hoss' suggestion) I
found the recipe above with the boostrap_confdir scheme.

I am mostly confused as to how I supply my solrconfig.xml and
schema.xml when I follow the process you are suggesting. I know I'm
verging on vampirism here, but if you could possibly find the time to
turn your paragraph into either a pointer to a recipe or the command
lines in a bit more detail, I'd be exceedingly grateful.

Thanks,
benson




 Thanks,
 Shawn

Re: 8 Shards of Cloud with 4.10.3.

2015-02-25 Thread Benson Margulies

So, found the following line in the guide:

   java -DzkRun -DnumShards=2
-Dbootstrap_confdir=./solr/collection1/conf
-Dcollection.configName=myconf -jar start.jar

using a completely clean, new, solr_home.

In my own bootstrap dir, I have my own solrconfig.xml and schema.xml,
and I modified to have:

 -DnumShards=8 -DmaxShardsPerNode=8

When I went to start loading data into this, I failed:

Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
No registered leader was found after waiting for 4000ms , collection:
rni slice: shard4
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:554)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at 
org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:285)
at 
org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:271)
at 
com.basistech.rni.index.internal.SolrCloudEvaluationNameIndex.init(SolrCloudEvaluationNameIndex.java:53)

with corresponding log traffic in the solr log.

The cloud page in the Solr admin app shows the IP address in green.
It's a bit hard to read in general, it's all squished up to the top.




On Tue, Feb 24, 2015 at 4:33 PM, Benson Margulies bimargul...@gmail.com wrote:
 On Tue, Feb 24, 2015 at 4:27 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:

 : Unfortunately, this is all 5.1 and instructs me to run the 'start from
 : scratch' process.

 a) checkout the left nav of any ref guide page webpage which has a link to
 Older Versions of this Guide (PDF)

 b) i'm not entirely sure i understand what you're asking, but i'm guessing
 you mean...

 * you have a fully functional individual instance of Solr, with a single
 core
 * you only want to run that one single instance of the Solr process
 * you want tha single solr process to be a SolrCould of one node, but
 replace your single core with a collection that is divided into 8
 shards.
 * presumably: you don't care about replication since you are only trying
 to run one node.

 what you want to look into (in the 4.10 ref guide) is how to bootstrap a
 SolrCloud instance from a non-SolrCloud node -- ie: startup zk, tell solr
 to take the configs from your single core and uploda them to zk as a
 configset, and register that single core as a collection.

 That should give you a single instance of solrcloud, with a single
 collection, consisting of one shard (your original core)

 Then you should be able to use the SPLITSHARD command to split your
 single shard into 2 shards, and then split them again, etc... (i don't
 think you can split directly to 8-sub shards with a single command)



 FWIW: unless you no longer have access to the original data, it would
 almost certainly be a lot easier to just start with a clean install of
 Solr in cloud mode, then create a collection with 8 shards, then re-index
 your data.

 OK, now I'm good to go. Thanks.




 -Hoss
 http://www.lucidworks.com/

Re: 8 Shards of Cloud with 4.10.3.

2015-02-25 Thread Benson Margulies

A little more data. Note that the cloud status shows the black bubble
for a leader. See http://i.imgur.com/k2MhGPM.png.

org.apache.solr.common.SolrException: No registered leader was found
after waiting for 4000ms , collection: rni slice: shard4
at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:568)
at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:551)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doDeleteByQuery(DistributedUpdateProcessor.java:1358)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:1226)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processDelete(UpdateRequestProcessor.java:55)
at 
org.apache.solr.update.processor.LogUpdateProcessor.processDelete(LogUpdateProcessorFactory.java:121)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processDelete(UpdateRequestProcessor.java:55)


On Wed, Feb 25, 2015 at 9:44 AM, Benson Margulies bimargul...@gmail.com wrote:
 On Wed, Feb 25, 2015 at 8:04 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 2/25/2015 5:50 AM, Benson Margulies wrote:
 So, found the following line in the guide:

java -DzkRun -DnumShards=2
 -Dbootstrap_confdir=./solr/collection1/conf
 -Dcollection.configName=myconf -jar start.jar

 using a completely clean, new, solr_home.

 In my own bootstrap dir, I have my own solrconfig.xml and schema.xml,
 and I modified to have:

  -DnumShards=8 -DmaxShardsPerNode=8

 When I went to start loading data into this, I failed:

 Caused by: 
 org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
 No registered leader was found after waiting for 4000ms , collection:
 rni slice: shard4
 at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:554)
 at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
 at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
 at 
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
 at 
 org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:285)
 at 
 org.apache.solr.client.solrj.SolrServer.deleteByQuery(SolrServer.java:271)
 at 
 com.basistech.rni.index.internal.SolrCloudEvaluationNameIndex.init(SolrCloudEvaluationNameIndex.java:53)

 with corresponding log traffic in the solr log.

 The cloud page in the Solr admin app shows the IP address in green.
 It's a bit hard to read in general, it's all squished up to the top.

 The way I would do it would be to start Solr *only* with the zkHost
 parameter.  If you're going to use embedded zookeeper, I guess you would
 use zkRun instead.

 Once I had Solr running in cloud mode, I would upload the config to
 zookeeper using zkcli, and create the collection using the Collections
 API, including things like numShards and maxShardsPerNode on that CREATE
 call, not as startup properties.  Then I would completely reindex my
 data into the new collection.  It's a whole lot cleaner than trying to
 convert non-cloud to cloud and split shards.

 Shawn, I _am_ starting from clean. However, I didn't find a recipe for
 what you suggest as a process, and  (following Hoss' suggestion) I
 found the recipe above with the boostrap_confdir scheme.

 I am mostly confused as to how I supply my solrconfig.xml and
 schema.xml when I follow the process you are suggesting. I know I'm
 verging on vampirism here, but if you could possibly find the time to
 turn your paragraph into either a pointer to a recipe or the command
 lines in a bit more detail, I'd be exceedingly grateful.

 Thanks,
 benson




 Thanks,
 Shawn

Re: 8 Shards of Cloud with 4.10.3.

2015-02-25 Thread Benson Margulies

It's the zkcli options on my mind. zkcli's usage shows me 'bootstrap',
'upconfig', and uploading a solr.xml.

When I use upconfig, it might work, but it sure is noise:

benson@ip-10-111-1-103:/data/solr+rni$ 554331
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:9983] WARN
org.apache.zookeeper.server.NIOServerCnxn  – caught end of stream
exception
EndOfStreamException: Unable to read additional data from client
sessionid 0x14bc16c5e660003, likely client has closed socket
at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
at 
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)

On Wed, Feb 25, 2015 at 10:52 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 2/25/2015 8:35 AM, Benson Margulies wrote:
 Do I need a zkcli bootstrap or do I start with upconfig? What port does
 zkRun put zookeeper on?

 I personally would not use bootstrap options.  They are only meant to be
 used once, when converting from non-cloud, but many people who use them
 do NOT use them only once -- they include them in their startup scripts
 and use them on every startup.  The whole thing becomes extremely
 confusing.  I would just use zkcli and the Collections API, so nothing
 ever happens that you don't explicitly request.

 I believe that the port for embedded zookeeper (zkRun) is the jetty
 listen port plus 1000, so 9983 if jetty.port is 8983 or not set.

 Thanks,
 Shawn

Re: 8 Shards of Cloud with 4.10.3.

2015-02-25 Thread Benson Margulies

Do I need a zkcli bootstrap or do I start with upconfig? What port does
zkRun put zookeeper on?
On Feb 25, 2015 10:15 AM, Shawn Heisey apa...@elyograg.org wrote:

On 2/25/2015 7:44 AM, Benson Margulies wrote:
Shawn, I _am_ starting from clean. However, I didn't find a recipe for
what you suggest as a process, and (following Hoss' suggestion) I
found the recipe above with the boostrap_confdir scheme.

I am mostly confused as to how I supply my solrconfig.xml and
schema.xml when I follow the process you are suggesting. I know I'm
verging on vampirism here, but if you could possibly find the time to
turn your paragraph into either a pointer to a recipe or the command
lines in a bit more detail, I'd be exceedingly grateful.

I'm willing to help in any way that I can.

Normally in the conf directory for a non-cloud core you have
solrconfig.xml and schema.xml, plus any other configs referenced by
those files, like synomyms.txt, dih-config.xml, etc. In cloud terms,
the directory containing these files is a confdir. It's best to keep
the on-disk copy of your configs completely outside of the solr home so
there's no confusion about what configurations are active. On-disk
cores for solrcloud do not need or use a conf directory.

The cloud-scripts/zkcli.sh (or zkcli.bat) script has an upconfig
command with -confdir and -confname options.

When doing upconfig, the zkHost value goes on the -z option to zkcli,
and you only need to list one of your zookeeper hosts, although it is
perfectly happy if you list them all. You would point -confdir at a
directory containing the config files mentioned earlier, and -confname
is the name that the config has in zookeeper, which you would then use
on the collection.configName parameter for the Collections API call.
Once the config is uploaded, here's an example call to that API for
creating a collection:

http://server:port
/solr/admin/collections?action=CREATEname=testnumShards=8replicationFactor=1collection.configName=testcfgmaxShardsPerNode=8

If this is not enough detail, please let me know which part you need
help with.

Thanks,
Shawn

Re: 8 Shards of Cloud with 4.10.3.

2015-02-25 Thread Benson Margulies

Bingo!

Here's the recipe for the record:

 gcopts has the ton of gc options.

First, set up shop:

DIR=$PWD
cd ../solr-4.10.3/example
java -Xmx200g $gcopts DSTOP.PORT=7983 -DSTOP.KEY=solrrocks
-Djetty.port=8983 -Dsolr.solr.home=/data/solr+rni/cloud_solr_home
-Dsolr.install.dir=/dat\
a/solr-4.10.3 -Duser.timezone=UTC -Djava.net.preferIPv4Stack=true
-DzkRun -jar start.jar 

and then:

curl 
'http://localhost:8983/solr/admin/collections?action=CREATEname=rninumShards=8replicationFactor=1collection.configName=rnimaxSh\
ardsPerNode=8'



On Wed, Feb 25, 2015 at 11:03 AM, Benson Margulies
bimargul...@gmail.com wrote:
 It's the zkcli options on my mind. zkcli's usage shows me 'bootstrap',
 'upconfig', and uploading a solr.xml.

 When I use upconfig, it might work, but it sure is noise:

 benson@ip-10-111-1-103:/data/solr+rni$ 554331
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:9983] WARN
 org.apache.zookeeper.server.NIOServerCnxn  – caught end of stream
 exception
 EndOfStreamException: Unable to read additional data from client
 sessionid 0x14bc16c5e660003, likely client has closed socket
 at 
 org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
 at 
 org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
 at java.lang.Thread.run(Thread.java:745)

 On Wed, Feb 25, 2015 at 10:52 AM, Shawn Heisey apa...@elyograg.org wrote:
 On 2/25/2015 8:35 AM, Benson Margulies wrote:
 Do I need a zkcli bootstrap or do I start with upconfig? What port does
 zkRun put zookeeper on?

 I personally would not use bootstrap options.  They are only meant to be
 used once, when converting from non-cloud, but many people who use them
 do NOT use them only once -- they include them in their startup scripts
 and use them on every startup.  The whole thing becomes extremely
 confusing.  I would just use zkcli and the Collections API, so nothing
 ever happens that you don't explicitly request.

 I believe that the port for embedded zookeeper (zkRun) is the jetty
 listen port plus 1000, so 9983 if jetty.port is 8983 or not set.

 Thanks,
 Shawn

Customized search handler components and cloud

2015-02-25 Thread Benson Margulies

We have a pair of customized search components which we used
successfully with SolrCloud some releases back (4.x). In 4.10.3, I am
trying to find the point of departure in debugging why we get no
results back when querying to them with a sharded index.

If I query the regular /select, all is swell. Obviously, there's a
debugger in my future, but I wonder if this rings any bells for
anyone.


Here's what we add to solrconfig.xml.

  searchComponent name=name-indexing-query
class=com.basistech.rni.solr.NameIndexingQueryComponent /
  searchComponent name=name-indexing-rescore
class=com.basistech.rni.solr.NameIndexingRescoreComponent/

  requestHandler name=/RNI class=solr.SearchHandler default=false
arr name=first-components
strname-indexing-query/str
strname-indexing-rescore/str
  /arr
  /requestHandler

Re: 8 Shards of Cloud with 4.10.3.

2015-02-24 Thread Benson Margulies

On Tue, Feb 24, 2015 at 4:27 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : Unfortunately, this is all 5.1 and instructs me to run the 'start from
 : scratch' process.

 a) checkout the left nav of any ref guide page webpage which has a link to
 Older Versions of this Guide (PDF)

 b) i'm not entirely sure i understand what you're asking, but i'm guessing
 you mean...

 * you have a fully functional individual instance of Solr, with a single
 core
 * you only want to run that one single instance of the Solr process
 * you want tha single solr process to be a SolrCould of one node, but
 replace your single core with a collection that is divided into 8
 shards.
 * presumably: you don't care about replication since you are only trying
 to run one node.

 what you want to look into (in the 4.10 ref guide) is how to bootstrap a
 SolrCloud instance from a non-SolrCloud node -- ie: startup zk, tell solr
 to take the configs from your single core and uploda them to zk as a
 configset, and register that single core as a collection.

 That should give you a single instance of solrcloud, with a single
 collection, consisting of one shard (your original core)

 Then you should be able to use the SPLITSHARD command to split your
 single shard into 2 shards, and then split them again, etc... (i don't
 think you can split directly to 8-sub shards with a single command)



 FWIW: unless you no longer have access to the original data, it would
 almost certainly be a lot easier to just start with a clean install of
 Solr in cloud mode, then create a collection with 8 shards, then re-index
 your data.

OK, now I'm good to go. Thanks.




 -Hoss
 http://www.lucidworks.com/

Re: 8 Shards of Cloud with 4.10.3.

2015-02-24 Thread Benson Margulies

On Tue, Feb 24, 2015 at 1:30 PM, Michael Della Bitta
michael.della.bi...@appinions.com wrote:
 Benson:

 Are you trying to run independent invocations of Solr for every node?
 Otherwise, you'd just want to create a 8 shard collection with
 maxShardsPerNode set to 8 (or more I guess).

Michael Della Bitta,

I don't want to run multiple invocations. I just want to exploit
hardware cores with shards. Can you point me at doc for the process
you are referencing here? I confess to some ongoing confusion between
cores and collections.

--benson



 Michael Della Bitta

 Senior Software Engineer

 o: +1 646 532 3062

 appinions inc.

 “The Science of Influence Marketing”

 18 East 41st Street

 New York, NY 10017

 t: @appinions https://twitter.com/Appinions | g+:
 plus.google.com/appinions
 https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
 w: appinions.com http://www.appinions.com/

 On Tue, Feb 24, 2015 at 1:27 PM, Benson Margulies bimargul...@gmail.com
 wrote:

 With so much of the site shifted to 5.0, I'm having a bit of trouble
 finding what I need, and so I'm hoping that someone can give me a push
 in the right direction.

 On a big multi-core machine, I want to set up a configuration with 8
 (or perhaps more) nodes treated as shards. I have some very particular
 solrconfig.xml and schema.xml that I need to use.

 Could some kind person point me at a relatively step-by-step layout?
 This is all on Linux, I'm happy to explicitly run Zookeeper.

Re: 8 Shards of Cloud with 4.10.3.

2015-02-24 Thread Benson Margulies

On Tue, Feb 24, 2015 at 3:32 PM, Michael Della Bitta
michael.della.bi...@appinions.com wrote:
 https://cwiki.apache.org/confluence/display/solr/SolrCloud

Unfortunately, this is all 5.1 and instructs me to run the 'start from
scratch' process.

I wish that I could take my existing one-core no-cloud config and
convert it into a cloud, 8-shard config.

8 Shards of Cloud with 4.10.3.

2015-02-24 Thread Benson Margulies

With so much of the site shifted to 5.0, I'm having a bit of trouble
finding what I need, and so I'm hoping that someone can give me a push
in the right direction.

On a big multi-core machine, I want to set up a configuration with 8
(or perhaps more) nodes treated as shards. I have some very particular
solrconfig.xml and schema.xml that I need to use.

Could some kind person point me at a relatively step-by-step layout?
This is all on Linux, I'm happy to explicitly run Zookeeper.

Having a spot of trouble setting up /browse

2015-02-16 Thread Benson Margulies

So, I had set up a solr core modelled on the 'multicore' example in 4.10.3,
which has no /browse.

Upon request, I went to set up /browse.

I copied in a minimal version. When I go there, I just get some XML back:

response
lst name=responseHeader
int name=status0/int
int name=QTime4/int
lst name=params/
/lst
result name=response numFound=0 start=0 maxScore=0.0/
/response

What else does /browse depend upon?

codec factory versus posting format versus documentation

2015-02-10 Thread Benson Margulies

I think perhaps there is a minor doc drought, or perhaps just I'm
having an SEO bad hair day.

I'm trying to understand the relationship of codecFactory and postingFormat.

Experiment 1: I just want to use my own codec. So, I make a
CodecFactory, declare it in solrconfig.xml, and stand back? If so, why
does codecFactory take a name attribute?

Experiment 2: I want something per field. I can has a postingsFormat
per field by using the SchemaCodecFactory and then naming ... some
class in postingsFormat=. A postings format class? A Codec class?

I will improve documentation when I have this all straight.

Re: Complaint of multiple /updates but solrconfig.xml has one

2015-02-09 Thread Benson Margulies

OK, I see, I forgot to include the core name in the URL.

On Mon, Feb 9, 2015 at 8:27 PM, Benson Margulies ben...@basistech.com wrote:
 I see https://issues.apache.org/jira/browse/SOLR-6302 but I don't see
 what I am supposed to do about it.

 On Mon, Feb 9, 2015 at 8:19 PM, Benson Margulies ben...@basistech.com wrote:
 4.10.3: Customized solrconfig.xml.

 My log shows:

 2/9/2015, 8:14:44 PMWARNRequestHandlersMultiple requestHandler
 registered to the same name: /update ignoring:
 org.apache.solr.handler.UpdateRequestHandler

 But there is only one:

   requestHandler name=/update class=solr.UpdateRequestHandler
 lst name=defaults
   str name=update.chainRNI/str
 /lst
   /requestHandler

 And all attempts to post with the simple post tool yield:

 SimplePostTool: WARNING: IOException while reading response:
 java.io.FileNotFoundException: http://localhost:8983/solr/update
 1 files indexed.
 COMMITting Solr index changes to http://localhost:8983/solr/update..
 SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for
 url: http://localhost:8983/solr/update?commit=true

 The admin UI is alive and kicking. When I look at the solrconfig.xml
 file from there I only see only handler on /update.

Re: Complaint of multiple /updates but solrconfig.xml has one

2015-02-09 Thread Benson Margulies

I see https://issues.apache.org/jira/browse/SOLR-6302 but I don't see
what I am supposed to do about it.

On Mon, Feb 9, 2015 at 8:19 PM, Benson Margulies ben...@basistech.com wrote:
 4.10.3: Customized solrconfig.xml.

 My log shows:

 2/9/2015, 8:14:44 PMWARNRequestHandlersMultiple requestHandler
 registered to the same name: /update ignoring:
 org.apache.solr.handler.UpdateRequestHandler

 But there is only one:

   requestHandler name=/update class=solr.UpdateRequestHandler
 lst name=defaults
   str name=update.chainRNI/str
 /lst
   /requestHandler

 And all attempts to post with the simple post tool yield:

 SimplePostTool: WARNING: IOException while reading response:
 java.io.FileNotFoundException: http://localhost:8983/solr/update
 1 files indexed.
 COMMITting Solr index changes to http://localhost:8983/solr/update..
 SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for
 url: http://localhost:8983/solr/update?commit=true

 The admin UI is alive and kicking. When I look at the solrconfig.xml
 file from there I only see only handler on /update.

Complaint of multiple /updates but solrconfig.xml has one

2015-02-09 Thread Benson Margulies

4.10.3: Customized solrconfig.xml.

My log shows:

2/9/2015, 8:14:44 PMWARNRequestHandlersMultiple requestHandler
registered to the same name: /update ignoring:
org.apache.solr.handler.UpdateRequestHandler

But there is only one:

  requestHandler name=/update class=solr.UpdateRequestHandler
lst name=defaults
  str name=update.chainRNI/str
/lst
  /requestHandler

And all attempts to post with the simple post tool yield:

SimplePostTool: WARNING: IOException while reading response:
java.io.FileNotFoundException: http://localhost:8983/solr/update
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/update..
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for
url: http://localhost:8983/solr/update?commit=true

The admin UI is alive and kicking. When I look at the solrconfig.xml
file from there I only see only handler on /update.

log location when using bin/start

2015-02-09 Thread Benson Margulies

Running bin/start with a command like:

/data/solr-4.10.3/bin/solr start -s $PWD/solr_home -a
-Djava.library.path=$libdir -Dbt.root=$bt_root\
 $@

I note that the logs are ending up in the solr install
dir/examples/logs. Can I move them?

Re: Is there any sentence tokenizers in sold 4.9.0?

2014-09-12 Thread Benson Margulies

Basis Technology's toolset includes sentence boundary detectors. Please
contact me for more details.

On Fri, Sep 12, 2014 at 1:15 AM, Sandeep B A belgavi.sand...@gmail.com
wrote:

 Hi All,
 Sorry for the delayed response.
 I was out of office for last few days and was not able to reply.
 Thanks for the information.

 We have a use case were one sentence is the unit token with which we need
 to do normalization and semantic analyzer.

 We need to finalize on the type of normalizer and analyzer but was trying
 to view if solr has any inbuilt libraries, so that no cross language
 integration might be required.

 Again Wil get back if something works or not works.

 @susheel,
 Thanks will try to see if that works.

 Thanks,
 Sandeep.
 On Sep 8, 2014 12:54 PM, Sandeep B A belgavi.sand...@gmail.com wrote:

  Hi Susheel ,
  Thanks for the information.
  I have crawled few website and all I need is for sentence tokenizers on
  the data I have collected.
  These websites are English only.
 
  Well I don't have experience in writing custom sentence tokenizers for
  solr. Is there any tutorial link which tell how to do it?
 
  Is it possible to integrate nltk for solr? If yes how to do it? Because I
  found sentence tokenizers for English in nltk.
 
  Thanks,
  Sandeep
  On Sep 5, 2014 8:10 PM, Sandeep B A belgavi.sand...@gmail.com wrote:
 
  Sorry for typo it is solr 4.9.0 instead of sold 4.9.0
   On Sep 5, 2014 7:48 PM, Sandeep B A belgavi.sand...@gmail.com
 wrote:
 
  Hi,
 
  I was looking out the options for sentence tokenizers default in solr
  but could not find it. Does any one used? Integrated from any other
  language tokenizers to solr. Example python etc.. Please let me know.
 
 
  Thanks and regards,
  Sandeep

Re: Business Name spell check

2014-08-31 Thread Benson Margulies

Trying to shoehorn business name resolution or correction purely into
Solr tokenization and spell checking is not, in my opinion, a viable
approach. It seems to me that you need a query parser that does
something very different from pure tokenization, and you might also
need a more complex approach to matching names. Full disclosure: I
work for a company that builds one of those. You could talk to us, or
you could at least look at the problem from the point of view of our
approach: take the business names, index them in some way that allows
for fuzzy matching (which is _not_ just treating them as ordinary
tokenized text), then take the queries, and map them to fuzzy
matching. The whole business is comparable to the geo support in Solr:
a special data type that is treated with domain-specific techniques.

Re: Solr Japanese support

2014-03-16 Thread Benson Margulies

Your problem has nothing to do with Japanese. Perhaps a content-type
for CSV would work better?

On Sat, Mar 15, 2014 at 12:50 PM, Bala Iyer grb...@yahoo.com wrote:
 Hi,

 I am new to Solr japanese.
 I added the support for japanese on schema.xml
 How can i insert Japanese text into that field either by solr client (java / 
 php / ruby ) or by curl


 schema.xml
 
 field name=username type=string indexed=true stored=true 
 multiValued=true omitNorms=true termVectors=true /
 field name=timestamp type=date indexed=true stored=true 
 multiValued=true omitNorms=true termVectors=true /
 field name=jtxt type=text_ja indexed=true stored=true 
 multiValued=true omitNorms=true termVectors=true /

 fieldType name=text_ja class=solr.TextField 
 positionIncrementGap=100 autoGeneratePhraseQueries=false
   analyzer
 tokenizer class=solr.JapaneseTokenizerFactory mode=search/

 !--tokenizer class=solr.JapaneseTokenizerFactory mode=search 
 userDictionary=lang/userdict_ja.txt/--
 !-- Reduces inflected verbs and adjectives to their base/dictionary 
 forms (辞書形) --
 filter class=solr.JapaneseBaseFormFilterFactory/
 !-- Removes tokens with certain part-of-speech tags --
 filter class=solr.JapanesePartOfSpeechStopFilterFactory 
 tags=lang/stoptags_ja.txt /
 !-- Normalizes full-width romaji to half-width and half-width kana 
 to full-width (Unicode NFKC subset) --
 filter class=solr.CJKWidthFilterFactory/
 !-- Removes common tokens typically not useful for search, but have 
 a negative effect on ranking --
 filter class=solr.StopFilterFactory ignoreCase=true 
 words=lang/stopwords_ja.txt /
 !-- Normalizes common katakana spelling variations by removing any 
 last long sound character (U+30FC) --
 filter class=solr.JapaneseKatakanaStemFilterFactory 
 minimumLength=4/
 !-- Lower-cases romaji characters --
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
 /fieldType
 

 my insert.csv file

 id,username,timestamp,content,jtxt
 9,x,2013-12-26T10:14:26Z,Hello ,マイ ドキュメント
 =
 I am trying to insert through curl it gives me error
 curl 
 http://localhost:8983/solr/collection1/update/csv?separator=,commit=true; 
 -H Content-Type: text/plain; charset=utf-8 --data-binary @insert.csv


 ERROR
 
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status400/intint 
 name=QTime23/int
/lstlst name=errorstr name=msgDocument is missing mandatory 
uniqueKey
  field: id/strint name=code400/int/lst
 /response

 I know i should not use Content-Type as text/plain
 =


 Thanks

Mixing lucene scoring and other scoring

2014-03-06 Thread Benson Margulies

Some months ago, I talked to some people at LR about this, but I can't
find my notes.

Imagine a function of some fields that produces a score between 0 and 1.

Imagine that you want to combine this score with relevance over some
more or less complex ordinary query.

What are the options, given the arbitrary nature of Lucene scores?

A bit lost in the land of schemaless Solr

2014-02-08 Thread Benson Margulies

Say that I have 10 fieldTypes for 10 languages. Is there a way to associate
a naming convention from field names to field types so that I can avoid
bothering with all those dynamic fields?

(lack) of error for missing library?

2014-02-08 Thread Benson Margulies

  !-- an exact 'path' can be used instead of a 'dir' to specify a
   specific jar file.  This will cause a serious error to be logged
   if it can't be loaded.
--

is the comment, but when I put a completely missing path in there -- no
error. Should I file a JIRA?

Re: Multi Lingual Analyzer

2014-01-20 Thread Benson Margulies

MT is not nearly good enough to allow approach 1 to work.

On Mon, Jan 20, 2014 at 9:25 AM, Erick Erickson erickerick...@gmail.com wrote:
 It Depends (tm). Approach (2) will give you better, more specific
 search results. (1) is simpler to implement and might be good
 enough...



 On Mon, Jan 20, 2014 at 5:21 AM, David Philip
 davidphilipshe...@gmail.com wrote:
 Hi,



   I have a query on Multi-Lingual Analyser.


  Which one of the  below is the best approach?


 1.1.To develop a translator that translates a/any language to
 English and then use standard English analyzer to analyse – use translator,
 both at index time and while search time?

 2.  2.  To develop a language specific analyzer and use that by
 creating specific field only for that language?

 We have client data coming in different Languages: Kannada and Telegu and
 others later.This data is basically the text written by customer in that
 language.


 Requirement is to develop analyzers particular for these language.



 Thanks - David

Re: Tracking down the input that hits an analysis chain bug

2014-01-16 Thread Benson Margulies

I think that https://issues.apache.org/jira/browse/SOLR-5623 should be
ready to go. Would someone please commit from the PR? If there's a
preference, I can attach a patch as well.

On Fri, Jan 10, 2014 at 1:37 PM, Benson Margulies bimargul...@gmail.com wrote:
 Thanks, that's the recipe that I need.

 On Fri, Jan 10, 2014 at 11:40 AM, Chris Hostetter
 hossman_luc...@fucit.org wrote:

 : Is there a neighborhood of existing tests I should be visiting here?

 You'll need a custom schema that refers to your new
 MockFailOnCertainTokensFilterFactory, so i would create a completley new
 test class somewhere in ...solr.update (you're testing that an update
 fails with a clean error)


 -Hoss
 http://www.lucidworks.com/

Analyzers versus Tokenizers/TokenFilters

2014-01-15 Thread Benson Margulies

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters never
mentions an Analyzer class.

http://wiki.apache.org/solr/SolrPlugins talks about subclasses of
SolrAnalyzer as ways of delivering an entire analysis chain and still
'minding the gap'.

Anyone care to offer a comparison of the viewpoints?

Re: Analyzers versus Tokenizers/TokenFilters

2014-01-15 Thread Benson Margulies

Ahmet,

So, this is an interesting difference between Lucene (and ES) and
Solr. In Lucene, the idea seems to be that you package up a reusable
analysis chain as an analyzer. Saying 'use analyzer X' is less complex
than saying 'use tokenizer T and filters F1, F2, ...'.

thanks,
benson


On Wed, Jan 15, 2014 at 5:09 PM, Ahmet Arslan iori...@yahoo.com wrote:
 Hi Benson,

 Using lucene analyzer in schema.xlm should be last resort. For very specific 
 reasons : if you have an existing analyzer, etc.

 Ahmet


 On Wednesday, January 15, 2014 11:52 PM, Benson Margulies 
 ben...@basistech.com wrote:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters never
 mentions an Analyzer class.

 http://wiki.apache.org/solr/SolrPlugins talks about subclasses of
 SolrAnalyzer as ways of delivering an entire analysis chain and still
 'minding the gap'.

 Anyone care to offer a comparison of the viewpoints?

Re: Tracking down the input that hits an analysis chain bug

2014-01-10 Thread Benson Margulies

OK, patch forthcoming.

On Fri, Jan 10, 2014 at 11:23 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : The problem manifests as this sort of thing:
 :
 : Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
 : SEVERE: java.lang.IllegalArgumentException: startOffset must be
 : non-negative, and endOffset must be = startOffset,
 : startOffset=-1811581632,endOffset=-1811581632

 Is there a stack trace in the log to go along with that?  there should be.

 My suspicion is that since analysis errors like these are
 RuntimeExceptions, they may not be getting caught  re-thrown with as much
 context as they should -- so by the time they get logged (or returned to
 the client) there isn't any info about the problematic field value, let
 alone the unqiueKey.

 If we had a test case that reproduces (ie: with a mock tokenfilter that
 always throws a RuntimeException when a token matches fail_now or
 something) we could have some tests that assert indexing a doc with that
 token results in a useful error -- which should help ensure that useful
 error also gets logged (although i don't think we don't really have any
 easy way of asserting specific log messages at the moment)


 -Hoss
 http://www.lucidworks.com/

Re: Tracking down the input that hits an analysis chain bug

2014-01-10 Thread Benson Margulies

Is there a neighborhood of existing tests I should be visiting here?


On Fri, Jan 10, 2014 at 11:27 AM, Benson Margulies
bimargul...@gmail.com wrote:
 OK, patch forthcoming.

 On Fri, Jan 10, 2014 at 11:23 AM, Chris Hostetter
 hossman_luc...@fucit.org wrote:

 : The problem manifests as this sort of thing:
 :
 : Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
 : SEVERE: java.lang.IllegalArgumentException: startOffset must be
 : non-negative, and endOffset must be = startOffset,
 : startOffset=-1811581632,endOffset=-1811581632

 Is there a stack trace in the log to go along with that?  there should be.

 My suspicion is that since analysis errors like these are
 RuntimeExceptions, they may not be getting caught  re-thrown with as much
 context as they should -- so by the time they get logged (or returned to
 the client) there isn't any info about the problematic field value, let
 alone the unqiueKey.

 If we had a test case that reproduces (ie: with a mock tokenfilter that
 always throws a RuntimeException when a token matches fail_now or
 something) we could have some tests that assert indexing a doc with that
 token results in a useful error -- which should help ensure that useful
 error also gets logged (although i don't think we don't really have any
 easy way of asserting specific log messages at the moment)


 -Hoss
 http://www.lucidworks.com/

Re: Tracking down the input that hits an analysis chain bug

2014-01-10 Thread Benson Margulies

Thanks, that's the recipe that I need.

On Fri, Jan 10, 2014 at 11:40 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : Is there a neighborhood of existing tests I should be visiting here?

 You'll need a custom schema that refers to your new
 MockFailOnCertainTokensFilterFactory, so i would create a completley new
 test class somewhere in ...solr.update (you're testing that an update
 fails with a clean error)


 -Hoss
 http://www.lucidworks.com/

Re: Tracking down the input that hits an analysis chain bug

2014-01-04 Thread Benson Margulies

I rather assumed that there was some log4j-ish config to be set that
would do this for me. Lacking one, I guess I'll end up there.

On Fri, Jan 3, 2014 at 8:23 PM, Michael Sokolov
msoko...@safaribooksonline.com wrote:
 Have you considered using a custom UpdateProcessor to catch the exception
 and provide more context in the logs?

 -Mike


 On 01/03/2014 03:33 PM, Benson Margulies wrote:

 Robert,

 Yes, if the problem was not data-dependent, indeed I wouldn't need to
 index anything. However, I've run a small mountain of data through our
 tokenizer on my machine, and never seen the error, but my customer
 gets these errors in the middle of a giant spew of data. As it
 happens, I _was_ missing that call to clearAttributes(), (and the
 usual implementation of end()), but I found and fixed that problem
 precisely by creating a random data test case using checkRandomData().
 Unfortunately, fixing that didn't make the customer's errors go away.

 So I'm left needing to help them identify the data that provokes this,
 because I've so far failed to come up with any.

 --benson


 On Fri, Jan 3, 2014 at 2:16 PM, Robert Muir rcm...@gmail.com wrote:

 This exception comes from OffsetAttributeImpl (e.g. you dont need to
 index anything to reproduce it).

 Maybe you have a missing clearAttributes() call (your tokenizer
 'returns true' without calling that first)? This could explain it, if
 something like a StopFilter is also present in the chain: basically
 the offsets overflow.

 the test stuff in BaseTokenStreamTestCase should be able to detect
 this as well...

 On Fri, Jan 3, 2014 at 1:56 PM, Benson Margulies ben...@basistech.com
 wrote:

 Using Solr Cloud with 4.3.1.

 We've got a problem with a tokenizer that manifests as calling
 OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure
 out
 what input provokes our code into getting into this pickle.

 The problem happens on SolrCloud nodes.

 The problem manifests as this sort of thing:

 Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
 SEVERE: java.lang.IllegalArgumentException: startOffset must be
 non-negative, and endOffset must be = startOffset,
 startOffset=-1811581632,endOffset=-1811581632

 How could we get a document ID so that we can tell which document was
 being
 processed?

Tracking down the input that hits an analysis chain bug

2014-01-03 Thread Benson Margulies

Using Solr Cloud with 4.3.1.

We've got a problem with a tokenizer that manifests as calling
OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure out
what input provokes our code into getting into this pickle.

The problem happens on SolrCloud nodes.

The problem manifests as this sort of thing:

Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.IllegalArgumentException: startOffset must be
non-negative, and endOffset must be = startOffset,
startOffset=-1811581632,endOffset=-1811581632

How could we get a document ID so that we can tell which document was being
processed?

Re: Tracking down the input that hits an analysis chain bug

2014-01-03 Thread Benson Margulies

Robert,

Yes, if the problem was not data-dependent, indeed I wouldn't need to
index anything. However, I've run a small mountain of data through our
tokenizer on my machine, and never seen the error, but my customer
gets these errors in the middle of a giant spew of data. As it
happens, I _was_ missing that call to clearAttributes(), (and the
usual implementation of end()), but I found and fixed that problem
precisely by creating a random data test case using checkRandomData().
Unfortunately, fixing that didn't make the customer's errors go away.

So I'm left needing to help them identify the data that provokes this,
because I've so far failed to come up with any.

--benson


On Fri, Jan 3, 2014 at 2:16 PM, Robert Muir rcm...@gmail.com wrote:
 This exception comes from OffsetAttributeImpl (e.g. you dont need to
 index anything to reproduce it).

 Maybe you have a missing clearAttributes() call (your tokenizer
 'returns true' without calling that first)? This could explain it, if
 something like a StopFilter is also present in the chain: basically
 the offsets overflow.

 the test stuff in BaseTokenStreamTestCase should be able to detect
 this as well...

 On Fri, Jan 3, 2014 at 1:56 PM, Benson Margulies ben...@basistech.com wrote:
 Using Solr Cloud with 4.3.1.

 We've got a problem with a tokenizer that manifests as calling
 OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure out
 what input provokes our code into getting into this pickle.

 The problem happens on SolrCloud nodes.

 The problem manifests as this sort of thing:

 Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
 SEVERE: java.lang.IllegalArgumentException: startOffset must be
 non-negative, and endOffset must be = startOffset,
 startOffset=-1811581632,endOffset=-1811581632

 How could we get a document ID so that we can tell which document was being
 processed?

TokenizerFactory from 4.2.0 to 4.3.0

2013-09-16 Thread Benson Margulies

TokenizerFactory changed, incompatibly with subclasses, from 4.2.0 to
4.3.0. Subclasses must now implement a different overload of create, and
may not implement the old one.

Has anyone got any devious strategies other than multiple copies of code to
deal with this when supporting multiple versions of Solr?

Re: Solr Patent

2013-09-15 Thread Benson Margulies

I am not a lawyer.

The Apache Software Foundation cannot 'protect Solr developers.'

Patent infringement is a claim made against someone who derived economic
benefit from an invention, not someone who writes code.

The patent clause in the Apache License requires people who contribute code
to grant certain licenses. It does not, and cannot, prevent someone else
from asserting that a user of Apache Solr is infringing on some patent
owned by someone who has never contributed to the project.

SOLR-4872 and LUCENE-2145 (or, how to clean up a Tokenizer)

2013-06-12 Thread Benson Margulies

Could I have some help on the combination of these two? Right now, it
appears that I'm stuck with a finalizer to chase after native
resources in a Tokenizer. Am I missing something?

How can a Tokenizer be CoreAware?

2013-05-29 Thread Benson Margulies

I am currently testing some things with Solr 4.0.0. I tried to make a
tokenizer CoreAware, and was rewarded with:

Caused by: org.apache.solr.common.SolrException: Invalid 'Aware'
object: com.basistech.rlp.solr.RLPTokenizerFactory@19336006 --
org.apache.solr.util.plugin.SolrCoreAware must be an instance of:
[org.apache.solr.request.SolrRequestHandler]
[org.apache.solr.response.QueryResponseWriter]
[org.apache.solr.handler.component.SearchComponent]
[org.apache.solr.update.processor.UpdateRequestProcessorFactory]
[org.apache.solr.handler.component.ShardHandlerFactory]

I need this to allow cleanup of some cached items in the tokenizer.

Questions:

1: will a newer version allow me to do this directly?
2: is there some other approach that anyone would recommend? I could,
for example, make a fake object in the list above to act as a
singleton with a static accessor, but that seems pretty ugly.

Seeming bug in ConcurrentUpdateSolrServer

2013-05-29 Thread Benson Margulies

The comment here is clearly wrong, since there is no division by two.

I think that the code is wrong, because this results in not starting
runners when it should start runners. Am I misanalyzing?

if (runners.isEmpty() || (queue.remainingCapacity()  queue.size() // queue

  // is

  // half

  // full

  // and

  // we

  // can

  // add

  // more

  // runners
   runners.size()  threadCount)) {

Re: Seeming bug in ConcurrentUpdateSolrServer

2013-05-29 Thread Benson Margulies

Ah. So now I have to find some other explanation of why it never
creates more than one thread, even when I make a very deep queue and
specify 6 threads.

On Wed, May 29, 2013 at 2:25 PM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:
 On Wed, May 29, 2013 at 11:29 PM, Benson Margulies 
 bimargul...@gmail.comwrote:

 The comment here is clearly wrong, since there is no division by two.

 I think that the code is wrong, because this results in not starting
 runners when it should start runners. Am I misanalyzing?

 if (runners.isEmpty() || (queue.remainingCapacity()  queue.size() // queue

   // is

   // half

   // full

   // and

   // we

   // can

   // add

   // more

   // runners
runners.size()  threadCount)) {



 queue.remainingCapacity() returns capacity - queue.size() so the comment is
 correct.

 --
 Regards,
 Shalin Shekhar Mangar.

Re: Seeming bug in ConcurrentUpdateSolrServer

2013-05-29 Thread Benson Margulies

I now understand the algorithm, but I don't understand why is the way it is.

Consider one of these objects configure with a handful of threads and
a pretty big queue.

When the first request comes in, the object creates one runner. It
then won't create a second runner until the Q reaches 1/2-full.

If the idea is that we want to pile up 'a lot' (1/2-of-a-q) of work
before sending any of it, why start that first runner?

On Wed, May 29, 2013 at 2:45 PM, Benson Margulies bimargul...@gmail.com wrote:
 Ah. So now I have to find some other explanation of why it never
 creates more than one thread, even when I make a very deep queue and
 specify 6 threads.

 On Wed, May 29, 2013 at 2:25 PM, Shalin Shekhar Mangar
 shalinman...@gmail.com wrote:
 On Wed, May 29, 2013 at 11:29 PM, Benson Margulies 
 bimargul...@gmail.comwrote:

 The comment here is clearly wrong, since there is no division by two.

 I think that the code is wrong, because this results in not starting
 runners when it should start runners. Am I misanalyzing?

 if (runners.isEmpty() || (queue.remainingCapacity()  queue.size() // queue

   // is

   // half

   // full

   // and

   // we

   // can

   // add

   // more

   // runners
runners.size()  threadCount)) {



 queue.remainingCapacity() returns capacity - queue.size() so the comment is
 correct.

 --
 Regards,
 Shalin Shekhar Mangar.

Not so concurrent concurrency

2013-05-28 Thread Benson Margulies

 I can't quite apply SolrMeter to my problem, so I did something of my
own. The brains of the operation are the function here.

This feeds a ConcurrentUpdateSolrServer about 95 documents, each about
10mb, and 'threads' is six. Yet Solr just barely uses more than one
core.

   private long doIteration(File[] filesToRead) throws IOException,
SolrServerException {
ConcurrentUpdateSolrServer concurrentServer = new
ConcurrentUpdateSolrServer(launcher.getSolrServer().getBaseURL(),
1000, threads);
UpdateRequest updateRequest = new UpdateRequest(updateUrl);
updateRequest.setCommitWithin(1);
Stopwatch stopwatch = new Stopwatch();

ListFile allFiles = Arrays.asList(filesToRead);
IteratorFile fileIterator = allFiles.iterator();
while (fileIterator.hasNext()) {
ListFile thisBatch = Lists.newArrayList();
int batchByteCount = 0;
while (batchByteCount  BATCH_LIMIT  fileIterator.hasNext()) {
File thisFile = fileIterator.next();
thisBatch.add(thisFile);
batchByteCount += thisFile.length();
}
LOG.info(String.format(update %s files, thisBatch.size()));
updateRequest.setDocIterator(new
StreamingDocumentIterator(thisBatch));
stopwatch.start();
concurrentServer.request(updateRequest);
concurrentServer.blockUntilFinished();
stopwatch.stop();
}

Benchmarking Solr

2013-05-26 Thread Benson Margulies

I'd like to run a repeatable test of having Solr ingest a corpus of
docs on disk, to measure the speed of some alternative things plugged
in.

Anyone have some advice to share? One approach would be a quick SolrJ
program that pushed the entire stack as one giant collection with a
commit at the end.

Re: solr.xml or its successor in the wiki

2013-05-20 Thread Benson Margulies

I suppose you saw my JIRA suggesting that solr.xml should might have
the same repetoire of 'lib' elements as solrconfig.xml, instead of
just a single 'str'.

On Mon, May 20, 2013 at 11:16 AM, Erick Erickson
erickerick...@gmail.com wrote:
 What's supposed to happen (not guaranteeing it is completely correct,
 mind you) is that the presence of a cores tag defines which checks
 are performed. Errors are thrown on old-style constructs when no
 cores tag is present and vice-versa.

 Best
 Erick


 On Sun, May 19, 2013 at 7:20 PM, Benson Margulies bimargul...@gmail.com 
 wrote:
 One point of confusion: Is the compatibility code I hit trying to
 prohibit the 'str' form when it sees old-fangled cores? Or when the
 current running version pre-5.0? I hope it's the former.

 On Sun, May 19, 2013 at 6:47 PM, Shawn Heisey s...@elyograg.org wrote:
 On 5/19/2013 4:38 PM, Benson Margulies wrote:
 Shawn, thanks. need any more jiras on this?

 I don't think so, but if you grab the 4.3 branch or branch_4x and find
 any bugs, let us know.

 Thanks,
 Shawn

solr.xml or its successor in the wiki

2013-05-19 Thread Benson Margulies

http://wiki.apache.org/solr/ConfiguringSolr

does not point to any information on solr.xml.

Given https://issues.apache.org/jira/browse/SOLR-4791, I'm a bit
confused, and I need to set up a sharedLib directory for 4.3.0.

I would do some writing or linking if I had some raw material ...

Re: solr.xml or its successor in the wiki

2013-05-19 Thread Benson Margulies

I found http://wiki.apache.org/solr/Solr.xml%204.3%20and%20beyond, but
it doesn't mention the successor to sharedLib.

On Sun, May 19, 2013 at 12:02 PM, Benson Margulies
bimargul...@gmail.com wrote:
 http://wiki.apache.org/solr/ConfiguringSolr

 does not point to any information on solr.xml.

 Given https://issues.apache.org/jira/browse/SOLR-4791, I'm a bit
 confused, and I need to set up a sharedLib directory for 4.3.0.

 I would do some writing or linking if I had some raw material ...

Re: solr.xml or its successor in the wiki

2013-05-19 Thread Benson Margulies

OK, I found the successor.

On Sun, May 19, 2013 at 12:40 PM, Benson Margulies
bimargul...@gmail.com wrote:
 I found http://wiki.apache.org/solr/Solr.xml%204.3%20and%20beyond, but
 it doesn't mention the successor to sharedLib.

 On Sun, May 19, 2013 at 12:02 PM, Benson Margulies
 bimargul...@gmail.com wrote:
 http://wiki.apache.org/solr/ConfiguringSolr

 does not point to any information on solr.xml.

 Given https://issues.apache.org/jira/browse/SOLR-4791, I'm a bit
 confused, and I need to set up a sharedLib directory for 4.3.0.

 I would do some writing or linking if I had some raw material ...

Re: solr.xml or its successor in the wiki

2013-05-19 Thread Benson Margulies

Starting with the shipped solr.xml, I added a new-style str child to
configure a shared lib, and i was rewarded with:

Caused by: org.apache.solr.common.SolrException: Should not have found
solr/str[@name='sharedLib'] solr.xml may be a mix of old and new style
formats.
at org.apache.solr.core.ConfigSolrXml.failIfFound(ConfigSolrXml.java:169)
at org.apache.solr.core.ConfigSolrXml.init(ConfigSolrXml.java:150)
at org.apache.solr.core.ConfigSolrXml.init(ConfigSolrXml.java:94)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:387)
... 42 more

Is this a bug? I seem to now be caught on a fork between 4791 and this.

On Sun, May 19, 2013 at 12:52 PM, Benson Margulies
bimargul...@gmail.com wrote:
 OK, I found the successor.

 On Sun, May 19, 2013 at 12:40 PM, Benson Margulies
 bimargul...@gmail.com wrote:
 I found http://wiki.apache.org/solr/Solr.xml%204.3%20and%20beyond, but
 it doesn't mention the successor to sharedLib.

 On Sun, May 19, 2013 at 12:02 PM, Benson Margulies
 bimargul...@gmail.com wrote:
 http://wiki.apache.org/solr/ConfiguringSolr

 does not point to any information on solr.xml.

 Given https://issues.apache.org/jira/browse/SOLR-4791, I'm a bit
 confused, and I need to set up a sharedLib directory for 4.3.0.

 I would do some writing or linking if I had some raw material ...

Re: solr.xml or its successor in the wiki

2013-05-19 Thread Benson Margulies

Shawn, thanks. need any more jiras on this?

On May 19, 2013, at 6:37 PM, Shawn Heisey s...@elyograg.org wrote:

 On 5/19/2013 11:27 AM, Benson Margulies wrote:
 Starting with the shipped solr.xml, I added a new-style str child to
 configure a shared lib, and i was rewarded with:

 Caused by: org.apache.solr.common.SolrException: Should not have found
 solr/str[@name='sharedLib'] solr.xml may be a mix of old and new style
 formats.
 at org.apache.solr.core.ConfigSolrXml.failIfFound(ConfigSolrXml.java:169)
 at org.apache.solr.core.ConfigSolrXml.init(ConfigSolrXml.java:150)
 at org.apache.solr.core.ConfigSolrXml.init(ConfigSolrXml.java:94)
 at org.apache.solr.core.CoreContainer.load(CoreContainer.java:387)
 ... 42 more

 There are serious problems with the new solr.xml format in 4.3.  Due to
 major changes in the code between 4.3 and 4.4, the problems will not be
 fixed in 4.3.1.  You'll need to wait for 4.4 before attempting to use
 it.  The new format will be used in the example in 4.4.

 I have updated the ConfiguringSolr page with some additional info, and
 reorganized it.  I believe the 4.3 and beyond page should be changed
 to 4.4 and beyond.

 The sharedLib attribute is broken in 4.3.0, fixed in 4.3.1 with
 SOLR-4791, which should be out very soon.  A workaround is to put your
 jars in ${solr.solr.home}/lib which does not require configuration.

 After 4.3.1 comes out (or if you a use dev version), if you want to use
 sharedLib in the old-style solr.xml file, it will not be a str tag, it
 is an attribute on the solr tag.  The sharedLib values are relative to
 solr.solr.home:

 solr persistent=true sharedLib=libextra
  cores adminPath=/admin/cores

 Thanks,
 Shawn

Re: solr.xml or its successor in the wiki

2013-05-19 Thread Benson Margulies

One point of confusion: Is the compatibility code I hit trying to
prohibit the 'str' form when it sees old-fangled cores? Or when the
current running version pre-5.0? I hope it's the former.

On Sun, May 19, 2013 at 6:47 PM, Shawn Heisey s...@elyograg.org wrote:
 On 5/19/2013 4:38 PM, Benson Margulies wrote:
 Shawn, thanks. need any more jiras on this?

 I don't think so, but if you grab the 4.3 branch or branch_4x and find
 any bugs, let us know.

 Thanks,
 Shawn

wiki versus downloads versus archives

2013-05-16 Thread Benson Margulies

http://wiki.apache.org/solr/Solr3.1 claims that Solr3.1 is available in a
place where it is not, and I can't find a link on the front page to the
archive for old releases.

Re: wiki versus downloads versus archives

2013-05-16 Thread Benson Margulies

tanks.


On Thu, May 16, 2013 at 4:28 PM, Shawn Heisey s...@elyograg.org wrote:

 On 5/16/2013 2:21 PM, Benson Margulies wrote:

 http://wiki.apache.org/solr/**Solr3.1http://wiki.apache.org/solr/Solr3.1claims
  that Solr3.1 is available in a
 place where it is not, and I can't find a link on the front page to the
 archive for old releases.


 Download links fixed on the wiki pages for 3.1 and 3.2.

 Thanks,
 Shawn

A request handler that manipulated the index

2013-04-02 Thread Benson Margulies

I am thinking about trying to structure a problem as a Solr plugin. The
nature of the plugin is that it would need to read and write the lucene
index to do its work. It could not be cleanly split into URP 'over here'
and a Search Component 'over there'.

Are there invariants of Solr that would preclude this, like assumptions in
the implementation of the cache?

Solr1.4 and threads ....

2012-06-13 Thread Benson Margulies

We've got a tokenizer which is quite explicitly coded on the
assumption that it will only be called from one thread at a time.
After all, what would it mean for two threads to make interleaved
calls to the hasNext() function()?

Yet, a customer of ours with a gigantic instance of Solr 1.4 reports
incidents in which we throw an exception that indicates (we think),
that two different threads made interleaved calls.

Does this suggest anything to anyone? Other than that we've
misanalyzed the logic in the tokenizer and there's a way to make it
burp on one thread?

Re: Why would solr norms come up different from Lucene norms?

2012-05-05 Thread Benson Margulies

On Sat, May 5, 2012 at 7:59 PM, Lance Norskog goks...@gmail.com wrote:
 Which Similarity class do you use for the Lucene code? Solr has a custom one.

I am embarassed to report that I also have a custom similarity that I
didn't know about, and once I configured that into Solr all was well.



 On Fri, May 4, 2012 at 6:30 AM, Benson Margulies bimargul...@gmail.com 
 wrote:
 So, I've got some code that stores the same documents in a Lucene
 3.5.0 index and a Solr 3.5.0 instance. It's only five documents.

 For a particular field, the Solr norm is always 0.625, while the
 Lucene norm is .5.

 I've watched the code in NormsWriterPerField in both cases.

 In Solr we've got .577, in naked Lucene it's .5.

 I tried to check for boosts, and I don't see any non-1.0 document or
 field boosts.

 The Solr field is:

 field name=bt_rni_NameHRK_encodedName type=text_ws indexed=true
 stored=true multiValued=false /



 --
 Lance Norskog
 goks...@gmail.com

Why would solr norms come up different from Lucene norms?

2012-05-04 Thread Benson Margulies

So, I've got some code that stores the same documents in a Lucene
3.5.0 index and a Solr 3.5.0 instance. It's only five documents.

For a particular field, the Solr norm is always 0.625, while the
Lucene norm is .5.

I've watched the code in NormsWriterPerField in both cases.

In Solr we've got .577, in naked Lucene it's .5.

I tried to check for boosts, and I don't see any non-1.0 document or
field boosts.

The Solr field is:

field name=bt_rni_NameHRK_encodedName type=text_ws indexed=true
stored=true multiValued=false /

Latest solr4 snapshot seems to be giving me a lot of unhappy logging about 'Log4j', should I be concerned?

2012-05-01 Thread Benson Margulies

CoreContainer.java, in the method 'load', finds itself calling
loader.NewInstance with an 'fname' of Log4j of the slf4j backend is
'Log4j'.

e.g.:

2012-05-01 10:40:32,367 org.apache.solr.core.CoreContainer  - Unable
to load LogWatcher
org.apache.solr.common.SolrException: Error loading class 'Log4j'

What is it actually looking for? Have I misplaced something?

Re: Latest solr4 snapshot seems to be giving me a lot of unhappy logging about 'Log4j', should I be concerned?

2012-05-01 Thread Benson Margulies

On Tue, May 1, 2012 at 12:16 PM, Mark Miller markrmil...@gmail.com wrote:
 There is a recent JIRA issue about keeping the last n logs to display in the 
 admin UI.

 That introduced a problem - and then the fix introduced a problem - and then 
 the fix mitigated the problem but left that ugly logging as a by product.

 Don't remember the issue # offhand. I think there was a dispute about what 
 should be done with it.

 On May 1, 2012, at 11:14 AM, Benson Margulies wrote:

 CoreContainer.java, in the method 'load', finds itself calling
 loader.NewInstance with an 'fname' of Log4j of the slf4j backend is
 'Log4j'.

Couldn't someone just fix the if statement to say, 'OK, if we're doing
log4j, we have no log watcher' and skip all the loud failing on the
way?




 e.g.:

 2012-05-01 10:40:32,367 org.apache.solr.core.CoreContainer  - Unable
 to load LogWatcher
 org.apache.solr.common.SolrException: Error loading class 'Log4j'

 What is it actually looking for? Have I misplaced something?

 - Mark Miller
 lucidimagination.com

Re: Latest solr4 snapshot seems to be giving me a lot of unhappy logging about 'Log4j', should I be concerned?

2012-05-01 Thread Benson Margulies

Yes, I'm the author of that JIRA.

On Tue, May 1, 2012 at 8:45 PM, Ryan McKinley ryan...@gmail.com wrote:
 check a release since r1332752

 If things still look problematic, post a comment on:
 https://issues.apache.org/jira/browse/SOLR-3426

 this should now have a less verbose message with an older SLF4j and with Log4j


 On Tue, May 1, 2012 at 10:14 AM, Gopal Patwa gopalpa...@gmail.com wrote:
 I have similar issue using log4j for logging with trunk build, the
 CoreConatainer class print big stack trace on our jboss 4.2.2 startup, I am
 using sjfj 1.5.2

 10:07:45,918 WARN  [CoreContainer] Unable to read SLF4J version
 java.lang.NoSuchMethodError:
 org.slf4j.impl.StaticLoggerBinder.getSingleton()Lorg/slf4j/impl/StaticLoggerBinder;
 at org.apache.solr.core.CoreContainer.load(CoreContainer.java:395)
 at org.apache.solr.core.CoreContainer.load(CoreContainer.java:355)
 at
 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:304)
 at
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:101)



 On Tue, May 1, 2012 at 9:25 AM, Benson Margulies 
 bimargul...@gmail.comwrote:

 On Tue, May 1, 2012 at 12:16 PM, Mark Miller markrmil...@gmail.com
 wrote:
  There is a recent JIRA issue about keeping the last n logs to display in
 the admin UI.
 
  That introduced a problem - and then the fix introduced a problem - and
 then the fix mitigated the problem but left that ugly logging as a by
 product.
 
  Don't remember the issue # offhand. I think there was a dispute about
 what should be done with it.
 
  On May 1, 2012, at 11:14 AM, Benson Margulies wrote:
 
  CoreContainer.java, in the method 'load', finds itself calling
  loader.NewInstance with an 'fname' of Log4j of the slf4j backend is
  'Log4j'.

 Couldn't someone just fix the if statement to say, 'OK, if we're doing
 log4j, we have no log watcher' and skip all the loud failing on the
 way?



 
  e.g.:
 
  2012-05-01 10:40:32,367 org.apache.solr.core.CoreContainer  - Unable
  to load LogWatcher
  org.apache.solr.common.SolrException: Error loading class 'Log4j'
 
  What is it actually looking for? Have I misplaced something?
 
  - Mark Miller
  lucidimagination.com

Re: Unsubscribe does not appear to be working

2012-04-27 Thread Benson Margulies

There is no such thing as a 'solr forum' or a 'solr forum account.'

If you are subscribed to this list, an email to the unsubscribe
address will unsubscribe you. If some intermediary or third party is
forwarding email from this list to you, no one here can help you.

On Fri, Apr 27, 2012 at 12:09 PM, Kevin Bootz kbo...@caci.com wrote:
 I have tried the unsubscribe process but believe it to be broken as I've gone 
 as far as deleting my solr forum account and yet continue to receive emails. 
 Is there a moderator that can remove my email from the list please?

 Thanks

Re: Query parsing VS marshalling/unmarshalling

2012-04-24 Thread Benson Margulies

2012/4/24 Mindaugas Žakšauskas min...@gmail.com:
 Hi,

 I maintain a distributed system which Solr is part of. The data which
 is kept is Solr is permissioned and permissions are currently
 implemented by taking the original user query, adding certain bits to
 it which would make it return less data in the search results. Now I
 am at the point where I need to go over this functionality and try to
 improve it.

 Changing this to send two separate queries (q=...fq=...) would be the
 first logical thing to do, however I was thinking of an extra
 improvement. Instead of generating filter query, converting it into a
 String, sending over the HTTP just to parse it by Solr again - would
 it not be better to take generated Lucene fq query, serialize it using
 Java serialization, convert it to, say, Base64 and then send and
 deserialize it on the Solr end? Has anyone tried doing any performance
 comparisons on this topic?

I'm about to try out a contribution for serializing queries in
Javascript using Jackson. I've previously done this by serializing my
own data structure and putting the JSON into a custom query parameter.



 I am particularly concerned about this because in extreme cases my
 filter queries can be very large (1000s of characters long) and we
 already had to do tweaks as the size of GET requests would exceed
 default limits. And yes, we could move to POST but I would like to
 minimize both the amount of data that is sent over and the time taken
 to parse large queries.

 Thanks in advance.

 m.

Is there such as thing as FQ on a subquery?

2012-04-16 Thread Benson Margulies

I found myself wanting to write ...

   OR _query_:{!lucene fq=\a:b\}c:d

And then I started looking at query trees in the debugger, and found
myself thinking that there's no possible representation for this -- a
subquery with a filter, since the filters are part of the
RequestBuilder, not part of the query.

Am I missing something?

Questions about the query function

2012-04-15 Thread Benson Margulies

I've been pestering you all with a series of questions about
disassembling and partially rescoring queries. Every helpful response
(thanks) has led me to further reading, and this leads to more
questions. If I haven't before, I'll apologize now for the high level
of ignorance at which I'm starting. This morning I'm wading in the
pool of subqueries.

An example from the wiki:

q=product(popularity, query({!dismax qf=text v='solr rocks'}))

What's the use/net effect of having a Q that amounts to a number? The
final number becomes the score, yes? Why doesn't this example have to
use _val_? Is there an assumed defType of func?

Reading QueryValueSource.java, I'm wondering about the speed of
something like this, or, more to the point, something like a
calculation on two queries. Does this iteratively run the subquery for
each document otherwise under consideration? I see some code that
might be an optimization here.

I'm also wondering, is there a way to express that I only want to see
results that meet some threshold value for a subquery?

Does _val_ (or the local param syntax) manufacture a field that
appears in returned documents? If so, is its name _val_? In which
case, can there be more than one?

Re: Questions about the query function

2012-04-15 Thread Benson Margulies

On Sun, Apr 15, 2012 at 9:03 AM, Erik Hatcher erik.hatc...@gmail.com wrote:
 Why doesn't this example have to
 use _val_? Is there an assumed defType of fund?

 Yeah, that wiki page is misleading there, as it is implying a non-specified 
 defType=func.  Wanna fix up the wiki to make this clear?

Yup.


 _val_ would work too, or of course using that function as a parameter to 
 (e)dismay's bf, or dismay's boost params.

        Erik



 On Apr 15, 2012, at 08:43 , Benson Margulies wrote:

 I've been pestering you all with a series of questions about
 disassembling and partially rescoring queries. Every helpful response
 (thanks) has led me to further reading, and this leads to more
 questions. If I haven't before, I'll apologize now for the high level
 of ignorance at which I'm starting. This morning I'm wading in the
 pool of subqueries.

 An example from the wiki:

 q=product(popularity, query({!dismax qf=text v='solr rocks'}))

 What's the use/net effect of having a Q that amounts to a number? The
 final number becomes the score, yes? Why doesn't this example have to
 use _val_? Is there an assumed defType of func?

 Reading QueryValueSource.java, I'm wondering about the speed of
 something like this, or, more to the point, something like a
 calculation on two queries. Does this iteratively run the subquery for
 each document otherwise under consideration? I see some code that
 might be an optimization here.

 I'm also wondering, is there a way to express that I only want to see
 results that meet some threshold value for a subquery?

 Does _val_ (or the local param syntax) manufacture a field that
 appears in returned documents? If so, is its name _val_? In which
 case, can there be more than one?

Re: Questions about the query function

2012-04-15 Thread Benson Margulies

Since I ended up with 'fund' instead of 'func' we're even. I made the
edit. I'd make some more if you answered more of my questions :-)

On Sun, Apr 15, 2012 at 9:42 AM, Erik Hatcher erik.hatc...@gmail.com wrote:

 _val_ would work too, or of course using that function as a parameter to 
 (e)dismay's bf, or dismay's boost params.

 oops damn you autocorrect.  I've been fighting this one since upgrading 
 to Lion and will turn it off.  s/dismay/dismax/!  :)

        Erik

It's hard to google on _val_

2012-04-15 Thread Benson Margulies

So, I've been experimenting to learn how the _val_ participates in scores.

It seems to me that http://wiki.apache.org/solr/FunctionQuery should
explain the *effect* of including an _val_ term in an ordinary query,
starting with a constant.

http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_change_the_score_of_a_document_based_on_the_.2Avalue.2A_of_a_field_.28say.2C_.22popularity.22.29
poses exactly my question, but does not explain the math. It just
says, 'they get a boost'.

I tried some experiments. Positive values of _val_ did lead to
positive increments in the score, but clearly not by simple addition.

Presumably, the brains of the operation are
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/scoring.html.
However, it seems to me that it would be kind to us dumb animals if
the Solr pages gave a 'for idiots' summary of the net effect. Left to
my own devices, I'll eventually work my way through this, but if
someone hands me a shortcut, I'll cheerfully play tech writer here and
there.

Re: It's hard to google on _val_

2012-04-15 Thread Benson Margulies

On Sun, Apr 15, 2012 at 12:14 PM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Sun, Apr 15, 2012 at 11:34 AM, Benson Margulies
 bimargul...@gmail.com wrote:
 So, I've been experimenting to learn how the _val_ participates in scores.

 It seems to me that http://wiki.apache.org/solr/FunctionQuery should
 explain the *effect* of including an _val_ term in an ordinary query,
 starting with a constant.

 It's simply added to the score as any other clause in a boolean query would 
 be.

 Positive values of _val_ did lead to
 positive increments in the score, but clearly not by simple addition.

 That's just because Lucene normalizes scores.  By default, this is
 really just multiplying scores by a magic constant (that by default is
 the inverse of the sum of squared weights) and doesn't change relative
 orderings of docs.  If you add debugQuery=true and look at the scoring
 explanations, you'll see that queryNorm component.

 If you want to go down the rabbit hole on trunk, see
 IndexSearcher.createNormalizedWeight()

I think I should be able to add some text to the wiki that would help
fellow Alice's merely by looking at the debugQuery result. Thanks.



 -Yonik
 lucenerevolution.com - Lucene/Solr Open Source Search Conference.
 Boston May 7-10

Re: Can I discover what part of a score is attributable to a subquery?

2012-04-14 Thread Benson Margulies

yes please

On Apr 14, 2012, at 2:40 AM, Paul Libbrecht p...@hoplahup.net wrote:

 Benson,
 In mid 2009, I has such a question answered with a nifty score bitwise 
 manipulation, and a little precision loss. For each result I could pick the 
 language of a multilingual match.
 If interested, I can dig.
 Paul
 --
 Envoyé de mon téléphone Android avec K-9 Mail. Excusez la brièveté.


 Benson Margulies bimargul...@gmail.com a écrit :

 Given a query including a subquery, is there any way for me to learn
 that subquery's contribution to the overall document score?

 I can provide 'why on earth would anyone ...' if someone wants to know.

Re: Can I discover what part of a score is attributable to a subquery?

2012-04-14 Thread Benson Margulies

On Sat, Apr 14, 2012 at 12:37 PM, Paul Libbrecht p...@hoplahup.net wrote:
Benson,

it was in the Lucene world in May 2010:
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201005.mbox/%3c469705.48901...@web29016.mail.ird.yahoo.com%3E
Mark Harwood pointed me to a FlagQuery which was exactly what I needed.
His contribution sounds not to have been taken up, it worked for me in
Lucene, 2.4.1.
We used this to create an auto-completion popup which selected the right
language by flagging the right sub-query that was most matched.

Paul, it seems to me that the criticism in the JIRA (do you really
want this calculation for every single document that matches?) applies
to me. In our stuff, we run a query, and we look at the top 200 items,
rearranging their order based on a name similarity metric that is too
expensive to run in bulk. If the overall query is 'just us', we can
discard the Lucene scores and reorder based on our own. If our query
is combined with other terms, then we need to subtract out the
contribution our part of the initial query. However, sending in a
second query with (I suppose) ids=id1,id2,... and just our query, to
retrieve the scores, should be pretty speedy for a mere 200 items.
Maybe I'm missing some even easier way, given a DocList and a query,
to obtain scores for those docs for that query?

paul

Le 14 avr. 2012 à 15:34, Benson Margulies a écrit :

yes please

On Apr 14, 2012, at 2:40 AM, Paul Libbrecht p...@hoplahup.net wrote:

Benson,
In mid 2009, I has such a question answered with a nifty score bitwise
manipulation, and a little precision loss. For each result I could pick the
language of a multilingual match.
If interested, I can dig.
Paul
--
Envoyé de mon téléphone Android avec K-9 Mail. Excusez la brièveté.

Benson Margulies bimargul...@gmail.com a écrit :

Given a query including a subquery, is there any way for me to learn
that subquery's contribution to the overall document score?

I can provide 'why on earth would anyone ...' if someone wants to know.

Realtime /get versus SearchHandler

2012-04-13 Thread Benson Margulies

A discussion over on the dev list led me to expect that the by-if
field retrievals in a SolrCloud query would come through the get
handler. In fact, I've seen them turn up in my search component in the
search handler that is configured with my custom QT. (I have a
'prepare' method that sets ShardParams.QT to my QT to get my
processing involved in the first of the two queries.) Did I overthink
this?

Re: Can I discover what part of a score is attributable to a subquery?

2012-04-13 Thread Benson Margulies

On Fri, Apr 13, 2012 at 6:43 PM, John Chee johnc...@mylife.com wrote:
 On Fri, Apr 13, 2012 at 2:40 PM, Benson Margulies bimargul...@gmail.com 
 wrote:
 Given a query including a subquery, is there any way for me to learn
 that subquery's contribution to the overall document score?

I need this number to be available in a SearchComponent that runs
after QueryComponent.



 I can provide 'why on earth would anyone ...' if someone wants to know.

 Have you tried debugQuery=true?
 http://wiki.apache.org/solr/CommonQueryParameters#debugQuery The
 'explain' field of the result explains the scoring of each document.

Re: Can I discover what part of a score is attributable to a subquery?

2012-04-13 Thread Benson Margulies

On Fri, Apr 13, 2012 at 7:07 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : Given a query including a subquery, is there any way for me to learn
 : that subquery's contribution to the overall document score?

 You have to just execute the subquery itself ... doc collection
 and score calculation doesn't keep track the subscores.

 you could do this using functions in the fl but since you mentioned
 wanting this in SearchCOmponent just pass the subquery to
 SolrIndexSeracher using a DocSet filter of the current page (ie: make your
 own DocSet based on the current DocList)

I get it. Some fairly intricate dancing then can ensue with SolrCloud. Thanks.



 -Hoss

Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-12 Thread Benson Margulies

On Thu, Apr 12, 2012 at 11:56 AM, Mark Miller markrmil...@gmail.com wrote:
 Please see the documentation: 
 http://wiki.apache.org/solr/SolrCloud#Required_Config

Did I fail to find this in google or did I just goad you into a writing job?

I'm inclined to write a JIRA asking for _version_ to be configurable
just like the uniqueKey in the schema.




 schema.xml

 You must have a _version_ field defined:

 field name=_version_ type=long indexed=true stored=true/

 On Apr 11, 2012, at 9:10 AM, Benson Margulies wrote:

 I didn't have a _version_ field, since nothing in the schema says that
 it's required!

 On Wed, Apr 11, 2012 at 6:35 AM, Darren Govoni dar...@ontrenet.com wrote:
 Hard to say why its not working for you. Start with a fresh Solr and
 work forward from there or back out your configs and plugins until it
 works again.

 On Tue, 2012-04-10 at 17:15 -0400, Benson Margulies wrote:
 In my cloud configuration, if I push

 delete
   query*:*/query
 /delete

 followed by:

 commit/

 I get no errors, the log looks happy enough, but the documents remain
 in the index, visible to /query.

 Here's what seems my relevant bit of solrconfig.xml. My URP only
 implements processAdd.

    updateRequestProcessorChain name=RNI
     !-- some day, add parameters when we have some --
     processor 
 class=com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory/
     processor class=solr.LogUpdateProcessorFactory /
     processor class=solr.DistributedUpdateProcessorFactory/
     processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain

     !-- activate RNI processing by adding the RNI URP to the chain
 for xml updates --
   requestHandler name=/update
                   class=solr.XmlUpdateRequestHandler
     lst name=defaults
       str name=update.chainRNI/str
     /lst
     /requestHandler




 - Mark Miller
 lucidimagination.com

Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-12 Thread Benson Margulies

I'm probably confused, but it seems to me that the case I hit does not
meet any of Yonik's criteria.

I have no replicas. I'm running SolrCloud in the simple mode where
each doc ends up in exactly one place.

I think that it's just a bug that the code refuses to do the local
deletion when there's no version info.

However, if I am confused, it sure seems like a candidate for the 'at
least throw instead of failing silently' policy.

Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-12 Thread Benson Margulies

On Thu, Apr 12, 2012 at 2:14 PM, Mark Miller markrmil...@gmail.com wrote:
 google must not have found it - i put that in a month or so ago I believe -
 at least weeks. As you can see, there is still a bit to fill in, but it
 covers the high level. I'd like to add example snippets for the rest soon.

Mark, is it all true? I don't have an update log or a replication
handler, and neither does the default, and it all works fine in the
simple case from the top of that wiki page.

Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-11 Thread Benson Margulies

See https://issues.apache.org/jira/browse/SOLR-3347. I can replace the
solrconfig.xml with the vanilla solrconfig.xml and the problem
remains.

On Wed, Apr 11, 2012 at 6:35 AM, Darren Govoni dar...@ontrenet.com wrote:
 Hard to say why its not working for you. Start with a fresh Solr and
 work forward from there or back out your configs and plugins until it
 works again.

 On Tue, 2012-04-10 at 17:15 -0400, Benson Margulies wrote:
 In my cloud configuration, if I push

 delete
   query*:*/query
 /delete

 followed by:

 commit/

 I get no errors, the log looks happy enough, but the documents remain
 in the index, visible to /query.

 Here's what seems my relevant bit of solrconfig.xml. My URP only
 implements processAdd.

    updateRequestProcessorChain name=RNI
     !-- some day, add parameters when we have some --
     processor 
 class=com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory/
     processor class=solr.LogUpdateProcessorFactory /
     processor class=solr.DistributedUpdateProcessorFactory/
     processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain

     !-- activate RNI processing by adding the RNI URP to the chain
 for xml updates --
   requestHandler name=/update
                   class=solr.XmlUpdateRequestHandler
     lst name=defaults
       str name=update.chainRNI/str
     /lst
     /requestHandler

Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-11 Thread Benson Margulies

I didn't have a _version_ field, since nothing in the schema says that
it's required!

On Wed, Apr 11, 2012 at 6:35 AM, Darren Govoni dar...@ontrenet.com wrote:
 Hard to say why its not working for you. Start with a fresh Solr and
 work forward from there or back out your configs and plugins until it
 works again.

 On Tue, 2012-04-10 at 17:15 -0400, Benson Margulies wrote:
 In my cloud configuration, if I push

 delete
   query*:*/query
 /delete

 followed by:

 commit/

 I get no errors, the log looks happy enough, but the documents remain
 in the index, visible to /query.

 Here's what seems my relevant bit of solrconfig.xml. My URP only
 implements processAdd.

    updateRequestProcessorChain name=RNI
     !-- some day, add parameters when we have some --
     processor 
 class=com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory/
     processor class=solr.LogUpdateProcessorFactory /
     processor class=solr.DistributedUpdateProcessorFactory/
     processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain

     !-- activate RNI processing by adding the RNI URP to the chain
 for xml updates --
   requestHandler name=/update
                   class=solr.XmlUpdateRequestHandler
     lst name=defaults
       str name=update.chainRNI/str
     /lst
     /requestHandler

Re: Default qt on SolrCloud

2012-04-11 Thread Benson Margulies

On Wed, Apr 11, 2012 at 11:19 AM, Erick Erickson
erickerick...@gmail.com wrote:
 What does your query request handler look like? By adding qt=standard
 you're specifying the standard request handler, whereas your
 ...solr/query?q=*:* format goes at the request handler you named
 query which presumably you've defined in solrconfig.xml...

 What does debugQuery=on show?


It turned out that I had left an extra(eous) declaration for /query
with my custom RT, and when I removed it all was well.

thanks,benson



 Best
 Erick

 On Tue, Apr 10, 2012 at 12:31 PM, Benson Margulies
 bimargul...@gmail.com wrote:
 After I load documents into my cloud instance, a URL like:

 http://localhost:PORT/solr/query?q=*:*

 finds nothing.

 http://localhost:PORT/solr/query?q=*:*qt=standard

 finds everything.

 My custom request handlers have 'default=false'.

 What have I done?

Re: SolrCloud versus a SearchComponent that rescores

2012-04-10 Thread Benson Margulies

On Mon, Apr 9, 2012 at 9:36 PM, Mark Miller markrmil...@gmail.com wrote:
 Yeah, that's how it works - it ends up hitting the select request handler 
 (this might be overridable with shards.qt) All the params are passed along, 
 so in general, it will act the same as the top level req handler - but it can 
 the remove the shards param so you don't have an infinite recursion of 
 distrib requests (say in the case you put shards in the tea handler in 
 solrconfig).

 I think you have to investigate shards.qt
 Or look at adding those components to the std select handler as well.

Thanks.



 Sent from my iPhone

 On Apr 9, 2012, at 9:26 PM, Benson Margulies bimargul...@gmail.com wrote:

 Um, maybe I've hit a quirk?

 In my solrconfig.xml, my special SearchComponents are installed only
 for a specific QT. So, it looks to me as if that QT is not propagated
 into the request out to the shards, and so they run the ordinary
 request handler without my components in it.

 Is this intended behavior I have to tweak via a distribution-aware
 component, or perhaps a bug, or does it make no sense at all and I
 need to look for some mistake of mine?

Re: SolrCloud versus a SearchComponent that rescores

2012-04-10 Thread Benson Margulies

Another thought: currently I'm using qt=ME to indicate this process. I
could, in theory, use some ME=true and make my components check for it
to avoid this process, but it seems kind of peculiar from an end-user
standpoint.

Re: SolrCloud versus a SearchComponent that rescores

2012-04-10 Thread Benson Margulies

I've updated the doc with my findings. Thanks for the pointer.

URP's versus Cloud

2012-04-10 Thread Benson Margulies

How are URP's managed with respect to cloud deployment? Given some
solrconfig.xml like the below, do I expect it to be in the chain on
the leader, the shards, or both?

   updateRequestProcessorChain name=RNI
!-- some day, add parameters when we have some --
processor 
class=com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory/
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.DistributedUpdateProcessorFactory/
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

!-- activate RNI processing by adding the RNI URP to the chain
for xml updates --
  requestHandler name=/update
  class=solr.XmlUpdateRequestHandler
lst name=defaults
  str name=update.chainRNI/str
/lst
/requestHandler

Re: URP's versus Cloud

2012-04-10 Thread Benson Margulies

On Tue, Apr 10, 2012 at 1:08 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
 In this case on each node, order matters. If you, for example, define a
 standard SignatureUpdateProcessorFactory before the
 DistributedUpdateProcessorFactory you will end up with multiple values for
 the signature field.

That seems to imply that 'before' processors run both on the leader
and on the shards. Where do the afters run? Just on the leader or just
on the shards?



 On Tue, 10 Apr 2012 12:43:36 -0400, Benson Margulies bimargul...@gmail.com
 wrote:

 How are URP's managed with respect to cloud deployment? Given some
 solrconfig.xml like the below, do I expect it to be in the chain on
 the leader, the shards, or both?

   updateRequestProcessorChain name=RNI
    !-- some day, add parameters when we have some --
    processor

 class=com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory/
    processor class=solr.LogUpdateProcessorFactory /
    processor class=solr.DistributedUpdateProcessorFactory/
    processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

    !-- activate RNI processing by adding the RNI URP to the chain
 for xml updates --
  requestHandler name=/update
                  class=solr.XmlUpdateRequestHandler
    lst name=defaults
      str name=update.chainRNI/str
    /lst
    /requestHandler


 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536600 / 06-50258350

Default qt on SolrCloud

2012-04-10 Thread Benson Margulies

After I load documents into my cloud instance, a URL like:

http://localhost:PORT/solr/query?q=*:*

finds nothing.

http://localhost:PORT/solr/query?q=*:*qt=standard

finds everything.

My custom request handlers have 'default=false'.

What have I done?

I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-10 Thread Benson Margulies

In my cloud configuration, if I push

delete
  query*:*/query
/delete

followed by:

commit/

I get no errors, the log looks happy enough, but the documents remain
in the index, visible to /query.

Here's what seems my relevant bit of solrconfig.xml. My URP only
implements processAdd.

   updateRequestProcessorChain name=RNI
!-- some day, add parameters when we have some --
processor 
class=com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory/
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.DistributedUpdateProcessorFactory/
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

!-- activate RNI processing by adding the RNI URP to the chain
for xml updates --
  requestHandler name=/update
  class=solr.XmlUpdateRequestHandler
lst name=defaults
  str name=update.chainRNI/str
/lst
/requestHandler

Cloud-aware request processing?

2012-04-09 Thread Benson Margulies

I'm working on a prototype of a scheme that uses SolrCloud to, in
effect, distribute a computation by running it inside of a request
processor.

If there are N shards and M operations, I want each node to perform
M/N operations. That, of course, implies that I know N.

Is that fact available anyplace inside Solr, or do I need to just configure it?

'No JSP support' error in embedded Jetty for solrCloud as of apache-solr-4.0-2012-04-02_11-54-55

2012-04-09 Thread Benson Margulies

Starting the leader with:

 java -Dbootstrap_confdir=./solr/conf -Dcollection.configName=rnicloud
-DzkRun -DnumShards=3 -Djetty.port=9167  -jar start.jar

and browsing to

http://localhost:9167/solr/rnicloud/admin/zookeeper.jsp

I get:

HTTP ERROR 500

Problem accessing /solr/rnicloud/admin/zookeeper.jsp. Reason:

JSP support not configured
Powered by Jetty://

Re: Cloud-aware request processing?

2012-04-09 Thread Benson Margulies

 Jan Høydahl,

My problem is intimately connected to Solr. it is not a batch job for
hadoop, it is a distributed real-time query scheme. I hate to add yet
another complex framework if a Solr RP can do the job simply.

For this problem, I can transform a Solr query into a subset query on
each shard, and then let the SolrCloud mechanism.

I am well aware of the 'zoo' of alternatives, and I will be evaluating
them if I can't get what I want from Solr.

On Mon, Apr 9, 2012 at 9:34 AM, Jan Høydahl jan@cominvent.com wrote:
 Hi,

 Instead of using Solr, you may want to have a look at Hadoop or another 
 framework for distributed computation, see e.g. 
 http://java.dzone.com/articles/comparison-gridcloud-computing

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com

 On 9. apr. 2012, at 13:41, Benson Margulies wrote:

 I'm working on a prototype of a scheme that uses SolrCloud to, in
 effect, distribute a computation by running it inside of a request
 processor.

 If there are N shards and M operations, I want each node to perform
 M/N operations. That, of course, implies that I know N.

 Is that fact available anyplace inside Solr, or do I need to just configure 
 it?

Is http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster up to date?

2012-04-09 Thread Benson Margulies

I specify -Dcollection.configName=rnicloud, but the admin gui tells me
that I have a collection named 'collection1'.

And, as reported in a prior email, the admin UI URL in there seems wrong.

Re: Re: Cloud-aware request processing?

2012-04-09 Thread Benson Margulies

On Mon, Apr 9, 2012 at 9:50 AM, Darren Govoni ontre...@ontrenet.com wrote:
 ...it is a distributed real-time query scheme...

 SolrCloud does this already. It treats all the shards like one-big-index,
 and you can query it normally to get subset results from each shard. Why
 do you have to re-write the query for each shard? Seems unnecessary.

For reasons described in previous email that I won't repeat here.


 brbrbr--- Original Message ---
 On 4/9/2012  08:45 AM Benson Margulies wrote:br Jan Høydahl,
 br
 brMy problem is intimately connected to Solr. it is not a batch job for
 brhadoop, it is a distributed real-time query scheme. I hate to add yet
 branother complex framework if a Solr RP can do the job simply.
 br
 brFor this problem, I can transform a Solr query into a subset query on
 breach shard, and then let the SolrCloud mechanism.
 br
 brI am well aware of the 'zoo' of alternatives, and I will be evaluating
 brthem if I can't get what I want from Solr.
 br
 brOn Mon, Apr 9, 2012 at 9:34 AM, Jan Høydahl jan@cominvent.com
 wrote:
 br Hi,
 br
 br Instead of using Solr, you may want to have a look at Hadoop or
 another framework for distributed computation, see e.g.
 http://java.dzone.com/articles/comparison-gridcloud-computing
 br
 br --
 br Jan Høydahl, search solution architect
 br Cominvent AS - www.cominvent.com
 br Solr Training - www.solrtraining.com
 br
 br On 9. apr. 2012, at 13:41, Benson Margulies wrote:
 br
 br I'm working on a prototype of a scheme that uses SolrCloud to, in
 br effect, distribute a computation by running it inside of a request
 br processor.
 br
 br If there are N shards and M operations, I want each node to perform
 br M/N operations. That, of course, implies that I know N.
 br
 br Is that fact available anyplace inside Solr, or do I need to just
 configure it?
 br
 br
 br

Stumped on using a custom update request processor with SolrCloud

2012-04-09 Thread Benson Margulies

If you would be so kind as to look at
https://issues.apache.org/jira/browse/SOLR-3342, you will see that I
tried to use a working configuration for a URP of mine with SolrCloud,
and received in return an NPE.

Somehow or another, by default, the XmlUpdateRequestHandler ends up
using (I think) the PeerSync class to establish the indexibleId. When
I add in my URP, I am somehow turning this off, and I'm currently
stumped as to how to turn it back on.

If you don't care to read the JIRA, my relevant configuration is right
here. Is there something else I need in the 'defaults' list, or some
other processor I need to put in my chain?

   updateRequestProcessorChain name=RNI
!-- some day, add parameters when we have some --
processor 
class=com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory/
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

!-- activate RNI processing by adding the RNI URP to the chain
for xml updates --
  requestHandler name=/update
  class=solr.XmlUpdateRequestHandler
lst name=defaults
  str name=update.chainRNI/str
/lst
/requestHandler

SolrCloud versus a SearchComponent that rescores

2012-04-09 Thread Benson Margulies

Those of you insomniacs who have read my messages here over the last
few weeks might recall that I've been working on a request handler
that wraps the SearchHandler to rewrite queries and then reorder
results.

(I haven't quite worked out how to apply Grant's alternative
suggestions without losing the performance advantages I was looking
for in the first place.)

Today, I realized that the RequestHandler approach, as opposed to
search components, wasn't going to be viable. I was growing too much
dependency on internal Solr quirks.

So I refactored it into a pair of SearchComponents -- one to go first
and rewrite the query, and one to go after query and rescore.

And it works just fine - until I configure it into a SolrCloud
cluster. At which point it started coming up with very wrong answers.

I think that the reason is that I don't have an implementation of the
distributedProcess method, or, more generally, that I don't understand
the protocol on a SearchComponent when distributed processing is
happening. Has anyone written anything yet about these considerations?
I can put multiple processes in the debugging and see who gets called
with what, but I was hoping for some sort of short cut.

Re: SolrCloud versus a SearchComponent that rescores

2012-04-09 Thread Benson Margulies

That page seems to be saying that the 'distributed' APIs take place on
the leader, and the ordinary prepare/process APIs out at the leaves.
I'll set out to prove or disprove that tomorrow.


On Mon, Apr 9, 2012 at 8:17 PM, Mark Miller markrmil...@gmail.com wrote:

 On Apr 9, 2012, at 7:34 PM, Benson Margulies wrote:

 Those of you insomniacs who have read my messages here over the last
 few weeks might recall that I've been working on a request handler
 that wraps the SearchHandler to rewrite queries and then reorder
 results.

 (I haven't quite worked out how to apply Grant's alternative
 suggestions without losing the performance advantages I was looking
 for in the first place.)

 Today, I realized that the RequestHandler approach, as opposed to
 search components, wasn't going to be viable. I was growing too much
 dependency on internal Solr quirks.

 So I refactored it into a pair of SearchComponents -- one to go first
 and rewrite the query, and one to go after query and rescore.

 And it works just fine - until I configure it into a SolrCloud
 cluster. At which point it started coming up with very wrong answers.

 I think that the reason is that I don't have an implementation of the
 distributedProcess method, or, more generally, that I don't understand
 the protocol on a SearchComponent when distributed processing is
 happening. Has anyone written anything yet about these considerations?
 I can put multiple processes in the debugging and see who gets called
 with what, but I was hoping for some sort of short cut.



 Grant started something on this once: 
 http://wiki.apache.org/solr/WritingDistributedSearchComponents
 It's only a start though.

 Unfortunately, to this point, adventurous souls have had to debug and study 
 there way to understanding the distrib process solo mostly.

 Perhaps we can encourage anyone that has written a distributed component to 
 help jump in on that wiki page. Any takers?

 - Mark Miller
 lucidimagination.com

Re: SolrCloud versus a SearchComponent that rescores

2012-04-09 Thread Benson Margulies

Um, maybe I've hit a quirk?

In my solrconfig.xml, my special SearchComponents are installed only
for a specific QT. So, it looks to me as if that QT is not propagated
into the request out to the shards, and so they run the ordinary
request handler without my components in it.

Is this intended behavior I have to tweak via a distribution-aware
component, or perhaps a bug, or does it make no sense at all and I
need to look for some mistake of mine?

A curious request about a curious request handler

2012-04-03 Thread Benson Margulies

I've made a RequestHandler class that acts as follows:

1. At its initialization, it creates a StandardRequestHandler and hangs onto it.
2. When a query comes to it (I configure it to a custom qt value), it:
  a. creates a new query based on the query that arrived
  b. creates a LocalSolrQueryRequest for the current core and a param
set containing the derived query
  c. runs this request through the SearchHandler
  d. uses a searcher to retrieve all the docs
  e. rescores/reorders them using our code
  f. attaches the result of this process to the response. The
rescoring code creates a DocSlice containing the usual items, and that
becomes response.

By and large, this works, but I've a few points of mystification at
hand, and I'd be most grateful for some illumination.

1. Is there any reason to pass FL to the inner query?
StandardRequestHandler.handleRequest never seems to fill fields, since
that's a response-write job anyway.

2. My 'rescoring' operates on the assumption that the entire relevancy
is determined by the output of our code. If we wanted to combine our
ranking (which produces similarity numbers between 0 and 1) with
ordinary scores, any advice on scaling? (We'd want to subtract out the
contribution of the initial search on our funny fields, and then
combine in our rescoring value instead).

3. In 3.5.0, the admin gui never shows me scores other than 0 when I
fire this up. SolrJ works just fine, scores come back, but with the
admin gui, if I put my special QT in the query type and include score
in the field list, the value displayed is always 0. I'd be grateful
for any clue as to what I might have gummed up.

Re: A curious request about a curious request handler

2012-04-03 Thread Benson Margulies

On Tue, Apr 3, 2012 at 12:27 PM, Grant Ingersoll gsing...@apache.org wrote:

 On Apr 3, 2012, at 9:43 AM, Benson Margulies wrote:

 I've made a RequestHandler class that acts as follows:

 1. At its initialization, it creates a StandardRequestHandler and hangs onto 
 it.
 2. When a query comes to it (I configure it to a custom qt value), it:
  a. creates a new query based on the query that arrived
  b. creates a LocalSolrQueryRequest for the current core and a param
 set containing the derived query
  c. runs this request through the SearchHandler
  d. uses a searcher to retrieve all the docs
  e. rescores/reorders them using our code
  f. attaches the result of this process to the response. The
 rescoring code creates a DocSlice containing the usual items, and that
 becomes response.


 Couldn't you just implement a Function (that calls your code) and use sort by 
 function and/or use that value as part of the broader match?  Lot less moving 
 parts, etc.

I don't know. Feel free to point me at doc at any point, but here's
the questions that spring to mind:

Starting with something in 'q' like:

   bt_rni_name:Mortimer Q Snerd bt_rni_Name_Language:eng

code of mine eats those two fields (in some sense, pseudo-fields), and
spits out many other fields that we actually want to query on.

Then, when the results come back, a whole slew of other fields are
used to calculate the 'real' score.

Do Functions do that?




 -Grant

Re: A curious request about a curious request handler

2012-04-03 Thread Benson Margulies

Grant, let me see if I can expand this, as it were:

{!benson f1:v1 f2:v2 f3:v3} (or do I mean {!query defType='benson' ...}?)

I see how that could expand to be anything else I like.

However, the Function side has me a little more puzzled.

The information from the fields inside my {! ... } gets turned into an
object, and that object goes into the code that can scores a document
based on values a small slew of other fields, and it's too costly to
reconstruct for each result. I'm thinking that this still calls for a
request handler just to hold the state, but perhaps I'm missing
something?

Thanks for the help,
benson

1 2 >

1 - 100 of 135 matches

Mail list logo