Re: Issue with large html indexing

2013-10-24 Thread Raheel Hasan
ok. see this:
http://s23.postimg.org/yck2s5k1n/html_indexing.png



On Wed, Oct 23, 2013 at 10:45 PM, Erick Erickson erickerick...@gmail.comwrote:

 Attachments and images are often eaten by the mail server, your image is
 not visible at least to me. Can you describe what you're seeing? Or post
 the image somewhere and provide a link?

 Best,
 Erick


 On Wed, Oct 23, 2013 at 11:07 AM, Raheel Hasan raheelhasan@gmail.com
 wrote:

  Hi,
 
  I have an issue here while indexing large html. Here is the confguration
  for that:
 
  1) Data is imported via URLDataSource / PlainTextEntityProcessor (DIH)
 
  2) Schema has this for the field:
  type=text_en_splitting indexed=true stored=false required=false
 
  3) text_en_splitting has the following work done for indexing:
  HTMLStripCharFilterFactory
  WhitespaceTokenizerFactory (create tokens)
  StopFilterFactory
  WordDelimiterFilterFactory
  ICUFoldingFilterFactory
  PorterStemFilterFactory
  RemoveDuplicatesTokenFilterFactory
  LengthFilterFactory
 
  However, the indexed data is like this (as in the attached image):
  [image: Inline image 1]
 
 
  so what are these numbers?
  If I put small html, it works fine, but as the size of html file
  increases, this is what happens..
 
  --
  Regards,
  Raheel Hasan
 




-- 
Regards,
Raheel Hasan


Re: Minor bug with CloudSolrServer and collection-alias.

2013-10-24 Thread Thomas Egense
Thanks to both of you for fixing the bug. Impressive response time for the
fix (7 hours).

Thomas Egense


On Wed, Oct 23, 2013 at 7:16 PM, Mark Miller markrmil...@gmail.com wrote:

 I filed https://issues.apache.org/jira/browse/SOLR-5380 and just
 committed a fix.

 - Mark

 On Oct 23, 2013, at 11:15 AM, Shawn Heisey s...@elyograg.org wrote:

  On 10/23/2013 3:59 AM, Thomas Egense wrote:
  Using cloudSolrServer.setDefaultCollection(collectionId) does not work
 as
  intended for an alias spanning more than 1 collection.
  The virtual collection-alias collectionID is recoqnized as a existing
  collection, but it does only query one of the collections it is mapped
 to.
 
  You can confirm this easy in AliasIntegrationTest.
 
  The test-class AliasIntegrationTest creates to cores with 2 and 3
 different
  documents. And then creates an alias pointing to both of them.
 
  Line 153:
 // search with new cloud client
 CloudSolrServer cloudSolrServer = new
  CloudSolrServer(zkServer.getZkAddress(), random().nextBoolean());
 cloudSolrServer.setParallelUpdates(random().nextBoolean());
 query = new SolrQuery(*:*);
 query.set(collection, testalias);
 res = cloudSolrServer.query(query);
 cloudSolrServer.shutdown();
 assertEquals(5, res.getResults().getNumFound());
 
  No unit-test bug here, however if you change it from setting the
  collectionid on the query but on CloudSolrServer instead,it will produce
  the bug:
 
 // search with new cloud client
 CloudSolrServer cloudSolrServer = new
 CloudSolrServer(zkServer.getZkAddress(), random().nextBoolean());
 cloudSolrServer.setDefaultCollection(testalias);
 cloudSolrServer.setParallelUpdates(random().nextBoolean());
 query = new SolrQuery(*:*);
 //query.set(collection, testalias);
 res = cloudSolrServer.query(query);
 cloudSolrServer.shutdown();
 assertEquals(5, res.getResults().getNumFound());  -- Assertion
 failure
 
  Should I create a Jira issue for this?
 
  Thomas,
 
  I have confirmed this with the following test patch, which adds to the
  test rather than changing what's already there:
 
  http://apaste.info/9ke5
 
  I'm about to head off to the train station to start my commute, so I
  will be unavailable for a little while.  If you haven't gotten the jira
  filed by the time I get to another computer, I will create it.
 
  Thanks,
  Shawn
 




RE: New shard leaders or existing shard replicas depends on zookeeper?

2013-10-24 Thread Hoggarth, Gil
Absolutely, the scenario I'm seeing does _sound_ like I've not specified
the number of shards, but I think I have - the evidence is:
- DnumShards=24 defined within the /etc/sysconfig/solrnode* files

- DnumShards=24 seen on each 'ps' line (two nodes listed here):
 tomcat   26135 1  5 09:51 ?00:00:22 /opt/java/bin/java
-Djava.util.logging.config.file=/opt/tomcat_instances/solrnode1/conf/log
ging.properties
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
-Xms512m -Xmx5120m -Dsolr.solr.home=/opt/solrnode1 -Duser.language=en
-Duser.country=uk -Dbootstrap_confdir=/opt/solrnode1/ldwa01/conf
-Dcollection.configName=ldwa01cfg -DnumShards=24
-Dsolr.data.dir=/opt/data/solrnode1/ldwa01/data
-DzkHost=zk01.solr.wa.bl.uk:9983,zk02.solr.wa.bl.uk:9983,zk03.solr.wa.bl
.uk:9983 -Djava.endorsed.dirs=/opt/tomcat/endorsed -classpath
/opt/tomcat/bin/bootstrap.jar:/opt/tomcat/bin/tomcat-juli.jar
-Dcatalina.base=/opt/tomcat_instances/solrnode1
-Dcatalina.home=/opt/tomcat
-Djava.io.tmpdir=/opt/tomcat_instances/solrnode1/tmp
org.apache.catalina.startup.Bootstrap start
tomcat   26225 1  5 09:51 ?00:00:19 /opt/java/bin/java
-Djava.util.logging.config.file=/opt/tomcat_instances/solrnode2/conf/log
ging.properties
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
-Xms512m -Xmx5120m -Dsolr.solr.home=/opt/solrnode2 -Duser.language=en
-Duser.country=uk -Dbootstrap_confdir=/opt/solrnode2/ldwa01/conf
-Dcollection.configName=ldwa01cfg -DnumShards=24
-Dsolr.data.dir=/opt/data/solrnode2/ldwa01/data
-DzkHost=zk01.solr.wa.bl.uk:9983,zk02.solr.wa.bl.uk:9983,zk03.solr.wa.bl
.uk:9983 -Djava.endorsed.dirs=/opt/tomcat/endorsed -classpath
/opt/tomcat/bin/bootstrap.jar:/opt/tomcat/bin/tomcat-juli.jar
-Dcatalina.base=/opt/tomcat_instances/solrnode2
-Dcatalina.home=/opt/tomcat
-Djava.io.tmpdir=/opt/tomcat_instances/solrnode2/tmp
org.apache.catalina.startup.Bootstrap start

- The Solr node dashboard shows -DnumShards=24 in its list of Args for
each node

And yet, the ldwa01 nodes are leader and replica of shard 17 and there
are no other shard leaders created. Plus, if I only change the ZK
ensemble declarations in /etc/system/solrnode* to the different dev ZK
servers, all 24 leaders are created before any replicas are added.

I can also mention, when I browse the Cloud view, I can see both the
ldwa01 collection and the ukdomain collection listed, suggesting that
this information comes from the ZKs - I assume this is as expected.
Plus, the correct node addresses (e.g., 192.168.45.17:8984) are listed
for ldwa01 but these addresses are also listed as 'Down' in the ukdomain
collection (except for :8983 which only shows in the ldwa01 collection).

Any help very gratefully received.
Gil

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 23 October 2013 18:50
To: solr-user@lucene.apache.org
Subject: Re: New shard leaders or existing shard replicas depends on
zookeeper?

My first impulse would be to ask how you created the collection. It sure
_sounds_ like you didn't specify 24 shards and thus have only a single
shard, one leader and 23 replicas

bq: ...to point to the zookeeper ensemble also used for the ukdomain
collection...

so my guess is that this ZK ensemble has the ldwa01 collection defined
as having only one shard

I admit I pretty much skimmed your post though...

Best,
Erick


On Wed, Oct 23, 2013 at 12:54 PM, Hoggarth, Gil gil.hogga...@bl.uk
wrote:

 Hi solr-users,



 I'm seeing some confusing behaviour in Solr/zookeeper and hope you can

 shed some light on what's happening/how I can correct it.



 We have two physical servers running automated builds of RedHat 6.4 
 and Solr 4.4.0 that host two separate Solr services. The first server 
 (called ld01) has 24 shards and hosts a collection called 'ukdomain'; 
 the second server (ld02) also has 24 shards and hosts a different 
 collection called 'ldwa01'. It's evidently important to note that 
 previously both of these physical servers provided the 'ukdomain'
 collection, but the 'ldwa01' server has been rebuilt for the new 
 collection.



 When I start the ldwa01 solr nodes with their zookeeper configuration 
 (defined in /etc/sysconfig/solrnode* and with collection.configName as
 'ldwa01cfg') pointing to the development zookeeper ensemble, all nodes

 initially become shard leaders and then replicas as I'd expect. But if

 I change the ldwa01 solr nodes to point to the zookeeper ensemble also

 used for the ukdomain collection, all ldwa01 solr nodes start on the 
 same shard (that is, the first ldwa01 solr node becomes the shard 
 leader, then every other solr node becomes a replica for this shard). 
 The significant point here is no other ldwa01 shards gain leaders (or
replicas).



 The ukdomain collection uses a zookeeper collection.configName of 
 'ukdomaincfg', and prior to the creation of this ldwa01 service the 
 collection.configName of 'ldwa01cfg' has never previously been used.
So 
 I'm 

Re: Terms function join with a Select function ?

2013-10-24 Thread Bruno Mannina

Dear All,

Ok I have an answer concerning the first question (limit)
It's the terms.limit parameters.

But I can't find how to apply a Terms request on a query result

any idea ?

Bruno

Le 23/10/2013 23:19, Bruno Mannina a écrit :

Dear Solr users,

I use the Terms function to see the frequency data in a field but it's 
for the whole database.


I have 2 questions:
- Is it possible to increase the number of statistic ? actually I have 
the 10 first frequency term.


- Is it possible to limit this statistic to the result of a request ?

PS: the second question is very important for me.

Many thanks





---
Ce courrier électronique ne contient aucun virus ou logiciel malveillant parce 
que la protection avast! Antivirus est active.
http://www.avast.com



SolrCloud: optimizing a core triggers optimizations of all cores in that collection?

2013-10-24 Thread michael.boom
Hi!

I have a SolrCloud setup, on two servers 3 shards, replicationFactor=2.
Today I trigered the optimization on core *shard2_replica2* which only
contained 3M docs, and 2.7G.
The size of the other shards were shard3=2.7G and shard1=48G (the routing is
implicit but after some update deadlocks and restarts the shard range in
Zookeeper got null and everything since then apparently got indexed to
shard1)

So, half an hour after i triggered the optimization, via the Admin UI, i
noticed that used space was increasing alot on *both servers* for cores
*shard1_replica1 and shard1_replica2*. 
It was now 67G and increasing. In the end after about 40 minutes from the
start operation shard1 was done optimizing on both servers leaving
shard1_replica1 and shard1_replica2 at about 33G.

Any idea what is happening and why the core on which i wanted the
optimization to happen, got no optimization and instead another shard got
optimized, on both servers?



-
Thanks,
Michael
--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-optimizing-a-core-triggers-optimizations-of-all-cores-in-that-collection-tp4097499.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Spellcheck with Distributed Search (sharding).

2013-10-24 Thread Luis Cappa Banda
Any idea?


2013/10/23 Luis Cappa Banda luisca...@gmail.com

 More info:

 When executing the Query to a single Solr server it works:
 http://solr1:8080/events/data/suggest?q=mwt=jsonhttp://solrclusterd.buguroo.dev:8080/events/data/suggest?q=mwt=json

 {

- responseHeader:
{
   - status: 0,
   - QTime: 1
   },
- response:
{
   - numFound: 0,
   - start: 0,
   - docs: [ ]
   },
- spellcheck:
{
   - suggestions:
   [
  - m,
  -
  {
 - numFound: 4,
 - startOffset: 0,
 - endOffset: 1,
 - suggestion:
 [
- marca,
- marcacom,
- mis,
- mispelotas
]
 }
  ]
   }

 }


 But when choosing the Request handler this way it doesn't:
 http://solr1:8080/events/data/select?*qt=/sugges*twt=jsonq=*:*http://solrclusterd.buguroo.dev:8080/events/data/select?qt=/suggestwt=jsonq=*:*




 2013/10/23 Luis Cappa Banda luisca...@gmail.com

 Hello!

 I'be been trying to enable Spellchecking using sharding following the
 steps from the Wiki, but I failed, :-( What I do is:

 *Solrconfig.xml*


 *searchComponent name=suggest* class=solr.SpellCheckComponent
 lst name=spellchecker
  str name=namesuggest/str
 str name=classnameorg.apache.solr.spelling.suggest.Suggester/str
  str
 name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str
 str name=fieldsuggestion/str
  str name=buildOnOptimizetrue/str
 /lst
 /searchComponent


 *requestHandler name=/suggest* class=solr.SearchHandler
 lst name=defaults
  str name=dfsuggestion/str
 str name=spellchecktrue/str
  str name=spellcheck.dictionarysuggest/str
 str name=spellcheck.count10/str
  /lst
   arr name=last-components
 strsuggest/str
   /arr
 /requestHandler


 *Note:* I have two shards (solr1 and solr2) and both have the same
 solrconfig.xml. Also, bot indexes were optimized to create the spellchecker
 indexes.

 *Query*


 solr1:8080/events/data/select?q=mqt=/suggestionshards.qt=/suggestionwt=jsonshards=solr1:8080/events/data,solr2:8080/events/data

 *
 *
 *Response*
 *
 *
 {

- responseHeader:
{
   - status: 404,
   - QTime: 12,
   - params:
   {
  - shards: solr1:8080/events/data,solr2:8080/events/data,
  - shards.qt: /suggestion,
  - q: m,
  - wt: json,
  - qt: /suggestion
  }
   },
- error:
{
   - msg: Server at http://solr1:8080/events/data returned non ok
   status:404, message:Not Found,
   - code: 404
   }

 }

 More query syntaxes that I used and that doesn't work:


 http://solr1:8080/events/data/select?q=mqt=suggestionshards.qt=/suggestionwt=jsonshards=solr1:8080/events/data,solr2:8080/events/datahttp://solrclusterd.buguroo.dev:8080/events/data/select?q=mqt=suggestionshards.qt=/suggestionwt=jsonshards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data


 http://solr1:8080/events/data/select?q=*:*spellcheck.q=mqt=suggestionshards.qt=/suggestionwt=jsonshards=solr1:8080/events/data,solr2:8080/events/datahttp://solrclusterd.buguroo.dev:8080/events/data/select?q=*:*spellcheck.q=mqt=suggestionshards.qt=/suggestionwt=jsonshards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data


 Any idea of what I'm doing wrong?

 Thank you very much in advance!

 Best regards,

 --
 - Luis Cappa




 --
 - Luis Cappa




-- 
- Luis Cappa


Proposal for new feature, cold replicas, brainstorming

2013-10-24 Thread yriveiro
I'm wondering some time ago if it's possible have replicas of a shard
synchronized but in an state that they can't accept queries only updates. 

This replica in replication mode only awake to accept queries if it's the
last alive replica and goes to replication mode when other replica becomes
alive and synchronized.

The motivation of this is simple, I want have replication but I don't want
have n replicas actives with full resources allocated (cache and so on).
This is usefull in enviroments where replication is needed but a high query
throughput is not fundamental and the resources are limited.

I know that right now is not possible, but I think that it's a feature that
can be implemented in a easy way creating a new status for shards.

The bottom line question is, I'm the only one with this kind of
requeriments? Does it make sense one functionality like this?



-
Best regards
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Proposal-for-new-feature-cold-replicas-brainstorming-tp4097501.html
Sent from the Solr - User mailing list archive at Nabble.com.


Query result caching with custom functions

2013-10-24 Thread Mathias Lux
Hi all!

Got a question on the Solr cache :)

I've written a custom function, which is able to provide a distance
based on some DocValues to re-sort result lists. This basically works
great, but we've got the problem that if I don't change the query, but
the function parameters, Solr delivers a cached result without
re-ordering. I turned off caching and see there, problem solved. But
of course this is not a avenue I want to pursue further as it doesn't
make sense for a prodcutive system.

Do you have any ideas (beyond fake query modification and turning off
caching) to counteract?

btw. I'm using Solr 4.4 (so if you are aware of the issue and it has
been resolved in 4.5 I'll port it :) The code I'm using is at
https://bitbucket.org/dermotte/liresolr

regards,
Mathias

-- 
Dr. Mathias Lux
Assistant Professor, Klagenfurt University, Austria
http://tinyurl.com/mlux-itec


Solr subset searching in 100-million document index

2013-10-24 Thread Sandeep Gupta
Hi,

We have a Solr index of around 100 million documents with each document
being given a region id growing at a rate of about 10 million documents per
month - the average document size being aronud 10KB of pure text. The total
number of region ids are themselves in the range of 2.5 million.

We want to search for a query with a given list of region ids. The number
of region ids in this list is usually around 250-300 (most of the time),
but can be upto 500, with a maximum cap of around 2000 ids in one request.


What is the best way to model such queries besides using an IN param in the
query, or using a Filter FQ in the query? Are there any other faster
methods available?


If it may help, the index is on a VM with 4 virtual-cores and has currently
4GB of Java memory allocated out of 16GB in the machine. The number of
queries do not exceed more than 1 per minute for now. If needed, we can
throw more hardware to the index - but the index will still be only on a
single machine for atleast 6 months.

Regards,
Sandeep Gupta


Re: Terms function join with a Select function ?

2013-10-24 Thread Erik Hatcher
That would be called faceting :)

http://wiki.apache.org/solr/SimpleFacetParameters




On Oct 24, 2013, at 5:23 AM, Bruno Mannina bmann...@free.fr wrote:

 Dear All,
 
 Ok I have an answer concerning the first question (limit)
 It's the terms.limit parameters.
 
 But I can't find how to apply a Terms request on a query result
 
 any idea ?
 
 Bruno
 
 Le 23/10/2013 23:19, Bruno Mannina a écrit :
 Dear Solr users,
 
 I use the Terms function to see the frequency data in a field but it's for 
 the whole database.
 
 I have 2 questions:
 - Is it possible to increase the number of statistic ? actually I have the 
 10 first frequency term.
 
 - Is it possible to limit this statistic to the result of a request ?
 
 PS: the second question is very important for me.
 
 Many thanks
 
 
 
 
 ---
 Ce courrier électronique ne contient aucun virus ou logiciel malveillant 
 parce que la protection avast! Antivirus est active.
 http://www.avast.com
 



Basic query process question with fl=id

2013-10-24 Thread Manuel Le Normand
Hi

Any distributed lookup is basically composed of two stages: the first
collecting all the matching documents from every shard and a second which
fetches additional information about specific ids (i.e stored, termVectors).

It can be seen in the logs of each shard (isShard=true), where first
request logs the num of hits that were received on the query by the
specific shard and a second that contains the ids fields (ids=...) for the
additional fetch.
At the end of both I get a total QTime of the query and the total num of
hits.

My question is about the case only id's are requested (fl=id). This query
should make only one request against a shard, while it actually does the
two of them.

Looks like the response builder has to go through these two stages no
matter what is the kind of query.

My question:
1. Is it normal the response builder has to go though both stages?
2. Does the first request gets internal lucene DocId's or the actual
uniqueKey id?
3. A query as above (fl=id), where is the Id read from? Is it fetched from
the stored file? or doc value file if exists? Because if fetched from the
stored, a high row param (say 1000 in my case) would need 1000 lookups
which could badly heart performance.

Thanks
Manuel


RE: Spellcheck with Distributed Search (sharding).

2013-10-24 Thread Dyer, James
Is it that your request handler is named /suggest but you are setting 
shards.qt to /suggestion ?

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: Luis Cappa Banda [mailto:luisca...@gmail.com] 
Sent: Thursday, October 24, 2013 6:22 AM
To: solr-user@lucene.apache.org
Subject: Re: Spellcheck with Distributed Search (sharding).

Any idea?


2013/10/23 Luis Cappa Banda luisca...@gmail.com

 More info:

 When executing the Query to a single Solr server it works:
 http://solr1:8080/events/data/suggest?q=mwt=jsonhttp://solrclusterd.buguroo.dev:8080/events/data/suggest?q=mwt=json

 {

- responseHeader:
{
   - status: 0,
   - QTime: 1
   },
- response:
{
   - numFound: 0,
   - start: 0,
   - docs: [ ]
   },
- spellcheck:
{
   - suggestions:
   [
  - m,
  -
  {
 - numFound: 4,
 - startOffset: 0,
 - endOffset: 1,
 - suggestion:
 [
- marca,
- marcacom,
- mis,
- mispelotas
]
 }
  ]
   }

 }


 But when choosing the Request handler this way it doesn't:
 http://solr1:8080/events/data/select?*qt=/sugges*twt=jsonq=*:*http://solrclusterd.buguroo.dev:8080/events/data/select?qt=/suggestwt=jsonq=*:*




 2013/10/23 Luis Cappa Banda luisca...@gmail.com

 Hello!

 I'be been trying to enable Spellchecking using sharding following the
 steps from the Wiki, but I failed, :-( What I do is:

 *Solrconfig.xml*


 *searchComponent name=suggest* class=solr.SpellCheckComponent
 lst name=spellchecker
  str name=namesuggest/str
 str name=classnameorg.apache.solr.spelling.suggest.Suggester/str
  str
 name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str
 str name=fieldsuggestion/str
  str name=buildOnOptimizetrue/str
 /lst
 /searchComponent


 *requestHandler name=/suggest* class=solr.SearchHandler
 lst name=defaults
  str name=dfsuggestion/str
 str name=spellchecktrue/str
  str name=spellcheck.dictionarysuggest/str
 str name=spellcheck.count10/str
  /lst
   arr name=last-components
 strsuggest/str
   /arr
 /requestHandler


 *Note:* I have two shards (solr1 and solr2) and both have the same
 solrconfig.xml. Also, bot indexes were optimized to create the spellchecker
 indexes.

 *Query*


 solr1:8080/events/data/select?q=mqt=/suggestionshards.qt=/suggestionwt=jsonshards=solr1:8080/events/data,solr2:8080/events/data

 *
 *
 *Response*
 *
 *
 {

- responseHeader:
{
   - status: 404,
   - QTime: 12,
   - params:
   {
  - shards: solr1:8080/events/data,solr2:8080/events/data,
  - shards.qt: /suggestion,
  - q: m,
  - wt: json,
  - qt: /suggestion
  }
   },
- error:
{
   - msg: Server at http://solr1:8080/events/data returned non ok
   status:404, message:Not Found,
   - code: 404
   }

 }

 More query syntaxes that I used and that doesn't work:


 http://solr1:8080/events/data/select?q=mqt=suggestionshards.qt=/suggestionwt=jsonshards=solr1:8080/events/data,solr2:8080/events/datahttp://solrclusterd.buguroo.dev:8080/events/data/select?q=mqt=suggestionshards.qt=/suggestionwt=jsonshards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data


 http://solr1:8080/events/data/select?q=*:*spellcheck.q=mqt=suggestionshards.qt=/suggestionwt=jsonshards=solr1:8080/events/data,solr2:8080/events/datahttp://solrclusterd.buguroo.dev:8080/events/data/select?q=*:*spellcheck.q=mqt=suggestionshards.qt=/suggestionwt=jsonshards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data


 Any idea of what I'm doing wrong?

 Thank you very much in advance!

 Best regards,

 --
 - Luis Cappa




 --
 - Luis Cappa




-- 
- Luis Cappa



Re: Proposal for new feature, cold replicas, brainstorming

2013-10-24 Thread Toke Eskildsen
On Thu, 2013-10-24 at 13:27 +0200, yriveiro wrote:
 The motivation of this is simple, I want have replication but I don't want
 have n replicas actives with full resources allocated (cache and so on).
 This is usefull in enviroments where replication is needed but a high query
 throughput is not fundamental and the resources are limited.

Coincidentally we recently talked about the exact same setup.

We are looking at sharding a 20 TB index into 20 * 1 TB shards, each
located on their own dedicated physical SSD, which has more than enough
horsepower for our needs. For replication, we have a remote storage
system capable of serving requests for 2-4 shards with acceptable
latency.

Projected performance for the SSD setup is superior (5-10 times) to our
remote storage, so we would like to hit only the SSDs if possible.
Setting up a cloud to issue all requests to the SSD-shards unless a
catastrophic failure happened to on of them and in that case fallback to
the remote story replica for only that shard, would be perfect.

 I know that right now is not possible, but I think that it's a feature that
 can be implemented in a easy way creating a new status for shards.

shardIsLastResort=true? On paper it seems like a simple addition, but I
am not at familiar enough with the SolrCloud-code to guess if it is easy
to implement.

- Toke Eskildsen, State and University Library, Denmark




Searching on special characters

2013-10-24 Thread johnmunir
Hi,


How should I setup Solr so I can search and get hit on special characters such 
as: + -  || ! ( ) { } [ ] ^  ~ * ? : \


My need is, if a user has text like so:


Doc-#1: (Solr)
Doc-#2: Solr


And they type (solr) I want a hit on (solr) only in document #1, with the 
brackets matching.  And if they type solr, they will get a hit in Document #2 
only.


An additional nice-to-have is, if they type solr, I want a hit in both 
document #1 and #2.


Here is what my current schema.xml looks like:



  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=lang/stopwords_en.txt enablePositionIncrements=true/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 
splitOnCaseChange=0 splitOnNumerics=1 stemEnglishPossessive=1 
preserveOriginal=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory 
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer



Currently, special characters are being stripped.



Any idea how I can configure Solr to do this?  I'm using Solr 3.6.



Thanks !!


-MJ


Re: Searching on special characters

2013-10-24 Thread Jack Krupansky
Have two or three copies of the text, one field could be raw string and 
boosted heavily for exact match, a second could be text using the keyword 
tokenizer but with lowercase filter also heavily boosted, and the third 
field general, tokenized text with a lower boost. You could also have a copy 
that uses the keyword tokenizer to maintain a single token but also applies 
a regex filter to strip special characters and applies a lower case filter 
and give that an intermediate boost.


-- Jack Krupansky

-Original Message- 
From: johnmu...@aol.com

Sent: Thursday, October 24, 2013 9:20 AM
To: solr-user@lucene.apache.org
Subject: Searching on special characters

Hi,


How should I setup Solr so I can search and get hit on special characters 
such as: + -  || ! ( ) { } [ ] ^  ~ * ? : \



My need is, if a user has text like so:


Doc-#1: (Solr)
Doc-#2: Solr


And they type (solr) I want a hit on (solr) only in document #1, with 
the brackets matching.  And if they type solr, they will get a hit in 
Document #2 only.



An additional nice-to-have is, if they type solr, I want a hit in both 
document #1 and #2.



Here is what my current schema.xml looks like:



 analyzer
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true 
words=lang/stopwords_en.txt enablePositionIncrements=true/
   filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=1 
catenateNumbers=1 catenateAll=1 splitOnCaseChange=0 
splitOnNumerics=1 stemEnglishPossessive=1 preserveOriginal=1/

   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.KeywordMarkerFilterFactory 
protected=protwords.txt/

   filter class=solr.PorterStemFilterFactory/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer



Currently, special characters are being stripped.



Any idea how I can configure Solr to do this?  I'm using Solr 3.6.



Thanks !!


-MJ 



Re: Spellcheck with Distributed Search (sharding).

2013-10-24 Thread Luis Cappa Banda
I'ts just a type error, sorry about that! The Request Handler is OK spelled
and it doesn't work.


2013/10/24 Dyer, James james.d...@ingramcontent.com

 Is it that your request handler is named /suggest but you are setting
 shards.qt to /suggestion ?

 James Dyer
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Luis Cappa Banda [mailto:luisca...@gmail.com]
 Sent: Thursday, October 24, 2013 6:22 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Spellcheck with Distributed Search (sharding).

 Any idea?


 2013/10/23 Luis Cappa Banda luisca...@gmail.com

  More info:
 
  When executing the Query to a single Solr server it works:
  http://solr1:8080/events/data/suggest?q=mwt=json
 http://solrclusterd.buguroo.dev:8080/events/data/suggest?q=mwt=json
 
  {
 
 - responseHeader:
 {
- status: 0,
- QTime: 1
},
 - response:
 {
- numFound: 0,
- start: 0,
- docs: [ ]
},
 - spellcheck:
 {
- suggestions:
[
   - m,
   -
   {
  - numFound: 4,
  - startOffset: 0,
  - endOffset: 1,
  - suggestion:
  [
 - marca,
 - marcacom,
 - mis,
 - mispelotas
 ]
  }
   ]
}
 
  }
 
 
  But when choosing the Request handler this way it doesn't:
  http://solr1:8080/events/data/select?*qt=/sugges*twt=jsonq=*:*
 http://solrclusterd.buguroo.dev:8080/events/data/select?qt=/suggestwt=jsonq=*:*
 
 
 
 
 
  2013/10/23 Luis Cappa Banda luisca...@gmail.com
 
  Hello!
 
  I'be been trying to enable Spellchecking using sharding following the
  steps from the Wiki, but I failed, :-( What I do is:
 
  *Solrconfig.xml*
 
 
  *searchComponent name=suggest* class=solr.SpellCheckComponent
  lst name=spellchecker
   str name=namesuggest/str
  str name=classnameorg.apache.solr.spelling.suggest.Suggester/str
   str
  name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str
  str name=fieldsuggestion/str
   str name=buildOnOptimizetrue/str
  /lst
  /searchComponent
 
 
  *requestHandler name=/suggest* class=solr.SearchHandler
  lst name=defaults
   str name=dfsuggestion/str
  str name=spellchecktrue/str
   str name=spellcheck.dictionarysuggest/str
  str name=spellcheck.count10/str
   /lst
arr name=last-components
  strsuggest/str
/arr
  /requestHandler
 
 
  *Note:* I have two shards (solr1 and solr2) and both have the same
  solrconfig.xml. Also, bot indexes were optimized to create the
 spellchecker
  indexes.
 
  *Query*
 
 
 
 solr1:8080/events/data/select?q=mqt=/suggestionshards.qt=/suggestionwt=jsonshards=solr1:8080/events/data,solr2:8080/events/data
 
  *
  *
  *Response*
  *
  *
  {
 
 - responseHeader:
 {
- status: 404,
- QTime: 12,
- params:
{
   - shards: solr1:8080/events/data,solr2:8080/events/data,
   - shards.qt: /suggestion,
   - q: m,
   - wt: json,
   - qt: /suggestion
   }
},
 - error:
 {
- msg: Server at http://solr1:8080/events/data returned non ok
status:404, message:Not Found,
- code: 404
}
 
  }
 
  More query syntaxes that I used and that doesn't work:
 
 
 
 http://solr1:8080/events/data/select?q=mqt=suggestionshards.qt=/suggestionwt=jsonshards=solr1:8080/events/data,solr2:8080/events/data
 
 http://solrclusterd.buguroo.dev:8080/events/data/select?q=mqt=suggestionshards.qt=/suggestionwt=jsonshards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data
 
 
 
 
 http://solr1:8080/events/data/select?q=*:*spellcheck.q=mqt=suggestionshards.qt=/suggestionwt=jsonshards=solr1:8080/events/data,solr2:8080/events/data
 
 http://solrclusterd.buguroo.dev:8080/events/data/select?q=*:*spellcheck.q=mqt=suggestionshards.qt=/suggestionwt=jsonshards=solrclusterd.buguroo.dev:8080/events/data,solrclusterc.buguroo.dev:8080/events/data
 
 
 
  Any idea of what I'm doing wrong?
 
  Thank you very much in advance!
 
  Best regards,
 
  --
  - Luis Cappa
 
 
 
 
  --
  - Luis Cappa
 



 --
 - Luis Cappa




-- 
- Luis Cappa


Re: Searching on special characters

2013-10-24 Thread johnmunir
I'm not sure what you mean.  Based on what you are saying, is there an example 
of how I can setup my schema.xml to get the result I need?


Also, the way I execute a search is using 
http://localhost:8080/solr/select/?q=search-term  Does your solution require 
me to change this?  If so, in what way?


It would be great if all this is documented somewhere, so I won't have to bug 
you guys !!!



--MJ



-Original Message-
From: Jack Krupansky j...@basetechnology.com
To: solr-user solr-user@lucene.apache.org
Sent: Thu, Oct 24, 2013 9:39 am
Subject: Re: Searching on special characters


Have two or three copies of the text, one field could be raw string and 
boosted heavily for exact match, a second could be text using the keyword 
tokenizer but with lowercase filter also heavily boosted, and the third 
field general, tokenized text with a lower boost. You could also have a copy 
that uses the keyword tokenizer to maintain a single token but also applies 
a regex filter to strip special characters and applies a lower case filter 
and give that an intermediate boost.

-- Jack Krupansky

-Original Message- 
From: johnmu...@aol.com
Sent: Thursday, October 24, 2013 9:20 AM
To: solr-user@lucene.apache.org
Subject: Searching on special characters

Hi,


How should I setup Solr so I can search and get hit on special characters 
such as: + -  || ! ( ) { } [ ] ^  ~ * ? : \


My need is, if a user has text like so:


Doc-#1: (Solr)
Doc-#2: Solr


And they type (solr) I want a hit on (solr) only in document #1, with 
the brackets matching.  And if they type solr, they will get a hit in 
Document #2 only.


An additional nice-to-have is, if they type solr, I want a hit in both 
document #1 and #2.


Here is what my current schema.xml looks like:



  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=lang/stopwords_en.txt enablePositionIncrements=true/
filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=1 
catenateNumbers=1 catenateAll=1 splitOnCaseChange=0 
splitOnNumerics=1 stemEnglishPossessive=1 preserveOriginal=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory 
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer



Currently, special characters are being stripped.



Any idea how I can configure Solr to do this?  I'm using Solr 3.6.



Thanks !!


-MJ 


 



Re: Issue with large html indexing

2013-10-24 Thread Shawn Heisey
On 10/24/2013 2:11 AM, Raheel Hasan wrote:
 ok. see this:
 http://s23.postimg.org/yck2s5k1n/html_indexing.png

A recap.  You said your index analysis chain is this:

HTMLStripCharFilterFactory
WhitespaceTokenizerFactory (create tokens)
StopFilterFactory
WordDelimiterFilterFactory
ICUFoldingFilterFactory
PorterStemFilterFactory
RemoveDuplicatesTokenFilterFactory
LengthFilterFactory

Your picture says you have 1 document, and this field contains 1036
terms. The numbers are likely numbers that are in your html document.
You never showed us the input document.  It is likely that the
whitespace tokenizer and/or the WordDelimeter filter are producing these
numbers as standalone tokens.  The tokenizer is pretty easy to
understand - it splits on whitespace.  Please see the following to know
what the options for WordDelimeterFilterFactory will do:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

Thanks,
Shawn



Re: Query result caching with custom functions

2013-10-24 Thread Shawn Heisey
On 10/24/2013 5:35 AM, Mathias Lux wrote:
 I've written a custom function, which is able to provide a distance
 based on some DocValues to re-sort result lists. This basically works
 great, but we've got the problem that if I don't change the query, but
 the function parameters, Solr delivers a cached result without
 re-ordering. I turned off caching and see there, problem solved. But
 of course this is not a avenue I want to pursue further as it doesn't
 make sense for a prodcutive system.
 
 Do you have any ideas (beyond fake query modification and turning off
 caching) to counteract?
 
 btw. I'm using Solr 4.4 (so if you are aware of the issue and it has
 been resolved in 4.5 I'll port it :) The code I'm using is at
 https://bitbucket.org/dermotte/liresolr

I suspect that the queryResultCache is not paying attention to the fact
that parameters for your plugin have changed.  This probably means that
your plugin must somehow inform the cache check code that something
HAS changed.

How you actually do this is a mystery to me because it involves parts of
the code that are beyond my understanding, but it MIGHT involve making
sure that parameters related to your code are saved as part of the entry
that goes into the cache.

Thanks,
Shawn



Re: Solr subset searching in 100-million document index

2013-10-24 Thread Joel Bernstein
Sandeep,

This type of operation can often be expressed as a PostFilter very
efficiently. This is particularly true if the region id's are integer keys.

Joel

On Thu, Oct 24, 2013 at 7:46 AM, Sandeep Gupta sandy@gmail.com wrote:

 Hi,

 We have a Solr index of around 100 million documents with each document
 being given a region id growing at a rate of about 10 million documents per
 month - the average document size being aronud 10KB of pure text. The total
 number of region ids are themselves in the range of 2.5 million.

 We want to search for a query with a given list of region ids. The number
 of region ids in this list is usually around 250-300 (most of the time),
 but can be upto 500, with a maximum cap of around 2000 ids in one request.


 What is the best way to model such queries besides using an IN param in the
 query, or using a Filter FQ in the query? Are there any other faster
 methods available?


 If it may help, the index is on a VM with 4 virtual-cores and has currently
 4GB of Java memory allocated out of 16GB in the machine. The number of
 queries do not exceed more than 1 per minute for now. If needed, we can
 throw more hardware to the index - but the index will still be only on a
 single machine for atleast 6 months.

 Regards,
 Sandeep Gupta




--


Re: Query result caching with custom functions

2013-10-24 Thread Joel Bernstein
Mathias,

I'd have to do a close review of the function sort code to be sure, but I
suspect if you implement the equals() method on the ValueSource it should
solve your caching issue. Also implement hashCode().

Joel


On Thu, Oct 24, 2013 at 10:35 AM, Shawn Heisey s...@elyograg.org wrote:

 On 10/24/2013 5:35 AM, Mathias Lux wrote:
  I've written a custom function, which is able to provide a distance
  based on some DocValues to re-sort result lists. This basically works
  great, but we've got the problem that if I don't change the query, but
  the function parameters, Solr delivers a cached result without
  re-ordering. I turned off caching and see there, problem solved. But
  of course this is not a avenue I want to pursue further as it doesn't
  make sense for a prodcutive system.
 
  Do you have any ideas (beyond fake query modification and turning off
  caching) to counteract?
 
  btw. I'm using Solr 4.4 (so if you are aware of the issue and it has
  been resolved in 4.5 I'll port it :) The code I'm using is at
  https://bitbucket.org/dermotte/liresolr

 I suspect that the queryResultCache is not paying attention to the fact
 that parameters for your plugin have changed.  This probably means that
 your plugin must somehow inform the cache check code that something
 HAS changed.

 How you actually do this is a mystery to me because it involves parts of
 the code that are beyond my understanding, but it MIGHT involve making
 sure that parameters related to your code are saved as part of the entry
 that goes into the cache.

 Thanks,
 Shawn




Re: Solr not indexing everything from MongoDB

2013-10-24 Thread Michael Della Bitta
That's typical for an index that receives updates to the same document. Are
you sure your keys are unique?

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinionshttps://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/


On Wed, Oct 23, 2013 at 5:57 PM, gohome190 gohome...@gmail.com wrote:

 numFound is 10.
 numDocs is 10, maxDoc is 23.  Yeah, Solr 4.x!

 Thanks!



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-not-indexing-everything-from-MongoDB-tp4097302p4097340.html
 Sent from the Solr - User mailing list archive at Nabble.com.



RE: New shard leaders or existing shard replicas depends on zookeeper?

2013-10-24 Thread Hoggarth, Gil
I think my question is easier, because I think the problem below was
caused by the very first startup of the 'ldwa01' collection/'ldwa01cfg'
zk collection name didn't specify the number of shards (and thus
defaulted to 1).

So, how can I change the number of shards for an existing collection/zk
collection name, especially when the ZK ensemble in question is the
production version and supporting other Solr collections that I do not
want to interrupt. (Which I think means that I can't just delete the
clusterstate.json and restart the ZKs as this will also lose the other
Solr collection information.)

Thanks in advance, Gil

-Original Message-
From: Hoggarth, Gil [mailto:gil.hogga...@bl.uk] 
Sent: 24 October 2013 10:13
To: solr-user@lucene.apache.org
Subject: RE: New shard leaders or existing shard replicas depends on
zookeeper?

Absolutely, the scenario I'm seeing does _sound_ like I've not specified
the number of shards, but I think I have - the evidence is:
- DnumShards=24 defined within the /etc/sysconfig/solrnode* files

- DnumShards=24 seen on each 'ps' line (two nodes listed here):
 tomcat   26135 1  5 09:51 ?00:00:22 /opt/java/bin/java
-Djava.util.logging.config.file=/opt/tomcat_instances/solrnode1/conf/log
ging.properties
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
-Xms512m -Xmx5120m -Dsolr.solr.home=/opt/solrnode1 -Duser.language=en
-Duser.country=uk -Dbootstrap_confdir=/opt/solrnode1/ldwa01/conf
-Dcollection.configName=ldwa01cfg -DnumShards=24
-Dsolr.data.dir=/opt/data/solrnode1/ldwa01/data
-DzkHost=zk01.solr.wa.bl.uk:9983,zk02.solr.wa.bl.uk:9983,zk03.solr.wa.bl
.uk:9983 -Djava.endorsed.dirs=/opt/tomcat/endorsed -classpath
/opt/tomcat/bin/bootstrap.jar:/opt/tomcat/bin/tomcat-juli.jar
-Dcatalina.base=/opt/tomcat_instances/solrnode1
-Dcatalina.home=/opt/tomcat
-Djava.io.tmpdir=/opt/tomcat_instances/solrnode1/tmp
org.apache.catalina.startup.Bootstrap start
tomcat   26225 1  5 09:51 ?00:00:19 /opt/java/bin/java
-Djava.util.logging.config.file=/opt/tomcat_instances/solrnode2/conf/log
ging.properties
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
-Xms512m -Xmx5120m -Dsolr.solr.home=/opt/solrnode2 -Duser.language=en
-Duser.country=uk -Dbootstrap_confdir=/opt/solrnode2/ldwa01/conf
-Dcollection.configName=ldwa01cfg -DnumShards=24
-Dsolr.data.dir=/opt/data/solrnode2/ldwa01/data
-DzkHost=zk01.solr.wa.bl.uk:9983,zk02.solr.wa.bl.uk:9983,zk03.solr.wa.bl
.uk:9983 -Djava.endorsed.dirs=/opt/tomcat/endorsed -classpath
/opt/tomcat/bin/bootstrap.jar:/opt/tomcat/bin/tomcat-juli.jar
-Dcatalina.base=/opt/tomcat_instances/solrnode2
-Dcatalina.home=/opt/tomcat
-Djava.io.tmpdir=/opt/tomcat_instances/solrnode2/tmp
org.apache.catalina.startup.Bootstrap start

- The Solr node dashboard shows -DnumShards=24 in its list of Args for
each node

And yet, the ldwa01 nodes are leader and replica of shard 17 and there
are no other shard leaders created. Plus, if I only change the ZK
ensemble declarations in /etc/system/solrnode* to the different dev ZK
servers, all 24 leaders are created before any replicas are added.

I can also mention, when I browse the Cloud view, I can see both the
ldwa01 collection and the ukdomain collection listed, suggesting that
this information comes from the ZKs - I assume this is as expected.
Plus, the correct node addresses (e.g., 192.168.45.17:8984) are listed
for ldwa01 but these addresses are also listed as 'Down' in the ukdomain
collection (except for :8983 which only shows in the ldwa01 collection).

Any help very gratefully received.
Gil

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: 23 October 2013 18:50
To: solr-user@lucene.apache.org
Subject: Re: New shard leaders or existing shard replicas depends on
zookeeper?

My first impulse would be to ask how you created the collection. It sure
_sounds_ like you didn't specify 24 shards and thus have only a single
shard, one leader and 23 replicas

bq: ...to point to the zookeeper ensemble also used for the ukdomain
collection...

so my guess is that this ZK ensemble has the ldwa01 collection defined
as having only one shard

I admit I pretty much skimmed your post though...

Best,
Erick


On Wed, Oct 23, 2013 at 12:54 PM, Hoggarth, Gil gil.hogga...@bl.uk
wrote:

 Hi solr-users,



 I'm seeing some confusing behaviour in Solr/zookeeper and hope you can

 shed some light on what's happening/how I can correct it.



 We have two physical servers running automated builds of RedHat 6.4 
 and Solr 4.4.0 that host two separate Solr services. The first server 
 (called ld01) has 24 shards and hosts a collection called 'ukdomain'; 
 the second server (ld02) also has 24 shards and hosts a different 
 collection called 'ldwa01'. It's evidently important to note that 
 previously both of these physical servers provided the 'ukdomain'
 collection, but the 'ldwa01' server has been rebuilt for the new 
 collection.



 When I start the ldwa01 solr 

Re: New shard leaders or existing shard replicas depends on zookeeper?

2013-10-24 Thread Daniel Collins
Ah yes, I was about to mention that, -DnumShards is only actually used when
the collection is being created for the first time.  After that point (i.e.
once the collection exists in ZK), passing it along the command line is
redundant (Solr won't actually read it).  I know preferred mechanism of
creating collections is to use the collectionAPI, in which case you never
use -DnumShards at all.  Having it on the command line can be confusing
(we've fallen into that trap too!)

The only way to change the number of shards on a collection is to use the
collection API to split a shard (and currently you can only do that in
steps of 2, so you'll need to do 1-2, 2-4, 4-8, 8-16.  You can't get
from 1 - 24 as its not a power of 2 :(   What you want is
https://issues.apache.org/jira/browse/SOLR-5004

Otherwise, you'll need to create a new collection and re-index everything
into that.


On 24 October 2013 16:35, Hoggarth, Gil gil.hogga...@bl.uk wrote:

 I think my question is easier, because I think the problem below was
 caused by the very first startup of the 'ldwa01' collection/'ldwa01cfg'
 zk collection name didn't specify the number of shards (and thus
 defaulted to 1).

 So, how can I change the number of shards for an existing collection/zk
 collection name, especially when the ZK ensemble in question is the
 production version and supporting other Solr collections that I do not
 want to interrupt. (Which I think means that I can't just delete the
 clusterstate.json and restart the ZKs as this will also lose the other
 Solr collection information.)

 Thanks in advance, Gil

 -Original Message-
 From: Hoggarth, Gil [mailto:gil.hogga...@bl.uk]
 Sent: 24 October 2013 10:13
 To: solr-user@lucene.apache.org
 Subject: RE: New shard leaders or existing shard replicas depends on
 zookeeper?

 Absolutely, the scenario I'm seeing does _sound_ like I've not specified
 the number of shards, but I think I have - the evidence is:
 - DnumShards=24 defined within the /etc/sysconfig/solrnode* files

 - DnumShards=24 seen on each 'ps' line (two nodes listed here):
  tomcat   26135 1  5 09:51 ?00:00:22 /opt/java/bin/java
 -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode1/conf/log
 ging.properties
 -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
 -Xms512m -Xmx5120m -Dsolr.solr.home=/opt/solrnode1 -Duser.language=en
 -Duser.country=uk -Dbootstrap_confdir=/opt/solrnode1/ldwa01/conf
 -Dcollection.configName=ldwa01cfg -DnumShards=24
 -Dsolr.data.dir=/opt/data/solrnode1/ldwa01/data
 -DzkHost=zk01.solr.wa.bl.uk:9983,zk02.solr.wa.bl.uk:9983,zk03.solr.wa.bl
 .uk:9983 -Djava.endorsed.dirs=/opt/tomcat/endorsed -classpath
 /opt/tomcat/bin/bootstrap.jar:/opt/tomcat/bin/tomcat-juli.jar
 -Dcatalina.base=/opt/tomcat_instances/solrnode1
 -Dcatalina.home=/opt/tomcat
 -Djava.io.tmpdir=/opt/tomcat_instances/solrnode1/tmp
 org.apache.catalina.startup.Bootstrap start
 tomcat   26225 1  5 09:51 ?00:00:19 /opt/java/bin/java
 -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode2/conf/log
 ging.properties
 -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
 -Xms512m -Xmx5120m -Dsolr.solr.home=/opt/solrnode2 -Duser.language=en
 -Duser.country=uk -Dbootstrap_confdir=/opt/solrnode2/ldwa01/conf
 -Dcollection.configName=ldwa01cfg -DnumShards=24
 -Dsolr.data.dir=/opt/data/solrnode2/ldwa01/data
 -DzkHost=zk01.solr.wa.bl.uk:9983,zk02.solr.wa.bl.uk:9983,zk03.solr.wa.bl
 .uk:9983 -Djava.endorsed.dirs=/opt/tomcat/endorsed -classpath
 /opt/tomcat/bin/bootstrap.jar:/opt/tomcat/bin/tomcat-juli.jar
 -Dcatalina.base=/opt/tomcat_instances/solrnode2
 -Dcatalina.home=/opt/tomcat
 -Djava.io.tmpdir=/opt/tomcat_instances/solrnode2/tmp
 org.apache.catalina.startup.Bootstrap start

 - The Solr node dashboard shows -DnumShards=24 in its list of Args for
 each node

 And yet, the ldwa01 nodes are leader and replica of shard 17 and there
 are no other shard leaders created. Plus, if I only change the ZK
 ensemble declarations in /etc/system/solrnode* to the different dev ZK
 servers, all 24 leaders are created before any replicas are added.

 I can also mention, when I browse the Cloud view, I can see both the
 ldwa01 collection and the ukdomain collection listed, suggesting that
 this information comes from the ZKs - I assume this is as expected.
 Plus, the correct node addresses (e.g., 192.168.45.17:8984) are listed
 for ldwa01 but these addresses are also listed as 'Down' in the ukdomain
 collection (except for :8983 which only shows in the ldwa01 collection).

 Any help very gratefully received.
 Gil

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: 23 October 2013 18:50
 To: solr-user@lucene.apache.org
 Subject: Re: New shard leaders or existing shard replicas depends on
 zookeeper?

 My first impulse would be to ask how you created the collection. It sure
 _sounds_ like you didn't specify 24 shards and thus have only a single
 shard, one leader and 

Re: Query result caching with custom functions

2013-10-24 Thread Mathias Lux
That's a possibility,  I'll try that and report on the effects.  Thanks,
Mathias
Am 24.10.2013 16:52 schrieb Joel Bernstein joels...@gmail.com:

 Mathias,

 I'd have to do a close review of the function sort code to be sure, but I
 suspect if you implement the equals() method on the ValueSource it should
 solve your caching issue. Also implement hashCode().

 Joel


 On Thu, Oct 24, 2013 at 10:35 AM, Shawn Heisey s...@elyograg.org wrote:

  On 10/24/2013 5:35 AM, Mathias Lux wrote:
   I've written a custom function, which is able to provide a distance
   based on some DocValues to re-sort result lists. This basically works
   great, but we've got the problem that if I don't change the query, but
   the function parameters, Solr delivers a cached result without
   re-ordering. I turned off caching and see there, problem solved. But
   of course this is not a avenue I want to pursue further as it doesn't
   make sense for a prodcutive system.
  
   Do you have any ideas (beyond fake query modification and turning off
   caching) to counteract?
  
   btw. I'm using Solr 4.4 (so if you are aware of the issue and it has
   been resolved in 4.5 I'll port it :) The code I'm using is at
   https://bitbucket.org/dermotte/liresolr
 
  I suspect that the queryResultCache is not paying attention to the fact
  that parameters for your plugin have changed.  This probably means that
  your plugin must somehow inform the cache check code that something
  HAS changed.
 
  How you actually do this is a mystery to me because it involves parts of
  the code that are beyond my understanding, but it MIGHT involve making
  sure that parameters related to your code are saved as part of the entry
  that goes into the cache.
 
  Thanks,
  Shawn
 
 



Re: Proposal for new feature, cold replicas, brainstorming

2013-10-24 Thread Yago Riveiro
With a shard with listening status and some logic on the mechanism that does 
the load balancing between replicas, we can achieve the goal.

The SPLITSHARD action makes replicas from the original shard which are in 
inactive state, this shards buffering the updates and when the operation 
ends, the parent shard becomes inactive and the new replicas are promoted to 
active state.

Like inactive state, we can have a listening state that never becomes 
active, unless a leader election operation happen and this shard with 
listening status be the unique that is alive.

In addition, is necessary add new metadata to the shard on clusterstate.json 
file to mark that replica as a replica with replication purposes, and resigns 
when other replica becomes active.


-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Thursday, October 24, 2013 at 2:16 PM, Toke Eskildsen wrote:

 On Thu, 2013-10-24 at 13:27 +0200, yriveiro wrote:
  The motivation of this is simple, I want have replication but I don't want
  have n replicas actives with full resources allocated (cache and so on).
  This is usefull in enviroments where replication is needed but a high query
  throughput is not fundamental and the resources are limited.
  
 
 
 Coincidentally we recently talked about the exact same setup.
 
 We are looking at sharding a 20 TB index into 20 * 1 TB shards, each
 located on their own dedicated physical SSD, which has more than enough
 horsepower for our needs. For replication, we have a remote storage
 system capable of serving requests for 2-4 shards with acceptable
 latency.
 
 Projected performance for the SSD setup is superior (5-10 times) to our
 remote storage, so we would like to hit only the SSDs if possible.
 Setting up a cloud to issue all requests to the SSD-shards unless a
 catastrophic failure happened to on of them and in that case fallback to
 the remote story replica for only that shard, would be perfect.
 
  I know that right now is not possible, but I think that it's a feature that
  can be implemented in a easy way creating a new status for shards.
  
 
 
 shardIsLastResort=true? On paper it seems like a simple addition, but I
 am not at familiar enough with the SolrCloud-code to guess if it is easy
 to implement.
 
 - Toke Eskildsen, State and University Library, Denmark 



[ANNOUNCE] Apache Solr 4.5.1 released.

2013-10-24 Thread Mark Miller
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

October 2013, Apache Solr™ 4.5.1 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.5.1

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic clustering,
database integration, rich document (e.g., Word, PDF) handling, and
geospatial search. Solr is highly scalable, providing fault tolerant
distributed search and indexing, and powers the search and navigation
features of many of the world's largest internet sites.

Solr 4.5.1 includes 16 bug fixes as well as Lucene 4.5.1 and its bug
fixes. The release is available for immediate download at:

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html


See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using
may not have replicated the release yet. If that is the case, please try
another mirror. This also goes for Maven access.

Happy searching,

Lucene/Solr developers
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJSaUdSAAoJED+/0YJ4eWrI90UP/RGSmLBdvrc/5NZEb7LSCSjW
z4D3wJ2i4a0rLpiW2qA547y/NZ5KZcmrDSzJu0itf8Q/0q+tm7/d30uPg/cdRlgl
wGERcxsyfPfTqBjzdSNNGgNm++tnkkqRJbYEfsG5ApWrKicitU7cPb82m8oCdlnn
4wnhYt6tfu/EPCglt9ixF7Ukv5o7txMnwWGmkGTbUt8ugp9oOMN/FfGHex/FVxcF
xHhWBLymIJy24APEEF/Mq3UW12hQT+aRof66xBch0fEPVlbDitBa9wNuRNQ98M90
ZpTl8o0ITMUKjTKNkxZJCO5LQeNwhYaOcM5nIykGadWrXBZo5Ob611ZKeYPZBWCW
Ei88dwJQkXaDcVNLZ/HVcAePjmcALHd3nc4uNfcJB8zvgZOPagMpXW2rRSXFACHM
FdaRezTdH8Uh5zp2n3hsqYCbpDreRoXGXaiOgVZ+8EekVMGYUnMFKdqNlqhVnF6r
tzp+aaCBhGDUD5xUw2w2fb5c9Jh1oIQ9f7fsVH78kgsHShySnte3NbfoFWUClPMX
PwrfWuZpmu9In2ZiJVYSOD6MBqmJ+z3N1bnf1kqsitv7MonkvQkOoDIafW835vG9
3aajknE1vazOATSGHIxCtJfqzTEqeqFqVbjG/qS72XIhMey8tVAwjrjcgFnayk9Z
xrG1W1o2sjrYkioJ7nZK
=8++G
-END PGP SIGNATURE-


Re: [ANNOUNCE] Apache Solr 4.5.1 released.

2013-10-24 Thread Jack Park
Download redirects to 4.5.0
Is there a typo in the server path?

On Thu, Oct 24, 2013 at 9:14 AM, Mark Miller markrmil...@apache.org wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 October 2013, Apache Solr™ 4.5.1 available

 The Lucene PMC is pleased to announce the release of Apache Solr 4.5.1

 Solr is the popular, blazing fast, open source NoSQL search platform
 from the Apache Lucene project. Its major features include powerful
 full-text search, hit highlighting, faceted search, dynamic clustering,
 database integration, rich document (e.g., Word, PDF) handling, and
 geospatial search. Solr is highly scalable, providing fault tolerant
 distributed search and indexing, and powers the search and navigation
 features of many of the world's largest internet sites.

 Solr 4.5.1 includes 16 bug fixes as well as Lucene 4.5.1 and its bug
 fixes. The release is available for immediate download at:

 http://lucene.apache.org/solr/mirrors-solr-latest-redir.html


 See the CHANGES.txt file included with the release for a full list of
 changes and further details.

 Please report any feedback to the mailing lists
 (http://lucene.apache.org/solr/discussion.html)

 Note: The Apache Software Foundation uses an extensive mirroring network
 for distributing releases. It is possible that the mirror you are using
 may not have replicated the release yet. If that is the case, please try
 another mirror. This also goes for Maven access.

 Happy searching,

 Lucene/Solr developers
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.14 (GNU/Linux)
 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

 iQIcBAEBAgAGBQJSaUdSAAoJED+/0YJ4eWrI90UP/RGSmLBdvrc/5NZEb7LSCSjW
 z4D3wJ2i4a0rLpiW2qA547y/NZ5KZcmrDSzJu0itf8Q/0q+tm7/d30uPg/cdRlgl
 wGERcxsyfPfTqBjzdSNNGgNm++tnkkqRJbYEfsG5ApWrKicitU7cPb82m8oCdlnn
 4wnhYt6tfu/EPCglt9ixF7Ukv5o7txMnwWGmkGTbUt8ugp9oOMN/FfGHex/FVxcF
 xHhWBLymIJy24APEEF/Mq3UW12hQT+aRof66xBch0fEPVlbDitBa9wNuRNQ98M90
 ZpTl8o0ITMUKjTKNkxZJCO5LQeNwhYaOcM5nIykGadWrXBZo5Ob611ZKeYPZBWCW
 Ei88dwJQkXaDcVNLZ/HVcAePjmcALHd3nc4uNfcJB8zvgZOPagMpXW2rRSXFACHM
 FdaRezTdH8Uh5zp2n3hsqYCbpDreRoXGXaiOgVZ+8EekVMGYUnMFKdqNlqhVnF6r
 tzp+aaCBhGDUD5xUw2w2fb5c9Jh1oIQ9f7fsVH78kgsHShySnte3NbfoFWUClPMX
 PwrfWuZpmu9In2ZiJVYSOD6MBqmJ+z3N1bnf1kqsitv7MonkvQkOoDIafW835vG9
 3aajknE1vazOATSGHIxCtJfqzTEqeqFqVbjG/qS72XIhMey8tVAwjrjcgFnayk9Z
 xrG1W1o2sjrYkioJ7nZK
 =8++G
 -END PGP SIGNATURE-


Re: [ANNOUNCE] Apache Solr 4.5.1 released.

2013-10-24 Thread Jack Park
Use a different server than default gets 4.5.1

On Thu, Oct 24, 2013 at 9:35 AM, Jack Park jackp...@topicquests.org wrote:
 Download redirects to 4.5.0
 Is there a typo in the server path?

 On Thu, Oct 24, 2013 at 9:14 AM, Mark Miller markrmil...@apache.org wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 October 2013, Apache Solr™ 4.5.1 available

 The Lucene PMC is pleased to announce the release of Apache Solr 4.5.1

 Solr is the popular, blazing fast, open source NoSQL search platform
 from the Apache Lucene project. Its major features include powerful
 full-text search, hit highlighting, faceted search, dynamic clustering,
 database integration, rich document (e.g., Word, PDF) handling, and
 geospatial search. Solr is highly scalable, providing fault tolerant
 distributed search and indexing, and powers the search and navigation
 features of many of the world's largest internet sites.

 Solr 4.5.1 includes 16 bug fixes as well as Lucene 4.5.1 and its bug
 fixes. The release is available for immediate download at:

 http://lucene.apache.org/solr/mirrors-solr-latest-redir.html


 See the CHANGES.txt file included with the release for a full list of
 changes and further details.

 Please report any feedback to the mailing lists
 (http://lucene.apache.org/solr/discussion.html)

 Note: The Apache Software Foundation uses an extensive mirroring network
 for distributing releases. It is possible that the mirror you are using
 may not have replicated the release yet. If that is the case, please try
 another mirror. This also goes for Maven access.

 Happy searching,

 Lucene/Solr developers
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.14 (GNU/Linux)
 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

 iQIcBAEBAgAGBQJSaUdSAAoJED+/0YJ4eWrI90UP/RGSmLBdvrc/5NZEb7LSCSjW
 z4D3wJ2i4a0rLpiW2qA547y/NZ5KZcmrDSzJu0itf8Q/0q+tm7/d30uPg/cdRlgl
 wGERcxsyfPfTqBjzdSNNGgNm++tnkkqRJbYEfsG5ApWrKicitU7cPb82m8oCdlnn
 4wnhYt6tfu/EPCglt9ixF7Ukv5o7txMnwWGmkGTbUt8ugp9oOMN/FfGHex/FVxcF
 xHhWBLymIJy24APEEF/Mq3UW12hQT+aRof66xBch0fEPVlbDitBa9wNuRNQ98M90
 ZpTl8o0ITMUKjTKNkxZJCO5LQeNwhYaOcM5nIykGadWrXBZo5Ob611ZKeYPZBWCW
 Ei88dwJQkXaDcVNLZ/HVcAePjmcALHd3nc4uNfcJB8zvgZOPagMpXW2rRSXFACHM
 FdaRezTdH8Uh5zp2n3hsqYCbpDreRoXGXaiOgVZ+8EekVMGYUnMFKdqNlqhVnF6r
 tzp+aaCBhGDUD5xUw2w2fb5c9Jh1oIQ9f7fsVH78kgsHShySnte3NbfoFWUClPMX
 PwrfWuZpmu9In2ZiJVYSOD6MBqmJ+z3N1bnf1kqsitv7MonkvQkOoDIafW835vG9
 3aajknE1vazOATSGHIxCtJfqzTEqeqFqVbjG/qS72XIhMey8tVAwjrjcgFnayk9Z
 xrG1W1o2sjrYkioJ7nZK
 =8++G
 -END PGP SIGNATURE-


Re: Changing indexed property on a field from false to true

2013-10-24 Thread Aloke Ghoshal
 Upayavira - Nice idea pushing in a nominal update when all fields are
stored, and it does work. The nominal update could be sent to a boolean
type dynamic field, that's not to be used for anything other than maybe
identifying documents that are done re-indexing.


On Wed, Oct 23, 2013 at 7:47 PM, Upayavira u...@odoko.co.uk wrote:

 The content needs to be re-indexed, the question is whether you can use
 the info in the index to do it rather than pushing fresh copies of the
 documents to the index.

 I've often wondered whether atomic updates could be used to handle this
 sort of thing. If all fields are stored, push a nominal update to cause
 the document to be re-indexed. I've never tried it though. I'd be
 curious to know if it works.

 Upayavira

 On Wed, Oct 23, 2013, at 02:25 PM, michael.boom wrote:
  Being given
  field name=title type=string bindexed=false* stored=true
  multiValued=false /
  Changed to
  field name=title type=string bindexed=true* stored=true
  multiValued=false /
 
  Once the above is done and the collection reloaded, is there a way I can
  build that index on that field, without reindexing the everything?
 
  Thank you!
 
 
 
  -
  Thanks,
  Michael
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Changing-indexed-property-on-a-field-from-false-to-true-tp4097213.html
  Sent from the Solr - User mailing list archive at Nabble.com.



Re: New query-time multi-word synonym expander

2013-10-24 Thread Otis Gospodnetic
Jack - watch https://issues.apache.org/jira/browse/SOLR-5379 -
comments from the author are there.
Markus - ah, yes.  I see I even managed to (re)name SOLR-5379
*exactly* the same as SOLR-4381 :)  But the author of SOLR-5379 points
out its advantages over SOLR-4381.

Would be great if people could try it and leave comments with any
issues, so we can iterate on the patch to make it committable.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Wed, Oct 23, 2013 at 1:13 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Nice, but now we got three multi-word synonym parsers? Didn't the LUCENE-4499 
 or SOLR-4381 patches work? I know the latter has had a reasonable amount of 
 users and committers on github, but it was never brought back to ASF it seems.

 -Original message-
 From:Otis Gospodnetic otis.gospodne...@gmail.com
 Sent: Wednesday 23rd October 2013 18:54
 To: solr-user@lucene.apache.org
 Subject: New query-time multi-word synonym expander

 Hi,

 Heads up that there is new query-time multi-word synonym expander
 patch in https://issues.apache.org/jira/browse/SOLR-5379

 This worked for our customer and we hope it works for others.

 Any feedback would be greatly appreciated.

 Thanks,
 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics
 Solr  Elasticsearch Support * http://sematext.com



Re: Changing indexed property on a field from false to true

2013-10-24 Thread Upayavira
When this gets interesting is if we had batch atomic updates. Imagine
you could do indexCount++ fro all docs matching the query
category:sport. Could be really useful. /dreaming.

Upayavira

On Thu, Oct 24, 2013, at 05:40 PM, Aloke Ghoshal wrote:
  Upayavira - Nice idea pushing in a nominal update when all fields are
 stored, and it does work. The nominal update could be sent to a boolean
 type dynamic field, that's not to be used for anything other than maybe
 identifying documents that are done re-indexing.
 
 
 On Wed, Oct 23, 2013 at 7:47 PM, Upayavira u...@odoko.co.uk wrote:
 
  The content needs to be re-indexed, the question is whether you can use
  the info in the index to do it rather than pushing fresh copies of the
  documents to the index.
 
  I've often wondered whether atomic updates could be used to handle this
  sort of thing. If all fields are stored, push a nominal update to cause
  the document to be re-indexed. I've never tried it though. I'd be
  curious to know if it works.
 
  Upayavira
 
  On Wed, Oct 23, 2013, at 02:25 PM, michael.boom wrote:
   Being given
   field name=title type=string bindexed=false* stored=true
   multiValued=false /
   Changed to
   field name=title type=string bindexed=true* stored=true
   multiValued=false /
  
   Once the above is done and the collection reloaded, is there a way I can
   build that index on that field, without reindexing the everything?
  
   Thank you!
  
  
  
   -
   Thanks,
   Michael
   --
   View this message in context:
  
  http://lucene.472066.n3.nabble.com/Changing-indexed-property-on-a-field-from-false-to-true-tp4097213.html
   Sent from the Solr - User mailing list archive at Nabble.com.
 


Re: Multiple facet fields in defaults section of a Request Handler

2013-10-24 Thread Chris Hostetter

: Now a client wants to use multi select faceting. He calls the following API:
: 
http://localhost:8983/solr/collection1/search?q=*:*facet.field={!ex=foo}categoryfq={!tag=foo}category
: :cat

: Putting the facet definitions in appends cases it to facet category 2
: times.
: 
: Is there a way where he does not have to provide all the facet.field
: parameters in the API call?

What you are asking is essentially I want to configure faceting on X and 
Y by default, but i want clients to be able to add faceting on Z and have 
that disable faceting on X while still faceting on Y

It doens't matter that X and Z are both field facets based arround the 
field name category -- the tag exclusion makes them completley 
different.

The basic default/invariants/appends logic doesn't give you any easy 
mechanism to ignore arbitrary params like that - you could probably write 
a custom component that inspected the params and droped ones you don't 
want, but this wouldn't make sense as generalized logic in the 
FacetComponent since faceting on a field both with and w/o a tag 
expclusion at the same time is a very common use case.




-Hoss


Re: Solr subset searching in 100-million document index

2013-10-24 Thread Sandeep Gupta
Hi Joel,

Thanks a lot for the information - I haven't worked with PostFilter's
before but found an example at
http://java.dzone.com/articles/custom-security-filtering-solr.

Will try it over the next few days and come back if still have questions.

Thanks again!



Keep Walking,
~ Sandeep


On Thu, Oct 24, 2013 at 8:25 PM, Joel Bernstein joels...@gmail.com wrote:

 Sandeep,

 This type of operation can often be expressed as a PostFilter very
 efficiently. This is particularly true if the region id's are integer keys.

 Joel

 On Thu, Oct 24, 2013 at 7:46 AM, Sandeep Gupta sandy@gmail.com
 wrote:

  Hi,
 
  We have a Solr index of around 100 million documents with each document
  being given a region id growing at a rate of about 10 million documents
 per
  month - the average document size being aronud 10KB of pure text. The
 total
  number of region ids are themselves in the range of 2.5 million.
 
  We want to search for a query with a given list of region ids. The number
  of region ids in this list is usually around 250-300 (most of the time),
  but can be upto 500, with a maximum cap of around 2000 ids in one
 request.
 
 
  What is the best way to model such queries besides using an IN param in
 the
  query, or using a Filter FQ in the query? Are there any other faster
  methods available?
 
 
  If it may help, the index is on a VM with 4 virtual-cores and has
 currently
  4GB of Java memory allocated out of 16GB in the machine. The number of
  queries do not exceed more than 1 per minute for now. If needed, we can
  throw more hardware to the index - but the index will still be only on a
  single machine for atleast 6 months.
 
  Regards,
  Sandeep Gupta
 



 --



Re: Terms function join with a Select function ?

2013-10-24 Thread Bruno Mannina

Dear,

humI don't know how can I use it..;

I tried:

my query:
ti:snowboard (3095 results)

I would like to have at the end of my XML, the Terms statistic for the 
field AP (applicant field (patent notice))


but I haven't that...

Please help,
Bruno

/select?q=ti%Asnowboardversion=2.2start=0rows=10indent=onfacet=truef.ap.facet.limit=10

Le 24/10/2013 14:04, Erik Hatcher a écrit :

That would be called faceting :)

 http://wiki.apache.org/solr/SimpleFacetParameters




On Oct 24, 2013, at 5:23 AM, Bruno Mannina bmann...@free.fr wrote:


Dear All,

Ok I have an answer concerning the first question (limit)
It's the terms.limit parameters.

But I can't find how to apply a Terms request on a query result

any idea ?

Bruno

Le 23/10/2013 23:19, Bruno Mannina a écrit :

Dear Solr users,

I use the Terms function to see the frequency data in a field but it's for the 
whole database.

I have 2 questions:
- Is it possible to increase the number of statistic ? actually I have the 10 
first frequency term.

- Is it possible to limit this statistic to the result of a request ?

PS: the second question is very important for me.

Many thanks




---
Ce courrier électronique ne contient aucun virus ou logiciel malveillant parce 
que la protection avast! Antivirus est active.
http://www.avast.com







---
Ce courrier électronique ne contient aucun virus ou logiciel malveillant parce 
que la protection avast! Antivirus est active.
http://www.avast.com



Re: Terms function join with a Select function ?

2013-10-24 Thread Bruno Mannina

humm facet perfs are very bad (Solr 3.6.0)
My index is around 87 000 000 docs. (4 * Proc double core, 24G Ram)

I thought facets will work only on the result but it seems it's not the 
case.


My request:
http://localhost:2727/solr/select?q=ti:snowboardrows=0facet=truefacet.field=apfacet.limit=5

Do you think my request is wrong ?

Maybe it's not possible to have statistic on a field (like Terms 
function) on a query.


Thx for your help,

Bruno


Le 24/10/2013 19:40, Bruno Mannina a écrit :

Dear,

humI don't know how can I use it..;

I tried:

my query:
ti:snowboard (3095 results)

I would like to have at the end of my XML, the Terms statistic for the 
field AP (applicant field (patent notice))


but I haven't that...

Please help,
Bruno

/select?q=ti%Asnowboardversion=2.2start=0rows=10indent=onfacet=truef.ap.facet.limit=10 



Le 24/10/2013 14:04, Erik Hatcher a écrit :

That would be called faceting :)

 http://wiki.apache.org/solr/SimpleFacetParameters




On Oct 24, 2013, at 5:23 AM, Bruno Mannina bmann...@free.fr wrote:


Dear All,

Ok I have an answer concerning the first question (limit)
It's the terms.limit parameters.

But I can't find how to apply a Terms request on a query result

any idea ?

Bruno

Le 23/10/2013 23:19, Bruno Mannina a écrit :

Dear Solr users,

I use the Terms function to see the frequency data in a field but 
it's for the whole database.


I have 2 questions:
- Is it possible to increase the number of statistic ? actually I 
have the 10 first frequency term.


- Is it possible to limit this statistic to the result of a request ?

PS: the second question is very important for me.

Many thanks




---
Ce courrier électronique ne contient aucun virus ou logiciel 
malveillant parce que la protection avast! Antivirus est active.

http://www.avast.com







---
Ce courrier électronique ne contient aucun virus ou logiciel 
malveillant parce que la protection avast! Antivirus est active.

http://www.avast.com






---
Ce courrier électronique ne contient aucun virus ou logiciel malveillant parce 
que la protection avast! Antivirus est active.
http://www.avast.com



Re: Terms function join with a Select function ?

2013-10-24 Thread Bruno Mannina

Just a little precision: solr down after running my URL :( so bad...

Le 24/10/2013 22:04, Bruno Mannina a écrit :

humm facet perfs are very bad (Solr 3.6.0)
My index is around 87 000 000 docs. (4 * Proc double core, 24G Ram)

I thought facets will work only on the result but it seems it's not 
the case.


My request:
http://localhost:2727/solr/select?q=ti:snowboardrows=0facet=truefacet.field=apfacet.limit=5 



Do you think my request is wrong ?

Maybe it's not possible to have statistic on a field (like Terms 
function) on a query.


Thx for your help,

Bruno


Le 24/10/2013 19:40, Bruno Mannina a écrit :

Dear,

humI don't know how can I use it..;

I tried:

my query:
ti:snowboard (3095 results)

I would like to have at the end of my XML, the Terms statistic for 
the field AP (applicant field (patent notice))


but I haven't that...

Please help,
Bruno

/select?q=ti%Asnowboardversion=2.2start=0rows=10indent=onfacet=truef.ap.facet.limit=10 



Le 24/10/2013 14:04, Erik Hatcher a écrit :

That would be called faceting :)

 http://wiki.apache.org/solr/SimpleFacetParameters




On Oct 24, 2013, at 5:23 AM, Bruno Mannina bmann...@free.fr wrote:


Dear All,

Ok I have an answer concerning the first question (limit)
It's the terms.limit parameters.

But I can't find how to apply a Terms request on a query result

any idea ?

Bruno

Le 23/10/2013 23:19, Bruno Mannina a écrit :

Dear Solr users,

I use the Terms function to see the frequency data in a field but 
it's for the whole database.


I have 2 questions:
- Is it possible to increase the number of statistic ? actually I 
have the 10 first frequency term.


- Is it possible to limit this statistic to the result of a request ?

PS: the second question is very important for me.

Many thanks




---
Ce courrier électronique ne contient aucun virus ou logiciel 
malveillant parce que la protection avast! Antivirus est active.

http://www.avast.com







---
Ce courrier électronique ne contient aucun virus ou logiciel 
malveillant parce que la protection avast! Antivirus est active.

http://www.avast.com






---
Ce courrier électronique ne contient aucun virus ou logiciel 
malveillant parce que la protection avast! Antivirus est active.

http://www.avast.com






---
Ce courrier électronique ne contient aucun virus ou logiciel malveillant parce 
que la protection avast! Antivirus est active.
http://www.avast.com



Join Query Behavior

2013-10-24 Thread Andy Pickler
We're attempting to upgrade from Solr 4.2 to 4.5 but are finding that 4.5
is not honoring this join query:

first part of query...

fq={!join from=project_id_i to=project_id_im}user_id_i:65615 -role_id_i:18
type:UserRole

last part of query

On our Solr 4.2 instance adding/removing that query gives us different (and
expected) results, while the query doesn't affect the results at all in
4.5.  Is there any known join query behavior differences/fixes between 4.2
and 4.5 that might explain this, or should I be looking at other factors?

Thanks,
Andy Pickler


Post filter cache question

2013-10-24 Thread Eric Grobler
Hi

If I run this query it is very fast (10 ms) because it uses a TopList
filter:
q=*:*
fl=adr_geopoint,adr_city,filterflags
*fq=(filterflags:TopList) *
and the number of relevant documents are 3000 out of 7 million.

If I run the same query but add a spatial filter with cost:
q=*:*
fl=adr_geopoint,adr_city,filterflags
*fq=(filterflags:TopList) *
pt=49.594,8.468
sfield=adr_geopoint
fq={!bbox d=30}
fq={!frange l=15 u=30 *cache=false *cost=200}geodist()

It takes over 3 seconds even though it should only scan around 3000
documents from the first cached filter?
Could it be a problem with my cache settings in solrconfig.xml (solr 3.1)
or is my query wrong?

Thanks  regards
Ericz


Re: measure result set quality

2013-10-24 Thread Chris Hostetter

: As a first approach I will evaluate (manually :( ) hits that are out of the
: intersection set for every query in each system. Anyway I will keep

FYI: LucidWorks has a Relevancy Workbench tool that serves as a simple 
UI designed explicitly for the purpose of comparing the result sets of 
from different solr query configurations...

http://www.lucidworks.com/market_app/lucidworks-relevancy-workbench/


-Hoss


Re: difference between apache tomcat vs Jetty

2013-10-24 Thread Jonathan Rochkind
This is good to know, and I find it welcome advice; I would recommend 
making sure this advice is clearly highlighted in the relevant Solr 
docs, such as any getting started docs.


I'm not sure everyone realizes this, and some go down tomcat route 
without realizing the Solr committers recommend jetty -- or use a stock 
jetty without realizing the 'example' jetty is recommended and actually 
intended to be used by Solr users in production!  I think it's easy to 
not catch this advice.


On 10/20/13 5:55 PM, Shawn Heisey wrote:

On 10/20/2013 2:57 PM, Shawn Heisey wrote:

We recommend jetty.  The solr example uses jetty.


I have a clarification for this statement.  We actually recommend using
the jetty that's included in the Solr 4.x example.  It is stripped of
all unnecessary features and its config has had some minor tuning so
it's optimized for Solr.  The jetty binaries in 4.x are completely
unmodified from the upstream download, we just don't include all of
them.  On the 1.x and 3.x examples, there was a small bug in Jetty 6, so
those versions included modified binaries.

If you download jetty from eclipse.org or install it from your operating
system's repository, it will include components you don't need and its
config won't be optimized for Solr, but it will still be a lot closer to
what's actually tested than tomcat is.

Thanks,
Shawn



Re: Post filter cache question

2013-10-24 Thread Chris Hostetter

: Could it be a problem with my cache settings in solrconfig.xml (solr 3.1)
: or is my query wrong?

3.1? ouch ... PostFilter wasn't even added until 3.4...
https://wiki.apache.org/solr/CommonQueryParameters#Caching_of_filters

...so your spatial filter is definitely being applied to the entire index 
and then getting cached.

. . .

Below is what i wrote before i saw that 3.4 comment at the end of your 
email...

: If I run the same query but add a spatial filter with cost:
: q=*:*
: fl=adr_geopoint,adr_city,filterflags
: *fq=(filterflags:TopList) *
: pt=49.594,8.468
: sfield=adr_geopoint
: fq={!bbox d=30}
: fq={!frange l=15 u=30 *cache=false *cost=200}geodist()
: 
: It takes over 3 seconds even though it should only scan around 3000
: documents from the first cached filter?

You've also added a bbox filter, which will be computed against the 
entire index and cached.

I'm not sure whta FieldType you are using, and i don't know a lot of the 
detials about hte spatial queries -- but things you should look into...

1) does the bbox gain you anything if you are already doing the geodist 
filter as a post filter?  (my hunch would be that the only point of a bbox 
fq is if you are *scoring* documents by distance and you want to ignore 
things beyond a set distance)

2) does {!bbox} support PostFilter on your FieldType? does 
adding cache=false cost=150 to the bbox filter improve things?



-Hoss


Re: difference between apache tomcat vs Jetty

2013-10-24 Thread Tim Vaillancourt
I agree with Jonathan (and Shawn on the Jetty explanation), I think the
docs should make this a bit more clear - I notice many people choosing
Tomcat and then learning these details after, possibly regretting it.

I'd be glad to modify the docs but I want to be careful how it is worded.
Is it fair to go as far as saying Jetty is 100% THE recommended container
for Solr, or should a recommendation be avoided, and maybe just a list of
pros/cons?

Cheers,

Tim


Re: Reclaiming disk space from (large, optimized) segments

2013-10-24 Thread Chris Hostetter

I didn't dig into the details of your mail too much, but a few things 
jumped out at me...

: - At some time in the past, a manual force merge / optimize with
: maxSegments=2 was run to troubleshoot high disk i/o and remove too many

Have you tried a simple commit using expungeDeletes=true?  It should be a 
little less intensive then a optimizing.  (under the covers it does 
IndexWriter.forceMergeDeletes())


: - Merge policies are all at Solr 4 defaults. Index size is currently ~50M
: maxDocs, ~35M numDocs, 276GB.

Solr 4 defaults is way to vague to be meaningful: 4.0? 4.1? ... 4.4? 

Do you mean you are using the example configs that came with that version 
of Solr, or do you mean you have no mergePolicy configured and you are 
getting the hardcoded defaults? .. either way it's important to specify 
exactly which version of Solr are you running and exactly what does your 
entire indexConfig/ section looks like since both the example configs 
and the hardcoded default behavior when configs aren't specified have 
evolved since 4.0-ALPHA.



-Hoss


Problem with glassfish and zookeeper 3.4.5

2013-10-24 Thread kaustubh147
Hi,

Glassfish 3.1.2.2
Solr 4.5
Zookeeper 3.4.5

We have set up a SolrCloud with 4 Solr nodes and 3 zookeeper instances. It
seems to be working fine from Solr admin page.

but when I am trying to connect it to web application using Solrj 4.5.
I am creating my Solr Cloud Server as suggested on the wiki page

LBHttpSolrServer lbHttpSolrServer = new LBHttpSolrServer(
SOLR_INSTANCE01,
SOLR_INSTANCE02,
SOLR_INSTANCE03,
SOLR_INSTANCE04);
solrServer = new CloudSolrServer(zk1:p1, zk2:p1, zk3:p1, lbHttpSolrServer);
solrServer.setDefaultCollection(collection);


It seems to be working fine for a while even though I am getting a WARNING
as below
-
SASL configuration failed: javax.security.auth.login.LoginException: No JAAS
configuration section named 'Client' was found in specified JAAS
configuration file: 'XYZ_path/SolrCloud_04/config/login.conf'. Will continue
connection to Zookeeper server without SASL authentication,​ if Zookeeper
server allows it.
--

The application is deployed on a single node cluster on glassfish. 

as soon as my application has made some queries to the Solr server it will
start throwing error in the solrServer.runQuery() method. The reason of the
error is not clear..

Application logs shows following error trace many times...

-
[#|2013-10-24T14:07:53.750-0700|WARNING|glassfish3.1.2|org.apache.zookeeper.ClientCnxn|_ThreadID=1434;_ThreadName=Thread-2;|SASL
configuration failed: javax.security.auth.login.LoginException: No JAAS
configuration section named 'Client' was found in specified JAAS
configuration file: 'XYZ_PATH/config/login.conf'. Will continue connection
to Zookeeper server without SASL authentication, if Zookeeper server allows
it.|#]

[#|2013-10-24T14:07:53.750-0700|INFO|glassfish3.1.2|org.apache.zookeeper.ClientCnxn|_ThreadID=1434;_ThreadName=Thread-2;|Opening
socket connection to server server_name/IP3:2181|#]

[#|2013-10-24T14:07:53.750-0700|INFO|glassfish3.1.2|org.apache.solr.common.cloud.ConnectionManager|_ThreadID=1435;_ThreadName=Thread-2;|Watcher
org.apache.solr.common.cloud.ConnectionManager@187eaada
name:ZooKeeperConnection Watcher:IP1:2181,IP2:2181,IP3:2181 got event
WatchedEvent state:AuthFailed type:None path:null path:null type:None|#]

[#|2013-10-24T14:07:53.750-0700|INFO|glassfish3.1.2|org.apache.solr.common.cloud.ConnectionManager|_ThreadID=1435;_ThreadName=Thread-2;|Client-ZooKeeper
status change trigger but we are already closed|#]

[#|2013-10-24T14:07:53.751-0700|INFO|glassfish3.1.2|org.apache.zookeeper.ClientCnxn|_ThreadID=1434;_ThreadName=Thread-2;|Socket
connection established to server_name/IP3:2181, initiating session|#]

[#|2013-10-24T14:07:53.751-0700|INFO|glassfish3.1.2|org.apache.solr.common.cloud.ConnectionManager|_ThreadID=1420;_ThreadName=Thread-2;|Watcher
org.apache.solr.common.cloud.ConnectionManager@4ba50169
name:ZooKeeperConnection Watcher:IP1:2181,IP2:2181,IP3:2181 got event
WatchedEvent state:Disconnected type:None path:null path:null type:None|#]

[#|2013-10-24T14:07:53.751-0700|WARNING|glassfish3.1.2|org.apache.zookeeper.ClientCnxn|_ThreadID=1434;_ThreadName=Thread-2;|Session
0x0 for serverserver_name/IP3:2181, unexpected error, closing socket
connection and attempting reconnect
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
at sun.nio.ch.IOUtil.read(IOUtil.java:166)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
|#]


--

before this happen the zookeeper logs on all the 3 instances starts showing
following warning

2013-10-24 14:05:55,200 [myid:3] - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too
many connections from /IP_APPLICATION_SEVER - max is 200

it means that my application is making too many connections with the
zookeeper and it is exceeding the limit which is set to 200.


Is there a way I can control the number of connections my application is
making with the zookeeper.
The only component which is connecting to zookeeper in my application is
CloudSolrServer object.

As per my investigation SASL warning is related to a existing bug in
Zookeeper 3.4.5 and is being solved for Zookeeper 3.5 and it should not
cause this issue

I need help and guidance..

Thanks,
Kaustubh













--
View this message in context: 

Re: Problem with glassfish and zookeeper 3.4.5

2013-10-24 Thread Shawn Heisey

On 10/24/2013 4:30 PM, kaustubh147 wrote:

Glassfish 3.1.2.2
Solr 4.5
Zookeeper 3.4.5

We have set up a SolrCloud with 4 Solr nodes and 3 zookeeper instances. It
seems to be working fine from Solr admin page.

but when I am trying to connect it to web application using Solrj 4.5.
I am creating my Solr Cloud Server as suggested on the wiki page

LBHttpSolrServer lbHttpSolrServer = new LBHttpSolrServer(
SOLR_INSTANCE01,
SOLR_INSTANCE02,
SOLR_INSTANCE03,
SOLR_INSTANCE04);
solrServer = new CloudSolrServer(zk1:p1, zk2:p1, zk3:p1, lbHttpSolrServer);
solrServer.setDefaultCollection(collection);


If this is what you are seeing as instructions for connecting from SolrJ 
to SolrCloud, then something's really screwy.  Can you give me the URL 
that shows this, so I can see about getting it changed?  The following 
code example is how you should be doing that.  For this example, 
zookeeper is using the default port of 2181 and the zookeeper hosts are 
zoo1, zoo2, and zoo3.


String zkHost = zoo1:2181,zoo2:2181,zoo3:2181;
// If you are using a chroot, use something like this instead:
// String zkHost = zoo1:2181,zoo2:2181,zoo3:2181/chroot;
CloudSolrServer server = new CloudSolrServer(zkHost);
server.setDefaultCollection(collection1);


It seems to be working fine for a while even though I am getting a WARNING
as below
-
SASL configuration failed: javax.security.auth.login.LoginException: No JAAS
configuration section named 'Client' was found in specified JAAS
configuration file: 'XYZ_path/SolrCloud_04/config/login.conf'. Will continue
connection to Zookeeper server without SASL authentication,​ if Zookeeper
server allows it.
--


Later you said this sounds like a bug you saw in ZK 3.4.5. It may be 
that glassfish turns on some system-wide setting related to 
authentication that zookeeper picks up on.  I would tend to agree that 
this probably is not related to the other problems mentioned below.



before this happen the zookeeper logs on all the 3 instances starts showing
following warning

2013-10-24 14:05:55,200 [myid:3] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too
many connections from /IP_APPLICATION_SEVER - max is 200

it means that my application is making too many connections with the
zookeeper and it is exceeding the limit which is set to 200.


Are you creating one CloudSolrServer object (static would be OK) and 
using it for all interaction with SolrCloud, or are you creating many 
CloudSolrServer objects over the life of your application?  Is there 
more than one thread or instance of your application running, and each 
one has its own CloudSolrServer object?  It is strongly recommended that 
you only create one object for your entire application and use it for 
all queries, updates, etc.  You can set the collection parameter on 
each query or request object that you use, if you need to use more than one.


If you *DO* create many CloudSolrServer objects over the life of your 
application and cannot immediately change your code so that it uses one 
object, be sure to shutdown() each one when it is no longer required.  
Depending on the exact nature of your application, you may also need to 
increase the maximum number of connections allowed in your zookeeper config.


Thanks,
Shawn



Re: difference between apache tomcat vs Jetty

2013-10-24 Thread Anshum Gupta
Thought you may want to have a look at this:

https://issues.apache.org/jira/browse/SOLR-4792

P.S: There are no timelines for 5.0 for now, but it's the future
nevertheless.



On Fri, Oct 25, 2013 at 3:39 AM, Tim Vaillancourt t...@elementspace.comwrote:

 I agree with Jonathan (and Shawn on the Jetty explanation), I think the
 docs should make this a bit more clear - I notice many people choosing
 Tomcat and then learning these details after, possibly regretting it.

 I'd be glad to modify the docs but I want to be careful how it is worded.
 Is it fair to go as far as saying Jetty is 100% THE recommended container
 for Solr, or should a recommendation be avoided, and maybe just a list of
 pros/cons?

 Cheers,

 Tim




-- 

Anshum Gupta
http://www.anshumgupta.net


Re: difference between apache tomcat vs Jetty

2013-10-24 Thread Tim Vaillancourt
Hmm, thats an interesting move. I'm on the fence on that one but it surely
simplifies some things. Good info, thanks!

Tim


On 24 October 2013 16:46, Anshum Gupta ans...@anshumgupta.net wrote:

 Thought you may want to have a look at this:

 https://issues.apache.org/jira/browse/SOLR-4792

 P.S: There are no timelines for 5.0 for now, but it's the future
 nevertheless.



 On Fri, Oct 25, 2013 at 3:39 AM, Tim Vaillancourt t...@elementspace.com
 wrote:

  I agree with Jonathan (and Shawn on the Jetty explanation), I think the
  docs should make this a bit more clear - I notice many people choosing
  Tomcat and then learning these details after, possibly regretting it.
 
  I'd be glad to modify the docs but I want to be careful how it is worded.
  Is it fair to go as far as saying Jetty is 100% THE recommended
 container
  for Solr, or should a recommendation be avoided, and maybe just a list of
  pros/cons?
 
  Cheers,
 
  Tim
 



 --

 Anshum Gupta
 http://www.anshumgupta.net



Re: Post filter cache question

2013-10-24 Thread Eric Grobler
Hi Chris

Thank you for your response.
I will try to migrate to Solr 4.4 first!

Best regards



On Thu, Oct 24, 2013 at 10:44 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : Could it be a problem with my cache settings in solrconfig.xml (solr 3.1)
 : or is my query wrong?

 3.1? ouch ... PostFilter wasn't even added until 3.4...
 https://wiki.apache.org/solr/CommonQueryParameters#Caching_of_filters

 ...so your spatial filter is definitely being applied to the entire index
 and then getting cached.

 . . .

 Below is what i wrote before i saw that 3.4 comment at the end of your
 email...

 : If I run the same query but add a spatial filter with cost:
 : q=*:*
 : fl=adr_geopoint,adr_city,filterflags
 : *fq=(filterflags:TopList) *
 : pt=49.594,8.468
 : sfield=adr_geopoint
 : fq={!bbox d=30}
 : fq={!frange l=15 u=30 *cache=false *cost=200}geodist()
 :
 : It takes over 3 seconds even though it should only scan around 3000
 : documents from the first cached filter?

 You've also added a bbox filter, which will be computed against the
 entire index and cached.

 I'm not sure whta FieldType you are using, and i don't know a lot of the
 detials about hte spatial queries -- but things you should look into...

 1) does the bbox gain you anything if you are already doing the geodist
 filter as a post filter?  (my hunch would be that the only point of a bbox
 fq is if you are *scoring* documents by distance and you want to ignore
 things beyond a set distance)

 2) does {!bbox} support PostFilter on your FieldType? does
 adding cache=false cost=150 to the bbox filter improve things?



 -Hoss



First test cloud error question...

2013-10-24 Thread Jack Park
Background: all testing done on a Win7 platform. This is my first
migration from a single Solr server to a simple cloud. Everything is
configured exactly as specified in the wiki.

I created a simple 3-node client, all localhost with different server
URLs, and a lone external zookeeper.  The online admin shows they are
all up.

I then start an agent which sends in documents to bootstrap the
index. That's when issues start.  A clip from the log shows this:
First, I create a SolrDocument with this JSON data:

DEBUG 2013-10-24 18:00:09,143 [main] - SolrCloudClient.mapToDocument-
{locator:EssayNodeType,smallIcon:\/images\/cogwheel.png,subOf:[NodeType],details:[The
TopicQuests NodeTypes typology essay
type.],isPrivate:false,creatorId:SystemUser,label:[Essay
Type],largeIcon:\/images\/cogwheel_sm.png,lastEditDate:Thu Oct
24 18:00:09 PDT 2013,createdDate:Thu Oct 24 18:00:09 PDT 2013}

Then, send it in from SolrJ which has a CloudSolrServer initialized
with localhost:2181 and an instance of LBHttpSolrServer initialized
with http://localhost:8983/solr/

That trace follows

INFO  2013-10-24 18:00:09,145 [main] - Initiating client connection,
connectString=localhost:2181 sessionTimeout=1
watcher=org.apache.solr.common.cloud.ConnectionManager@e6c
INFO  2013-10-24 18:00:09,148 [main] - Waiting for client to connect
to ZooKeeper
INFO  2013-10-24 18:00:09,150 [main-SendThread(0:0:0:0:0:0:0:1:2181)]
- Opening socket connection to server
0:0:0:0:0:0:0:1/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate
using SASL (Unable to locate a login configuration)
ERROR 2013-10-24 18:00:09,151 [main-SendThread(0:0:0:0:0:0:0:1:2181)]
- Unable to open socket to 0:0:0:0:0:0:0:1/0:0:0:0:0:0:0:1:2181
WARN  2013-10-24 18:00:09,151 [main-SendThread(0:0:0:0:0:0:0:1:2181)]
- Session 0x0 for server null, unexpected error, closing socket
connection and attempting reconnect
java.net.SocketException: Address family not supported by protocol
family: connect
at sun.nio.ch.Net.connect(Native Method)
at sun.nio.ch.SocketChannelImpl.connect(Unknown Source)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.registerAndConnect(ClientCnxnSocketNIO.java:266)

I can watch the Zookeeper console running; it's mostly complaining
about too many connections from /127.0.0.1 ; I am seeing the errors in
the agent's log file.

Following that trace in the log is this:

INFO  2013-10-24 18:00:09,447 [main-SendThread(127.0.0.1:2181)] -
Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not
attempt to authenticate using SASL (Unable to locate a login
configuration)
INFO  2013-10-24 18:00:09,448 [main-SendThread(127.0.0.1:2181)] -
Socket connection established to 127.0.0.1/127.0.0.1:2181, initiating
session
DEBUG 2013-10-24 18:00:09,449 [main-SendThread(127.0.0.1:2181)] -
Session establishment request sent on 127.0.0.1/127.0.0.1:2181
DEBUG 2013-10-24 18:00:09,449 [main-SendThread(127.0.0.1:2181)] -
Could not retrieve login configuration: java.lang.SecurityException:
Unable to locate a login configuration
INFO  2013-10-24 18:00:09,501 [main-SendThread(127.0.0.1:2181)] -
Session establishment complete on server 127.0.0.1/127.0.0.1:2181,
sessionid = 0x141ece7e6160017, negotiated timeout = 1
INFO  2013-10-24 18:00:09,501 [main-EventThread] - Watcher
org.apache.solr.common.cloud.ConnectionManager@42bad8a8
name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent
state:SyncConnected type:None path:null path:null type:None
INFO  2013-10-24 18:00:09,502 [main] - Client is connected to ZooKeeper
DEBUG 2013-10-24 18:00:09,502 [main-SendThread(127.0.0.1:2181)] -
Could not retrieve login configuration: java.lang.SecurityException:
Unable to locate a login configuration
DEBUG 2013-10-24 18:00:09,502 [main-SendThread(127.0.0.1:2181)] -
Could not retrieve login configuration: java.lang.SecurityException:
Unable to locate a login configuration
DEBUG 2013-10-24 18:00:09,503 [main-SendThread(127.0.0.1:2181)] -
Could not retrieve login configuration: java.lang.SecurityException:
Unable to locate a login configuration
DEBUG 2013-10-24 18:00:09,503 [main-SendThread(127.0.0.1:2181)] -
Could not retrieve login configuration: java.lang.SecurityException:
Unable to locate a login configuration
DEBUG 2013-10-24 18:00:09,504 [main-SendThread(127.0.0.1:2181)] -
Could not retrieve login configuration: java.lang.SecurityException:
Unable to locate a login configuration
DEBUG 2013-10-24 18:00:09,504 [main-SendThread(127.0.0.1:2181)] -
Could not retrieve login configuration: java.lang.SecurityException:
Unable to locate a login configuration
DEBUG 2013-10-24 18:00:09,505 [main-SendThread(127.0.0.1:2181)] -
Could not retrieve login configuration: java.lang.SecurityException:
Unable to locate a login configuration
DEBUG 2013-10-24 18:00:09,506 [main-SendThread(127.0.0.1:2181)] -
Reading reply sessionid:0x141ece7e6160017, packet:: clientPath:null
serverPath:null finished:false header:: 1,3  replyHeader:: 1,541,0
request:: '/clusterstate.json,F  response::

Solr 4.5.1 and Illegal to have multiple roots (start tag in epilog?). (perhaps SOLR-4327 bug?)

2013-10-24 Thread Michael Tracey
Hey Solr-users,

I've got a single solr 4.5.1 node with 96GB ram, a 65GB index (105 million 
records) and a lot of daily churn of newly indexed files (auto softcommit and 
commits).  I'm trying to bring another matching node into the mix, and am 
getting these errors on the new node:

org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: 
Illegal to have multiple roots (start tag in epilog?).

On the old server, still running, I'm getting: 

shard update error StdNode: 
http://server1:/solr/collection/:org.apache.solr.client.solrj.SolrServerException:
 Server refused connection at: http://server2:/solr/collection

the new core never actually comes online, stays in recovery mode.  The other 
two tiny cores (100,000+ records each and not updated frequently), work just 
fine.

is this SOLR-4327 bug?  https://issues.apache.org/jira/browse/SOLR-5331   And 
if so, how can I get the new node up and running so I can get back in 
production with some redundancy and speed?

I'm running an external zookeeper, and that is all running just fine.  Also 
internal Solrj/jetty with little to no modifications.  

Any ideas would be appreciated, thanks, 

M.


Solr indexing on email mime body and attachment

2013-10-24 Thread neerajp
Hi,
I am integrating solr search engine with my email clients. I am sending POST
request to Solr using REST.
I am successfully able to post email's to, from, subject etc headers to solr
for making index.
Since email can have mime type bodies and attachments so I am not able to
understand how to post the email body and attachment so that solr could make
indexing.

Any help is highly appreciated.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-on-email-mime-body-and-attachment-tp4097692.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Reclaiming disk space from (large, optimized) segments

2013-10-24 Thread Otis Gospodnetic
Only skimmed your email, but purge every 4 hours jumped out at me. Would it
make sense to have time-based indices that can be periodically dropped
instead of being purged?

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Oct 23, 2013 10:33 AM, Scott Lundgren scott.lundg...@carbonblack.com
wrote:

 *Background:*

 - Our use case is to use SOLR as a massive FIFO queue.

 - Document additions and updates happen continuously.

 - Documents are being added at sustained a rate of 50 - 100 documents
 per second.

 - About 50% of these document are updates to existing docs, indexed
 using atomic updates: the original doc is thus deleted and re-added.

 - There is a separate purge operation running every four hours that deletes
 the oldest docs, if required based on a number of unrelated configuration
 parameters.

 - At some time in the past, a manual force merge / optimize with
 maxSegments=2 was run to troubleshoot high disk i/o and remove too many
 segments as a potential variable.  Currently, the largest fdts are 74G and
 43G.   There are 47 total segments, the largest other sizes are all around
 2G.

 - Merge policies are all at Solr 4 defaults. Index size is currently ~50M
 maxDocs, ~35M numDocs, 276GB.

 *Issue:*

 The background purge operation is deleting docs on schedule, but the disk
 space is not being recovered.

 *Presumptions:*
 I presume, but have not confirmed (how?) the 15M deleted documents are
 predominately in the two large segments.  Because they are largely in the
 two large segments, and those large segments still have (some/many) live
 documents, the segment backing files are not deleted.

 *Questions:*

 - When will those segments get merged and documents recovered?  Does it
 happen when _all_ the documents in those segments are deleted?  Some
 percentage of the segment is filled with deleted documents?
 - Is there a way to do it right now vs. just waiting?
 - In some cases, the purge delete conditional is _just_ free disk space:
  when index  free space, delete oldest.  Those setups are now in scenarios
 where index  free space, and getting worse.  How does low disk space
 effect above two questions?
 - Is there a way for me to determine stats on a per-segment basis?
- for example, how many deleted documents in a particular segment?
 - On the flip side, can I determine in what segment a particular document
 is located?

 Thank you,

 Scott

 --
 Scott Lundgren
 Director of Engineering
 Carbon Black, Inc.
 (210) 204-0483 | scott.lundg...@carbonblack.com