from:"Darren Govoni"

MLT in SolrJ vs. URL?

2013-05-21 Thread Darren Govoni

Hi,
  I compose a mlt query in a URL and get the queried result back and a
list of  documents in the moreLikeThis section in my browser.

When I try to execute the same query in SolrJ setting the same params, I
only get the queried result document back and no MLT docs.

What's the trick here?

thanks,
Darren

Re: zk Config URL?

2013-02-25 Thread Darren Govoni

(AbstractInhabitantImpl.java:78)
at 
com.sun.enterprise.v3.server.AppServerStartup.run(AppServerStartup.java:253)
at 
com.sun.enterprise.v3.server.AppServerStartup.doStart(AppServerStartup.java:145)
at 
com.sun.enterprise.v3.server.AppServerStartup.start(AppServerStartup.java:136)
at 
com.sun.enterprise.glassfish.bootstrap.GlassFishImpl.start(GlassFishImpl.java:79)
at 
com.sun.enterprise.glassfish.bootstrap.GlassFishDecorator.start(GlassFishDecorator.java:63)
at 
com.sun.enterprise.glassfish.bootstrap.osgi.OSGiGlassFishImpl.start(OSGiGlassFishImpl.java:69)
at 
com.sun.enterprise.glassfish.bootstrap.GlassFishMain$Launcher.launch(GlassFishMain.java:117)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)
at 
com.sun.enterprise.glassfish.bootstrap.GlassFishMain.main(GlassFishMain.java:97)

at com.sun.enterprise.glassfish.bootstrap.ASMain.main(ASMain.java:55)
Caused by: java.lang.ClassNotFoundException: javax.servlet.Filter
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at sun.misc.Launcher$ExtClassLoader.findClass(Launcher.java:229)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 55 more


On 02/24/2013 08:32 PM, Mark Miller wrote:

You either have to specifically upload a config set or use one of the bootstrap 
sys props.

Are you doing either?

- Mark

On Feb 24, 2013, at 8:15 PM, Darren Govoni dar...@ontrenet.com wrote:


Thanks Michael.

I went ahead and just started an external zookeeper, but my solr node throws 
exceptions from it.

Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not find 
configName for collection collection1 found:null

...

[#|2013-02-24T20:13:58.451-0500|SEVERE|glassfish3.1.2|org.apache.solr.core.CoreContainer|_ThreadID=28;_ThreadName=Thread-2;|null:org.apache.solr.common.SolrException:
 Unable to create core: collection1
at 
org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1654)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1039)
at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not find 
configName for collection collection1 found:null
at org.apache.solr.cloud.ZkController.getConfName(ZkController.java:1097)
at 
org.apache.solr.cloud.ZkController.createCollectionZkNode(ZkController.java:1016)
at org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:937)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1031)
... 10 more


On 02/24/2013 07:21 PM, Michael Della Bitta wrote:

Hello Darren,

If you go into the admin and click on Cloud, you'll see that
information represented in a number of ways. Both Dump and Tree
(especially the clusterstate.json file) have this information
represented as a document in JSON format.

If you don't see the Cloud navigation on the left side of the admin
screen, that's a good indication that Solr hasn't connected to
Zookeeper.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Sun, Feb 24, 2013 at 6:34 PM, Darren Govoni dar...@ontrenet.com wrote:

Hi,
I'm trying the latest solrcloud 4.1. Is there a button(or url) I can't
find that shows me the zookeeper config XML,
so I can check what other nodes are connected? Can't seem to find it.

I deploy my solrcloud war into glassfish and set jetty.port (among other
properties) to the GF domain port (e.g. 8181).'
It starts successfully.

I want zookeeper to run automatically within (as needed). How can I verify
this or refer to
the first/master server using zkHost from another node? (e.g. {host}:{port})
to form a cluster.

I did this before a while ago, before solr 4.x was released, but things have
changed.

tips appreciated. thank you.
Darren

zk Config URL?

2013-02-24 Thread Darren Govoni


Hi,
   I'm trying the latest solrcloud 4.1. Is there a button(or url) I 
can't find that shows me the zookeeper config XML,

so I can check what other nodes are connected? Can't seem to find it.

I deploy my solrcloud war into glassfish and set jetty.port (among other 
properties) to the GF domain port (e.g. 8181).'

It starts successfully.

I want zookeeper to run automatically within (as needed). How can I 
verify this or refer to
the first/master server using zkHost from another node? (e.g. 
{host}:{port}) to form a cluster.


I did this before a while ago, before solr 4.x was released, but things 
have changed.


tips appreciated. thank you.
Darren

Re: zk Config URL?

2013-02-24 Thread Darren Govoni


Thanks Michael.

I went ahead and just started an external zookeeper, but my solr node 
throws exceptions from it.


Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not 
find configName for collection collection1 found:null


...

[#|2013-02-24T20:13:58.451-0500|SEVERE|glassfish3.1.2|org.apache.solr.core.CoreContainer|_ThreadID=28;_ThreadName=Thread-2;|null:org.apache.solr.common.SolrException: 
Unable to create core: collection1
at 
org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1654)

at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1039)
at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)

at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not 
find configName for collection collection1 found:null
at 
org.apache.solr.cloud.ZkController.getConfName(ZkController.java:1097)
at 
org.apache.solr.cloud.ZkController.createCollectionZkNode(ZkController.java:1016)
at 
org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:937)

at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1031)
... 10 more


On 02/24/2013 07:21 PM, Michael Della Bitta wrote:

Hello Darren,

If you go into the admin and click on Cloud, you'll see that
information represented in a number of ways. Both Dump and Tree
(especially the clusterstate.json file) have this information
represented as a document in JSON format.

If you don't see the Cloud navigation on the left side of the admin
screen, that's a good indication that Solr hasn't connected to
Zookeeper.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Sun, Feb 24, 2013 at 6:34 PM, Darren Govoni dar...@ontrenet.com wrote:

Hi,
I'm trying the latest solrcloud 4.1. Is there a button(or url) I can't
find that shows me the zookeeper config XML,
so I can check what other nodes are connected? Can't seem to find it.

I deploy my solrcloud war into glassfish and set jetty.port (among other
properties) to the GF domain port (e.g. 8181).'
It starts successfully.

I want zookeeper to run automatically within (as needed). How can I verify
this or refer to
the first/master server using zkHost from another node? (e.g. {host}:{port})
to form a cluster.

I did this before a while ago, before solr 4.x was released, but things have
changed.

tips appreciated. thank you.
Darren

RE: SolrJ and Solr 4.0 | doc.getFieldValue() returns String instead of Date

2013-01-08 Thread Darren Govoni


SimpleDateFormat df= new SimpleDateFormat(-MM-dd'T'hh:mm:ss.S'Z');
Date dateObj = df.parse(2009-10-29T00:00:009Z);

brbrbr--- Original Message ---
On 1/8/2013  09:34 AM uwe72 wrote:brA Lucene 4.0 document returns for a Date 
field now a string value, instead of
bra Date object.
br
brfield name=ModuleImpl.versionAsDate view=Datenstand type=date 
br

brSolr4.0 -- 2009-10-29T00:00:009Z
brSolr3.6 -- Date instance
br
brCan this be set somewhere in the config?
br
brI prefer to receive a date instance
br
br
br
br--
brView this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-and-Solr-4-0-doc-getFieldValue-returns-String-instead-of-Date-tp4031588.html
brSent from the Solr - User mailing list archive at Nabble.com.
br

RE: RE: Max number of core in Solr multi-core

2013-01-07 Thread Darren Govoni

This should be clarified some. In the client API, SolrServer is represents a
connection to a single server backend/endpoint and should be re-used where possible.

The approach being discussed is to have one client connection (represented by SolrServer class) per solr core, all residing in a single solr server (as is the case below, but not required).

brbrbr--- Original Message ---
On 1/7/2013 08:06 AM Jay Parashar wrote:brThis is the exact approach we use
in our multithreaded env. One server per
brcore. I think this is the recommended approach.
br
br-Original Message-
brFrom: Parvin Gasimzade [mailto:parvin.gasimz...@gmail.com]
brSent: Monday, January 07, 2013 7:00 AM

brTo: solr-user@lucene.apache.org
brSubject: Re: Max number of core in Solr multi-core
br
brI know that but my question is different. Let me ask it in this way.
br
brI have a solr with base url localhost:8998/solr and two solr core as
brlocalhost:8998/solr/core1 and localhost:8998/solr/core2.
br
brI have one baseSolr instance initialized as :
brSolrServer server = new HttpSolrServer( url );
br
brI have also create SolrServer's for each core as :
brSolrServer core1 = new HttpSolrServer( url + /core1 ); SolrServer core2 =
brnew HttpSolrServer( url + /core2 );
br
brSince there are many cores, I have to initialize SolrServer as shown above.
brIs there a way to create only one SolrServer with the base url and access
breach core using it? If it is possible, then I don't need to create new
brSolrServer for each core.
br
brOn Mon, Jan 7, 2013 at 2:39 PM, Erick Erickson
brerickerick...@gmail.comwrote:
br
br This might help:
br https://wiki.apache.org/solr/Solrj#HttpSolrServer
br
br Note that the associated SolrRequest takes the path, I presume
br relative to the base URL you initialized the HttpSolrServer with.

br
br Best
br Erick
br
br
br On Mon, Jan 7, 2013 at 7:02 AM, Parvin Gasimzade
br parvin.gasimz...@gmail.com

br wrote:
br
br Thank you for your responses. I have one more question related to
br Solr multi-core.
br By using SolrJ I create new core for each application. When user
br wants to add data or make query on his application, I create new
br HttpSolrServer

br for
br this core. In this scenario there will be many running
br HttpSolrServer instances.

br
br Is there a better solution? Does it cause a problem to run many
br instances at the same time?

br
br On Wed, Jan 2, 2013 at 5:35 PM, Per Steffensen st...@designware.dk
br wrote:
br
br g a collection per application instead of a core
br
br
br
br

Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread Darren Govoni

Yes. In that case, core should best be described as a logical solr 
entity with various managed attributes
and qualities above the physical layer (sorry, not trying to perpetuate 
this thread so much).


On 01/04/2013 01:55 PM, Mark Miller wrote:

Currently a SolrCore is 1:1 with a low level Lucene index. There is no reason 
that needs to alway be that way. It's possible that we may at some point add 
built in micro sharding support that means a SolrCore could have multiple 
underlying Lucene indexes. Or we may not.

- Mark


On Jan 4, 2013, at 1:49 PM, darren dar...@ontrenet.com wrote:


Good point. Agree.


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Upayavira u...@odoko.co.uk
Date:
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

Using your terminology, I'd say core is a physical solr term, and index
is a pysical lucene term. A collection or a shard is a logical solr
term.

Upayavira

On Fri, Jan 4, 2013, at 06:28 PM, darren wrote:

My understanding is core is a logical solr term. Index is a physical
lucene term. A solr core is backed by a physical lucene index. One index
per core. Solr team can correct me if its not accurate. :)


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Alexandre Rafalovitch arafa...@gmail.com
Date:
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...
   
Can I just start by saying that this was AMAZING. :-) When I asked the

question, I certainly did not expect this level of details.

And I vote on the cake diagram for WIKI as well. Perhaps, two with the
first one showing the trivial collapsed state of single
collection/shard/replica/core. The trivial one will also help to explain
why the example is now called 'collection1'.

I think I followed everything, except for just added term of 'index'.
Isn't
that the same as 'core'? Or can we have several indexes in one core?

Regards,
Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote:


This is the containment hierarchy i understand but includes both physical
and logical.

 Original message 
From: darren dar...@ontrenet.com
Date:
To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

Actually. Node/collection/shard/replica/core/index



 Original message 
From: darren dar...@ontrenet.com
Date:
To: yo...@lucidworks.com,solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...


Agreed. But for completeness can it be node/collection/shard/replica/core?

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Good write up.

And what about node?

I think there needs to be an official glossary of terms that is sanctioned by the solr
team and some terms still ni use may need to be labeled deprecated. After so
many years, its still confusing.

brbrbr--- Original Message ---
On 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more modern term and incorporates the fact that the
brcollection may be sharded, with each shard on one or more cores, with each
brcore being a replica of the other cores within that shard of that
brcollection.

br
brInstance is a general term, but is commonly used to refer to a running Solr
brserver, each of which can service any number of cores. A sharded collection
brwould typically require multiple instances of Solr, each with a shard of the
brcollection.

br
brMultiple collections can be supported on a single instance of Solr. They
brdon't have to be sharded or replicated. But if they are, each Solr instance
brwill have a copy or replica of the data (index) of one shard of each sharded
brcollection - to the degree that each collection needs that many shards.

br
brAt the API level, you talk to a Solr instance, using a host and port, and
brgiving the collection name. Some operations will refer only to the portion
brof a multi-shard collection on that Solr instance, but typically Solr will
brdistribute the operation, whether it be an update or a query, to all of
brthe shards of the named collection. In the case of update, the update will
brbe distributed to all replicas as well, but in the case of query only one
brreplica of each shard of the collection is needed.

br
brBefore SolrCloud we Solr had master and slave and the slaves were replicas
brof the master, but with SolrCloud there is no master and all the replicas of
brthe shard are peers, although at any moment of time one of them will be
brconsidered the leader for coordination purposes, but not in the sense that
brit is a master of the other replicas in that shard. A SolrCloud replica is a
brreplica of the data, in an abstract sense, for a single shard of a
brcollection. A SolrCloud replica is more of an instance of the data/index.

br
brAn index exists at two levels: the portion of a collection on a single Solr
brcore will have a Lucene index, but collectively the Lucene indexes for the
brshards of a collection can be referred to the index of the collection. Each
brreplica is a copy or instance of a portion of the collection's index.

br
brThe term slice is sometimes used to refer collectively to all of the
brcores/replicas of a single shard, or sometimes to a single replica as it
brcontains only a slice of the full collection data.

br
br-- Jack Krupansky
br
br-Original Message-
brFrom: Alexandre Rafalovitch

brSent: Thursday, January 03, 2013 4:42 AM
brTo: solr-user@lucene.apache.org
brSubject: Terminology question: Core vs. Collection vs...
br
brHello,
br
brI am trying to understand the core Solr terminology. I am looking for
brcorrect rather than loose meaning as I am trying to teach an example that
brstarts from easy scenario and may scale to multi-core, multi-machine
brsituation.
br
brHere are the terms that seem to be all overlapping and/or crossing over in
brmy mind a the moment.
br
br1) Index
br2) Core
br3) Collection
br4) Instance
br5) Replica (Replica of _what_?)
br6) Others?
br
brI tried looking through documentation, but either there is a terminology
brdrift or I am having trouble understanding the distinctions.
br
brIf anybody has a clear picture in their mind, I would appreciate a
brclarification.
br
brRegards,
br Alex.
br
brPersonal blog: http://blog.outerthoughts.com/
brLinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
br- Time is the quality of nature that keeps events from happening all at
bronce. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
br

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Thanks again. (And sorry to jump into this convo)

But I had a question on your statement:

On 1/3/2013 08:07 AM Jack Krupansky wrote:
brCollection is the more modern term and incorporates the fact that the
brcollection may be sharded, with each shard on one or more cores, with each
brcore being a replica of the other cores within that shard of that
brcollection.

A collection is sharded, meaning it is distributed across cores. A shard itself
is not distributed across cores in the same since. Rather a shard exist on a
single core and is replicated on other cores. Is that right? The way its worded
above, it sounds like a shard can also be sharded...

brbrbr--- Original Message ---
On 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a cluster or cloud (graph). It could be a real
brmachine or a virtualized machine. Technically, you could have multiple
brvirtual nodes on the same physical box. Each Solr replica would be on a
brdifferent node.

br
brTechnically, you could have multiple Solr instances running on a single
brhardware node, each with a different port. They are simply instances of
brSolr, although you could consider each Solr instance a node in a Solr cloud
bras well, a virtual node. So, technically, you could have multiple replicas
bron the same node, but that sort of defeats most of the purpose of having
brreplicas in the first place - to distribute the data for performance and
brfault tolerance. But, you could have replicas of different shards on the
brsame node/box for a partial improvement of performance and fault tolerance.

br
brA Solr cloud' is really a cluster.
br
br-- Jack Krupansky
br
br-Original Message-
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:16 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brGood write up.
br
brAnd what about node?
br
brI think there needs to be an official glossary of terms that is sanctioned
brby the solr team and some terms still ni use may need to be labeled
brdeprecated. After so many years, its still confusing.

br
brbrbrbr--- Original Message ---
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more modern
brterm and incorporates the fact that the
brbrcollection may be sharded, with each shard on one or more cores, with
breach

brbrcore being a replica of the other cores within that shard of that
brbrcollection.
brbr
brbrInstance is a general term, but is commonly used to refer to a running
brSolr
brbrserver, each of which can service any number of cores. A sharded
brcollection
brbrwould typically require multiple instances of Solr, each with a shard of
brthe

brbrcollection.
brbr
brbrMultiple collections can be supported on a single instance of Solr. They
brbrdon't have to be sharded or replicated. But if they are, each Solr
brinstance
brbrwill have a copy or replica of the data (index) of one shard of each
brsharded

brbrcollection - to the degree that each collection needs that many shards.
brbr
brbrAt the API level, you talk to a Solr instance, using a host and port,
brand
brbrgiving the collection name. Some operations will refer only to the
brportion
brbrof a multi-shard collection on that Solr instance, but typically Solr
brwill
brbrdistribute the operation, whether it be an update or a query, to all
brof
brbrthe shards of the named collection. In the case of update, the update
brwill
brbrbe distributed to all replicas as well, but in the case of query only
brone

brbrreplica of each shard of the collection is needed.
brbr
brbrBefore SolrCloud we Solr had master and slave and the slaves were
brreplicas
brbrof the master, but with SolrCloud there is no master and all the
brreplicas of

brbrthe shard are peers, although at any moment of time one of them will be
brbrconsidered the leader for coordination purposes, but not in the sense
brthat
brbrit is a master of the other replicas in that shard. A SolrCloud replica
bris a

brbrreplica of the data, in an abstract sense, for a single shard of a
brbrcollection. A SolrCloud replica is more of an instance of the
brdata/index.

brbr
brbrAn index exists at two levels: the portion of a collection on a single
brSolr
brbrcore will have a Lucene index, but collectively the Lucene indexes for
brthe
brbrshards of a collection can be referred to the index of the collection.
brEach

brbrreplica is a copy or instance of a portion of the collection's index.
brbr
brbrThe term slice is sometimes used to refer collectively to all of the
brbrcores/replicas of a single shard, or sometimes to a single replica as it
brbrcontains only a slice of the full collection data.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message-
brbrFrom: Alexandre Rafalovitch

brbrSent: Thursday, January 03, 2013 4:42 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: Terminology question: Core vs. Collection vs...
brbr
brbrHello,
brbr
brbrI am trying

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Thanks. I got that part.

A group of shards (and therefore cores) represent a collection, yes. But a single shard exist only on a single core?

brbrbr--- Original Message ---
On 1/3/2013 09:03 AM Jack Krupansky wrote:brNo, a shard is a subset (or slice) of the collection. Sharding is a way of
brslicing the original data, before we talk about how the shards get stored
brand replicated on actual Solr cores. Replicas are instances of the data for
bra shard.

br
brSometimes people may loosely speak of a replica as being a shard, but
brthat's just loose use of the terminology.

br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message-
brFrom: Darren Govoni

brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. A shard
britself is not distributed across cores in the same since. Rather a shard
brexist on a single core and is replicated on other cores. Is that right? The
brway its worded above, it sounds like a shard can also be sharded...

br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a
brcluster or cloud (graph). It could be a real

brbrmachine or a virtualized machine. Technically, you could have multiple
brbrvirtual nodes on the same physical box. Each Solr replica would be on
bra

brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running on a single
brbrhardware node, each with a different port. They are simply instances of
brbrSolr, although you could consider each Solr instance a node in a Solr
brcloud
brbras well, a virtual node. So, technically, you could have multiple
brreplicas

brbron the same node, but that sort of defeats most of the purpose of having
brbrreplicas in the first place - to distribute the data for performance and
brbrfault tolerance. But, you could have replicas of different shards on the
brbrsame node/box for a partial improvement of performance and fault
brtolerance.

brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message-
brbrFrom: Darren Govoni

brbrby the solr team and some terms still ni use may need to be labeled
brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more
brmodern

brbrterm and incorporates the fact that the
brbrbrcollection may be sharded, with each shard on one or more cores,
brwith

brbreach
brbrbrcore being a replica of the other cores within that shard of that
brbrbrcollection.
brbrbr
brbrbrInstance is a general term, but is commonly used to refer to a
brrunning

brbrSolr
brbrbrserver, each of which can service any number of cores. A sharded
brbrcollection
brbrbrwould typically require multiple instances of Solr, each with a
brshard of

brbrthe
brbrbrcollection.
brbrbr
brbrbrMultiple collections can be supported on a single instance of Solr.
brThey

brbrbrdon't have to be sharded or replicated. But if they are, each Solr
brbrinstance
brbrbrwill have a copy or replica of the data (index) of one shard of each
brbrsharded
brbrbrcollection - to the degree that each collection needs that many
brshards.

brbrbr
brbrbrAt the API level, you talk to a Solr instance, using a host and
brport,

brbrand
brbrbrgiving the collection name. Some operations will refer only to the
brbrportion
brbrbrof a multi-shard collection on that Solr instance, but typically
brSolr

brbrwill
brbrbrdistribute the operation, whether it be an update or a query, to
brall

brbrof
brbrbrthe shards of the named collection. In the case of update, the
brupdate

brbrwill
brbrbrbe distributed to all replicas as well, but in the case of query
bronly

brbrone
brbrbrreplica of each shard of the collection is needed.
brbrbr
brbrbrBefore SolrCloud we Solr had master and slave and the slaves were
brbrreplicas
brbrbrof the master, but with SolrCloud there is no master and all the
brbrreplicas of
brbrbrthe shard are peers, although at any moment of time one of them will
brbe
brbrbrconsidered the leader

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni


I think what's confusing about your explanation below is when you have a 
situation where there is no replication factor. That's possible too, yes?

So in that case, is each core of a shard of a collection, still referred to as a replica? 


To me a replica is a duplicate/backup of a shard's core. Not the sharded core 
itself. Or is there just no difference. Even a non-replicated core is called a 
replica?


brbrbr--- Original Message ---
On 1/3/2013  09:08 AM Jack Krupansky wrote:brOops... let me word that a 
little more carefully:
br
br...we are replicating the data of each shard.
br
br
br
br
br
br-- Jack Krupansky
br-Original Message- 
brFrom: Jack Krupansky

brSent: Thursday, January 03, 2013 9:03 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brNo, a shard is a subset (or slice) of the collection. Sharding is a way of
brslicing the original data, before we talk about how the shards get stored
brand replicated on actual Solr cores. Replicas are instances of the data for
bra shard.
br
brSometimes people may loosely speak of a replica as being a shard, but
brthat's just loose use of the terminology.
br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br   brCollection is the more modern term and incorporates the fact that the
brbrcollection may be sharded, with each shard on one or more cores, with
breach brcore being a replica of the other cores within that shard of that
brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. A shard
britself is not distributed across cores in the same since. Rather a shard
brexist on a single core and is replicated on other cores. Is that right? The
brway its worded above, it sounds like a shard can also be sharded...
br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a
brcluster or cloud (graph). It could be a real
brbrmachine or a virtualized machine. Technically, you could have multiple
brbrvirtual nodes on the same physical box. Each Solr replica would be on
bra
brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running on a single
brbrhardware node, each with a different port. They are simply instances of
brbrSolr, although you could consider each Solr instance a node in a Solr
brcloud
brbras well, a virtual node. So, technically, you could have multiple
brreplicas
brbron the same node, but that sort of defeats most of the purpose of having
brbrreplicas in the first place - to distribute the data for performance and
brbrfault tolerance. But, you could have replicas of different shards on the
brbrsame node/box for a partial improvement of performance and fault
brtolerance.
brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- 
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is
brsanctioned
brbrby the solr team and some terms still ni use may need to be labeled
brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more
brmodern
brbrterm and incorporates the fact that the
brbrbrcollection may be sharded, with each shard on one or more cores,
brwith
brbreach
brbrbrcore being a replica of the other cores within that shard of that
brbrbrcollection.
brbrbr
brbrbrInstance is a general term, but is commonly used to refer to a
brrunning
brbrSolr
brbrbrserver, each of which can service any number of cores. A sharded
brbrcollection
brbrbrwould typically require multiple instances of Solr, each with a
brshard of
brbrthe
brbrbrcollection.
brbrbr
brbrbrMultiple collections can be supported on a single instance of Solr.
brThey
brbrbrdon't have to be sharded or replicated. But if they are, each Solr
brbrinstance
brbrbrwill have a copy or replica of the data (index) of one shard of each
brbrsharded
brbrbrcollection - to the degree that each collection needs that many
brshards.
brbrbr
brbrbrAt the API level, you talk to a Solr instance, using a host and
brport,
brbrand
brbrbrgiving the collection name. Some operations will refer only to the
brbrportion
brbrbrof a multi-shard collection on that Solr instance, but typically
brSolr
brbrwill
brbrbrdistribute the operation, whether it be an update

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Yes. And its worth to note that when having multiple shards in a single
node(@deprecated) that they are shards of different collections...

brbrbr--- Original Message ---
On 1/3/2013 09:16 AM Jack Krupansky wrote:brAnd I would revise node to note that in SolrCloud a node is simply an
brinstance of a Solr server.

br
brAnd, technically, you can have multiple shards in a single instance of Solr,
brseparating the logical sharding of keys from the distribution of the data.

br
br-- Jack Krupansky
br
br-Original Message-
brFrom: Jack Krupansky

brSent: Thursday, January 03, 2013 9:08 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brOops... let me word that a little more carefully:
br
br...we are replicating the data of each shard.
br
br
br
br
br
br-- Jack Krupansky
br-Original Message-
brFrom: Jack Krupansky

brSent: Thursday, January 03, 2013 9:03 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brNo, a shard is a subset (or slice) of the collection. Sharding is a way of
brslicing the original data, before we talk about how the shards get stored
brand replicated on actual Solr cores. Replicas are instances of the data for
bra shard.
br
brSometimes people may loosely speak of a replica as being a shard, but
brthat's just loose use of the terminology.
br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message-
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br brCollection is the more modern term and incorporates the fact that the
brbrcollection may be sharded, with each shard on one or more cores, with
breach brcore being a replica of the other cores within that shard of that
brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. A shard
britself is not distributed across cores in the same since. Rather a shard
brexist on a single core and is replicated on other cores. Is that right? The
brway its worded above, it sounds like a shard can also be sharded...
br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a
brcluster or cloud (graph). It could be a real
brbrmachine or a virtualized machine. Technically, you could have multiple
brbrvirtual nodes on the same physical box. Each Solr replica would be on
bra
brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running on a single
brbrhardware node, each with a different port. They are simply instances of
brbrSolr, although you could consider each Solr instance a node in a Solr
brcloud
brbras well, a virtual node. So, technically, you could have multiple
brreplicas
brbron the same node, but that sort of defeats most of the purpose of having
brbrreplicas in the first place - to distribute the data for performance and
brbrfault tolerance. But, you could have replicas of different shards on the
brbrsame node/box for a partial improvement of performance and fault
brtolerance.
brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message-
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is
brsanctioned
brbrby the solr team and some terms still ni use may need to be labeled
brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more
brmodern
brbrterm and incorporates the fact that the
brbrbrcollection may be sharded, with each shard on one or more cores,
brwith
brbreach
brbrbrcore being a replica of the other cores within that shard of that
brbrbrcollection.
brbrbr
brbrbrInstance is a general term, but is commonly used to refer to a
brrunning
brbrSolr
brbrbrserver, each of which can service any number of cores. A sharded
brbrcollection
brbrbrwould typically require multiple instances of Solr, each with a
brshard of
brbrthe
brbrbrcollection.
brbrbr
brbrbrMultiple collections can be supported on a single instance of Solr.
brThey
brbrbrdon't have to be sharded or replicated. But if they are, each Solr
brbrinstance
brbrbrwill have a copy or replica of the data (index) of one shard of each
brbrsharded
brbrbrcollection - to the degree that each collection needs that many
brshards.
brbrbr
brbrbrAt the API level, you talk to a Solr instance, using a host and
brport,
brbrand
brbrbrgiving

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Ah, ok. Good. Makes sense.

I think I will draw all this up in a UML that includes the distinction between the
logical terms and the physical terms (and their mapping) as they do get
intertwined. I'll post it here when I'm done.

brbrbr--- Original Message ---
On 1/3/2013 09:19 AM Jack Krupansky wrote:brA single shard MAY exist on a single core, but only if it is not replicated.
brGenerally, a single shard will exist on multiple cores, each a replica of
brthe source data as it comes into the update handler.

br
br-- Jack Krupansky
br
br-Original Message-
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 9:10 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks. I got that part.
br
brA group of shards (and therefore cores) represent a collection, yes. But a
brsingle shard exist only on a single core?

br
brbrbrbr--- Original Message ---
brOn 1/3/2013 09:03 AM Jack Krupansky wrote:brNo, a shard is a subset (or
brslice) of the collection. Sharding is a way of
brbrslicing the original data, before we talk about how the shards get
brstored
brbrand replicated on actual Solr cores. Replicas are instances of the data
brfor

brbra shard.
brbr
brbrSometimes people may loosely speak of a replica as being a shard, but
brbrthat's just loose use of the terminology.
brbr
brbrSo, we're not sharding shards, but we are replicating shards.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message-
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:51 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrThanks again. (And sorry to jump into this convo)
brbr
brbrBut I had a question on your statement:
brbr
brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:
brbr brCollection is the more modern term and incorporates the fact that
brthe
brbrbrcollection may be sharded, with each shard on one or more cores,
brwith
brbreach brcore being a replica of the other cores within that shard of
brthat

brbrbrcollection.
brbr
brbrA collection is sharded, meaning it is distributed across cores. A shard
brbritself is not distributed across cores in the same since. Rather a shard
brbrexist on a single core and is replicated on other cores. Is that right?
brThe

brbrway its worded above, it sounds like a shard can also be sharded...
brbr
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a
brbrcluster or cloud (graph). It could be a real
brbrbrmachine or a virtualized machine. Technically, you could have
brmultiple
brbrbrvirtual nodes on the same physical box. Each Solr replica would be
bron

brbra
brbrbrdifferent node.
brbrbr
brbrbrTechnically, you could have multiple Solr instances running on a
brsingle
brbrbrhardware node, each with a different port. They are simply instances
brof
brbrbrSolr, although you could consider each Solr instance a node in a
brSolr

brbrcloud
brbrbras well, a virtual node. So, technically, you could have multiple
brbrreplicas
brbrbron the same node, but that sort of defeats most of the purpose of
brhaving
brbrbrreplicas in the first place - to distribute the data for performance
brand
brbrbrfault tolerance. But, you could have replicas of different shards on
brthe

brbrbrsame node/box for a partial improvement of performance and fault
brbrtolerance.
brbrbr
brbrbrA Solr cloud' is really a cluster.
brbrbr
brbrbr-- Jack Krupansky
brbrbr
brbrbr-Original Message-
brbrbrFrom: Darren Govoni

brbrbrSent: Thursday, January 03, 2013 8:16 AM
brbrbrTo: solr-user@lucene.apache.org
brbrbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbrbr
brbrbrGood write up.
brbrbr
brbrbrAnd what about node?
brbrbr
brbrbrI think there needs to be an official glossary of terms that is
brbrsanctioned
brbrbrby the solr team and some terms still ni use may need to be labeled
brbrbrdeprecated. After so many years, its still confusing.
brbrbr
brbrbrbrbrbr--- Original Message ---
brbrbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the
brmore

brbrmodern
brbrbrterm and incorporates the fact that the
brbrbrbrcollection may be sharded, with each shard on one or more cores,
brbrwith
brbrbreach
brbrbrbrcore being a replica of the other cores within that shard of
brthat

brbrbrbrcollection.
brbrbrbr
brbrbrbrInstance is a general term, but is commonly used to refer to a
brbrrunning
brbrbrSolr
brbrbrbrserver, each of which can service any number of cores. A sharded
brbrbrcollection
brbrbrbrwould typically require multiple instances of Solr, each with a
brbrshard of
brbrbrthe
brbrbrbrcollection.
brbrbrbr
brbrbrbrMultiple collections can be supported on a single instance of
brSolr.

brbrThey
brbrbrbrdon't have to be sharded or replicated. But if they are, each
brSolr

brbrbrinstance
brbrbrbrwill have a copy or replica of the data (index) of one

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni


Great point.

brbrbr--- Original Message ---
On 1/3/2013  10:42 AM Per Steffensen wrote:brOn 1/3/13 4:33 PM, Mark Miller 
wrote:
br This has pretty much become the standard across other distributed systems 
and in the literat…err…books.
brHmmm Im not sure you are right about that. Maybe more than one 
brdistributed system calls them Replica, but there is also a lot that 
brdoesnt. But if you are right, thats at least a good valid argument to do 
brit this way, even though I generally prefer good logical naming over 
brfollowing bad naming from the industry :-) Just because there is a lot 
brof crap out there, doesnt mean that we also want to make crap. Maybe 
brgood logical naming could even be a small entry in the Why Solr is 
brbetter than its competitors list :-)

br

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

And based on the previous explanation there is never a copy of a shard. A
shard represents and contains only replicas for itself, replicas being copies of cores
within the shard.

brbrbr--- Original Message ---
On 1/3/2013 11:58 AM Walter Underwood wrote:brA factor is multiplied, so
multiplying the leader by a replicationFactor of 1 means you have exactly one copy of that
shard.
br
brI think that recycling the term replication within Solr was confusing, but it is a bit late to change that.
br

brwunder
br
brOn Jan 3, 2013, at 7:33 AM, Mark Miller wrote:
br
br This has pretty much become the standard across other distributed systems
and in the literat…err…books.
br
br I first implemented it as you mention you'd like, but Yonik correctly pointed out that we were going against the grain.
br
br - Mark
br
br On Jan 3, 2013, at 10:01 AM, Per Steffensen st...@designware.dk wrote:
br
br For the same reasons that Replica shouldnt be called Replica (it requires to long an explanation to agree that it is an ok name), replicationFactor shouldnt be called replicationFactor and long as it referes to the TOTAL number of cores you get for your Shard. replicationFactor would be an ok name if replicationFactor=0 meant one core, replicationFactor=1 meant two cores etc., but as long as replicationFactor=1 means one core, replicationFactor=2 means two cores, it is bad naming (you will not get any replication with replicationFactor=1 - WTF!?!?). If we want to insist that you specify the total number of cores at least use replicaPerShard instead of replicationFactor, or even better rename Replica to Shard-instance and use instancesPerShard instead of replicationFactor.
br
br Regards, Per Steffensen
br
br On 1/3/13 3:52 PM, Per Steffensen wrote:

br Hi
br
br Here is my version - do not believe the explanations have been very clear
br
br We have the following concepts (here I will try to explain what each the concept cover without naming it - its hard)

br 1) Machines (virtual or physical) running Solr server JVMs (one machine
can run several Solr server JVMs if you like)
br 2) Solr server JVMs
br 3) Logical stores where you can add/update/delete data-instances (closest to
logical tables in RDBMS)
br 4) Logical slices of a store (closest to non-overlapping logical sets of rows
for the logical table in a RDBMS)
br 5) Physical instances of slices (a physical (disk/memory) instance of the a logical
slice). This is where data actually goes on disk - the logical stores and slices above are just non-physical
concepts
br
br Terminology

br 1) Believe we have no name for this (except of course machine :-) ), even though Jack claims that this is
called a node. Maybe sometimes it is called a node, but I believe node is more often used to refer
to a Solr server JVM.
br 2) Node
br 3) Collection
br 4) Shard. Used to be called Slice but I believe now it is officially called
Shard. I agree with that change, because I believe most of the industry also uses the term Shard for this
logical/non-physical concept - just needs to be reflected it across documentation and code
br 5) Replica. Used to be called Shard but I believe now it is officially called Replica. I certainly do not agree with the name
Replica, because it suggests that it is a copy of an original, but it isnt. I would prefer Shard-instance here, to avoid the confusion. I understand that you can argue
(if you argue long enough) that Replica is a fine name, but you really need the explanation to understand why Replica can be defended as the name for this. Is is not immediately
obvious what this is as long as it is called Replica. A Replica is basically a Solr Cloud managed Core and behind every Replica/Core lives a physical Lucene index. So Replica=Core)
contains/maintains Lucene index behind the scenes. The term Replica also needs to be reflected across documentation and code.
br
br Regards, Per Steffensen
br
br
br

br--
brWalter Underwood
brwun...@wunderwood.org
br
br
br
br

Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

I see. So sharding and distributing/replicating can have separate and 
different advantages.


On 01/03/2013 01:06 PM, Lance Norskog wrote:
Also, searching can be much faster if you put all of the shards on one 
machine, and the search distributor. That way, you search with 
multiple simultaneous threads inside one machine. I've seen this make 
searches several times faster.


On 01/03/2013 06:36 AM, Jack Krupansky wrote:
Ah... the multiple shards (of the same collection) in a single node 
is about planning for future expansion of your cluster - create more 
shards than you need today, put more of them on a single node and 
then migrate them to their own nodes as the data outgrows the smaller 
number of nodes. In other words, add nodes incrementally without 
having to reindex all the data.


-- Jack Krupansky

-Original Message- From: Darren Govoni
Sent: Thursday, January 03, 2013 9:18 AM
To: solr-user@lucene.apache.org
Subject: RE: Re: Terminology question: Core vs. Collection vs...

Yes. And its worth to note that when having multiple shards in a 
single node(@deprecated) that they are shards of different 
collections...


brbrbr--- Original Message ---
On 1/3/2013  09:16 AM Jack Krupansky wrote:brAnd I would revise 
node to note that in SolrCloud a node is simply an

brinstance of a Solr server.
br
brAnd, technically, you can have multiple shards in a single 
instance of Solr,
brseparating the logical sharding of keys from the distribution of 
the data.

br
br-- Jack Krupansky
br
br-Original Message- brFrom: Jack Krupansky
brSent: Thursday, January 03, 2013 9:08 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brOops... let me word that a little more carefully:
br
br...we are replicating the data of each shard.
br
br
br
br
br
br-- Jack Krupansky
br-Original Message- brFrom: Jack Krupansky
brSent: Thursday, January 03, 2013 9:03 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brNo, a shard is a subset (or slice) of the collection. Sharding 
is a way of
brslicing the original data, before we talk about how the shards 
get stored
brand replicated on actual Solr cores. Replicas are instances of 
the data for

bra shard.
br
brSometimes people may loosely speak of a replica as being a 
shard, but

brthat's just loose use of the terminology.
br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message- brFrom: Darren Govoni
brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br   brCollection is the more modern term and incorporates the 
fact that the
brbrcollection may be sharded, with each shard on one or more 
cores, with
breach brcore being a replica of the other cores within that 
shard of that

brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. 
A shard
britself is not distributed across cores in the same since. Rather 
a shard
brexist on a single core and is replicated on other cores. Is that 
right? The

brway its worded above, it sounds like a shard can also be sharded...
br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a 
machine in a

brcluster or cloud (graph). It could be a real
brbrmachine or a virtualized machine. Technically, you could have 
multiple
brbrvirtual nodes on the same physical box. Each Solr replica 
would be on

bra
brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running 
on a single
brbrhardware node, each with a different port. They are simply 
instances of
brbrSolr, although you could consider each Solr instance a node 
in a Solr

brcloud
brbras well, a virtual node. So, technically, you could have 
multiple

brreplicas
brbron the same node, but that sort of defeats most of the 
purpose of having
brbrreplicas in the first place - to distribute the data for 
performance and
brbrfault tolerance. But, you could have replicas of different 
shards on the

brbrsame node/box for a partial improvement of performance and fault
brtolerance.
brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- brbrFrom: Darren Govoni
brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is
brsanctioned
brbrby the solr team and some terms still ni use may need to be 
labeled

brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message

RE: Does SolrCloud supports MoreLikeThis?

2012-11-05 Thread Darren Govoni


There is a ticket for that with some recent activity (sorry I don't have it 
handy right now), but I'm not sure if that work made it into the trunk, so 
probably solrcloud does not support MLT...yet. Would love an update from the 
dev team though!

brbrbr--- Original Message ---
On 11/5/2012  10:37 AM Luis Cappa Banda wrote:brThat´s the question, :-)
br
brRegards,
br
brLuis Cappa.
br

Re: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime NRT available for download

2012-10-29 Thread Darren Govoni

It certainly seems to be a rogue project, but I can't understand the 
meaning of realtime near realtime (NRT) either. At best, its oxymoronic.



On 10/29/2012 10:30 AM, Jack Krupansky wrote:
Could any of the committers here confirm whether this is a legitimate 
effort? I mean, how could anything labeled Apache ABC with XYZ be an 
external project and be sanctioned/licensed by Apache? In fact, the 
linked web page doesn't even acknowledge the ownership of the Apache 
trademarks or ASL. And the term Realtime NRT is nonsensical. Even 
worse: Realtime NRT makes available a near realtime view. Equally 
nonsensical. Who knows, maybe it is legit, but it sure comes across as 
a scam/spam.


-- Jack Krupansky

-Original Message- From: Nagendra Nagarajayya
Sent: Monday, October 29, 2012 10:06 AM
To: solr-user@lucene.apache.org
Subject: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and 
Realtime NRT available for download


Hi!

I am very excited to announce the availability of Apache Solr 4.0 with
RankingAlgorithm 1.4.4 and Realtime NRT. Realtime NRT is a high
performance and more granular NRT implementation as to soft commit. The
update performance is about 70,000 documents / sec* (almost 1.5-2x
performance improvement over soft-commit). You can also scale up to 2
billion documents* in a single core, and query half a billion documents
index in ms**. Realtime NRT is different from realtime-get. realtime-get
does not have search capability and is a lookup by id. Realtime NRT
allows full search, see here http://solr-ra.tgels.org/realtime-nrt.jsp
for more info.

Realtime NRT has been contributed back to Solr, see JIRA:
https://issues.apache.org/jira/browse/SOLR-3816

RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ± and/or
boolean/dismax/boost queries and is compatible with the new Lucene 4.0 
api.


You can get more information about Solr 4.0 with RankingAlgorithm 1.4.4
and Realtime NRT performance from here:
http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x

You can download Solr 4.0 with RankingAlgorithm 1.4.4 from here:
http://solr-ra.tgels.org

Please download and give the new version a try.

Note:
1. Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external project

Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

* performance is a real use case of Apache Solr with RankingAlgorithm as
seen at a user installation
** performance seen when using the age feature

Re: Cloud terminology clarification

2012-09-09 Thread Darren Govoni

I agree it needs updating and I've always gotten confused at some point
by
the use (misuse) of terms.

For example, the term 'node' is thrown around a lot too. What is it??!
Hehe.

On Sat, 2012-09-08 at 22:26 -0700, JesseBuesking wrote:

 It's been a while since the terminology at
 http://wiki.apache.org/solr/SolrTerminology has been updated, so I'm
 wondering how these terms apply to solr cloud setups.
 
 My take on what the terms mean:
 
 Collection: Basically the highest level container that bundles together the
 other pieces for servicing a particular search setup
 Core: An individual solr instance (represents entire indexes)
 Shard: A portion of a core (represents a subset of an index)
 
 Therefore:
 - increasing the number of shards allows for indexing more documents (aka
 scaling the amount of data that can be indexed)
 - increasing the number of cores increases the potential throughput of
 requests (aka cores mirror each other allowing you to distribute requests to
 multiple servers)
 
 Does this sound right?
 
 If so, then my follow up question would be does the following directory
 structure look right/standard?
 
 .../solr # = solr home
 .../solr/collection-01
 .../solr/collection-01/core-01
 .../solr/collection-01/core-02
 
 And if this is right, I'm on a roll :D
 
 My next question would then be:
 Given we're using zookeeper (separate machine), do we need 1 conf folder at
 collection-01's level?  Or do we need 1 conf folder per core?
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Cloud-terminology-clarification-tp4006407.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Map/Reduce directly against solr4 index.

2012-07-26 Thread Darren Govoni

Of course you can do it, but the question is whether this will produce
the performance results you expect.
I've seen talk about this in other forums, so you might find some prior
work here.

Solr and HDFS serve somewhat different purposes. The key issue would be
if your map and reduce code
overloads the Solr endpoint. Even using SolrCloud, I believe all
requests will have to go through a single
URL (to be routed), so if you have thousands of map/reduce jobs all
running simultaneously, the question is whether
your Solr is architected to handle that amount of throughput.


On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:

 Is it possible to run map reduce jobs directly on Solr4?
 
 I'm asking this because I want to use Solr4 as the primary storage engine.
 And I want to be able to run near real time analytics against it as well.
 Rather than export solr4 data out to a hadoop cluster.

Re: Map/Reduce directly against solr4 index.

2012-07-26 Thread Darren Govoni

You raise an interesting possibility. A map/reduce solr handler over
solrcloud...

On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote:

 I think the performance should be close to Hadoop running on HDFS, if
 somehow Hadoop job can directly read the Solr Index file while executing
 the job on the local solr node.
 
 Kindna like how HBase and Cassadra integrate with Hadoop.
 
 Plus, we can run the map reduce job on a standby Solr4 cluster.
 
 This way, the documents in Solr will be our primary source of truth. And we
 have the ability to run near real time search queries and analytics on it.
 No need to export data around.
 
 Solr4 is becoming a very interesting solution to many web scale problems.
 Just missing the map/reduce component. :)
 
 On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni dar...@ontrenet.com wrote:
 
  Of course you can do it, but the question is whether this will produce
  the performance results you expect.
  I've seen talk about this in other forums, so you might find some prior
  work here.
 
  Solr and HDFS serve somewhat different purposes. The key issue would be
  if your map and reduce code
  overloads the Solr endpoint. Even using SolrCloud, I believe all
  requests will have to go through a single
  URL (to be routed), so if you have thousands of map/reduce jobs all
  running simultaneously, the question is whether
  your Solr is architected to handle that amount of throughput.
 
 
  On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:
 
   Is it possible to run map reduce jobs directly on Solr4?
  
   I'm asking this because I want to use Solr4 as the primary storage
  engine.
   And I want to be able to run near real time analytics against it as well.
   Rather than export solr4 data out to a hadoop cluster.

Re: [Announce] Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 with Realtime NRT available for download

2012-07-22 Thread Darren Govoni

What exactly is Realtime NRT (Near Real Time)?

On Sun, 2012-07-22 at 14:07 -0700, Nagendra Nagarajayya wrote:

 Hi!
 
 I am very excited to announce the availability of Solr 4.0-ALPHA with 
 RankingAlgorithm 1.4.4 with Realtime NRT. The Realtime NRT 
 implementation now supports both RankingAlgorithm and Lucene. Realtime 
 NRT is a high performance and more granular NRT implementation as to 
 soft commit. The update performance is about 70,000 documents / sec*. 
 You can also scale up to 2 billion documents* in a single core, and 
 query half a billion documents index in ms**.
 
 RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ± and/or 
 boolean queries and is compatible with the new Lucene 4.0-ALPHA api.
 
 You can get more information about Solr 4.0-ALPHA with RankingAlgorithm 
 1.4.4 Realtime performance from here:
 http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x
 
 You can download Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 from here:
 http://solr-ra.tgels.org
 
 Please download and give the new version a try.
 
 Regards,
 
 Nagendra Nagarajayya
 http://solr-ra.tgels.org
 http://rankingalgorithm.tgels.org
 
 * performance seen at a user installation of Solr 4.0 with 
 RankingAlgorithm 1.4.3
 ** performance seen when using the age feature

Re: Facet on all the dynamic fields with *_s feature

2012-07-16 Thread Darren Govoni

You'll have to query the index for the fields and sift out the _s ones
and cache them or something.

On Mon, 2012-07-16 at 16:52 +0530, Rajani Maski wrote:

 Yes, This feature will solve the below problem very neatly.
 
 All,
 
  Is there any approach to achieve this for now?
 
 
 --Rajani
 
 On Sun, Jul 15, 2012 at 6:02 PM, Jack Krupansky 
 j...@basetechnology.comwrote:
 
  The answer appears to be No, but it's good to hear people express an
  interest in proposed features.
 
  -- Jack Krupansky
 
  -Original Message- From: Rajani Maski
  Sent: Sunday, July 15, 2012 12:02 AM
  To: solr-user@lucene.apache.org
  Subject: Facet on all the dynamic fields with *_s feature
 
 
  Hi All,
 
Is this issue fixed in solr 3.6 or 4.0:  Faceting on all Dynamic field
  with facet.field=*_s
 
Link  :  
  https://issues.apache.org/**jira/browse/SOLR-247https://issues.apache.org/jira/browse/SOLR-247
 
 
 
   If it is not fixed, any suggestion on how do I achieve this?
 
 
  My requirement is just same as this one :
  http://lucene.472066.n3.**nabble.com/Dynamic-facet-**
  field-tc2979407.html#nonehttp://lucene.472066.n3.nabble.com/Dynamic-facet-field-tc2979407.html#none
 
 
  Regards
  Rajani

Re: Solr Faceting

2012-07-07 Thread Darren Govoni

I don't think it comes at any added cost for solr to return that facet
so you can filter it
out in your business logic.

On Sat, 2012-07-07 at 15:18 +0530, Shanu Jha wrote:

 Hi,
 
 
 I am generating facet for a field which has one of the value NA and I
 want solr should not create facet(or ignore) for this(NA) value. Is there
 any way to in solr to do that.
 
 Thanks

Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support

2012-05-28 Thread Darren Govoni

I don't recall anyone being able to get acceptable performance with a
single index that large with solr/lucene. The conventional wisdom is
that parallel searching across cores (or shards in SolrCloud) is the
best way to handle index sizes in the illions. So its of great
interest how you did.

Anyone else gotten an index(es) with billions of documents to perform
well? I'm greatly interested in how.

On Mon, 2012-05-28 at 05:12 -0700, Nagendra Nagarajayya wrote:
 It is a single node. I am trying to find out if the performance can be 
 referenced.
 
 Regarding information on Solr with RankingAlgorithm, you can find all 
 the information here:
 
 http://solr-ra.tgels.org
 
 On RankingAlgorithm:
 
 http://rankingalgorithm.tgels.org
 
 Regards,
 - NN
 
 On 5/27/2012 4:50 PM, Li Li wrote:
  yes, I am also interested in good performance with 2 billion docs. how
  many search nodes do you use? what's the average response time and qps
  ?
 
  another question: where can I find related paper or resources of your
  algorithm which explains the algorithm in detail? why it's better than
  google site(better than lucene is not very interested because lucene
  is not originally designed to provide search function like google)?
 
  On Mon, May 28, 2012 at 1:06 AM, Darren Govonidar...@ontrenet.com  wrote:
  I think people on this list would be more interested in your approach to
  scaling 2 billion documents than modifying solr/lucene scoring (which is
  already top notch). So given that, can you share any references or
  otherwise substantiate good performance with 2 billion documents?
 
  Thanks.
 
  On Sun, 2012-05-27 at 08:29 -0700, Nagendra Nagarajayya wrote:
  Actually, RankingAlgorithm 1.4.2 has been scaled to more than 2 billion
  docs. With RankingAlgorithm 1.4.3, using the parameters
  age=latestdocs=number feature, you can retrieve the NRT inserted
  documents in milliseconds from such a huge index improving query and
  faceting performance and using very little resources ...
 
  Currently, RankingAlgorithm 1.4.3 is only available with Solr 4.0, and
  the NRT insert performance with Solr 4.0 is about 70,000 docs / sec.
  RankingAlgorithm 1.4.3 should become available with Solr 3.6 soon.
 
  Regards,
 
  Nagendra Nagarajayya
  http://solr-ra.tgels.org
  http://rankingalgorithm.tgels.org
 
 
 
  On 5/27/2012 7:32 AM, Darren Govoni wrote:
  Hi,
  Have you tested this with a billion documents?
 
  Darren
 
  On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote:
  Hi!
 
  I am very excited to announce the availability of Solr 3.6 with
  RankingAlgorithm 1.4.2.
 
  This NRT supports now works with both RankingAlgorithm and Lucene. The
  insert/update performance should be about 5000 docs in about 490 ms with
  the MbArtists Index.
 
  RankingAlgorithm 1.4.2 has multiple algorithms, improved performance
  over the earlier releases, supports the entire Lucene Query Syntax, ±
  and/or boolean queries and can scale to more than a billion documents.
 
  You can get more information about NRT performance from here:
  http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x
 
  You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here:
  http://solr-ra.tgels.org
 
  Please download and give the new version a try.
 
  Regards,
 
  Nagendra Nagarajayya
  http://solr-ra.tgels.org
  http://rankingalgorithm.tgels.org
 
  ps. MbArtists index is the example index used in the Solr 1.4 Enterprise
  Book

Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support

2012-05-27 Thread Darren Govoni

Hi,
  Have you tested this with a billion documents?

Darren

On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote:
 Hi!
 
 I am very excited to announce the availability of Solr 3.6 with 
 RankingAlgorithm 1.4.2.
 
 This NRT supports now works with both RankingAlgorithm and Lucene. The 
 insert/update performance should be about 5000 docs in about 490 ms with 
 the MbArtists Index.
 
 RankingAlgorithm 1.4.2 has multiple algorithms, improved performance 
 over the earlier releases, supports the entire Lucene Query Syntax, ± 
 and/or boolean queries and can scale to more than a billion documents.
 
 You can get more information about NRT performance from here:
 http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x
 
 You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here:
 http://solr-ra.tgels.org
 
 Please download and give the new version a try.
 
 Regards,
 
 Nagendra Nagarajayya
 http://solr-ra.tgels.org
 http://rankingalgorithm.tgels.org
 
 ps. MbArtists index is the example index used in the Solr 1.4 Enterprise 
 Book

Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support

2012-05-27 Thread Darren Govoni

I think people on this list would be more interested in your approach to
scaling 2 billion documents than modifying solr/lucene scoring (which is
already top notch). So given that, can you share any references or
otherwise substantiate good performance with 2 billion documents?

Thanks.

On Sun, 2012-05-27 at 08:29 -0700, Nagendra Nagarajayya wrote:
 Actually, RankingAlgorithm 1.4.2 has been scaled to more than 2 billion 
 docs. With RankingAlgorithm 1.4.3, using the parameters 
 age=latestdocs=number feature, you can retrieve the NRT inserted 
 documents in milliseconds from such a huge index improving query and 
 faceting performance and using very little resources ...
 
 Currently, RankingAlgorithm 1.4.3 is only available with Solr 4.0, and 
 the NRT insert performance with Solr 4.0 is about 70,000 docs / sec. 
 RankingAlgorithm 1.4.3 should become available with Solr 3.6 soon.
 
 Regards,
 
 Nagendra Nagarajayya
 http://solr-ra.tgels.org
 http://rankingalgorithm.tgels.org
 
 
 
 On 5/27/2012 7:32 AM, Darren Govoni wrote:
  Hi,
 Have you tested this with a billion documents?
 
  Darren
 
  On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote:
  Hi!
 
  I am very excited to announce the availability of Solr 3.6 with
  RankingAlgorithm 1.4.2.
 
  This NRT supports now works with both RankingAlgorithm and Lucene. The
  insert/update performance should be about 5000 docs in about 490 ms with
  the MbArtists Index.
 
  RankingAlgorithm 1.4.2 has multiple algorithms, improved performance
  over the earlier releases, supports the entire Lucene Query Syntax, ±
  and/or boolean queries and can scale to more than a billion documents.
 
  You can get more information about NRT performance from here:
  http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x
 
  You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here:
  http://solr-ra.tgels.org
 
  Please download and give the new version a try.
 
  Regards,
 
  Nagendra Nagarajayya
  http://solr-ra.tgels.org
  http://rankingalgorithm.tgels.org
 
  ps. MbArtists index is the example index used in the Solr 1.4 Enterprise
  Book

SolrCloud war context name?

2012-05-26 Thread Darren Govoni

Hi,
 I am running my solrcloud nodes in an app server deployed into the
context path 'solr' and zookeeper sees all of them. I want to deploy a
second solrcloud war into the same app server (thus same IP:port) in a
different context like 'solrrep' with the same config (cloned).

Will this work? Or does zookeeper (or solrcloud leader) require all
connected solr shards to have context url with ip:port/solr? Or will the
correct URL be registered from the replica shard?

thanks!

Re: SolrCloud war context name?

2012-05-26 Thread Darren Govoni

It's not really clear from the wiki how to use cores as shard replicas
within the same solr server. In my mind, having a separate JVM/solr
node/ acting as a replica makes sense because the replication traffic
will be on a different channel in a different vm and won't interfere
with search/indexing traffic on the primary shards.

Or am I missing something easy about using cores with solr cloud? 
It was mentioned on the list recently that managing cores with solrcloud
isn't really the best practice for it.

On Sat, 2012-05-26 at 16:12 -0300, Marcelo Carvalho Fernandes wrote:
 Why not using multicore?
 
 
 Marcelo Carvalho Fernandes
 +55 21 8272-7970
 
 
 
 On Sat, May 26, 2012 at 12:56 PM, Darren Govoni ontre...@ontrenet.comwrote:
 
  Hi,
   I am running my solrcloud nodes in an app server deployed into the
  context path 'solr' and zookeeper sees all of them. I want to deploy a
  second solrcloud war into the same app server (thus same IP:port) in a
  different context like 'solrrep' with the same config (cloned).
 
  Will this work? Or does zookeeper (or solrcloud leader) require all
  connected solr shards to have context url with ip:port/solr? Or will the
  correct URL be registered from the replica shard?
 
  thanks!

RE: Re: SolrCloud: how to index documents into a specific core and how to search against that core?

2012-05-22 Thread Darren Govoni


I'm curious what the solrcloud experts say, but my suggestion is to try not to 
over-engineering the search architecture  on solrcloud. For example, what is 
the benefit of managing the what cores are indexed and searched? Having to know 
those details, in my mind, works against the automation in solrcore, but maybe 
there's a good reason you want to do it this way.

brbrbr--- Original Message ---
On 5/22/2012  07:35 AM Yandong Yao wrote:brHi Darren,
br
brThanks very much for your reply.
br
brThe reason I want to control core indexing/searching is that I want to
bruse one core to store one customer's data (all customer share same
brconfig):  such as customer 1 use coreForCustomer1 and customer 2
bruse coreForCustomer2.
br
brIs there any better way than using different core for different customer?
br
brAnother way maybe use different collection for different customer, while
brnot sure how many collections solr cloud could support. Which way is better
brin terms of flexibility/scalability? (suppose there are tens of thousands
brcustomers).
br
brRegards,
brYandong
br
br2012/5/22 Darren Govoni dar...@ontrenet.com
br
br Why do you want to control what gets indexed into a core and then
br knowing what core to search? That's the kind of knowing that SolrCloud
br solves. In SolrCloud, it handles the distribution of documents across
br shards and retrieves them regardless of which node is searched from.
br That is the point of cloud, you don't know the details of where
br exactly documents are being managed (i.e. they are cloudy). It can
br change and re-balance from time to time. SolrCloud performs the
br distributed search for you, therefore when you try to search a node/core
br with no documents, all the results from the cloud are retrieved
br regardless. This is considered A Good Thing.
br
br It requires a change in thinking about indexing and searching
br
br On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote:
br  Hi Guys,
br 
br  I use following command to start solr cloud according to solr cloud 
wiki.
br 
br  yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf
br  -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
br  yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983
br -jar
br  start.jar
br 
br  Then I have created several cores using CoreAdmin API (
br  http://localhost:8983/solr/admin/cores?action=CREATEname=
br  coreNamecollection=collection1), and clusterstate.json show following
br  topology:
br 
br 
br  collection1:
br  -- shard1:
br-- collection1
br-- CoreForCustomer1
br-- CoreForCustomer3
br-- CoreForCustomer5
br  -- shard2:
br-- collection1
br-- CoreForCustomer2
br-- CoreForCustomer4
br 
br 
br  1) Index:
br 
br  Using following command to index mem.xml file in exampledocs directory.
br 
br  yydzero:exampledocs bjcoe$ java -Durl=
br  http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml
br  SimplePostTool: version 1.4
br  SimplePostTool: POSTing files to
br  http://localhost:8983/solr/coreForCustomer3/update..
br  SimplePostTool: POSTing file mem.xml
br  SimplePostTool: COMMITting Solr index changes.
br 
br  And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3',
br  'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2
br  core has 0 documents.
br 
br  *Question 1:*  Is this expected behavior? How do I to index documents
br into
br  a specific core?
br 
br  *Question 2*:  If SolrCloud don't support this yet, how could I extend 
it
br  to support this feature (index document to particular core), where
br should i
br  start, the hashing algorithm?
br 
br  *Question 3*:  Why the documents are also indexed into 
'coreForCustomer1'
br  and 'coreForCustomer5'?  The default replica for documents are 1, right?
br 
br  Then I try to index some document to 'coreForCustomer2':
br 
br  $ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar
br  post.jar ipod_video.xml
br 
br  While 'coreForCustomer2' still have 0 documents and documents in
br ipod_video
br  are indexed to core for customer 1/3/5.
br 
br  *Question 4*:  Why this happens?
br 
br  2) Search: I use 
br  http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xml; to
br  search against 'CoreForCustomer2', while it will return all documents in
br  the whole collection even though this core has no documents at all.
br 
br  Then I use 
br 
br 
http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xmlshards=localhost:8983/solr/coreForCustomer2
br ,
br  and it will return 0 documents.
br 
br  *Question 5*: So If want to search against a particular core, we need to
br  use 'shards' parameter and use solrCore name as parameter value, right?
br 
br 
br  Thanks very much in advance!
br 
br  Regards,
br  Yandong
br
br
br
br

Re: SolrCloud: how to index documents into a specific core and how to search against that core?

2012-05-21 Thread Darren Govoni

Why do you want to control what gets indexed into a core and then
knowing what core to search? That's the kind of knowing that SolrCloud
solves. In SolrCloud, it handles the distribution of documents across
shards and retrieves them regardless of which node is searched from.
That is the point of cloud, you don't know the details of where
exactly documents are being managed (i.e. they are cloudy). It can
change and re-balance from time to time. SolrCloud performs the
distributed search for you, therefore when you try to search a node/core
with no documents, all the results from the cloud are retrieved
regardless. This is considered A Good Thing.

It requires a change in thinking about indexing and searching

On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote:
Hi Guys,

I use following command to start solr cloud according to solr cloud wiki.

yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf
-Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983 -jar
start.jar

Then I have created several cores using CoreAdmin API (
http://localhost:8983/solr/admin/cores?action=CREATEname=
coreNamecollection=collection1), and clusterstate.json show following
topology:

collection1:
-- shard1:
-- collection1
-- CoreForCustomer1
-- CoreForCustomer3
-- CoreForCustomer5
-- shard2:
-- collection1
-- CoreForCustomer2
-- CoreForCustomer4

1) Index:

Using following command to index mem.xml file in exampledocs directory.

yydzero:exampledocs bjcoe$ java -Durl=
http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml
SimplePostTool: version 1.4
SimplePostTool: POSTing files to
http://localhost:8983/solr/coreForCustomer3/update..
SimplePostTool: POSTing file mem.xml
SimplePostTool: COMMITting Solr index changes.

And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3',
'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2
core has 0 documents.

*Question 1:* Is this expected behavior? How do I to index documents into
a specific core?

*Question 2*: If SolrCloud don't support this yet, how could I extend it
to support this feature (index document to particular core), where should i
start, the hashing algorithm?

*Question 3*: Why the documents are also indexed into 'coreForCustomer1'
and 'coreForCustomer5'? The default replica for documents are 1, right?

Then I try to index some document to 'coreForCustomer2':

$ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar
post.jar ipod_video.xml

While 'coreForCustomer2' still have 0 documents and documents in ipod_video
are indexed to core for customer 1/3/5.

*Question 4*: Why this happens?

2) Search: I use
http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xml; to
search against 'CoreForCustomer2', while it will return all documents in
the whole collection even though this core has no documents at all.

Then I use
http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xmlshards=localhost:8983/solr/coreForCustomer2;,
and it will return 0 documents.

*Question 5*: So If want to search against a particular core, we need to
use 'shards' parameter and use solrCore name as parameter value, right?

Thanks very much in advance!

Regards,
Yandong

Re: Distributed search between solrclouds?

2012-05-18 Thread Darren Govoni

The thought here is to distribute a search between two different
solrcloud clusters and get ordered ranked results between them.
It's possible?

On Tue, 2012-05-15 at 18:47 -0400, Darren Govoni wrote:
 Hi,
   Would distributed search (the old way where you provide the solr host
 IP's etc.) still work between different solrclouds?
 
 thanks,
 Darren

Distributed search between solrclouds?

2012-05-15 Thread Darren Govoni

Hi,
  Would distributed search (the old way where you provide the solr host
IP's etc.) still work between different solrclouds?

thanks,
Darren

Re: Documents With large number of fields

2012-05-13 Thread Darren Govoni

Was there a response to this? 

On Fri, 2012-05-04 at 10:27 -0400, Keswani, Nitin - BLS CTR wrote:
 Hi,
 
 My data model consist of different types of data. Each data type has its own 
 characteristics
 
 If I include the unique characteristics of each type of data, my single Solr 
 Document could end up containing 300-400 fields.
 
 In order to drill down to this data set I would have to provide faceting on 
 most of these fields so that I can drilldown to very small set of
 Documents.
 
 Here are some of the questions :
 
 1) What's the best approach when dealing with documents with large number of 
 fields .
 Should I keep a single document with large number of fields or split my
 document into a number of smaller  documents where each document would 
 consist of some fields
 
 2) From an operational point of view, what's the drawback of having a single 
 document with a very large number of fields.
 Can Solr support documents with large number of fields (say 300 to 400).
 
 
 Thanks.
 
 Regards,
 
 Nitin Keswani

Re: Documents With large number of fields

2012-05-04 Thread Darren Govoni

I'm also interested in this. Same situation.

On Fri, 2012-05-04 at 10:27 -0400, Keswani, Nitin - BLS CTR wrote:
 Hi,
 
 My data model consist of different types of data. Each data type has its own 
 characteristics
 
 If I include the unique characteristics of each type of data, my single Solr 
 Document could end up containing 300-400 fields.
 
 In order to drill down to this data set I would have to provide faceting on 
 most of these fields so that I can drilldown to very small set of
 Documents.
 
 Here are some of the questions :
 
 1) What's the best approach when dealing with documents with large number of 
 fields .
 Should I keep a single document with large number of fields or split my
 document into a number of smaller  documents where each document would 
 consist of some fields
 
 2) From an operational point of view, what's the drawback of having a single 
 document with a very large number of fields.
 Can Solr support documents with large number of fields (say 300 to 400).
 
 
 Thanks.
 
 Regards,
 
 Nitin Keswani

SolrCloud indexing question

2012-04-20 Thread Darren Govoni

Hi,
  I just wanted to make sure I understand how distributed indexing works
in solrcloud.

Can I index locally at each shard to avoid throttling a central port? Or
all the indexing has to go through a single shard leader?

thanks

Re: SolrCloud indexing question

2012-04-20 Thread Darren Govoni

Gotcha.

Now does that mean if I have 5 threads all writing to a local shard,
will that shard piggyhop those index requests onto a SINGLE connection
to the leader? Or will they spawn 5 connections from the shard to the
leader? I really hope the formerthe latter won't scale well.

On Fri, 2012-04-20 at 10:28 -0400, Jamie Johnson wrote:
 my understanding is that you can send your updates/deletes to any
 shard and they will be forwarded to the leader automatically.  That
 being said your leader will always be the place where the index
 happens and then distributed to the other replicas.
 
 On Fri, Apr 20, 2012 at 7:54 AM, Darren Govoni dar...@ontrenet.com wrote:
  Hi,
   I just wanted to make sure I understand how distributed indexing works
  in solrcloud.
 
  Can I index locally at each shard to avoid throttling a central port? Or
  all the indexing has to go through a single shard leader?
 
  thanks

Re: Opposite to MoreLikeThis?

2012-04-20 Thread Darren Govoni

You could run the MLT for the document in question, then gather all
those doc id's in the MLT results and negate those in a subsequent
query. Not sure how robust that would work with very large result sets,
but something to try.

Another approach would be to gather the interesting terms from the
document in question and then negate those terms in subsequent queries.
Perhaps with many negated terms, Solr will rank the results based on
most negated terms above less negated terms, simulating a ranked less
like effect.

On Fri, 2012-04-20 at 15:38 -0700, Charlie Maroto wrote:
 Hi all,
 
 Is there a way to implement the opposite to MoreLikeThis (LessLikeThis, I
 guess :).  The requirement we have is to remove all documents with content
 like that of a given document id or a text provided by the end-user.  In
 the current index implementation (not using Solr), the user can narrow
 results by indicating what document(s) are not relevant to him and then
 request to remove from the search results any document whose content is
 like that of the selected document(s)
 
 Our index has close to 100 million documents and they cover multiple topics
 that are not related to one another.  So, a search for some broad terms may
 retrieve documents about engineering, agriculture, communications, etc.  As
 the user is trying to discover the relevant documents, he may select an
 agriculture-related document to exclude it and those documents like it from
 the results set; same w/ engineering-like content, etc. until most of the
 documents are about communications.
 
 Of course, some exclusions may actually remove relevant content but those
 filters can be removed to go back to the previous set of results.
 
 Any ideas from similar implementations or suggestions are welcomed!
 Thanks,
 Carlos

Re: hierarchical faceting?

2012-04-18 Thread Darren Govoni

Put the parent term in all the child documents at index time
and the re-issue the facet query when you expand the parent using the
parent's term. works perfect.

On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote:
 I have hierarchical colors:
 field name=colors type=text_pathindexed=true
 stored=true multiValued=true/
 text_path is TextField with PathHierarchyTokenizerFactory as tokenizer.
 
 Given these two documents,
 Doc1: red
 Doc2: red/pink
 
 I want the result to be the following:
 ?fq=red
 == Doc1, Doc2
 
 ?fq=red/pink
 == Doc2
 
 But, with PathHierarchyTokenizer, Doc1 is included for the query:
 ?fq=red/pink
 == Doc1, Doc2
 
 How can I query for hierarchical facets?
 http://wiki.apache.org/solr/HierarchicalFaceting describes facet.prefix..
 But it looks too cumbersome to me.
 
 Is there a simpler way to implement hierarchical facets?

Re: hierarchical faceting?

2012-04-18 Thread Darren Govoni

I don't use any of that stuff in my app, so not sure how it works.

I just manage my taxonomy outside of solr at index time and don't need
any special fields or tokenizers. I use a string field type and insert
the proper field at index time and query it normally. Nothing special
required.

On Wed, 2012-04-18 at 13:00 -0400, sam ” wrote:
 It looks like TextField is the problem.
 
 This fixed:
 fieldType name=text_path class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
   tokenizer class=solr.PathHierarchyTokenizerFactory
 delimiter=//
   /analyzer
   analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory/
   /analyzer
 /fieldType
 
 I am assuming the text_path fields won't include whitespace characters.
 
 ?q=colors:red/pink
 == Doc2   (Doc1, which has colors = red isn't included!)
 
 
 Is there a tokenizer that tokenizes the string as one token?
 I tried to extend Tokenizer myself  but it fails:
 public class AsIsTokenizer extends Tokenizer {
 @Override
 public boolean incrementToken() throws IOException {
 return true;//or false;
 }
 }
 
 
 On Wed, Apr 18, 2012 at 11:33 AM, sam ” skyn...@gmail.com wrote:
 
  Yah, that's exactly what PathHierarchyTokenizer does.
  fieldType name=text_path class=solr.TextField
  positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.PathHierarchyTokenizerFactory/
/analyzer
  /fieldType
 
  I think I have a query time tokenizer that tokenizes at /
 
  ?q=colors:red
  == Doc1, Doc2
 
  ?q=colors:redfoobar
  ==
 
  ?q=colors:red/foobarasdfoaijao
  == Doc1, Doc2
 
 
 
 
  On Wed, Apr 18, 2012 at 11:10 AM, Darren Govoni dar...@ontrenet.comwrote:
 
  Put the parent term in all the child documents at index time
  and the re-issue the facet query when you expand the parent using the
  parent's term. works perfect.
 
  On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote:
   I have hierarchical colors:
   field name=colors type=text_pathindexed=true
   stored=true multiValued=true/
   text_path is TextField with PathHierarchyTokenizerFactory as tokenizer.
  
   Given these two documents,
   Doc1: red
   Doc2: red/pink
  
   I want the result to be the following:
   ?fq=red
   == Doc1, Doc2
  
   ?fq=red/pink
   == Doc2
  
   But, with PathHierarchyTokenizer, Doc1 is included for the query:
   ?fq=red/pink
   == Doc1, Doc2
  
   How can I query for hierarchical facets?
   http://wiki.apache.org/solr/HierarchicalFaceting describes
  facet.prefix..
   But it looks too cumbersome to me.
  
   Is there a simpler way to implement hierarchical facets?

Re: Monitoring SolrCloud health

2012-04-14 Thread Darren Govoni

Can you be more specific about health?

On Sat, 2012-04-14 at 00:03 -0400, Jamie Johnson wrote:
 How do people currently monitor the health of a solr cluster?  Are
 there any good tools which can show the health across the entire
 cluster?  Is this something which is planned for the new admin user
 interface?

RE: Realtime /get versus SearchHandler

2012-04-13 Thread Darren Govoni


Yes

brbrbr--- Original Message ---
On 4/13/2012  06:25 AM Benson Margulies wrote:brA discussion over on the dev 
list led me to expect that the by-if
brfield retrievals in a SolrCloud query would come through the get
brhandler. In fact, I've seen them turn up in my search component in the
brsearch handler that is configured with my custom QT. (I have a
br'prepare' method that sets ShardParams.QT to my QT to get my
brprocessing involved in the first of the two queries.) Did I overthink
brthis?
br
br

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-12 Thread Darren Govoni

You could use SolrCloud (for the automatic scaling) and just mount a
fuse[1] HDFS directory and configure solr to use that directory for its
data. 

[1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS

On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
 Hi,
 
 I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
 using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
 crawled + indexed every *4 weeks, *with a search latency of less than 0.5
 seconds.
 
 Needless to mention, the search index needs to scale to 5Billion pages. It
 is also possible that I might need to store multiple indexes -- one for
 crawled content, and one for ancillary data that is also very large. Each
 of these indices would likely require a logically distributed and
 replicated index.
 
 However, I would like for such a system to be homogenous with the Hadoop
 infrastructure that is already installed on the cluster (for the crawl). In
 other words, I would much prefer if the replication and distribution of the
 Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
 using another scalability framework (such as SolrCloud). In addition, it
 would be ideal if this environment was flexible enough to be dynamically
 scaled based on the size requirements of the index and the search traffic
 at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
 enough to automatically provision additional processing power into the
 cluster without requiring server re-starts).
 
 However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
 be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
 Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
 mature enough and would be the right architectural choice to go along with
 a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
 above.
 
 Lastly, how much hardware (assuming a medium sized EC2 instance) would you
 estimate my needing with this setup, for regular web-data (HTML text) at
 this scale?
 
 Any architectural guidance would be greatly appreciated. The more details
 provided, the wider my grin :).
 
 Many many thanks in advance.
 
 Thanks,
 Safdar

RE: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-12 Thread Darren Govoni


Solrcloud or any other tech specific replication isnt going to 'just work' with 
hadoop replication. But with some significant custom coding anything should be 
possible. Interesting idea.

brbrbr--- Original Message ---
On 4/12/2012  09:21 AM Ali S Kureishy wrote:brThanks Darren.
br
brActually, I would like the system to be homogenous - i.e., use Hadoop based
brtools that already provide all the necessary scaling for the lucene index
br(in terms of throughput, latency of writes/reads etc). Since SolrCloud adds
brits own layer of sharding/replication that is outside Hadoop, I feel that
brusing SolrCloud would be redundant, and a step in the opposite
brdirection, which is what I'm trying to avoid in the first place. Or am I
brmistaken?
br
brThanks,
brSafdar
br
br
brOn Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni dar...@ontrenet.com wrote:
br
br You could use SolrCloud (for the automatic scaling) and just mount a
br fuse[1] HDFS directory and configure solr to use that directory for its
br data.
br
br [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS
br
br On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
br  Hi,
br 
br  I'm trying to setup a large scale *Crawl + Index + Search 
*infrastructure
br  using Nutch and Solr/Lucene. The targeted scale is *5 Billion web 
pages*,
br  crawled + indexed every *4 weeks, *with a search latency of less than 
0.5
br  seconds.
br 
br  Needless to mention, the search index needs to scale to 5Billion pages.
br It
br  is also possible that I might need to store multiple indexes -- one for
br  crawled content, and one for ancillary data that is also very large. 
Each
br  of these indices would likely require a logically distributed and
br  replicated index.
br 
br  However, I would like for such a system to be homogenous with the Hadoop
br  infrastructure that is already installed on the cluster (for the crawl).
br In
br  other words, I would much prefer if the replication and distribution of
br the
br  Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead 
of
br  using another scalability framework (such as SolrCloud). In addition, it
br  would be ideal if this environment was flexible enough to be dynamically
br  scaled based on the size requirements of the index and the search 
traffic
br  at the time (i.e. if it is deployed on an Amazon cluster, it should be
br easy
br  enough to automatically provision additional processing power into the
br  cluster without requiring server re-starts).
br 
br  However, I'm not sure which Solr-based tool in the Hadoop ecosystem 
would
br  be ideal for this scenario. I've heard mention of Solr-on-HBase,
br Solandra,
br  Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these
br is
br  mature enough and would be the right architectural choice to go along
br with
br  a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
br aspects
br  above.
br 
br  Lastly, how much hardware (assuming a medium sized EC2 instance) would
br you
br  estimate my needing with this setup, for regular web-data (HTML text) at
br  this scale?
br 
br  Any architectural guidance would be greatly appreciated. The more 
details
br  provided, the wider my grin :).
br 
br  Many many thanks in advance.
br 
br  Thanks,
br  Safdar
br
br
br
br

Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-11 Thread Darren Govoni

Hard to say why its not working for you. Start with a fresh Solr and
work forward from there or back out your configs and plugins until it
works again.

On Tue, 2012-04-10 at 17:15 -0400, Benson Margulies wrote:
 In my cloud configuration, if I push
 
 delete
   query*:*/query
 /delete
 
 followed by:
 
 commit/
 
 I get no errors, the log looks happy enough, but the documents remain
 in the index, visible to /query.
 
 Here's what seems my relevant bit of solrconfig.xml. My URP only
 implements processAdd.
 
updateRequestProcessorChain name=RNI
 !-- some day, add parameters when we have some --
 processor 
 class=com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory/
 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.DistributedUpdateProcessorFactory/
 processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain
 
 !-- activate RNI processing by adding the RNI URP to the chain
 for xml updates --
   requestHandler name=/update
   class=solr.XmlUpdateRequestHandler
 lst name=defaults
   str name=update.chainRNI/str
 /lst
 /requestHandler

RE: SOLR issue - too many search queries

2012-04-10 Thread Darren Govoni

My first reaction to your question is why are you running thousands of queries
in a loop? Immediately, I think this will not scale well and the design
probably needs to be re-visited.

Second, if you need that many requests, then you need to seriously consider an
architecture that supports it. This will require a complex design involving
load balancers, multiple servers, replication, etc. People have achieved this
with Solr, but it's beyond the scope of Solr itself to provide this, as its a
matter of system architecture.

Also, there are limits to the number of app server threads allowed, OS threads
allowed, OS sockets, OS file descriptors, etc. etc. All of which need to be
understood, designed for and configured properly.

brbrbr--- Original Message ---
On 4/10/2012 07:51 AM arunssasidhar wrote:brWe have a PHP web application
which is using SOLR for searching. The APP is
brusing CURL to connect to the SOLR server and which run in a loop with
brthousands of predefined keywords. That will create thousands of different
brsearch quires to SOLR at a given time.
br
brMy issue is that, when a single user logged into the app everything is
brworking as expected. When there is more than one user is trying to run the
brapp we are getting this response from the server.
br
brFailed to connect to xxx.xxx.xxx.xxx: Cannot assign requested
braddressFailed to connect to xxx.xxx.xxx.xxx: Cannot assign requested
braddressFailed
br
brOur assumption is that, SOLR server is unable to handle this much search
brqueries at a given time. If so what is the solution to overcome this?. Is
brthere any settings like keep-alive in SOLR?
br
brAny help would be highly appreciate.
br
brThanks,
br
brArun S
br
br
br--
brView this message in context:
http://lucene.472066.n3.nabble.com/SOLR-issue-too-many-search-queries-tp3899518p3899518.html
brSent from the Solr - User mailing list archive at Nabble.com.
br
br

RE: Re: Cloud-aware request processing?

2012-04-09 Thread Darren Govoni


...it is a distributed real-time query scheme...

SolrCloud does this already. It treats all the shards like one-big-index, and you can 
query it normally to get subset results from each shard. Why do you have to 
re-write the query for each shard? Seems unnecessary.

brbrbr--- Original Message ---
On 4/9/2012  08:45 AM Benson Margulies wrote:br Jan Høydahl,
br
brMy problem is intimately connected to Solr. it is not a batch job for
brhadoop, it is a distributed real-time query scheme. I hate to add yet
branother complex framework if a Solr RP can do the job simply.
br
brFor this problem, I can transform a Solr query into a subset query on
breach shard, and then let the SolrCloud mechanism.
br
brI am well aware of the 'zoo' of alternatives, and I will be evaluating
brthem if I can't get what I want from Solr.
br
brOn Mon, Apr 9, 2012 at 9:34 AM, Jan Høydahl jan@cominvent.com wrote:
br Hi,
br
br Instead of using Solr, you may want to have a look at Hadoop or another 
framework for distributed computation, see e.g. 
http://java.dzone.com/articles/comparison-gridcloud-computing
br
br --
br Jan Høydahl, search solution architect
br Cominvent AS - www.cominvent.com
br Solr Training - www.solrtraining.com
br
br On 9. apr. 2012, at 13:41, Benson Margulies wrote:
br
br I'm working on a prototype of a scheme that uses SolrCloud to, in
br effect, distribute a computation by running it inside of a request
br processor.
br
br If there are N shards and M operations, I want each node to perform
br M/N operations. That, of course, implies that I know N.
br
br Is that fact available anyplace inside Solr, or do I need to just 
configure it?
br
br
br

Re: How to facet data from a multivalued field?

2012-04-09 Thread Darren Govoni

Your handler for that field should be looked at.
Try not using a handler that tokenizes or stems the field.
You want to leave the text as is. I forget the handler setting for that,
but its documented in there somewhere.

On Mon, 2012-04-09 at 13:02 -0700, Thiago wrote:
Hello everybody,

I've already searched about this topic in the forum, but I didn't find any
case like this. I ask for apologizes if this topic have been already
discussed.

I'm having a problem in faceting a multivalued field. My field is called
series, and it has names of TV series like the big bang theory, two and a
half men ...

In this field I can have a lot of TV series names. For example:

arr name=series
strTwo and a Half Men/str
strHow I Met Your Mother/str
strThe Big Bang Theory/str
/arr

What I want to do is: search and count how many documents related to each
series. I'm doing it using facet search in this field. But it's returning
each word separately. Like this:

lst name=facet_counts
lst name=facet_queries/
lst name=facet_fields
lst name=series
int name=bang91/int
int name=big91/int
int name=half21/int
int name=how45/int
int name=i45/int
int name=men21/int
int name=met45/int
int name=mother45/int
int name=theori91/int
int name=two21/int
int name=your45/int
/lst
/lst
lst name=facet_dates/
lst name=facet_ranges/
/lst

And what I want is something like:

lst name=facet_counts
lst name=facet_queries/
lst name=facet_fields
lst name=series
int name=Two and a Half Men21/int
int name=How I Met Your Mother45/int
int name=The Big Bang Theory91/int
/lst
/lst
lst name=facet_dates/
lst name=facet_ranges/
/lst

Is there any possible way to do it with facet search? I don't want the
terms, I just want each string including the white spaces. Do I have to
change my fieldtype to do this?

Thanks to everybody.

Thiago

--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-facet-data-from-a-multivalued-field-tp3897853p3897853.html
Sent from the Solr - User mailing list archive at Nabble.com.

No webadmin for trunk?

2012-04-07 Thread Darren Govoni

Hi,
  Just updated solr trunk and tried the java -jar start.jar and
localhost:8983/solr/admin.not found.

Where did it go?

thanks.

Re: No webadmin for trunk?

2012-04-07 Thread Darren Govoni

HTTP ERROR: 404
Problem accessing /solr. Reason:

Not Found



Powered by Jetty://

On Sat, 2012-04-07 at 09:04 -0400, Jamie Johnson wrote:
 just go to localhost:8983/solr and you'll see the updated interface.
 
 On Sat, Apr 7, 2012 at 8:23 AM, Darren Govoni dar...@ontrenet.com wrote:
  Hi,
   Just updated solr trunk and tried the java -jar start.jar and
  localhost:8983/solr/admin.not found.
 
  Where did it go?
 
  thanks.

Re: No webadmin for trunk?

2012-04-07 Thread Darren Govoni

start.jar has no apps in it at all.

On Sat, 2012-04-07 at 09:47 -0400, Darren Govoni wrote:
 HTTP ERROR: 404
 Problem accessing /solr. Reason:
 
 Not Found
 
 
 
 Powered by Jetty://
 
 On Sat, 2012-04-07 at 09:04 -0400, Jamie Johnson wrote:
  just go to localhost:8983/solr and you'll see the updated interface.
  
  On Sat, Apr 7, 2012 at 8:23 AM, Darren Govoni dar...@ontrenet.com wrote:
   Hi,
Just updated solr trunk and tried the java -jar start.jar and
   localhost:8983/solr/admin.not found.
  
   Where did it go?
  
   thanks.

Re: No webadmin for trunk?

2012-04-07 Thread Darren Govoni

Yep. I did all kinds of ant clean, ant dist, ant example, etc.

My trunk rev.

At revision 1310773.

Example start.jar is broken. No webapp inside. :(

On Sat, 2012-04-07 at 16:11 +0200, Rafał Kuć wrote:
 Hello!
 
 Did you run 'ant example' ?

Re: No webadmin for trunk?

2012-04-07 Thread Darren Govoni

K. There is a solr.war in the webapps directory. But still get the 404.

On Sat, 2012-04-07 at 16:19 +0200, Rafał Kuć wrote:
 Hello!
 
 start.jar shouldn't contain any webapp. If you look at the 'example'
 directory, you'll notice that there is a 'webapps' directory which
 should contain solr.war file.
 
 Btw. revision 1307647 works without a problem. I'll checkout trunk in
 a few in try with the newest revision.

Re: No webadmin for trunk?

2012-04-07 Thread Darren Govoni

Now, it comes up. Not sure why its acting weird. Will continue to look
at it.

On Sat, 2012-04-07 at 10:23 -0400, Darren Govoni wrote:
 K. There is a solr.war in the webapps directory. But still get the 404.
 
 On Sat, 2012-04-07 at 16:19 +0200, Rafał Kuć wrote:
  Hello!
  
  start.jar shouldn't contain any webapp. If you look at the 'example'
  directory, you'll notice that there is a 'webapps' directory which
  should contain solr.war file.
  
  Btw. revision 1307647 works without a problem. I'll checkout trunk in
  a few in try with the newest revision.

Re: upgrade 3.5 to 4.0

2012-04-07 Thread Darren Govoni

In my opinion, its never a good idea to overwrite files of a previous
version with a new version. 

The easiest thing would be to just deploy the solr war file into tomcat
and let tomcat manage the webapp, files, etc.

On Sat, 2012-04-07 at 22:39 -0400, Dan Foley wrote:
 I have download the nightly snapshot of v 4.0 and would like to install it
 to my tomcat install of solr 3.5
 
 can i simply overwrite the current files or is there a correct method for
 doing so?
 
 please advise.. thanks

Re: Does any one know when Solr 4.0 will be released.

2012-04-04 Thread Darren Govoni

No one knows. But if you ask the devs, they will say 'when its done'.

One clue might be to monitor the bugs/issues scheduled for 4.0. When
they are all resolved, then its ready.

On Wed, 2012-04-04 at 09:41 -0700, srinivas konchada wrote:
 Hello every one
 Does any one know when Solr 4.0 will be released? there is a specific
 feature that exists in 4.0 which we want to take advantage off. Problem is
 we cannot deploy some thing into production from trunk. We need to use an
 official release.
 
 
 Thanks
 Srinivas Konchada

Re: Duplicates in Facets

2012-04-04 Thread Darren Govoni

Try using Luke to look at your index and see if there are multiple
similar TFV's. You can browse them easily in Luke.

On Wed, 2012-04-04 at 23:35 -0400, Jamie Johnson wrote:
 I am currently indexing some information and am wondering why I am
 getting duplicates in facets.  From what I can tell they are the same,
 but is there any case that could cause this that I may not be thinking
 of?  Could this be some non printable character making it's way into
 the index?
 
 
 Sample output from luke
 
 lst name=fields
   lst name=organization_umvs
 str name=typestring/str
 str name=schemaI--M---OFl/str
 str name=dynamicBase*_umvs/str
 str name=index(unstored field)/str
 int name=docs332/int
 int name=distinct-1/int
 lst name=topTerms
   int name=ORGANIZATION 1328/int
   int name=ORGANIZATION 2124/int
   int name=ORGANIZATION 236/int
   int name=ORGANIZATION 220/int
   int name=ORGANIZATION 34/int
 /lst

Custom scoring question

2012-03-29 Thread Darren Govoni

Hi,
 I have a situation I want to re-score document relevance.

Let's say I have two fields:

text: The quick brown fox jumped over the white fence.
terms: fox fence

Now my queries come in as:

terms:[* TO *]

and Solr scores them on that field. 

What I want is to rank them according to the distribution of field
terms within field text. Which is a per document calculation.

Can this be done with any kind of dismax? I'm not searching for known
terms at query time.

If not, what is the best way to implement a custom scoring handler to
perform this calculation and re-score/sort the results?

thanks for any tips!!!

Re: Custom scoring question

2012-03-29 Thread Darren Govoni

I'm going to try index time per-field boosting and do the boost
computation at index time and see if that helps.

On Thu, 2012-03-29 at 10:08 -0400, Darren Govoni wrote:
 Hi,
  I have a situation I want to re-score document relevance.
 
 Let's say I have two fields:
 
 text: The quick brown fox jumped over the white fence.
 terms: fox fence
 
 Now my queries come in as:
 
 terms:[* TO *]
 
 and Solr scores them on that field. 
 
 What I want is to rank them according to the distribution of field
 terms within field text. Which is a per document calculation.
 
 Can this be done with any kind of dismax? I'm not searching for known
 terms at query time.
 
 If not, what is the best way to implement a custom scoring handler to
 perform this calculation and re-score/sort the results?
 
 thanks for any tips!!!

Re: Custom scoring question

2012-03-29 Thread Darren Govoni

Yeah, I guess that would work. I wasn't sure if it would change relative
to other documents. But if it were to be combined with other fields,
that approach may not work because the calculation wouldn't include the
scoring for other parts of the query. So then you have the dynamic score
and what to do with it.

On Thu, 2012-03-29 at 16:29 -0300, Tomás Fernández Löbbe wrote:
 Can't you simply calculate that at index time and assign the result to a
 field, then sort by that field.
 
 On Thu, Mar 29, 2012 at 12:07 PM, Darren Govoni dar...@ontrenet.com wrote:
 
  I'm going to try index time per-field boosting and do the boost
  computation at index time and see if that helps.
 
  On Thu, 2012-03-29 at 10:08 -0400, Darren Govoni wrote:
   Hi,
I have a situation I want to re-score document relevance.
  
   Let's say I have two fields:
  
   text: The quick brown fox jumped over the white fence.
   terms: fox fence
  
   Now my queries come in as:
  
   terms:[* TO *]
  
   and Solr scores them on that field.
  
   What I want is to rank them according to the distribution of field
   terms within field text. Which is a per document calculation.
  
   Can this be done with any kind of dismax? I'm not searching for known
   terms at query time.
  
   If not, what is the best way to implement a custom scoring handler to
   perform this calculation and re-score/sort the results?
  
   thanks for any tips!!!

MLT and solrcloud?

2012-03-22 Thread Darren Govoni

Hi,
  It was mentioned before that SolrCloud has all the capability of
regular solr (including handlers) with the exception of the MLT handler.
As this is a key capability for Solr, is there work planned to include
the MLT in SolrCloud? If so when? Our efforts greatly depend on it. As
such, I'm happy to help anyway possible.

thanks,
Darren

Re: MLT and solrcloud?

2012-03-22 Thread Darren Govoni

Ok, I'll do what I can to help!

As always, appreciate the hard work Mark.


On Thu, 2012-03-22 at 17:31 -0400, Mark Miller wrote:
 On Mar 22, 2012, at 5:22 PM, Darren Govoni wrote:
 
  Hi,
   It was mentioned before that SolrCloud has all the capability of
  regular solr (including handlers) with the exception of the MLT handler.
  As this is a key capability for Solr, is there work planned to include
  the MLT in SolrCloud? If so when? Our efforts greatly depend on it. As
  such, I'm happy to help anyway possible.
  
  thanks,
  Darren
  
 
 Usually no real time tables here :) Depends on who jumps in when.
 
 Some work has already gone on for this here: 
 https://issues.apache.org/jira/browse/SOLR-788
 
 You might just try and jump start that issue again? As I get a free moment or 
 two, I'm happy to help commit a solution.
 
 - Mark Miller
 lucidimagination.com

RE: Re: maxClauseCount Exception

2012-03-19 Thread Darren Govoni


true. but how can you find documents containing that field without expanding 
1000 clauses?

brbrbr--- Original Message ---
On 3/19/2012  07:24 AM Erick Erickson wrote:brbq: So all I want to do is a simple 
all docs with something in this field,
brand to highlight the field
br
brBut that doesn't really make sense to do at the Solr/Lucene level. All
bryou're saying is that you want that field highlighted. Wouldn't it be much
breasier to just do this at the app level whenever your field had anything
brreturned in it?
br
brBest
brErick
br
brOn Sat, Mar 17, 2012 at 8:07 PM, Darren Govoni dar...@ontrenet.com wrote:
br Thanks for the tip Hoss.
br
br I notice that it appears sometimes and was varying because my index runs
br would sometimes have different amount of docs, etc.
br
br So all I want to do is a simple all docs with something in this field,
br and to highlight the field.
br
br Is the query expansion to all possible terms in the index really
br necessary? I could have 100's of thousands of possible terms. Why should
br they all become explicit query elements? Seems overkill and
br underperformant.
br
br Is there a another way with Lucene or not really?
br
br On Thu, 2012-03-08 at 16:18 -0800, Chris Hostetter wrote:
br :   I am suddenly getting a maxClauseCount exception for no reason. I am
br : using Solr 3.5. I have only 206 documents in my index.
br
br Unless things have changed the reason you are seeing this is because
br _highlighting_ a query (clause) like type_s:[*+TO+*] requires rewriting
br it into a giant boolean query of all the terms in that field -- so even 
if
br you only have 206 docs, if you have more then 206 values in that field in
br your index, you're going to go over 1024 terms.
br
br (you don't get this problem in a basic query, because it doens't need to
br enumerate all the terms, it rewrites it to a ConstantScoreQuery)
br
br what you most likeley want to do, is move some of those clauses like
br type_s:[*+TO+*]: and usergroup_sm:admin) out of your main q query 
and
br into fq filters ... so they can be cached independently, won't
br contribute to scoring (just matching) and won't be used in highlighting.
br
br : 
params={hl=truehl.snippets=4hl.simple.pre=b/bfl=*,scorehl.mergeContiguous=truehl.usePhraseHighlighter=truehl.requireFieldMatch=trueechoParams=allhl.fl=text_tq={!lucene+q.op%3DOR+df%3Dtext_t}+(+kind_s:doc+OR+kind_s:xml)+AND+(type_s:[*+TO+*])+AND+(usergroup_sm:admin)rows=20start=0wt=javabinversion=2}
 hits=204 status=500 QTime=166 |#]
br
br : [#|2012-02-22T13:40:13.131-0500|SEVERE|glassfish3.1.1|
br : org.apache.solr.servlet.SolrDispatchFilter|
br : 
_ThreadID=22;_ThreadName=Thread-2;|org.apache.lucene.search.BooleanQuery
br : $TooManyClauses: maxClauseCount is set to 1024
br :     at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:136)
br       ...
br :     at
br : 
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:304)
br :     at
br : 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158)
br
br -Hoss
br
br
br
br
br

Re: Inconsistent Results with ZooKeeper Ensemble and Four SOLR Cloud Nodes

2012-03-18 Thread Darren Govoni

I think he's asking if all the nodes (same machine or not) return a
response. Presumably you have different ports for each node since they
are on the same machine.

On Sun, 2012-03-18 at 14:44 -0400, Matthew Parker wrote:
The cluster is running on one machine.

On Sun, Mar 18, 2012 at 2:07 PM, Mark Miller markrmil...@gmail.com wrote:

From every node in your cluster you can hit http://MACHINE1:8084/solr in
your browser and get a response?

On Mar 18, 2012, at 1:46 PM, Matthew Parker wrote:

My cloud instance finally tried to sync. It looks like it's having
connection issues, but I can bring the SOLR instance up in the browser so
I'm not sure why it cannot connect to it. I got the following condensed log
output:

org.apache.commons.httpclient.HttpMethodDirector executeWithRetry
I/O exception (java.net.ConnectException) caught when processing
request: Connection refused: connect

Retrying request

shard update error StdNode:
http://MACHINE1:8084/solr/:org.apache.solr.client.solrj.SolrServerException:
http://MACHINE1:8084/solr
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:
483)
..
..
..
Caused by: java.net.ConnectException: Connection refused: connect
at java.net.DualStackPlainSocketImpl.connect0(Native Method)
..
..
..

try and ask http://MACHINE1:8084/solr to recover

Could not tell a replica to recover

org.apache.solr.client.solrj.SolrServerException:
http://MACHINE1:8084/solr
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:483)
...
...
...
Caused by: java.net.ConnectException: Connection refused: connect
at java.net.DualStackPlainSocketImpl.waitForConnect(Native method)
..
..
..

On Sat, Mar 17, 2012 at 10:10 PM, Mark Miller markrmil...@gmail.com
wrote:
Nodes talk to ZooKeeper as well as to each other. You can see the
addresses they are trying to use to communicate with each other in the
'cloud' view of the Solr Admin UI. Sometimes you have to override these, as
the detected default may not be an address that other nodes can reach. As a
limited example: for some reason my mac cannot talk to my linux box with
its default detected host address of halfmetal:8983/solr - but the mac can
reach my linux box if I use halfmetal.Local - so I have to override the
published address of my linux box using the host attribute if I want to
setup a cluster between my macbook and linux box.

Each nodes talks to ZooKeeper to learn about the other nodes, including
their addresses. Recovery is then done node to node using the appropriate
addresses.

- Mark Miller
lucidimagination.com

On Mar 16, 2012, at 3:00 PM, Matthew Parker wrote:

I'm still having issues replicating in my work environment. Can anyone
explain how the replication mechanism works? Is it communicating across
ports or through zookeeper to manager the process?

On Thu, Mar 8, 2012 at 10:57 PM, Matthew Parker
mpar...@apogeeintegration.com wrote:

All,

I recreated the cluster on my machine at home (Windows 7, Java
1.6.0.23,
apache-solr-4.0-2012-02-29_09-07-30) , sent some document through
Manifold
using its crawler, and it looks like it's replicating fine once the
documents are committed.

This must be related to my environment somehow. Thanks for your help.

Regards,

Matt

On Fri, Mar 2, 2012 at 9:06 AM, Erick Erickson
erickerick...@gmail.comwrote:

Matt:

Just for paranoia's sake, when I was playing around with this (the
_version_ thing was one of my problems too) I removed the entire data
directory as well as the zoo_data directory between experiments (and
recreated just the data dir). This included various index.2012
files and the tlog directory on the theory that *maybe* there was
some
confusion happening on startup with an already-wonky index.

If you have the energy and tried that it might be helpful
information,
but it may also be a total red-herring

FWIW
Erick

On Thu, Mar 1, 2012 at 8:28 PM, Mark Miller markrmil...@gmail.com
wrote:
I assuming the windows configuration looked correct?

Yeah, so far I can not spot any smoking gun...I'm confounded at the
moment. I'll re read through everything once more...

- Mark

Re: maxClauseCount Exception

2012-03-17 Thread Darren Govoni

Thanks for the tip Hoss.

I notice that it appears sometimes and was varying because my index runs
would sometimes have different amount of docs, etc.

So all I want to do is a simple all docs with something in this field,
and to highlight the field. 

Is the query expansion to all possible terms in the index really
necessary? I could have 100's of thousands of possible terms. Why should
they all become explicit query elements? Seems overkill and
underperformant.

Is there a another way with Lucene or not really?

On Thu, 2012-03-08 at 16:18 -0800, Chris Hostetter wrote:
 :   I am suddenly getting a maxClauseCount exception for no reason. I am
 : using Solr 3.5. I have only 206 documents in my index.
 
 Unless things have changed the reason you are seeing this is because 
 _highlighting_ a query (clause) like type_s:[*+TO+*] requires rewriting 
 it into a giant boolean query of all the terms in that field -- so even if 
 you only have 206 docs, if you have more then 206 values in that field in 
 your index, you're going to go over 1024 terms.
 
 (you don't get this problem in a basic query, because it doens't need to 
 enumerate all the terms, it rewrites it to a ConstantScoreQuery)
 
 what you most likeley want to do, is move some of those clauses like 
 type_s:[*+TO+*]: and usergroup_sm:admin) out of your main q query and 
 into fq filters ... so they can be cached independently, won't 
 contribute to scoring (just matching) and won't be used in highlighting.
 
 : 
 params={hl=truehl.snippets=4hl.simple.pre=b/bfl=*,scorehl.mergeContiguous=truehl.usePhraseHighlighter=truehl.requireFieldMatch=trueechoParams=allhl.fl=text_tq={!lucene+q.op%3DOR+df%3Dtext_t}+(+kind_s:doc+OR+kind_s:xml)+AND+(type_s:[*+TO+*])+AND+(usergroup_sm:admin)rows=20start=0wt=javabinversion=2}
  hits=204 status=500 QTime=166 |#]
 
 : [#|2012-02-22T13:40:13.131-0500|SEVERE|glassfish3.1.1|
 : org.apache.solr.servlet.SolrDispatchFilter|
 : _ThreadID=22;_ThreadName=Thread-2;|org.apache.lucene.search.BooleanQuery
 : $TooManyClauses: maxClauseCount is set to 1024
 : at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:136)
   ...
 : at
 : org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:304)
 : at
 : 
 org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158)
 
 -Hoss

RE: Solr 4.0 and production environments

2012-03-07 Thread Darren Govoni


As a rule of thumb, many will say not to go to production with a pre-release baseline. So until 
Solr4 goes final and stable, it's best not to assume too much about it.

Second suggestion is to properly stage new technologies in your product such 
that they go through their own validation. And so to that end, jump right in 
and start using Solr4 and see for yourself! It's a great technology.

brbrbr--- Original Message ---
On 3/7/2012  11:47 AM Dirceu Vieira wrote:brHi All,
br
brHas anybody started using Solr 4.0 in production environments? Is it stable
brenough?
brI'm planning to create a proof of concept using solr 4.0, we have some
brprojects that will gain a lot with features such as near real time search,
brjoins and others, that are available only on version 4.
br
brIs it too risky to think of using it right now?
brWhat are your thoughts and experiences with that?
br
brBest regards,
br
br-- 
brDirceu Vieira Júnior

br---
br+47 9753 2473
brdirceuvjr.blogspot.com
brtwitter.com/dirceuvjr
br

Re: Building a resilient cluster

2012-03-06 Thread Darren Govoni

What I think was mentioned on this a bit ago is that the index stops
working if one of the nodes goes down unless its a replica.

You have 2 nodes running with numShards=2? Thus if one goes down the
entire index is inoperable. In the future I'm hoping this changes such
that the index cluster continues to operate but will lack results from
the downed node. Maybe this has changed in recent trunk updates though.
Not sure.

On Mon, 2012-03-05 at 20:49 -0800, Ranjan Bagchi wrote:
 Hi Mark,
 
 So I tried this: started up one instance w/ zookeeper, and started a second
 instance defining a shard name in solr.xml -- it worked, searching would
 search both indices, and looking at the zookeeper ui, I'd see the second
 shard.  However, when I brought the second server down -- the first one
 stopped working:  it didn't kick the second shard out of the cluster.
 
 Any way to do this?
 
 Thanks,
 
 Ranjan
 
 
  From: Mark Miller markrmil...@gmail.com
  To: solr-user@lucene.apache.org
  Cc:
  Date: Wed, 29 Feb 2012 22:57:26 -0500
  Subject: Re: Building a resilient cluster
  Doh! Sorry - this was broken - I need to fix the doc or add it back.
 
  The shard id is actually set in solr.xml since its per core - the sys prop
  was a sugar option we had setup. So either add 'shard' to the core in
  solr.xml, or to make it work like it does in the doc, do:
 
   core name=collection1 shard=${shard:} instanceDir=. /
 
  That sets shard to the 'shard' system property if its set, or as a default,
  act as if it wasn't set.
 
  I've been working with custom shard ids mainly through solrj, so I hadn't
  noticed this.
 
  - Mark
 
  On Wed, Feb 29, 2012 at 10:36 AM, Ranjan Bagchi ranjan.bag...@gmail.com
  wrote:
 
   Hi,
  
   At this point I'm ok with one zk instance being a point of failure, I
  just
   want to create sharded solr instances, bring them into the cluster, and
  be
   able to shut them down without bringing down the whole cluster.
  
   According to the wiki page, I should be able to bring up new shard by
  using
   shardId [-D shardId], but when I did that, the logs showed it replicating
   an existing shard.
  
   Ranjan
   Andre Bois-Crettez wrote:
  
You have to run ZK on a at least 3 different machines for fault
tolerance (a ZK ensemble).
   
   
  
  http://wiki.apache.org/solr/SolrCloud#Example_C:_Two_shard_cluster_with_sha=
rd_replicas_and_zookeeper_ensemble
   
Ranjan Bagchi wrote:
 Hi,

 I'm interested in setting up a solr cluster where each machine [at
   least
 initially] hosts a separate shard of a big index [too big to sit on
  the
 machine].  I'm able to put a cloud together by telling it that I have
   (to
 start out with) 4 nodes, and then starting up nodes on 3 machines
pointin=
g
 at the zkInstance.  I'm able to load my sharded data onto each
  machine
 individually and it seems to work.

 My concern is that it's not fault tolerant:  if one of the
   non-zookeeper
 machines falls over, the whole cluster won't work.  Also, I can't
   create
=
a
 shard with more data, and have it work within the existing cloud.

 I tried using -DshardId=3Dshard5 [on an existing 4-shard cluster],
  but
   it
 just started replicating, which doesn't seem right.

 Are there ways around this?

 Thanks,
 Ranjan Bagchi


  
 
 
 
  --
  - Mark
 
  http://www.lucidimagination.com

maxClauseCount error

2012-02-22 Thread Darren Govoni

Hi,
  I am suddenly getting a maxclause count error and don't know why. I am
using Solr 3.5

maxClauseCount Exception

2012-02-22 Thread Darren Govoni

Hi,
  I am suddenly getting a maxClauseCount exception for no reason. I am
using Solr 3.5. I have only 206 documents in my index.

Any ideas? This is wierd.

QUERY PARAMS: [hl, hl.snippets, hl.simple.pre, hl.simple.post, fl,
hl.mergeContiguous, hl.usePhraseHighlighter, hl.requireFieldMatch,
echoParams, hl.fl, q, rows, start]|#]


[#|2012-02-22T13:40:13.129-0500|INFO|glassfish3.1.1|
org.apache.solr.core.SolrCore|_ThreadID=22;_ThreadName=Thread-2;|[]
webapp=/solr3 path=/select
params={hl=truehl.snippets=4hl.simple.pre=b/bfl=*,scorehl.mergeContiguous=truehl.usePhraseHighlighter=truehl.requireFieldMatch=trueechoParams=allhl.fl=text_tq={!lucene+q.op%3DOR+df%3Dtext_t}+(+kind_s:doc+OR+kind_s:xml)+AND+(type_s:[*+TO+*])+AND+(usergroup_sm:admin)rows=20start=0wt=javabinversion=2}
 hits=204 status=500 QTime=166 |#]


[#|2012-02-22T13:40:13.131-0500|SEVERE|glassfish3.1.1|
org.apache.solr.servlet.SolrDispatchFilter|
_ThreadID=22;_ThreadName=Thread-2;|org.apache.lucene.search.BooleanQuery
$TooManyClauses: maxClauseCount is set to 1024
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:136)
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:127)
at org.apache.lucene.search.ScoringRewrite
$1.addClause(ScoringRewrite.java:51)
at org.apache.lucene.search.ScoringRewrite
$1.addClause(ScoringRewrite.java:41)
at org.apache.lucene.search.ScoringRewrite
$3.collect(ScoringRewrite.java:95)
at
org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:38)
at
org.apache.lucene.search.ScoringRewrite.rewrite(ScoringRewrite.java:93)
at
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:304)
at
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158)
at
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:98)
at
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:385)
at
org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:217)
at
org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:185)
at
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:205)
at
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:490)
at
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401)
at
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:131)
at org.apache.so

Trunk build errors

2012-02-22 Thread Darren Govoni

Hi,
  I am getting numerous errors preventing a build of solrcloud trunk.

 [licenses] MISSING LICENSE for the following file:


Any tips to get a clean build working?

thanks

Re: SolrJ + SolrCloud

2012-02-12 Thread Darren Govoni

Thanks Mark. Is there any plan to make all the Solr search handlers work
with SolrCloud, like MLT? That missing feature would prohibit us from
using SolrCloud at the moment. :(

On Sat, 2012-02-11 at 18:24 -0500, Mark Miller wrote:
 On Feb 11, 2012, at 6:02 PM, Darren Govoni wrote:
 
  Hi,
   Do all the normal facilities of Solr work with SolrCloud from SolrJ?
  Things like /mlt, /cluster, facets , tvf's, etc.
  
  Darren
  
 
 
 SolrJ works the same in SolrCloud mode as it does in non SolrCloud mode - 
 it's fully supported. There is even a new SolrJ client called CloudSolrServer 
 that has built in cluster awareness and load balancing.
 
 In terms of what is supported - anything that is supported with distributed 
 search - that is most things, but there is the odd man out - like MLT - looks 
 like an issue is open here: https://issues.apache.org/jira/browse/SOLR-788 
 but it's not resolved yet.
 
 - Mark Miller
 lucidimagination.com

SolrJ + SolrCloud

2012-02-11 Thread Darren Govoni

Hi,
  Do all the normal facilities of Solr work with SolrCloud from SolrJ?
Things like /mlt, /cluster, facets , tvf's, etc.

Darren

Re: Range facet - Count in facet menu != Count in search results

2012-02-10 Thread Darren Govoni

Double check your default operator for a faceted search vs. regular
search. I caught this difference in my work that explained this
difference.

On Fri, 2012-02-10 at 07:45 -0800, Yuhao wrote:
 Jay,
 
 Was the curly closing bracket } intentional?  I'm using 3.4, which also 
 supports fq=price:[10 TO 20].  The problem is the results are not working 
 properly.
 
 
 
 
 
  From: Jan Høydahl jan@cominvent.com
 To: solr-user@lucene.apache.org; Yuhao nfsvi...@yahoo.com 
 Sent: Thursday, February 9, 2012 7:45 PM
 Subject: Re: Range facet - Count in facet menu != Count in search results
  
 Hi,
 
 If you use trunk (4.0) version, you can say fq=price:[10 TO 20} and have the 
 upper bound be exclusive.
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com
 
 On 10. feb. 2012, at 00:58, Yuhao wrote:
 
  I've changed the facet.range.include option to every possible value 
  (lower, upper, edge, outer, all)**.  It only changes the count shown in the 
  Ranges facet menu on the left.  It has no effect on the count and results 
  shown in search results, which ALWAYS is inclusive of both the lower AND 
  upper bounds (which is equivalent to include = all).  Is this by design?  
  I would like to make the search results include the lower bound, but not 
  the upper bound.  Can I do that?
  
  My range field is multi-valued, but I don't think that should be the 
  problem.
  
  ** Actually, it doesn't like outer for some reason, which leaves the 
  facet completely empty.

Re: SolrCloud war?

2012-02-03 Thread Darren Govoni


UPDATE:

I set my app server[1] system property jetty.port to be equal to the app 
servers open port and was able to get two Solr shards to talk.


The overall properties I set are:

App server domain 1:

bootstrap_confdir
collection.configName
jetty.port
solr.solr.home
zkRun

App server domain 2:

bootstrap_confdir
collection.configName
jetty.port
solr.solr.home
zkHost

I deployed each war app into the /solr context. I presume its needed 
by remote URL addressing.

I checked the zookeeper config page and it shows both shards.

Awesome.

[1] Glassfish 3.1.1

On 02/01/2012 08:50 PM, Mark Miller wrote:

I have not yet tried to run SolrCloud in another app server, but it shouldn't 
be a problem.

One issue you might have is the fact that we count on hostPort coming from the 
system property jetty.port. This is set in the default solr.xml - the hostPort 
defaults to jetty.port. You probably want to explicitly pass -DhostPort= if you 
are not going to use jetty.port.


- Mark Miller
lucidimagination.com
On Feb 1, 2012, at 2:44 PM, Darren Govoni wrote:


Hi,
  I'm trying to get the SolrCloud2 examples to work using a war deployed solr 
into glassfish.
The startup properties must be different in this case, because its having 
trouble connecting to zookeeper when
I deploy the solr war file.

Perhaps the embedded zookeeper has trouble running in an app server?

Any tips appreciated!

Darren

On 01/30/2012 06:58 PM, Darren Govoni wrote:

Hi,
  Is there any issue with running the new SolrCloud deployed as a war in 
another app server?
Has anyone tried this yet?

thanks.

Re: Federation in SolrCloud?

2012-02-02 Thread Darren Govoni


Thanks for the reply Mark.

I did example A. One of the instances had zookeeper. If I shut down the 
other instance, all searches on the other (running) instance produced an 
error in the browser.
I don't have the error handy but it was one line. Something like missing 
shard in collection IIRC.


What I'm hoping to achieve is this.

Shard A: DocA, DocB
Shard B: DocC, DocD

if I do a query with both shards running I get DocA,DocB,DocC,DocD. If 
Shard B goes down, I only get DocA, DocB.


After that I will fold replication in to understand it.

On 02/02/2012 04:22 PM, Mark Miller wrote:

On Feb 2, 2012, at 9:51 AM, dar...@ontrenet.com wrote:


Hi,
  I want to use SolrCloud in a more federated mode rather than
replication. The failover is nice, but I am more interested in
increasing capacity of an index through horizontal scaling (shards).

How can I configure shards such that they retain their own documents and
don't replicate (or replicate to some shards and not all)? Thus, when I
search from any shard I want results from all shards (being different
results from each).

Currently, if I kill a shard (using the example provided), no search works
and it errors out.

thanks!


What example are you trying? Are you following it exactly? In order to serve 
requests at least one instance has to be up for every shard - but what you 
describe is how things work if you have enough replicas.

Example A splits the index across two shards, but there are no replicas - if an 
instance goes down, search will not work.

Example B and C add replicas. This means that one instance can die per shard 
and you will still be able to serve requests.

Keep in mind that if you are running ZooKeeper with Solr (as the examples do), 
you have to make sure at least half the nodes running ZooKeeper are up. If that 
is only one node, you cannot kill that node - it will be a single point of 
failure unless you create a ZooKeeper ensemble.

- Mark Miller
lucidimagination.com

Re: SolrCloud war?

2012-02-01 Thread Darren Govoni


Hi,
  I'm trying to get the SolrCloud2 examples to work using a war 
deployed solr into glassfish.
The startup properties must be different in this case, because its 
having trouble connecting to zookeeper when

I deploy the solr war file.

Perhaps the embedded zookeeper has trouble running in an app server?

Any tips appreciated!

Darren

On 01/30/2012 06:58 PM, Darren Govoni wrote:

Hi,
  Is there any issue with running the new SolrCloud deployed as a war 
in another app server?

Has anyone tried this yet?

thanks.

SolrCloud war?

2012-01-30 Thread Darren Govoni


Hi,
  Is there any issue with running the new SolrCloud deployed as a war 
in another app server?

Has anyone tried this yet?

thanks.

Re: Hierarchical faceting in UI

2012-01-24 Thread Darren Govoni


Yuhao,
Ok, let me think about this. A term can have multiple parents. Each of 
those parents would be 'different', yes?
In this case, use a multivalued field for the parent and add all the parent 
names or id's to it. The relations should be unique.

Your UI will associate the correct parent id to build the facet query from and 
return the correct children because the user
is descending down a specific path in the UI and the parent node unique id's 
are returned along the way.

Now, if you are having parent names/id's that themselves can appear in multiple 
locations (vs. just terms 'the leafs'),
then perhaps your hierarchy needs refactoring for redundancy?

Happy to help with more details.

Darren


On 01/24/2012 11:22 AM, Yuhao wrote:

Darren,

One challenge for me is that a term can appear in multiple places of the hierarchy.  So it's not safe to 
simply use the term as it appears to get its children; I probably need to include the entire tree path up 
to this term.  For example, if the hierarchy is Cardiovascular Diseases  Arteriosclerosis  
Coronary Artery Disease, and I'm getting the children of the middle term Arteriosclerosi, I need to 
filter on something like parent:Cardiovascular Diseases/Arteriosclerosis.

I'm having trouble figuring out how I can get the complete path per above to add to the URL of each facet term.  I 
know velocity/facet_field.vm is where I build the URL.  I know how to simply add a 
parent:term filter to the URL.  But I don't know how to access a document field, like the 
complete parent path, in facet_field.vm.  Any help would be great.

Yuhao





  From: dar...@ontrenet.comdar...@ontrenet.com
To: Yuhaonfsvi...@yahoo.com
Cc: solr-user@lucene.apache.org
Sent: Monday, January 23, 2012 7:16 PM
Subject: Re: Hierarchical faceting in UI


On Mon, 23 Jan 2012 14:33:00 -0800 (PST), Yuhaonfsvi...@yahoo.com
wrote:

Programmatically, something like this might work: for each facet field,
add another hidden field that identifies its parent.  Then, program
additional logic in the UI to show only the facet terms at the currently
selected level.  For example, if one filters on cat:electronics, the

new

UI logic would apply the additional filter cat_parent:electronics.

Can

this be done?

Yes. This is how I do it.


Would it be a lot of work?

No. Its not a lot of work, simply represent your hierarchy as parent/child
relations in the document fields and in your UI drill down by issuing new
faceted searches. Use the current facet (tree level) as the parent:level
in the next query. Its much easier than other suggestions for this.


Is there a better way?

Not in my opinion, there isn't. This is the simplest to implement and
understand.


By the way, Flamenco (another faceted browser) has built-in support for
hierarchies, and it has worked well for my data in this aspect (but less
well than Solr in others).  I'm looking for the same kind of

hierarchical

UI feature in Solr.

Re: How to accelerate your Solr-Lucene appication by 4x

2012-01-19 Thread Darren Govoni

I think the occassional Hey, we made something cool you might be 
interested in! notice, even if commercial, is ok

because it addresses numerous issues we struggle with on this list.

Now, if it were something completely off-base or unrelated (e.g. male 
enhancement pills), then yeah, I agree.


On 01/18/2012 11:08 PM, Steven A Rowe wrote:

Hi Darren,

I think it's rare because it's rare: if this were found to be a useful 
advertising space, rare would cease to be descriptive of it.  But I could be 
wrong.

Steve


-Original Message-
From: Darren Govoni [mailto:dar...@ontrenet.com]
Sent: Wednesday, January 18, 2012 8:40 PM
To: solr-user@lucene.apache.org
Subject: Re: How to accelerate your Solr-Lucene appication by 4x

And to be honest, many people on this list are professionals who not
only build their own solutions, but also buy tools and tech.

I don't see what the big deal is if some clever company has something of
imminent value here to share it. Considering that its a rare event.

On 01/18/2012 08:28 PM, Jason Rutherglen wrote:

Steven,

If you are going to admonish people for advertising, it should be
equally dished out or not at all.

On Wed, Jan 18, 2012 at 6:38 PM, Steven A Rowesar...@syr.edu   wrote:

Hi Peter,

Commercial solicitations are taboo here, except in the context of a

request for help that is directly relevant to a product or service.

Please don’t do this again.

Steve Rowe

From: Peter Velikin [mailto:pe...@velobit.com]
Sent: Wednesday, January 18, 2012 6:33 PM
To: solr-user@lucene.apache.org
Subject: How to accelerate your Solr-Lucene appication by 4x

Hello Solr users,

Did you know that you can boost the performance of your Solr

application using your existing servers? All you need is commodity SSD and
plug-and-play software like VeloBit.

At ZoomInfo, a leading business information provider, VeloBit increased

the performance of the Solr-Lucene-powered application by 4x.

I would love to tell you more about VeloBit and find out if we can

deliver same business benefits at your company. Click
herehttp://www.velobit.com/15-minute-brief   for a 15-minute
briefinghttp://www.velobit.com/15-minute-brief   on the VeloBit
technology.

Here is more information on how VeloBit helped ZoomInfo:

   *   Increased Solr-Lucene performance by 4x using existing servers

and commodity SSD

   *   Installed VeloBit plug-and-play SSD caching software in 5-minutes

transparent to running applications and storage infrastructure

   *   Reduced by 75% the hardware and monthly operating costs required

to support service level agreements

Technical Details:

   *   Environment: Solr‐Lucene indexed directory search service fronted

by J2EE web application technology

   *   Index size: 600 GB
   *   Number of items indexed: 50 million
   *   Primary storage: 6 x SAS HDD
   *   SSD Cache: VeloBit software + OCZ Vertex 3

Click herehttp://www.velobit.com/use-cases/enterprise-search/   to

read more about the ZoomInfo Solr-Lucene case
studyhttp://www.velobit.com/use-cases/enterprise-search/.

You can also sign uphttp://www.velobit.com/early-access-program-

accelerate-application   for our Early Access
Programhttp://www.velobit.com/early-access-program-accelerate-
application   and try VeloBit HyperCache for free.

Also, feel free to write to me directly at

pe...@velobit.commailto:pe...@velobit.com.

Best regards,

Peter Velikin
VP Online Marketing, VeloBit, Inc.
pe...@velobit.commailto:pe...@velobit.com
tel. 978-263-4800
mob. 617-306-7165
[Description: VeloBit with tagline]
VeloBit provides plug   play SSD caching software that dramatically

accelerates applications at a remarkably low cost. The software installs
seamlessly in less than 10 minutes and automatically tunes for fastest
application speed. Visit www.velobit.comhttp://www.velobit.com   for
details.

Re: How to accelerate your Solr-Lucene appication by 4x

2012-01-19 Thread Darren Govoni


Agree. There's probably some unwritten etiquette there.

On 01/19/2012 05:52 AM, Patrick Plaatje wrote:

Partially agree. If just the facts are given, and not a complete sales talk
instead, it'll be fine. Don't overdo it like this though.

Cheers,

Patrick


2012/1/19 Darren Govonidar...@ontrenet.com


I think the occassional Hey, we made something cool you might be
interested in! notice, even if commercial, is ok
because it addresses numerous issues we struggle with on this list.

Now, if it were something completely off-base or unrelated (e.g. male
enhancement pills), then yeah, I agree.

On 01/18/2012 11:08 PM, Steven A Rowe wrote:


Hi Darren,

I think it's rare because it's rare: if this were found to be a useful
advertising space, rare would cease to be descriptive of it.  But I could
be wrong.

Steve

  -Original Message-

From: Darren Govoni [mailto:dar...@ontrenet.com]
Sent: Wednesday, January 18, 2012 8:40 PM
To: solr-user@lucene.apache.org
Subject: Re: How to accelerate your Solr-Lucene appication by 4x

And to be honest, many people on this list are professionals who not
only build their own solutions, but also buy tools and tech.

I don't see what the big deal is if some clever company has something of
imminent value here to share it. Considering that its a rare event.

On 01/18/2012 08:28 PM, Jason Rutherglen wrote:


Steven,

If you are going to admonish people for advertising, it should be
equally dished out or not at all.

On Wed, Jan 18, 2012 at 6:38 PM, Steven A Rowesar...@syr.eduwrote:


Hi Peter,

Commercial solicitations are taboo here, except in the context of a


request for help that is directly relevant to a product or service.
Please don’t do this again.

Steve Rowe

From: Peter Velikin [mailto:pe...@velobit.com]
Sent: Wednesday, January 18, 2012 6:33 PM
To: solr-user@lucene.apache.org
Subject: How to accelerate your Solr-Lucene appication by 4x

Hello Solr users,

Did you know that you can boost the performance of your Solr


application using your existing servers? All you need is commodity SSD

and
plug-and-play software like VeloBit.


At ZoomInfo, a leading business information provider, VeloBit increased
the performance of the Solr-Lucene-powered application by 4x.
I would love to tell you more about VeloBit and find out if we can
deliver same business benefits at your company. Click

herehttp://www.velobit.com/**15-minute-briefhttp://www.velobit.com/15-minute-brief
   for a 15-minute
briefinghttp://www.velobit.**com/15-minute-briefhttp://www.velobit.com/15-minute-brief
   on the VeloBit
technology.


Here is more information on how VeloBit helped ZoomInfo:

   *   Increased Solr-Lucene performance by 4x using existing servers


and commodity SSD
   *   Installed VeloBit plug-and-play SSD caching software in 5-minutes
transparent to running applications and storage infrastructure
   *   Reduced by 75% the hardware and monthly operating costs required
to support service level agreements
Technical Details:

   *   Environment: Solr‐Lucene indexed directory search service fronted


by J2EE web application technology
   *   Index size: 600 GB

   *   Number of items indexed: 50 million
   *   Primary storage: 6 x SAS HDD
   *   SSD Cache: VeloBit software + OCZ Vertex 3

Click 
herehttp://www.velobit.com/**use-cases/enterprise-search/http://www.velobit.com/use-cases/enterprise-search/
   to


read more about the ZoomInfo Solr-Lucene case

studyhttp://www.velobit.com/**use-cases/enterprise-search/http://www.velobit.com/use-cases/enterprise-search/

.
You can also sign 
uphttp://www.velobit.com/**early-access-program-http://www.velobit.com/early-access-program-
accelerate-applicationfor our Early Access

Programhttp://www.velobit.**com/early-access-program-**accelerate-http://www.velobit.com/early-access-program-accelerate-
applicationand try VeloBit HyperCache for free.


Also, feel free to write to me directly at
pe...@velobit.commailto:peter**@velobit.compe...@velobit.com.
Best regards,

Peter Velikin
VP Online Marketing, VeloBit, Inc.
pe...@velobit.commailto:peter**@velobit.compe...@velobit.com
tel. 978-263-4800
mob. 617-306-7165
[Description: VeloBit with tagline]
VeloBit provides plugplay SSD caching software that dramatically


accelerates applications at a remarkably low cost. The software installs

seamlessly in less than 10 minutes and automatically tunes for fastest
application speed. Visit 
www.velobit.comhttp://www.**velobit.comhttp://www.velobit.com
   for
details.

Re: How can a distributed Solr setup scale to TB-data, if URL limitations are 4000 for distributed shard search?

2012-01-18 Thread Darren Govoni


Try changing the URI/HTTP/GET size limitation on your app server.

On 01/18/2012 05:59 PM, Daniel Bruegge wrote:

Hi,

I am just wondering how I can 'grow' a distributed Solr setup to an index
size of a couple of terabytes, when one of the distributed Solr limitations
is max. 4000 characters in URI limitation. See:

*The number of shards is limited by number of characters allowed for GET

method's URI; most Web servers generally support at least 4000 characters,
but many servers limit URI length to reduce their vulnerability to Denial
of Service (DoS) attacks.
*




*(via
http://lucidworks.lucidimagination.com/display/solr/Distributed+Search+with+Index+Sharding
)*


Is the only way then to make multiple distributed solr clusters and query
them independently and merge them in application code?

Thanks. Daniel

Re: How to accelerate your Solr-Lucene appication by 4x

2012-01-18 Thread Darren Govoni

And to be honest, many people on this list are professionals who not 
only build their own solutions, but also buy tools and tech.


I don't see what the big deal is if some clever company has something of 
imminent value here to share it. Considering that its a rare event.


On 01/18/2012 08:28 PM, Jason Rutherglen wrote:

Steven,

If you are going to admonish people for advertising, it should be
equally dished out or not at all.

On Wed, Jan 18, 2012 at 6:38 PM, Steven A Rowesar...@syr.edu  wrote:

Hi Peter,

Commercial solicitations are taboo here, except in the context of a request for 
help that is directly relevant to a product or service.

Please don’t do this again.

Steve Rowe

From: Peter Velikin [mailto:pe...@velobit.com]
Sent: Wednesday, January 18, 2012 6:33 PM
To: solr-user@lucene.apache.org
Subject: How to accelerate your Solr-Lucene appication by 4x

Hello Solr users,

Did you know that you can boost the performance of your Solr application using 
your existing servers? All you need is commodity SSD and plug-and-play software 
like VeloBit.

At ZoomInfo, a leading business information provider, VeloBit increased the 
performance of the Solr-Lucene-powered application by 4x.

I would love to tell you more about VeloBit and find out if we can deliver same business 
benefits at your company. Click herehttp://www.velobit.com/15-minute-brief  for a 
15-minute briefinghttp://www.velobit.com/15-minute-brief  on the VeloBit technology.

Here is more information on how VeloBit helped ZoomInfo:

  *   Increased Solr-Lucene performance by 4x using existing servers and 
commodity SSD
  *   Installed VeloBit plug-and-play SSD caching software in 5-minutes 
transparent to running applications and storage infrastructure
  *   Reduced by 75% the hardware and monthly operating costs required to 
support service level agreements

Technical Details:

  *   Environment: Solr‐Lucene indexed directory search service fronted by J2EE 
web application technology
  *   Index size: 600 GB
  *   Number of items indexed: 50 million
  *   Primary storage: 6 x SAS HDD
  *   SSD Cache: VeloBit software + OCZ Vertex 3

Click herehttp://www.velobit.com/use-cases/enterprise-search/  to read more about 
the ZoomInfo Solr-Lucene case 
studyhttp://www.velobit.com/use-cases/enterprise-search/.

You can also sign 
uphttp://www.velobit.com/early-access-program-accelerate-application  for our Early 
Access Programhttp://www.velobit.com/early-access-program-accelerate-application  
and try VeloBit HyperCache for free.

Also, feel free to write to me directly at 
pe...@velobit.commailto:pe...@velobit.com.

Best regards,

Peter Velikin
VP Online Marketing, VeloBit, Inc.
pe...@velobit.commailto:pe...@velobit.com
tel. 978-263-4800
mob. 617-306-7165
[Description: VeloBit with tagline]
VeloBit provides plug  play SSD caching software that dramatically accelerates 
applications at a remarkably low cost. The software installs seamlessly in less than 10 
minutes and automatically tunes for fastest application speed. Visit 
www.velobit.comhttp://www.velobit.com  for details.

Highlighting in 3.5?

2012-01-02 Thread Darren Govoni


Hi,
  Can someone tell me if this is correct behavior from Solr.

I search on a dynamic field:

field_t:[* TO *]

I set highlight fields to field_t,text_t but I am not searching 
specifically inside text_t field.


The highlights for text_t come back with EVERY WORD. Maybe because of 
the [* TO *], but
the query semantics indicate not searching on text_t even though 
highlighting is enabled.


Is this correct behavior? it produces unwanted highlight results.

I would expect Solr to know what fields are participating in the query 
and only highlight

those that are involved in the result set.

Thanks,
Darren

Re: Highlighting in 3.5?

2012-01-02 Thread Darren Govoni


Hi Juan,
  Setting that parameter produces the same extraneous results. Here is 
my query:


{!lucene q.op=OR df=text_t}  kind_s:doc AND (( field_t:[* TO *] ))

Clearly, the default field (text_t) is not being searched by this query 
and highlighting it would be semantically incongruent with the query.


Is it a bug?

Darren

On 01/02/2012 04:39 PM, Juan Grande wrote:

Hi Darren,

This is the expected behavior. Have you tried setting the
hl.requireFieldMatch parameter to true? See:
http://wiki.apache.org/solr/HighlightingParameters#hl.requireFieldMatch

*Juan*



On Mon, Jan 2, 2012 at 10:54 AM, Darren Govonidar...@ontrenet.com  wrote:


Hi,
  Can someone tell me if this is correct behavior from Solr.

I search on a dynamic field:

field_t:[* TO *]

I set highlight fields to field_t,text_t but I am not searching
specifically inside text_t field.

The highlights for text_t come back with EVERY WORD. Maybe because of the
[* TO *], but
the query semantics indicate not searching on text_t even though
highlighting is enabled.

Is this correct behavior? it produces unwanted highlight results.

I would expect Solr to know what fields are participating in the query and
only highlight
those that are involved in the result set.

Thanks,
Darren

Re: Highlighting in 3.5?

2012-01-02 Thread Darren Govoni

Forgot to add, that the time when I DO want the highlight to appear 
would be with a query that DOES match the default field.


{!lucene q.op=OR df=text_t}  kind_s:doc AND (( field_t:[* TO *] ))  cars

Where the term 'cars' would be matched against the df. Then I want the 
highlight for it.


If there are no query term matches for the df, then getting ALL the 
field terms highlighted (as it does now) is rather perplexing feature.


Darren

On 01/02/2012 06:28 PM, Darren Govoni wrote:

Hi Juan,
  Setting that parameter produces the same extraneous results. Here is 
my query:


{!lucene q.op=OR df=text_t}  kind_s:doc AND (( field_t:[* TO *] ))

Clearly, the default field (text_t) is not being searched by this 
query and highlighting it would be semantically incongruent with the 
query.


Is it a bug?

Darren

On 01/02/2012 04:39 PM, Juan Grande wrote:

Hi Darren,

This is the expected behavior. Have you tried setting the
hl.requireFieldMatch parameter to true? See:
http://wiki.apache.org/solr/HighlightingParameters#hl.requireFieldMatch

*Juan*



On Mon, Jan 2, 2012 at 10:54 AM, Darren Govonidar...@ontrenet.com  
wrote:



Hi,
  Can someone tell me if this is correct behavior from Solr.

I search on a dynamic field:

field_t:[* TO *]

I set highlight fields to field_t,text_t but I am not searching
specifically inside text_t field.

The highlights for text_t come back with EVERY WORD. Maybe because 
of the

[* TO *], but
the query semantics indicate not searching on text_t even though
highlighting is enabled.

Is this correct behavior? it produces unwanted highlight results.

I would expect Solr to know what fields are participating in the 
query and

only highlight
those that are involved in the result set.

Thanks,
Darren

Re: Poor performance on distributed search

2011-12-19 Thread Darren Govoni

I see what you are asking. This is an interesting question. It seems
inefficient for Solr to apply the

requested rows to all shards only to discard most of the results on merge.
That would consume lots of resources not used in the final result set.

On 12/19/2011 04:32 PM, ku3ia wrote:

Uhm, either I misunderstand your question or you're doing
a lot of extra work for nothing
The whole point of sharding it exactly to collect the top N docs

from each shard and merge them into a single result. So if

you want 10 docs, just specify rows=10. Solr will query all
the shards, get the top 10 docs from each and then
merge them into a final list 10 items long. Both the initial
fetch and the final merge are based on the
sort criteria are respected.
Score is the default sort. If you specify other sort criteria,
i.e. a field, then that sort is respected by the merge process.
So why do you have this 2,000 requirement in the first
place? This really sounds like an XY problem.

As I wrote it is a minimum for me. I can't change it. Final response must
has top 2K docs from all shards by query, so I specify rows=2000. Yeah, it
collects top N docs from each shard. In my case N=2000, so on production I
have 2000x30=60K, and on my own machine 2000x4=8K docs. Its true, this is an
extra work, but in other case, seems it's only way to get top 2K docs from
all shards, am I right?

P.S. Is any mechanism, for example, to get top 100 rows from each shard,

only merge it, sort by defined at query filed or score and pull result to
the user?

Uhm, either I misunderstand your question

For example I have 4 shards. Finally, I need 2000 docs. Now, when I'm using
shards=127.0.0.1:8080/solr/shard1,127.0.0.1:8080/solr/shard2,127.0.0.1:8080/solr/shard3,127.0.0.1:8080/solr/shard4
Solr gets 2000 docs from each shard (shard1,2,3,4, summary we have 8000
docs) merge and sort it, for example, by default field (score), and returns
me only 2000 rows (not all 8000), which I specified at request.
So, my question was about, is any mechanism in Solr, which gets not 2000
rows from each shard, and say, If I specified 2000 docs at request, Solr
calculates how much shards I have (four shards), divides total rows onto
shards (2000/4=500) and sends to each shard queries with rows=500, but not
rows=2000, so finally, summary after merging and sorting I'll have 2000 rows
(maybe less), but not 8000... That was my question.

Thanks.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Poor-performance-on-distributed-search-tp3590028p3599636.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Grouping or Facet ?

2011-12-07 Thread Darren Govoni

Yes. That's what I would expect. I guess I didn't understand when you said

The facet counts are the counts of the *values* in that field

Because it seems its the count of the number of matching documents
irrespective
if one document has 20 values for that field and another 10, the facet
count will be 2,

one for each document in the results.

On 12/07/2011 09:04 AM, Erick Erickson wrote:

In your example you'll have 10 facets returned each with a value of 1.

Best
Erick

On Tue, Dec 6, 2011 at 9:54 AM,dar...@ontrenet.com wrote:

Sorry to jump into this thread, but are you saying that the facet count is
not # of result hits?

So if I have 1 document with field CAT that has 10 values and I do a query
that returns this 1 document with faceting, that the CAT facet count will
be 10 not 1? I don't seem to be seeing that behavior in my app (Solr 3.5).

Thanks.

OK, I'm not understanding here. You get the counts and the results if you
facet
on a single category field. The facet counts are the counts of the
*values* in that
field. So it would help me if you showed the output of faceting on a
single
category field and why that didn't work for you

But either way, faceting will probably outperform grouping.

Best
Erick

On Mon, Dec 5, 2011 at 9:05 AM, Juan Pablo Morajua...@informa.es wrote:

Because I need the count and the result to return back to the client
side. Both the grouping and the facet offers me a solution to do that,
but my doubt is about performance ...

With Grouping my results are:

grouped:{
category:{
matches: ...,
groups:[{
groupValue:categoryXX,
doclist:{numFound:Important_number,start:0,docs:[
{
doc:id
category:XX
}
groupValue:categoryYY,
doclist:{numFound:Important_number,start:0,docs:[
{
doc: id
category:YY
}

And with faceting my results are :
facet.prefix=whatever
facet_counts:{
facet_queries:{},
facet_fields:{
namesXX:[
whatever_name_in_category,76,
...
namesYY:[
whatever_name_in_category,76,
...

Both results are OK to me.

De: Erick Erickson [erickerick...@gmail.com]
Enviado el: lunes, 05 de diciembre de 2011 14:48
Para: solr-user@lucene.apache.org
Asunto: Re: Grouping or Facet ?

Why not just use the first form of the document
and just facet.field=category? You'll get
two different facet counts for XX and YY
that way.

I don't think grouping is the way to go here.

Best
Erick

On Sat, Dec 3, 2011 at 6:43 AM, Juan Pablo Morajua...@informa.es
wrote:

I need to do some counts on a StrField field to suggest options from
two different categories, and I don´t know what option is the best:

My schema looks:

- id
- name
- category: XX or YY

with Grouping I do:

http://localhost:8983/?q=name:prefix*group=truegroup.field=category

But I can change my schema to to:

- id
- nameXX
- nameYY
- category: XX or YY (only 1 value in nameXX or nameYY)

With facet:
http://localhost:8983/?q=*:*facet=truefacet.field=nameXXfacet.field=nameYYfacet.prefix=prefix

What option have the best performance ?

Best,
Juampa.

Re: Solr 3.5 very slow (performance)

2011-11-30 Thread Darren Govoni

Monitoring this thread make me ask the question of whether there are 
standardized performance benchmarks for Solr.
Such that they are run and published with each new release. This would 
affirm its performance under known circumstances,
with which people can try in their own environments and compare to their 
application behavior.


I think it would be a good idea.

On 11/30/2011 04:12 PM, Pawel Rog wrote:

On Wed, Nov 30, 2011 at 9:05 PM, Chris Hostetter
hossman_luc...@fucit.org  wrote:

: I tried to use index from 1.4 (load was the same as on index from 3.5)
: but there was problem with synchronization with master (invalid
: javabin format)
: Then I built new index on 3.5 with luceneMatchVersion LUCENE_35

why would you need to re-replicate from the master?

You already have a copy of the Solr 1.4 index on the slave machine where
you are doing testing correct? Just (make sure Solr 1.4 isn't running
and) point Solr 3.5 at that solr home directory for the configs and data
and time that.  (Just because Solr 3.5 can't replicate from Solr 1.4
over HTTP doesn't mean it can't open indexes built by Solr 1.4)


I made It before sending earlier e-mail. Efect was the same.


It's important to understand if the discrepencies you are seeing have to
do with *building* the index under Solr 3.5, or *searching* in Solr 3.5.

: reader : 
SolrIndexReader{this=8cca36c,r=ReadOnlyDirectoryReader@8cca36c,refCnt=1,segments=4}
: readerDir : 
org.apache.lucene.store.NIOFSDirectory@/data/solr_data/itemsfull/index
:
: solr 3.5
: reader : 
SolrIndexReader{this=3d01e178,r=ReadOnlyDirectoryReader@3d01e178,refCnt=1,segments=14}
: readerDir : 
org.apache.lucene.store.MMapDirectory@/data/solr_data_350/itemsfull/index
: lockFactory=org.apache.lucene.store.NativeFSLockFactory@294ce5eb

As mentioned, the difference in the number of segments may be contributing
to the perf differences you are seeing, so optimizing both indexes (or
doing a partial optimize of your 3.5 index down to 4 segments) for
comparison would probably be worthwhile.  (and if that is the entirety of
hte problem, then explicitly configuring a MergePolicy may help you in the
long run)

but independent of that I would like to suggest that you first try
explicitly configuring Solr 3.5 to use NIOFSDirectory so it's consistent
with what Solr 1.4 was doing (I'm told MMapDirectory should be faster, but
maybe there's something about your setup that makes that not true) So it
would be helpful to also try adding this to your 3.5 solrconfig.xml and
testing ...

directoryFactory name=DirectoryFactory class=solr.NIOFSDirectoryFactory/

: I made some test with quiet heavy query (with frange). In both cases
: (1.4 and 3.5) I used the same newSearcher queries and started solr
: without any load.
: Results of debug timing

Ok, well ... honestly: giving us *one* example of the timing data for
*one* query (w/o even telling us what the exact query was) ins't really
anything we can use to help you ... the crux of the question was: was the
slow performance you are seeing only under heavy load or was it also slow
when you did manual testing?

: When I send fewer than 60 rps I see that in comparsion to 1.4 median
: response time is worse, avarage is worse but maximum time is better.
: It doesn't change propotion of cpu usage (3.5 uses much more cpu).

How much fewer then 60 rps ? ... I'm trying to understand if the
problems you are seeing are solely happening under heavy concurrent
load, or if you are seeing Solr 3.5 consistently respond much slower then
Solr 1.4 even with a single client?

Also: I may still be missunderstanding how you are generating load, and
wether you are throttling the clients, but seeing higher CPU utilization
in Solr 3.5 isn't neccessarily an indication of something going wrong --
in some cases higher CPU% (particularly under heavy concurrent load on a
multi-core machine) could just mean that Solr is now capable of utilizing
more CPU to process parallel request, where as previous versions might have
been hitting other bottle necks. -- but that doesn't explain the slower
response times. that's what concerns me the most.

I don't think that 1200% CPU usage with the same traffic is better
then 200%. I think you are wrong :) Using solr 1.4 I can reach 300rps
and then reach 1200% on cpu and only 60rps in solr 3.5


FWIW: I'm still wondering what the stats from your caches wound up looking
like on both Solr 1.4 and Solr 3.5...


7) What do the cache stats look like on your Solr 3.5 instance after
you've done some of this timing testing?  the output of...
http://localhost:8983/solr/admin/mbeans?cat=CACHEstats=truewt=jsonindent=true
...would be helpful. NOTE: you may need to add this to your
solrconfig.xml
for that URL to work...
  requestHandler name=/admin/ class=solr.admin.AdminHandlers /'

...but i don't think /admin/mbeans exists in Solr 1.4, so you may just
have to get the details from stats.jsp.


I forgot to write it earlier. QueryCache hit rate was about 0.03 (in
solr

Re: Solr 3.5 very slow (performance)

2011-11-29 Thread Darren Govoni

Any suspicous activity in the logs? what about disk activity?

On 11/29/2011 05:22 PM, Pawel Rog wrote:

On Tue, Nov 29, 2011 at 9:13 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

Let's back up a minute and cover some basics...

1) You said that you built a brand new index on a brand new master server,
using Solr 3.5 -- how do you build your indexes? did the source data
change at all? does your new index have the same number of docs as your
previous Solr 1.4 index? what does a directory listing (including file
sizes) look like for both your old and new indexes?

Yes, both indexes have same data. Indexes are build using some C++
programm which reads data from database and inserts it into Solr
(using XML). Both indexes have about 8GB size and 18milions documents.

2) Did you try using your Solr 1.4 index (and configs) directly in Solr
3.5 w/o rebuilding from scratch?

Yes I used the same configs in solr 1.4 and solr 3.5 (adding only line
about luceneMatchVersion)
As I see in example of solr 3.5 in repository (solrconfig.xml) there
are not many diffrences.

3) You said you build the new index on a new mmachine, but then you said
you used a slave where the performanne was worse then Solr 1.4 on the
same machine ... are you running both the Solr 1.4 and Solr 3.5 instances
concurrently on your slave machine? How much physical ram is on that
machine? what JVM options are using when running the Solr 3.5 instance?
what servlet container are you using?

Mayby I didn't wrote precisely enough. I have some machine on which
there is master node. I have second machine on which there is slave. I
tested solr 1.4 on that machine, then turned it off and turned on
solr-3.5. I have 36GB RAM on that machine.
On both - solr 1.4 and 3.5 configuration of JVM is the same, and the
same servlet container ... jetty-6

JVM options: -server -Xms12000m -Xmx12000m -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:NewSize=1500m -XX:ParallelGCThreads=8
-XX:CMSInitiatingOccupancyFraction=60

4) what does your request handler configuration look like? do you have
any default/invariant/appended request params?

requestHandler name=standard class=solr.SearchHandler default=true
lst name=defaults
str name=echoParamsexplicit/str
/lst
/requestHandler
requestHandler name=/admin/
class=org.apache.solr.handler.admin.AdminHandlers /
requestHandler name=/replication class=solr.ReplicationHandler
lst name=slave
!--fully qualified url for the replication handler of
master . It
is possible to pass on this as a request param for the
fetchindexommand--
str
name=masterUrlhttp://${masterHost}:${masterPort}/solr-3.5/${solr.core.instanceDir}replication/str
str name=pollInterval00:00:02/str
str name=httpConnTimeout5000/str
str name=httpReadTimeout1/str
/lst
/requestHandler

5) The descriptions youve given of how the performance has changed sound
like you are doing concurrent load testing -- did you do cache warming before
you
started your testing? how many client threads are hitting the solr server
at one time?

Maybe I wasn't precise enough again. CPU on solr 1.4 was 200% and on
solr 3.5 1200%
yes there is cache warming. There are 50-100 client threads on both
1.4 and 3.5. There are about 60 requests per second on 3.5 and on 1.4,
but on 3.5 responses are slower and CPU usage much higher.

6) have you tried doing some basic manual testing to see how individual
requests performe? ie: single client at a time, loading a URL, then
request the same URL again to verify that your Solr caches are in use and
the QTime is low. If you see slow respone times even when manually
executing single requests at a time, have you tried using debug=timing
to see which serach components are contributing the most to the slow
QTimes?

Most time is in org.apache.solr.handler.component.QueryComponent and
org.apache.solr.handler.component.DebugComponent in process. I didn't
comare individual request performance.

7) What do the cache stats look like on your Solr 3.5 instance after
you've done some of this timing testing? the output of...
http://localhost:8983/solr/admin/mbeans?cat=CACHEstats=truewt=jsonindent=true
...would be helpful. NOTE: you may need to add this to your solrconfig.xml
for that URL to work...
requestHandler name=/admin/ class=solr.admin.AdminHandlers /'

Will check it :)

: in my last pos i mean
: default operation AND
: promoted - int
: ending - int
: b_count - int
: name - text
: cat1 - int
: cat2 - int
:
: On Tue, Nov 29, 2011 at 7:54 PM, Pawel Rogpawelro...@gmail.com wrote:
: examples
:
:
facet=truesort=promoted+desc,ending+asc,b_count+descfacet.mincount=1start=0q=name:(kurtka+skóry+brazowe42)facet.limit=500facet.field=cat1facet.field=cat2wt=jsonrows=50
:
:

Query time help

2011-10-30 Thread Darren Govoni


Hi,
  I am running Solr 3.4 in a glassfish domain for itself. I have about 
12,500 documents with a 100 or so fields with the works (stored, 
termv's, etc).


In my webtier code, I use SolrJ and execute a query as such:

long querystart = new Date().getTime();
System.out.println(BEFORE QUERY TIME:  + (querystart - 
startime) +  milliseconds.);


1. QueryResponse qr = solr.query(aquery, METHOD.POST);
long queryend = new Date().getTime();
System.out.println(QUERY TIME:  + (queryend - querystart) 
+  milliseconds. Before QUERY TIME. + (querystart - startime));


The Qtime in the response reads 50-77. But line 1. takes anywhere from 5 
- 13 seconds to complete.


Here is query:
{!lucene q.op=OR df=text_t} ( kind_s:doc OR kind_s:xml) AND (( 
item_sm_t:[* TO *] )) AND (usergroup_sm:admin)


What could be causing this faulty delay?

Server has 15GB RAM. Responses are not unreasonably large. I use paging.

Many thanks,
Darren

Re: inconsistent results when faceting on multivalued field

2011-10-21 Thread Darren Govoni

My interpretation of your results are that your FQ found 1281 documents
with 1213206 value in sou_codeMetier field. Of those results, 476 also
had 1212104 as a value...and so on. Since ALL the results will have
the field value in your FQ, then I would expect the other values to
be equal or less occurring from the result set, which they appear to be.

On 10/21/2011 03:55 AM, Alain Rogister wrote:

Pravesh,

Not exactly. Here is the search I do, in more details (different field name,
but same issue).

I want to get a count for a specific value of the sou_codeMetier field,
which is multivalued. I expressed this by including a fq clause :

/select/?q=*:*facet=truefacet.field=sou_codeMetierfq=sou_codeMetier:1213206rows=0

The response (excerpt only):

lst name=facet_fields
lst name=sou_codeMetier
int name=12132061281/int
int name=1212104476/int
int name=121320603285/int
int name=1213101260/int
int name=121320602208/int
int name=121320605171/int
int name=1212201152/int
...

As you see, I get back both the expected results and extra results I would
expect to be filtered out by the fq clause.

I can eliminate the extra results with a
'f.sou_codeMetier.facet.prefix=1213206' clause.

But I wonder if Solr's behavior is correct and how the fq filtering works
exactly.

If I replace the facet.field clause with a facet.query clause, like this:

/select/?q=*:*facet=truefacet.query=sou_codeMetier:[1213206 TO
1213206]rows=0

The results contain a single item:

lst name=facet_queries
int name=sou_codeMetier:[1213206 TO 1213206]1281/int
/lst

The 'fq=sou_codeMetier:1213206' clause isn't necessary here and does not
affect the results.

Thanks,

Alain

On Fri, Oct 21, 2011 at 9:18 AM, praveshsuyalprav...@yahoo.com wrote:

Could u clarify on below:

When I make a search on facet.qua_code=1234567 ??

Are u trying to say, when u fire a fresh search for a facet item, like;
q=qua_code:1234567??

This this would fetch for documents where qua_code fields contains either
the terms 1234567 OR both terms (1234567 9384738.and others terms).
This would be since its a multivalued field and hence if you see the facet,
then its shown for both the terms.

If I reword the query as 'facet.query=qua_code:1234567 TO 1234567', I

only
get the expected counts

You will get facet for documents which have term 1234567 only (facet.query
would apply to the facets,so as to which facet to be picked/shown)

Regds
Pravesh

--
View this message in context:
http://lucene.472066.n3.nabble.com/inconsistent-results-when-faceting-on-multivalued-field-tp3438991p3440128.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Merging Remote Solr Indexes?

2011-10-20 Thread Darren Govoni


Interesting Yury. Thanks.

On 10/20/2011 11:00 AM, Yury Kats wrote:

On 10/19/2011 5:15 PM, Darren Govoni wrote:

Hi Otis,
 Yeah, I saw page, but it says for merging cores, which I presume
must reside locally to the solr instance doing the merging?
What I'm interested in doing is merging across solr instances running on
different machines into a single solr running on
another machine (programmatically). Is it still possible or did I
misread the wiki?

Possible, but in a few steps.
1. Create new cores on another machine.
2. Replicate them from different machine.
3. Merge on another machine.

All 3 steps can be done programmatically.

Re: Merging Remote Solr Indexes?

2011-10-19 Thread Darren Govoni


Hi Otis,
   Yeah, I saw page, but it says for merging cores, which I presume 
must reside locally to the solr instance doing the merging?
What I'm interested in doing is merging across solr instances running on 
different machines into a single solr running on
another machine (programmatically). Is it still possible or did I 
misread the wiki?


Thanks!
Darren

On 10/19/2011 11:57 AM, Otis Gospodnetic wrote:

Hi Darren,

http://search-lucene.com/?q=solr+mergefc_project=Solr


Check hit #1

Otis


Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/




From: dar...@ontrenet.comdar...@ontrenet.com
To: solr-user@lucene.apache.org
Sent: Wednesday, October 19, 2011 10:04 AM
Subject: Merging Remote Solr Indexes?


Hi,
   I thought of a useful capability if it doesn't already exist.

Is it possible to do an index merge between two remote Solr's?

To handle massive index-time scalability, wouldn't it be useful
to have distributed indexes accepting local input, then merge
them into one central index after?

Darren

Re: Merging Remote Solr Indexes?

2011-10-19 Thread Darren Govoni

Actually, yeah. If you think about it a remote merge is like the inverse 
of replication.
Where replication is a one to many away from an index, the inverse would 
be merging many back to the one.

Sorta like a recall.

I think it would be a great analog to replication.

On 10/19/2011 06:18 PM, Otis Gospodnetic wrote:

Darren,

No, that is not possible without one copying an index/shard to a single machine 
on which you would then merge indices as described on the Wiki.

H, wouldn't it be nice to make use of existing replication code to make it 
possible to move shards around the cluster?

Otis


Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/




From: Darren Govonidar...@ontrenet.com
To: solr-user@lucene.apache.org
Sent: Wednesday, October 19, 2011 5:15 PM
Subject: Re: Merging Remote Solr Indexes?

Hi Otis,
 Yeah, I saw page, but it says for merging cores, which I presume
must reside locally to the solr instance doing the merging?
What I'm interested in doing is merging across solr instances running on
different machines into a single solr running on
another machine (programmatically). Is it still possible or did I
misread the wiki?

Thanks!
Darren

On 10/19/2011 11:57 AM, Otis Gospodnetic wrote:

Hi Darren,

http://search-lucene.com/?q=solr+mergefc_project=Solr


Check hit #1

Otis


Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/




From: dar...@ontrenet.comdar...@ontrenet.com
To: solr-user@lucene.apache.org
Sent: Wednesday, October 19, 2011 10:04 AM
Subject: Merging Remote Solr Indexes?


Hi,
 I thought of a useful capability if it doesn't already exist.

Is it possible to do an index merge between two remote Solr's?

To handle massive index-time scalability, wouldn't it be useful
to have distributed indexes accepting local input, then merge
them into one central index after?

Darren

Re: basic solr cloud questions

2011-09-29 Thread Darren Govoni


That was kinda my point. The new cloud implementation
is not about replication, nor should it be. But rather about
horizontal scalability where nodes manage different parts
of a unified index. One of the design goals of the new cloud
implementation is for this to happen more or less automatically.

To me that means one does not have to manually distributed
documents or enforce replication as Yurly suggests.
Replication is different to me than what was being asked.
And perhaps I misunderstood the original question.

Yurly's response introduced the term core where the original
person was referring to nodes. For all I know, those are two
different things in the new cloud design terminology (I believe they are).

I guess understanding cores vs. nodes vs shards is helpful. :)

cheers!
Darren


On 09/29/2011 12:00 AM, Pulkit Singhal wrote:

@Darren: I feel that the question itself is misleading. Creating
shards is meant to separate out the data ... not keep the exact same
copy of it.

I think the two node setup that was attempted by Sam mislead him and
us into thinking that configuring two nodes which are to be named
shard1 ... somehow means that they are instantly replicated too ...
this is not the case! I can see how this misunderstanding can develop
as I too was confused until Yury cleared it up.

@Sam: If you are interested in performing a quick exercise to
understand the pieces involved for replication rather than sharding
... perhaps this link would be of help in taking you through it:
http://pulkitsinghal.blogspot.com/2011/09/setup-solr-master-slave-replication.html

- Pulkit

2011/9/27 Yury Katsyuryk...@yahoo.com:

On 9/27/2011 5:16 PM, Darren Govoni wrote:

On 09/27/2011 05:05 PM, Yury Kats wrote:

You need to either submit the docs to both nodes, or have a replication
setup between the two. Otherwise they are not in sync.

I hope that's not the case. :/ My understanding (or hope maybe) is that
the new Solr Cloud implementation will support auto-sharding and
distributed indexing. This means that shards will receive different
documents regardless of which node received the submitted document
(spread evenly based on a hash-node assignment). Distributed queries
will thus merge all the solr shard/node responses.

All cores in the same shard must somehow have the same index.
Only then can you continue servicing searches when individual cores
fail. Auto-sharding and distributed indexing don't have anything to
do with this.

In the future, SolrCloud may be managing replication between cores
in the same shard automatically. But right now it does not.

Re: basic solr cloud questions

2011-09-29 Thread Darren Govoni


Agree. Thanks also for clarifying. It helps.

On 09/29/2011 08:50 AM, Yury Kats wrote:

On 9/29/2011 7:22 AM, Darren Govoni wrote:

That was kinda my point. The new cloud implementation
is not about replication, nor should it be. But rather about
horizontal scalability where nodes manage different parts
of a unified index.

It;s about many things. You stated one, but there are goals,
one of them being tolerance to node outages. In a cloud, when
one of your many nodes fail, you don't want to stop querying and
indexing. For this to happen, you need to maintain redundant copies
of the same pieces of the index, hence you need to replicate.


One of the design goals of the new cloud
implementation is for this to happen more or less automatically.

True, but there is a big gap between goals and current state.
Right now, there is distributed search, but not distributed indexing
or auto-sharding, or auto-replication. So if you want to use the SolrCloud
now (as many of us do), you need do a number of things yourself,
even if they might be done by SolrCloud automatically in the future.


To me that means one does not have to manually distributed
documents or enforce replication as Yurly suggests.
Replication is different to me than what was being asked.
And perhaps I misunderstood the original question.

Yurly's response introduced the term core where the original
person was referring to nodes. For all I know, those are two
different things in the new cloud design terminology (I believe they are).

I guess understanding cores vs. nodes vs shards is helpful. :)

Shard is a slice of index. Index is managed/stored in a core.
Nodes are Solr instances, usually physical machines.

Each node can host multiple shards, and each shard can consist of multiple 
cores.
However, all cores within the same shard must have the same content.

This is where the OP ran into the problem. The OP had 1 shard, consisting of two
cores on two nodes. Since there is no distributed indexing yet, all documents 
were
indexed into a single core. However, there is distributed search, therefore 
queries
were sent randomly to different cores of the same shard. Since one core in the 
shard
had documents and the other didn't, the query result was random.

To solve this problem, the OP must make sure all cores within the same shard 
(be they
on the same node or not) have the same content. This can currently be achieved 
by:
a) setting up replication between cores. you index into one core and the other 
core
replicates the content
b) indexing into both cores

Hope this clarifies.

Re: basic solr cloud questions

2011-09-27 Thread Darren Govoni


On 09/27/2011 05:05 PM, Yury Kats wrote:

You need to either submit the docs to both nodes, or have a replication
setup between the two. Otherwise they are not in sync.
I hope that's not the case. :/ My understanding (or hope maybe) is that 
the new Solr Cloud implementation will support auto-sharding and 
distributed indexing. This means that shards will receive different 
documents regardless of which node received the submitted document 
(spread evenly based on a hash-node assignment). Distributed queries 
will thus merge all the solr shard/node responses.


This is similar in theory to how memcache and other big scale DHT's 
work. If its just manually replicated indexes then its not really a step 
forward from current Solr. :/

Re: Geo spatial search with multi-valued locations (SOLR-2155 / lucene-spatial-playground)

2011-08-29 Thread Darren Govoni


It doesn't.

On 08/29/2011 01:37 PM, Mike Austin wrote:

I've been trying to follow the progress of this and I'm not sure what the
current status is.  Can someone update me on what is currently in Solr4 and
does it support multi-valued location in a single document?  I saw that
SOLR-2155 was not included and is now lucene-spatial-playground.

Thanks,
Mike

Paging over mutlivalued field results?

2011-08-25 Thread Darren Govoni


Hi,
  Is it possible to construct a query in Solr where the paged results are
matching multivalued fields and not documents?

thanks,
Darren

Re: Paging over mutlivalued field results?

2011-08-25 Thread Darren Govoni


Hi Erick,
   Sure thing.

I have a document schema where I put the sentences of that document in a 
multivalued field sentences.

I search that field in a query but get back the document results, naturally.

I then need to further find which exact sentences matched the query (for 
each document result)
and then do my own paging since I am only returning pages of sentences 
and not the whole document.

(i.e. I don't want to page the document results).

Does this make sense? Or is there a better way Solr can accomodate this?

Much appreciated.

Darren

On 08/25/2011 07:24 PM, Erick Erickson wrote:

Hmm, I don't quite understand what you want. An example
or two would help.

Best
Erick

On Thu, Aug 25, 2011 at 12:11 PM, Darren Govonidar...@ontrenet.com  wrote:

Hi,
  Is it possible to construct a query in Solr where the paged results are
matching multivalued fields and not documents?

thanks,
Darren

1 2 >

1 - 100 of 157 matches

Mail list logo