IndexSchema is not mutable error Solr Cloud 7.7.1

2020-07-23 Thread Porritt, Ian
Hi All,

 

I made a change to schema to add new fields in a
collection, this was uploaded to Zookeeper via the
below command:

 

For the Schema

solr zk cp
file:E:\SolrCloud\server\solr\configsets\COLLECTIO
N\conf\schema.xml
zk:/configs/COLLECTION/schema.xml -z
SERVERNAME1.uleaf.site

 

For the Solrconfig

solr zk cp
file:E:\SolrCloud\server\solr\configsets\COLLECTIO
N\conf\solrconfig.xml
zk:/configs/COLLECTION/solrconfig.xml -z
SERVERNAME1.uleaf.site

Note: the solrconfig has  defined.

 

 

When I then go to update a record with the new
field in you get the following error: 

 

org.apache.solr.common.SolrException: This
IndexSchema is not mutable.

at
org.apache.solr.update.processor.AddSchemaFieldsUp
dateProcessorFactory$AddSchemaFieldsUpdateProcesso
r.processAdd(AddSchemaFieldsUpdateProcessorFactory
.java:376)

at
org.apache.solr.update.processor.UpdateRequestProc
essor.processAdd(UpdateRequestProcessor.java:55)

at
org.apache.solr.update.processor.FieldMutatingUpda
teProcessor.processAdd(FieldMutatingUpdateProcesso
r.java:118)

at
org.apache.solr.update.processor.UpdateRequestProc
essor.processAdd(UpdateRequestProcessor.java:55)

at
org.apache.solr.update.processor.FieldMutatingUpda
teProcessor.processAdd(FieldMutatingUpdateProcesso
r.java:118)

at
org.apache.solr.update.processor.UpdateRequestProc
essor.processAdd(UpdateRequestProcessor.java:55)

at
org.apache.solr.update.processor.FieldMutatingUpda
teProcessor.processAdd(FieldMutatingUpdateProcesso
r.java:118)

at
org.apache.solr.update.processor.UpdateRequestProc
essor.processAdd(UpdateRequestProcessor.java:55)

at
org.apache.solr.update.processor.FieldMutatingUpda
teProcessor.processAdd(FieldMutatingUpdateProcesso
r.java:118)

at
org.apache.solr.update.processor.UpdateRequestProc
essor.processAdd(UpdateRequestProcessor.java:55)

at
org.apache.solr.update.processor.FieldNameMutating
UpdateProcessorFactory$1.processAdd(FieldNameMutat
ingUpdateProcessorFactory.java:75)

at
org.apache.solr.update.processor.UpdateRequestProc
essor.processAdd(UpdateRequestProcessor.java:55)

at
org.apache.solr.update.processor.FieldMutatingUpda
teProcessor.processAdd(FieldMutatingUpdateProcesso
r.java:118)

at
org.apache.solr.update.processor.UpdateRequestProc
essor.processAdd(UpdateRequestProcessor.java:55)

at
org.apache.solr.update.processor.AbstractDefaultVa
lueUpdateProcessorFactory$DefaultValueUpdateProces
sor.processAdd(AbstractDefaultValueUpdateProcessor
Factory.java:92)

at
org.apache.solr.handler.loader.JavabinLoader$1.upd
ate(JavabinLoader.java:110)

at
org.apache.solr.client.solrj.request.JavaBinUpdate
RequestCodec$StreamingCodec.readOuterMostDocIterat
or(JavaBinUpdateRequestCodec.java:327)

at
org.apache.solr.client.solrj.request.JavaBinUpdate
RequestCodec$StreamingCodec.readIterator(JavaBinUp
dateRequestCodec.java:280)

at
org.apache.solr.common.util.JavaBinCodec.readObjec
t(JavaBinCodec.java:333)

at
org.apache.solr.common.util.JavaBinCodec.readVal(J
avaBinCodec.java:278)

at
org.apache.solr.client.solrj.request.JavaBinUpdate
RequestCodec$StreamingCodec.readNamedList(JavaBinU
pdateRequestCodec.java:235)

at
org.apache.solr.common.util.JavaBinCodec.readObjec
t(JavaBinCodec.java:298)

at
org.apache.solr.common.util.JavaBinCodec.readVal(J
avaBinCodec.java:278)

at
org.apache.solr.common.util.JavaBinCodec.unmarshal
(JavaBinCodec.java:191)

at
org.apache.solr.client.solrj.request.JavaBinUpdate
RequestCodec.unmarshal(JavaBinUpdateRequestCodec.j
ava:126)

at
org.apache.solr.handler.loader.JavabinLoader.parse
AndLoadDocs(JavabinLoader.java:123)

at
org.apache.solr.handler.loader.JavabinLoader.load(
JavabinLoader.java:70)

at
org.apache.solr.handler.UpdateRequestHandler$1.loa
d(UpdateRequestHandler.java:97)

at
org.apache.solr.handler.ContentStreamHandlerBase.h
andleRequestBody(ContentStreamHandlerBase.java:68)

at
org.apache.solr.handler.RequestHandlerBase.handleR
equest(RequestHandlerBase.java:199)

at
org.apache.solr.core.SolrCore.execute(SolrCore.jav
a:2551)

at
org.apache.solr.servlet.HttpSolrCall.execute(HttpS
olrCall.java:710)

at
org.apache.solr.servlet.HttpSolrCall.call(HttpSolr
Call.java:516)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilte
r(SolrDispatchFilter.java:395)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilte
r(SolrDispatchFilter.java:341)

at
org.eclipse.jetty.servlet.ServletHandler$CachedCha
in.doFilter(ServletHandler.java:1602)

at

RE: Query regarding Solr Cloud Setup

2019-09-06 Thread Porritt, Ian
Hi Jörn/Erick/Shawn thanks for your responses.

@Jörn - much apprecaited for the heads up on Kerberos authentication its 
something we havent really considered at the moment, more production this may 
well be the case. With regards to the Solr Nodes 3 is something we are looking 
as a minimum, when adding a new Solr Node to the cluster will 
settings/configuration be applied by Zookeeper on the new node or is there 
manual intervention?
@Erick - With regards to the core.properties, on standard Solr the 
update.autoCreateFields=false is within the core.properites file however for 
Cloud I have it added within Solrconfig.xml which gets uploaded to Zookeeper, 
apprecaite standard and cloud may work entirely different just wanted to ensure 
it’s the correct way of doing it.
@Shawn - Will try the creation of the lib directory in Solr Home to see if it 
gets picked up and having 5 Zookeepers would more than satisy high availability.


Regards
Ian 

-Original Message-
From: Jörn Franke  

If you have a properly secured cluster eg with Kerberos then you should not 
update files in ZK directly. Use the corresponding Solr REST interfaces then 
you also less likely to mess something up. 

If you want to have HA you should have at least 3 Solr nodes and replicate the 
collection to all three of them (more is not needed from a HA point of view). 
This would also allow you upgrades to the cluster without downtime.

-Original Message-
From: erickerick...@gmail.com>
Having custom core.properties files is “fraught”. First of all, that file can 
be re-written. Second, the collections ADDREPLICA command will create a new 
core.properties file. Third, any mistakes you make when hand-editing the file 
can have grave consequences.

What change exactly do you want to make to core.properties and why?

Trying to reproduce “what a colleague has done on standalone” is not something 
I’d recommend, SolrCloud is a different beast. Reproducing the _behavior_ is 
another thing, so what is the behavior you want in SolrCloud that causes you to 
want to customize core.properties?

Best,
Erick  

-Original Message-
From: Shawn Heisey 

I cannot tell what you are asking here.  The core.properties file lives 
on the disk, not in ZK.

I was under the impression that .jar files could not be loaded into ZK 
and used in a core config.  Documentation saying otherwise was recently 
pointed out to me on the list, but I remain skeptical that this actually 
works, and I have not tried to implement it myself.

The best way to handle custom jar loading is to create a "lib" directory 
under the solr home, and place all jars there.  Solr will automatically 
load them all before any cores are started, and no config commands of 
any kind will be needed to make it happen.

> Also from a high availability aspect, if I effectivly lost 2 of the Solr 
> Servers due to an outage will the system still work as expected? Would I 
> expect any data loss?

If all three Solr servers have a complete copy of all your indexes, then 
you should remain fully operational if two of those Solr servers go down.

Note that if you have three ZK servers and you lose two, that means that 
you have lost zookeeper quorum, and in that situation, SolrCloud will 
transition to read only -- you will not be able to change any index in 
the cloud.  This is how ZK is designed and it cannot be changed.  If you 
want a ZK deployment to survive the loss of two servers, you must have 
at least five total ZK servers, so more than 50 percent of the total 
survives.

Thanks,
Shawn


smime.p7s
Description: S/MIME cryptographic signature


Query regarding Solr Cloud Setup

2019-09-03 Thread Porritt, Ian
Hi,

 

I am relatively new to Solr especially Solr Cloud and have been using it for
a few days now. I think I have setup Solr Cloud correctly however would like
some guidance to ensure I am doing it correctly. I ideally want to be able
to process 40 million documents on production via Solr Cloud. The number of
fields is undefined as the documents may differ but could be around 20+. 

 

The current setup I have at present is as follows: (note this is all on 1
machine for now). A 3 Zookeeper Ensemble (all running on different ports)
and works as expected. 

 

3 Solar Nodes started on separate ports (note: directory path à
D:\solr-7.7.1\example\cloud\Node (1/2/3). 

 



 

Setup of Solr would be similar to the above except its on my local, the
below is the Graph status in Solr Cloud.

 



 

I have a few questions which I cannot seem to find the answer for on the
web. 

 

We have a schema which I have managed to upload to Zookeeper along with the
Solrconfig, how do I get the system to recognise both a lib/.jar extension
and a custom core.properties file? I bypassed the issue of the
core.properties by amending the update.autoCreateField in the Solrconfig.xml
to false however would like to include as a colleague has done on Solr
Standlone.

 

Also from a high availability aspect, if I effectivly lost 2 of the Solr
Servers due to an outage will the system still work as expected? Would I
expect any data loss? 

 

 



smime.p7s
Description: S/MIME cryptographic signature


Re: Solr POST Tool Hidden Files

2018-06-01 Thread Ian Goldsmith-Roooney
Agreed, but yes it skips them even when explicitly referenced by name. The
line I linked to (530) will skip any file whose name begins with a dot. If
there's a better workaround than what I've proposed then I'm certainly open
to it.

Best,
Ian

On Fri, Jun 1, 2018 at 1:25 PM, Alexandre Rafalovitch 
wrote:

> Does it still skip them if they are provided directly by name? It is rather
> a narrow use case.
>
> Regards,
> Alex
>
> On Fri, Jun 1, 2018, 1:01 PM Ian Goldsmith-Roooney, <
> iangoldsmithroo...@gmail.com> wrote:
>
> > Hello,
> >
> > I was hoping to make a small change to allow the simple POST tool to
> accept
> > a command line arg (-Dhidden=yes) so that it will not ignore hidden
> files.
> > Currently there is no toggle; it always ignores hidden files
> > <
> > https://github.com/apache/lucene-solr/blob/master/solr/
> core/src/java/org/apache/solr/util/SimplePostTool.java#L530
> > >.
> >
> >
> > Having never contribute to Solr before, can somebody point me to the best
> > way of making this change if it is acceptable? The HowToContribute
> > indicated that one can either go the route of GitHub fork -> PR or JIRA
> bug
> > -> patch. Additionally wasn't sure which git branch would be best in this
> > case. Any guidance on best practices is much appreciated!
> >
> > Best,
> > --
> > Ian Goldsmith-Rooney
> >
>



-- 
Ian Goldsmith-Rooney


Solr POST Tool Hidden Files

2018-06-01 Thread Ian Goldsmith-Roooney
Hello,

I was hoping to make a small change to allow the simple POST tool to accept
a command line arg (-Dhidden=yes) so that it will not ignore hidden files.
Currently there is no toggle; it always ignores hidden files
<https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/util/SimplePostTool.java#L530>.


Having never contribute to Solr before, can somebody point me to the best
way of making this change if it is acceptable? The HowToContribute
indicated that one can either go the route of GitHub fork -> PR or JIRA bug
-> patch. Additionally wasn't sure which git branch would be best in this
case. Any guidance on best practices is much appreciated!

Best,
-- 
Ian Goldsmith-Rooney


RE: Re:the number of docs in each group depends on rows

2018-05-06 Thread Ian Caldwell
When I looked at this in solr 5.5.3 The second phase of the query was only sent 
to the shards that returned documents in the first phase, the problem is that 
one shard may contain matching documents in a group but ranked outside the top 
N results.

Fatduo this solution won't help you unless you are looking at changing some 
solr code, but is to help with Diego point that maby this could be fixed(as a 
starting point to look at as the code may have changed in 7.0).

We changed the grouping code to search all shards on the second phase. (I think 
that this was all that was needed but we changed grouping to be two level so 
lots of change is grouping code)
In the 5.5.3 code base we changed the method construceRequest(ResponseBuilder 
rb) in TopGroupsShardRequestFactory to always call createRequestForAllShards(rb)


Ian
NLA

-Original Message-
From: Diego Ceccarelli (BLOOMBERG/ LONDON) <dceccarel...@bloomberg.net> 
Sent: Friday, 4 May 2018 9:37 PM
To: solr-user@lucene.apache.org
Subject: Re:the number of docs in each group depends on rows

Hello, 

I'm not sure 100% but I think that if you have multiple shards the number of 
docs matched in each group is *not* guarantee to be exact. Increasing the rows 
will increase the amount of partial information that each shard sends to the 
federator and make the number more precise.

For exact counts you might need one shard OR  to make sure that all the 
documents in the same group are in the same shard by using document routing via 
composite keys [1].

Thinking about that, it should be possible to fix grouping to compute the exact 
numbers on request...

cheers,
Diego


[1] 
https://lucene.apache.org/solr/guide/6_6/shards-and-indexing-data-in-solrcloud.html#shards-and-indexing-data-in-solrcloud


From: solr-user@lucene.apache.org At: 05/04/18 07:53:41To:  
solr-user@lucene.apache.org
Subject: the number of docs in each group depends on rows

Hi,
We used Solr Cloud 7.1.0(3 nodes, 3 shards with 2 replicas). When we used group 
query, we found that the number of docs in each group depends on the rows 
number(group number).

difference:
<http://lucene.472066.n3.nabble.com/file/t494000/difference.jpeg> 

when the rows bigger then 5, the return docs are correct and stable, for the 
rest, the number of docs is smaller than the actual result.

Could you please explain why and give me some suggestion about how to decide 
the rows number?


--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html




possible to dump "routing table" from a single Solr node?

2016-02-03 Thread Ian Rose
Hi all,

I'm having a situation where our SolrCloud cluster often gets into a bad
state where our solr nodes frequently respond with "no servers hosting
shard" even though the node that hosts that shard is clearly up.  We
suspect that this is a state bug where some servers are somehow ending up
with an incorrect view of the network (e.g. which nodes are up/down, which
shards are hosted on which nodes).  Is it possible to somehow get a "dump"
of the current "routing table" (i.e. documents with prefixes in this range
in this collection are stored in this shard on this node)?  That would help
immensely when debugging.

Thanks!
- Ian


Using facets and stats with solr v4

2015-11-20 Thread Ian Harrigan
Hi guys

So i have a question about using facet queries but getting stats for each
facet item, it seems this is possible on solr v5+. Something like this:

 

q=*:*=true={!stats=t1}servicename={!tag=t1}dur
ation=0=true=json=true

 

It also seems this isnt avilable for lower verison (v4), so is there anyway
to acheive similar as we as stuck on v4?

Any help / advice / pointers would be great! - thanks in advance

Ian

 



Re: Async deleteshard commands?

2015-04-28 Thread Ian Rose
Done!

https://issues.apache.org/jira/browse/SOLR-7481


On Tue, Apr 28, 2015 at 11:09 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 This is a bug. Can you please open a Jira issue?

 On Tue, Apr 28, 2015 at 8:35 PM, Ian Rose ianr...@fullstory.com wrote:

  Is it possible to run DELETESHARD commands in async mode?  Google
 searches
  seem to indicate yes, but not definitively.
 
  My local experience indicates otherwise.  If I start with an async
  SPLITSHARD like so:
 
 
 
 http://localhost:8983/solr/admin/collections?action=splitshardcollection=2Gpshard=shard1_0_0async=12-foo-1
 
  Then I get back the expected response format, with str name=requestid
  12-foo-1/str
 
  And I can later query for the result via REQUESTSTATUS.
 
  However if I try an async DELETESHARD like so:
 
 
 
 http://localhost:8983/solr/admin/collections?action=deleteshardcollection=2Gpshard=shard1_0_0async=12-foo-4
 
  The response includes the command result, indicating that the command was
  not run async:
 
  lst name=success
  lst name=192.168.1.106:8983_solr
  lst name=responseHeader
  int name=status0/int
  int name=QTime16/int
  /lst
  /lst
  /lst
 
  And in addition REQUESTSTATUS calls for that requestId fail with Did not
  find taskid [12-foo-4] in any tasks queue.
 
  Synchronous deletes are causing problems for me in production as they are
  timing out in some cases.
 
  Thanks,
  Ian
 
 
  p.s. I'm on version 5.0.0
 



 --
 Regards,
 Shalin Shekhar Mangar.



Re: Async deleteshard commands?

2015-04-28 Thread Ian Rose
Hi Anshum,

FWIW I find that page is not entirely accurate with regard to async
params.  For example, my testing shows that DELETEREPLICA *does* support
the async param, although that is not listed here:
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api9

Cheers,
Ian


On Tue, Apr 28, 2015 at 12:47 PM, Anshum Gupta ans...@anshumgupta.net
wrote:

 Hi Ian,

 DELETESHARD doesn't support ASYNC calls officially. We could certainly do
 with a better response but I believe with most of the Collections API calls
 at this time in Solr, you could send random params which would get ignored.
 Therefore, in this case, I believe that the async param gets ignored.

 The go-to reference point to check what's supported is the official
 reference guide:

 https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api7

 This doesn't mentioned support for async DELETESHARD calls.

 On Tue, Apr 28, 2015 at 8:05 AM, Ian Rose ianr...@fullstory.com wrote:

  Is it possible to run DELETESHARD commands in async mode?  Google
 searches
  seem to indicate yes, but not definitively.
 
  My local experience indicates otherwise.  If I start with an async
  SPLITSHARD like so:
 
 
 
 http://localhost:8983/solr/admin/collections?action=splitshardcollection=2Gpshard=shard1_0_0async=12-foo-1
 
  Then I get back the expected response format, with str name=requestid
  12-foo-1/str
 
  And I can later query for the result via REQUESTSTATUS.
 
  However if I try an async DELETESHARD like so:
 
 
 
 http://localhost:8983/solr/admin/collections?action=deleteshardcollection=2Gpshard=shard1_0_0async=12-foo-4
 
  The response includes the command result, indicating that the command was
  not run async:
 
  lst name=success
  lst name=192.168.1.106:8983_solr
  lst name=responseHeader
  int name=status0/int
  int name=QTime16/int
  /lst
  /lst
  /lst
 
  And in addition REQUESTSTATUS calls for that requestId fail with Did not
  find taskid [12-foo-4] in any tasks queue.
 
  Synchronous deletes are causing problems for me in production as they are
  timing out in some cases.
 
  Thanks,
  Ian
 
 
  p.s. I'm on version 5.0.0
 



 --
 Anshum Gupta



Re: Async deleteshard commands?

2015-04-28 Thread Ian Rose
Sure.  Here is an example of ADDREPLICA in synchronous mode:

http://localhost:8983/solr/admin/collections?action=addreplicacollection=293shard=shard1_1

response:
response
lst name=responseHeader
int name=status0/int
int name=QTime1168/int
/lst
lst name=success
lst
lst name=responseHeader
int name=status0/int
int name=QTime1158/int
/lst
str name=core293_shard1_1_replica2/str
/lst
/lst
/response

And here is the same in asynchronous mode:

http://localhost:8983/solr/admin/collections?action=addreplicacollection=293shard=shard1_1async=foo99

response:
response
lst name=responseHeader
int name=status0/int
int name=QTime2/int
/lst
str name=requestidfoo99/str
/response

Note that the format of this response does NOT match the response format
that I got from the attempt at an async DELETESHARD in my earlier email.

Also note that I am now able to query for the status of this request:

http://localhost:8983/solr/admin/collections?action=requeststatusrequestid=foo99

response:
response
lst name=responseHeader
int name=status0/int
int name=QTime0/int
/lst
lst name=status
str name=statecompleted/str
str name=msgfound foo99 in completed tasks/str
/lst
/response



On Tue, Apr 28, 2015 at 2:06 PM, Anshum Gupta ans...@anshumgupta.net
wrote:

 Hi Ian,

 What do you mean by *my testing shows* ? Can you elaborate on the steps
 and how did you confirm that the call was indeed *async* ?
 I may be wrong but I think what you're seeing is a normal DELETEREPLICA
 call succeeding behind the scenes. It is not treated or processed as an
 async call.

 Also, that page is the official reference guide and might need fixing if
 it's out of sync.


 On Tue, Apr 28, 2015 at 10:47 AM, Ian Rose ianr...@fullstory.com wrote:

  Hi Anshum,
 
  FWIW I find that page is not entirely accurate with regard to async
  params.  For example, my testing shows that DELETEREPLICA *does* support
  the async param, although that is not listed here:
 
 
 https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api9
 
  Cheers,
  Ian
 
 
  On Tue, Apr 28, 2015 at 12:47 PM, Anshum Gupta ans...@anshumgupta.net
  wrote:
 
   Hi Ian,
  
   DELETESHARD doesn't support ASYNC calls officially. We could certainly
 do
   with a better response but I believe with most of the Collections API
  calls
   at this time in Solr, you could send random params which would get
  ignored.
   Therefore, in this case, I believe that the async param gets ignored.
  
   The go-to reference point to check what's supported is the official
   reference guide:
  
  
 
 https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api7
  
   This doesn't mentioned support for async DELETESHARD calls.
  
   On Tue, Apr 28, 2015 at 8:05 AM, Ian Rose ianr...@fullstory.com
 wrote:
  
Is it possible to run DELETESHARD commands in async mode?  Google
   searches
seem to indicate yes, but not definitively.
   
My local experience indicates otherwise.  If I start with an async
SPLITSHARD like so:
   
   
   
  
 
 http://localhost:8983/solr/admin/collections?action=splitshardcollection=2Gpshard=shard1_0_0async=12-foo-1
   
Then I get back the expected response format, with str
  name=requestid
12-foo-1/str
   
And I can later query for the result via REQUESTSTATUS.
   
However if I try an async DELETESHARD like so:
   
   
   
  
 
 http://localhost:8983/solr/admin/collections?action=deleteshardcollection=2Gpshard=shard1_0_0async=12-foo-4
   
The response includes the command result, indicating that the command
  was
not run async:
   
lst name=success
lst name=192.168.1.106:8983_solr
lst name=responseHeader
int name=status0/int
int name=QTime16/int
/lst
/lst
/lst
   
And in addition REQUESTSTATUS calls for that requestId fail with Did
  not
find taskid [12-foo-4] in any tasks queue.
   
Synchronous deletes are causing problems for me in production as they
  are
timing out in some cases.
   
Thanks,
Ian
   
   
p.s. I'm on version 5.0.0
   
  
  
  
   --
   Anshum Gupta
  
 



 --
 Anshum Gupta



Async deleteshard commands?

2015-04-28 Thread Ian Rose
Is it possible to run DELETESHARD commands in async mode?  Google searches
seem to indicate yes, but not definitively.

My local experience indicates otherwise.  If I start with an async
SPLITSHARD like so:

http://localhost:8983/solr/admin/collections?action=splitshardcollection=2Gpshard=shard1_0_0async=12-foo-1

Then I get back the expected response format, with str name=requestid
12-foo-1/str

And I can later query for the result via REQUESTSTATUS.

However if I try an async DELETESHARD like so:

http://localhost:8983/solr/admin/collections?action=deleteshardcollection=2Gpshard=shard1_0_0async=12-foo-4

The response includes the command result, indicating that the command was
not run async:

lst name=success
lst name=192.168.1.106:8983_solr
lst name=responseHeader
int name=status0/int
int name=QTime16/int
/lst
/lst
/lst

And in addition REQUESTSTATUS calls for that requestId fail with Did not
find taskid [12-foo-4] in any tasks queue.

Synchronous deletes are causing problems for me in production as they are
timing out in some cases.

Thanks,
Ian


p.s. I'm on version 5.0.0


proper routing (from non-Java client) in solr cloud 5.0.0

2015-04-14 Thread Ian Rose
Hi all -

I've just upgraded my dev install of Solr (cloud) from 4.10 to 5.0.  Our
client is written in Go, for which I am not aware of a client, so we wrote
our own.  One tricky bit for this was the routing logic; if a document has
routing prefix X and belong to collection Y, we need to know which solr
node to connect to.  Previously we accomplished this by watching the
clusterstate.json
file in zookeeper - at startup and whenever it changes, the client parses
the file contents to build a routing table.

However in 5.0 newly create collections do not show up in clusterstate.json
but instead have their own state.json document.  Are there any
recommendations for how to handle this from the client?  The obvious answer
is to watch every collection's state.json document, but we run a lot of
collections (~1000 currently, and growing) so I'm concerned about keeping
that many watches open at the same time (should I be?).  How does the SolrJ
client handle this?

Thanks!
- Ian


Re: proper routing (from non-Java client) in solr cloud 5.0.0

2015-04-14 Thread Ian Rose
Hi Hrishikesh,

Thanks for the pointers - I had not looked at SOLR-5474
https://issues.apache.org/jira/browse/SOLR-5474 previously.  Interesting
approach...  I think we will stick with trying to keep zk watches open from
all clients to all collections for now, but if that starts to be a
bottleneck its good to know how the route that Solrj has chosen...

cheers,
Ian



On Tue, Apr 14, 2015 at 3:56 PM, Hrishikesh Gadre gadre.s...@gmail.com
wrote:

 Hi Ian,

 As per my understanding, Solrj does not use Zookeeper watches but instead
 caches the information (along with a TTL). You can find more information
 here,

 https://issues.apache.org/jira/browse/SOLR-5473
 https://issues.apache.org/jira/browse/SOLR-5474

 Regards
 Hrishikesh


 On Tue, Apr 14, 2015 at 8:49 AM, Ian Rose ianr...@fullstory.com wrote:

  Hi all -
 
  I've just upgraded my dev install of Solr (cloud) from 4.10 to 5.0.  Our
  client is written in Go, for which I am not aware of a client, so we
 wrote
  our own.  One tricky bit for this was the routing logic; if a document
 has
  routing prefix X and belong to collection Y, we need to know which solr
  node to connect to.  Previously we accomplished this by watching the
  clusterstate.json
  file in zookeeper - at startup and whenever it changes, the client parses
  the file contents to build a routing table.
 
  However in 5.0 newly create collections do not show up in
 clusterstate.json
  but instead have their own state.json document.  Are there any
  recommendations for how to handle this from the client?  The obvious
 answer
  is to watch every collection's state.json document, but we run a lot of
  collections (~1000 currently, and growing) so I'm concerned about keeping
  that many watches open at the same time (should I be?).  How does the
 SolrJ
  client handle this?
 
  Thanks!
  - Ian
 



Re: Help understanding addreplica error message re: maxShardsPerNode

2015-04-08 Thread Ian Rose
Wups - sorry folks, I send this prematurely.  After typing this out I think
I have it figured out - although SPLITSHARD ignores maxShardsPerNode,
ADDREPLICA does not.  So ADDREPLICA fails because I already have too many
shards on a single node.

On Wed, Apr 8, 2015 at 11:18 PM, Ian Rose ianr...@fullstory.com wrote:

 On my local machine I have the following test setup:

 * 2 nodes (JVMs)
 * 1 collection named testdrive, that was originally created with
 numShards=1 and maxShardsPerNode=1.
 * After a series of SPLITSHARD commands, I now have 4 shards, as follows:

 testdrive_shard1_0_0_replica1 (L) Active 115
 testdrive_shard1_0_1_replica1 (L) Active 0
 testdrive_shard1_1_0_replica1 (L) Active 5
 testdrive_shard1_1_1_replica1 (L) Active 88

 The number in the last column is the number of documents.  The 4 shards
 are all on the same node; the second node holds nothing for this collection.

 Already, this situation is a little strange because I have 4 shards on one
 node, despite the fact that maxShardsPerNode is 1.  My guess is that
 SPLITSHARD ignores the maxShardsPerNode value - is that right?

 Now, if I issue an ADDREPLICA command
 with collection=testdriveshard=shard1_0_0, I get the following error:

 Cannot create shards testdrive. Value of maxShardsPerNode is 1, and the
 number of live nodes is 2. This allows a maximum of 2 to be created. Value
 of numShards is 4 and value of replicationFactor is 1. This requires 4
 shards to be created (higher than the allowed number)

 I don't totally understand this.





Re: change maxShardsPerNode for existing collection?

2015-04-08 Thread Ian Rose
Thanks, I figured that might be the case (hand-editting clusterstate.json).

- Ian


On Wed, Apr 8, 2015 at 11:46 PM, ralph tice ralph.t...@gmail.com wrote:

 It looks like there's a patch available:
 https://issues.apache.org/jira/browse/SOLR-5132

 Currently the only way without that patch is to hand-edit
 clusterstate.json, which is very ill advised.  If you absolutely must,
 it's best to stop all your Solr nodes, backup the current clusterstate
 in ZK, modify it, and then start your nodes.

 On Wed, Apr 8, 2015 at 10:21 PM, Ian Rose ianr...@fullstory.com wrote:
  I previously created several collections with maxShardsPerNode=1 but I
  would now like to change that (to unlimited if that is an option).  Is
  changing this value possible?
 
  Cheers,
  - Ian



Help understanding addreplica error message re: maxShardsPerNode

2015-04-08 Thread Ian Rose
On my local machine I have the following test setup:

* 2 nodes (JVMs)
* 1 collection named testdrive, that was originally created with
numShards=1 and maxShardsPerNode=1.
* After a series of SPLITSHARD commands, I now have 4 shards, as follows:

testdrive_shard1_0_0_replica1 (L) Active 115
testdrive_shard1_0_1_replica1 (L) Active 0
testdrive_shard1_1_0_replica1 (L) Active 5
testdrive_shard1_1_1_replica1 (L) Active 88

The number in the last column is the number of documents.  The 4 shards are
all on the same node; the second node holds nothing for this collection.

Already, this situation is a little strange because I have 4 shards on one
node, despite the fact that maxShardsPerNode is 1.  My guess is that
SPLITSHARD ignores the maxShardsPerNode value - is that right?

Now, if I issue an ADDREPLICA command
with collection=testdriveshard=shard1_0_0, I get the following error:

Cannot create shards testdrive. Value of maxShardsPerNode is 1, and the
number of live nodes is 2. This allows a maximum of 2 to be created. Value
of numShards is 4 and value of replicationFactor is 1. This requires 4
shards to be created (higher than the allowed number)

I don't totally understand this.


change maxShardsPerNode for existing collection?

2015-04-08 Thread Ian Rose
I previously created several collections with maxShardsPerNode=1 but I
would now like to change that (to unlimited if that is an option).  Is
changing this value possible?

Cheers,
- Ian


Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Ian Rose
Per - Wow, 1 trillion documents stored is pretty impressive.  One
clarification: when you say that you have 2 replica per collection on each
machine, what exactly does that mean?  Do you mean that each collection is
sharded into 50 shards, divided evenly over all 25 machines (thus 2 shards
per machine)?  Or are some of these slave replicas (e.g. 25x sharding with
1 replica per shard)?

Thanks!

On Wed, Mar 25, 2015 at 5:13 AM, Per Steffensen st...@designware.dk wrote:

 In one of our production environments we use 32GB, 4-core, 3T RAID0
 spinning disk Dell servers (do not remember the exact model). We have about
 25 collections with 2 replica (shard-instances) per collection on each
 machine - 25 machines. Total of 25 coll * 2 replica/coll/machine * 25
 machines = 1250 replica. Each replica contains about 800 million pretty
 small documents - thats about 1000 billion (do not know the english word
 for it) documents all in all. We index about 1.5 billion new documents
 every day (mainly into one of the collections = 50 replica across 25
 machine) and keep a history of 2 years on the data. Shifting the index
 into collection every month. We can fairly easy keep up with the indexing
 load. We have almost non of the data on the heap, but of course a small
 fraction of the data in the files will at any time be in OS file-cache.
 Compared to our indexing frequency we do not do a lot of searches. We have
 about 10 users searching the system from time to time - anything from major
 extracts to small quick searches. Depending on the nature of the search we
 have response-times between 1 sec and 5 min. But of course that is very
 dependent on clever choice on each field wrt index, store, doc-value etc.
 BUT we are not using out-of-box Apache Solr. We have made quit a lot of
 performance tweaks ourselves.
 Please note that, even though you disable all Solr caches, each replica
 will use heap-memory linearly dependent on the number of documents (and
 their size) in that replica. But not much, so you can get pretty far with
 relatively little RAM.
 Our version of Solr is based on Apache Solr 4.4.0, but I expect/hope it
 did not get worse in newer releases.

 Just to give you some idea of what can at least be achieved - in the
 high-end of #replica and #docs, I guess

 Regards, Per Steffensen


 On 24/03/15 14:02, Ian Rose wrote:

 Hi all -

 I'm sure this topic has been covered before but I was unable to find any
 clear references online or in the mailing list.

 Are there any rules of thumb for how many cores (aka shards, since I am
 using SolrCloud) is too many for one machine?  I realize there is no one
 answer (depends on size of the machine, etc.) so I'm just looking for a
 rough idea.  Something like the following would be very useful:

 * People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
 server without any problems.
 * I have never heard of anyone successfully running X cores/shards on a
 single machine, even if you throw a lot of hardware at it.

 Thanks!
 - Ian





Re: TooManyBasicQueries?

2015-03-24 Thread Ian Rose
Hi Erik -

Sorry, I totally missed your reply.  To the best of my knowledge, we are
not using any surround queries (have to admit I had never heard of them
until now).  We use solr.SearchHandler for all of our queries.

Does that answer the question?

Cheers,
Ian


On Fri, Mar 13, 2015 at 10:08 AM, Erik Hatcher erik.hatc...@gmail.com
wrote:

 It results from a surround query with too many terms.   Says the javadoc:

 * Exception thrown when {@link BasicQueryFactory} would exceed the limit
 * of query clauses.

 I’m curious, are you issuing a large {!surround} query or is it expanding
 to hit that limit?


 —
 Erik Hatcher, Senior Solutions Architect
 http://www.lucidworks.com




  On Mar 13, 2015, at 9:44 AM, Ian Rose ianr...@fullstory.com wrote:
 
  I sometimes see the following in my logs:
 
  ERROR org.apache.solr.core.SolrCore  –
  org.apache.lucene.queryparser.surround.query.TooManyBasicQueries:
 Exceeded
  maximum of 1000 basic queries.
 
 
  What does this mean?  Does this mean that we have issued a query with too
  many terms?  Or that the number of concurrent queries running on the
 server
  is too high?
 
  Also, is this a builtin limit or something set in a config file?
 
  Thanks!
  - Ian




rough maximum cores (shards) per machine?

2015-03-24 Thread Ian Rose
Hi all -

I'm sure this topic has been covered before but I was unable to find any
clear references online or in the mailing list.

Are there any rules of thumb for how many cores (aka shards, since I am
using SolrCloud) is too many for one machine?  I realize there is no one
answer (depends on size of the machine, etc.) so I'm just looking for a
rough idea.  Something like the following would be very useful:

* People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
server without any problems.
* I have never heard of anyone successfully running X cores/shards on a
single machine, even if you throw a lot of hardware at it.

Thanks!
- Ian


Re: TooManyBasicQueries?

2015-03-24 Thread Ian Rose
Ah yes, right you are.  I had thought that `surround` required a different
endpoint, but I see now that someone is using a surround query.

Many thanks!

On Tue, Mar 24, 2015 at 10:02 AM, Erik Hatcher erik.hatc...@gmail.com
wrote:

 Somehow a surround query is being constructed along the way.  Search your
 logs for “surround” and see if someone is maybe sneaking a q={!surround}…
 in there.  If you’re passing input directly through from your application
 to Solr’s q parameter without any sanitizing or filtering, it’s possible a
 surround query parser could be asked for.


 —
 Erik Hatcher, Senior Solutions Architect
 http://www.lucidworks.com http://www.lucidworks.com/




  On Mar 24, 2015, at 8:55 AM, Ian Rose ianr...@fullstory.com wrote:
 
  Hi Erik -
 
  Sorry, I totally missed your reply.  To the best of my knowledge, we are
  not using any surround queries (have to admit I had never heard of them
  until now).  We use solr.SearchHandler for all of our queries.
 
  Does that answer the question?
 
  Cheers,
  Ian
 
 
  On Fri, Mar 13, 2015 at 10:08 AM, Erik Hatcher erik.hatc...@gmail.com
  wrote:
 
  It results from a surround query with too many terms.   Says the
 javadoc:
 
  * Exception thrown when {@link BasicQueryFactory} would exceed the limit
  * of query clauses.
 
  I’m curious, are you issuing a large {!surround} query or is it
 expanding
  to hit that limit?
 
 
  —
  Erik Hatcher, Senior Solutions Architect
  http://www.lucidworks.com
 
 
 
 
  On Mar 13, 2015, at 9:44 AM, Ian Rose ianr...@fullstory.com wrote:
 
  I sometimes see the following in my logs:
 
  ERROR org.apache.solr.core.SolrCore  –
  org.apache.lucene.queryparser.surround.query.TooManyBasicQueries:
  Exceeded
  maximum of 1000 basic queries.
 
 
  What does this mean?  Does this mean that we have issued a query with
 too
  many terms?  Or that the number of concurrent queries running on the
  server
  is too high?
 
  Also, is this a builtin limit or something set in a config file?
 
  Thanks!
  - Ian
 
 




Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Ian Rose
First off thanks everyone for the very useful replies thus far.

Shawn - thanks for the list of items to check.  #1 and #2 should be fine
for us and I'll check our ulimit for #3.

To add a bit of clarification, we are indeed using SolrCloud.  Our current
setup is to create a new collection for each customer.  For now we allow
SolrCloud to decide for itself where to locate the initial shard(s) but in
time we expect to refine this such that our system will automatically
choose the least loaded nodes according to some metric(s).

Having more than one business entity controlling the configuration of a
 single (Solr) server is a recipe for disaster. Solr works well if there is
 an architect for the system.


Jack, can you explain a bit what you mean here?  It looks like Toke caught
your meaning but I'm afraid it missed me.  What do you mean by business
entity?  Is your concern that with automatic creation of collections they
will be distributed willy-nilly across the cluster, leading to uneven load
across nodes?  If it is relevant, the schema and solrconfig are controlled
entirely by me and is the same for all collections.  Thus theoretically we
could actually just use one single collection for all of our customers
(adding a 'customer:whatever' type fq to all queries) but since we never
need to query across customers it seemed more performant (as well as safer
- less chance of accidentally leaking data across customers) to use
separate collections.

Better to give each tenant a separate Solr instance that you spin up and
 spin down based on demand.


Regarding this, if by tenant you mean customer, this is not viable for us
from a cost perspective.  As I mentioned initially, many of our customers
are very small so dedicating an entire machine to each of them would not be
economical (or efficient).  Or perhaps I am not understanding what your
definition of tenant is?

Cheers,
Ian



On Tue, Mar 24, 2015 at 4:51 PM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:

 Jack Krupansky [jack.krupan...@gmail.com] wrote:
  I'm sure that I am quite unqualified to describe his hypothetical setup.
 I
  mean, he's the one using the term multi-tenancy, so it's for him to be
  clear.

 It was my understanding that Ian used them interchangeably, but of course
 Ian it the only one that knows.

  For me, it's a question of who has control over the config and schema and
  collection creation. Having more than one business entity controlling the
  configuration of a single (Solr) server is a recipe for disaster.

 Thank you. Now your post makes a lot more sense. I will not argue against
 that.

 - Toke Eskildsen



Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Ian Rose
Let me give a bit of background.  Our Solr cluster is multi-tenant, where
we use one collection for each of our customers.  In many cases, these
customers are very tiny, so their collection consists of just a single
shard on a single Solr node.  In fact, a non-trivial number of them are
totally empty (e.g. trial customers that never did anything with their
trial account).  However there are also some customers that are larger,
requiring their collection to be sharded.  Our strategy is to try to keep
the total documents in any one shard under 20 million (honestly not sure
where my coworker got that number from - I am open to alternatives but I
realize this is heavily app-specific).

So my original question is not related to indexing or query traffic, but
just the sheer number of cores.  For example, if I have 10 active cores on
a machine and everything is working fine, should I expect that everything
will still work fine if I add 10 nearly-idle cores to that machine?  What
about 100?  1000?  I figure the overhead of each core is probably fairly
low but at some point starts to matter.

Does that make sense?
- Ian


On Tue, Mar 24, 2015 at 11:12 AM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 Shards per collection, or across all collections on the node?

 It will all depend on:

 1. Your ingestion/indexing rate. High, medium or low?
 2. Your query access pattern. Note that a typical query fans out to all
 shards, so having more shards than CPU cores means less parallelism.
 3. How many collections you will have per node.

 In short, it depends on what you want to achieve, not some limit of Solr
 per se.

 Why are you even sharding the node anyway? Why not just run with a single
 shard per node, and do sharding by having separate nodes, to maximize
 parallel processing and availability?

 Also be careful to be clear about using the Solr term shard (a slice,
 across all replica nodes) as distinct from the Elasticsearch term shard
 (a single slice of an index for a single replica, analogous to a Solr
 core.)


 -- Jack Krupansky

 On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose ianr...@fullstory.com wrote:

  Hi all -
 
  I'm sure this topic has been covered before but I was unable to find any
  clear references online or in the mailing list.
 
  Are there any rules of thumb for how many cores (aka shards, since I am
  using SolrCloud) is too many for one machine?  I realize there is no
 one
  answer (depends on size of the machine, etc.) so I'm just looking for a
  rough idea.  Something like the following would be very useful:
 
  * People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
  server without any problems.
  * I have never heard of anyone successfully running X cores/shards on a
  single machine, even if you throw a lot of hardware at it.
 
  Thanks!
  - Ian
 



TooManyBasicQueries?

2015-03-13 Thread Ian Rose
I sometimes see the following in my logs:

ERROR org.apache.solr.core.SolrCore  –
org.apache.lucene.queryparser.surround.query.TooManyBasicQueries: Exceeded
maximum of 1000 basic queries.


What does this mean?  Does this mean that we have issued a query with too
many terms?  Or that the number of concurrent queries running on the server
is too high?

Also, is this a builtin limit or something set in a config file?

Thanks!
- Ian


Re: Using Zookeeper with REST URL

2014-11-19 Thread Ian Rose
I don't think zookeeper has a REST api.  You'll need to use a Zookeeper
client library in your language (or roll one yourself).

On Wed, Nov 19, 2014 at 9:48 AM, nabil Kouici koui...@yahoo.fr wrote:

 Hi All,

 I'm connecting to solr using REST API (No library like SolJ). As my solr
 configuration is in cloud using Zookeeper ensemble, I don't know how to get
 available Solr server from ZooKeeper to be used in my URL Call. With SolrJ
 I can do:

 String zkHostString = 10.0.1.8:2181;
 CloudSolrServersolr = newCloudSolrServer(zkHostString);
 solr.connect();

 SolrQuerysolrQuery = newSolrQuery(*:*);
 solrQuery.setRows(10);
 QueryResponse resp = solr.query(solrQuery);

 Any help.

 Regards,
 Nabil


Re: Ideas for debugging poor SolrCloud scalability

2014-11-07 Thread Ian Rose
Hi again, all -

Since several people were kind enough to jump in to offer advice on this
thread, I wanted to follow up in case anyone finds this useful in the
future.

*tl;dr: *Routing updates to a random Solr node (and then letting it forward
the docs to where they need to go) is very expensive, more than I
expected.  Using a smart router that uses the cluster config to route
documents directly to their shard results in (near) linear scaling for us.

*Expository version:*

We use Go on our client, for which (to my knowledge) there is no SolrCloud
router implementation.  So we started by just routing updates to a random
Solr node and letting it forward the docs to where they need to go.  My
theory was that this would lead to a constant amount of additional work
(and thus still linear scaling).  This was based on the observation that if
you send an update of K documents to a Solr node in a N node cluster, in
the worst case scenario, all K documents will need to be forwarded on to
other nodes.  Since Solr nodes have perfect knowledge of where docs belong,
each doc would only take 1 additional hop to get to its replica.  So random
routing (in the limit) imposes 1 additional network hop for each document.

In practice, however, we find that (for small networks, at least) per-node
performance falls as you add shards.  In fact, the client performance (in
writes/sec) was essentially constant no matter how many shards we added.  I
do have a working theory as to why this might be (i.e. where the flaw is in
my logic above) but as this is merely an unverified theory I don't want to
lead anyone astray by writing it up here.

However, by writing a smart router that retrieves the clusterstate.json
file from Zookeeper and uses that to perfectly route documents to their
proper shard, we were able to achieve much better scalability.  Using a
synthetic workload, we were able to achieve 141.7 writes/sec to a cluster
of size 1 and 2506 writes/sec to a cluster of size 20 (125
writes/sec/node).  So a dropoff of ~12% which is not too bad.  We are
hoping to continue our tests with larger clusters to ensure that the
per-node write performance levels off and does not continue to drop as the
cluster scales.

I will also note that we initially had several bugs in our smart router
implementation so if you follow a similar path and see bad performance look
to your router implementation as you might not be routing correctly.  We
ended up writing a simple proxy that we ran in front of Solr to observe all
requests which helped immensely when verifying and debugging our router.
Yes tcpdump does something similar but viewing HTTP-level traffic is way
more convenient than TCP-level.  Plus Go makes little proxies like this
super easy to do.

Hope all that is useful to someone.  Thanks again to the posters above for
providing suggestions!

- Ian



On Sat, Nov 1, 2014 at 7:13 PM, Erick Erickson erickerick...@gmail.com
wrote:

 bq: but it should be more or less a constant factor no matter how many
 Solr nodes you are using, right?

 Not really. You've stated that you're not driving Solr very hard in
 your tests. Therefore you're waiting on I/O. Therefore your tests just
 aren't going to scale linearly with the number of shards. This is a
 simplification, but

 Your network utilization is pretty much irrelevant. I send a packet
 somewhere. somewhere does some stuff and sends me back an
 acknowledgement. While I'm waiting, the network is getting no traffic,
 so. If the network traffic was in the 90% range that would be
 different, so it's a good thing to monitor.

 Really, use a leader aware client and rack enough clients together
 that you're driving Solr hard. Then double the number of shards. Then
 rack enough _more_ clients to drive Solr at the same level. In this
 case I'll go out on a limb and predict near 2x throughput increases.

 One additional note, though. When you add _replicas_ to shards expect
 to see a drop in throughput that may be quite significant, 20-40%
 anecdotally...

 Best,
 Erick

 On Sat, Nov 1, 2014 at 9:23 AM, Shawn Heisey apa...@elyograg.org wrote:
  On 11/1/2014 9:52 AM, Ian Rose wrote:
  Just to make sure I am thinking about this right: batching will
 certainly
  make a big difference in performance, but it should be more or less a
  constant factor no matter how many Solr nodes you are using, right?
 Right
  now in my load tests, I'm not actually that concerned about the absolute
  performance numbers; instead I'm just trying to figure out why relative
  performance (no matter how bad it is since I am not batching) does not
 go
  up with more Solr nodes.  Once I get that part figured out and we are
  seeing more writes per sec when we add nodes, then I'll turn on
 batching in
  the client to see what kind of additional performance gain that gets us.
 
  The basic problem I see with your methodology is that you are sending an
  update request and waiting for it to complete before sending another

Migrating shards

2014-11-07 Thread Ian Rose
Howdy -

What is the current best practice for migrating shards to another machine?
I have heard suggestions that it is add replica on new machine, wait for
it to catch up, delete original replica on old machine.  But I wanted to
check to make sure...

And if that is the best method, two follow-up questions:

1. Is there a best practice for knowing when the new replica has caught
up or do you just do a *:* query on both, compare counts, and call it a
day when they are the same (or nearly so, since the slave replica might lag
a little bit)?

2. When deleting the original (old) replica, since that one could be the
leader, is the replica deletion done in a safe manner such that no
documents will be lost (e.g. ones that were recently received by the leader
and not yet synced over to the slave replica before the leader is deleted)?

Thanks as always,
Ian


Re: Migrating shards

2014-11-07 Thread Ian Rose
Sounds great - thanks all.

On Fri, Nov 7, 2014 at 2:06 PM, Erick Erickson erickerick...@gmail.com
wrote:

 bq: I think ADD/DELETE replica APIs are best for within a SolrCloud

 I second this, if for no other reason than I'd expect this to get
 more attention than the underlying core admin API.

 That said, I believe ADD/DELETE replica just makes use of the core
 admin API under the covers, in which case you'd get all the goodness
 baked into the core admin API plus whatever extra is written into
 the collections api processing.

 Best,
 Erick

 On Fri, Nov 7, 2014 at 8:28 AM, ralph tice ralph.t...@gmail.com wrote:
  I think ADD/DELETE replica APIs are best for within a SolrCloud,
  however if you need to move data across SolrClouds you will have to
  resort to older APIs, which I didn't find good documentation of but
  many references to.  So I wrote up the instructions to do so here:
  https://gist.github.com/ralph-tice/887414a7f8082a0cb828
 
  I haven't had much time to think about how to translate this to more
  generic documentation for inclusion in the community wiki but I would
  love to hear some feedback if anyone else has a similar use case for
  moving Solr indexes across SolrClouds.
 
 
 
  On Fri, Nov 7, 2014 at 10:18 AM, Michael Della Bitta
  michael.della.bi...@appinions.com wrote:
  1. The new replica will not begin serving data until it's all there and
  caught up. You can watch the replica status on the Cloud screen to see
 it
  catch up; when it's green, you're done. If you're trying to automate
 this,
  you're going to look for the replica that says recovering in
  clusterstate.json and wait until it's active.
 
  2. I believe this to be the case, but I'll wait for someone else to
 chime in
  who knows better. Also, I wonder if there's a difference between
  DELETEREPLICA and unloading the core directly.
 
  Michael
 
 
 
  On 11/7/14 10:24, Ian Rose wrote:
 
  Howdy -
 
  What is the current best practice for migrating shards to another
 machine?
  I have heard suggestions that it is add replica on new machine, wait
 for
  it to catch up, delete original replica on old machine.  But I wanted
 to
  check to make sure...
 
  And if that is the best method, two follow-up questions:
 
  1. Is there a best practice for knowing when the new replica has
 caught
  up or do you just do a *:* query on both, compare counts, and call
 it a
  day when they are the same (or nearly so, since the slave replica might
  lag
  a little bit)?
 
  2. When deleting the original (old) replica, since that one could be
 the
  leader, is the replica deletion done in a safe manner such that no
  documents will be lost (e.g. ones that were recently received by the
  leader
  and not yet synced over to the slave replica before the leader is
  deleted)?
 
  Thanks as always,
  Ian
 
 



Re: any difference between using collection vs. shard in URL?

2014-11-05 Thread Ian Rose
Awesome, thanks.  That's what I was hoping.

Cheers,
Ian


On Wed, Nov 5, 2014 at 10:33 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 There's no difference between the two. Even if you send updates to a shard
 url, it will still be forwarded to the right shard leader according to the
 hash of the id (assuming you're using the default compositeId router). Of
 course, if you happen to hit the right shard leader then it is just an
 internal forward and not an extra network hop.

 The advantage with using the collection name is that you can hit any
 SolrCloud node (even the ones not hosting this collection) and it will
 still work. So for a non Java client, a load balancer can be setup in front
 of the entire cluster and things will just work.

 On Wed, Nov 5, 2014 at 8:50 PM, Ian Rose ianr...@fullstory.com wrote:

  If I add some documents to a SolrCloud shard in a collection alpha, I
 can
  post them to /solr/alpha/update.  However I notice that you can also
 post
  them using the shard name, e.g. /solr/alpha_shard4_replica1/update - in
  fact this is what Solr seems to do internally (like if you send documents
  to the wrong node so Solr needs to forward them over to the leader of the
  correct shard).
 
  Assuming you *do* always post your documents to the correct shard, is
 there
  any difference between these two, performance or otherwise?
 
  Thanks!
  - Ian
 



 --
 Regards,
 Shalin Shekhar Mangar.



Re: Ideas for debugging poor SolrCloud scalability

2014-11-01 Thread Ian Rose
Erick,

Just to make sure I am thinking about this right: batching will certainly
make a big difference in performance, but it should be more or less a
constant factor no matter how many Solr nodes you are using, right?  Right
now in my load tests, I'm not actually that concerned about the absolute
performance numbers; instead I'm just trying to figure out why relative
performance (no matter how bad it is since I am not batching) does not go
up with more Solr nodes.  Once I get that part figured out and we are
seeing more writes per sec when we add nodes, then I'll turn on batching in
the client to see what kind of additional performance gain that gets us.

Cheers,
Ian


On Fri, Oct 31, 2014 at 3:43 PM, Peter Keegan peterlkee...@gmail.com
wrote:

 Yes, I was inadvertently sending them to a replica. When I sent them to the
 leader, the leader reported (1000 adds) and the replica reported only 1 add
 per document. So, it looks like the leader forwards the batched jobs
 individually to the replicas.

 On Fri, Oct 31, 2014 at 3:26 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  Internally, the docs are batched up into smaller buckets (10 as I
  remember) and forwarded to the correct shard leader. I suspect that's
  what you're seeing.
 
  Erick
 
  On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan peterlkee...@gmail.com
  wrote:
   Regarding batch indexing:
   When I send batches of 1000 docs to a standalone Solr server, the log
  file
   reports (1000 adds) in LogUpdateProcessor. But when I send them to
 the
   leader of a replicated index, the leader log file reports much smaller
   numbers, usually (12 adds). Why do the batches appear to be broken
 up?
  
   Peter
  
   On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson 
  erickerick...@gmail.com
   wrote:
  
   NP, just making sure.
  
   I suspect you'll get lots more bang for the buck, and
   results much more closely matching your expectations if
  
   1 you batch up a bunch of docs at once rather than
   sending them one at a time. That's probably the easiest
   thing to try. Sending docs one at a time is something of
   an anti-pattern. I usually start with batches of 1,000.
  
   And just to check.. You're not issuing any commits from the
   client, right? Performance will be terrible if you issue commits
   after every doc, that's totally an anti-pattern. Doubly so for
   optimizes Since you showed us your solrconfig  autocommit
   settings I'm assuming not but want to be sure.
  
   2 use a leader-aware client. I'm totally unfamiliar with Go,
   so I have no suggestions whatsoever to offer there But you'll
   want to batch in this case too.
  
   On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose ianr...@fullstory.com
  wrote:
Hi Erick -
   
Thanks for the detailed response and apologies for my confusing
terminology.  I should have said WPS (writes per second) instead
 of
  QPS
but I didn't want to introduce a weird new acronym since QPS is well
known.  Clearly a bad decision on my part.  To clarify: I am doing
*only* writes
(document adds).  Whenever I wrote QPS I was referring to writes.
   
It seems clear at this point that I should wrap up the code to do
  smart
routing rather than choose Solr nodes randomly.  And then see if
 that
changes things.  I must admit that although I understand that random
  node
selection will impose a performance hit, theoretically it seems to
 me
   that
the system should still scale up as you add more nodes (albeit at
  lower
absolute level of performance than if you used a smart router).
Nonetheless, I'm just theorycrafting here so the better thing to do
 is
   just
try it experimentally.  I hope to have that working today - will
  report
back on my findings.
   
Cheers,
- Ian
   
p.s. To clarify why we are rolling our own smart router code, we use
  Go
over here rather than Java.  Although if we still get bad
 performance
   with
our custom Go router I may try a pure Java load client using
CloudSolrServer to eliminate the possibility of bugs in our
   implementation.
   
   
On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson 
  erickerick...@gmail.com
   
wrote:
   
I'm really confused:
   
bq: I am not issuing any queries, only writes (document inserts)
   
bq: It's clear that once the load test client has ~40 simulated
 users
   
bq: A cluster of 3 shards over 3 Solr nodes *should* support
a higher QPS than 2 shards over 2 Solr nodes, right
   
QPS is usually used to mean Queries Per Second, which is
 different
   from
the statement that I am not issuing any queries. And what do
  the
number of users have to do with inserting documents?
   
You also state:  In many cases, CPU on the solr servers is quite
  low as
well
   
So let's talk about indexing first. Indexing should scale nearly
linearly as long as
1 you are routing your docs to the correct leader, which

Re: Ideas for debugging poor SolrCloud scalability

2014-10-31 Thread Ian Rose
Hi Erick -

Thanks for the detailed response and apologies for my confusing
terminology.  I should have said WPS (writes per second) instead of QPS
but I didn't want to introduce a weird new acronym since QPS is well
known.  Clearly a bad decision on my part.  To clarify: I am doing
*only* writes
(document adds).  Whenever I wrote QPS I was referring to writes.

It seems clear at this point that I should wrap up the code to do smart
routing rather than choose Solr nodes randomly.  And then see if that
changes things.  I must admit that although I understand that random node
selection will impose a performance hit, theoretically it seems to me that
the system should still scale up as you add more nodes (albeit at lower
absolute level of performance than if you used a smart router).
Nonetheless, I'm just theorycrafting here so the better thing to do is just
try it experimentally.  I hope to have that working today - will report
back on my findings.

Cheers,
- Ian

p.s. To clarify why we are rolling our own smart router code, we use Go
over here rather than Java.  Although if we still get bad performance with
our custom Go router I may try a pure Java load client using
CloudSolrServer to eliminate the possibility of bugs in our implementation.


On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson erickerick...@gmail.com
wrote:

 I'm really confused:

 bq: I am not issuing any queries, only writes (document inserts)

 bq: It's clear that once the load test client has ~40 simulated users

 bq: A cluster of 3 shards over 3 Solr nodes *should* support
 a higher QPS than 2 shards over 2 Solr nodes, right

 QPS is usually used to mean Queries Per Second, which is different from
 the statement that I am not issuing any queries. And what do the
 number of users have to do with inserting documents?

 You also state:  In many cases, CPU on the solr servers is quite low as
 well

 So let's talk about indexing first. Indexing should scale nearly
 linearly as long as
 1 you are routing your docs to the correct leader, which happens with
 SolrJ
 and the CloudSolrSever automatically. Rather than rolling your own, I
 strongly
 suggest you try this out.
 2 you have enough clients feeding the cluster to push CPU utilization
 on them all.
 Very often slow indexing, or in your case lack of scaling is a
 result of document
 acquisition or, in your case, your doc generator is spending all it's
 time waiting for
 the individual documents to get to Solr and come back.

 bq: chooses a random solr server for each ADD request (with 1 doc per add
 request)

 Probably your culprit right there. Each and every document requires that
 you
 have to cross the network (and forward that doc to the correct leader). So
 given
 that you're not seeing high CPU utilization, I suspect that you're not
 sending
 enough docs to SolrCloud fast enough to see scaling. You need to batch up
 multiple docs, I generally send 1,000 docs at a time.

 But even if you do solve this, the inter-node routing will prevent
 linear scaling.
 When a doc (or a batch of docs) goes to a random Solr node, here's what
 happens:
 1 the docs are re-packaged into groups based on which shard they're
 destined for
 2 the sub-packets are forwarded to the leader for each shard
 3 the responses are gathered back and returned to the client.

 This set of operations will eventually degrade the scaling.

 bq:  A cluster of 3 shards over 3 Solr nodes *should* support
 a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole idea
 behind sharding.

 If we're talking search requests, the answer is no. Sharding is
 what you do when your collection no longer fits on a single node.
 If it _does_ fit on a single node, then you'll usually get better query
 performance by adding a bunch of replicas to a single shard. When
 the number of  docs on each shard grows large enough that you
 no longer get good query performance, _then_ you shard. And
 take the query hit.

 If we're talking about inserts, then see above. I suspect your problem is
 that you're _not_ saturating the SolrCloud cluster, you're sending
 docs to Solr very inefficiently and waiting on I/O. Batching docs and
 sending them to the right leader should scale pretty linearly until you
 start saturating your network.

 Best,
 Erick

 On Thu, Oct 30, 2014 at 6:56 PM, Ian Rose ianr...@fullstory.com wrote:
  Thanks for the suggestions so for, all.
 
  1) We are not using SolrJ on the client (not using Java at all) but I am
  working on writing a smart router so that we can always send to the
  correct node.  I am certainly curious to see how that changes things.
  Nonetheless even with the overhead of extra routing hops, the observed
  behavior (no increase in performance with more nodes) doesn't make any
  sense to me.
 
  2) Commits: we are using autoCommit with openSearcher=false
 (maxTime=6)
  and autoSoftCommit (maxTime=15000).
 
  3) Suggestions to batch documents certainly make sense for production
 code

Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Ian Rose
Howdy all -

The short version is: We are not seeing Solr Cloud performance scale (event
close to) linearly as we add nodes. Can anyone suggest good diagnostics for
finding scaling bottlenecks? Are there known 'gotchas' that make Solr Cloud
fail to scale?

In detail:

We have used Solr (in non-Cloud mode) for over a year and are now beginning
a transition to SolrCloud.  To this end I have been running some basic load
tests to figure out what kind of capacity we should expect to provision.
In short, I am seeing very poor scalability (increase in effective QPS) as
I add Solr nodes.  I'm hoping to get some ideas on where I should be
looking to debug this.  Apologies in advance for the length of this email;
I'm trying to be comprehensive and provide all relevant information.

Our setup:

1 load generating client
 - generates tiny, fake documents with unique IDs
 - performs only writes (no queries at all)
 - chooses a random solr server for each ADD request (with 1 doc per add
request)

N collections spread over K solr servers
 - every collection is sharded K times (so every solr instance has 1 shard
from every collection)
 - no replicas
 - external zookeeper server (not using zkRun)
 - autoCommit maxTime=6
 - autoSoftCommit maxTime =15000

Everything is running within a single zone on Google Compute Engine, so
high quality gigabit network links between all machines (ping times  1ms).

My methodology is as follows.
1. Start up a K solr servers.
2. Remove all existing collections.
3. Create N collections, with numShards=K for each.
4. Start load testing.  Every minute, print the number of successful
updates and the number of failed updates.
5. Keep increasing the offered load (via simulated users) until the qps
flatlines.

In brief (more detailed results at the bottom of email), I find that for
any number of nodes between 2 and 5, the QPS always caps out at ~3000.
Obviously something must be wrong here, as there should be a trend of the
QPS scaling (roughly) linearly with the number of nodes.  Or at the very
least going up at all!

So my question is what else should I be looking at here?

* CPU on the loadtest client is well under 100%
* No other obvious bottlenecks on loadtest client (running 2 clients leads
to ~1/2 qps on each)
* In many cases, CPU on the solr servers is quite low as well (e.g. with
100 users hitting 5 solr nodes, all nodes are 50% idle)
* Network bandwidth is a few MB/s, well under the gigabit capacity of our
network
* Disk bandwidth ( 2 MB/s) and iops ( 20/s) are low.

Any ideas?  Thanks very much!
- Ian


p.s. Here is my raw data broken out by number of nodes and number of
simulated users:


Num NodesNum UsersQPS111020153180110382511539001204050140410021472251790210
229021528502202900240321026032002803210210031803138535158031020903152560320
27603252890380305041375451560410220041525004202700425280043028505152450520
2640525279053028405100290052002810


Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Ian Rose

 If you want to increase QPS, you should not be increasing numShards.
 You need to increase replicationFactor.  When your numShards matches the
 number of servers, every single server will be doing part of the work
 for every query.



I think this is true only for actual queries, right?  I am not issuing any
queries, only writes (document inserts).  In the case of writes, increasing
the number of shards should increase my throughput (in ops/sec) more or
less linearly, right?


On Thu, Oct 30, 2014 at 4:50 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 10/30/2014 2:23 PM, Ian Rose wrote:
  My methodology is as follows.
  1. Start up a K solr servers.
  2. Remove all existing collections.
  3. Create N collections, with numShards=K for each.
  4. Start load testing.  Every minute, print the number of successful
  updates and the number of failed updates.
  5. Keep increasing the offered load (via simulated users) until the qps
  flatlines.

 If you want to increase QPS, you should not be increasing numShards.
 You need to increase replicationFactor.  When your numShards matches the
 number of servers, every single server will be doing part of the work
 for every query.  If you increase replicationFactor instead, then each
 server can be doing a different query in parallel.

 Sharding the index is what you need to do when you need to scale the
 size of the index, so each server does not get overwhelmed by dealing
 with every document for every query.

 Getting a high QPS with a big index requires increasing both numShards
 *AND* replicationFactor.

 Thanks,
 Shawn




Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Ian Rose
Thanks for the suggestions so for, all.

1) We are not using SolrJ on the client (not using Java at all) but I am
working on writing a smart router so that we can always send to the
correct node.  I am certainly curious to see how that changes things.
Nonetheless even with the overhead of extra routing hops, the observed
behavior (no increase in performance with more nodes) doesn't make any
sense to me.

2) Commits: we are using autoCommit with openSearcher=false (maxTime=6)
and autoSoftCommit (maxTime=15000).

3) Suggestions to batch documents certainly make sense for production code
but in this case I am not real concerned with absolute performance; I just
want to see the *relative* performance as we use more Solr nodes.  So I
don't think batching or not really matters.

4) No, that won't affect indexing speed all that much.  The way to
increase indexing speed is to increase the number of processes or threads
that are indexing at the same time.  Instead of having one client
sending update
requests, try five of them.

Can you elaborate on this some?  I'm worried I might be misunderstanding
something fundamental.  A cluster of 3 shards over 3 Solr nodes
*should* support
a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole idea
behind sharding.  Regarding your comment of increase the number of
processes or threads, note that for each value of K (number of Solr nodes)
I measured with several different numbers of simulated users so that I
could find a saturation point.  For example, take a look at my data for
K=2:

Num NodesNum UsersQPS214722517902102290215285022029002403210260320028032102
1003180

It's clear that once the load test client has ~40 simulated users, the Solr
cluster is saturated.  Creating more users just increases the average
request latency, such that the total QPS remained (nearly) constant.  So I
feel pretty confident that a cluster of size 2 *maxes out* at ~3200 qps.
The problem is that I am finding roughly this same max point, no matter
how many simulated users the load test client created, for any value of K
( 1).

Cheers,
- Ian


On Thu, Oct 30, 2014 at 8:01 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Your indexing client, if written in SolrJ, should use CloudSolrServer
 which is, in Matt's terms leader aware. It divides up the
 documents to be indexed into packets that where each doc in
 the packet belongs on the same shard, and then sends the packet
 to the shard leader. This avoids a lot of re-routing and should
 scale essentially linearly. You may have to add more clients
 though, depending upon who hard the document-generator is
 working.

 Also, make sure that you send batches of documents as Shawn
 suggests, I use 1,000 as a starting point.

 Best,
 Erick

 On Thu, Oct 30, 2014 at 2:10 PM, Shawn Heisey apa...@elyograg.org wrote:
  On 10/30/2014 2:56 PM, Ian Rose wrote:
  I think this is true only for actual queries, right? I am not issuing
  any queries, only writes (document inserts). In the case of writes,
  increasing the number of shards should increase my throughput (in
  ops/sec) more or less linearly, right?
 
  No, that won't affect indexing speed all that much.  The way to increase
  indexing speed is to increase the number of processes or threads that
  are indexing at the same time.  Instead of having one client sending
  update requests, try five of them.  Also, index many documents with each
  update request.  Sending one document at a time is very inefficient.
 
  You didn't say how you're doing commits, but those need to be as
  infrequent as you can manage.  Ideally, you would use autoCommit with
  openSearcher=false on an interval of about five minutes, and send an
  explicit commit (with the default openSearcher=true) after all the
  indexing is done.
 
  You may have requirements regarding document visibility that this won't
  satisfy, but try to avoid doing commits with openSearcher=true (soft
  commits qualify for this) extremely frequently, like once a second.
  Once a minute is much more realistic.  Opening a new searcher is an
  expensive operation, especially if you have cache warming configured.
 
  Thanks,
  Shawn
 



Re: Slow inserts when using Solr Cloud

2014-08-04 Thread ian
Very interested in what you find out with your benchmarking, and whether it
bears out what I've experienced.

Does anyone know when 4.10 is likely to be released?



I'm benchmarking this right now so I'll share some numbers soon.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Slow-inserts-when-using-Solr-Cloud-tp4146087p4150963.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Slow inserts when using Solr Cloud

2014-07-25 Thread ian
I've built and installed the latest snapshot of Solr 4.10 using the same
SolrCloud configuration and that gave me a tenfold increase in throughput,
so it certainly looks like SOLR-6136 was the issue that was causing my slow
insert rate/high latency with shard routing and replicas.  Thanks for your
help.


Timothy Potter wrote
 Hi Ian,
 
 What's the CPU doing on the leader? Have you tried attaching a
 profiler to the leader while running and then seeing if there are any
 hotspots showing. Not sure if this is related but we recently fixed an
 issue in the area of leader forwarding to replica that used too many
 CPU cycles inefficiently - see SOLR-6136.
 
 Tim





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Slow-inserts-when-using-Solr-Cloud-tp4146087p4149219.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Slow inserts when using Solr Cloud

2014-07-17 Thread ian
Hi Tim

Thanks for the info about the bug.  I've just looked at the CPU usage for
the leader using JConsole, while my bulk load process was running, inserting
documents into my Solr cloud.  Is that what you meant by profiling and
looking for hotspots?   I find the CPU usage goes up quite a lot when the
replica is enabled, compared to when it is disabled:  

http://lucene.472066.n3.nabble.com/file/n4147645/solr-cpu-usage.jpg 

In the above chart, the dip in CPU usage in the middle was while the replica
(which lives on a different VM) was disabled.

Thanks
Ian


Timothy Potter wrote
 Hi Ian,
 
 What's the CPU doing on the leader? Have you tried attaching a
 profiler to the leader while running and then seeing if there are any
 hotspots showing. Not sure if this is related but we recently fixed an
 issue in the area of leader forwarding to replica that used too many
 CPU cycles inefficiently - see SOLR-6136.
 
 Tim





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Slow-inserts-when-using-Solr-Cloud-tp4146087p4147645.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Slow inserts when using Solr Cloud

2014-07-16 Thread ian
That's useful to know, thanks very much.   I'll look into using
CloudSolrServer, although I'm using solrnet at present.

That would reduce some of the overhead - but not the extra 200ms I'm getting
for forwarding to the replica when the replica is switched on.  

It does seem a very high overhead.  When I consider that it takes 20ms to
insert a new document to Solr with replicas disabled (if I route to the
correct shard), you might expect it to take two to three times longer if it
has to forward to one replica and then wait for a response, but an increase
of 200ms seems really high doesn't it?  Is there a forum where I should
raise that? 

Thanks again for your help
Ian


Shalin Shekhar Mangar wrote
 You can use CloudSolrServer (if you're using Java) which will route
 documents correctly to the leader of the appropriate shard.
 
 
 On Tue, Jul 15, 2014 at 3:04 PM, ian lt;

 Ian.Williams@.nhs

 gt; wrote:
 
 Hi Mark

 Thanks for replying to my post.  Would you know whether my findings are
 consistent with what other people see when using SolrCloud?

 One thing I want to investigate is whether I can route my updates to the
 correct shard in the first place, by having my client using the same
 hashing
 logic as Solr, and working out in advance which shard my inserts should
 be
 sent to.  Do you know whether that's an approach that others have used?

 Thanks again
 Ian



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Slow-inserts-when-using-Solr-Cloud-tp4146087p4147183.html
 Sent from the Solr - User mailing list archive at Nabble.com.

 
 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Slow-inserts-when-using-Solr-Cloud-tp4146087p4147481.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Slow inserts when using Solr Cloud

2014-07-15 Thread ian
Hi Mark

Thanks for replying to my post.  Would you know whether my findings are
consistent with what other people see when using SolrCloud?

One thing I want to investigate is whether I can route my updates to the
correct shard in the first place, by having my client using the same hashing
logic as Solr, and working out in advance which shard my inserts should be
sent to.  Do you know whether that's an approach that others have used?

Thanks again
Ian



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Slow-inserts-when-using-Solr-Cloud-tp4146087p4147183.html
Sent from the Solr - User mailing list archive at Nabble.com.


Slow inserts when using Solr Cloud

2014-07-08 Thread Ian Williams (NWIS - Applications Design)
Hi

I'm encountering a surprisingly high increase in response times when I insert 
new documents into a SolrCloud, compared with a standalone Solr instance.

I have a SolrCloud set up for test and evaluation purposes.  I have four 
shards, each with a leader and a replica, distributed over four Windows virtual 
servers.  I have zookeeper running on three of the four servers. There are not 
many documents in my SolrCloud (just a few hundred).   I am using composite id 
routing, specifying a prefix to my document ids which is then used by Solr to 
determine which shard the document should be stored on.

I determine in advance which shard a document with a given id prefix will end 
up in, by trying it out in advance.  I then try the following scenarios, using 
inserts without commits.  E.g. I use:
curl http://servername:port/solr/update -H Content-Type: text/xml 
--data-binary @test.txt

1. Insert a document, sending it to the server hosting the correct shard, with 
replicas turned off (response time 20ms)
I find that if I 'switch off' the replicas for my shard (by shutting down Solr 
for the replicas), and then I send the new document to the server hosting the 
leader for the correct shard, then I get a very fast response, i.e. under 10ms, 
which is similar to the performance I get when not using SolrCloud.  This is 
expected, as I've removed any overhead to do with replicas or routing to the 
correct shard.

2. Insert a document, sending it to the server hosting the correct shard, but 
with replicas turned on (response time approx 250ms)
If I switch on the replica for that shard, then my average response time for an 
insert increases from 10ms  to around 250ms.  Now I expect an overhead, 
because the leader has to find out where the replica is (from Zookeeper?) and 
then forward the request to that replica, then wait for a reply - but an 
increase from 20ms to 250ms seems very high?

3. Insert a document, sending it to a server hosting the incorrect shard, with 
replicas turned on (response time approx 500ms)
If I do the same thing again but this time send to the server hosting a 
different shard to the shard my document will end up in, the average response 
times increase again to around 500ms.  Again, I'd expect an increase because of 
the extra step of needing to forward to the correct shard, but the increase 
seems very high?


Should I expect this much of an overhead for shard routing and replicas, or 
might this indicate a problem in my configuration?

Many thanks
Ian

---
Mae’r wybodaeth a gynhwysir yn y neges e-bost hon ac yn unrhyw atodiadau’n 
gyfrinachol. Os ydych yn ei derbyn ar gam, rhowch wybod i’r anfonwr a’i dileu’n 
ddi-oed. Ni fwriedir i ddatgelu i unrhyw un heblaw am y derbynnydd, boed yn 
anfwriadol neu fel arall, hepgor cyfrinachedd. Efallai bydd Gwasanaeth Gwybodeg 
GIG Cymru (NWIS) yn monitro ac yn cofnodi pob neges e-bost rhag firysau a 
defnydd amhriodol. Mae’n bosibl y bydd y neges e-bost hon ac unrhyw atebion neu 
atodiadau dilynol yn ddarostyngedig i’r Ddeddf Rhyddid Gwybodaeth. Mae’r farn a 
fynegir yn y neges e-bost hon yn perthyn i’r anfonwr ac nid ydynt o reidrwydd 
yn perthyn i NWIS.

The information included in this email and any attachments is confidential. If 
received in error, please notify the sender and delete it immediately. 
Disclosure to any party other than the addressee, whether unintentional or 
otherwise, is not intended to waive confidentiality. The NHS Wales Informatics 
Service (NWIS) may monitor and record all emails for viruses and inappropriate 
use. This e-mail and any subsequent replies or attachments may be subject to 
the Freedom of Information Act. The views expressed in this email are those of 
the sender and not necessarily of NWIS.
---


Re: Faceting on a date field multiple times

2012-05-04 Thread Ian Holsman
Thanks Marc.
On May 4, 2012, at 8:52 PM, Marc Sturlese wrote:

 http://lucene.472066.n3.nabble.com/Multiple-Facet-Dates-td495480.html
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Faceting-on-a-date-field-multiple-times-tp3961282p3961865.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Faceting on a date field multiple times

2012-05-03 Thread Ian Holsman
Hi.

I would like to be able to do a facet on a date field, but with different 
ranges (in a single query).

for example. I would like to show

#documents by day for the last week - 
#documents by week for the last couple of months
#documents by year for the last several years.

is there a way to do this without hitting solr 3 times?


thanks
Ian

Intermittent connection timeouts to Solr server using SolrNet

2012-01-05 Thread Ian Grainger
Hi - I have also posted this question on SO:
http://stackoverflow.com/questions/8741080/intermittent-connection-timeouts-to-solr-server-using-solrnet


I have a production webserver hosting a search webpage, which uses SolrNet
to connect to another machine which hosts the Solr search server (on a
subnet which is in the same room, so no network problems). All is fine 90%
of the time, but I consistently get a small number of The operation has
timed out errors.

I've increased the timeout in the SolrNet init to *30* seconds (!)

SolrNet.Startup.InitSolrDataObject(
new SolrNet.Impl.SolrConnection(
System.Configuration.ConfigurationManager.AppSettings[URL]
) {Timeout = 3}
);

...but all that happened is I started getting this message instead of Unable
to connect to the remote server which I was seeing before. It seems to have
made no difference to the amount of timeout errors.

I can see *nothing* in *any* log (believe me I've looked!) and clearly my
configuration is correct because it works most of the time. Anyone any
ideas how I can find more information on this problem?

Thanks!


-- 
Ian

i...@isfluent.com a...@endissolutions.com
+44 (0)1223 257903


Re: Unable to index documents using DataImportHandler with MSSQL

2011-11-28 Thread Ian Grainger
Right.
This is REALLY weird - I've now started from scratch on another
machine (this time Windows 7), and got _exactly_ the same problem !?


On Mon, Nov 28, 2011 at 7:37 AM, Husain, Yavar yhus...@firstam.com wrote:
 Hi Ian

 I am having exactly the same problem what you are having on Win 7 and 2008 
 Server http://lucene.472066.n3.nabble.com/DIH-Strange-Problem-tc3530370.html

 I still have not received any replies which could solve my problem till now. 
 Please do let me know if you have arrived at some solution for your problem.

 Thanks.

 Regards,
 Yavar

 -Original Message-
 From: Ian Grainger [mailto:i...@isfluent.com]
 Sent: Friday, November 25, 2011 10:59 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Unable to index documents using DataImportHandler with MSSQL

 Update on this: I've established:
 * It's not a problem in the DB (I can index from this DB into a Solr
 instance on another server)
 * It's not Tomcat (I get the same problem in Jetty)
 * It's not the schema (I have simplified it to one field)

 That leaves SolrConfig.xml and data-config.

 Only thing changed in SolrConfig.xml is adding:

  lib dir=D:/Software/Solr/example/solr/dist/
 regex=apache-solr-cell-\d.*\.jar /
  lib dir=D:/Software/Solr/example/solr/dist/
 regex=apache-solr-clustering-\d.*\.jar /
  lib dir=D:/Software/Solr/example/solr/dist/
 regex=apache-solr-dataimporthandler-\d.*\.jar /
 requestHandler name=/dataimport
    class=org.apache.solr.handler.dataimport.DataImportHandler
  lst name=defaults
    str name=configD:/Software/Solr/example/solr/conf/data-config.xml/str
  /lst
 /requestHandler

 And data-config.xml is pretty much as attached - except simpler.

 Any help or any advice on how to diagnose would be appreciated!


 On Fri, Nov 25, 2011 at 12:29 PM, Ian Grainger i...@isfluent.com wrote:
 Hi I have copied my Solr config from a working Windows server to a new
 one, and it can't seem to run an import.

 They're both using win server 2008 and SQL 2008R2. This is the data
 importer config

    dataConfig
      dataSource type=JdbcDataSource  name=ds1
            driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
            url=jdbc:sqlserver://localhost;databaseName=DB
            user=Solr
            password=pwd/
      document name=datas
        entity name=data dataSource=ds1 pk=key
        query=EXEC SOLR_COMPANY_SEARCH_DATA
        deltaImportQuery=SELECT * FROM Company_Search_Data WHERE
 [key]='${dataimporter.delta.key}'
        deltaQuery=SELECT [key] FROM Company_Search_Data WHERE modify_dt
 '${dataimporter.last_index_time}'
              field column=WorkDesc_Comments
 name=WorkDesc_Comments_Split /
              field column=WorkDesc_Comments name=WorkDesc_Comments_Edge 
 /
        /entity
      /document
    /dataConfig

 I can use MS SQL Profiler to watch the Solr user log in successfully,
 but then nothing. It doesn't seem to even try and execute the stored
 procedure. Any ideas why this would be working one server and not on
 another?

 FTR the only thing in the tomcat catalina log is:

    org.apache.solr.handler.dataimport.JdbcDataSource$1 call
    INFO: Creating a connection for entity data with URL:
 jdbc:sqlserver://localhost;databaseName=CATLive

 --
 Ian

 i...@isfluent.com
 +44 (0)1223 257903




 --
 Ian

 i...@isfluent.com
 +44 (0)1223 257903
 **
 This message may contain confidential or proprietary information intended 
 only for the use of the
 addressee(s) named above or may contain information that is legally 
 privileged. If you are
 not the intended addressee, or the person responsible for delivering it to 
 the intended addressee,
 you are hereby notified that reading, disseminating, distributing or copying 
 this message is strictly
 prohibited. If you have received this message by mistake, please immediately 
 notify us by
 replying to the message and delete the original message and any copies 
 immediately thereafter.

 Thank you.-
 **
 FAFLD





-- 
Ian

i...@isfluent.com
+44 (0)1223 257903


Re: Unable to index documents using DataImportHandler with MSSQL

2011-11-28 Thread Ian Grainger
Hah, I've just come on here to suggest you do the same thing! Thanks
for getting back to me - and interesting we both came up with the same
solution!

Now I have the problem that running a delta update updates the
'dataimport.properties' file - but then just re-fetches all the data
regardless! Weird!


On Mon, Nov 28, 2011 at 11:59 AM, Husain, Yavar yhus...@firstam.com wrote:
 Hi Ian

 I downloaded and build latest Solr (3.4) from sources and finally hit 
 following line of code in Solr (where I put my debug statement) :

 if(url != null){
               LOG.info(Yavar: getting handle to driver manager:);
               c = DriverManager.getConnection(url, initProps);
               LOG.info(Yavar: got handle to driver manager:);
 }

 The call to Driver Manager was not returning. Here was the error!! The Driver 
 we were using was Microsoft Type 4 JDBC driver for SQL Server. I downloaded 
 another driver called jTDS jDBC driver and installed that. Problem got 
 fixed!!!

 So please follow the following steps:

 1. Download jTDS jDBC driver from http://jtds.sourceforge.net/
 2. Put the driver jar file into your Solr/lib directory where you had put 
 Microsoft JDBC driver.
 3. In the data-config.xml use this statement: 
 driver=net.sourceforge.jtds.jdbc.Driver
 4. Also in data-config.xml mention url like this: 
 url=jdbc:jTDS:sqlserver://localhost:1433;databaseName=XXX
 5. Now run your indexing.

 It should solve the problem.

 Regards,
 Yavar

 -Original Message-
 From: Ian Grainger [mailto:i...@isfluent.com]
 Sent: Monday, November 28, 2011 4:11 PM
 To: Husain, Yavar
 Cc: solr-user@lucene.apache.org
 Subject: Re: Unable to index documents using DataImportHandler with MSSQL

 Right.
 This is REALLY weird - I've now started from scratch on another
 machine (this time Windows 7), and got _exactly_ the same problem !?


 On Mon, Nov 28, 2011 at 7:37 AM, Husain, Yavar yhus...@firstam.com wrote:
 Hi Ian

 I am having exactly the same problem what you are having on Win 7 and 2008 
 Server http://lucene.472066.n3.nabble.com/DIH-Strange-Problem-tc3530370.html

 I still have not received any replies which could solve my problem till now. 
 Please do let me know if you have arrived at some solution for your problem.

 Thanks.

 Regards,
 Yavar

 -Original Message-
 From: Ian Grainger [mailto:i...@isfluent.com]
 Sent: Friday, November 25, 2011 10:59 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Unable to index documents using DataImportHandler with MSSQL

 Update on this: I've established:
 * It's not a problem in the DB (I can index from this DB into a Solr
 instance on another server)
 * It's not Tomcat (I get the same problem in Jetty)
 * It's not the schema (I have simplified it to one field)

 That leaves SolrConfig.xml and data-config.

 Only thing changed in SolrConfig.xml is adding:

  lib dir=D:/Software/Solr/example/solr/dist/
 regex=apache-solr-cell-\d.*\.jar /
  lib dir=D:/Software/Solr/example/solr/dist/
 regex=apache-solr-clustering-\d.*\.jar /
  lib dir=D:/Software/Solr/example/solr/dist/
 regex=apache-solr-dataimporthandler-\d.*\.jar /
 requestHandler name=/dataimport
    class=org.apache.solr.handler.dataimport.DataImportHandler
  lst name=defaults
    str 
 name=configD:/Software/Solr/example/solr/conf/data-config.xml/str
  /lst
 /requestHandler

 And data-config.xml is pretty much as attached - except simpler.

 Any help or any advice on how to diagnose would be appreciated!


 On Fri, Nov 25, 2011 at 12:29 PM, Ian Grainger i...@isfluent.com wrote:
 Hi I have copied my Solr config from a working Windows server to a new
 one, and it can't seem to run an import.

 They're both using win server 2008 and SQL 2008R2. This is the data
 importer config

    dataConfig
      dataSource type=JdbcDataSource  name=ds1
            driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
            url=jdbc:sqlserver://localhost;databaseName=DB
            user=Solr
            password=pwd/
      document name=datas
        entity name=data dataSource=ds1 pk=key
        query=EXEC SOLR_COMPANY_SEARCH_DATA
        deltaImportQuery=SELECT * FROM Company_Search_Data WHERE
 [key]='${dataimporter.delta.key}'
        deltaQuery=SELECT [key] FROM Company_Search_Data WHERE modify_dt
 '${dataimporter.last_index_time}'
              field column=WorkDesc_Comments
 name=WorkDesc_Comments_Split /
              field column=WorkDesc_Comments 
 name=WorkDesc_Comments_Edge /
        /entity
      /document
    /dataConfig

 I can use MS SQL Profiler to watch the Solr user log in successfully,
 but then nothing. It doesn't seem to even try and execute the stored
 procedure. Any ideas why this would be working one server and not on
 another?

 FTR the only thing in the tomcat catalina log is:

    org.apache.solr.handler.dataimport.JdbcDataSource$1 call
    INFO: Creating a connection for entity data with URL:
 jdbc:sqlserver://localhost;databaseName=CATLive

 --
 Ian

 i...@isfluent.com
 +44 (0)1223 257903

Re: DIH Strange Problem

2011-11-28 Thread Ian Grainger
Aha! That sounds like it might be it!

On Mon, Nov 28, 2011 at 4:16 PM, Husain, Yavar yhus...@firstam.com wrote:

 Thanks Kai for sharing this. Ian encountered the same problem so marking him 
 in the mail too.
 
 From: Kai Gülzau [kguel...@novomind.com]
 Sent: Monday, November 28, 2011 6:55 PM
 To: solr-user@lucene.apache.org
 Subject: RE: DIH Strange Problem

 Do you use Java 6 update 29? There is a known issue with the latest mssql 
 driver:

 http://blogs.msdn.com/b/jdbcteam/archive/2011/11/07/supported-java-versions-november-2011.aspx

 In addition, there are known connection failure issues with Java 6 update 
 29, and the developer preview (non production) versions of Java 6 update 30 
 and Java 6 update 30 build 12.  We are in contact with Java on these issues 
 and we will update this blog once we have more information.

 Should work with update 28.

 Kai

 -Original Message-
 From: Husain, Yavar [mailto:yhus...@firstam.com]
 Sent: Monday, November 28, 2011 1:02 PM
 To: solr-user@lucene.apache.org; Shawn Heisey
 Subject: RE: DIH Strange Problem

 I figured out the solution and Microsoft and not Solr is the problem here :):

 I downloaded and build latest Solr (3.4) from sources and finally hit 
 following line of code in Solr (where I put my debug statement) :

 if(url != null){
               LOG.info(Yavar: getting handle to driver manager:);
               c = DriverManager.getConnection(url, initProps);
               LOG.info(Yavar: got handle to driver manager:); }

 The call to Driver Manager was not returning. Here was the error!! The Driver 
 we were using was Microsoft Type 4 JDBC driver for SQL Server. I downloaded 
 another driver called jTDS jDBC driver and installed that. Problem got 
 fixed!!!

 So please follow the following steps:

 1. Download jTDS jDBC driver from http://jtds.sourceforge.net/ 2. Put the 
 driver jar file into your Solr/lib directory where you had put Microsoft JDBC 
 driver.
 3. In the data-config.xml use this statement: 
 driver=net.sourceforge.jtds.jdbc.Driver
 4. Also in data-config.xml mention url like this: 
 url=jdbc:jTDS:sqlserver://localhost:1433;databaseName=XXX
 5. Now run your indexing.

 It should solve the problem.

 -Original Message-
 From: Husain, Yavar
 Sent: Thursday, November 24, 2011 12:38 PM
 To: solr-user@lucene.apache.org; Shawn Heisey
 Subject: RE: DIH Strange Problem

 Hi

 Thanks for your replies.

 I carried out these 2 steps (it did not solve my problem):

 1. I tried setting responseBuffering to adaptive. Did not work.
 2. For checking Database connection I wrote a simple java program to connect 
 to database and fetch some results with the same driver that I use for solr. 
 It worked. So it does not seem to be a problem with the connection.

 Now I am stuck where Tomcat log says: Creating a connection for entity 
 . and does nothing, I mean after this log we usually get the 
 getConnection() took x millisecond however I dont get that ,I can just see 
 the time moving with no records getting fetched.

 Original Problem listed again:


 I am using Solr 1.4.1 on Windows/MS SQL Server and am using DIH for importing 
 data. Indexing and all was working perfectly fine. However today when I 
 started full indexing again, Solr halts/stucks at the line Creating a 
 connection for entity. There are no further messages after that. I 
 can see that DIH is busy and on the DIH console I can see A command is still 
 running, I can also see total rows fetched = 0 and total request made to 
 datasource = 1 and time is increasing however it is not doing anything. This 
 is the exact configuration that worked for me. I am not really able to 
 understand the problem here. Also in the index directory where I am storing 
 the index there are just 3 files: 2 segment files + 1  lucene*-write.lock 
 file.
 ...
 data-config.xml:
 
 dataSource type=JdbcDataSource 
 driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
 url=jdbc:sqlserver://127.0.0.1:1433;databaseName=SampleOrders 
 user=testUser password=password/ document .
 .

 Logs:

 INFO: Server startup in 2016 ms
 Nov 23, 2011 4:11:27 PM org.apache.solr.handler.dataimport.DataImporter 
 doFullImport
 INFO: Starting Full Import
 Nov 23, 2011 4:11:27 PM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/dataimport params={command=full-import} status=0 
 QTime=11 Nov 23, 2011 4:11:27 PM 
 org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
 INFO: Read dataimport.properties
 Nov 23, 2011 4:11:27 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll
 INFO: [] REMOVING ALL DOCUMENTS FROM INDEX Nov 23, 2011 4:11:27 PM 
 org.apache.solr.core.SolrDeletionPolicy onInit
 INFO: SolrDeletionPolicy.onInit: commits:num=1
               
 commit{dir=C:\solrindexes\index,segFN=segments_6,version=1322041133719,generation=6,filenames=[segments_6]
 Nov 23, 2011 4:11:27 PM

Unable to index documents using DataImportHandler with MSSQL

2011-11-25 Thread Ian Grainger
Hi I have copied my Solr config from a working Windows server to a new
one, and it can't seem to run an import.

They're both using win server 2008 and SQL 2008R2. This is the data
importer config

dataConfig
  dataSource type=JdbcDataSource  name=ds1
driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
url=jdbc:sqlserver://localhost;databaseName=DB
user=Solr
password=pwd/
  document name=datas
entity name=data dataSource=ds1 pk=key
query=EXEC SOLR_COMPANY_SEARCH_DATA
deltaImportQuery=SELECT * FROM Company_Search_Data WHERE
[key]='${dataimporter.delta.key}'
deltaQuery=SELECT [key] FROM Company_Search_Data WHERE modify_dt
 '${dataimporter.last_index_time}'
  field column=WorkDesc_Comments
name=WorkDesc_Comments_Split /
  field column=WorkDesc_Comments name=WorkDesc_Comments_Edge /
/entity
  /document
/dataConfig

I can use MS SQL Profiler to watch the Solr user log in successfully,
but then nothing. It doesn't seem to even try and execute the stored
procedure. Any ideas why this would be working one server and not on
another?

FTR the only thing in the tomcat catalina log is:

org.apache.solr.handler.dataimport.JdbcDataSource$1 call
INFO: Creating a connection for entity data with URL:
jdbc:sqlserver://localhost;databaseName=CATLive

-- 
Ian

i...@isfluent.com
+44 (0)1223 257903


Re: Unable to index documents using DataImportHandler with MSSQL

2011-11-25 Thread Ian Grainger
Update on this: I've established:
* It's not a problem in the DB (I can index from this DB into a Solr
instance on another server)
* It's not Tomcat (I get the same problem in Jetty)
* It's not the schema (I have simplified it to one field)

That leaves SolrConfig.xml and data-config.

Only thing changed in SolrConfig.xml is adding:

  lib dir=D:/Software/Solr/example/solr/dist/
regex=apache-solr-cell-\d.*\.jar /
  lib dir=D:/Software/Solr/example/solr/dist/
regex=apache-solr-clustering-\d.*\.jar /
  lib dir=D:/Software/Solr/example/solr/dist/
regex=apache-solr-dataimporthandler-\d.*\.jar /
requestHandler name=/dataimport
class=org.apache.solr.handler.dataimport.DataImportHandler
  lst name=defaults
str name=configD:/Software/Solr/example/solr/conf/data-config.xml/str
  /lst
/requestHandler

And data-config.xml is pretty much as attached - except simpler.

Any help or any advice on how to diagnose would be appreciated!


On Fri, Nov 25, 2011 at 12:29 PM, Ian Grainger i...@isfluent.com wrote:
 Hi I have copied my Solr config from a working Windows server to a new
 one, and it can't seem to run an import.

 They're both using win server 2008 and SQL 2008R2. This is the data
 importer config

    dataConfig
      dataSource type=JdbcDataSource  name=ds1
            driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
            url=jdbc:sqlserver://localhost;databaseName=DB
            user=Solr
            password=pwd/
      document name=datas
        entity name=data dataSource=ds1 pk=key
        query=EXEC SOLR_COMPANY_SEARCH_DATA
        deltaImportQuery=SELECT * FROM Company_Search_Data WHERE
 [key]='${dataimporter.delta.key}'
        deltaQuery=SELECT [key] FROM Company_Search_Data WHERE modify_dt
 '${dataimporter.last_index_time}'
              field column=WorkDesc_Comments
 name=WorkDesc_Comments_Split /
              field column=WorkDesc_Comments name=WorkDesc_Comments_Edge 
 /
        /entity
      /document
    /dataConfig

 I can use MS SQL Profiler to watch the Solr user log in successfully,
 but then nothing. It doesn't seem to even try and execute the stored
 procedure. Any ideas why this would be working one server and not on
 another?

 FTR the only thing in the tomcat catalina log is:

    org.apache.solr.handler.dataimport.JdbcDataSource$1 call
    INFO: Creating a connection for entity data with URL:
 jdbc:sqlserver://localhost;databaseName=CATLive

 --
 Ian

 i...@isfluent.com
 +44 (0)1223 257903




-- 
Ian

i...@isfluent.com
+44 (0)1223 257903


Solr 3.4 group.truncate does not work with facet queries

2011-10-28 Thread Ian Grainger
Hi, I'm using Grouping with group.truncate=true, The following simple facet
query:

facet.query=Monitor_id:[38 TO 40]

Doesn't give the same number as the nGroups result (with
grouping.ngroups=true) for the equivalent filter query:

fq=Monitor_id:[38 TO 40]

I thought they should be the same - from the Wiki page: 'group.truncate: If
true, facet counts are based on the most relevant document of each group
matching the query.'

What am I doing wrong?

If I turn off group.truncate then the counts are the same, as I'd expect -
but unfortunately I'm only interested in the grouped results.

- I have also asked this question on StackOverflow, here:
http://stackoverflow.com/questions/7905756/solr-3-4-group-truncate-does-not-work-with-facet-queries

Thanks!

-- 
Ian

i...@isfluent.com a...@endissolutions.com
+44 (0)1223 257903


Re: Solr 3.4 group.truncate does not work with facet queries

2011-10-28 Thread Ian Grainger
Thanks, Marijn. I have logged the bug here:
https://issues.apache.org/jira/browse/SOLR-2863

Is there any chance of a workaround for this issue before the bug is fixed?

If you want to answer the question on StackOverflow:
http://stackoverflow.com/questions/7905756/solr-3-4-group-truncate-does-not-work-with-facet-queries
I'll
accept your answer.


On Fri, Oct 28, 2011 at 12:14 PM, Martijn v Groningen 
martijn.v.gronin...@gmail.com wrote:

 Hi Ian,

 I think this is a bug. After looking into the code the facet.query
 feature doesn't take into account the group.truncate option.
 This needs to be fixed. You can open a new issue in Jira if you want to.

 Martijn

 On 28 October 2011 12:09, Ian Grainger i...@isfluent.com wrote:
  Hi, I'm using Grouping with group.truncate=true, The following simple
 facet
  query:
 
  facet.query=Monitor_id:[38 TO 40]
 
  Doesn't give the same number as the nGroups result (with
  grouping.ngroups=true) for the equivalent filter query:
 
  fq=Monitor_id:[38 TO 40]
 
  I thought they should be the same - from the Wiki page: 'group.truncate:
 If
  true, facet counts are based on the most relevant document of each group
  matching the query.'
 
  What am I doing wrong?
 
  If I turn off group.truncate then the counts are the same, as I'd expect
 -
  but unfortunately I'm only interested in the grouped results.
 
  - I have also asked this question on StackOverflow, here:
 
 http://stackoverflow.com/questions/7905756/solr-3-4-group-truncate-does-not-work-with-facet-queries
 
  Thanks!
 
  --
  Ian
 
  i...@isfluent.com a...@endissolutions.com
  +44 (0)1223 257903
 



 --
 Met vriendelijke groet,

 Martijn van Groningen




-- 
Ian

i...@isfluent.com a...@endissolutions.com
+44 (0)1223 257903


Re: Index directories on slaves

2011-08-29 Thread Ian Connor
This turned out to be a missing SolrDeletionPolicy in the configuration.

Once the slaves had a SolrDeletionPolicy, they stopped growing out of
control.

Ian.

On Wed, Aug 17, 2011 at 8:46 AM, Ian Connor ian.con...@gmail.com wrote:

 Hi,

 We have noticed that many index.* directories are appearing on slaves (some
 more than others).

 e.g. ls shows

 index/index.20110101021510/ index.20110105030400/
 index.20110106040701/ index.20110130031416/
 index.20101222081713/ index.20110101034500/ index.20110105075100/
 index.20110107085605/ index.20110812153349/
 index.20101231011754/ index.20110105022600/ index.20110106024902/
 index.20110108014100/ index.20110814204200/

 Are this harmful, should I clean them out. I see a command for backup
 cleanup but am not sure the best way to clean these up (apart from removing
 all index* and getting a fresh replica).

 We have also seen on the latest 3.4 build that replicas are getting 1000s
 of files even though the masters have less than a 100 each. It seems as
 though they are not deleting after some replications and not sure if this is
 also related. We are trying to monitor this to see if we can find out how to
 reproduce it or at least the conditions that tend to reproduce it.

 --
 Regards,

 Ian Connor
 1 Leighton St #723
 Cambridge, MA 02141
 Call Center Phone: +1 (714) 239 3875 (24 hrs)
 Fax: +1(770) 818 5697
 Skype: ian.connor



Index directories on slaves

2011-08-17 Thread Ian Connor
Hi,

We have noticed that many index.* directories are appearing on slaves (some
more than others).

e.g. ls shows

index/index.20110101021510/ index.20110105030400/
index.20110106040701/ index.20110130031416/
index.20101222081713/ index.20110101034500/ index.20110105075100/
index.20110107085605/ index.20110812153349/
index.20101231011754/ index.20110105022600/ index.20110106024902/
index.20110108014100/ index.20110814204200/

Are this harmful, should I clean them out. I see a command for backup
cleanup but am not sure the best way to clean these up (apart from removing
all index* and getting a fresh replica).

We have also seen on the latest 3.4 build that replicas are getting 1000s of
files even though the masters have less than a 100 each. It seems as though
they are not deleting after some replications and not sure if this is also
related. We are trying to monitor this to see if we can find out how to
reproduce it or at least the conditions that tend to reproduce it.

-- 
Regards,

Ian Connor
1 Leighton St #723
Cambridge, MA 02141
Call Center Phone: +1 (714) 239 3875 (24 hrs)
Fax: +1(770) 818 5697
Skype: ian.connor


Re: solr-ruby: Error undefined method `closed?' for nil:NilClass

2011-08-17 Thread Ian Connor
That is a good suggestion. At the very least I can catch this error and
create a new connection when I see this - thanks.

On Sun, Aug 14, 2011 at 3:46 PM, Erik Hatcher erik.hatc...@gmail.comwrote:

 Does instantiating a Solr::Connection for each request make things better?

Erik

 On Aug 14, 2011, at 11:34 , Ian Connor wrote:

  It is nothing special - just like this:
 
 conn   = Solr::Connection.new(http://#{LOCAL_SHARD};,
  {:timeout = 1000, :autocommit = :on})
 options[:shards] = HA_SHARDS
 response = conn.query(query, options)
 
  Where LOCAL_SHARD points to a haproxy of a single shard and HA_SHARDS is
 an
  array of 18 shards (via haproxy).
 
  Ian.
 
  On Mon, Aug 8, 2011 at 12:50 PM, Erik Hatcher erik.hatc...@gmail.com
 wrote:
 
  Ian -
 
  What does your solr-ruby using code look like?
 
  Solr::Connection is light-weight, so you could just construct a new one
 of
  those for each request.  Are you keeping an instance around?
 
  Erik
 
 
  On Aug 8, 2011, at 12:03 , Ian Connor wrote:
 
  Hi,
 
  I have seen some of these errors come through from time to time. It
 looks
  like:
 
  /usr/lib/ruby/1.8/net/http.rb:1060:in
  `request'\n/usr/lib/ruby/1.8/net/http.rb:845:in `post'
 
 
 /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:158:in
  `post'
 
 
 /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:151:in
  `send'
 
 
 /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:174:in
  `create_and_send_query'
 
 
 /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:92:in
  `query'
 
  It is as if the http object has gone away. Would it be good to create a
  new
  one inside of the connection or is something more serious going on?
  ubuntu 10.04
  passenger 3.0.8
  rails 2.3.11
 
  --
  Regards,
 
  Ian Connor
 
 
 
 
  --
  Regards,
 
  Ian Connor
  1 Leighton St #723
  Cambridge, MA 02141
  Call Center Phone: +1 (714) 239 3875 (24 hrs)
  Fax: +1(770) 818 5697
  Skype: ian.connor




-- 
Regards,

Ian Connor
1 Leighton St #723
Cambridge, MA 02141
Call Center Phone: +1 (714) 239 3875 (24 hrs)
Fax: +1(770) 818 5697
Skype: ian.connor


Re: solr-ruby: Error undefined method `closed?' for nil:NilClass

2011-08-14 Thread Ian Connor
It is nothing special - just like this:

  conn   = Solr::Connection.new(http://#{LOCAL_SHARD};,
   {:timeout = 1000, :autocommit = :on})
  options[:shards] = HA_SHARDS
  response = conn.query(query, options)

Where LOCAL_SHARD points to a haproxy of a single shard and HA_SHARDS is an
array of 18 shards (via haproxy).

Ian.

On Mon, Aug 8, 2011 at 12:50 PM, Erik Hatcher erik.hatc...@gmail.comwrote:

 Ian -

 What does your solr-ruby using code look like?

 Solr::Connection is light-weight, so you could just construct a new one of
 those for each request.  Are you keeping an instance around?

Erik


 On Aug 8, 2011, at 12:03 , Ian Connor wrote:

  Hi,
 
  I have seen some of these errors come through from time to time. It looks
  like:
 
  /usr/lib/ruby/1.8/net/http.rb:1060:in
  `request'\n/usr/lib/ruby/1.8/net/http.rb:845:in `post'
 
  /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:158:in
  `post'
 
  /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:151:in
  `send'
 
  /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:174:in
  `create_and_send_query'
 
  /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:92:in
  `query'
 
  It is as if the http object has gone away. Would it be good to create a
 new
  one inside of the connection or is something more serious going on?
  ubuntu 10.04
  passenger 3.0.8
  rails 2.3.11
 
  --
  Regards,
 
  Ian Connor




-- 
Regards,

Ian Connor
1 Leighton St #723
Cambridge, MA 02141
Call Center Phone: +1 (714) 239 3875 (24 hrs)
Fax: +1(770) 818 5697
Skype: ian.connor


solr-ruby: Error undefined method `closed?' for nil:NilClass

2011-08-08 Thread Ian Connor
Hi,

I have seen some of these errors come through from time to time. It looks
like:

/usr/lib/ruby/1.8/net/http.rb:1060:in
`request'\n/usr/lib/ruby/1.8/net/http.rb:845:in `post'

/usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:158:in
`post'

/usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:151:in
`send'

/usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:174:in
`create_and_send_query'

/usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:92:in
`query'

It is as if the http object has gone away. Would it be good to create a new
one inside of the connection or is something more serious going on?
ubuntu 10.04
passenger 3.0.8
rails 2.3.11

-- 
Regards,

Ian Connor


how does Solr/Lucene index multi-value fields

2011-05-31 Thread Ian Holsman
Hi.

I want to store a list of documents (say each being 30-60k of text) into a 
single SolrDocument. (to speed up post-retrieval querying)

In order to do this, I need to know if lucene calculates the TF/IDF score over 
the entire field or does it treat each value in the list as a unique field? 

If I can't store it as a multi-value, I could create a schema where I put each 
document into a unique field, but I'm not sure how to create the query to 
search all the fields.


Regards
Ian



Re: how does Solr/Lucene index multi-value fields

2011-05-31 Thread Ian Holsman

On May 31, 2011, at 12:11 PM, Erick Erickson wrote:

 Can you explain the use-case a bit more here? Especially the post-query
 processing and how you expect the multiple documents to help here.
 

we have a collection of related stories. when a user searches for something, we 
might not want to display the story that is most-relevant (according to SOLR), 
but according to other home-grown rules.  by combing all the possibilities in 
one SolrDocument, we can avoid a DB-hit to get related stories.


 But TF/IDF is calculated over all the values in the field. There's really no
 difference between a multi-valued field and storing all the data in a
 single field
 as far as relevance calculations are concerned.
 

so.. it will suck regardless.. I thought we had per-field relevance in the 
current trunk. :-(


 Best
 Erick
 
 On Tue, May 31, 2011 at 11:02 AM, Ian Holsman had...@holsman.net wrote:
 Hi.
 
 I want to store a list of documents (say each being 30-60k of text) into a 
 single SolrDocument. (to speed up post-retrieval querying)
 
 In order to do this, I need to know if lucene calculates the TF/IDF score 
 over the entire field or does it treat each value in the list as a unique 
 field?
 
 If I can't store it as a multi-value, I could create a schema where I put 
 each document into a unique field, but I'm not sure how to create the query 
 to search all the fields.
 
 
 Regards
 Ian
 
 



Re: how does Solr/Lucene index multi-value fields

2011-05-31 Thread Ian Holsman
Thanks Erick.

sadly in my use-case I don't that wouldn't work. I'll go back to storing them 
at the story level, and hitting a DB to get related stories I think.

--I
On May 31, 2011, at 12:27 PM, Erick Erickson wrote:

 Hmmm, I may have mis-lead you. Re-reading my text it
 wasn't very well written
 
 TF/IDF calculations are, indeed, per-field. I was trying
 to say that there was no difference between storing all
 the data for an individual field as a single long string of text
 in a single-valued field or as several shorter strings in
 a multi-valued field.
 
 Best
 Erick
 
 On Tue, May 31, 2011 at 12:16 PM, Ian Holsman had...@holsman.net wrote:
 
 On May 31, 2011, at 12:11 PM, Erick Erickson wrote:
 
 Can you explain the use-case a bit more here? Especially the post-query
 processing and how you expect the multiple documents to help here.
 
 
 we have a collection of related stories. when a user searches for something, 
 we might not want to display the story that is most-relevant (according to 
 SOLR), but according to other home-grown rules.  by combing all the 
 possibilities in one SolrDocument, we can avoid a DB-hit to get related 
 stories.
 
 
 But TF/IDF is calculated over all the values in the field. There's really no
 difference between a multi-valued field and storing all the data in a
 single field
 as far as relevance calculations are concerned.
 
 
 so.. it will suck regardless.. I thought we had per-field relevance in the 
 current trunk. :-(
 
 
 Best
 Erick
 
 On Tue, May 31, 2011 at 11:02 AM, Ian Holsman had...@holsman.net wrote:
 Hi.
 
 I want to store a list of documents (say each being 30-60k of text) into a 
 single SolrDocument. (to speed up post-retrieval querying)
 
 In order to do this, I need to know if lucene calculates the TF/IDF score 
 over the entire field or does it treat each value in the list as a unique 
 field?
 
 If I can't store it as a multi-value, I could create a schema where I put 
 each document into a unique field, but I'm not sure how to create the 
 query to search all the fields.
 
 
 Regards
 Ian
 
 
 
 



Boosting score by distance

2011-05-13 Thread Ian Eure
I have a bunch of documents representing points of interest indexed in Solr. 
I'm trying to boost the score of documents based on distance from an origin 
point, and having some difficulty.

I'm currently using the standard query parser and sending in this query:

(name:sushi OR tags:sushi OR classifiers:sushi) AND deleted:False AND 
owner:simplegeo

I'm also using the spatial search to limit results to ones found within 25km of 
my origin point. The issue I'm having is that I need the score to be a blend of 
the FT match _and_ distance from the origin point; If i sort by distance, lots 
of low quality matches clog up the results for simple searches, but if I sort 
by score, more distant results overwhelm nearby, though less relevant 
(according to Solr) results.

I think what I want to do is boost the score of documents based on the distance 
from the origin search point. Alternately, if there was some way to treat a 
match on any of the three fields as having equal weight, I believe that would 
get me much closer to what I want.

The examples I've seen for doing this kind of thing use dismax and its boost 
function (`bf') parameter. I don't know if my queries are translatable to 
dismax syntax as they are now, and it looks like the boost functions don't work 
with the standard query parser — at least, I have been completely unable to 
change the score when using it.

Is there some way to boost by the inverse of the distance using the standard 
query parser, or alternately, to filter my results by different fields with the 
dismax parser?

Re: resetting stats

2011-01-31 Thread Ian Connor
Has there been any progress on this or tools people might use to capture the
average or 90% time for the last hour?

That would allow us to better match up slowness with other metrics like
CPU/IO/Memory to find bottlenecks in the system.

Thanks,
Ian.

On Wed, Mar 31, 2010 at 9:13 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : Say I have 3 Cores names core0, core1, and core2, where only core1 and
 core2
 : have documents and caches.  If all my searches hit core0, and core0
 shards
 : out to core1 and core2, then the stats from core0 would be accurate for
 : errors, timeouts, totalTime, avgTimePerRequest, avgRequestsPerSecond,
 etc.

 Ahhh yes. (i see what you mean by aggregating core now ... i thought
 you ment a core just for aggregatign stats)

 *If* you are using distributed search, then you can gather stats from the
 core you use for collating/aggregating from the other shards, and
 reloading that core should be cheap.

 but if you aren't already using distributed searching, it would be a bad
 idea from a performance standpoint to add it just to take advantage of
 being able to reload the coordinator core (the overhead of searching one
 distributed shard vs doing the same query directly is usually very
 measurable, even on if the shard is the same Solr instance as your
 coordinator)



 -Hoss




-- 
Regards,

Ian Connor
1 Leighton St #723
Cambridge, MA 02141
Call Center Phone: +1 (714) 239 3875 (24 hrs)
Fax: +1(770) 818 5697
Skype: ian.connor


Solr rate limiting / DoS attacks

2010-09-29 Thread Ian Upright
Hi, I'm curious as to what approaches one would take to defend against users
attacking a Solr service, especially if exposed to the internet as opposed
to an intranet.  I'm fairly new to Solr, is there anything built in?

Is there anything in place to prevent the search engine from getting
overwhelmed by a particular user or group of users, submitting loads of
time-consuming queries as some form of a DoS attack?  

Additionally, is there a way of rate-limiting it so that only a certain
number of queries per user/per hour can be submitted, etc?  (for example, to
prevent programmatic access to the search engine as opposed to a human user)

Thanks, Ian


Re: getting a list of top page-ranked webpages

2010-09-17 Thread Ian Upright
On Fri, 17 Sep 2010 04:46:44 -0700 (PDT), kenf_nc
ken.fos...@realestate.com wrote:

A slightly different route to take, but one that should help test/refine a
semantic parser is wikipedia. They make available their entire corpus, or
any subset you define. The whole thing is like 14 terabytes, but you can get
smaller sets. 

Actually, I do heavy analysis of the entire wikipedia, plus 1m top webpages
from Alexa, and all of dmoz url's, in order to build the semantic engine in
the first place.  However, an outside corpus is required to test it's
quality outside of this space.

Cheers, Ian


Re: getting a list of top page-ranked webpages

2010-09-17 Thread Ian Upright
On Thu, 16 Sep 2010 15:31:02 -0700, you wrote:

The public terabyte dataset project would be a good match for what you  
need.

http://bixolabs.com/datasets/public-terabyte-dataset-project/

Of course, that means we have to actually finish the crawl  finalize  
the Avro format we use for the data :)

There are other free collections of data around, though none that I  
know of which target top-ranked pages.

-- Ken

Hi Ken.. this looks exactly like what i need.  There is the ClueWeb dataset,
http://boston.lti.cs.cmu.edu/Data/clueweb09/   However, one must buy it from
them, the crawl was done in 09, and it inclues a number of hard drives which
are shipped to you.  Any crawl that would be available as an Amazon Public
Dataset would be totally perfect.

Ian


getting a list of top page-ranked webpages

2010-09-16 Thread Ian Upright
Hi, this question is a little off topic, but I thought since so many people
on this are probably experts in this field, someone may know.

I'm experimenting with my own semantic-based search engine, but I want to
test it with a large corpus of web pages.  Ideally I would like to have a
list of the top 10M or top 100M page-ranked URL's in the world.

Short of using Nutch to crawl the entire web and build this page-rank, is
there any other ways?  What other ways or resources might be available for
me to get this (smaller) corpus of top webpages?

Thanks, Ian


Re: How to find first document for the ALL search

2010-07-15 Thread Ian Connor
Hi,

The good news is that:

/solr/select?q=*%3A*fq=start=1rows=1fl=id

did work (kind of odd really) so I am reading all the documents from the bad
one to a new solr using using the same configuration using ruby (complete
rebuild).

so far so good - it is gone through 500k out of 1.7M and seems to be the
best I could think of.

Running the luke tool and trying to check the index on a copy ended up
destroying the index and leaving only about 5k documents left. Reading them
out via ruby seemed better in this case (and less work than restoring from
backup and re running a few days transactions to catch it up).

Ian.


On Wed, Jul 14, 2010 at 9:22 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : I have found that this search crashes:
 :
 : /solr/select?q=*%3A*fq=start=0rows=1fl=id

 Ouch .. that exception is kind of hairy.  it suggests that your index may
 have been corrupted in some way -- do you have nay idea what happened?
 have you tried using hte CheckIndex tool to see what it says?

 (I'd hate to help you workd arround this but get bit by a timebomb of some
 other bad docs later)

 : It looks like just that first document is bad. I am happy to delete it -
 but
 : not sure how to get to it. Does anyone know how to find it?

 CheckIndexes might help ... if it doesn't the next thing you might try is
 asking for a legitimate field name that you know no document has (ie: if
 you have a dynamicField with the pattern str_* because you have fields
 like str_foo and str_bar but you never have fields named
 strBOGUS then use fl=strBOGUS) and then add debugQuery=true to
 the URL -- the debug info should contain the id.

 I'll be honest thought: i'm guessing that if your example query doesn't
 work, by suggestion won't either -- because if you get that error just
 trying to access the id field, the same thing will probably happen when
 the debugComponent tries to look at up as well.



 -Hoss




-- 
Regards,

Ian Connor
1 Leighton St #723
Cambridge, MA 02141
Call Center Phone: +1 (714) 239 3875 (24 hrs)
Fax: +1(770) 818 5697
Skype: ian.connor


How to find first document for the ALL search

2010-07-12 Thread Ian Connor
I have found that this search crashes:

/solr/select?q=*%3A*fq=start=0rows=1fl=id

SEVERE: java.lang.IndexOutOfBoundsException: Index: 114, Size: 90
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:288)
at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:217)
at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948)
at
org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506)
at
org.apache.solr.search.SolrIndexReader.document(SolrIndexReader.java:259)

but this one works:

/solr/select?q=*%3A*fq=start=1rows=1fl=id

It looks like just that first document is bad. I am happy to delete it - but
not sure how to get to it. Does anyone know how to find it?

- Ian


Generating a sitemap

2010-03-10 Thread Ian Evans
Been testing nutch to crawl for solr and I was wondering if anyone had
already worked on a system for getting the urls out of solr and generating
an XML sitemap for Google.


RE: Handling and sorting email addresses

2010-03-08 Thread Ian Battersby
Thanks Mitch, using the analysis page has been a real eye-opener and given
me a better insight into how Solr was applying the filters (and more
importantly in which order). I've ironically ended up with a charFilter
mapping file as this seemed the only route to replacing characters before
the tokenizer kicked in, unfortunately Solr just refused to allow sorting on
anything tokenized with characters other than whitespace.

Cheers, Ian.

-Original Message-
From: MitchK [mailto:mitc...@web.de] 
Sent: 07 March 2010 22:44
To: solr-user@lucene.apache.org
Subject: Re: Handling and sorting email addresses


Ian,

did you have a look at Solr's admin analysis.jsp?
When everything on the analysis's page is fine, you have missunderstood
Solr's schema.xml-file.

You've set two attributes in your schema.xml:
stored = true
indexed = true

What you get as a response is the stored field value.
The stored field value is the original field value, without any
modifications.
However, Solr is using the indexed field value to query your data.

Kind regards
- Mitch
 

Ian Battersby wrote:
 
 Forgive what might seem like a newbie question but am struggling
 desperately
 with this. 
 
 We have a dynamic field that holds email address and we'd like to be able
 to
 sort by it, obviously when trying to do this we get an error as it thinks
 the email address is a tokenized field. We've tried a custom field type
 using PatternReplaceFilterFactory to specify that @ and . should be
 replaced
 with  AT  and  DOT  but we just can't seem to get it to work, all the
 field still contain the unparsed email.
 
 We used an example found on the mailing-list for the field type:
 
 fieldType name=email class=solr.TextField
 positionIncrementGap=100
   analyzer
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PatternReplaceFilterFactory pattern=\.
 replacement= DOT  replace=all /
filter class=solr.PatternReplaceFilterFactory pattern=@
 replacement= AT  replace=all /
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1
 generateNumberParts=1 catenateWords=0 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=0/
   /analyzer
 /fieldType
 
 .. our dynamic field looks like ..
 
   dynamicField name=dynamicemail_*  type=email  indexed=true
 stored=true  multiValued=true /
 
 When writing a document to Solr it still seems to write the original email
 address (e.g. this.u...@somewhere.com) opposed to its parsed version (e.g.
 this DOT user AT somewhere DOT com). Can anyone help? 
 
 We are running version 1.4 but have even tried the nightly build in an
 attempt to solve this problem.
 
 Thanks.
 
 
 

-- 
View this message in context:
http://old.nabble.com/Handling-and-sorting-email-addresses-tp27813111p278152
39.html
Sent from the Solr - User mailing list archive at Nabble.com.




Handling and sorting email addresses

2010-03-07 Thread Ian Battersby
Forgive what might seem like a newbie question but am struggling desperately
with this. 

We have a dynamic field that holds email address and we'd like to be able to
sort by it, obviously when trying to do this we get an error as it thinks
the email address is a tokenized field. We've tried a custom field type
using PatternReplaceFilterFactory to specify that @ and . should be replaced
with  AT  and  DOT  but we just can't seem to get it to work, all the
field still contain the unparsed email.

We used an example found on the mailing-list for the field type:

fieldType name=email class=solr.TextField
positionIncrementGap=100
  analyzer
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.PatternReplaceFilterFactory pattern=\.
replacement= DOT  replace=all /
   filter class=solr.PatternReplaceFilterFactory pattern=@
replacement= AT  replace=all /
   filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=0 catenateNumbers=0
catenateAll=0 splitOnCaseChange=0/
  /analyzer
/fieldType

.. our dynamic field looks like ..

  dynamicField name=dynamicemail_*  type=email  indexed=true
stored=true  multiValued=true /

When writing a document to Solr it still seems to write the original email
address (e.g. this.u...@somewhere.com) opposed to its parsed version (e.g.
this DOT user AT somewhere DOT com). Can anyone help? 

We are running version 1.4 but have even tried the nightly build in an
attempt to solve this problem.

Thanks.



[ANN] Zoie Solr Plugin - Zoie Solr Plugin enables real-time update functionality for Apache Solr 1.4+

2010-03-07 Thread Ian Holsman


I just saw this on twitter, and thought you guys would be interested.. I 
haven't tried it, but it looks interesting.


http://snaprojects.jira.com/wiki/display/ZOIE/Zoie+Solr+Plugin

Thanks for the RT Shalin!


Re: If you could have one feature in Solr...

2010-02-28 Thread Ian Holsman

On 2/24/10 8:42 AM, Grant Ingersoll wrote:

What would it be?

   

most of this will be coming in 1.5,
but for me it's

- sharding.. it still seems a bit clunky

secondly.. this one isn't in 1.5.
I'd like to be able to find interesting terms that appear in my result 
set that don't appear in the global corpus.


it's kind of like doing a facet count on *:* and then on the search term 
and discount the terms that appear heavily on the global one.
(sorry.. there is a textbook definition of this.. XX distance.. but I 
haven't got the books in front of me).








Re: HTTP ERROR: 404 missing core name in path after integrating nutch

2010-02-26 Thread Ian Evans
Just wanted to give an update on my efforts.

I installed the Feb. 26 update this morning. Was able to access /solr/admin.

Copied over the nutch schema.xml. restarted solr and was able to access
/solr/admin

Edited solrconfig.xml to add the nutch requesthandler snippet from
lucidimagination. Restarted solr and got the 404 missing core name in path
error.

What in the requesthandler snippet (see below) could be causing this error?

from http://bit.ly/1mOb

requestHandler name=/nutch class=solr.SearchHandler 
lst name=defaults
str name=defTypedismax/str
str name=echoParamsexplicit/str
float name=tie0.01/float
str name=qf
content^0.5 anchor^1.0 title^1.2
/str
str name=pf
content^0.5 anchor^1.5 title^1.2 site^1.5
/str
str name=fl
url
/str
str name=mm
2lt;-1 5lt;-2 6lt;90%
/str
int name=ps100/int
bool hl=true/
str name=q.alt*:*/str
str name=hl.fltitle url content/str
str name=f.title.hl.fragsize0/str
str name=f.title.hl.alternateFieldtitle/str
str name=f.url.hl.fragsize0/str
str name=f.url.hl.alternateFieldurl/str
str name=f.content.hl.fragmenterregex/str
/lst
/requestHandler

Have a great weekend.


HTTP ERROR: 404 missing core name in path after integrating nutch

2010-02-25 Thread Ian M. Evans

Hi everyone,

Last night I was able to get solr up and running. Ran and was able to 
access:


http://localhost:8983/solr/admin

This morning, I started on the nutch crawling instructions over at:

http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/

After adding the following to /solr/conf/solrconfig.xml:
requestHandler name=/nutch class=solr.SearchHandler 
lst name=defaults
str name=defTypedismax/str
str name=echoParamsexplicit/str
float name=tie0.01/float
str name=qf
content^0.5 anchor^1.0 title^1.2
/str
str name=pf
content^0.5 anchor^1.5 title^1.2 site^1.5
/str
str name=fl
url
/str
str name=mm
2lt;-1 5lt;-2 6lt;90%
/str
int name=ps100/int
bool hl=true/
str name=q.alt*:*/str
str name=hl.fltitle url content/str
str name=f.title.hl.fragsize0/str
str name=f.title.hl.alternateFieldtitle/str
str name=f.url.hl.fragsize0/str
str name=f.url.hl.alternateFieldurl/str
str name=f.content.hl.fragmenterregex/str
/lst
/requestHandler

going to http://localhost:8983/solr/admin suddenly throws a HTTP ERROR: 
404 missing core name in path


Why would adding the above snippet suddenly throw that error?

Thanks.


Re: Distributed search and haproxy and connection build up

2010-02-11 Thread Ian Connor
Not yet - but thanks for the link.

I think that the OS also has a timeout that keeps it around even after this
event and with heavy traffic I have seen this build up. Having said all
this, the performance impact after testing was negligible for us but I
thought I would post that haproxy can cause large numbers of connections on
a busy site. Going directly to shards does cut the number of connections
down a lot if someone else finds this to be a problem.

I am looking forward to distribution under 1.5 where the | option allows
redundancy in the request. This will solve the persistence problem while
still allowing failover for the shard requests.

Even after 1.5, I would then still advocate haproxy between ruby (or your
http stack) and solr. It is just when Solr is sharding the request it can
keep its connections open and save some resources here.

Ian.


On Thu, Feb 11, 2010 at 11:49 AM, Tim Underwood timunderw...@gmail.comwrote:

 Have you played around with the option httpclose or the option
 forceclose configuration options in HAProxy (both documented here:
 http://haproxy.1wt.eu/download/1.3/doc/configuration.txt)?

 -Tim

 On Wed, Feb 10, 2010 at 10:05 AM, Ian Connor ian.con...@gmail.com wrote:
  Thanks,
 
  I bypassed haproxy as a test and it did reduce the number of connections
 -
  but it did not seem as those these connections were hurting anything.
 
  Ian.
 
  On Tue, Feb 9, 2010 at 11:01 PM, Lance Norskog goks...@gmail.com
 wrote:
 
  This goes through the Apache Commons HTTP client library:
  http://hc.apache.org/httpclient-3.x/
 
  We used 'balance' at another project and did not have any problems.
 
  On Tue, Feb 9, 2010 at 5:54 AM, Ian Connor ian.con...@gmail.com
 wrote:
   I have been using distributed search with haproxy but noticed that I
 am
   suffering a little from tcp connections building up waiting for the OS
  level
   closing/time out:
  
   netstat -a
   ...
   tcp6   1  0 10.0.16.170%34654:53789 10.0.16.181%363574:8893
   CLOSE_WAIT
   tcp6   1  0 10.0.16.170%34654:43932 10.0.16.181%363574:8890
   CLOSE_WAIT
   tcp6   1  0 10.0.16.170%34654:43190 10.0.16.181%363574:8895
   CLOSE_WAIT
   tcp6   0  0 10.0.16.170%346547:8984 10.0.16.181%36357:53770
   TIME_WAIT
   tcp6   1  0 10.0.16.170%34654:41782 10.0.16.181%363574:
   CLOSE_WAIT
   tcp6   1  0 10.0.16.170%34654:52169 10.0.16.181%363574:8890
   CLOSE_WAIT
   tcp6   1  0 10.0.16.170%34654:55947 10.0.16.181%363574:8887
   CLOSE_WAIT
   tcp6   0  0 10.0.16.170%346547:8984 10.0.16.181%36357:54040
   TIME_WAIT
   tcp6   1  0 10.0.16.170%34654:40030 10.0.16.160%363574:8984
   CLOSE_WAIT
   ...
  
   Digging a little into the haproxy documentation, it seems that they do
  not
   support persistent connections.
  
   Does solr normally persist the connections between shards (would this
   problem happen even without haproxy)?
  
   Ian.
  
 
 
 
  --
  Lance Norskog
  goks...@gmail.com
 
 
 
 
  --
  Regards,
 
  Ian Connor
 



Has anyone done request logging with Solr-Ruby for use in Rails?

2010-02-11 Thread Ian Connor
The idea is that in the log is currently like:

Completed in 1290ms (View: 152, DB: 75) | 200 OK [
http://localhost:3000/search?q=nik+gene+clusterview=2]

I want to extend it to also track the Solr query times and time spent in
solr-ruby like:

Completed in 1290ms (View: 152, DB: 75, Solr: 334) | 200 OK [
http://localhost:3000/search?q=nik+gene+clusterview=2]

Has anyone done such a plug-in or extension already?

-- 
Regards,

Ian Connor


Re: Has anyone done request logging with Solr-Ruby for use in Rails?

2010-02-11 Thread Ian Connor
This seems to allow you to log each query - which is a good start.

I was thinking of something that would add all the ms together and report it
in the completed at line so you can get a higher level view of which
requests take the time and where.

Ian.

On Thu, Feb 11, 2010 at 1:13 PM, Mat Brown m...@patch.com wrote:

 On Thu, Feb 11, 2010 at 13:07, Ian Connor ian.con...@gmail.com wrote:
  The idea is that in the log is currently like:
 
  Completed in 1290ms (View: 152, DB: 75) | 200 OK [
  http://localhost:3000/search?q=nik+gene+clusterview=2]
 
  I want to extend it to also track the Solr query times and time spent in
  solr-ruby like:
 
  Completed in 1290ms (View: 152, DB: 75, Solr: 334) | 200 OK [
  http://localhost:3000/search?q=nik+gene+clusterview=2]
 
  Has anyone done such a plug-in or extension already?
 
  --
  Regards,
 
  Ian Connor
 

 Here's a module in Sunspot::Rails that does that. It's written against
 RSolr, which is an alternative to solr-ruby, but the concept is the
 same:

 http://github.com/outoftime/sunspot/blob/master/sunspot_rails/lib/sunspot/rails/solr_logging.rb




-- 
Regards,

Ian Connor
1 Leighton St #723
Cambridge, MA 02141
Call Center Phone: +1 (714) 239 3875 (24 hrs)
Fax: +1(770) 818 5697
Skype: ian.connor


Re: Has anyone done request logging with Solr-Ruby for use in Rails?

2010-02-11 Thread Ian Connor
...and probably break stuff - that might be why it hasn't been done.

On Thu, Feb 11, 2010 at 1:28 PM, Mat Brown m...@patch.com wrote:

 Oh - indeed - sorry, didn't read your email closely enough : )

 Yeah that would probably involve some pretty crufty monkey patching /
 use of globals...

 On Thu, Feb 11, 2010 at 13:22, Ian Connor ian.con...@gmail.com wrote:
  This seems to allow you to log each query - which is a good start.
 
  I was thinking of something that would add all the ms together and report
 it
  in the completed at line so you can get a higher level view of which
  requests take the time and where.
 
  Ian.
 
  On Thu, Feb 11, 2010 at 1:13 PM, Mat Brown m...@patch.com wrote:
 
  On Thu, Feb 11, 2010 at 13:07, Ian Connor ian.con...@gmail.com wrote:
   The idea is that in the log is currently like:
  
   Completed in 1290ms (View: 152, DB: 75) | 200 OK [
   http://localhost:3000/search?q=nik+gene+clusterview=2]
  
   I want to extend it to also track the Solr query times and time spent
 in
   solr-ruby like:
  
   Completed in 1290ms (View: 152, DB: 75, Solr: 334) | 200 OK [
   http://localhost:3000/search?q=nik+gene+clusterview=2]
  
   Has anyone done such a plug-in or extension already?
  
   --
   Regards,
  
   Ian Connor
  
 
  Here's a module in Sunspot::Rails that does that. It's written against
  RSolr, which is an alternative to solr-ruby, but the concept is the
  same:
 
 
 http://github.com/outoftime/sunspot/blob/master/sunspot_rails/lib/sunspot/rails/solr_logging.rb
 
 
 
 
  --
  Regards,
 
  Ian Connor
  1 Leighton St #723
  Cambridge, MA 02141
  Call Center Phone: +1 (714) 239 3875 (24 hrs)
  Fax: +1(770) 818 5697
  Skype: ian.connor
 




-- 
Regards,

Ian Connor
1 Leighton St #723
Cambridge, MA 02141
Call Center Phone: +1 (714) 239 3875 (24 hrs)
Fax: +1(770) 818 5697
Skype: ian.connor


Re: Distributed search and haproxy and connection build up

2010-02-10 Thread Ian Connor
Thanks,

I bypassed haproxy as a test and it did reduce the number of connections -
but it did not seem as those these connections were hurting anything.

Ian.

On Tue, Feb 9, 2010 at 11:01 PM, Lance Norskog goks...@gmail.com wrote:

 This goes through the Apache Commons HTTP client library:
 http://hc.apache.org/httpclient-3.x/

 We used 'balance' at another project and did not have any problems.

 On Tue, Feb 9, 2010 at 5:54 AM, Ian Connor ian.con...@gmail.com wrote:
  I have been using distributed search with haproxy but noticed that I am
  suffering a little from tcp connections building up waiting for the OS
 level
  closing/time out:
 
  netstat -a
  ...
  tcp6   1  0 10.0.16.170%34654:53789 10.0.16.181%363574:8893
  CLOSE_WAIT
  tcp6   1  0 10.0.16.170%34654:43932 10.0.16.181%363574:8890
  CLOSE_WAIT
  tcp6   1  0 10.0.16.170%34654:43190 10.0.16.181%363574:8895
  CLOSE_WAIT
  tcp6   0  0 10.0.16.170%346547:8984 10.0.16.181%36357:53770
  TIME_WAIT
  tcp6   1  0 10.0.16.170%34654:41782 10.0.16.181%363574:
  CLOSE_WAIT
  tcp6   1  0 10.0.16.170%34654:52169 10.0.16.181%363574:8890
  CLOSE_WAIT
  tcp6   1  0 10.0.16.170%34654:55947 10.0.16.181%363574:8887
  CLOSE_WAIT
  tcp6   0  0 10.0.16.170%346547:8984 10.0.16.181%36357:54040
  TIME_WAIT
  tcp6   1  0 10.0.16.170%34654:40030 10.0.16.160%363574:8984
  CLOSE_WAIT
  ...
 
  Digging a little into the haproxy documentation, it seems that they do
 not
  support persistent connections.
 
  Does solr normally persist the connections between shards (would this
  problem happen even without haproxy)?
 
  Ian.
 



 --
 Lance Norskog
 goks...@gmail.com




-- 
Regards,

Ian Connor


Distributed search and haproxy and connection build up

2010-02-09 Thread Ian Connor
I have been using distributed search with haproxy but noticed that I am
suffering a little from tcp connections building up waiting for the OS level
closing/time out:

netstat -a
...
tcp6   1  0 10.0.16.170%34654:53789 10.0.16.181%363574:8893
CLOSE_WAIT
tcp6   1  0 10.0.16.170%34654:43932 10.0.16.181%363574:8890
CLOSE_WAIT
tcp6   1  0 10.0.16.170%34654:43190 10.0.16.181%363574:8895
CLOSE_WAIT
tcp6   0  0 10.0.16.170%346547:8984 10.0.16.181%36357:53770
TIME_WAIT
tcp6   1  0 10.0.16.170%34654:41782 10.0.16.181%363574:
CLOSE_WAIT
tcp6   1  0 10.0.16.170%34654:52169 10.0.16.181%363574:8890
CLOSE_WAIT
tcp6   1  0 10.0.16.170%34654:55947 10.0.16.181%363574:8887
CLOSE_WAIT
tcp6   0  0 10.0.16.170%346547:8984 10.0.16.181%36357:54040
TIME_WAIT
tcp6   1  0 10.0.16.170%34654:40030 10.0.16.160%363574:8984
CLOSE_WAIT
...

Digging a little into the haproxy documentation, it seems that they do not
support persistent connections.

Does solr normally persist the connections between shards (would this
problem happen even without haproxy)?

Ian.


Re: distributed search and failed core

2010-02-03 Thread Ian Connor
My only suggestion is to put haproxy in front of two replicas and then have
haproxy do the failover. If a shard fails, the whole search will fail unless
you do something like this.

On Fri, Jan 29, 2010 at 3:31 PM, Joe Calderon calderon@gmail.comwrote:

 hello *, in distributed search when a shard goes down, an error is
 returned and the search fails, is there a way to avoid the error and
 return the results from the shards that are still up?

 thx much

 --joe




-- 
Regards,

Ian Connor


Re: Lock problems: Lock obtain timed out

2010-01-27 Thread Ian Connor
Can anyone think of a reason why these locks would hang around for more than
2 hours?

I have been monitoring them and they look like they are very short lived.

On Tue, Jan 26, 2010 at 10:15 AM, Ian Connor ian.con...@gmail.com wrote:

 We traced one of the lock files, and it had been around for 3 hours. A
 restart removed it - but is 3 hours normal for one of these locks?

 Ian.


 On Mon, Jan 25, 2010 at 4:14 PM, mike anderson saidthero...@gmail.comwrote:

 I am getting this exception as well, but disk space is not my problem.
 What
 else can I do to debug this? The solr log doesn't appear to lend any other
 clues..

 Jan 25, 2010 4:02:22 PM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update params={} status=500 QTime=1990
 Jan 25, 2010 4:02:22 PM org.apache.solr.common.SolrException log
 SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain
 timed
 out: NativeFSLock@
 /solr8984/index/lucene-98c1cb272eb9e828b1357f68112231e0-write.lock
 at org.apache.lucene.store.Lock.obtain(Lock.java:85)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1545)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1402)
 at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:190)
 at

 org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:98)
 at

 org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173)
 at

 org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:220)
 at

 org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
 at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
 at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
 at

 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
 at

 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
 at

 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
 at

 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
 at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
 at

 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
 at
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
 at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
 at

 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
 at

 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
 at
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
 at org.mortbay.jetty.Server.handle(Server.java:285)
 at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
 at

 org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
 at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
 at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
 at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
 at

 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
 at

 org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)


 Should I consider changing the lock timeout settings (currently set to
 defaults)? If so, I'm not sure what to base these values on.

 Thanks in advance,
 mike


 On Wed, Nov 4, 2009 at 8:27 PM, Lance Norskog goks...@gmail.com wrote:

  This will not ever work reliably. You should have 2x total disk space
  for the index. Optimize, for one, requires this.
 
  On Wed, Nov 4, 2009 at 6:37 AM, Jérôme Etévé jerome.et...@gmail.com
  wrote:
   Hi,
  
   It seems this situation is caused by some No space left on device
  exeptions:
   SEVERE: java.io.IOException: No space left on device
  at java.io.RandomAccessFile.writeBytes(Native Method)
  at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
  at
 
 org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexOutput.flushBuffer(SimpleFSDirectory.java:192)
  at
 
 org.apache.lucene.store.BufferedIndexOutput.flushBuffer(BufferedIndexOutput.java:96)
  
  
   I'd better try to set my maxMergeDocs and mergeFactor to more
   adequates values for my app (I'm indexing ~15 Gb of data on 20Gb
   device, so I guess there's problem when solr tries to merge the index
   bits being build.
  
   At the moment, they are set to   mergeFactor100/mergeFactor and
   maxMergeDocs2147483647/maxMergeDocs
  
   Jerome.
  
   --
   Jerome Eteve.
   http://www.eteve.net
   jer...@eteve.net
  
 
 
 
  --
  Lance Norskog
  goks...@gmail.com
 




-- 
Regards,

Ian Connor
1 Leighton St

Re: Lock problems: Lock obtain timed out

2010-01-26 Thread Ian Connor
We traced one of the lock files, and it had been around for 3 hours. A
restart removed it - but is 3 hours normal for one of these locks?

Ian.

On Mon, Jan 25, 2010 at 4:14 PM, mike anderson saidthero...@gmail.comwrote:

 I am getting this exception as well, but disk space is not my problem. What
 else can I do to debug this? The solr log doesn't appear to lend any other
 clues..

 Jan 25, 2010 4:02:22 PM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/update params={} status=500 QTime=1990
 Jan 25, 2010 4:02:22 PM org.apache.solr.common.SolrException log
 SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain
 timed
 out: NativeFSLock@
 /solr8984/index/lucene-98c1cb272eb9e828b1357f68112231e0-write.lock
 at org.apache.lucene.store.Lock.obtain(Lock.java:85)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1545)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1402)
 at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:190)
 at

 org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:98)
 at

 org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173)
 at

 org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:220)
 at

 org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
 at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
 at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
 at

 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
 at

 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
 at

 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
 at

 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
 at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
 at
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
 at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
 at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
 at

 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
 at

 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
 at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
 at org.mortbay.jetty.Server.handle(Server.java:285)
 at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
 at

 org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
 at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
 at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
 at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
 at

 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
 at

 org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)


 Should I consider changing the lock timeout settings (currently set to
 defaults)? If so, I'm not sure what to base these values on.

 Thanks in advance,
 mike


 On Wed, Nov 4, 2009 at 8:27 PM, Lance Norskog goks...@gmail.com wrote:

  This will not ever work reliably. You should have 2x total disk space
  for the index. Optimize, for one, requires this.
 
  On Wed, Nov 4, 2009 at 6:37 AM, Jérôme Etévé jerome.et...@gmail.com
  wrote:
   Hi,
  
   It seems this situation is caused by some No space left on device
  exeptions:
   SEVERE: java.io.IOException: No space left on device
  at java.io.RandomAccessFile.writeBytes(Native Method)
  at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
  at
 
 org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexOutput.flushBuffer(SimpleFSDirectory.java:192)
  at
 
 org.apache.lucene.store.BufferedIndexOutput.flushBuffer(BufferedIndexOutput.java:96)
  
  
   I'd better try to set my maxMergeDocs and mergeFactor to more
   adequates values for my app (I'm indexing ~15 Gb of data on 20Gb
   device, so I guess there's problem when solr tries to merge the index
   bits being build.
  
   At the moment, they are set to   mergeFactor100/mergeFactor and
   maxMergeDocs2147483647/maxMergeDocs
  
   Jerome.
  
   --
   Jerome Eteve.
   http://www.eteve.net
   jer...@eteve.net
  
 
 
 
  --
  Lance Norskog
  goks...@gmail.com
 



Re: checkindex

2010-01-08 Thread Ian Kallen

When I needed to use it, I couldn't find docs for it either but it's straight 
forward. Here's what I did:
un-jar the solr war file to find the lucene jar that solr was using and run 
CheckIndex like this
java -cp lucene-core-2.9-dev.jar org.apache.lucene.index.CheckIndex 
/path/to/solr/data/index/
to actually *fix* the index, add the -fix argument
java -cp lucene-core-2.9-dev.jar org.apache.lucene.index.CheckIndex -fix 
/path/to/solr/data/index/

hope that helps,
-Ian


On 1/8/10 2:09 PM, Giovanni Fernandez-Kincade wrote:


I've seen many mentions of the Lucene CheckIndex tool, but where can I 
find it? Is there any documentation on how to use it?


I noticed Luke has it built-in, but I can't get Luke to open my index 
with the Don't open IndexReader(when opening corrupted index) option 
check. Opening even an index I know is valid doesn't work using this 
option:





--
Ian Kallen
blog: http://www.arachna.com/roller/spidaman
tweetz: http://twitter.com/spidaman
vox: 925.385.8426




Re: Improvising solr queries

2010-01-04 Thread Ian Holsman

On 1/5/10 12:46 AM, Shalin Shekhar Mangar wrote:

sitename:XYZ OR sitename:All Sites) AND (localeid:1237400589415) AND
  ((assettype:Gallery))  AND (rbcategory:ABC XYZ ) AND (startdate:[* TO
  2009-12-07T23:59:00Z] AND enddate:[2009-12-07T00:00:00Z TO
  *])rows=9start=63sort=date
  descfacet=truefacet.field=assettypefacet.mincount=1

  Similar to this query we have several much complex queries supporting all
  major landing pages of our application.

  Just want to confirm that whether anyone can identify any major flaws or
  issues in the sample query?


 
I'm not the expert Shalin is, but I seem to remember sorting by date was 
pretty rough on CPU. (this could have been resolved since I last looked 
at it)


the other thing I'd question is the facet. it looks like your only 
retrieving a single assetType  (Gallery).
so you will only get a single field back. if thats the case, wouldn't 
the rows returned (which is part of the response)

give you the same answer ?


Most of those AND conditions can be separate filter queries. Filter queries
can be cached separately and can therefore be re-used. See
http://wiki.apache.org/solr/FilterQueryGuidance

   




Re: Adaptive search?

2009-12-21 Thread Ian Holsman

On 12/18/09 2:46 AM, Siddhant Goel wrote:

Let say we have a search engine (a simple front end - web app kind of a
thing - responsible for querying Solr and then displaying the results in a
human readable form) based on Solr. If a user searches for something, gets
quite a few search results, and then clicks on one such result - is there
any mechanism by which we can notify Solr to boost the score/relevance of
that particular result in future searches? If not, then any pointers on how
to go about doing that would be very helpful.
   


Hi Siddhant.
Solr can't do this out of the box.
you would need to use a external field and a custom scoring function to 
do something like this.


regards
Ian

Thanks,

On Thu, Dec 17, 2009 at 7:50 PM, Paul Libbrechtp...@activemath.org  wrote:

   

What can it mean to adapt to user clicks ? Quite many things in my head.
Do you have maybe a citation that inspires you here?

paul


Le 17-déc.-09 à 13:52, Siddhant Goel a écrit :


  Does Solr provide adaptive searching? Can it adapt to user clicks within
 

the
search results it provides? Or that has to be done externally?

   


 


   




RE: Selection of returned fields - dynamic fields?

2009-12-10 Thread Ian Smith
OK thanks for the reply, fortunately we have now found an approach which
avoids storing the field.  It would be nice to be able to search for
dynamic fields in a way which is consistent with their definition,
although I suppose there probably isn't demand for this.

Regards,

Ian.

-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: 09 December 2009 19:36
To: solr-user@lucene.apache.org
Cc: Gary Ratcliffe
Subject: Re: Selection of returned fields - dynamic fields?


: Unfortunately this does not seem to work for dynamic fields -

you can definiltely ask for a field that exists because of a
dynamicField by name, but you can't use wildcard style patterns in the
fl param.

: fl=PREFIX* does not return anything, and neither does fl=*POSTFIX.
: What seems to be missing from Solr is a removeField(FIELDNAME) method
in
: SolrJ, or a fl=-FIELDNAME query parameter to remove the fixed field.
: 
: Is such a feature planned, or is there a workaround that I have
missed?

There's been a lot of discussion about it over the years, the crux of
the problem is that it's hard to come up with a good way of dealing with
field names using meta characters that doesn't make it hard for people
to actaully use those metacharacters in their field names...

http://wiki.apache.org/solr/FieldAliasesAndGlobsInParams

-Hoss



Selection of returned fields - dynamic fields?

2009-12-09 Thread Ian Smith
Hi Guys,

We need to eliminate one of our stored fields from the Solr response to
reduce traffic as it is very bulky and not used externally.  I have been
experimenting both with fl=FIELDNAME and addField(FIELDNAME) from
SolrJ and have found it is possible to achieve this effect for fixed
fields by starting with an empty list and adding the field names
explicitly in the request.

Unfortunately this does not seem to work for dynamic fields -
fl=PREFIX* does not return anything, and neither does fl=*POSTFIX.
What seems to be missing from Solr is a removeField(FIELDNAME) method in
SolrJ, or a fl=-FIELDNAME query parameter to remove the fixed field.

Is such a feature planned, or is there a workaround that I have missed?

Regards,

Ian.


RE: schema-based Index-time field boosting

2009-12-03 Thread Ian Smith
Aaaargh!  OK, I would like a document with (eg.) a title containing
a term to score higher than one on (eg.) a summary containing the same
term, all other things being equal.  You seem to be arguing against
field boosting in general, and I don't understand why!

May as well let this drop since we don't seem to be talking about the
same thing . . . but thanks anyway,

Ian.
-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: 30 November 2009 23:05
To: solr-user@lucene.apache.org
Subject: RE: schema-based Index-time field boosting 


: I am talking about field boosting rather than document boosting, ie. I
: would like some fields (say eg. title) to be louder than others,
: across ALL documents.  I believe you are at least partially talking
: about document boosting, which clearly applies on a per-document
basis.

index time boosts are all the same -- it doesn't matter if htey are
field boosts or document boosts -- a document boost is just a field
boost for every field in the document.

: If it helps, consider a schema version of the following, from
: org.apache.solr.common.SolrInputDocument:
: 
:   /**
:* Adds a field with the given name, value and boost.  If a field
with
: the name already exists, then it is updated to
:* the new value and boost.
:*
:* @param name Name of the field to add
:* @param value Value of the field
:* @param boost Boost value for the field
:*/
:   public void addField(String name, Object value, float boost ) 

...

: Where a constant boost value is applied consistently to a given field.
: That is what I was mistakenly hoping to achieve in the schema.  I
still
: think it would be a good idea BTW.  Regards,

But now we're right back to what i was trying to explain before: index
time boost values like these are only used as a multiplier in the
fieldNorm.  when included as part of the document data, you can specify
a fieldBoost for fieldX of docA that's greater then the boost for fieldX
of docB and that will make docA score higher then docB when fieldX
contains the same number of matches and is hte same length -- but if you
apply a constant boost of B to fieldX for every doc (which is what a
feature to hardcode boosts in schema.xml might give you) then the net
effect would be zero when scoring docA and docB, because the fieldNorm's
for fieldX in both docs would include the exact same multiplier.



-Hoss



Re: dismax query syntax to replace standard query

2009-12-03 Thread Ian Sugar
I believe you need to use the fq parameter with dismax (not to be confused
with qf) to add a filter query in addition to the q parameter.

So your text search value goes in q parameter (which searches on the fields
you configure) and the rest of the query goes in the fq.

Would that work?

On Thu, Dec 3, 2009 at 7:28 PM, javaxmlsoapdev vika...@yahoo.com wrote:


 I have configured dismax handler to search against both title 
 description fields now I have some other attributes on the page e.g.
 status, name etc. On the search page I have three fields for user to
 input search values

 1)Free text search field (which searchs against both title 
 description)
 2)Status (multi select dropdown)
 3)name(single select dropdown)

 I want to form query like textField1:value AND status:(Male OR Female) AND
 name:abc. I know first (textField1:value searchs against both title 
 description as that's how I have configured dixmax in the configuration)
 but not sure how I can AND other attributes (in my case status  name)

 note; standadquery looks like following (w/o using dixmax handler)
 title:testdescription:testname:JoestatusName:(Male OR Female)
 --
 View this message in context:
 http://old.nabble.com/dismax-query-syntax-to-replace-standard-query-tp26631725p26631725.html
 Sent from the Solr - User mailing list archive at Nabble.com.




RE: schema-based Index-time field boosting

2009-11-26 Thread Ian Smith
Hi Chris, thanks for replying! 


OK, now I'm going to take the bait ;)


I am talking about field boosting rather than document boosting, ie. I
would like some fields (say eg. title) to be louder than others,
across ALL documents.  I believe you are at least partially talking
about document boosting, which clearly applies on a per-document basis.

If it helps, consider a schema version of the following, from
org.apache.solr.common.SolrInputDocument:

  /**
   * Adds a field with the given name, value and boost.  If a field with
the name already exists, then it is updated to
   * the new value and boost.
   *
   * @param name Name of the field to add
   * @param value Value of the field
   * @param boost Boost value for the field
   */
  public void addField(String name, Object value, float boost ) 
  {
SolrInputField field = _fields.get( name );
if( field == null || field.value == null ) {
  setField(name, value, boost);
}
else {
  field.addValue( value, boost );
}
  }

Where a constant boost value is applied consistently to a given field.
That is what I was mistakenly hoping to achieve in the schema.  I still
think it would be a good idea BTW.  Regards,

Ian.

-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: 23 November 2009 18:34
To: solr-user@lucene.apache.org
Subject: RE: schema-based Index-time field boosting 


: Yeah, like I said, I was mistaken about setting field boost in
: schema.xml - doesn't mean it's a bad idea though.  At any rate, from
: your penultimate sentence I reckon at least one of us is still
confused
: about field boosting, feel free to reply if you think it's me ;)

Yeah ... i think it's you.  like i said...

: field boosting only makes sense if it's only applied to some of the
: documents in the index, if every document has an index time boost on
: fieldX, then that boost is meaningless.

...if there was a way to oost fields at index time that was configured
in the schema.xml, then every doc would get that boost on it's instances
of those fields but the only purpose of index time boosting is to
indicate that one document is more significant then another doc -- if
every doc gets the same boost, it becomes a No-OP.

(think about the math -- field boosts become multipliers in the
fieldNorm
-- if every doc gets the same multiplier, then there is no net effect)



-Hoss


Web design and intelligent Content Management. www.twitter.com/gossinteractive 

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, 
Plymouth, PL1 1LG.  Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be 
legally privileged. It is for the intended recipient only. If an addressing or 
transmission error has misdirected this email, please notify the author by 
replying to this email. If you are not the intended recipient you may not use, 
disclose, distribute, copy, print or rely on this email. 

Email transmission cannot be guaranteed to be secure or error free, as 
information may be intercepted, corrupted, lost, destroyed, arrive late or 
incomplete or contain viruses. This email and any files attached to it have 
been checked with virus detection software before transmission. You should 
nonetheless carry out your own virus check before opening any attachment. GOSS 
Interactive Ltd accepts no liability for any loss or damage that may be caused 
by software viruses.




Re: Question about lat/long data type in localsolr

2009-11-21 Thread Ian Ibbotson
Heya...

I think you need to use the newer types in your schema.xml, IE

   field name=lat type=tdouble indexed=true stored=true/
   field name=lng type=tdouble indexed=true stored=true/
   field name=geo_distance type=tdouble/

as doubles are no longer index-compatible (AFAIK)

To use the above, make sure you have the tdouble types declared with

fieldType name=tdouble class=solr.TrieDoubleField
precisionStep=8 omitNorms=true positionIncrementGap=0/

in your types section.

HTH

Ian.


2009/11/21 Bertie Shen bertie.s...@gmail.com:
 Hey everyone,

  I used localsolr and locallucene to do local search. But I could not make
 longitude and latitude successfully indexed. During DataImport process,
 there is an exception. Do you have some ideas about it?

  I copy solrconfig.xml and schema.xml from your
 http://www.gissearch.com/localsolr. Only change I made is to replace names
 lat and lng by latitude and longtitude respectively, which are field name in
 my index. Should str in str name=latFieldlat/str should be replaced
 by double according to the following exception?

  Thanks.

 Solr log about Exception.

 exception
    messagejava.lang.ClassCastException: java.lang.Double cannot be cast
 to java.lan\
 g.String/message
    frame
      classcom.pjaol.search.solr.update.LocalUpdaterProcessor/class
      methodprocessAdd/method
      line136/line
    /frame
    frame
      classorg.apache.solr.handler.dataimport.SolrWriter/class
      methodupload/method
      line75/line
    /frame
    frame
      classorg.apache.solr.handler.dataimport.DataImportHandler$1/class
      methodupload/method
      line292/line
    /frame
    frame
      classorg.apache.solr.handler.dataimport.DocBuilder/class
      methodbuildDocument/method
      line392/line
    /frame
    frame
      classorg.apache.solr.handler.dataimport.DocBuilder/class
      methoddoFullDump/method
      line242/line
    /frame
    frame
      classorg.apache.solr.handler.dataimport.DocBuilder/class
      methodexecute/method
      line180/line
    /frame
    frame
      classorg.apache.solr.handler.dataimport.DataImporter/class
      methoddoFullImport/method
      line331/line
    /frame
    frame
      classorg.apache.solr.handler.dataimport.DataImporter/class
      methodrunCmd/method
      line389/line
    /frame
    frame
      classorg.apache.solr.handler.dataimport.DataImporter$1/class
      methodrun/method
      line370/line
    /frame
  /exception


 How do I set up local indexing

 Here is what I have done to set up local indexing.
 1) Download localsolr. I download it from
 http://developer.k-int.com/m2snapshots/localsolr/localsolr/1.5/ and put jar
 file (in my case, localsolr-1.5.jar) in your application's WEB_INF/lib
 directory of application server.

 2) Download locallucene. I download it from
 http://sourceforge.net/projects/locallucene/ and put jar file (in my case,
 locallucene.jar in locallucene_r2.0/dist/ diectory) in your application's
 WEB_INF/lib directory of application server. I also need to copy
 gt2-referencing-2.3.1.jar, geoapi-nogenerics-2.1-M2.jar, and jsr108-0.01.jar
 under locallucene_r2.0/lib/ directory to WEB_INF/lib. Do not copy
 lucene-spatial-2.9.1.jar under Lucene codebase. The namespace has been
 changed from com.pjaol.blah.blah.blah to org.apache.blah blah.

 3) Update your solrconfig.xml and schema.xml. I copy it from
 http://www.gissearch.com/localsolr.



RE: schema-based Index-time field boosting

2009-11-20 Thread Ian Smith
Hi David, thanks for replying,

The field boost attribute was put there by me back in the 1.3 days, when
I somehow gained the mistaken impression that it was supposed to work!
Of course, despite a lot of searching I haven't been able to find
anything to back up my position ;)

Unfortunately our code (intentionally) has no idea what index it is
writing to so only a schema-based approach is really going to work for
us.

Of course, by now I am convinced that this might be a really good
feature - I might get the chance to look into it in the near future -
can anyone think of reasons why this might not work in practice?

Regards,

Ian.

-Original Message-
From: Smiley, David W. [mailto:dsmi...@mitre.org] 
Sent: 19 November 2009 19:29
To: solr-user@lucene.apache.org
Subject: Re: Index-time field boosting not working?

Hi Ian.  Thanks for buying my book.

The boost attribute goes on the field for the XML message you're
sending to Solr.  In your example you mistakenly placed it in the
schema.

FYI I use index time boosting as well as query time boosting.  Although
index time boosting isn't something I can change on a whim, I've found
it to be far easier to control the scoring than say function queries
which would be the query time substitute if the boost is a function of
particular field values.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/


On Nov 18, 2009, at 6:40 AM, Ian Smith wrote:

 I have the following field configured in schema.xml:
 
 field name=title type=text indexed=true stored=true
 omitNorms=false boost=3.0 /
 
 Where text is the type which came with the Solr distribution.  I 
 have not been able to get this configuration to alter any document 
 scores, and if I look at the indexes in Luke there is no change in the

 norms (compared to an un-boosted equivalent).
 
 I have confirmed that document boosting works (via SolrJ), but our 
 field boosting needs to be done in the schema.
 
 Am I doing something wrong (BTW I have tried using 3.0f as well, no 
 difference)?
 
 Also, I have seen no debug output during startup which would indicate 
 that fild boosting is being configured - should there be any?
 
 I have found no usage examples of this in the Solr 1.4 book, except a 
 vague discouragement - is this a deprecated feature?
 
 TIA,
 
 Ian
 
 Web design and intelligent Content Management. 
 www.twitter.com/gossinteractive
 
 Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower 
 Street, Plymouth, PL1 1LG.  Company Registration No: 3553908
 
 This email contains proprietary information, some or all of which may
be legally privileged. It is for the intended recipient only. If an
addressing or transmission error has misdirected this email, please
notify the author by replying to this email. If you are not the intended
recipient you may not use, disclose, distribute, copy, print or rely on
this email. 
 
 Email transmission cannot be guaranteed to be secure or error free, as
information may be intercepted, corrupted, lost, destroyed, arrive late
or incomplete or contain viruses. This email and any files attached to
it have been checked with virus detection software before transmission.
You should nonetheless carry out your own virus check before opening any
attachment. GOSS Interactive Ltd accepts no liability for any loss or
damage that may be caused by software viruses.
 
 



Solr Cell text extraction

2009-11-20 Thread Ian Smith
Hi Guys,

I am trying to use Solr Cell to extract body content from documents, and
also to pass along some literal field values.  Trouble is, some of the
literal fields contain spaces, colons etc. which cause a bad request
exception in the server.  However, if I URL encode these fields the
encoding is not stripped away, so it is still present in search
responses.

Is there a way to pass literal values containing non-URL safe characters
to Solr Cell?

Regards,

Ian.

Web design and intelligent Content Management. www.twitter.com/gossinteractive 

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, 
Plymouth, PL1 1LG.  Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be 
legally privileged. It is for the intended recipient only. If an addressing or 
transmission error has misdirected this email, please notify the author by 
replying to this email. If you are not the intended recipient you may not use, 
disclose, distribute, copy, print or rely on this email. 

Email transmission cannot be guaranteed to be secure or error free, as 
information may be intercepted, corrupted, lost, destroyed, arrive late or 
incomplete or contain viruses. This email and any files attached to it have 
been checked with virus detection software before transmission. You should 
nonetheless carry out your own virus check before opening any attachment. GOSS 
Interactive Ltd accepts no liability for any loss or damage that may be caused 
by software viruses.




RE: Solr Cell text extraction - non-issue

2009-11-20 Thread Ian Smith
Sorry guys, the bad request seemed to be caused elsewhere, no need to
URL encode now.
Ian.

-Original Message-
From: Ian Smith [mailto:ian.sm...@gossinteractive.com] 
Sent: 20 November 2009 15:26
To: solr-user@lucene.apache.org
Subject: Solr Cell text extraction

Hi Guys,

I am trying to use Solr Cell to extract body content from documents, and
also to pass along some literal field values.  Trouble is, some of the
literal fields contain spaces, colons etc. which cause a bad request
exception in the server.  However, if I URL encode these fields the
encoding is not stripped away, so it is still present in search
responses.

Is there a way to pass literal values containing non-URL safe characters
to Solr Cell?

Regards,

Ian.

Web design and intelligent Content Management.
www.twitter.com/gossinteractive 

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street,
Plymouth, PL1 1LG.  Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be
legally privileged. It is for the intended recipient only. If an
addressing or transmission error has misdirected this email, please
notify the author by replying to this email. If you are not the intended
recipient you may not use, disclose, distribute, copy, print or rely on
this email. 

Email transmission cannot be guaranteed to be secure or error free, as
information may be intercepted, corrupted, lost, destroyed, arrive late
or incomplete or contain viruses. This email and any files attached to
it have been checked with virus detection software before transmission.
You should nonetheless carry out your own virus check before opening any
attachment. GOSS Interactive Ltd accepts no liability for any loss or
damage that may be caused by software viruses.




RE: Index-time field boosting not working?

2009-11-19 Thread Ian Smith
Hi Otis, thanks for replying,

Well I'm pretty sure it was there (and documented) in the 1.3 era.
Strangely, it is still accepted in the Eclipse HTML editor, even for
attribute completion (if you can, try it).  But if it is truly
deprecated, we will have to reassess part of our system design :(

If you or anyone else here has any historical perspective on this, I'd
be interested to hear about it.

Regards,

Ian,

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: 18 November 2009 22:55
To: solr-user@lucene.apache.org
Subject: Re: Index-time field boosting not working?

Can boost attribute really be specified for a field in the schema?  I
wasn't aware of that, and I don't see it on
http://wiki.apache.org/solr/SchemaXml .

Maybe you are mixing it with
http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.2
2field.22 ?

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: Ian Smith ian.sm...@gossinteractive.com
 To: solr-user@lucene.apache.org
 Sent: Wed, November 18, 2009 6:40:11 AM
 Subject: Index-time field boosting not working?
 
 I have the following field configured in schema.xml:
 
 
 omitNorms=false boost=3.0 /
 
 Where text is the type which came with the Solr distribution.  I 
 have not been able to get this configuration to alter any document 
 scores, and if I look at the indexes in Luke there is no change in the

 norms (compared to an un-boosted equivalent).
 
 I have confirmed that document boosting works (via SolrJ), but our 
 field boosting needs to be done in the schema.
 
 Am I doing something wrong (BTW I have tried using 3.0f as well, no 
 difference)?
 
 Also, I have seen no debug output during startup which would indicate 
 that fild boosting is being configured - should there be any?
 
 I have found no usage examples of this in the Solr 1.4 book, except a 
 vague discouragement - is this a deprecated feature?
 
 TIA,
 
 Ian
 
 Web design and intelligent Content Management. 
 www.twitter.com/gossinteractive
 
 Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower 
 Street, Plymouth, PL1 1LG.  Company Registration No: 3553908
 
 This email contains proprietary information, some or all of which may 
 be legally privileged. It is for the intended recipient only. If an 
 addressing or transmission error has misdirected this email, please 
 notify the author by replying to this email. If you are not the 
 intended recipient you may not use, disclose, distribute, copy, print
or rely on this email.
 
 Email transmission cannot be guaranteed to be secure or error free, as

 information may be intercepted, corrupted, lost, destroyed, arrive 
 late or incomplete or contain viruses. This email and any files 
 attached to it have been checked with virus detection software before 
 transmission. You should nonetheless carry out your own virus check 
 before opening any attachment. GOSS Interactive Ltd accepts no 
 liability for any loss or damage that may be caused by software
viruses.



Index-time field boosting not working?

2009-11-18 Thread Ian Smith
I have the following field configured in schema.xml:

field name=title type=text indexed=true stored=true
omitNorms=false boost=3.0 /

Where text is the type which came with the Solr distribution.  I have
not been able to get this configuration to alter any document scores,
and if I look at the indexes in Luke there is no change in the norms
(compared to an un-boosted equivalent).

I have confirmed that document boosting works (via SolrJ), but our field
boosting needs to be done in the schema.

Am I doing something wrong (BTW I have tried using 3.0f as well, no
difference)?

Also, I have seen no debug output during startup which would indicate
that fild boosting is being configured - should there be any?

I have found no usage examples of this in the Solr 1.4 book, except a
vague discouragement - is this a deprecated feature?

TIA,

Ian

Web design and intelligent Content Management. www.twitter.com/gossinteractive 

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, 
Plymouth, PL1 1LG.  Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be 
legally privileged. It is for the intended recipient only. If an addressing or 
transmission error has misdirected this email, please notify the author by 
replying to this email. If you are not the intended recipient you may not use, 
disclose, distribute, copy, print or rely on this email. 

Email transmission cannot be guaranteed to be secure or error free, as 
information may be intercepted, corrupted, lost, destroyed, arrive late or 
incomplete or contain viruses. This email and any files attached to it have 
been checked with virus detection software before transmission. You should 
nonetheless carry out your own virus check before opening any attachment. GOSS 
Interactive Ltd accepts no liability for any loss or damage that may be caused 
by software viruses.




Re: Problems downloading lucene 2.9.1

2009-11-03 Thread Ian Ibbotson
Heya Ryan...

For me the big problem with adding
http://people.apache.org/~mikemccand/staging-area/rc3_lucene2.9.1/maven/http://people.apache.org/%7Emikemccand/staging-area/rc3_lucene2.9.1/maven/to
my build config is that the artifact names of the interim release are
the
same as the final objects will be.. thus once they are copied to a local
repo maven won't bother to go looking for more recent versions, even if you
blow away that temporary repo. Would it be possible to publish tagged rc-N
releases to a public and more permanent repository where people can
reference them and upgrade to the final release when it's available.

Just a thought, cheers for all your hard work.

Ian.

2009/11/2 Ryan McKinley ryan...@gmail.com


 On Nov 2, 2009, at 8:29 AM, Grant Ingersoll wrote:


 On Nov 2, 2009, at 12:12 AM, Licinio Fernández Maurelo wrote:

  Hi folks,

 as we are using an snapshot dependecy to solr1.4, today we are getting
 problems when maven try to download lucene 2.9.1 (there isn't a any 2.9.1
 there).

 Which repository can i use to download it?


 They won't be there until 2.9.1 is officially released.  We are trying to
 speed up the Solr release by piggybacking on the Lucene release, but this
 little bit is the one downside.


 Until then, you can add a repo to:

 http://people.apache.org/~mikemccand/staging-area/rc3_lucene2.9.1/maven/http://people.apache.org/%7Emikemccand/staging-area/rc3_lucene2.9.1/maven/





LocalSolr, Maven, build files and release candidates (Just for info) and spatial radius (A question)

2009-11-02 Thread Ian Ibbotson
Hallo All. I've been trying to prepare a project using localsolr for the
impending (I hope) arrival of solr 1.4 and Lucene 2.9.1.. Here are some
notes in case anyone else is suffering similarly. Obviously everything here
may change by next week.

First problem has been the lack of any stable maven based lucene and solr
artifacts to wire into my poms. Because of that, and as an interim only
measure, I've built the latest branches of the lucene 2.9.1 and solr 1.4
trees and made them into a *temporary* maven repository at
http://developer.k-int.com/m2snapshots/. in there you can find all the jar
artifacts tagged as xxx-ki-rc1 (For solr) and xxx-ki-rc3 (For lucene) and
finally, a localsolr.localsolr build tagged as 1.5.2-rc1. Sorry for the
naming, but I don't want these artifacts to clash with the real ones when
they come along. This is really just for my own use, but I've seen messages
and spoken to people who are really struggling to get their maven deps
right, if this helps anyone, please feel free to use these until the real
apache artifacts appear. I can't take any responsibility for their quality.
All the poms have been altered to look for the correct dependent artifacts
in the same repository, adding the stanza

  !-- Emergency repository for storing interim builds of lucene and solr
whilst they sort their act out --
  repositories
repository
  idk-int-m2-snapshots/id
  nameK-int M2 Snapshots/name
  urlhttp://developer.k-int.com/m2snapshots/url
  releases
enabledtrue/enabled
  /releases
/repository
  /repositories

to your pom will let you use these deps temporarily until we see an official
build. If you're a maven developer and I've gone way around the houses with
this, please tell me of an easier solution :) This repo *will* go away when
the real builds turn up.

The localsolr in this repo also contains the patches I've submitted (A good
while ago) to the localsolr project to make it build with the lucene 2.9.1
rc3 as the downloadable dist is currently built against an older 2.9 release
that had a different API (IE won't work with the new lucene and solr)

All this means that there is a working localsolr build.

Second up, I've also seen emails (And seen the exception myself) around
asking about the following when trying to get all these revisions working
together.

java.lang.NumberFormatException: Invalid shift value in prefixCoded string
(is encoded value really a LONG?)

There are some threads out there telling you that the Lucene indexes are not
binary compatible between versions, but if you're using localsolr, what you
really need to know is:

1) Make sure that your schema.xml contains at least the following fieldType
defs

   fieldType name=tdouble class=solr.TrieDoubleField precisionStep=8
omitNorms=true positionIncrementGap=0/

2) Convert your old solr sdouble fields to tdoubles:

  field name=lat type=tdouble indexed=true stored=true/
  field name=lng type=tdouble indexed=true stored=true/
  dynamicField name=_local* type=tdouble indexed=true stored=true/

Pretty sure you would need to rebuild your indexes.

Ok, with those changes I managed to get a working spatial search.

My only problem now is that the radius param on the command line seems to
need to be way bigger than it needs to be in order to find anything.
Specifically, if I search with a radius of 220 I get a record back which
marks it's geo_distance as 83.76888211666025. Shuffling the radius around
ends up that a radius of 205 returns that doc, 204 and it's filtered. I'm
going to dig into this now, but if anyone knows about this I'd really
appreciate any help.

Cheers all, hope this is of use to someone out there, if anyone has
corrections/comments I'd really appreciate any info.

Best,
Ian.


Re: Solr via ruby

2009-09-23 Thread Ian Connor
Hi,

Thanks for the discussion. We use the distributed option so I am not sure
embedded is possible.

As you also guessed, we use haproxy for load balancing and failover between
replicas of the shards so giving this up for a minor performance boost is
probably not wise.

So essentially we have: User - HTTP Load Balancer - Mogrel Cluster -
Haproxy - N x Solr Shards

and it looks like that is the standard setup for performance from what you
suggest here and most of the performance tweaks I thought of are already in
use.

Ian.

On Fri, Sep 18, 2009 at 3:09 AM, Erik Hatcher erik.hatc...@gmail.comwrote:


 On Sep 18, 2009, at 1:09 AM, rajan chandi wrote:

 We are planning to use the external Solr on tomcat for scalability
 reasons.

 We thought that EmbeddedSolrServer uses HTTP too to talk with Ruby and
 vise-versa as in acts_as_solr ruby plugin.


 EmbeddedSolrServer is a way to run Solr as an API (like Lucene) rather than
 with any web container involved at all.  In other words, only Java can use
 EmbeddedSolrServer (which means JRuby works great).

 The acts_as_solr plugin uses the solr-ruby library to communicate with
 Solr.  Under solr-ruby, it's HTTP with ruby (wt=ruby) formatted responses
 for searches, and documents being indexed get converted to Solr's XML format
 and POSTed to the Solr URL used to open the Solr::Connection

Erik




 If Ruby is not using the HTTP to talk EmbeddedSolrServer, what is it
 using?

 Thanks and Regards
 Rajan Chandi

 On Thu, Sep 17, 2009 at 9:44 PM, Erik Hatcher erik.hatc...@gmail.com
 wrote:


 On Sep 17, 2009, at 11:40 AM, Ian Connor wrote:

  Is there any support for connection pooling or a more optimized data
 exchange format?


 The solr-ruby library (as do other Solr + Ruby libraries) use the ruby
 response format and eval it.  solr-ruby supports keeping the HTTP
 connection
 alive too.

 We are looking at any further ways to optimize the solr

 queries so we can possibly make more of them in the one request.

 The JSON like format seems pretty tight but I understand when the
 distributed search takes place it uses a binary protocol instead of
 text.
 I
 wanted to know if that was available or could be available via the ruby
 library.

 Is it possible to host a local shard and skip HTTP between ruby and
 solr?


 If you use JRuby you can do some fancy stuff, like use the javabin update
 and response formats so no XML is involved, and you could also use Solr's
 EmbeddedSolrServer to avoid HTTP.   However, in practice rarely is HTTP
 the
 bottleneck and actually offers a lot of advantages, such as easy
 commodity
 load balancing and caching.

 But JRuby + Solr is a very beautiful way to go!

 If you're using MRI Ruby, though, you don't really have any options other
 than to go over HTTP. You could use json or ruby formatted responses -
 I'd
 be curious to see some performance numbers comparing those two.

  Erik






-- 
Regards,

Ian Connor
1 Leighton St #723
Cambridge, MA 02141
Call Center Phone: +1 (714) 239 3875 (24 hrs)
Fax: +1(770) 818 5697
Skype: ian.connor


  1   2   3   >