Re: Fix sort order within an index ?

2013-10-08 Thread Upayavira


On Mon, Oct 7, 2013, at 11:09 PM, user 01 wrote:
 Any way to store documents in a fixed sort order within the indexes of
 certain fields(either the arrival order or sorted by int ids, that also
 serve as my unique key), so that I could store them optimized for
 browsing
 lists of items ?
 
 The order for browsing is always fixed  there are no further filter
 queries. Just I need to fetch the top 20 (most recently added) document
 with field value topic=x1
 
 I came across this article  a JIRA issue which encouraged me that
 something like this may be possible:
 
 http://shaierera.blogspot.com/2013/04/index-sorting-with-lucene.html
 
 https://issues.apache.org/jira/browse/LUCENE-4752

That ticket is an optimisation. If your IDs are sequential, you can sort
on them. Or you can add a timestamp field with a default of NOW, and
sort on that.

q=topic:x1rows=20sort=id desc
Or
q=topic:x1rows=20sort=timestamp desc

Will get you what you ask for.

The above ticket might just make it a little faster.

Upayavira


Re: How to round solr score ?

2013-10-08 Thread Mamta Thakur
Thanks for your replies.
I am actually doing the frange approach for now. The only downside I see there 
is it makes the function call twice, calling createWeight() twice. And so my 
social connections are evaluated twice which is quite heavy operation. So I was 
thinking if I could get away with one additional call.




This email is intended for the person(s) to whom it is addressed and may 
contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, 
distribution, copying, or disclosure by any person other than the addressee(s) 
is strictly prohibited. If you have received this email in error, please notify 
the sender immediately by return email and delete the message and any 
attachments from your system.

SolrCloud shard splitting keeps failing

2013-10-08 Thread Kalle Aaltonen
I have a test system where I have a index of 15M documents in one shard
that I would like to split in two. I've tried it four times now. I have a
stand-alone zookeeper running on the same machine.

The end result is that I have two new shards with state construction, and
each has one replica which is down.

Two of the attempts failed because of heapspace. Now the heap size is 24GB.
I can't figure out from the logs what is going on.

I've attached a log of the latest attempt. Any help would be much
appreciated.

- Kalle Aaltonen


splitfail3.txt.gz
Description: GNU Zip compressed data


DIH with SolrCloud

2013-10-08 Thread Prasi S
Hi ,
I have setup solrcloud with solr4.4. The cloud has 2 tomcat instances with
separate zookeeper.

 i execute the below command in the url,

http://localhost:8180/solr/colindexer/dataimportmssql?command=full-importcommit=trueclean=false


response
lst name=responseHeader
int name=status0/int
int name=QTime0/int
/lst
lst name=initArgs
lst name=defaults
str name=configdata-config-mssql.xml/str
/lst
/lst
str name=commandstatus/str
str name=statusidle/str
str name=importResponse/
lst name=statusMessages
str name=Total Requests made to DataSource1/str
str name=Total Rows Fetched0/str
str name=Total Documents Skipped0/str
str name=Full Dump Started2013-10-08 10:55:27/str
str name=Total Documents Processed0/str
str name=Time taken0:0:1.585/str
/lst
str name=WARNING
This response format is experimental. It is likely to change in the future.
/str
/response

I dont get Indexing completed. added  documents ...  status message at
all. Also, when i check the dataimport in Solr admin page,get the below
status. and no documents are indexed.


[image: Inline image 1]

Not sure of the problem.


Re: SolrCloud shard splitting keeps failing

2013-10-08 Thread Harald Kirsch

Hello Kalle,

we noticed the same problem some weeks ago:

http://lucene.472066.n3.nabble.com/Share-splitting-at-23-million-documents-gt-OOM-td4085064.html

Would be interesting to hear if there is more positive feedback this time.

We finally concluded that it may be worth to start with many shards 
right away. And as they grow, they can be distributed to other machines. 
This works, as we have tested (yet not in production).


Regards,
Harald.

On 08.10.2013 08:43, Kalle Aaltonen wrote:


I have a test system where I have a index of 15M documents in one shard
that I would like to split in two. I've tried it four times now. I have
a stand-alone zookeeper running on the same machine.

The end result is that I have two new shards with state construction,
and each has one replica which is down.

Two of the attempts failed because of heapspace. Now the heap size is
24GB. I can't figure out from the logs what is going on.

I've attached a log of the latest attempt. Any help would be much
appreciated.

- Kalle Aaltonen





Regex to match one of two words

2013-10-08 Thread Dinusha Dilrukshi
I have an input that can have only 2 values Published or Deprecated. What
regular expression can I use to ensure that either of the two words was
submitted?

I tried with different regular expressions (as in the [1], [2]) that
contains most generic syntax.. But Solar throws parser exception when
validating these expressions.. Could someone help me on writing this
regular expression that will evaluate by the Solar parser.

[1] /^(PUBLISHED)?(DEPRECATED)?$/
[2] /(PUBLISHED)?(DEPRECATED)?/


SolrCore org.apache.solr.common.SolrException:
org.apache.lucene.queryParser.ParseException: Cannot parse
'overview_status_s:/(PUBLISHED)?(DEPRECATED)?/': '*' or '?' not allowed as
first character in WildcardQuery
at
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:108)


Regards,
Dinusha.


Re: SolrCloud shard splitting keeps failing

2013-10-08 Thread Shalin Shekhar Mangar
Hi Kalle,

The problem here is that certain actions are taking too long causing the
split process to terminate in between. For example, a commit on the parent
shard leader took 83 seconds in your case but the read timeout value is set
to 60 seconds only. We actually do not need to open a searcher during this
commit. I'll open an issue and attach a fix.

Longer term we need to introduce asynchronous commands so that status can
be reported in a better way.


On Tue, Oct 8, 2013 at 12:13 PM, Kalle Aaltonen
kalle.aalto...@zemanta.comwrote:


 I have a test system where I have a index of 15M documents in one shard
 that I would like to split in two. I've tried it four times now. I have a
 stand-alone zookeeper running on the same machine.

 The end result is that I have two new shards with state construction,
 and each has one replica which is down.

 Two of the attempts failed because of heapspace. Now the heap size is
 24GB. I can't figure out from the logs what is going on.

 I've attached a log of the latest attempt. Any help would be much
 appreciated.

 - Kalle Aaltonen






-- 
Regards,
Shalin Shekhar Mangar.


Re: DIH with SolrCloud

2013-10-08 Thread Raymond Wiker
It looks like your select statement does not return any rows... have you
verified it with some sort of SQL client?


On Tue, Oct 8, 2013 at 8:57 AM, Prasi S prasi1...@gmail.com wrote:

 Hi ,
 I have setup solrcloud with solr4.4. The cloud has 2 tomcat instances with
 separate zookeeper.

  i execute the below command in the url,


 http://localhost:8180/solr/colindexer/dataimportmssql?command=full-importcommit=trueclean=false


 response
 lst name=responseHeader
 int name=status0/int
 int name=QTime0/int
 /lst
 lst name=initArgs
 lst name=defaults
 str name=configdata-config-mssql.xml/str
 /lst
 /lst
 str name=commandstatus/str
 str name=statusidle/str
 str name=importResponse/
 lst name=statusMessages
 str name=Total Requests made to DataSource1/str
 str name=Total Rows Fetched0/str
 str name=Total Documents Skipped0/str
 str name=Full Dump Started2013-10-08 10:55:27/str
 str name=Total Documents Processed0/str
 str name=Time taken0:0:1.585/str
 /lst
 str name=WARNING
 This response format is experimental. It is likely to change in the future.
 /str
 /response

 I dont get Indexing completed. added  documents ...  status message at
 all. Also, when i check the dataimport in Solr admin page,get the below
 status. and no documents are indexed.


 [image: Inline image 1]

 Not sure of the problem.



SolrCloud+Tomcat 3 win VMs, 3 shards * 2 replica

2013-10-08 Thread magnum87
Hello,
I'm trying to deploy, using SolRCloud, a cluster of 3 VMs with Windows, each
with an instance of SolR running on a Tomcat container AND with an external
ZooKeeper (3.4.5) (so 3 ZK + 3 SolR). I'm using SolR 4.2, the original conf
is multi-core (6 different cores)

I tried to set up a configuration of 3 shards each with 2 replica (1
original + 1), so that:
* VM1 -- shards 1,2
* VM2 -- shards 2,3
* VM3 -- shards 1,3

After days of googling, reading documentation (in particular  here
https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble
 
,  here
http://wiki.apache.org/solr/SolrCloud#Example_C:_Two_shard_cluster_with_shard_replicas_and_zookeeper_ensemble
  
and  here http://wiki.apache.org/solr/SolrCloudTomcat  ) and browsing
forums, I can't still find the solution.
Apparently the only way to force 2 shards on the same machine is to use
Collection API (otherwise I could only deploy 3 shards * 1 replica,using
numShards or 1 shard * 3 replica).

After several attempts (almost all combinations of adding/removing
bootstrap_conf=true, solr.xml persistent true/false, removing/leaving 'core'
tags in solr.xml, using DELETE/RELOAD/CREATE on collections) I managed to
deploy this configuration using boostrap_conf=true, DELETing and CREATing on
each collection, but when I stop SolR service and then start it again, it
does not work (adding/removing boostrap_conf etc.).
I think this quite a standard use case, is there a simple solution avoiding
very ugly workarounds like deploying 2 tomcats or more than 1 SolR on
tomcat?

Thank you very much



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-Tomcat-3-win-VMs-3-shards-2-replica-tp4094051.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DIH with SolrCloud

2013-10-08 Thread Prasi S
My select statement retusn documents. i have checked the query in the sql
server.

The problem is the same configuration i have given with default handler
/dataimport. It was working. If i give it with /dataimportmssql handler , i
get this type of behaviour


On Tue, Oct 8, 2013 at 1:28 PM, Raymond Wiker rwi...@gmail.com wrote:

 It looks like your select statement does not return any rows... have you
 verified it with some sort of SQL client?


 On Tue, Oct 8, 2013 at 8:57 AM, Prasi S prasi1...@gmail.com wrote:

  Hi ,
  I have setup solrcloud with solr4.4. The cloud has 2 tomcat instances
 with
  separate zookeeper.
 
   i execute the below command in the url,
 
 
 
 http://localhost:8180/solr/colindexer/dataimportmssql?command=full-importcommit=trueclean=false
 
 
  response
  lst name=responseHeader
  int name=status0/int
  int name=QTime0/int
  /lst
  lst name=initArgs
  lst name=defaults
  str name=configdata-config-mssql.xml/str
  /lst
  /lst
  str name=commandstatus/str
  str name=statusidle/str
  str name=importResponse/
  lst name=statusMessages
  str name=Total Requests made to DataSource1/str
  str name=Total Rows Fetched0/str
  str name=Total Documents Skipped0/str
  str name=Full Dump Started2013-10-08 10:55:27/str
  str name=Total Documents Processed0/str
  str name=Time taken0:0:1.585/str
  /lst
  str name=WARNING
  This response format is experimental. It is likely to change in the
 future.
  /str
  /response
 
  I dont get Indexing completed. added  documents ...  status message at
  all. Also, when i check the dataimport in Solr admin page,get the below
  status. and no documents are indexed.
 
 
  [image: Inline image 1]
 
  Not sure of the problem.
 



What is the full list of Solr Special Characters?

2013-10-08 Thread Furkan KAMACI
I found that:

+ -  || ! ( ) { } [ ] ^  ~ * ? : \

at that URL:
http://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Escaping+Special+Characters

I'm using Solr 4.5 Is there any full list of special characters to escape
inside my custom search API before making a request to SolrCloud?


Re: What is the full list of Solr Special Characters?

2013-10-08 Thread Furkan KAMACI
Actually I want to remove special characters and wont send them into my
Solr indexes. I mean user can send a special query as like a SQL injection
and I want to prevent my system such kind of scenarios.


2013/10/8 Furkan KAMACI furkankam...@gmail.com

 I found that:

 + -  || ! ( ) { } [ ] ^  ~ * ? : \

 at that URL:
 http://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Escaping+Special+Characters

 I'm using Solr 4.5 Is there any full list of special characters to escape
 inside my custom search API before making a request to SolrCloud?



Re: documents are not commited distributively in solr cloud tomcat with core discovery, range is null for shards in clusterstate.json

2013-10-08 Thread Liu Bo
I've solved this problem myself.

If you use core discovery, you must specify the numShards parameter in
core.properties.
or else solr won't be allocate range for each shards and then documents
won't be distributed properly.

Using core discovery to set up solr cloud in tomcat is much easier and
clean than coreAdmin described in the wiki:
http://wiki.apache.org/solr/SolrCloudTomcat.

It costs me some time to move from jetty to tomcat, but I think our IT team
will like this way. :)




On 6 October 2013 23:53, Liu Bo diabl...@gmail.com wrote:

 Hi all

 I've sent out this mail before, but I only subscribed to lucene-user but
 not solr-user at that time. Sorry for repeating if any and your help will
 be much of my appreciation.

 I'm trying out the tutorial about solrcloud, and then I manage to write my
 own plugin to import data from our set of databases, I use SolrWriter from
 DataImporter package and the docs could be distributed commit to shards.

 Every thing works fine using jetty from the solr example, but when I move
 to tomcat, solrcloud seems not been configured right. As the documents are
 just committed to the shard where update requested goes to.

 The cause probably is the range is null for shards in clusterstate.json.
 The router is implicit instead of compositeId as well.

 Is there anything missed or configured wrong in the following steps? How
 could I fix it. Your help will be much of my appreciation.

 PS, solr cloud tomcat wiki page isn't up to 4.4 with core discovery, I'm
 trying out after reading SoclrCloud, SolrCloudJboss, and CoreAdmin wiki
 pages.

 Here's what I've done and some useful logs:

 1. start three zookeeper server.
 2. upload configuration files to zookeeper, the collection name is
 content_collection
 3. start three tomcat instants on three server with core discovery

 a) core file:
  name=content
  loadOnStartup=true
  transient=false
  shard=shard1   (differrent on servers)
  collection=content_collection
 b) solr.xml

  solr

   solrcloud

 str name=host${host:}/str

 str name=hostContext${hostContext:solr}/str

 int name=hostPort8080/int

 int name=zkClientTimeout${zkClientTimeout:15000}/int

 str name=zkHost10.199.46.176:2181,10.199.46.165:2181,
 10.199.46.158:2181/str

 bool name=genericCoreNodeNames${genericCoreNodeNames:true}/bool

   /solrcloud


   shardHandlerFactory name=shardHandlerFactory

 class=HttpShardHandlerFactory

 int name=socketTimeout${socketTimeout:0}/int

 int name=connTimeout${connTimeout:0}/int

   /shardHandlerFactory

 /solr

 4. In the solr.log, I see the three shards are recognized, and the
 solrcloud can see the content_collection has three shards as well.
 5. write documents to content_collection using my update request, the
 documents only commits to the shard the request goes to, in the log I can
 see the DistributedUpdateProcessorFactory is in the processorChain and
 disribute commit is triggered:

 INFO  - 2013-09-30 16:31:43.205;
 com.microstrategy.alert.search.solr.plugin.index.handler.IndexRequestHandler;
 updata request processor factories:

 INFO  - 2013-09-30 16:31:43.206;
 com.microstrategy.alert.search.solr.plugin.index.handler.IndexRequestHandler;
 org.apache.solr.update.processor.LogUpdateProcessorFactory@4ae7b77

 INFO  - 2013-09-30 16:31:43.207;
 com.microstrategy.alert.search.solr.plugin.index.handler.IndexRequestHandler;
 org.apache.solr.update.processor.*DistributedUpdateProcessorFactory*
 @5b2bc407

 INFO  - 2013-09-30 16:31:43.207;
 com.microstrategy.alert.search.solr.plugin.index.handler.IndexRequestHandler;
 org.apache.solr.update.processor.RunUpdateProcessorFactory@1652d654

 INFO  - 2013-09-30 16:31:43.283; org.apache.solr.core.SolrDeletionPolicy;
 SolrDeletionPolicy.onInit: commits: num=1


 commit{dir=/home/bold/work/tomcat/solr/content/data/index,segFN=segments_1,generation=1}

 INFO  - 2013-09-30 16:31:43.284; org.apache.solr.core.SolrDeletionPolicy;
 newest commit generation = 1

 INFO  - 2013-09-30 16:31:43.440; *org.apache.solr.update.SolrCmdDistributor;
 Distrib commit to*:[StdNode: http://10.199.46.176:8080/solr/content/,
 StdNode: http://10.199.46.165:8080/solr/content/]
 params:commit_end_point=truecommit=truesoftCommit=falsewaitSearcher=trueexpungeDeletes=false

 but the documents won't go to other shards, the other shards only has a
 request with not documents:

 INFO  - 2013-09-30 16:31:43.841;
 org.apache.solr.update.DirectUpdateHandler2; start
 commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}

 INFO  - 2013-09-30 16:31:43.855; org.apache.solr.core.SolrDeletionPolicy;
 SolrDeletionPolicy.onInit: commits: num=1


 commit{dir=/home/bold/work/tomcat/solr/content/data/index,segFN=segments_1,generation=1}

 INFO  - 2013-09-30 16:31:43.855; org.apache.solr.core.SolrDeletionPolicy;
 newest commit 

Re: Improving indexing performance

2013-10-08 Thread Matteo Grolla
Thanks Erik,
I think I have been able to exhaust a resource
if I split the data in 2 and upload it with 2 clients like benchmark 
1.1 it takes 120s here the bottleneck it my LAN,
if I use a setting like benchmark 1 probably the bottleneck is the 
ramBuffer.

I'm going to buy a Gigabit ethernet cable so I can make a better test.

OutOfMemory error: it's the solrj client that crashes
I'm using solr 4.2.1 and corresponding solrj client
httpsolrserver works fine
concurrentupdatesolrsever gives me problems, and I didn't 
understand how to size the queuesize parameter optimally


Il giorno 07/ott/2013, alle ore 14:03, Erick Erickson ha scritto:

 Just skimmed, but the usual reason you can't max out the server
 is that the client can't go fast enough. Very quick experiment:
 comment out the server.add line in your client and run it again,
 does that speed up the client substantially? If not, then the time
 is being spent on the client.
 
 Or split your csv file into, say, 5 parts and run it from 5 different
 PCs in parallel.
 
 bq:  I can't rely on auto commit, otherwise I get an OutOfMemory error
 This shouldn't be happening, I'd get to the bottom of this. Perhaps simply
 allocating more memory to the JVM running Solr.
 
 bq: committing every 100k docs gives worse performance
 It'll be best to specify openSearcher=false for max indexing throughput
 BTW. You should be able to do this quite frequently, 15 seconds seems
 quite reasonable.
 
 Best,
 Erick
 
 On Sun, Oct 6, 2013 at 12:19 PM, Matteo Grolla matteo.gro...@gmail.com 
 wrote:
 I'd like to have some suggestion on how to improve the indexing performance 
 on the following scenario
 I'm uploading 1M docs to solr,
 
 every docs has
id: sequential number
title:  small string
date: date
body: 1kb of text
 
 Here are my benchmarks (they are all single executions, not averages from 
 multiple executions):
 
 1)  using the updaterequesthandler
and streaming docs from a csv file on the same disk of solr
auto commit every 15s with openSearcher=false and commit after last 
 document
 
total time: 143035ms
 
 1.1)using the updaterequesthandler
and streaming docs from a csv file on the same disk of solr
auto commit every 15s with openSearcher=false and commit after last 
 document
ramBufferSizeMB500/ramBufferSizeMB
maxBufferedDocs10/maxBufferedDocs
 
total time: 134493ms
 
 1.2)using the updaterequesthandler
and streaming docs from a csv file on the same disk of solr
auto commit every 15s with openSearcher=false and commit after last 
 document
mergeFactor30/mergeFactor
 
total time: 143134ms
 
 2)  using a solrj client from another pc in the lan (100Mbps)
with httpsolrserver
with javabin format
add documents to the server in batches of 1k docs   ( server.add( 
 collection ) )
auto commit every 15s with openSearcher=false and commit after last 
 document
 
total time: 139022ms
 
 3)  using a solrj client from another pc in the lan (100Mbps)
with concurrentupdatesolrserver
with javelin format
add documents to the server in batches of 1k docs   ( server.add( 
 collection ) )
server queue size=20k
server threads=4
no auto-commit and commit every 100k docs
 
total time: 167301ms
 
 
 --On the solr server--
 cpu averages25%
at best 100% for 1 core
 IO  is still far from being saturated
iostat gives a pattern like this (every 5 s)
 
time(s) %util
100 45,20
105 1,68
110 17,44
115 76,32
120 2,64
125 68
130 1,28
 
 I thought that using concurrentupdatesolrserver I was able to max cpu or IO 
 but I wasn't.
 With concurrentupdatesolrserver I can't rely on auto commit, otherwise I get 
 an OutOfMemory error
 and I found that committing every 100k docs gives worse performance than 
 auto commit every 15s (benchmark 3 with httpsolrserver took 193515)
 
 I'd really like to understand why I can't max out the resources on the 
 server hosting solr (disk above all)
 And I'd really like to understand what I'm doing wrong with 
 concurrentupdatesolrserver
 
 thanks
 



Re: Regex to match one of two words

2013-10-08 Thread Jack Krupansky

Why use regular expressions at all?

Try:

published OR deprecated

-- Jack Krupansky

-Original Message- 
From: Dinusha Dilrukshi

Sent: Tuesday, October 08, 2013 3:32 AM
To: solr-user@lucene.apache.org
Subject: Regex to match one of two words

I have an input that can have only 2 values Published or Deprecated. What
regular expression can I use to ensure that either of the two words was
submitted?

I tried with different regular expressions (as in the [1], [2]) that
contains most generic syntax.. But Solar throws parser exception when
validating these expressions.. Could someone help me on writing this
regular expression that will evaluate by the Solar parser.

[1] /^(PUBLISHED)?(DEPRECATED)?$/
[2] /(PUBLISHED)?(DEPRECATED)?/


SolrCore org.apache.solr.common.SolrException:
org.apache.lucene.queryParser.ParseException: Cannot parse
'overview_status_s:/(PUBLISHED)?(DEPRECATED)?/': '*' or '?' not allowed as
first character in WildcardQuery
at
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:108)


Regards,
Dinusha. 



Re: {soft}Commit and cache flusing

2013-10-08 Thread Dmitry Kan
Tim,
I suggest you open a new thread and not reply to this one to get noticed.
Dmitry


On Mon, Oct 7, 2013 at 9:44 PM, Tim Vaillancourt t...@elementspace.comwrote:

 Is there a way to make autoCommit only commit if there are pending changes,
 ie: if there are 0 adds pending commit, don't autoCommit (open-a-searcher
 and wipe the caches)?

 Cheers,

 Tim


 On 2 October 2013 00:52, Dmitry Kan solrexp...@gmail.com wrote:

  right. We've got the autoHard commit configured only atm. The
 soft-commits
  are controlled on the client. It was just easier to implement the first
  version of our internal commit policy that will commit to all solr
  instances at once. This is where we have noticed the reported behavior.
 
 
  On Wed, Oct 2, 2013 at 9:32 AM, Bram Van Dam bram.van...@intix.eu
 wrote:
 
   if there are no modifications to an index and a softCommit or
 hardCommit
   issued, then solr flushes the cache.
  
  
   Indeed. The easiest way to work around this is by disabling auto
 commits
   and only commit when you have to.
  
 



Re: Fix sort order within an index ?

2013-10-08 Thread user 01
@Upayavira:

q=topic:x1rows=20sort=id desc
Or
q=topic:x1rows=20sort=timestamp desc

Will get you what you ask for.

yeah I know that I could use SORT  that will work but I asked just for an
optimized way. Also that ticket has been fixed, so shouldn't be able to now
make use of the fixed sort order?


On Tue, Oct 8, 2013 at 11:59 AM, Upayavira u...@odoko.co.uk wrote:



 On Mon, Oct 7, 2013, at 11:09 PM, user 01 wrote:
  Any way to store documents in a fixed sort order within the indexes of
  certain fields(either the arrival order or sorted by int ids, that also
  serve as my unique key), so that I could store them optimized for
  browsing
  lists of items ?
 
  The order for browsing is always fixed  there are no further filter
  queries. Just I need to fetch the top 20 (most recently added) document
  with field value topic=x1
 
  I came across this article  a JIRA issue which encouraged me that
  something like this may be possible:
 
  http://shaierera.blogspot.com/2013/04/index-sorting-with-lucene.html
 
  https://issues.apache.org/jira/browse/LUCENE-4752

 That ticket is an optimisation. If your IDs are sequential, you can sort
 on them. Or you can add a timestamp field with a default of NOW, and
 sort on that.

 q=topic:x1rows=20sort=id desc
 Or
 q=topic:x1rows=20sort=timestamp desc

 Will get you what you ask for.

 The above ticket might just make it a little faster.

 Upayavira



Applying an AND search considering several document snippets as a single document

2013-10-08 Thread Rodrigo Rosenfeld Rosas

Hi there, this is my first message to this list :)

In our application we have a document split in several pages. When the 
user searches for words in a document we want to bring all documents 
containing all the words but we'd like to add a link to the specific 
page for each highlighting.


Currently, I could think of some solution like indexing both the full 
documents and the pages and do this using two steps (conceptually, as I 
haven't actually implemented this):


- perform an AND search across the full documents only and retrieve 
the document ids
- perform an OR search across the pages index only for those pages 
belonging to the previously returned document ids so that I could build 
the link to the specific returned pages.


But while the AND search is already a bit slow here, I'd like to avoid 
two Solr queries if possible, as I already need another RDBMS query as 
well and all of that sum up.


Is there any way I could tell Solr to consider all indexed documents 
with an specified attribute as a single document for an AND matching 
purpose?


Thanks in advance,
Rodrigo.



Hardware dimension for new SolrCloud cluster

2013-10-08 Thread Henrik Ossipoff Hansen
We're in the process of moving onto SolrCloud, and have gotten to the point 
where we are considering how to do our hardware setup.

We're limited to VMs running on our server cluster and storage system, so 
buying new physical servers is out of the question - the question is how we 
should dimension the new VMs.

Our document area is somewhat small, with about 1.2 million orders (rising of 
course), 75k products (divided into 5 countries - each which will be their own 
collection/core) and some million customers.

In our current master/slave setup, we only index the products, with each 
country taking up about 35 MB of disk space. The index frequency i more or less 
updating the indexes 8 times per hour (mostly this is not all data thought, but 
atomic updates with new stock data, new prices etc.).

Our upcoming order and customer indexes however will more or less receive 
updates on the fly as it happens (softcommit) and we expect the same to be 
the case for products in the near future.

- For hardware, it's down to 1 or 2 cores - current master runs with 2 cores
- RAM - currently our master runs with 6 GB only
- How much heap space should we allocate for max heap?

We currently plan on this setup:
- 1 machine for a simple loadbalancer
- 4 VMs totally for the Solr machines themselves (for both leaders and 
replicas, just one replica per shard is enough for our use case)
- A qorum of 3 ZKs

Question is - is this machine setup enough? And how exactly do we dimension the 
Solr machines?

Any help, pointers or resources will be much appreciated :)

Thank you!

Re: Hardware dimension for new SolrCloud cluster

2013-10-08 Thread primoz . skale
I think Mr. Erickson summarized the issue of hardware sizing quite well in 
the following article:

http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best regards,

Primož




From:   Henrik Ossipoff Hansen h...@entertainment-trading.com
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Date:   08.10.2013 14:59
Subject:Hardware dimension for new SolrCloud cluster



We're in the process of moving onto SolrCloud, and have gotten to the 
point where we are considering how to do our hardware setup.

We're limited to VMs running on our server cluster and storage system, so 
buying new physical servers is out of the question - the question is how 
we should dimension the new VMs.

Our document area is somewhat small, with about 1.2 million orders (rising 
of course), 75k products (divided into 5 countries - each which will be 
their own collection/core) and some million customers.

In our current master/slave setup, we only index the products, with each 
country taking up about 35 MB of disk space. The index frequency i more or 
less updating the indexes 8 times per hour (mostly this is not all data 
thought, but atomic updates with new stock data, new prices etc.).

Our upcoming order and customer indexes however will more or less receive 
updates on the fly as it happens (softcommit) and we expect the same to 
be the case for products in the near future.

- For hardware, it's down to 1 or 2 cores - current master runs with 2 
cores
- RAM - currently our master runs with 6 GB only
- How much heap space should we allocate for max heap?

We currently plan on this setup:
- 1 machine for a simple loadbalancer
- 4 VMs totally for the Solr machines themselves (for both leaders and 
replicas, just one replica per shard is enough for our use case)
- A qorum of 3 ZKs

Question is - is this machine setup enough? And how exactly do we 
dimension the Solr machines?

Any help, pointers or resources will be much appreciated :)

Thank you!


Re: SolrCloud shard splitting keeps failing

2013-10-08 Thread Shalin Shekhar Mangar
I was wrong in saying that we don't need to open a searcher, we do. I
committed a fix in SOLR-5314 to use soft commits instead of hard commits. I
also increased the read time out value. Both of these together will reduce
the likelyhood of such a thing happening.

https://issues.apache.org/jira/browse/SOLR-5314


On Tue, Oct 8, 2013 at 1:24 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Hi Kalle,

 The problem here is that certain actions are taking too long causing the
 split process to terminate in between. For example, a commit on the parent
 shard leader took 83 seconds in your case but the read timeout value is set
 to 60 seconds only. We actually do not need to open a searcher during this
 commit. I'll open an issue and attach a fix.

 Longer term we need to introduce asynchronous commands so that status can
 be reported in a better way.


 On Tue, Oct 8, 2013 at 12:13 PM, Kalle Aaltonen 
 kalle.aalto...@zemanta.com wrote:


 I have a test system where I have a index of 15M documents in one shard
 that I would like to split in two. I've tried it four times now. I have a
 stand-alone zookeeper running on the same machine.

 The end result is that I have two new shards with state construction,
 and each has one replica which is down.

 Two of the attempts failed because of heapspace. Now the heap size is
 24GB. I can't figure out from the logs what is going on.

 I've attached a log of the latest attempt. Any help would be much
 appreciated.

 - Kalle Aaltonen






 --
 Regards,
 Shalin Shekhar Mangar.




-- 
Regards,
Shalin Shekhar Mangar.


Re: problem with data import handler delta import due to use of multiple datasource

2013-10-08 Thread Bill Au
I am using 4.3.  It is not related to bugs related to last_index_time.  The
problem is caused by the fact that the parent entity and child entity use
different data source (different databases on different hosts).

From the log output, I do see the the delta query of the child entity being
executed correctly and found all the rows that have been modified for the
child entity.  But it fails when it executed the parentDeltaQuery because
it is still using the database connection from the child entity (ie
datasource ds2 in my example above).

Is there a way to tell DIH to use a different datasource in the
parentDeltaQuery?

Bill


On Sat, Oct 5, 2013 at 10:28 PM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 Which version of Solr and what kind of SQL errors? There were some bugs in
 4.x related to last_index_time, but it does not sound related.

 Regards,
Alex.

 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Sun, Oct 6, 2013 at 8:51 AM, Bill Au bill.w...@gmail.com wrote:

  Here is my DIH config:
 
  dataConfig
  dataSource name=ds1 type=JdbcDataSource
 driver=com.mysql.jdbc.Driver
  url=jdbc:mysql://localhost1/dbname1 user=db_username1
  password=db_password1/
  dataSource name=ds2 type=JdbcDataSource
 driver=com.mysql.jdbc.Driver
  url=jdbc:mysql://localhost2/dbname2 user=db_username2
  password=db_password2/
  document name=products
  entity name=item dataSource=ds1 query=select * from item
  field column=ID name=id /
  field column=NAME name=name /
 
  entity name=feature dataSource=ds2 query=select
  description from feature where item_id='${item.ID}'
  field name=features column=description /
  /entity
  /entity
  /document
  /dataConfig
 
  I am having trouble with delta import.  I think it is because the main
  entity and the sub-entity use different data source.  I have tried using
  both a delta query:
 
  deltaQuery=select id from item where id in (select item_id as id from
  feature where last_modified  '${dih.last_index_time}') or last_modified
  gt; '${dih.last_index_time}'
 
  and a parentDeltaQuery:
 
  entity name=feature pk=ITEM_ID query=select DESCRIPTION as features
  from FEATURE where ITEM_ID='${item.ID}' deltaQuery=select ITEM_ID from
  FEATURE where last_modified  '${dih.last_index_time}'
  parentDeltaQuery=select ID from item where ID=${feature.ITEM_ID}/
 
  I ended up with an SQL error for both.  Is there any way to make delta
  import work in my case?
 
  Bill
 



RE: problem with data import handler delta import due to use of multiple datasource

2013-10-08 Thread Dyer, James
Bill,

I do not believe there is any way to tell it to use a different datasource for 
the parent delta query.  

If you used this approach, would it solve your problem:  
http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport ?

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: Bill Au [mailto:bill.w...@gmail.com] 
Sent: Tuesday, October 08, 2013 8:50 AM
To: solr-user@lucene.apache.org
Subject: Re: problem with data import handler delta import due to use of 
multiple datasource

I am using 4.3.  It is not related to bugs related to last_index_time.  The
problem is caused by the fact that the parent entity and child entity use
different data source (different databases on different hosts).

From the log output, I do see the the delta query of the child entity being
executed correctly and found all the rows that have been modified for the
child entity.  But it fails when it executed the parentDeltaQuery because
it is still using the database connection from the child entity (ie
datasource ds2 in my example above).

Is there a way to tell DIH to use a different datasource in the
parentDeltaQuery?

Bill


On Sat, Oct 5, 2013 at 10:28 PM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 Which version of Solr and what kind of SQL errors? There were some bugs in
 4.x related to last_index_time, but it does not sound related.

 Regards,
Alex.

 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Sun, Oct 6, 2013 at 8:51 AM, Bill Au bill.w...@gmail.com wrote:

  Here is my DIH config:
 
  dataConfig
  dataSource name=ds1 type=JdbcDataSource
 driver=com.mysql.jdbc.Driver
  url=jdbc:mysql://localhost1/dbname1 user=db_username1
  password=db_password1/
  dataSource name=ds2 type=JdbcDataSource
 driver=com.mysql.jdbc.Driver
  url=jdbc:mysql://localhost2/dbname2 user=db_username2
  password=db_password2/
  document name=products
  entity name=item dataSource=ds1 query=select * from item
  field column=ID name=id /
  field column=NAME name=name /
 
  entity name=feature dataSource=ds2 query=select
  description from feature where item_id='${item.ID}'
  field name=features column=description /
  /entity
  /entity
  /document
  /dataConfig
 
  I am having trouble with delta import.  I think it is because the main
  entity and the sub-entity use different data source.  I have tried using
  both a delta query:
 
  deltaQuery=select id from item where id in (select item_id as id from
  feature where last_modified  '${dih.last_index_time}') or last_modified
  gt; '${dih.last_index_time}'
 
  and a parentDeltaQuery:
 
  entity name=feature pk=ITEM_ID query=select DESCRIPTION as features
  from FEATURE where ITEM_ID='${item.ID}' deltaQuery=select ITEM_ID from
  FEATURE where last_modified  '${dih.last_index_time}'
  parentDeltaQuery=select ID from item where ID=${feature.ITEM_ID}/
 
  I ended up with an SQL error for both.  Is there any way to make delta
  import work in my case?
 
  Bill
 




Effect of multiple white space at WhiteSpaceTokenizer

2013-10-08 Thread Furkan KAMACI
I use Solr 4.5 and I have a WhiteSpaceTokenizer at my schema. What is the
difference (index size and performance) for that two sentences:

First one: This is a sentence.
Second one: This   is a  sentence.


RE: How to achieve distributed spelling check in SolrCloud ?

2013-10-08 Thread Dyer, James
Shamik,

Are you using a request handler other than /select, and if so, did you set 
shards.qt in your request?  It should be set to the name of the request 
handler you are using.

See http://wiki.apache.org/solr/SpellCheckComponent?#Distributed_Search_Support

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: Shamik Bandopadhyay [mailto:sham...@gmail.com] 
Sent: Monday, October 07, 2013 4:47 PM
To: solr-user@lucene.apache.org
Subject: How to achieve distributed spelling check in SolrCloud ?

Hi,

  We are in the process of transitioning to SolrCloud (4.4) from
Master-Slave architecture (4.2) . One of the issues I'm facing now is with
making spell check work. It only seems to work if I explicitly set
distrib=false. I'm using a custom request handler and included the spell
check option.

str name=spellcheckon/str
   str name=spellcheck.collatetrue/str
   str name=spellcheck.onlyMorePopularfalse/str
   str name=spellcheck.extendedResultsfalse/str
   str name=spellcheck.count1/str
   str name=spellcheck.dictionarydefault/str
  /lst
  !-- append spellchecking to our list of components --
  arr name=last-components
   strspellcheck/str
  /arr

The spellcheck component has the usual configuration.

The spell check is part of the request handler which is being used to
executed a distributed query.. I can't possibly add distrib=false.

Just wondering if there's a way to address this.

Any pointers will be appreciated.

-Thanks,
Shamik



RE: Effect of multiple white space at WhiteSpaceTokenizer

2013-10-08 Thread Markus Jelsma
Result is the same and performance difference should be negligible, unless 
you're uploading megabytes of white space. Consecutive white space should be 
collapsed outside of Solr/Lucene anyway because it'll end up in your stored 
field. Index size will be slightly bigger but not much due to compression.
 
-Original message-
 From:Furkan KAMACI furkankam...@gmail.com
 Sent: Tuesday 8th October 2013 16:21
 To: solr-user@lucene.apache.org
 Subject: Effect of multiple white space at WhiteSpaceTokenizer
 
 I use Solr 4.5 and I have a WhiteSpaceTokenizer at my schema. What is the
 difference (index size and performance) for that two sentences:
 
 First one: This is a sentence.
 Second one: This   is a  sentence.
 


Adding FuctionalitiesIN SOLR

2013-10-08 Thread Ankit Kumar
*1. Span NOT Operator*

 We have a business use case to use SPAN NOT queries in SOLR. Query
Parser of LUCENE currently doesn't support/parse SPAN NOT queries.

2.Adding Recursive and Range Proximity

  *Recursive Proximity *is a proximity query within a proximity query

Ex:   “ “income tax”~5   statement” ~4  The recursion can be up to any
level.

* Range Proximity*: Currently we can only define number as a range we
want interval as a range .

Ex: “profit income”~3,5,  “United America”~-5,4



3. Complex  Queries

A complex query is a query formed with a combination of Boolean operators
or proximity queries or range queries or any possible combination of these.

Ex:“(income AND tax) statement”~4

  “ “income tax”~4  (statement OR period) ”~3

  (“ income” SPAN NOT  “income tax” ) source ~3,5

 Can anyone suggest us some way of achieving these 3 functionalities in
SOLR ???


Re: Regex to match one of two words

2013-10-08 Thread Walter Underwood
Or a boolean field for published, with false meaning deprecated.

wunder

On Oct 8, 2013, at 3:42 AM, Jack Krupansky wrote:

 Why use regular expressions at all?
 
 Try:
 
 published OR deprecated
 
 -- Jack Krupansky
 
 -Original Message- From: Dinusha Dilrukshi
 Sent: Tuesday, October 08, 2013 3:32 AM
 To: solr-user@lucene.apache.org
 Subject: Regex to match one of two words
 
 I have an input that can have only 2 values Published or Deprecated. What
 regular expression can I use to ensure that either of the two words was
 submitted?
 
 I tried with different regular expressions (as in the [1], [2]) that
 contains most generic syntax.. But Solar throws parser exception when
 validating these expressions.. Could someone help me on writing this
 regular expression that will evaluate by the Solar parser.
 
 [1] /^(PUBLISHED)?(DEPRECATED)?$/
 [2] /(PUBLISHED)?(DEPRECATED)?/
 
 
 SolrCore org.apache.solr.common.SolrException:
 org.apache.lucene.queryParser.ParseException: Cannot parse
 'overview_status_s:/(PUBLISHED)?(DEPRECATED)?/': '*' or '?' not allowed as
 first character in WildcardQuery
 at
 org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:108)
 
 
 Regards,
 Dinusha. 

--
Walter Underwood
wun...@wunderwood.org





Re: ALIAS feature, can be used for what?

2013-10-08 Thread Michael Della Bitta
CREATEALIAS is also used to move an alias.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinionshttps://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/


On Fri, Oct 4, 2013 at 5:41 AM, Jan Høydahl jan@cominvent.com wrote:

 Hi,

 I have been asked the same question. There are only DELETEALIAS and
 CREATEALIAS actions available, so is there a way to achieve uninterrupted
 switch of an alias from one index to another? Are we lacking a MOVEALIAS
 command?

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 27. sep. 2013 kl. 10:46 skrev Yago Riveiro yago.rive...@gmail.com:

  I need delete the alias for the old collection before point it to the
 new, right?
 
  --
  Yago Riveiro
  Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
 
 
  On Friday, September 27, 2013 at 2:25 AM, Otis Gospodnetic wrote:
 
  Hi,
 
  Imagine you have an index and you need to reindex your data into a new
  index, but don't want to have to reconfigure or restart client apps
  when you want to point them to the new index. This is where aliases
  come in handy. If you created an alias for the first index and made
  your apps hit that alias, then you can just repoint the same alias to
  your new index and avoid having to touch client apps.
 
  No, I don't think you can write to multiple collections through a
 single alias.
 
  Otis
  --
  Solr  ElasticSearch Support -- http://sematext.com/
  Performance Monitoring -- http://sematext.com/spm
 
 
 
  On Thu, Sep 26, 2013 at 6:34 AM, yriveiro yago.rive...@gmail.com(mailto:
 yago.rive...@gmail.com) wrote:
  Today I was thinking about the ALIAS feature and the utility on Solr.
 
  Can anyone explain me with an example where this feature may be useful?
 
  It's possible have an ALIAS of multiples collections, if I do a write
 to the
  alias, Is this write replied to all collections?
 
  /Yago
 
 
 
  -
  Best regards
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/ALIAS-feature-can-be-used-for-what-tp4092095.html
  Sent from the Solr - User mailing list archive at Nabble.com (
 http://Nabble.com).
 
 
 
 
 
 




Re: ALIAS feature, can be used for what?

2013-10-08 Thread Michael Della Bitta
You can index to an alias that points at only one collection. Works fine!

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinionshttps://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/


On Fri, Oct 4, 2013 at 7:59 AM, Upayavira u...@odoko.co.uk wrote:

 I've used this feature to great effect. I have logs coming in, and I
 create a core for each day. At the end of each day, I create a new core
 for tomorrow, unload any cores over 2 months old, then create a set of
 aliases (all, month, week, today) pointing to just the cores
 that are needed for that range. Thus, my app can efficiently query the
 bit of the index it is really interested in.

 You cannot, as far as I am aware, index directly to an alias. It
 wouldn't know what to do with the content. However, you can create an
 alias over the top of an existing one, and it will replace it. Works
 nicely.

 Upayavira

 On Fri, Oct 4, 2013, at 10:41 AM, Jan Høydahl wrote:
  Hi,
 
  I have been asked the same question. There are only DELETEALIAS and
  CREATEALIAS actions available, so is there a way to achieve uninterrupted
  switch of an alias from one index to another? Are we lacking a MOVEALIAS
  command?
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.cominvent.com
 
  27. sep. 2013 kl. 10:46 skrev Yago Riveiro yago.rive...@gmail.com:
 
   I need delete the alias for the old collection before point it to the
 new, right?
  
   --
   Yago Riveiro
   Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
  
  
   On Friday, September 27, 2013 at 2:25 AM, Otis Gospodnetic wrote:
  
   Hi,
  
   Imagine you have an index and you need to reindex your data into a new
   index, but don't want to have to reconfigure or restart client apps
   when you want to point them to the new index. This is where aliases
   come in handy. If you created an alias for the first index and made
   your apps hit that alias, then you can just repoint the same alias to
   your new index and avoid having to touch client apps.
  
   No, I don't think you can write to multiple collections through a
 single alias.
  
   Otis
   --
   Solr  ElasticSearch Support -- http://sematext.com/
   Performance Monitoring -- http://sematext.com/spm
  
  
  
   On Thu, Sep 26, 2013 at 6:34 AM, yriveiro yago.rive...@gmail.com(mailto:
 yago.rive...@gmail.com) wrote:
   Today I was thinking about the ALIAS feature and the utility on Solr.
  
   Can anyone explain me with an example where this feature may be
 useful?
  
   It's possible have an ALIAS of multiples collections, if I do a
 write to the
   alias, Is this write replied to all collections?
  
   /Yago
  
  
  
   -
   Best regards
   --
   View this message in context:
 http://lucene.472066.n3.nabble.com/ALIAS-feature-can-be-used-for-what-tp4092095.html
   Sent from the Solr - User mailing list archive at Nabble.com (
 http://Nabble.com).
  
  
  
  
  
  
 



Bootstrapping / Full Importing using Solr Cloud

2013-10-08 Thread Mark
We are in the process of upgrading our Solr cluster to the latest and greatest 
Solr Cloud. I have some questions regarding full indexing though. We're 
currently running a long job (~30 hours) using DIH to do a full index on over 
10M products. This process consumes a lot of memory and while updating can not 
handle any user requests. 

How, or what would be the best way going about this when using Solr Cloud? 
First off, does DIH work with cloud? Would I need to separate out my DIH 
indexing machine from the machines serving up user requests? If not going down 
the DIH route, what are my best options (solrj?) 

Thanks for the input

Case insensitive suggestion - Suggester with external dictionary

2013-10-08 Thread SolrLover

I am using suggester that uses external dictionary file for suggestions (as
below).

# This is a sample dictionary file.

iPhone3g
iPhone4 295
iPhone5c620
iPhone4g710

Everything works fine except for the fact that the suggester seems to be
case sensitive.

/suggest?q=ip is not matching any of the entries in the dictionary (listed
above). Is there a way to make the suggester case insensitive when using
external dictionary file?

Thanks for your help!!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Case-insensitive-suggestion-Suggester-with-external-dictionary-tp4094133.html
Sent from the Solr - User mailing list archive at Nabble.com.


EdgeNGramFilterFactory and Faceting

2013-10-08 Thread Tyler Foster
Hey Everyone,
When faceting on a field using the EdgeNGramFilterFactory the returned
facets values include all of the n-gram values. Is there a way to limit
this list to the stored values without creating a new field?

Thanks in advance!

Tyler


RE: EdgeNGramFilterFactory and Faceting

2013-10-08 Thread Markus Jelsma
Facets do not return the stored constraints, it's usually bad idea to tokenize 
or do some have analysis on facet fields. You need to copy your field instead.

-Original message-
 From:Tyler Foster tfos...@cloudera.com
 Sent: Tuesday 8th October 2013 19:28
 To: solr-user@lucene.apache.org
 Subject: EdgeNGramFilterFactory and Faceting
 
 Hey Everyone,
 When faceting on a field using the EdgeNGramFilterFactory the returned
 facets values include all of the n-gram values. Is there a way to limit
 this list to the stored values without creating a new field?
 
 Thanks in advance!
 
 Tyler
 


Re: EdgeNGramFilterFactory and Faceting

2013-10-08 Thread Shalin Shekhar Mangar
Tyler, faceting works on indexed content and not stored content.


On Tue, Oct 8, 2013 at 10:45 PM, Tyler Foster tfos...@cloudera.com wrote:

 Hey Everyone,
 When faceting on a field using the EdgeNGramFilterFactory the returned
 facets values include all of the n-gram values. Is there a way to limit
 this list to the stored values without creating a new field?

 Thanks in advance!

 Tyler




-- 
Regards,
Shalin Shekhar Mangar.


Re: EdgeNGramFilterFactory and Faceting

2013-10-08 Thread Tyler Foster
Thanks, that was the way it was looking. I just wanted to make sure I
wasn't missing something.


On Tue, Oct 8, 2013 at 10:32 AM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Facets do not return the stored constraints, it's usually bad idea to
 tokenize or do some have analysis on facet fields. You need to copy your
 field instead.

 -Original message-
  From:Tyler Foster tfos...@cloudera.com
  Sent: Tuesday 8th October 2013 19:28
  To: solr-user@lucene.apache.org
  Subject: EdgeNGramFilterFactory and Faceting
 
  Hey Everyone,
  When faceting on a field using the EdgeNGramFilterFactory the returned
  facets values include all of the n-gram values. Is there a way to limit
  this list to the stored values without creating a new field?
 
  Thanks in advance!
 
  Tyler
 



RE: How to achieve distributed spelling check in SolrCloud ?

2013-10-08 Thread shamik
James,

  Thanks for your reply. The shards.qt did the trick. I read the
documentation earlier but was not clear on the implementation, now it
totally makes sense.

Appreciate your help.

Regards,
Shamik



--
View this message in context: 
http://lucene.472066.n3.nabble.com/RE-How-to-achieve-distributed-spelling-check-in-SolrCloud-tp4094113p4094137.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr 4.4.0 Shard Update Errors (503) but cloud graph shows all green?

2013-10-08 Thread dmarini
Hi!We are running Solr 4.4.0 on a 3 node linux cluster and have about 2
collections storing product data with no problems. Yesterday, I attempted to
create another one of these collections using the Collections API, but I had
forgotten to upload the config to the zookeeper prior to making the call and
it failed spectacularly as expected :).. The API command I ran was to create
a 3 shard collection with a replicationfactor of 2 (maxShardsPerNode) set to
2 since the default understandably causes issues on 3 node clusters.Since I
ran that command however, I see the following message in the red 'SolrCore
Initialization Failures' when I load up the admin for 2 out of 3 of the
nodes (the following is from one of the
boxes):MyNewCollection_shard1_replica2:
org.apache.solr.common.cloud.ZooKeeperException:org.apache.solr.common.cloud.ZooKeeperException:
Could not find configName for collection MyNewCollection
found:[MyFirstCollection,
MySecondCollection]MyNewCollection_shard3_replica1:
org.apache.solr.common.cloud.ZooKeeperException:org.apache.solr.common.cloud.ZooKeeperException:
Could not find configName for collection MyNewCollection
found:[MyFirstCollection, MySecondCollection]My first question is, how do I
get this to go away since the cores never actually got created? I looked in
the solr directory and I do not see folders with the core names (which I'm
under the impression that the implicit core walking uses to determine what
cores to attempt to load).Second, and a bit stranger, is that also since I
messed up that command, I now appear to be seeing errors from the admin log
(every 2 seconds) when attempting to update documents in the other 2
collections that were working fine prior to the command being run.
Specifically, I'm seeing these messages repeating over and over near
constantly:14:07:11ERRORSolrCmdDistributorshard update error StdNode:
http://10.0.1.29:8983/solr/MyFirstCollection_shard1_replica2/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Server at http://10.0.1.29:8983/solr/MyFirstCollection_shard1_replica2
returned non ok status:503,​ message:Service
Unavailable14:07:11ERRORSolrCoreRequest says it is coming from leader,​ but
we are the leader:
distrib.from=http://10.0.1.30:8983/solr/MyFirstCollection_shard1_replica1/update.distrib=FROMLEADERwt=javabinversion=214:07:11ERRORSolrCoreorg.apache.solr.common.SolrException:
Request says it is coming from leader,​ but we are the
leader14:07:11WARNRecoveryStrategyStopping recovery for
zkNodeName=core_node1core=MyFirstCollection_shard1_replica214:07:11WARNRecoveryStrategyWe
have not yet recovered - but we are now the leader!
core=MyFirstCollection_shard1_replica2The first error worries me much, as I
think I'm losing data, but I can directly query that shard from that machine
with no issues and the cloud view from ALL of the machines shows totally
green.I'm not sure how the failed command got the system into this state and
I'm kicking myself for making that mistake to begin with but I'm completely
at a loss for how to attempt to recover since these are live collections
that I can't take down without incurring significant downtime.Any ideas?
Will reloading the cores that are throwing these messages help? can the
zookeeper and solr not have the same idea as to who the leader is for that
shard? and if so, how do I re-introduce consistency there?Appreciate any
help that can be offered.Thanks,--Dave



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-4-0-Shard-Update-Errors-503-but-cloud-graph-shows-all-green-tp4094139.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to achieve distributed spelling check in SolrCloud ?

2013-10-08 Thread Jason Hellman
The shards.qt parameter is the easiest one to forget, with the most dramatic of 
consequences!

On Oct 8, 2013, at 11:10 AM, shamik sham...@gmail.com wrote:

 James,
 
  Thanks for your reply. The shards.qt did the trick. I read the
 documentation earlier but was not clear on the implementation, now it
 totally makes sense.
 
 Appreciate your help.
 
 Regards,
 Shamik
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/RE-How-to-achieve-distributed-spelling-check-in-SolrCloud-tp4094113p4094137.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Solr 4.4 - Master/Slave configuration - Replication Issue with Commits after deleting documents using Delete by ID

2013-10-08 Thread Akkinepalli, Bharat (ELS-CON)
Hi,
We have recently migrated from Solr 3.6 to Solr 4.4.  We are using the 
Master/Slave configuration in Solr 4.4 (not Solr Cloud).  We have noticed the 
following behavior/defect.

Configuration:
===

1.   The Hard Commit and Soft Commit are disabled in the configuration (we 
control the commits from the application)

2.   We have 1 Master and 2 Slaves configured and the pollInterval is 
configured to 10 Minutes.

3.   The Master is configured to have the replicateAfter as commit  
startup

Steps to reproduce the problem:
==

1.   Delete a document in Solr  (using delete by id).  URL - 
http://localhost:8983/solr/annotation/update with body as  
deleteidchange.me/id/delete

2.   Issue a commit in Master 
(http://localhost:8983/solr/annotation/update?commit=true).

3.   The replication of the DELETE WILL NOT happen.  The master and slave 
has the same Index version.

4.   If we try to issue another commit in Master, we see that it replicates 
fine.

Request you to please confirm if this is a known issue.  Thank you.

Regards,
Bharat Akkinepalli



Re: problem with data import handler delta import due to use of multiple datasource

2013-10-08 Thread Bill Au
Thanks for the suggestion but that won't work as I have last_modified field
in both the parent entity and child entity as I want delta import to kick
in when either change.  That other approach has the same problem since the
parent and child entity uses different datasources.

Bill


On Tue, Oct 8, 2013 at 10:18 AM, Dyer, James
james.d...@ingramcontent.comwrote:

 Bill,

 I do not believe there is any way to tell it to use a different datasource
 for the parent delta query.

 If you used this approach, would it solve your problem:
 http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport ?

 James Dyer
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Bill Au [mailto:bill.w...@gmail.com]
 Sent: Tuesday, October 08, 2013 8:50 AM
 To: solr-user@lucene.apache.org
 Subject: Re: problem with data import handler delta import due to use of
 multiple datasource

 I am using 4.3.  It is not related to bugs related to last_index_time.  The
 problem is caused by the fact that the parent entity and child entity use
 different data source (different databases on different hosts).

 From the log output, I do see the the delta query of the child entity being
 executed correctly and found all the rows that have been modified for the
 child entity.  But it fails when it executed the parentDeltaQuery because
 it is still using the database connection from the child entity (ie
 datasource ds2 in my example above).

 Is there a way to tell DIH to use a different datasource in the
 parentDeltaQuery?

 Bill


 On Sat, Oct 5, 2013 at 10:28 PM, Alexandre Rafalovitch
 arafa...@gmail.comwrote:

  Which version of Solr and what kind of SQL errors? There were some bugs
 in
  4.x related to last_index_time, but it does not sound related.
 
  Regards,
 Alex.
 
  Personal website: http://www.outerthoughts.com/
  LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
  - Time is the quality of nature that keeps events from happening all at
  once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
  On Sun, Oct 6, 2013 at 8:51 AM, Bill Au bill.w...@gmail.com wrote:
 
   Here is my DIH config:
  
   dataConfig
   dataSource name=ds1 type=JdbcDataSource
  driver=com.mysql.jdbc.Driver
   url=jdbc:mysql://localhost1/dbname1 user=db_username1
   password=db_password1/
   dataSource name=ds2 type=JdbcDataSource
  driver=com.mysql.jdbc.Driver
   url=jdbc:mysql://localhost2/dbname2 user=db_username2
   password=db_password2/
   document name=products
   entity name=item dataSource=ds1 query=select * from
 item
   field column=ID name=id /
   field column=NAME name=name /
  
   entity name=feature dataSource=ds2 query=select
   description from feature where item_id='${item.ID}'
   field name=features column=description /
   /entity
   /entity
   /document
   /dataConfig
  
   I am having trouble with delta import.  I think it is because the main
   entity and the sub-entity use different data source.  I have tried
 using
   both a delta query:
  
   deltaQuery=select id from item where id in (select item_id as id from
   feature where last_modified  '${dih.last_index_time}') or
 last_modified
   gt; '${dih.last_index_time}'
  
   and a parentDeltaQuery:
  
   entity name=feature pk=ITEM_ID query=select DESCRIPTION as
 features
   from FEATURE where ITEM_ID='${item.ID}' deltaQuery=select ITEM_ID
 from
   FEATURE where last_modified  '${dih.last_index_time}'
   parentDeltaQuery=select ID from item where ID=${feature.ITEM_ID}/
  
   I ended up with an SQL error for both.  Is there any way to make delta
   import work in my case?
  
   Bill
  
 




Re: What is the full list of Solr Special Characters?

2013-10-08 Thread Shawn Heisey

On 10/8/2013 3:01 AM, Furkan KAMACI wrote:
 Actually I want to remove special characters and wont send them into my
 Solr indexes. I mean user can send a special query as like a SQL 
injection

 and I want to prevent my system such kind of scenarios.

There is a newer javadoc than the *very* old one you are looking at:

http://lucene.apache.org/core/4_5_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html?is-external=true#Escaping_Special_Characters

When I compare that list to what's actually in the SolrJ 
escapeQueryChars method, it looks like that method does one additional 
character - the semicolon.


http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_5_0/solr/solrj/src/java/org/apache/solr/client/solrj/util/ClientUtils.java

Just search the page for escapeQueryChars to see the java code.

Thanks,
Shawn



run filter queries after post filter

2013-10-08 Thread Rohit Harchandani
Hey,
I am using solr 4.0 with my own PostFilter implementation which is executed
after the normal solr query is done. This filter has a cost of 100. Is it
possible to run filter queries on the index after the execution of the post
filter?
I tried adding the below line to the url but it did not seem to work:
fq={!cache=false cost=200}field:value
Thanks,
Rohit


Re: no such field error:smaller big block size details while indexing doc files

2013-10-08 Thread sweety
This my new schema.xml:
schema  name=documents
fields 
field name=id type=string indexed=true stored=true required=true 
multiValued=false/
field name=author type=string indexed=true stored=true 
multiValued=true/
field name=comments type=text indexed=true stored=true 
multiValued=false/
field name=keywords type=text indexed=true stored=true 
multiValued=false/
field name=contents type=text indexed=true stored=true 
multiValued=false/
field name=title type=text indexed=true stored=true 
multiValued=false/
field name=revision_number type=string indexed=true stored=true 
multiValued=false/
field name=_version_ type=long indexed=true stored=true 
multiValued=false/
dynamicField name=ignored_* type=string indexed=false stored=true 
multiValued=true/
dynamicField name=* type=ignored  multiValued=true /
copyfield source=id dest=text /
copyfield source=author dest=text /
/fields 
types
fieldtype name=ignored stored=false indexed=false class=solr.StrField 
/ 
fieldType name=integer class=solr.IntField /
fieldType name=long class=solr.LongField /
fieldType name=string class=solr.StrField  /  
fieldType name=text class=solr.TextField /
/types
uniqueKeyid/uniqueKey
/schema
I still get the same error.


 From: Erick Erickson [via Lucene] ml-node+s472066n4094013...@n3.nabble.com
To: sweety sweetyshind...@yahoo.com 
Sent: Tuesday, October 8, 2013 7:16 AM
Subject: Re: no such field error:smaller big block size details while indexing 
doc files
 


Well, one of the attributes parsed out of, probably the 
meta-information associated with one of your structured 
docs is SMALLER_BIG_BLOCK_SIZE_DETAILS and 
Solr Cel is faithfully sending that to your index. If you 
want to throw all these in the bit bucket, try defining 
a true catch-all field that ignores things, like this. 
dynamicField name=* type=ignored multiValued=true / 

Best, 
Erick 

On Mon, Oct 7, 2013 at 8:03 AM, sweety [hidden email] wrote: 

 Im trying to index .doc,.docx,pdf files, 
 im using this url: 
 curl 
 http://localhost:8080/solr/document/update/extract?literal.id=12commit=true;
  
 -Fmyfile=@complex.doc 
 
 This is the error I get: 
 Oct 07, 2013 5:02:18 PM org.apache.solr.common.SolrException log 
 SEVERE: null:java.lang.RuntimeException: java.lang.NoSuchFieldError: 
 SMALLER_BIG_BLOCK_SIZE_DETAILS 
         at 
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:651)
  
         at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:364)
  
         at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
  
         at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
  
         at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
  
         at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
  
         at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
  
         at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168) 
         at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) 
         at 
 org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:928) 
         at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
  
         at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) 
         at 
 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
  
         at 
 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:539)
  
         at 
 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:298)
  
         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
         at java.lang.Thread.run(Unknown Source) 
 Caused by: java.lang.NoSuchFieldError: SMALLER_BIG_BLOCK_SIZE_DETAILS 
         at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:93)
  
         at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:190)
  
         at 
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:184)
  
         at 
 org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:376)
  
         at 
 org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:165)
  
         at 
 org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61) 
         at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:113) 
         at 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
  
         at 
 

Re: solr cpu usage

2013-10-08 Thread Tim Vaillancourt
Yes, you've saved us all lots of time with this article. I'm about to do
the same for the old Jetty or Tomcat? container question ;).

Tim


On 7 October 2013 18:55, Erick Erickson erickerick...@gmail.com wrote:

 Tim:

 Thanks! Mostly I wrote it to have something official looking to hide
 behind when I didn't have a good answer to the hardware sizing question
 :).

 On Mon, Oct 7, 2013 at 2:48 PM, Tim Vaillancourt t...@elementspace.com
 wrote:
  Fantastic article!
 
  Tim
 
 
  On 5 October 2013 18:14, Erick Erickson erickerick...@gmail.com wrote:
 
  From my perspective, your question is almost impossible to
  answer, there are too many variables. See:
 
 
 http://searchhub.org/dev/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
 
  Best,
  Erick
 
  On Thu, Oct 3, 2013 at 9:38 PM, Otis Gospodnetic
  otis.gospodne...@gmail.com wrote:
   Hi,
  
   More CPU cores means more concurrency.  This is good if you need to
  handle
   high query rates.
  
   Faster cores mean lower query latency, assuming you are not
 bottlenecked
  by
   memory or disk IO or network IO.
  
   So what is ideal for you depends on your concurrency and latency
 needs.
  
   Otis
   Solr  ElasticSearch Support
   http://sematext.com/
   On Oct 1, 2013 9:33 AM, adfel70 adfe...@gmail.com wrote:
  
   hi
   We're building a spec for a machine to purchase.
   We're going to buy 10 machines.
   we aren't sure yet how many proccesses we will run per machine.
   the question is  -should we buy faster cpu with less cores or slower
 cpu
   with more cores?
   in any case we will have 2 cpus in each machine.
   should we buy 2.6Ghz cpu with 8 cores or 3.5Ghz cpu with 4 cores?
  
   what will we gain by having many cores?
  
   what kinds of usages would make cpu be the bottleneck?
  
  
  
  
   --
   View this message in context:
   http://lucene.472066.n3.nabble.com/solr-cpu-usage-tp4092938.html
   Sent from the Solr - User mailing list archive at Nabble.com.
  
 



dynamically adding core with auto-discovery in Solr 4.5

2013-10-08 Thread Jan Van Besien
Hi,

We are using auto discovery and have a use case where we want to be
able to add cores dynamically, without restarting solr.

In 4.4 we were able to
- add a directory (e.g. core1) with an empty core.properties
- call 
http://localhost:8983/solr/admin/cores?action=CREATEcore=core1name=core1instanceDir=%2Fsomewhere%2Fcore1

In 4.5 however this (the second step) fails, saying it cannot create a
new core in that directory because another core is already defined
there.

From the documentation (http://wiki.apache.org/solr/CoreAdmin), I
understand that since 4.3 we should actually do RELOAD. However,
RELOAD results in this stacktrace:

org.apache.solr.common.SolrException: Error handling 'reload' action
at 
org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction(CoreAdminHandler.java:673)
at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:172)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at 
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:655)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:246)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:322) at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: org.apache.solr.common.SolrException: Unable to reload
core: core1 at 
org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:936)
at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:691)
at 
org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction(CoreAdminHandler.java:671)
... 20 more Caused by: org.apache.solr.common.SolrException: No such
core: core1 at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:642)
... 21 more

Note that before I RELOAD, the core1 directory was created.

Also note that next to the core1 directory, there is a core0 directory
which has exactly the same content and is auto-discovered perfectly
fine at startup.

So... what should it be? Or am I missing something here?

thanks in advance,
Jan


Re: Accent insensitive multi-words suggester

2013-10-08 Thread Dominique Bejean

Thank you Erick.
I will try this.

Regards
Dominique

Le 06/10/13 03:03, Erick Erickson a écrit :

Consider implementing a special field that of the form
accentfolded|original

For instance, you'd index something like
ecole|école
ecole|école privée
as _terms_, not broken up at all.

Now, when you send something to the suggester you send just
eco or éco you fold them to eco too and get back these tokens.
Then the app layer breaks them up and displays them pleasingly.

Best
Erick

On Tue, Oct 1, 2013 at 5:45 PM, Dominique Bejean
dominique.bej...@eolya.fr wrote:

Hi,

Up to now, the best solution I found in order to implement a multi-words
suggester was to use ShingleFilterFactory filter at index time and the
termsComponent. At index time the analyzer was :

   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.ASCIIFoldingFilterFactory/
 filter class=solr.ElisionFilterFactory ignoreCase=true
articles=lang/contractions_fr.txt/
 filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt /
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.ShingleFilterFactory maxShingleSize=4
outputUnigrams=true/
   /analyzer


With ASCIIFoldingFilter filter, it works find if the user do not use
accent in query terms and all suggestions are without accents.
Without ASCIIFoldingFilter filter, it works find if the user do not forget
accent in query terms and all suggestions are with accents.

Note : I use the StopFilter to avoid suggestions including stop words and
particularly starting or ending with stop words.


What I need is a suggester where the user can use or not use the accent in
query terms and the suggestions are returned with accent.

For example, if the user type éco or eco, the suggester should return :

école
école primaire
école publique
école privée
école primaire privée


I think it is impossible to achieve this with the termComponents and I
should use the SpellCheckComponent instead. However, I don't see how to make
the suggester accent insensitive and return the suggestions with accents.

Did somebody already achieved that ?

Thank you.

Dominique


--
Dominique Béjean
+33 6 08 46 12 43
skype: dbejean
www.eolya.fr
www.crawl-anywhere.com



Re: {soft}Commit and cache flusing

2013-10-08 Thread Tim Vaillancourt
I have a genuine question with substance here. If anything this
nonconstructive, rude response was to get noticed. Thanks for
contributing to the discussion.

Tim


On 8 October 2013 05:31, Dmitry Kan solrexp...@gmail.com wrote:

 Tim,
 I suggest you open a new thread and not reply to this one to get noticed.
 Dmitry


 On Mon, Oct 7, 2013 at 9:44 PM, Tim Vaillancourt t...@elementspace.com
 wrote:

  Is there a way to make autoCommit only commit if there are pending
 changes,
  ie: if there are 0 adds pending commit, don't autoCommit (open-a-searcher
  and wipe the caches)?
 
  Cheers,
 
  Tim
 
 
  On 2 October 2013 00:52, Dmitry Kan solrexp...@gmail.com wrote:
 
   right. We've got the autoHard commit configured only atm. The
  soft-commits
   are controlled on the client. It was just easier to implement the first
   version of our internal commit policy that will commit to all solr
   instances at once. This is where we have noticed the reported behavior.
  
  
   On Wed, Oct 2, 2013 at 9:32 AM, Bram Van Dam bram.van...@intix.eu
  wrote:
  
if there are no modifications to an index and a softCommit or
  hardCommit
issued, then solr flushes the cache.
   
   
Indeed. The easiest way to work around this is by disabling auto
  commits
and only commit when you have to.
   
  
 



What's the purpose of the bits option in compositeId (Solr 4.5)?

2013-10-08 Thread Brett Hoerner
I'm curious what the later shard-local bits do, if anything?

I have a very large cluster (256 shards) and I'm sending most of my data
with a single composite, e.g. 1234!unique_id, but I'm noticing the data
is being split among many of the shards.

My guess right now is that since I'm only using the default 16 bits my data
is being split across multiple shards (because of my high # of shards).

Thanks,
Brett


Re: What's the purpose of the bits option in compositeId (Solr 4.5)?

2013-10-08 Thread Yonik Seeley
On Tue, Oct 8, 2013 at 6:29 PM, Brett Hoerner br...@bretthoerner.com wrote:
 I'm curious what the later shard-local bits do, if anything?

 I have a very large cluster (256 shards) and I'm sending most of my data
 with a single composite, e.g. 1234!unique_id, but I'm noticing the data
 is being split among many of the shards.

That shouldn't be the case.  All of your shards should have a lower
hash value with all 0 bits and an upper hash value of all 1s (i.e.
0x to 0x)
So you see any shards where that's not true?

Also, is the router set to compositeId?

-Yonik

 My guess right now is that since I'm only using the default 16 bits my data
 is being split across multiple shards (because of my high # of shards).

 Thanks,
 Brett


Re: What's the purpose of the bits option in compositeId (Solr 4.5)?

2013-10-08 Thread Brett Hoerner
Router is definitely compositeId.

To be clear, data isn't being spread evenly... it's like it's *almost*
working. It's just odd to me that I'm slamming in data that's 99% of one
_route_ key yet after a few minutes (from a fresh empty index) I have 2
shards with a sizeable amount of data (68M and 128M) and the rest are very
small as expected.

The fact that two are receiving so much makes me think my data is being
split into two shards. I'm trying to debug more now.


On Tue, Oct 8, 2013 at 5:45 PM, Yonik Seeley ysee...@gmail.com wrote:

 On Tue, Oct 8, 2013 at 6:29 PM, Brett Hoerner br...@bretthoerner.com
 wrote:
  I'm curious what the later shard-local bits do, if anything?
 
  I have a very large cluster (256 shards) and I'm sending most of my data
  with a single composite, e.g. 1234!unique_id, but I'm noticing the
 data
  is being split among many of the shards.

 That shouldn't be the case.  All of your shards should have a lower
 hash value with all 0 bits and an upper hash value of all 1s (i.e.
 0x to 0x)
 So you see any shards where that's not true?

 Also, is the router set to compositeId?

 -Yonik

  My guess right now is that since I'm only using the default 16 bits my
 data
  is being split across multiple shards (because of my high # of shards).
 
  Thanks,
  Brett



limiting deep pagination

2013-10-08 Thread Peter Keegan
Is there a way to configure Solr 'defaults/appends/invariants' such that
the product of the 'start' and 'rows' parameters doesn't exceed a given
value? This would be to prevent deep pagination.  Or would this require a
custom requestHandler?

Peter


dynamic field question

2013-10-08 Thread Twomey, David

I am having trouble trying to return a particular dynamic field only instead of 
all dynamic fields.

Imagine I have a document with an unknown number of sections.  Each section can 
have a 'title' and a 'body'

 I have each section title and body as dynamic fields such as section_title_*  
and section_body_*

Imagine that some documents contain a section that has a title=Appendix

I want a query that will find all docs with that section and return just the 
Appendix section.

I don't know how to return just that one section though

I can copyField my dynamic field section_title_* into a static field called 
section_titles and query that for docs that contain the Appendix

But I don't know how to only return that one dynamic field

?q=section_titles:Appendixfl=section_body_*

Any ideas?   I can't seem to put a conditional in the fl parameter





Re: What's the purpose of the bits option in compositeId (Solr 4.5)?

2013-10-08 Thread Brett Hoerner
This is my clusterstate.json:
https://gist.github.com/bretthoerner/0098f741f48f9bb51433

And these are my core sizes (note large ones are sorted to the end):
https://gist.github.com/bretthoerner/f5b5e099212194b5dff6

I've only heavily sent 2 shards by now (I'm sharding by hour and it's
been running for 2). There *is* a little old data in my stream, but not
that much (like 5%). What's confusing to me is that 5 of them are rather
large, when I'd expect 2 of them to be.


On Tue, Oct 8, 2013 at 5:45 PM, Yonik Seeley ysee...@gmail.com wrote:

 On Tue, Oct 8, 2013 at 6:29 PM, Brett Hoerner br...@bretthoerner.com
 wrote:
  I'm curious what the later shard-local bits do, if anything?
 
  I have a very large cluster (256 shards) and I'm sending most of my data
  with a single composite, e.g. 1234!unique_id, but I'm noticing the
 data
  is being split among many of the shards.

 That shouldn't be the case.  All of your shards should have a lower
 hash value with all 0 bits and an upper hash value of all 1s (i.e.
 0x to 0x)
 So you see any shards where that's not true?

 Also, is the router set to compositeId?

 -Yonik

  My guess right now is that since I'm only using the default 16 bits my
 data
  is being split across multiple shards (because of my high # of shards).
 
  Thanks,
  Brett



Re: limiting deep pagination

2013-10-08 Thread Tomás Fernández Löbbe
I don't know of any OOTB way to do that, I'd write a custom request handler
as you suggested.

Tomás


On Tue, Oct 8, 2013 at 3:51 PM, Peter Keegan peterlkee...@gmail.com wrote:

 Is there a way to configure Solr 'defaults/appends/invariants' such that
 the product of the 'start' and 'rows' parameters doesn't exceed a given
 value? This would be to prevent deep pagination.  Or would this require a
 custom requestHandler?

 Peter



Re: limiting deep pagination

2013-10-08 Thread Erik Hatcher
I'd recommend a custom first-components SearchComponent.  Then it could 
simply validate (or adjust) the parameters or throw an exception. 

Knowing Tomás - that's probably what he'd really do :) 

Erik

On Oct 8, 2013, at 19:34, Tomás Fernández Löbbe tomasflo...@gmail.com wrote:

 I don't know of any OOTB way to do that, I'd write a custom request handler
 as you suggested.
 
 Tomás
 
 
 On Tue, Oct 8, 2013 at 3:51 PM, Peter Keegan peterlkee...@gmail.com wrote:
 
 Is there a way to configure Solr 'defaults/appends/invariants' such that
 the product of the 'start' and 'rows' parameters doesn't exceed a given
 value? This would be to prevent deep pagination.  Or would this require a
 custom requestHandler?
 
 Peter
 


Re: What's the purpose of the bits option in compositeId (Solr 4.5)?

2013-10-08 Thread Yonik Seeley
On Tue, Oct 8, 2013 at 7:31 PM, Brett Hoerner br...@bretthoerner.com wrote:
 This is my clusterstate.json:
 https://gist.github.com/bretthoerner/0098f741f48f9bb51433

 And these are my core sizes (note large ones are sorted to the end):
 https://gist.github.com/bretthoerner/f5b5e099212194b5dff6

 I've only heavily sent 2 shards by now (I'm sharding by hour and it's
 been running for 2). There *is* a little old data in my stream, but not
 that much (like 5%). What's confusing to me is that 5 of them are rather
 large, when I'd expect 2 of them to be.

The cluster state looks fine at first glance... and each route key
should map to a single shard.
You could try a query to each of the big shards and see what IDs are in them.

-Yonik


Re: What's the purpose of the bits option in compositeId (Solr 4.5)?

2013-10-08 Thread Brett Hoerner
I have a silly question, how do I query a single shard in SolrCloud? When I
hit solr/foo_shard1_replica1/select it always seems to do a full cluster
query.

I can't (easily) do a _route_ query before I know what each have.


On Tue, Oct 8, 2013 at 7:06 PM, Yonik Seeley ysee...@gmail.com wrote:

 On Tue, Oct 8, 2013 at 7:31 PM, Brett Hoerner br...@bretthoerner.com
 wrote:
  This is my clusterstate.json:
  https://gist.github.com/bretthoerner/0098f741f48f9bb51433
 
  And these are my core sizes (note large ones are sorted to the end):
  https://gist.github.com/bretthoerner/f5b5e099212194b5dff6
 
  I've only heavily sent 2 shards by now (I'm sharding by hour and it's
  been running for 2). There *is* a little old data in my stream, but not
  that much (like 5%). What's confusing to me is that 5 of them are rather
  large, when I'd expect 2 of them to be.

 The cluster state looks fine at first glance... and each route key
 should map to a single shard.
 You could try a query to each of the big shards and see what IDs are in
 them.

 -Yonik



Re: What's the purpose of the bits option in compositeId (Solr 4.5)?

2013-10-08 Thread Brett Hoerner
Ignore me I forgot about shards= from the wiki.


On Tue, Oct 8, 2013 at 7:11 PM, Brett Hoerner br...@bretthoerner.comwrote:

 I have a silly question, how do I query a single shard in SolrCloud? When
 I hit solr/foo_shard1_replica1/select it always seems to do a full cluster
 query.

 I can't (easily) do a _route_ query before I know what each have.


 On Tue, Oct 8, 2013 at 7:06 PM, Yonik Seeley ysee...@gmail.com wrote:

 On Tue, Oct 8, 2013 at 7:31 PM, Brett Hoerner br...@bretthoerner.com
 wrote:
  This is my clusterstate.json:
  https://gist.github.com/bretthoerner/0098f741f48f9bb51433
 
  And these are my core sizes (note large ones are sorted to the end):
  https://gist.github.com/bretthoerner/f5b5e099212194b5dff6
 
  I've only heavily sent 2 shards by now (I'm sharding by hour and it's
  been running for 2). There *is* a little old data in my stream, but not
  that much (like 5%). What's confusing to me is that 5 of them are
 rather
  large, when I'd expect 2 of them to be.

 The cluster state looks fine at first glance... and each route key
 should map to a single shard.
 You could try a query to each of the big shards and see what IDs are in
 them.

 -Yonik





Re: Improving indexing performance

2013-10-08 Thread Erick Erickson
queue size shouldn't really be too large, the whole point of
the concurrency is to keep from waiting around for the
communication with the server in a single thread. So having
a bunch of stuff backed up in the queue isn't buying you anything

And you can always increase the memory allocated to the JVM
running SolrJ...

Erick

On Tue, Oct 8, 2013 at 5:29 AM, Matteo Grolla matteo.gro...@gmail.com wrote:
 Thanks Erik,
 I think I have been able to exhaust a resource
 if I split the data in 2 and upload it with 2 clients like benchmark 
 1.1 it takes 120s here the bottleneck it my LAN,
 if I use a setting like benchmark 1 probably the bottleneck is the 
 ramBuffer.

 I'm going to buy a Gigabit ethernet cable so I can make a better test.

 OutOfMemory error: it's the solrj client that crashes
 I'm using solr 4.2.1 and corresponding solrj client
 httpsolrserver works fine
 concurrentupdatesolrsever gives me problems, and I didn't 
 understand how to size the queuesize parameter optimally


 Il giorno 07/ott/2013, alle ore 14:03, Erick Erickson ha scritto:

 Just skimmed, but the usual reason you can't max out the server
 is that the client can't go fast enough. Very quick experiment:
 comment out the server.add line in your client and run it again,
 does that speed up the client substantially? If not, then the time
 is being spent on the client.

 Or split your csv file into, say, 5 parts and run it from 5 different
 PCs in parallel.

 bq:  I can't rely on auto commit, otherwise I get an OutOfMemory error
 This shouldn't be happening, I'd get to the bottom of this. Perhaps simply
 allocating more memory to the JVM running Solr.

 bq: committing every 100k docs gives worse performance
 It'll be best to specify openSearcher=false for max indexing throughput
 BTW. You should be able to do this quite frequently, 15 seconds seems
 quite reasonable.

 Best,
 Erick

 On Sun, Oct 6, 2013 at 12:19 PM, Matteo Grolla matteo.gro...@gmail.com 
 wrote:
 I'd like to have some suggestion on how to improve the indexing performance 
 on the following scenario
 I'm uploading 1M docs to solr,

 every docs has
id: sequential number
title:  small string
date: date
body: 1kb of text

 Here are my benchmarks (they are all single executions, not averages from 
 multiple executions):

 1)  using the updaterequesthandler
and streaming docs from a csv file on the same disk of solr
auto commit every 15s with openSearcher=false and commit after last 
 document

total time: 143035ms

 1.1)using the updaterequesthandler
and streaming docs from a csv file on the same disk of solr
auto commit every 15s with openSearcher=false and commit after last 
 document
ramBufferSizeMB500/ramBufferSizeMB
maxBufferedDocs10/maxBufferedDocs

total time: 134493ms

 1.2)using the updaterequesthandler
and streaming docs from a csv file on the same disk of solr
auto commit every 15s with openSearcher=false and commit after last 
 document
mergeFactor30/mergeFactor

total time: 143134ms

 2)  using a solrj client from another pc in the lan (100Mbps)
with httpsolrserver
with javabin format
add documents to the server in batches of 1k docs   ( 
 server.add( collection ) )
auto commit every 15s with openSearcher=false and commit after last 
 document

total time: 139022ms

 3)  using a solrj client from another pc in the lan (100Mbps)
with concurrentupdatesolrserver
with javelin format
add documents to the server in batches of 1k docs   ( 
 server.add( collection ) )
server queue size=20k
server threads=4
no auto-commit and commit every 100k docs

total time: 167301ms


 --On the solr server--
 cpu averages25%
at best 100% for 1 core
 IO  is still far from being saturated
iostat gives a pattern like this (every 5 s)

time(s) %util
100 45,20
105 1,68
110 17,44
115 76,32
120 2,64
125 68
130 1,28

 I thought that using concurrentupdatesolrserver I was able to max cpu or IO 
 but I wasn't.
 With concurrentupdatesolrserver I can't rely on auto commit, otherwise I 
 get an OutOfMemory error
 and I found that committing every 100k docs gives worse performance than 
 auto commit every 15s (benchmark 3 with httpsolrserver took 193515)

 I'd really like to understand why I can't max out the resources on the 
 server hosting solr (disk above all)
 And I'd really like to understand what I'm doing wrong with 
 concurrentupdatesolrserver

 thanks




Re: What's the purpose of the bits option in compositeId (Solr 4.5)?

2013-10-08 Thread Shawn Heisey

On 10/8/2013 6:12 PM, Brett Hoerner wrote:

Ignore me I forgot about shards= from the wiki.


On Tue, Oct 8, 2013 at 7:11 PM, Brett Hoerner br...@bretthoerner.comwrote:


I have a silly question, how do I query a single shard in SolrCloud? When
I hit solr/foo_shard1_replica1/select it always seems to do a full cluster
query.

I can't (easily) do a _route_ query before I know what each have.


There is also the distrib=false parameter that will cause the request 
to be handled directly by the core it is sent to rather than being 
distributed/balanced by SolrCloud.


Thanks,
Shawn



Re: Bootstrapping / Full Importing using Solr Cloud

2013-10-08 Thread Erick Erickson
DIH works with SolrCloud as far as I understand. But
moving to SolrJ has several advantages:
1 you have more control over our process, beter
ability to debug etc.
2 If you can partition your data up amongst
several clients, you can probably get through your jobs
much faster.
3 You're not overloading one machine with both the
DIH bits and the indexing bits.

There are some other options, I generally prefer SolrJ
though. Others have different opinions of course.

Best,
Erick

On Tue, Oct 8, 2013 at 12:57 PM, Mark static.void@gmail.com wrote:
 We are in the process of upgrading our Solr cluster to the latest and 
 greatest Solr Cloud. I have some questions regarding full indexing though. 
 We're currently running a long job (~30 hours) using DIH to do a full index 
 on over 10M products. This process consumes a lot of memory and while 
 updating can not handle any user requests.

 How, or what would be the best way going about this when using Solr Cloud? 
 First off, does DIH work with cloud? Would I need to separate out my DIH 
 indexing machine from the machines serving up user requests? If not going 
 down the DIH route, what are my best options (solrj?)

 Thanks for the input


Re: run filter queries after post filter

2013-10-08 Thread Erick Erickson
Hmmm, seems like it should. What's our evidence that it isn't working?

Best,
Erick

On Tue, Oct 8, 2013 at 4:10 PM, Rohit Harchandani rhar...@gmail.com wrote:
 Hey,
 I am using solr 4.0 with my own PostFilter implementation which is executed
 after the normal solr query is done. This filter has a cost of 100. Is it
 possible to run filter queries on the index after the execution of the post
 filter?
 I tried adding the below line to the url but it did not seem to work:
 fq={!cache=false cost=200}field:value
 Thanks,
 Rohit


Re: no such field error:smaller big block size details while indexing doc files

2013-10-08 Thread Erick Erickson
Hmmm, that is odd, the glob dynamicField should
pick this up.

Not quite sure what's going on. You an parse the file
via Tika yourself and look at what's in there, it's a relatively
simple SolrJ program, here's a sample:
http://searchhub.org/2012/02/14/indexing-with-solrj/

Best,
Erick

On Tue, Oct 8, 2013 at 4:15 PM, sweety sweetyshind...@yahoo.com wrote:
 This my new schema.xml:
 schema  name=documents
 fields
 field name=id type=string indexed=true stored=true required=true 
 multiValued=false/
 field name=author type=string indexed=true stored=true 
 multiValued=true/
 field name=comments type=text indexed=true stored=true 
 multiValued=false/
 field name=keywords type=text indexed=true stored=true 
 multiValued=false/
 field name=contents type=text indexed=true stored=true 
 multiValued=false/
 field name=title type=text indexed=true stored=true 
 multiValued=false/
 field name=revision_number type=string indexed=true stored=true 
 multiValued=false/
 field name=_version_ type=long indexed=true stored=true 
 multiValued=false/
 dynamicField name=ignored_* type=string indexed=false stored=true 
 multiValued=true/
 dynamicField name=* type=ignored  multiValued=true /
 copyfield source=id dest=text /
 copyfield source=author dest=text /
 /fields
 types
 fieldtype name=ignored stored=false indexed=false 
 class=solr.StrField /
 fieldType name=integer class=solr.IntField /
 fieldType name=long class=solr.LongField /
 fieldType name=string class=solr.StrField  /
 fieldType name=text class=solr.TextField /
 /types
 uniqueKeyid/uniqueKey
 /schema
 I still get the same error.

 
  From: Erick Erickson [via Lucene] ml-node+s472066n4094013...@n3.nabble.com
 To: sweety sweetyshind...@yahoo.com
 Sent: Tuesday, October 8, 2013 7:16 AM
 Subject: Re: no such field error:smaller big block size details while 
 indexing doc files



 Well, one of the attributes parsed out of, probably the
 meta-information associated with one of your structured
 docs is SMALLER_BIG_BLOCK_SIZE_DETAILS and
 Solr Cel is faithfully sending that to your index. If you
 want to throw all these in the bit bucket, try defining
 a true catch-all field that ignores things, like this.
 dynamicField name=* type=ignored multiValued=true /

 Best,
 Erick

 On Mon, Oct 7, 2013 at 8:03 AM, sweety [hidden email] wrote:

 Im trying to index .doc,.docx,pdf files,
 im using this url:
 curl
 http://localhost:8080/solr/document/update/extract?literal.id=12commit=true;
 -Fmyfile=@complex.doc

 This is the error I get:
 Oct 07, 2013 5:02:18 PM org.apache.solr.common.SolrException log
 SEVERE: null:java.lang.RuntimeException: java.lang.NoSuchFieldError:
 SMALLER_BIG_BLOCK_SIZE_DETAILS
 at
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:651)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:364)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
 at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
 at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
 at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
 at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
 at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
 at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
 at
 org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:928)
 at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
 at
 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
 at
 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:539)
 at
 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:298)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
 Caused by: java.lang.NoSuchFieldError: SMALLER_BIG_BLOCK_SIZE_DETAILS
 at
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:93)
 at
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:190)
 at
 org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:184)
 at
 org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:376)
 at
 org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:165)
 at
 

Re: How to share Schema between multicore on Solr 4.4

2013-10-08 Thread Shawn Heisey

On 10/7/2013 6:02 AM, Dharmendra Jaiswal wrote:

I am using Solr 4.4 version with SolrCloud on Windows machine.
Somehow i am not able to share schema between multiple core.


If you're in SolrCloud mode, then you already *are* sharing your 
schema.  You are also sharing your configuration.  Both of them are in 
zookeeper.  All collections (and all shards within a collection) which 
use a given config name are using the same copy.


Any copies of your config/schema that might be on your disk are *NOT* 
being used.  If you are starting Solr with any bootstrap options, then 
the config set that is in zookeeper might be getting overwritten by 
whats on your disk when Solr restarts, but otherwise SolrCloud *only* 
uses zookeeper for config/schema. The bootstrap options are meant to be 
used once, and I actually prefer to get SolrCloud operational without 
using bootstrap options at all.


Thanks,
Shawn



Re: What's the purpose of the bits option in compositeId (Solr 4.5)?

2013-10-08 Thread Yonik Seeley
On Tue, Oct 8, 2013 at 8:27 PM, Shawn Heisey s...@elyograg.org wrote:
 There is also the distrib=false parameter that will cause the request to
 be handled directly by the core it is sent to rather than being
 distributed/balanced by SolrCloud.

Right - this is probably the best option for diagnosing what is in what index.

-Yonik


Re: How to warm up filter queries for a category field with 1000 possible values ?

2013-10-08 Thread Shawn Heisey

On 10/7/2013 12:36 AM, user 01 wrote:

what's the way to warm up filter queries for a category field with 1000
possible values. Would I need to write 1000 lines manually in the
solrconig.xml or what is the format?


Erick has given you awesome advice.  Here's something a little bit 
different that doesn't invalidate his advice:


If you have enough free RAM (not used by programs) for good OS disk 
caching, then as soon as you do one query that checks this field, then 
all 1000 values for that field are likely to be in RAM, and the next 
query against that field is going to be lightning fast, because the 
operating system will not have to read the disk to get the information.  
Although it is slightly faster to get informatin out of Solr's caches 
than the OS disk cache, the operating system is far better at managing 
huge caches than Solr and Java are.


http://wiki.apache.org/solr/SolrPerformanceProblems#General_information

Thanks,
Shawn



Re: SolrJ best pratices

2013-10-08 Thread Shawn Heisey

On 10/7/2013 3:08 PM, Mark wrote:

Some specific questions:
- When working with HttpSolrServer should we keep around instances for ever or 
should we create a singleton that can/should be used over and over?
- Is there a way to change the collection after creating the server or do we 
need to create a new server for each collection?


If at all possible, you should create your server object and use it for 
the life of your application.  SolrJ is threadsafe.  If there is any 
part of it that's not, the javadocs should say so - the SolrServer 
implementations definitely are.


By using the word collection you are implying that you are using 
SolrCloud ... but earlier you said HttpSolrServer, which implies that 
you are NOT using SolrCloud.


With HttpSolrServer, your base URL includes the core or collection name 
- http://server:port/solr/corename; for example.  Generally you will 
need one object for each core/collection, and another object for 
server-level things like CoreAdmin.


With SolrCloud, you should be using CloudSolrServer instead, another 
implementation of SolrServer that is constantly aware of the SolrCloud 
clusterstate.  With that object, you can use setDefaultCollection, and 
you can also add a collection parameter to each SolrQuery or other 
request object.


Thanks,
Shawn



Re: SolrCloud High Availability during indexing operation

2013-10-08 Thread Saurabh Saxena
Repeated the experiments on local system. Single shard Solrcloud with a
replica. Tried to index 10K docs. All the indexing operation were
redirected to replica Solr node. While the document while getting indexed
on replica, I shutdown the leader Solr node. Out of 10K docs, only 9900
docs got indexed. If I repeat the experiment without shutting down the
leader instance, all 10K docs get indexed. I am using curl to upload the
docs, there was no curl error while uploading documents.

Following error was there in replica log file.

ERROR - 2013-10-08 16:10:32.662; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException: No registered leader was found,
collection:test_collection slice:shard1

Attached replica log file.


On Thu, Sep 26, 2013 at 7:15 PM, Saurabh Saxena ssax...@gopivotal.comwrote:

 Sorry for the late reply.

 All the documents have unique id. If I repeat the experiment, the num of
 docs indexed changes (I guess it depends when I shutdown a particular
 shard). When I do the experiment without shutting down leader Shards, all
 80k docs get indexed (which I think proves that all documents are valid).

 I need to dig the logs to find error message. Also, I am not tracking of
 curl return code, will run again and reply.

 Regards,
 Saurabh


 On Wed, Sep 25, 2013 at 3:01 AM, Erick Erickson 
 erickerick...@gmail.comwrote:

 And do any of the documents have the same uniqueKey, which
 is usually called id? Subsequent adds of docs with the same
 uniqueKey replace the earlier one.

 It's not definitive because it changes as merges happen, old copies
 of docs that have been deleted or updated will be purged, but what
 does your admin page show for maxDoc? If it's more than numDocs
 then you have duplicate uniqueKeys. NOTE: if you optimize
 (which you usually shouldn't) then maxDoc and numDocs will be
 the same so if you test this don't optimize.

 Best,
 Erick


 On Tue, Sep 24, 2013 at 10:43 AM, Walter Underwood
 wun...@wunderwood.org wrote:
  Did all of the curl update commands return success? Ane errors in the
 logs?
 
  wunder
 
  On Sep 24, 2013, at 6:40 AM, Otis Gospodnetic wrote:
 
  Is it possible that some of those 80K docs were simply not valid? e.g.
  had a wrong field, had a missing required field, anything like that?
  What happens if you clear this collection and just re-run the same
  indexing process and do everything else the same?  Still some docs
  missing?  Same number?
 
  And what if you take 1 document that you know is valid and index it
  80K times, with a different ID, of course?  Do you see 80K docs in the
  end?
 
  Otis
  --
  Solr  ElasticSearch Support -- http://sematext.com/
  Performance Monitoring -- http://sematext.com/spm
 
 
 
  On Tue, Sep 24, 2013 at 2:45 AM, Saurabh Saxena ssax...@gopivotal.com
 wrote:
  Doc count did not change after I restarted the nodes. I am doing a
 single
  commit after all 80k docs. Using Solr 4.4.
 
  Regards,
  Saurabh
 
 
  On Mon, Sep 23, 2013 at 6:37 PM, Otis Gospodnetic 
  otis.gospodne...@gmail.com wrote:
 
  Interesting. Did the doc count change after you started the nodes
 again?
  Can you tell us about commits?
  Which version? 4.5 will be out soon.
 
  Otis
  Solr  ElasticSearch Support
  http://sematext.com/
  On Sep 23, 2013 8:37 PM, Saurabh Saxena ssax...@gopivotal.com
 wrote:
 
  Hello,
 
  I am testing High Availability feature of SolrCloud. I am using the
  following setup
 
  - 8 linux hosts
  - 8 Shards
  - 1 leader, 1 replica / host
  - Using Curl for update operation
 
  I tried to index 80K documents on replicas (10K/replica in
 parallel).
  During indexing process, I stopped 4 Leader nodes. Once indexing is
 done,
  out of 80K docs only 79808 docs are indexed.
 
  Is this an expected behaviour ? In my opinion replica should take
 care of
  indexing if leader is down.
 
  If this is an expected behaviour, any steps that can be taken from
 the
  client side to avoid such a situation.
 
  Regards,
  Saurabh Saxena
 
 
 
  --
  Walter Underwood
  wun...@wunderwood.org
 
 
 





stats on dynamic fields?

2013-10-08 Thread Li Xu
Hi,

I don't seem to be able to find any info on the possibility to get stats on
dynamic fields. stats=truestates.field=xyz_* appears to literally treat
xyz_* as the field name with a star. Is there a way to get stats on
dynamic fields without explicitly listing them in the query?

Thanks!
Li


Re: SolrCloud High Availability during indexing operation

2013-10-08 Thread Mark Miller
The attachment did not go through - try using pastebin.com or something.

Are you adding docs with curl one at a time or in bulk per request.

- Mark

On Oct 8, 2013, at 9:58 PM, Saurabh Saxena ssax...@gopivotal.com wrote:

 Repeated the experiments on local system. Single shard Solrcloud with a 
 replica. Tried to index 10K docs. All the indexing operation were redirected 
 to replica Solr node. While the document while getting indexed on replica, I 
 shutdown the leader Solr node. Out of 10K docs, only 9900 docs got indexed. 
 If I repeat the experiment without shutting down the leader instance, all 10K 
 docs get indexed. I am using curl to upload the docs, there was no curl error 
 while uploading documents. 
 
 Following error was there in replica log file. 
 
 ERROR - 2013-10-08 16:10:32.662; org.apache.solr.common.SolrException; 
 org.apache.solr.common.SolrException: No registered leader was found, 
 collection:test_collection slice:shard1
 
 Attached replica log file. 
 
 
 On Thu, Sep 26, 2013 at 7:15 PM, Saurabh Saxena ssax...@gopivotal.com wrote:
 Sorry for the late reply.
 
 All the documents have unique id. If I repeat the experiment, the num of docs 
 indexed changes (I guess it depends when I shutdown a particular shard). When 
 I do the experiment without shutting down leader Shards, all 80k docs get 
 indexed (which I think proves that all documents are valid). 
 
 I need to dig the logs to find error message. Also, I am not tracking of curl 
 return code, will run again and reply.
 
 Regards,
 Saurabh 
 
 
 On Wed, Sep 25, 2013 at 3:01 AM, Erick Erickson erickerick...@gmail.com 
 wrote:
 And do any of the documents have the same uniqueKey, which
 is usually called id? Subsequent adds of docs with the same
 uniqueKey replace the earlier one.
 
 It's not definitive because it changes as merges happen, old copies
 of docs that have been deleted or updated will be purged, but what
 does your admin page show for maxDoc? If it's more than numDocs
 then you have duplicate uniqueKeys. NOTE: if you optimize
 (which you usually shouldn't) then maxDoc and numDocs will be
 the same so if you test this don't optimize.
 
 Best,
 Erick
 
 
 On Tue, Sep 24, 2013 at 10:43 AM, Walter Underwood
 wun...@wunderwood.org wrote:
  Did all of the curl update commands return success? Ane errors in the logs?
 
  wunder
 
  On Sep 24, 2013, at 6:40 AM, Otis Gospodnetic wrote:
 
  Is it possible that some of those 80K docs were simply not valid? e.g.
  had a wrong field, had a missing required field, anything like that?
  What happens if you clear this collection and just re-run the same
  indexing process and do everything else the same?  Still some docs
  missing?  Same number?
 
  And what if you take 1 document that you know is valid and index it
  80K times, with a different ID, of course?  Do you see 80K docs in the
  end?
 
  Otis
  --
  Solr  ElasticSearch Support -- http://sematext.com/
  Performance Monitoring -- http://sematext.com/spm
 
 
 
  On Tue, Sep 24, 2013 at 2:45 AM, Saurabh Saxena ssax...@gopivotal.com 
  wrote:
  Doc count did not change after I restarted the nodes. I am doing a single
  commit after all 80k docs. Using Solr 4.4.
 
  Regards,
  Saurabh
 
 
  On Mon, Sep 23, 2013 at 6:37 PM, Otis Gospodnetic 
  otis.gospodne...@gmail.com wrote:
 
  Interesting. Did the doc count change after you started the nodes again?
  Can you tell us about commits?
  Which version? 4.5 will be out soon.
 
  Otis
  Solr  ElasticSearch Support
  http://sematext.com/
  On Sep 23, 2013 8:37 PM, Saurabh Saxena ssax...@gopivotal.com wrote:
 
  Hello,
 
  I am testing High Availability feature of SolrCloud. I am using the
  following setup
 
  - 8 linux hosts
  - 8 Shards
  - 1 leader, 1 replica / host
  - Using Curl for update operation
 
  I tried to index 80K documents on replicas (10K/replica in parallel).
  During indexing process, I stopped 4 Leader nodes. Once indexing is 
  done,
  out of 80K docs only 79808 docs are indexed.
 
  Is this an expected behaviour ? In my opinion replica should take care 
  of
  indexing if leader is down.
 
  If this is an expected behaviour, any steps that can be taken from the
  client side to avoid such a situation.
 
  Regards,
  Saurabh Saxena
 
 
 
  --
  Walter Underwood
  wun...@wunderwood.org
 
 
 
 
 



Re: ALIAS feature, can be used for what?

2013-10-08 Thread Mark Miller
Right - update aliases should only map an alias to one collection, but are 
perfectly valid.

Read aliases can map to multiple collections or just one.

There is currently only a create alias command and not an update alias command. 
I suppose because the impl for create just happened to work for update as well, 
so I guess I figured why add it explicitly. I figured we could still do it 
later - and I suppose we probably should.

I also intend to add a list alias command: 
https://issues.apache.org/jira/browse/SOLR-4968

- Mark

On Oct 8, 2013, at 11:31 AM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 You can index to an alias that points at only one collection. Works fine!
 
 Michael Della Bitta
 
 Applications Developer
 
 o: +1 646 532 3062  | c: +1 917 477 7906
 
 appinions inc.
 
 “The Science of Influence Marketing”
 
 18 East 41st Street
 
 New York, NY 10017
 
 t: @appinions https://twitter.com/Appinions | g+:
 plus.google.com/appinionshttps://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
 w: appinions.com http://www.appinions.com/
 
 
 On Fri, Oct 4, 2013 at 7:59 AM, Upayavira u...@odoko.co.uk wrote:
 
 I've used this feature to great effect. I have logs coming in, and I
 create a core for each day. At the end of each day, I create a new core
 for tomorrow, unload any cores over 2 months old, then create a set of
 aliases (all, month, week, today) pointing to just the cores
 that are needed for that range. Thus, my app can efficiently query the
 bit of the index it is really interested in.
 
 You cannot, as far as I am aware, index directly to an alias. It
 wouldn't know what to do with the content. However, you can create an
 alias over the top of an existing one, and it will replace it. Works
 nicely.
 
 Upayavira
 
 On Fri, Oct 4, 2013, at 10:41 AM, Jan Høydahl wrote:
 Hi,
 
 I have been asked the same question. There are only DELETEALIAS and
 CREATEALIAS actions available, so is there a way to achieve uninterrupted
 switch of an alias from one index to another? Are we lacking a MOVEALIAS
 command?
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 
 27. sep. 2013 kl. 10:46 skrev Yago Riveiro yago.rive...@gmail.com:
 
 I need delete the alias for the old collection before point it to the
 new, right?
 
 --
 Yago Riveiro
 Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
 
 
 On Friday, September 27, 2013 at 2:25 AM, Otis Gospodnetic wrote:
 
 Hi,
 
 Imagine you have an index and you need to reindex your data into a new
 index, but don't want to have to reconfigure or restart client apps
 when you want to point them to the new index. This is where aliases
 come in handy. If you created an alias for the first index and made
 your apps hit that alias, then you can just repoint the same alias to
 your new index and avoid having to touch client apps.
 
 No, I don't think you can write to multiple collections through a
 single alias.
 
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm
 
 
 
 On Thu, Sep 26, 2013 at 6:34 AM, yriveiro yago.rive...@gmail.com(mailto:
 yago.rive...@gmail.com) wrote:
 Today I was thinking about the ALIAS feature and the utility on Solr.
 
 Can anyone explain me with an example where this feature may be
 useful?
 
 It's possible have an ALIAS of multiples collections, if I do a
 write to the
 alias, Is this write replied to all collections?
 
 /Yago
 
 
 
 -
 Best regards
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/ALIAS-feature-can-be-used-for-what-tp4092095.html
 Sent from the Solr - User mailing list archive at Nabble.com (
 http://Nabble.com).
 
 
 
 
 
 
 
 



Re: dynamic field question

2013-10-08 Thread Jack Krupansky
I'd suggest that each of your source document sections would be a distinct 
solr document. All of the sections could have a source document ID field 
to tie them together.


Dynamic fields work best when used in moderation. Your use case seems like 
an excessive use of dynamic fields.


-- Jack Krupansky

-Original Message- 
From: Twomey, David

Sent: Tuesday, October 08, 2013 6:59 PM
To: solr-user@lucene.apache.org
Subject: dynamic field question


I am having trouble trying to return a particular dynamic field only instead 
of all dynamic fields.


Imagine I have a document with an unknown number of sections.  Each section 
can have a 'title' and a 'body'


I have each section title and body as dynamic fields such as section_title_* 
and section_body_*


Imagine that some documents contain a section that has a title=Appendix

I want a query that will find all docs with that section and return just the 
Appendix section.


I don't know how to return just that one section though

I can copyField my dynamic field section_title_* into a static field called 
section_titles and query that for docs that contain the Appendix


But I don't know how to only return that one dynamic field

?q=section_titles:Appendixfl=section_body_*

Any ideas?   I can't seem to put a conditional in the fl parameter





Re: SolrCloud High Availability during indexing operation

2013-10-08 Thread Saurabh Saxena
Pastbin link http://pastebin.com/cnkXhz7A

I am doing a bulk request. I am uploading 100 files, each file having 100
docs.

-Saurabh


On Tue, Oct 8, 2013 at 7:39 PM, Mark Miller markrmil...@gmail.com wrote:

 The attachment did not go through - try using pastebin.com or something.

 Are you adding docs with curl one at a time or in bulk per request.

 - Mark

 On Oct 8, 2013, at 9:58 PM, Saurabh Saxena ssax...@gopivotal.com wrote:

  Repeated the experiments on local system. Single shard Solrcloud with a
 replica. Tried to index 10K docs. All the indexing operation were
 redirected to replica Solr node. While the document while getting indexed
 on replica, I shutdown the leader Solr node. Out of 10K docs, only 9900
 docs got indexed. If I repeat the experiment without shutting down the
 leader instance, all 10K docs get indexed. I am using curl to upload the
 docs, there was no curl error while uploading documents.
 
  Following error was there in replica log file.
 
  ERROR - 2013-10-08 16:10:32.662; org.apache.solr.common.SolrException;
 org.apache.solr.common.SolrException: No registered leader was found,
 collection:test_collection slice:shard1
 
  Attached replica log file.
 
 
  On Thu, Sep 26, 2013 at 7:15 PM, Saurabh Saxena ssax...@gopivotal.com
 wrote:
  Sorry for the late reply.
 
  All the documents have unique id. If I repeat the experiment, the num of
 docs indexed changes (I guess it depends when I shutdown a particular
 shard). When I do the experiment without shutting down leader Shards, all
 80k docs get indexed (which I think proves that all documents are valid).
 
  I need to dig the logs to find error message. Also, I am not tracking of
 curl return code, will run again and reply.
 
  Regards,
  Saurabh
 
 
  On Wed, Sep 25, 2013 at 3:01 AM, Erick Erickson erickerick...@gmail.com
 wrote:
  And do any of the documents have the same uniqueKey, which
  is usually called id? Subsequent adds of docs with the same
  uniqueKey replace the earlier one.
 
  It's not definitive because it changes as merges happen, old copies
  of docs that have been deleted or updated will be purged, but what
  does your admin page show for maxDoc? If it's more than numDocs
  then you have duplicate uniqueKeys. NOTE: if you optimize
  (which you usually shouldn't) then maxDoc and numDocs will be
  the same so if you test this don't optimize.
 
  Best,
  Erick
 
 
  On Tue, Sep 24, 2013 at 10:43 AM, Walter Underwood
  wun...@wunderwood.org wrote:
   Did all of the curl update commands return success? Ane errors in the
 logs?
  
   wunder
  
   On Sep 24, 2013, at 6:40 AM, Otis Gospodnetic wrote:
  
   Is it possible that some of those 80K docs were simply not valid? e.g.
   had a wrong field, had a missing required field, anything like that?
   What happens if you clear this collection and just re-run the same
   indexing process and do everything else the same?  Still some docs
   missing?  Same number?
  
   And what if you take 1 document that you know is valid and index it
   80K times, with a different ID, of course?  Do you see 80K docs in the
   end?
  
   Otis
   --
   Solr  ElasticSearch Support -- http://sematext.com/
   Performance Monitoring -- http://sematext.com/spm
  
  
  
   On Tue, Sep 24, 2013 at 2:45 AM, Saurabh Saxena 
 ssax...@gopivotal.com wrote:
   Doc count did not change after I restarted the nodes. I am doing a
 single
   commit after all 80k docs. Using Solr 4.4.
  
   Regards,
   Saurabh
  
  
   On Mon, Sep 23, 2013 at 6:37 PM, Otis Gospodnetic 
   otis.gospodne...@gmail.com wrote:
  
   Interesting. Did the doc count change after you started the nodes
 again?
   Can you tell us about commits?
   Which version? 4.5 will be out soon.
  
   Otis
   Solr  ElasticSearch Support
   http://sematext.com/
   On Sep 23, 2013 8:37 PM, Saurabh Saxena ssax...@gopivotal.com
 wrote:
  
   Hello,
  
   I am testing High Availability feature of SolrCloud. I am using the
   following setup
  
   - 8 linux hosts
   - 8 Shards
   - 1 leader, 1 replica / host
   - Using Curl for update operation
  
   I tried to index 80K documents on replicas (10K/replica in
 parallel).
   During indexing process, I stopped 4 Leader nodes. Once indexing
 is done,
   out of 80K docs only 79808 docs are indexed.
  
   Is this an expected behaviour ? In my opinion replica should take
 care of
   indexing if leader is down.
  
   If this is an expected behaviour, any steps that can be taken from
 the
   client side to avoid such a situation.
  
   Regards,
   Saurabh Saxena
  
  
  
   --
   Walter Underwood
   wun...@wunderwood.org